research and evaluation in education iv volume 2, number 1, june 2016 subject indexes a affective assessment, 25, 39 assessment instrument, 1, 4, 6, 7, 8, 9, 13, 15, 16, 17, 18, 19, 20, 21, 23, 25, 92, 94, 95, 106 authentic assessment instrument, 13, 15, 16, 17, 21, 23 automotive electrical system, 71, 72, 73, 74, 75, 76, 77 c creative thinking skills, 1, 3, 6, 7, 9 conation aspect, 1, 3, 6, 7, 9 e elementary education, 79, 80, 82 entrepreneurial behavior, 53, 55, 57, 59, 61, 62, 63, 64, 65, 66, 67 etnography, 79, 91 f fit item, 1, 5, 6, 196 formative assessment, 13, 16, 17, 18 h higher order thinking, 92, 93, 106 j junior high school, 82, 86, 88, 92, 94, 95, 96, 97, 101, 103, 104, 105, 106 l learning trajectory, 13, 15, 16, 17, 18, 21, 22, 23 m mathematics, 92, 93, 95, 96, 98, 103, 104, 105, 106 multimedia, 71, 72, 73, 74, 75, 76, 77 p pcm-1pl, 1 pre-reading, 42, 43 primary school students, 25, 26, 27, 28, 35, 37, 39 productive entrepreneurial learning, 53 r reliability, 4, 5, 6, 7, 8, 9, 13, 16, 17, 18, 20, 21, 22, 25, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39, 42, 46, 48, 49, 51, 92, 97, 99, 102, 106 rural areas, 79, 82, 83, 84, 87, 90 s separation index, 1, 6, 7, 8, 9 social competence, 25, 26, 27, 28, 35, 36, t thematic-integrative, 13, 14, 15, 16, 18 v validity, 4, 6, 9, 16, 17, 18, 20, 21, 22, 23, 25, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 42, 46, 49, 51, 74, 82, 92, 95, 96, 97, 98, 99, 100, 103, 105, 106 research and evaluation in education research and evaluation in education, e-issn: 2460-6995 v author indexes aman, 13 budiyono, 25 faralina, asrab ali nur, 42 kadri, ayop shahrul, 42 kartowagiran, badrun, 1, 4, 11 lastariwati, badraningsih, 53, 61, 62, 63, 64, 65, 66, 69 nurrahmah, 79 pada, andi ulfa tenri, 1 pardjono, 53 samritin, 92 sofyan, herminarto, 71 subali, bambang, 1, 2, 3, 6, 9, 11, 12 sukamto, 53 sumarno, 79 surjono, herman dwi, 71 surya, anesa, 13 suryanto, 92 sutrisno, 25 syamsudin, amir, 25 widjanarko, dwi, 71, 72, 78 yap, abdullah nurul syafiqah, 42 zamroni, 79, 81, 91 research and evaluation in education vi volume 2, number 1, june 2016 authors’ biography abdullah nurul syafiqah yap. was born on 14 april 1983 in kuala lumpur, malaysia. she is currently a senior lecturer at the department of physics, faculty of science and mathematics, universiti pendidikan sultan idris (upsi). she received her first degree in bachelor science (physics industry) with honour in 2005 from malaysia university of technology (utm). in 2008, she obtained master in science (physics) from the same university. she completed her ph.d in physics education in malaysia university of science (usm) in 2014. she is interested in researches related to educational physics, physics instrumentation and interactive learning. aman. was born in brebes on october 15, 1974. he completed his bachelor‟s degree in 1999 in yogyakarta state university, majoring history education study program. he completed his master degree in 2002 in the graduate school of jakarta state university. he attained his doctorate in 2011 in the graduate school of yogyakarta state university, majoring educational research and evaluation. he currently works as a lecturer in the faculty of social sciences and graduate school of yogyakarta state university, indonesia. amir syamsudin. was born in ciamis, west java, on februari 4, 1971. he works as a lecturer in yogyakarta state university, indonesia, focusing on islamic education. he attained his bachelor degree in theology and philoshopy on march, 1997 and his master degree in islamic philosophy on august, 1999 from sunan kalijaga state islamic university of yogyakarta. he is awarded fellowship from the department of higer education of the minister of education and culture for visiting scholar at college of education, university of illinois at urbana-champaign (uiuc), illinois, usa, october 2010 until january 2011 to search literature. he graduated achieving his doctoral degree from yoggyakarta state university majoring educational research and evaluation in 2015. andi ulfa tenri pada. was born on 21 april 1983, in bantaeng, south sulawesi, indonesia. she has been working as a lecturer in the biology education study program, faculty of teacher training and education, syiah kuala university, indonesia since 2008. she attained her bachelor degree in 2005 from makassar state university, majoring biology education, and her master degree in 2007 from yogyakarta state university, majoring science education. she obtained her doctoral degree from yogyakarta state university in 2016 majoring educational research and evaluation. her research interests include: creativity and assessment development in science education. anesa surya. was born in ngawi on april 20, 1990. she completed her bachelor‟s degree in 2012 in the faculty of teacher education and educational sciences of sebelas maret university, majoring elementary school teacher education study program. she completed her master degree in 2015 in the graduate school of yogyakarta state university in the same major. she currently works as a lecturer in the department of elementary school teacher education, sarjanawiyata tamansiswa university, yogyakarta, indonesia. asrab ali nur faralina. she is currently a second year student in master of education (physics) in universiti pendidikan sultan idris (upsi). she was born on 10 june 1991 at penang maternity hospital, malaysia. she obtained bachelor in education (science) with first class honours from research and evaluation in education research and evaluation in education, e-issn: 2460-6995 vii upsi. she is looking forward to improve and contribute to the education system. she aims to complete her master and continue to learn and explore the world of education as an educator. ayop shahrul kadri. he was born at selangor, malaysia on 13 july 1980. he is currently an associate professor at department of physics, faculty of science and mathematics, universiti pendidikan sultan idris (upsi). he holds a ph.d in physics from university of hokkaido, japan in 2011. he completed his master of science (optics) in university leipzig, germany in 2005. he obtained bachelor of science (industrial physics) with first class honour from malaysia university of technology (utm) in 2002. his research interest is physics education and optical manipulation. badraningsih lastariwati. was born in yogyakarta in 1960. she attained her undergraduate degree from the family prosperity education study program, institute of teachers training and education in 1984. then, she earned her graduate program from the study program of societal health science, faculty of medicine, gadjah mada university in 1997. she currently works as a lecturer in the culinary and fashion education study program, faculty of engineering, yogyakarta state university, indonesia. badrun kartowagiran. was born on 25 july 1953. he works as a lecturer in the faculty of engineering and graduate school of yogyakarta state university, indonesia. he attained his bachelor degree in 1977 from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university) majoring in machine engineering education. in 1992, he achieved his master degree on educational research and evaluation from jakarta institute of teacher education and educational sciences (recently known as jakarta state university). in 2005, he graduated from his doctoral study in psychology/psychometrics in gadjah mada university, indonesia. bambang subali. was born on 12 january 1952. he works as a lecturer in the faculty of mathematics and science, and graduate school of yogyakarta state university, indonesia. he attained his bachelor degree from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university) majoring biology education. he achieved his master degree on forestry science from gadjah mada university, indonesia. he graduated from his doctoral study in educational research and evaluation, yogyakarta state university, indonesia. budiyono. he is a professor at the department of mathematics education, faculty of teacher training and education, sebelas maret university, surakarta, indonesia. dwi widjanarko. was born on 6 january 1969. he currently works as a lecturer and the chief of the department of automotive electricity, faculty of engineering, semarang state university, indonesia. he attained his doctoral degree in 2013 from yogyakarta state university, majoring vocational education. herman dwi surjono. currently works as a senior lecturer at the faculty of engineering and the graduate school of yogyakarta state university, indonesia. attained his doctoral degree in information technology in 2006 from southern cross university, australia. research and evaluation in education viii volume 2, number 1, june 2016 herminarto sofyan. was born on 9 august 1954. he currently works as a senior lecturer in the department of automotive engineering education of the faculty of engineering and graduate school of yogyakarta state university, indonesia. he attained his bachelor degree in 1978 from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university), majoring in mechanical engineering. in 1986, he attained his master degree from jakarta institute of teacher education and educational sciences (recently known as jakarta state university), majoring in vocational education. he achieved his doctoral degree in 2002 from jakarta state university majoring in educational technology. nurrahmah. was born in the county of dompu, the province of west nusa tenggara, on 20 december 1974. she finished her elementary school, junior high school and senior high school degree in lombok. she obtained her undergraduate degree from department of economic and development study, faculty of economy, mataram university. then, she attained her graduate degree from social science study program of yogyakarta state university majoring economic education in 2005. in 2006, she attained her postgraduate degree from educational research and evaluation yogyakarta state university. she has been working as a full-time lecturer for the development of economy and cooperation economy subject in the department of tadris economics social science, faculty of tarbiyah, mataram state islamic institute, indonesia, since 2008. pardjono. was born on 2 september 1953. he currently works as a senior lecturer in the faculty of engineering and graduate school of yogyakarta state university. he graduated achieving bachelor degree in 1977 from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university), indonesia, majoring in mechanical engineering education. in 1986, he attained master degree in industrial arts and technology education from state university of new york, usa. his doctoral degree in cognitive education was attained in 2000 from deakin university, australia. samritin. he currently works as a lecturer at muhammadiyah university of buton, south-east sulawesi, indonesia. he attained his bachelor degree from yogyakarta state university majoring educational research and evaluation in 2014. sukamto. was born on 25 february 1947. he is a professor in the department of mechanical engineering education of the faculty of engineering, yogyakarta state university, indonesia. sumarno. currently works as a lecturer in yogyakarta state university in the expertise field of program management of non-formal education and educational research and evaluation. his bachelor degree was attained from the department of social education of yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university) in 1972. his master degree was attained in 1980 and doctoral degree on educational planning and sociology was attained in 1986, both were attained from macquire university, sydney, australia. suryanto. currently works as a lecturer in yogyakarta state university in the expertise field of mathematics and statistics. sutrisno. he is a professor at the department of islamic education, faculty of teacher training and education, sunan kalijaga state islamic university, yogyakarta, indonesia. research and evaluation in education research and evaluation in education, e-issn: 2460-6995 ix zamroni. currently works as a senior lecturer in yogyakarta state university in the expertise field of educational sociology. he completed his bachelor degree in yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university) focusing on economic enterprise, and his master and doctoral degrees in florida state university, united states of america achieving m.a. and ph.d. he has been being active in publication since he started working in yogyakarta state university in 1974. research and evaluation in education x volume 2, number 1, june 2016 submission guidelines  the manuscript submitted is a result of a research or scientific assessment of an actual issue in the area of research, evaluation, and education in a broad sense, which has not been published elsewhere and is not being sent to other journals.  manuscript is accepted in english. any consistent spelling and punctuation styles may be used. please use single quotation marks, except where „a quotation is “within” a quotation‟. long quotations of 40 words or more should be indented without quotation marks.  a typical manuscript is approximately 8000 words or 12-18 pages including tables, references, captions and endnotes. manuscripts that greatly exceed this will be critically reviewed with respect to length. (a4; margins: top 3, left 3, right 2, bottom 2; double columns [except in abstract: single column]; single-spaced; font: garamond, 12).  manuscripts should be compiled in the following order: (1) title; (2) abstract; (3) keywords; (4) main text: introduction, research method, findings and discussion, conclusion and/or suggestions; (5) references.  the title of the manuscript should clearly represent the content of the article, and contain the keywords.  authors' name(s) should be written under the title (without any academic degree), along with the affiliation(s) and email address(es).  an abstract that does not exceed 300 words is required for any submitted manuscript. it is written narratively containing the aim(s), method, and the result(s) of the research.  each manuscript should have 3 to 5 keywords written under the abstract.  all tables and figures are adjusted to the paper length, and numbered referring to the text.  the citation and references are referred to american psychological association (apa) style, for example: .......... (switzgerald, 2014, p. 8) ............. mardapi (2015, pp. 13-14) [in text].  american psychological association (apa) style format is used.  the manuscript must be in *.doc or *.rtf, and sent to reid's management via online submission by creating account in this open journal system (ojs) [click register if you have not had any account yet; or click log in if you have already had an account].  authors‟ biography must be written narratively, containing each author‟s full name, degree(s) which were attained, place and date of birth, the last three educational levels which were taken, affiliation/department in which the author is currently working, phone number and email address.  all author(s)' names and identity(es) must be completely embedded in the form filled in by the corresponding author: email; affiliation; and each author's short biography (in the column of 'bio statement'). [if the manuscript is written by two or more authors, please click 'add author' in the 3rd step of 'enter metadata' in the submission process and then enter each author's data.]  (if any) the funding or grant-awarding bodies is acknowledged in the column of ‘contributors and supporting agencies’ when entering metadata in the open journal system (ojs ) of the journal. for single agency grants: "this work was supported by the [name of funding agency] under grant [number xxxx]."  all correspondences, information and decisions for the submitted manuscripts are given through email written in the manuscript and/or the emails used for the submission.  word template is available for this journal. if you have template queries, please contact reid.ppsuny@gmail.com mailto:reid.ppsuny@gmail.com research and evaluation in education iv reid, 2(2), december 2016 subject indexes a academic performance, 206, 207, 208, 209, 210, 211, 215, 216, 217, 218 aiken formula, 155, 161, 162, 163 assessment instruments, 135, 137, 139, 140, 141, 142, 152 b benchmark, 165, 167, 168, 169, 170, 171, 179, 180, 198, e effective teaching, 135, 136, 137, 138, 139, 141, 152, 153 english, 135, 136, 137, 138, 139, 140, 141, 142, 144, 145, 146, 147, 148, 150, 151, 152, 154, 209 evaluation model, 194, 195, 196, 197, 198, 199, 200, 202, 203, 204, 205 expanded gregory formula, 155, 163 f faith, 182, 183, 184, 185, 186, 189, 190, 191, 192, 193212, 217 h higher education, 130, 154, 166, 179, 180, 205, 206, 216 historical awareness, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121 historical research method, 108, 109, 110, 112, 115, 117, 119 hotel accommodation, 135, 139 i indonesian scholastic aptitude test (tbs), 165, 166, 167, 170, 171, 172, 173, 175, 176, 178 islamic perspective, 206, 209, 213, 214 j junior high school, 177, 194, 196, 198, 199, 204, 205 k knowledge of historical event, 108, 109, 110, 111, 112, 115, 117 m meaning of historical event, 108, 109, 110, 111, 112, 113, 115, 117, 119 measurement, 108, 109, 111, 114, 115, 116, 117, 118, 119, 134, 140, 153, 157, 158, 159, 162, 163, 164, 165, 167, 168, 169, 170, 171, 178, 179, 180, 181, 182, 184, 185, 186, 189, 190, 191, 193, 195, 198, 199 measurement model, 108, 109, 111, 114, 115, 116, 117, 118, 119, 184, 186, 189, 190, 198, 199 model development, 116, 122, 124, 125, 126, 128, 197, 204 moral competence, 206, 207, 208, 209, 210, 211, 214, 215, 216, 217 q quality assurance, 153, 194, 195, 196, 197, 198, 204 r reliability, 116, 117, 118, 119, 125, 130, 132, 133, 134, 135, 140, 141, 152, 157, 162, 163, 182, 185, 192, 198, 201, 202, 213, 214, 217, 218 s scale, 134, 139, 155, 157, 160, 161, 162, 163, 165, 168, 169, 170, 171, 178, 179, 180, 181, 182, 185, 186, 191, 192, 193, 200 scholastic aptitude, 165, 166, 170, 178, 180 soft skills, 122, 123, 124, 126, 127, 128, 129, 130, 133 srl scale, 155, 157, 160, 163 u usefulness of history, 108, 109, 111, 113, 115, 117, 119 research and evaluation in education research and evaluation in education, issn 2460-6995 v v validity, 116, 117, 118, 119, 122, 125, 127, 130, 132, 133, 135, 139, 152, 153, 155, 157, 158, 159, 160, 161, 162, 163, 164, 182, 185, 192, 198, 212, 218 validity coefficient, 125, 155, 159, 160, 161, 162, 163 research and evaluation in education vi reid, 2(2), december 2016 author indexes aisiah, 108 azwar, saifuddin, 165, 167, 179, 184, 192 dauda, chetubo kuta, 206 effendi, z. mawardi, 122 inra, azwar, 122 krisna, idwin irma, 165 kumaidi, 181 mardapi, djemari, 109, 114, 118, 120, 121, 141, 154, 165, 168, 180 paiko, isa imam, 206 retnawati, heri, 155, 168, 180 sakariyau, olalekan busra, 206 shodiq, 181 soenarto, 194 sudarsono, franciscus xaverius, 135 sugiyanta, 194 suhartono, 108, 112, 113, 121 sukamto, 122 sumarno, 108 widodo, estu, 135 zamroni, 181 zubairu, umaru mustapha, 206 research and evaluation in education research and evaluation in education, issn 2460-6995 vii authors’ biography aisiah. was born in piladang, west sumatera, indonesia, on 15 june 1981. her bachelor degree was attained from universitas negeri padang in 2004, majoring history education. in 2005, she was appointed to be a lecturer in history study program, faculty of social sciences, universitas negeri padang, and started to be actively involved in scientific activities in the forms of seminar/ conference, training, research, and scientific writing. her master degree was attained in 2009 and doctoral degree in 2016, both from universitas negeri yogyakarta majoring educational research and evaluation. azwar inra. was born on august 22nd, 1952 in maga, a small village in southern tapanulis, northern sumatra. he started his undergraduate study in civil engineering education study program, faculty of engineering education, institute of teachers training and education of yogyakarta under a joint program held by universitas negeri yogyakarta and universitas negeri padang (cross program). he finished his undergraduate study in 1978. then, he continued his graduate study majoring vocational technology education in universitas negeri padang, and he was graduated in 2010. after finishing his undergraduate study, he has been appointed as a lecturer in universitas negeri padang. in the period of 2003-2007, he was elected to be the first vice dean of the faculty of engineering, universitas negeri padang. chetubo kuta dauda. chetubo kuta dauda is a lecturer in the department of entrepreneurship and business studies at federal university of technology minna. he is currently a phd candidate at the university of sokoto, nigeria. djemari mardapi. was born on january 1, 1947. he is a professor at universitas negeri yogyakarta, indonesia. he currently works as a lecturer at the faculty of engineering and the graduate school of universitas negeri yogyakarta. he obtained a bachelor degree in electrical engineering education in 1973, from yogyakarta institute of teacher education and educational sciences (recently known as universitas negeri yogyakarta). in 1984, he achieved his master degree on educational research and evaluation from universitas negeri yogyakarta. he is an alumnus of iowa university in 1988, majoring in educational, measurement, and statistics and obtained a ph.d. estu widodo. was born in banyuwangi, east java, indonesia, on 20 may 1968. he currently works as a lecturer at english department of universitas muhammadiyah malang. he attained his bachelor degree in 1991 from english education department of universitas jember, east java. his master degree was attained in 1999 from universitas gadjah mada, majoring american studies. his doctoral degree was attained in 2016 from the study program of educational research and evaluation, universitas negeri yogyakarta. franciscus xaverius sudarsono. he works as a lecturer at the department of educational research and evaluation, graduate school of universitas negeri yogyakarta in the expertise field of evaluation research. heri retnawati. was born on 3 january 1973. currently works as a lecturer in the faculty of mathematics and natural sciences of universitas negeri yogyakarta, indonesia. attained her research and evaluation in education viii reid, 2(2), december 2016 bachelor degree on mathematics education in 1996 in yogyakarta institute of teacher education and educational sciences (recently known as universitas negeri yogyakarta), master degree on educational research and evaluation in 2003, and also doctoral degree in the same major in 2008, in the same university. idwin irma krisna. was born in jakarta, on october 22, 1973. currently works at centre of educational assessment, office of research and development, ministry of education and culture of republic of indonesia. in 1997, she attained her bachelor degree from the faculty of mathematics and natural sciences, universitas gadjah mada. she obtained her master degree in psychology/psychometrics from universitas indonesia. she obtained her doctoral degree from universitas negeri yogyakarta in 2016 majoring in educational research and evaluation. isa imam paiko. currently works as a lecturer in the department of entrepreneurship and business studies at federal university of technology minna. he is currently a phd candidate at the ahmadu bello university, nigeria. kumaidi. currently works as a lecturer of psychometrics and psychological statistics in the faculty of psychology of universitas muhammadiyah surakarta. he attained his bachelor degree in 1976 from universitas negeri yogyakarta, indonesia, majoring mechanical engineering education. his master degree was attained in 1984 from university of iowa, usa, majoring educational measurement and statistics, and his doctoral degree was attained in 1987 in the same major and university. olalekan busra sakariyau. he works as a lecturer in the department of entrepreneurship and business studies at federal university of technology minna. he obtained his phd in entrepreneurship from universiti malaysia sarawak. saifuddin azwar. was born on july 3, 1950. he is a professor at universitas gadjah mada. he currently works as a lecturer at the faculty of psychology of universitas gadjah mada, and also teaches at the graduate school as a part-time lecturer, majoring in psychological measurement, construction tests, and research methods. he attained his bachelor degree in 1976, from the faculty of psychology. in 1982, he obtained a master degree in education from university of iowa. in 2008, he graduated from his doctoral study from the faculty of psychology, universitas gadjah mada. shodiq. was born in pati, cental java, indonesia on 5 december 1968. he currently works in universitas islam negeri walisongo, semarang, central java, indonesia. he completed his bachelor degree majoring islamic education in 1993 in universitas islam negeri walisongo (formerly known as islamic state institute of walisongo). his master degree was completed in 1998, majoring in islamic studies, in islamic state institute of sumatera utara. he attained his doctoral degree in 2014 from universitas negeri yogyakarta, majoring in educational research and evaluation. soenarto. was born on 4 august 1948. he works as a senior lecturer in the faculty of engineering and the graduate school of universitas negeri yogyakarta, indonesia. he attained his bachelor degree in 1974 from yogyakarta institute of teacher education and educational sciences (recently known as universitas negeri yogyakarta) majoring in electrical engineering research and evaluation in education research and evaluation in education, issn 2460-6995 ix education. in 1984, he attained his master of science degree in industrial and vocational education from state university of new york, usa. in 1987, he also achieved a master of arts degree in educational program evaluation from ohio state university, usa. his doctor of philosophy degree in industrial vocational education was attained in 1988 from ohio state university, usa. sugiyanta. was born in sleman, yogyakarta, indonesia, on 19 june 1968. he obtained his associate degree (diploma-3) from the department of physics education of yogyakarta institute of teacher education and educational sciences (recently known as universitas negeri yogyakarta) in 1991. his bachelor degree was achieved in 1995 from the same major in universitas terbuka. he attained his master degree in 2003 from the department of educational research and evaluation of the graduate school of universitas negeri yogyakarta. his doctoral degree was attained in 2016 in the same major and university. he currently works in educational quality assurance institution of yogyakarta, indonesia. suhartono. currently works as a professor at the faculty of letter and culture of universitas gadjah mada in the history science department. attained his bachelor degree in 1966 majoring history science from the faculty of letter and culture of universitas gadjah mada. continued his study in the post graduate training program in the leiden university, netherland (19781979). performed an archive study in the rijsarchief nederland institute for asian studies, leiden, netherland (2003), institute of oriental culture, tokyo university and institute of asian pacific studies waseda university, japan (2004). sukamto. was born on 25 february 1947. he is a professor in the department of mechanical engineering education of the faculty of engineering, yogyakarta state university, indonesia. he has been actively participating in scientific research since 1973, in the expertise field of vocational and technology education. sumarno. formerly worked as a senior lecturer in universitas negeri yogyakarta in the expertise field of program management of non-formal education and educational research and evaluation. his bachelor degree was attained from the department of social education of yogyakarta institute of teacher education and educational sciences (recently known as universitas negeri yogyakarta) in 1972. his master degree was attained in 1980 and doctoral degree on educational planning and sociology was attained in 1986, both were attained from macquire university, sydney, australia. umaru mustapha zubairu. currently works as a lecturer in the department of entrepreneurship and business studies at the federal university of technology minna. he is currently a phd accounting candidate at the international islamic university malaysia. z. mawardi effendi. was born in koto panjang, tanah datar, west sumatera, indonesia on 4 november 1950. he currently works as a lecturer in the faculty of economics, universitas negeri padang, indonesia. he was elected to be the rector of universitas negeri padang in the periods of 2003-2007 and 2008-2012. he attained his doctoral degree on educational technology from jakarta institute of teacher education and educational sciences (recently known as universitas negeri jakarta) in 1990. research and evaluation in education x reid, 2(2), december 2016 zamroni. currently works as a senior lecturer in universitas negeri yogyakarta in the expertise field of educational sociology. he completed his bachelor degree in yogyakarta institute of teacher education and educational sciences (recently known as universitas negeri yogyakarta) focusing on economic enterprise, and his master and doctoral degrees in florida state university, united states of america achieving m.a. and ph.d. he has been being active in publication since he started working in universitas negeri yogyakarta in 1974. research and evaluation in education research and evaluation in education, issn 2460-6995 xi submission guidelines  the manuscript submitted is a result of a research or scientific assessment of an actual issue in the area of research, evaluation, and education in a broad sense, which has not been published elsewhere and is not being sent to other journals.  manuscript is accepted in english. any consistent spelling and punctuation styles may be used. please use single quotation marks, except where „a quotation is “within” a quotation‟. long quotations of 40 words or more should be indented without quotation marks.  a typical manuscript is approximately 8000 words or 12-18 pages including tables, references, captions and endnotes. manuscripts that greatly exceed this will be critically reviewed with respect to length. (a4; margins: top 3, left 3, right 2, bottom 2; double columns [except in abstract: single column]; single-spaced; font: garamond, 12).  manuscripts should be compiled in the following order: (1) title; (2) abstract; (3) keywords; (4) main text: introduction, method, findings and discussion, conclusion and/or suggestions; (5) references.  the title of the manuscript should clearly represent the content of the article, and contain the keywords.  authors' name(s) should be written under the title (without any academic degree), along with the affiliation(s) and the email address(es) of the corresponding author should be placed as the footer at the first page of the paper.  an abstract that does not exceed 300 words is required for any submitted manuscript. it is written narratively containing the aim(s), method, and the result(s) of the research.  each manuscript should have 3 to 5 keywords written under the abstract.  all tables and figures are adjusted to the paper length, and numbered referring to the text.  the citation and references are referred to american psychological association (apa) style, for example: .......... (switzgerald, 2014, p. 8) ............. mardapi (2015, pp. 13-14) [in text].  american psychological association (apa) style format is used.  the manuscript must be in *.doc or *.rtf, and sent to reid's management via online submission by creating account in this open journal system (ojs) [click register if you have not had any account yet; or click log in if you have already had an account].  authors‟ biography must be written narratively, containing each author‟s full name, degree(s) which were attained, place and date of birth, the last three educational levels which were taken, affiliation/department in which the author is currently working, phone number and email address.  all author(s)' names and identity(es) must be completely embedded in the form filled in by the corresponding author: email; affiliation; and each author's short biography (in the column of 'bio statement'). [if the manuscript is written by two or more authors, please click 'add author' in the 3rd step of 'enter metadata' in the submission process and then enter each author's data.]  (if any) the funding or grant-awarding bodies is acknowledged in the column of ‘contributors and supporting agencies’ when entering metadata in the open journal system (ojs ) of the journal. for single agency grants: "this work was supported by the [name of funding agency] under grant [number xxxx]."  all correspondences, information and decisions for the submitted manuscripts are given through email written in the manuscript and/or the emails used for the submission.  word template is available for this journal. if you have template queries, please contact reid.ppsuny@gmail.com mailto:reid.ppsuny@gmail.com reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(2), 2017, 106-113 available online at: http://journal.uny.ac.id/index.php/reid research article an evaluation of vocational high schools in indonesia: a comparison between four-year and three-year programs * 1 soenarto; 2 †muhammad mustaghfirin amin; 3 kumaidi *faculty of engineering, universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia *email: soenarto@uny.ac.id submitted: 28 november 2017 | revised: 15 december 2017 | accepted: 22 december 2017 abstract the research aimed to gain insights into the quality of four-year program vocational high school (vhs) in indonesia compared to four-year program vhs. this research was conducted based on the school graduate standard, business sector and industrial sector (or dunia usaha dan dunia industri (dudi)) – or the performance of the graduates and alumni (the graduates’ satisfaction). the research was conducted using discrepancy evaluation model using 16 vhss (eight four-year program vhss and eight three-year program vhss). the result shows that from the standpoint of the school, the graduates of the four-year program vhs are higher in quality than those of the three-year program vhs. the four-year program vhs graduates are more qualified in seven aspects: teamwork, discipline, tenacity, theoretical knowledge, confidence, creativity, and leadership. meanwhile, using dudi standpoint, the four-year program vhs graduates are also higher in quality than the three-year program vhs graduates. in addition, the four-year program vhs graduates are better in the quality of their discipline, tenacity, theoretical knowledge, practical skills, confidence, carefulness, creativity, and leadership. the four-year program vhs graduates have a higher level of satisfaction in terms of income than the three-year program vhs graduates. the higher quality of the four-year program vhs graduates has resulted from longer duration of the internship program (pkl) that provides them with reliable experience and skills concerning work-related problem-solving activities. keywords: vocational high school, graduates, four-year program, three-year program how to cite item: soenarto, s., amin, m., & kumaidi, k. (2017). an evaluation of vocational high schools in indonesia: a comparison between four-year and three-year programs. reid (research and evaluation in education), 3(2), 106-113. doi:http://dx.doi.org/10.21831/reid.v3i2.17077 introduction education institution is a human resource production house with the managerial competency related to human resource and other related resources. thus, it is the duty of education institutions to keep the process of improvement going and to produce graduates who fulfill the needs of the society. the society needs evolve as time changes and in alignment with the changes of circumstances. as asean free trade area (afta) and asean economic society (mea) were put into effect in 2003 and in 2015 respectively, the demand of business sector and industry sector (dunia usaha dan dunia industri or dudi) for innovative and creative workforce is on the rise. on the other hand, free competition in the open market has made the distribution of goods, services, capital and market-ready skilled labor even more dynamic. to survive under such circumstances, indonesia has to prepare itself http://dx.doi.org/10.21831/reid.v3i2.17077 reid (research and evaluation in education) 107 − reid (research and evaluation in education), 3(2), 2017 for upcoming chances and challenges in global market. alisjahbana (2014) states that in the free trade era, from the standpoint of population, manpower and human resource, indonesia has to pay more attention to three things: (1) keeping the demographic momentum, (2) improving the participation of manpower, and (3) improving the manpower productivity. the afore-mentioned action of keeping demographic momentum is an action of keeping the advantage of indonesian demography conducted by pushing the fertility rate and driving migration. migration is an effective strategy to keep the economic growth. the demographic momentum as the foundation of indonesian economic strength has to be followed by the effort to improve the manpower participation by nurturing flexible and efficient working climate and driving the participation of women in improving the national economy. the area of manpower participation is not the only area in need of improvement in indonesia. improvement is also needed in the area of manpower productivity, which can be a competitive advantadge that is able to improve the competitive edge of manpower in open market. vocational high school (vhs – or smk (sekolah menengah kejuruan) is one of the education institutions responsible for producing skilled workers with the ability to adapt to the changes in the need of the society as the effect of the dynamic international economy with the support of indonesian demographic bonus. vhs can be a powerful weapon in improving manpower participation and productivity by taking advantages of education processes. pardjono, sugiyono, and budiyono (2015) state that ‘vocational education cannot be removed from existing workforce development’. under the same light, in their research, ramayani, aimon, and anis (2012) conclude that indonesian government has to support the efforts made to improve manpower productivity by producing policy that focuses on education and health and providing more fund in the area related to human resource building. in the law no. 20 of 2003 of republic of indonesia concerning national education system, vhs is defined as the education institution responsible for preparing students to work in certain fields of work. dewey (1916) argues that ‘a vocation means nothing but such a direction of the life activities as renders them perceptibly significant to a person, because of the consequences they accomplish, and also usefull to his assocoate.’ moreover, thompson (1973) argues that vocational education improves students’ skills that eventually will improve their productivity. vhss then play an important role in determining the competitive edge of indonesian manpower by providing ready-to-work and high quality workers for national and also international needs. as stated by komariah (2010), vocational high school is an education institution responsible to prepare students for labor market and nation-building effort. prior to 1970, vocational high school and regular high school have the same study duration: three years. in 1970, as stipulated in through the first pelita (five-year building, or pembangunan lima tahun in indonesian term) program, indonesian government built eight four-year program engineering vocational high schools under the banner of ‘proyek perintis sekolah teknologi menengah pembangunan’ (‘development engineering high school initiative project’). in 1974, indonesian government built four more four-year program vocational high schools – this time with agriculture as the concentration. the goal of this project is to prepare industrial technicians or workers with engineering skill possessing (1) initiative attitude, (2) ability to work and love the work, and (3) ability to understand, manage, and implement the ideas of engineers from upper level and to provide guidance to the technical workers from lower levels. the four-year program is expected to provide supports for vocational high schools in producing skilled workers with competitive edge. all of the goals of four-year program vocational high schools are in national education standard, specifically the standards for the graduates. in the regulation of the minister of education and culture no. 20 of 2016, the competence standard of the graduates (standar kompetensi lulusan/skl) is the formula of qualification criteria for graduates, which are achieved upon the completion of a series of programs and education in the area of attitude, knowledge, and skills. reid (research and evaluation in education) an evaluation of vocational high schools in indonesia... 108 soenarto, †muhammad mustaghfirin amin, & kumaidi in order to be able to achieve the goals of the development engineering high school initiative project – which is now known as four-year program vocational high schools – the directorate of vocational high school administration focuses on the improvement of curiculum, learning and teaching process, and evaluation process. to take everything one step further, the directorate also focuses on the improvement of teacher professionalism and builds cooperation with parties involved in business sector and industrial sector (dudi). however, there were doubts related to the effectiveness of four-year program vocational high school as the national statistics board (badan pusat statistik/bps) released data related to the number of unemployment in indonesia in 2014. the data show that there were 2.179 million unemployed graduates of vocational high schools, which is 15.15% of the total number of unemployment in indonesia for above-15-year-old workforce. the number is an accumulation of all unemployed graduates of four-year program vocational high schools and the graduates of three-year program vocational high schools. there were no distinction made between the graduates of four-year program vocational high schools and those of three-year program vocational high schools in the data presented by bps although both of them do not follow the same education process. this phenomenon then made us wonder about the quality of the graduates of both programs and the differences. table 1 shows the number of open unemployment with vhs education. table 1. vocational high school graduate unemployment data in 2011-2014 year total number 2011 2,270,873 2012 2,085,474 2013 2,122,850 2104 2,179,886 the questions related to the worth or merit of four goals of the four-year program vhss can only be answered through evaluation. stufflebeam and shinkfield (1984) define evaluation as ‘systematic assessment of the worth or merit of some objects’. in this case, evaluation is conducted to define the worth or merit of the goals of four-year program vocational high schools. stufflebeam, madaus, and kellaghan (2000) state that the process of evaluation should not be alien to the process of comparing. the evaluation of the worth or merit of the four-year program vhss is conducted by comparing the competence of the graduates of both programs. the competence of the graduates is measured with the standards set by schools of origin as the provider of education services, the standards of dudi (in terms of the performance of the graduates) as the employer of the graduates, and the personal standard of the graduates (level of satisfaction) related to their jobs. method the goal of this evaluative research was to gain insights into the quality of education provided in both programs (three-year program and four-year program) of vocational high schools. the method applied in this research was discrepancy evaluation model (dem) developed by provus. the discrepancy evaluation model identifies discrepancy between the standards used as the basis of assessment and the performance in reality (kaufman & susan, 1982, p. 127). this research used three-year program vocational high schools’ graduates as the basis of assessment. the performance of the graduates of the three-year program vocational high school was set to be the basis or standards of assessment because it was the basis of the innovation that was known later as four-year program vocational high school. inovation in this case is the production of something better than the existing product or program. this research was conducted in eight three-year program vhss and eight four-year program vhss. all of the selected four-year program vhss were part of the early fouryear program initiative. on the other hand, all of the selected three-year program vhss were selected based on the similarities with the selected four-year program vhss in terms of the area of the school location. the respondents included all parties involved in the management of the vocational high schools, such as (1) head master, (2) vice head master, (3) reid (research and evaluation in education) 109 − reid (research and evaluation in education), 3(2), 2017 head of skill programs, (4) labor market coordinator, (5) guidance and counseling coordinator, (6) alumni, and (7) business sector and industrial sector. table 2 shows detailed information about the respondents involved in the research. table 2. research respondents no research subjects four year program threeyear program total 1 head master 8 8 16 2 vice head master 32 32 64 3 head of skill program 40 40 80 4 special labor market coordinator 8 8 16 5 guidance and counseling coordinator 8 8 16 6 alumni 40 40 80 7 dudi 40 40 80 total 176 176 176 the research data were collected using questionnaire, observation, interview and documentation. the questionnaire in the data collection process was distributed to reach the opinions from schools and parties in dudi about the competence of the graduates from both programs. in addition, the distribution of the questionnaire was also conducted to gain insights into the level of satisfaction of the graduates related to their jobs. the result of the validity test using a questionnaire showed that the instrument used was capable of measuring the data validly. the instrument reliability estimation shows that the questionnaire had the reliability coefficient as much as 0.83 which can be categorized as reliable. in collecting the supporting data related to the graduates, this research used not only a questionnaire but also observation, interview and documentation. the validity of the interview guide and observation guide was tested by experts via expert judgement. in order to make the data fit to be presented in the form of tables and diagrams, the collected data from various instruments were processed through tabulation and analysis process using the descriptive statistics. in this step, the qualitative data were projected as a support for quantitative descriptive findings. findings and discussion the result of the data analysis on the competence, performance, and level of satisfaction of the graduates of the four-year program vocational high schools in comparison with that of the graduates of the three-year program vocational high schools is described as follows. the graduates’ competence there are 11 aspects studied in this research, including team-work, discipline, ethics, tenacity, theoretical knowledge, practical skill, confidence, carefulness, creativity, sense of responsibility, and leadership. figure 1 shows the competence of the graduates of four-year program and three-year program vocational high schoosl from the standpoint of school. figure 1. the competence of vocational high school graduates from the standpoint of school upon the analysis on the above-mentioned aspects, from the standpoint of school – represented by head masters and vice head masters – all of the graduates of four-year program vocational high schools possess ‘very good’ competence. this result is better than the result for graduates of three-year program vocational high schools in which only 64% of them are in ‘very good’ category and the rest (36%) are in ‘good’ category. reid (research and evaluation in education) an evaluation of vocational high schools in indonesia... 110 soenarto, †muhammad mustaghfirin amin, & kumaidi figure 2. aspects of competence of vocational high school graduates the result of the assessment on all of the aspects shows that the graduates of fouryear program vocational high schools are better in seven out of eleven assessed aspects (teamwork, discipline, tenacity, theoretical knowledge, confidence, creativity, and leadership). on the other four aspects, the competence of the graduates from both programs are in the same level. the superiority of the graduates of the four-year program vocational high schools on the seven aspects resulted from their rich experienced gained in a longer internship program (or praktik kerja lapangan (pkl) in indonesian term). this longer internship program facilitated the students of fouryear program vocational high schools with proficient time for in-class knowledge internalization. in addition, the longer internship program made the students more experienced in problem-solving activities in the real daily work. the superiority of the four-year program vocational high school graduates in the seven aspects made them more competent at the business world. figure 2 shows the competence of the graduates of both programs in every measured aspect. the performance of the graduates in this research, the performance of the graduates from both programs is also analyzed from the standpoint of business sector and industrial sector (dudi), specifically their aspects of competence. there are 11 aspects measured, including teamwork, discipline, ethic, tenacity, theoretical knowledge, practical skills, confidence, carefulness, creativity, sense of responsibility, and leadership. figure 3 shows the comparison in terms of performance of the graduates from both programs. figure 3. the performance of vocational high school graduates the research result shows that 36% of the four-year program vocational high school graduates are in ‘very good’ category, whereas 64% of them are in ‘good’ category. the overall performance of the graduates of four-year program vhss is better than that of the graduates of three-year program vhss since all of the graduates of three-year program vhss are in ‘good’ category . reid (research and evaluation in education) 111 − reid (research and evaluation in education), 3(2), 2017 figure 4. the performance of the graduates from the standpoint of dudi according to the statement of the employers or dudi, the four-year program vocational high school graduates show superiority in eight aspects in terms of performance. they are superior in eight aspects 72.72% of total aspects studied – including discipline, tenacity, theoretical skill, practical skill, confidence, carefulness, creativity, and also leadership (see figure 4). as in the competence analysis from the standpoint of the schools, the superiority in these aspects is resulted from the longer internship programs which provide the students with richer and reliable experience. however, there is something new and intriguing in this competence analysis from the standpoint of dudi. in the aspect of teamwork, the graduates of the four-year program vocational high schools are inferior to that of the three-year program vocational high schools. this is driven by the fact that the graduates of the four-year program vocational high school have the ability to accomplish tasks individually since they are equipped with higher level of competence and experience. the satisfaction of the graduates the satisfaction of the graduates is an accumulation the graduates’ personal opinion about their jobs. there are nine aspects studied from this standpoint: income, working atmosphere, relationship with supervisors, relationship with co-workers, intention to get another job, working satisfaction, working facilities, working environment, and health insurance. the data of the satisfaction level of the graduates related to their jobs were collected by distributing questionnaires to the graduates of vocational high schools. figure 5. satisfaction level of the graduates figure 5 shows the satisfaction level of the graduates of the both programs. the data show that 67% of the graduates of four-year program vocational high schools express ‘very good’ level of satisfaction toward their jobs. there are 22% of them who express ‘good’ satisfaction and 11% of them are in ‘low’ category. the level of satisfaction of the four-year program vocational high school graduates is higher than the level of satisfaction of the three-year program vocational high school graduates. there are 56% of the three-year program vocational high school graduates who express ‘very good’ level of satisfaction toward their jobs. as many as 33% of them are in ‘good’ category and 11% of them are in reid (research and evaluation in education) an evaluation of vocational high schools in indonesia... 112 soenarto, †muhammad mustaghfirin amin, & kumaidi ‘low’ category. the most staggering difference is found in the aspect of income; the data show that the monthly income of the graduates of four-year program vocational high schools is between rp 1,100,000.00 and rp 5,000,000.00, while the income of the graduates of three-year program vocational high schools is between rp 1,000,000.00 and rp 2,500,000.00. the income of the graduates of four-year program vocational high schools is aligned with the level of competence and performance. conclusion and suggestions conclusion in conclusion, the result of the research can be concluded as follows: (1) from the standpoint of school, the competence of the graduates of four-year program vocational high schools is superior in seven aspects, including: teamwork, discipline, tenacity, theoretical knowledge, confidence, creativity, and leadership; (2) from the standpoint of the employers of the graduates (dudi), the graduates of four-year program vocational high schools are superior in eight aspects: discipline, tenacity, theoretical knowledge, practical skill, confidence, carefulness, creativity, and leadership; (3) from the standpoint of personal satisfaction of the jobs obtained, the graduates of four-year program vocational high schools expressed a higher level of satisfaction, specifically in terms of incomes; (4) the competence superiority of the graduates of four-year program vocational high schools resulted from a longer internship program (pkl) which provided students with richer skills as well as experience related to problemsolving activities in real daily work. suggestions based on the conclusion, some suggestions are proposed as follows: (1) reconsider the role of internship program in the students learning process. for optimized results, there should be more systematic, effective and efficient development, execution, and evaluation in the internship program of the fouryear program vocational high schools; (2) for the internship program to be more efficient and effective, there should be unified evaluative efforts among all involved parties in schools and in business sector and industrial sector; (3) the result of the research shows that the graduates of the four-year program vocational high school are more superior than the graduates of the three-year program vocational high school in terms of working competence. hence, there should be better appreciation for the graduates of the four-year program vocational high schools; (4) even though the graduates of the four-year program vocational high schools have sufficient skills to complete tasks presented individually, there should be more team-work-focused learning process for them. references alisjahbana, a. s. (2014). arah kebijakan dan program di bidang kependudukan, ketenagakerjaan, dan sumber daya manusia menghadapi globalisasi khususnya masyarakat ekonomi asean. in seminar nasional tantangan kependudukan, ketenagakerjaan, dan sdm indonesia menghadapi globalisasi khususnya masyarakat ekonomi asean. jakarta: ikatan praktisi dan ahli demografi indonesia (ipadi). dewey, j. (1916). democracy and education: an introduction to the philosophy of education. new york, ny: dover publication. kaufman, r., & susan, t. (1982). evaluation without fear. london: new view points. komariah, k. (2010). memimpikan smk di masa depan. in seminar nasional prospek pengembangan pendidikan vokasional dalam era globalisasi (pp. 127–132). bandung: culinary education study program, fptk, universitas pendidikan indonesia. law no. 20 of 2003 of republic of indonesia on national education system (2003). pardjono, p., sugiyono, s., & budiyono, a. (2015). developing a model of competency certification test for vocational high school students. reid (research and evaluation in education), 1(2), 129– 145. reid (research and evaluation in education) 113 − reid (research and evaluation in education), 3(2), 2017 https://doi.org/http://dx.doi.org/10.2 1831/reid.v1i2.6517 ramayani, c., aimon, h., & anis, a. (2012). analisis produktivitas tenaga kerja dan pertumbuhan ekonomi indonesia. jurnal kajian ekonomi, 1(1). retrieved from http://ejournal.unp.ac.id/index.php/ek onomi/article/view/738 regulation of the minister of education and culture no. 20 of 2016 on the competence standard of primary and secondary education graduates (2016). republic of indonesia. stufflebeam, d. l., madaus, g. f., & kellaghan, t. (2000). evaluation models: viewpoints on educational and human services evaluation (2nd ed.). boston, ma: kluwer academic publishers. stufflebeam, d. l., & shinkfield, a. j. (1984). systematic evaluation: a self-instructional guide to theory and practice. dordrecht: springer netherlands. thompson, j. f. (1973). foundations of vocational education: social and philosophical concepts. englewood cliffs, nj: prenticehall. reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(1), 2017, 42-49 available online at: http://journal.uny.ac.id/index.php/reid research article exploring the construct of school readiness based on child development for kindergarten children * 1 farida agus setiawati; 2 rita eka izzaty; 3 agus triyanto *faculty of education, universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia *email: farida_as@uny.ac.id submitted: 13april 2017 | revised: 05 june 2017 | accepted: 06 june 2017 abstract indonesian government has regulated that the basic age of readiness of a child to attend elementary schools is 7 years old. in fact, some children are not exactly 7 years old when they first go to school because they develop more rapidly. this study is aimed at investigating some aspects of child development which affect their readiness to attend elementary school. the subjects were 101 grade 1, 2, and 3 teachers of elementary schools in yogyakarta, a special region in indonesia. the data were collected through interviews. the results of the data collection were analyzed using both descriptive quantitative and qualitative techniques. the results of the study show some aspects of child development affecting their readiness to attend elementary schools, including: cognitive and language ability, social emotional skills, fine motor skills, gross motor skills, arts, religion and moral values, and some others. beside these aspects, some problems in grades 1, 2, and 3 are also found. this study is expected to give significant indicators to create the construct of school readiness. keywords: school readiness, elementary schools, child development how to cite item: setiawati, f., izzaty, r., & triyanto, a. (2017). exploring the construct of school readiness based on child development for kindergarten children. reid (research and evaluation in education), 3(1), 42-49. doi:http://dx.doi.org/10.21831/reid.v3i1.13663 introduction elementary school admission is conducted every year. when kindergarten age children begin to attend elementary school, parents usually curious and frequently raise a question of what skills and capability their children must have to be considered as having school readiness. the government has already regulated new student admission through government regulation no. 17 year 2010 on educational management and implementation (2010) and ministry of national education (2011) regulation number 04/vi/pb/2011, consisting of national education regulations of new student admission in kindergarten or primary schools. this regulation says that the main indicator of new student admission is being 712 years old. however, some problems which arise are that many children who were born on certain months cannot be exactly 7 years old when the admission is conducted. moreover, there are also some children who develop more rapidly in their age. therefore, even though they are not exactly 7 years old, they have more rapid development at this age. school readiness is a factor which plays the most important role for chidren to suceed in the learning process. school readiness consists of children’s ability in various aspects that a human has, including emotion, cognition, language, social, and motor skills. the concept of school readiness has been widely http://dx.doi.org/10.21831/reid.v3i1.13663 reid (research and evaluation in education) exploring the construct of school readiness based on child development... 43 farida agus setiawati, rita eka izzaty, & agus triyanto defined and redefined by experts through various points of view. some theories of child development and learning are employed to explain the meaning of school readiness. from some existing definitons, there are two types of meaning for readiness. the first meaning is studying readiness which consists of the description of children’s readiness to be involved in learning physical material. the second is school readiness which consists of the description of capabilities from various aspects that human has, such as cognitive, language, social, and motor skills which are related to the curriculum which is used (lewit & baker, 1995). moreover, dockett & perry (2002) add that chronological age must also be considered, because the finding of research indicates that chronological age has a positive correlation with the mental readiness and the development of each individual. the national association for the education of young children in dockett & perry (2002) states that in determining children’s school readiness, the policy makers must consider these three aspects: (1) considering the children’s existing experience, including skills or abilities possessed, in order to be able to predict whether children can be involved in learning activities in higher educational level or not. (2) realizing individual differences in children, including the language and culture differences used. (3) employing appropriate wishes and reasons of chidlren’s ability which must be fullfilled as one of school readiness requirements. from some cumpulsiary considered aspects, technical planning group (in dockett & perry, 2002), has identified some dimensions which become indicators in conducting an assessment of children’s school readiness, including: (1) motor development and physical condition; (2) social and emotional development; (3) learning approaches; (4) language use; (5) developmet of cognition and general knowledge. some experts have created instruments to detect learning readiness in elementary schools. chew created an instrument named lollipop test which measures school readiness (lemelin & boivin, 2007). this instrument was created to detect children’s school readiness in france. some aspects which are measured by this test highly emphasize children’s cognitive skills. various indicators explained by this instrument include: (1) color and shape identification; (2) spatial recognition; (3) number identification and calculation; (4) letter identification and writing system. these four development indicators are given to children in the form of an interview or an oral test. another instrument developed to detect children’s school readiness is early development instrument (edi). this instrument was developed by janus et al. (2006). this test reveals some areas of child development, including: (1) physical health and well-being, (2) social competence anf emotional maturity, (3) language and cognitive development, (4) communication and general skills. in indonesia, the minister of national education has created some standards to be applied for early childhood education. the standards are accommodated by the ministry of national education (2014) in the minister regulation number 137 year 2014. one of the standards included in this regulation is standard 1. this standard contains basic task of development, which must be taught to children in each development stage. at the end of early childhood, or age 5-6, some of the developmental characteristics which have been possessed are: (1) religious and moral values; (2) physical-motor skills, including gross motor skill, fine motor skill, health and safety behavior; (3) cognitive, including learning and problem solving, logical and symbolic thinking; (4) language skill, including language comprehension, language expression skills, and literacy; (5) social-emotional, consisting of self-awareness, sense of responsibility to self and others, and also pro-social behaviors; (6) arts, consisting of children’s ability to enjoy various songs, melodies, or voices, and interest in art activities. the regulation shows that there are various dimensions or aspects that have to be taught to early children. those various aspects have become comptencies which are expected to appear in 5-6 year-old children. however, among those afore-mentioned aspects, the extent aspect which has a significant role in reid (research and evaluation in education) 44 − reid (research and evaluation in education, 3(1), 2017 preparing school readiness has not been revealed yet. therefore, it is crucial to identify the aspects which contribute to preparing children’s school readiness. if certain aspects do not appear or are not in children’s characters, then there will be a lot of problems which have to be encountered by children while studying in elementary schools. the afore-mentioned explanation also portrays the existence of differences which underlie school readiness. for instance, the opinion of dockett & perry (2002) is different to lemelin & boivin (2007). this difference is affected by teacher and expert perceptions in determining the developmental aspects which underlie children’s school readiness. in addition, the situation, condition, and culture of learning processes also influence general readiness. therefore, examining the aspects which affect school readiness based on distinctive characteristics of a region is significant. thus, two research questions for the first year of this research are proposed: (1) what are the developmental aspects which affect children’s school readiness in elementary schools in indonesia? (2) what are the problems related to children’s lack of readiness to attend elementary schools? method this research is qualitative and quantitative research. the type of the research in this first year is exploratory and that in the second year is developmental. the subject of this research was 101 elementary school teachers of early grades from and four districts in yogyakarta special region, including bantul, sleman, kulonprogo, and also yogyakarta municipality. the data collecting technique which was used in this research was snowball sampling technique. this technique was based on the previous sample investigation. the investigations began with the key person of elementary teacher group, or kelompok kerja guru (kkg) in indonesian language, board management. they were asked to select some teachers in each elementary school. after identifying the teachers, the researchers interviewed them by phone. this research was a survey which was aimed at revealing the developmental aspects which affect children’s readiness to attend elementary schools. in collecting the data, the researchers employed an interview technique with the assistance of interview guidelines, and the questions about the aspects of school readiness and the problems of the students who were not ready yet to attend elementary school. the interview results were identified, and the collected information was coded and extracted. therefore, the data revealed some developmental aspects which played important roles in the children’s school readiness. in analyzing the data, descriptive analysis technique was employed. the revealed data were analyzed by percentace of subjects’ responses. findings and discussion aspects of school readiness the findings of extracted data reveal some factors or dimensions which underlie children’s learning readiness to attend elementary schools. the findings are presented in table 1. based on table 1, there are 29 child performances which influence school readiness. of the 29 aspects, the five most influential aspects are concentration (15%), imitating movements (running, jumping, standing on one leg) and dancing (9.82%), team work (9.22%), recognizing letters (8.62%), and also reading/comprehending reading texts (7.21). on the other hand, the five lowest aspects are obeying the rules/discipline (0.4%), drawing curves, straight lines, circles, as well as squares (0.4%), religious activities (praying, charity) (0.2%), sharing/helping others (0.2%) and creativity (0.2%). the result shown in table 1 is child performances which influence readiness. the performances are classified into six developmental aspects in the activity programs of kindergartens: cognitive, social emotional, and fine motor skills; gross motor skills; fine motor skills; art; religion and moral; and other aspects. reid (research and evaluation in education) exploring the construct of school readiness based on child development... 45 farida agus setiawati, rita eka izzaty, & agus triyanto table 1. child performance which influences school readiness child performances data finding (%) aspects of development concentration 15 other factors imitating movements (running, jumping) 9.82 gross motor team work 9.22 social emotional recognizing letters 8.62 cognitive reading/comprehen ding reading texts 7.21 cognitive recognizing numbers 6.81 cognitive parents (caring, breakfast, children’s readiness) 5.61 other factors socialization with peers 4.61 social emotional classroom adaptation 4.21 social emotional fine motor (snipping and sticking) 4.21 fine motor vocabulary 3.81 language writing 3.41 fine motor skills following rhythms, sounds, and tones 2.81 art story telling 2.2 language independence 2 social emotional apprehending the rules 1.6 social emotional age 1.4 other factors apprehending instruction and information 1.2 cognitive counting 1 cognitive fluent speaking 1 language problem solving 0.8 cognitive drawing 0.8 art holding stationery 0.6 fine motor skills recognizing colors and their uses 0.6 cognitive obeying rules/ discipline 0.4 social emotional drawing curves, straight lines, circles, and squares 0.4 fine motor skills religious activities (praying, reading al quran, charity) 0.2 religious and moral values sharing/helping others 0.2 social emotional creativity 0.2 art table 2. child performances on cognitive which influence school readiness child performances % reading/comprehending reading texts 7.2 recognizing numbers 6.8 recognizing letters 8.6 counting 1.0 vocabulary building 3.8 apprehending instruction and information 1.2 story telling 2.2 fluent speaking 1.0 problem solving 0.8 recognizing colors and their usage 0.6 sum 33.5 based on table 2, there are 10 cognitive child performances. the three highest cognitive performances of school readiness are recognizing letters (8.6%), reading/comprehending texts (7.2%), and recognizing numbers (6.8%). the following cognitive aspects are vocabulary building (3.8%), story-telling (2.2%), comprehending instructors, and information (1.2%), and fluent speaking, and also counting. the two lowest ranks are problem solving (0.8%) and recognizing colors (0.6%). table 3. child performances of children’s readiness based on social emotional aspects no. social emotional performances % 1. team work 9.2 2. socialization with peers 4.6 3. classroom adaptation 4.2 4. independence 2.0 5. obeying rules / discipline 2.0 6. sharing/helping others 0.2 sum 22.3 table 3 shows that there are six child performances of school readiness based on social emotional aspects. the highest performance is team work (9.2%), the second rank is socialization with peers (4.6%) followed by classroom adaptation (4.2%), independence (2.0%), and obeying rules/discipline (2.0%). meanwhile, the lowest rank is sharing/helping others (0.2%). reid (research and evaluation in education) 46 − reid (research and evaluation in education, 3(1), 2017 table 4. child performances of fine motor skills children’s which influence school readiness performance of fine motor skills % cutting and pasting 4.2 writing 3.4 holding stationery 0.6 drawing curves, straight lines, circles, and squares 0.4 sum 8.6 table 4 shows that there are four child performances of school readiness based on fine motor skills, including cutting and pasting, writing, holding stationery, drawing some curves, straight lines, and also squares. the other motoric skills are gross motor skill (see table 5), which has only one aspect, namely: imitating movements (including running, jumping, standing on one leg) or dancing (9.8%), and this aspect becomes the second highest aspect. table 5. the performance of school readiness based on gross motor skills no. performances of gross motor skills % 1. imitating movements (running, jumping, standing on one leg) or dancing 9.8 table 6. child performances of school readiness based on art no. child performances of art % 1. following rhythms, sounds, and tones 2.8 2. drawing 0.8 3. creativity 0.2 sum 4.8 the art aspect is divided into three points (as presented in table 6). the highest is following rhythm, sounds, and tones (2.8%), followed by drawing (0.8%) and then creativity (0.2%). table 7. child performances of school readiness based on religion and moral religion and moral performance % religious activities (praying, reading al quran, charity) 0.2 as presented in table 7, the next aspect is that of religion and morality which consists of religious activities (0.2%). such phenomenon needs further identification because religion and moral aspect, which is one of the aspects of school readiness, has the lowest percentage among the other influential aspects. table 8. child performances of school readiness based on other aspects other performance % concentration 15.0 parents’ caring 5.6 age 1.4 sum 22.0 the last aspect of school readiness as presented in table 8 is dealing with other aspects which are not appropriately categorized into one of the existing aspects of development. this category is divided into three parts, namely concentration (15%), parents’ caring (5.6%), and age (1.4%). the three aspects give great contribution, especially the aspects of concentration and parents’ caring. in fact, the most important requirement of student admission in indonesia is age. therefore, the role of the other aspects, in particular concentration and parents’ caring, must also be considered. the findings show a number of developmental aspects which contribute to children’s learning readiness to attend elementary school. in terms of roles, ordered from the most to the least dominant, the roles are cognitive, social-emotional, other (concentration, parents’ caring, and age), gross motor skills, fine motor skills, religion, and moral aspects. those aspects will be used as a draft for developing the construct instruments of school readiness in elementary school. finding aspects of school readiness learning problems of early grade students there are many problems extracted by the data related to children’s lack of readiness to attend elementary schools which emerge in yogyakarta. table 9 shows that there are 26 aspects of learning problems faced by early grade students. reid (research and evaluation in education) exploring the construct of school readiness based on child development... 47 farida agus setiawati, rita eka izzaty, & agus triyanto table 9. learning problems of lower grade students no. general description % 1. children’s focusing on playing 15.4 2. hard to follow the rule 10.5 3. daydreaming 9.37 4. stop studying 7.44 5. low learning result 6.34 6. slow task working 5.23 7. disturbing friends 4.41 8. slow instruction comprehending 4.41 9. being timid/afraid of asking 4.41 10. exiting class 4.41 11. wandering/running around the classroom 4.13 12. asking to be waited by parents 3.58 13. being unconfident 3.58 14. story telling/cheating with friends 3.31 15. lackluster 2.48 16. crying 2.2 17. dependence on teacher 2.2 18. keeping silent 1.65 19. feeling bored 1.38 20. easily exhausted 1.1 21. not completing the task 0.83 22. frequently asking for going home 0.83 23. not doing homework 0.28 24. moving around/not sitting still 0.28 25. being alone/isolated 0.28 table 9 shows that learning problems are always found in each educational level, and thus, some efforts to solve these problems are highly needed. through the interview which had been conducted, this research identifies many problems faced by the first up to third graders of elementary school. table 10 explains the various learning problems encountered by grade 1, 2, and 3 students of elementary school. it also shows that the higher the grade, the fewer the problems will be. however, it is also clearly shown that the problems faced by students of grade 1, 2, and 3 are identical. their problems are in terms of motivation, learning, independence, concentration, interaction, liveliness, motor skills, and comprehension. meanwhile, the problems shared by each grade related to learning problems are stopping learning, not completing the task, not doing assignment/ homework, daydreaming, keeping silent, and slowness in comprehending in-struction. the similar problems possessed by grade 1 and 2 students are frequently asking to go home. in addition, other common problems encountered by elementary school students are feeling bored, lackluster performance, slowness in comprehending instruction, and not sitting still. through in-depth identification, it is clearly shown that the problems related to social-emotional aspects more commonly occur than cognitive aspects. this condition is in line with the previous research findings which reveal that one of the aspects influencing children’s learning readiness is social and emotional aspects. therefore, social and emotional aspects determine children’s learning readiness. table 10. learning problems of students of grades 1, 2 and 3 of elementary school learning problem grade 1 grade 2 grade 3 stop studying    not completing the task    slow task working    not doing assignment/homework    low learning result    asking to be waited by parents    being not dependent    hard to follow the rules    crying    children focus on play    story telling/cheating with friends    wandering/running around the classroom    daydreaming    disturbing friends    keeping silent    easily exhausted    moving around/ not sitting still    frequently asking for going home  dependence on teacher  exiting class  feeling bored  lackluster behavior  slow instruction comprehension   conclusion and suggestions the findings of this research show that there are some performing developmental aspects influencing children’s learning readiness to attend elementary school. ordered from the most to the least dominant, the effects are reid (research and evaluation in education) 48 − reid (research and evaluation in education, 3(1), 2017 explained as follows: (1) cognitive and language aspects, consisting of recognizing letters, reading, recognizing numbers, counting, vocabulary building, comprehending instruction and information, fluent speaking, story telling, and problem solving; (2) social emotional aspects, consisting of team work, socialization with peers, classroom adaptation, independence, apprehending rules, discipline, helping others; (3) other aspects, consisting of concentration, parents’ caring, and age; (4) fine motor skills aspects, consisting of writing, snipping and sticking, holding stationeries, and drawing lines; (5) gross motor skills aspects consisting of imitating movements (running, jumping, and standing on one leg) or dancing; (6) arts, consisting of memorizing poem, drawing, and creativity; and (7) religion and other moral aspects, which consist of religious activities such as praying, reading holy quran, and charity. parents’ caring played great role in preparing their children to go to school. parental style and involvement at school and at home give big influences on the academic success of their children (magdalena, 2014). also, the parental characteristics and practices affect children’s success. furthermore, tunçeli and akman (2013) found that parents’ age and education affect the school readiness of their child. in addition to the afore-mentioned aspects, this research also finds some problems faced by elementary school students of grade 1, 2, and 3. the most dominant problems are related to socio-emotional, such as children focus on playing, hard to follow the rule, and stopping studying. therefore, socialemotional aspects highly determine the readiness of children to attend elementary school compared to other aspects. based on the research findings, the following suggestions are proposed: (1) children’s learning readiness to attend elementary school is affected not only by age, but also by other developmental aspects which play significant roles in determining readiness. therefore, identifying some aspects to construct the component of school readiness is needed. (2) moreover, cognitive aspects, social-emotional aspects, concentration, as well as parental caring also play a big role to determe children’s learning readiness to attend elementary school. therefore, teachers, educators, or psychologists need to give special considerations to these aspects. such action contributes to easily decide children’s learning readiness to attend elementary school, especially those who do not meet the age requirement. (3) for instrument developers, the findings of this research need to be followed up by composing new instruments which can detect some appropriate indicators of school readiness and are not only oriented to cognitive aspects and fine motor skills, but also social-emotional aspects, concentration, parents’ caring, age, gross motor skills, and arts. this research was focused only on many aspects of child development. meanwhile the roles of environment factors are not considered yet though many researchers have studied the role of environment on students’ school readiness. shaari & ahmad (2016) found that physical and social environment have effect on school readiness. the environment has indirect effect on stimulating many aspects of child development. references dockett, s., & perry, b. (2002). who’s ready for what? young children starting school. contemporary issues in early childhood, 3(1), 67–89. https://doi.org/10.2304/ciec.2002.3.1.9 government regulation no. 17 year 2010 on educational management and implementation (2010). indonesia. janus, m., brinkman, s., duku, e., hertzman, c., santos, r., sayers, m., … walsh, c. (2006). the early development instrument: a population-based measure for communities. a handbook on development, properties, and use. ontario: offord centre for child studies. lemelin, j.-p., & boivin, m. (2007). success starts in grade 1: the importance of school readiness. québec longitudinal study of child development, 4(2), 1–12. lewit, e. m., & baker, l. s. (1995). school readiness. the future of children, 5(2), reid (research and evaluation in education) exploring the construct of school readiness based on child development... 49 farida agus setiawati, rita eka izzaty, & agus triyanto 128–139. https://doi.org/10.2307/1602361 magdalena, s. m. (2014). the effects of parental influences and school readiness of the child. procedia social and behavioral sciences, 127, 733–737. https://doi.org/10.1016/j.sbspro.2014.0 3.345 ministry of national education. minister regulation no. 04/vi/pb/2011 on student admission at kindergarten/raudhatul athfal/bustanul athfal and school/madrasah (2011). ministry of national education. minister regulation no. 137 year 2014 on the national standard of early childhood education (2014). shaari, m. f., & ahmad, s. s. (2016). physical learning environment: impact on children school readiness in malaysian preschools. procedia social and behavioral sciences, 222, 9–18. https://doi.org/10.1016/j.sbspro.2016.0 5.164 tunçeli, h. i., & akman, b. (2013). the investigation of school readiness level of six years old preschool children in terms of different variables. procedia social and behavioral sciences, 106, 2899–2905. https://doi.org/10.1016/j.sbspro.2013.1 2.3 35 reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(2), 2017, 144-151 available online at: http://journal.uny.ac.id/index.php/reid research article students’ literature achievement: predictors investigation research alita arifiana anisa graduate school of universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia email: alita.arifiana.anisa@gmail.com submitted: 22 december 2017 | revised: 06 february 2018 | accepted: 06 february 2018 abstract this research is an expost facto research which aims to find out the exsistence of the mean difference between gender in terms of achievement, and investigate the variables predicting students’ achievement in literature study including the direct/indirect effect. this research involved 90 students established ramdomly as the sample. the research used the quantitative data analysis to analyze the mean difference between groups and the direct/indirect effect of the predictors. the result of this research shows that: (1) girls have higher achievement in literature study compared to boys; (2) the predictors that are statistically proven as a direct significant predictor of students’ final test score are gender, second dummy variable for class and mid-term test, while the rest of predictors (except the first dummy variable of class) contribute indirectly to the prediction of students’ achievement in literature study; and (3) the magnitude of the predictors might be different when they are applied in different classes. keywords: literature achievement, path analysis, gender, attendance how to cite item: anisa, a. (2017). students’ literature achievement: predictors investigation research. reid (research and evaluation in education), 3(2), 144-151. doi:http://dx.doi.org/10.21831/reid.v3i2.17498 introduction as a requirement to attain a bachelor degree in an institution in ponorogo, east java, indonesia, a final research project called thesis (or skripsi in indonesian term) has to be conducted by students. the thesis took students’ time and energy, especially in reading as well as understanding the literature needed to support their idea and findings. in order to improve the students’ skill in understanding and citating relevant literature and studies, the institution conducts a compulsory course aimed at assisting the students in finding, understanding, reviewing, and citating the idea from texts. the course is called literature study which has to be taken by every student in the institution before they take their final research or project (thesis). in the first semester of academic year of 2016/2017, there were ten classes consisting of approximately 350 students undertaking literature study. during the semester, the students have to (1) attend and actively participate in 16 face-to-face meetings, and (2) comply with all of the evaluation requirements, including oject presentation and mid-term test. the issue is that although the subject is considered to be very important for the success of their final research/project, the students seemed to think that the subject was not as important as their main subjects (the subjects directly connected to their major), so their achievement in the course was not satisfactory. the unsatisfying final score leads to the stakeholders’ anxiety related to the quality of the students’ final research/project (thesis). the poor quality of their thesis is one of the indihttp://dx.doi.org/10.21831/reid.v3i2.17498 reid (research and evaluation in education) 145 − reid (research and evaluation in education), 3(2), 2017 cations of the poor academic ability in integrating all of the knowledge that the students have earned in their four-year study. phye (1995) states that learning and achievement are surely related, but they are different in significant ways. people start to learn anything consciously or even unconsciously to achieve their desired objectives. for example, when you saw someone playing a doll-fishing machine and found that he never missed the doll, you were excited to know how he did such a good thing and started to observe every single thing he did in order to catch the doll. the observation you did to gain any information in doing doll fishing is a learning process and to be a good doll-fisher is your objective. after gaining information, you start to test whether the information works by challenging yourself to do doll fishing by yourself, and the result of your dollfishing test represents your achievement. it could be good (you were successful in catching the doll by using the information you have got from the learning process, i.e, observation), but it could also be not too good (you missed the doll). the achievement (either to be good or not too good) shows whether the learning process you did earlier meets your objectives. pritchard (2009), in ways of learning, mentions some definitions of learning, including: (1) a change in behaviour as a result of experience or practice, (2) the acquisition of knowledge, (3) knowledge gained through study, (4) gaining knowledge of, or skill in, something through study, teaching, instructions, or experience, (5) the process of gaining knowledge, (6) a process by which one’s behaviour is changed, shaped or controlled, and (7) the individual process of constructing understanding based on the experience from wide range of sources. thus, learning is a process in gaining knowledge through experiences and proven by behavioural changes. the experience means any kinds of experience, including educational experience through teaching and learning processes at school. meanwhile, achievement is what you gain from learning processes. the definition of academic achievement by the dictionary of education in phye (1995) is an accomplishment or proficiency of knowledge or skill. the achievement shows the increase of the learners’ knowledge after experiencing learning. one of the ways to see the learners’ achievement is by seeing their changes. in order to know their achievement as a result of learning process, teachers need to conduct assessment. american educational research association (aera, 1999) in reynolds, livingston, and willson (2009) mentions that in general, assessment is any systematic procedure to collect the information to make inferences about the characteristics of people. in educational issue, assessment can be defined as a procedure to gain any information about students’ learning or value judgement concerning learning process through observations, ratings of performance, project or tests (miller, linn, & gronlund, 2009). figure 1. the relationship among learning, achievement, and assessment (cumming & maxwell, 1999) there are some procedural questions which need to be answered in conducting an assessment: first, what are the learning objectives/goals which need to be achieved? learning leads to the changes of knowledge. thus, the first step that the teacher needs to do to assess students’ learning achievement is deciding the specific learning objectives or identifying what the teacher wants the students to master after the learning process. although mueller in berg (2006) mentions that it is not easy for a teacher to write good learning objectives, it is essential to acquire a clarity of purpose, because with clear purposes in mind, assessment can be well designed to match the purposes (phye, 1995). second, what kind of assessment approaches matches the learning objectives? identification to determine the type of goals needs to be held. is it cognitive changes reid (research and evaluation in education) students’ literature achievement... 146 alita arifiana anisa which come as the result of content acquisition? is it motor changes in performing specific task? or is it behavioural changes? the identification process is important to help teachers to choose the best assessment approaches and tools to meet their teaching objectives. the assessment approaches to each learning objective are as follows. (1) cognitive objectives include the building of knowledge base (thorndike & thorndike-christ, 2010). in order to meet the cognitive changes as a result of content acquisition type of learning objectives, various types of test can be conducted, such as multiple-choice item, matching, true or false, essay, short answer or filling in the blank tests (phye, 1995). (2) performance objectives cover the motor changes and how the learners perform their knowledge in the form of action/skill in doing a specific task. in order to match the performance objectives, a task that requires students to demonstrate a specific action or project is appropriate, such as playing musical instrument, using software to analyse data, constructing housing model, and making financial report. (3) affective/behaviour objectives involve the development of attitudes, values, interest and personal or social attributes that teachers can assess through observation (and try to infer what lies behind the behaviour), peers’ or teachers’ rates, and also students’ self-reports (thorndike & thorndike-christ, 2010). third, after the set of operations that requires students to perform their cognitive, performance, and attitude changes has been accomplished, it is important to set the rule to value students’ responses. the rule which is called scale is crucial to decide the most suitable number that is able to represent how much the objective is existing (thorndike & thorndike-christ, 2010). there are four kinds of scale which are used in measurement theories, namely: nominal scale (the number on the scale does not refer to the amount of anything), ordinal scale (the number on the scale tells the order of specific condition without knowing how much something is less or more than something else), interval scale (each number on the scale has equal difference, zero in this scale is not an absolute zero, and ratio scale (the scale that has an absolute zero). the educational objectives are measured by various types of instruments (the assessment tools) that are able to cover cognitive, performance, and affective objectives. the instrument is assumed to have equal amount of traits in every item. the equal differences in score indicate the equal differences in traits, and, thus, they fulfil the requirement to use interval scale. in the interval scale, the absolute zero does not exist; it is suitable to the educational issues where the students are not assumed as an empty vehicle (the base knowledge existence assumed). after setting the specific rule to value students’ responses, the next step to do is scoring. the common mechanism to do a scoring is by calculating the relative achievement objective. relative mastery involves estimating the percentage of the domain that the students have mastered. for example, when students answer eight out of ten questions correctly, it indicates that the students have mastered 80% of the domain (thorndike & thorndike-christ, 2010). the study related to the variables which can predict the final test score is beneficial to provide the stakeholder (lecturers and academic authorities) of the institution an evidence to formulate the decision to perform a better package of treatments to assist the students in preparing themselves to face the final research through literature study course. by the study, the lecturers are able to decide what to be focused in order to optimize students’ literature achievement. the main purpose of the study is to identify the variables that are able to predict students’ achievement in literature study. the study is significant since there has never been a study related to the final test score in literature study in ponorogo, although the literature study is considered to be beneficial to students in conducting their final research/project (as one of the requirements to graduate from the bachelor degree). this study is able to provide the lecturers and academic authorities an empiric evidence to give an appropriate treatment to help the students to reach their optimum achievement. reid (research and evaluation in education) 147 − reid (research and evaluation in education), 3(2), 2017 method population and sample the population of the study was all of the students who are majoring in islamic religion education. they took literature study course in the academic year of 2016/2017. the sample of the study was 90 students from three classes of x, y, and z who are chosen randomly. data variables the data which were employed in this research included: (1) gender, (2) classes, (3) attendance, (4) project presentation score, (5) mid-term test score, and (6) final-test score or achievement. table 1 shows the description of the data. research procedures this research is an expost facto research which studies about the variables that occured in the past. the research employed the quantitative research approach in order to investigate two independent variables and also four dependent variables. the research covered: (1) a descriptive analysis for analyzing the frequency of the categorical data (gender and classes) and the central tendency, variance, standard deviation, skewness and kurtosis of the continuous data (attendance, project presentation, midterm test, and final test); (2) mean difference to analyze the mean difference between genders in achievement; (3) a path analysis to analyze the direct and indirect effect in predicting the independent variable, its equation and the final model; and also (4) a path analysis for each class. meanwhile, the hypothesis of the research model analyzed is presented in figure 2. x1: gender; x2: classdummy1; x3: classdummy2 y1: attendance; y2: project presentation; y3: mid-term test y4: final achievement figure 2. hypothesis research model findings and discussion findings the sample consists of 90 participants, more than half (64.44%, n = 58) of the participants are girls, and 35.56% of the samples are boys (n = 32). the samples are the students who are studying literature study in three different classes, in which 27.8% (n=25) are class x students, 37.8% (n=34) are class y students, and the rest 34.4% (n=31) are class z students. table 1. data descriptions data description data type data source gender 1 = male 0 = female categorical student identity document classes 1 = x 2 = y 3 = z categorical academic document attendance the students’ attendance in 16 meetings. continues teacher-attendance report group project presentation the average score of group and individual performance the score in 1 to 100. continues student performance report mid-term test score the score is 1-100. continues mid-term test results final test score the score is 1-100. continues final test result reid (research and evaluation in education) students’ literature achievement... 148 alita arifiana anisa figure 3 shows the distribution of the variables which are studied in this research. all of the variables are considered to be normally distributed, with the skewness of less than ±2.0, and kurtosis of less than ± 7.0. significantly, the data indicate that the attendance variable is ranging from 57 to 100 (m=93.91, s= 7.523), while the project variable ranges from 52 to 89 with the average score (m) of 79.74 and standard deviation (s) of 5.867. furthermore, the data also show that the average score of the students’ midtest score is 71.10 (which is ranging from 50 to 98, s= 11.138), and the final test score is ranging from 50 to 100 (in which m=78.17, s=12.002). figure 3. the distribution of the attendance, project, mid-term test and final test analysis of mean difference gender to achievement. the independent sample t-test was conducted in order to reveal whether there is a significant mean difference between boys and girls in the literature study course achievement. the results indicate that girls have a greater mean score in the literature study achievement. table 2 presents the result of the mean difference analysis. table 2. mean difference analysis result variable gender n mean s t sig girls 58 73.03 11.12 final boys 32 72.22 10.98 -3.74 .00 girls 58 81.45 11.34 n= 90, p <.05 path analysis a regression analysis was conducted in order to investigate the predictors of attendance, project presentation, mid-term test, and also final test (dependent variable). there were two independent variables: gender and class. both of the independent variables were categorical data; gender consisted of two categories (boys and girls), while class consists of three categories (x, y, and z). the predictor that consisted of three categories could not be simply categorized into 0 and 1, so that a dummy variable needed to be created. a dummy variable is a way to represent groups of people or condition using only zero and one, and the number of dummy variables is one less than the number of the groups recoded (field, 2013). the final model is shown in figure 4. p= path coefficient x1: gender; x2: classdummy1; x3: classdummy2 y1: attendance; y2: project presentation; y3: mid-term test; y4: final test figure 4. research’s final model pedhazur (1997) explains that in simple regression, β is equal to correlation coefficient (r). he also mentions that the path coefficient from variable 1 to variable 2 is equal to β21, which can be estimated from the data by calculating r12. thus, in this research, the coefficient for each path is clearly presented in table 3. reid (research and evaluation in education) 149 − reid (research and evaluation in education), 3(2), 2017 table 3. path coefficients path coefficient (p) unstandardized standardized (β) a b py1x1 95.414 -4.226 x1 -.270 x1 py2x1 47.513 -2.430 x1 -.199 x1 py2y1 + .352 y1 .452 y1 py3y2 15.146 + .702 y2 .370 y2 py4x1 54.627 7.685 x1 -.308 x1 py4x3 -6.499 x3 -.259 x3 py4y3 .401 y3 .372 y3 gender and classes on attendance. the result of the multiple regressions using the stepwise method to investigate the predictors of attendance shows that gender is the only independent variable that is statistically significant in predicting the attendance. there are 7.3% attendance variances explained by gender. the equation used to predict students’ attendance is as follows: y1 = py1x1 gender, classes and attendance on project presentation. investigation was conducted using multiple regressions to find out the variables which predict project presentation. the result shows that 29% variances of project presentation are accounted by gender, classes, and attendance. the analysis found that gender and attendance are statistically significant to predict the students’ project presentation score (<.05), while classes are not significant (>.05). unlike the previous multiple regression equation, the equation used to predict the students’ project presentation considered both direct effect (gender on project presentation) and indirect effect (gender on project presentation through attendance) using the path analysis. in the path analysis, the sum of the direct effect and indirect effect is called the total effect, or effect coefficient (pedhazur, 1997). the equation to predict the students’ project presentation is as follows: y2 = py2x1 + (py1x1 * py2y1) gender, classes, attendance, and project presentation on mid-term test. another multiple regression analysis was conducted in order to investigate the predictors of students’ midterm test scores. the analysis result indicated that gender, classes, attendance, and project presentation are able to explain 13.7% variances of students’ mid-term test score, but only project presentation is statistically significant in predicting mid-term test (<.05). the equation which was used to predict the students’ mid-term test score has no direct effect, so the equation is constructed only by considering the indirect effects of: (1) gender on mid-term test through project presentation, and (2) gender on mid-term test through attendance and project presentation. the equation is as follows: y3 = (py1x1 * py3y2) + (py1x1 * py2y1 * py3y2) gender, classes, attendance, project presentation, mid-term test on final test. the last multiple regressions conducted was to find out which independent variables (gender, classes, attendance, project presentation, and midterm test) are able to predict students’ final test score. the result shows that there are 33.4% variances of students’ final test scores which are accounted by gender, classes, attendance, project presentation, and mid-term test result, but only mid-term test, gender and also classdummy2 (second dummy coding variable for classes) that are statistically proven as a significant predictors of students’ final test scores. the equation was constructed by considering the direct effects (gender on final test and classdummy2 on final-test) and indirect effects (gender on final test through project presentation and mid-term test, and also gender on final test through attendance, project presentation and mid-term test). the equation which is used to predict students’ final test score is as follows: y4 = py4x1 + py4x3 + ((py2x1 * py3y2 * py4y3) + (py1x1 * py2y1* py3y2 * py4y3)) predicting final test score in different classes. every class has its own characteristics influenced by the students’ environmental and academic background. the idea of the differreid (research and evaluation in education) students’ literature achievement... 150 alita arifiana anisa ent characteristics of class leads to a further analysis to compare the contribution of each predictor in different classes. table 4 shows the result of the multiple regressions using the enter method in each class. table 4. comparing the effects of gender, attendance, project, mid-term test on final test in different classes predictors final test x y z gender -.502* -.171 -.275 attendance .067* .228 -.388 project presentation -.122 .102 .386 mid-term test .399* .348 .317 r2 .495 .434 .337 e .505 .566 .663 n 25 34 31 p<.05 the result shows that in class 1 (x), the biggest predictor is gender and the lowest one is attendance. in class y, the biggest predictor of the students’ final test is mid-term test score, and the lowest predictor is project presentation. in class z, the biggest predictor of students’ final test is attendance, while the lowest one is gender. the result proves that the predictors might predict the independent variable in different magnitude based on the class characteristics. due to the different magnitude of the predictors, it is recommended that lecturers treat the classes differently. discussion the research findings show that girls have a significantly greater mean score than boys in terms of literature study achievement. in line with the mean difference analysis result, the final model of this study also shows the contribution of gender to students’ final achievement. two previous researchers, downing (1977) and droege (1967), mention that girls have greater facility in early reading skill. the evidence of girls’ reading ability is also revealed by finucci, gottfredson, and childs (1985). based on the reserach, women become better oral readers, buy more books, and read more pleasure than men. beside the direct contribution to students’ achievement, gender is also found to be the predictor of students’ attendance rate and project persentation performance. based on the lecturer reports, girls tend to attend the class more often than boys, and girls are more serious in doing their project persentation homework. in line with this, the research of duckworth and seligman (2005, p. 939) also explains that girls are more self-disciplined than boys in final grades, school attendance, and hours-spending homeworks. unlike gender, the first dummy variable (x=0, y=1, and z=0) has no significant effect on any independent variables. the second dummy variable (x=0, y=0, and z=1) has a direct negative significant effect on students’ achievement. it means that class z has significantly lower achievement compared to the other classess (x and y). based on the predictor analysis conducted for each class, the low achievement of class z was due to the low score of students’ attendance. students’ attendance, which is influenced by gender, is statistically proven as the predictor of the students’ project presentation performance. attendance is also found to be the indirect predictor of final achievement through the project presentation and midterm test score. the previous research also found a similiar phenomenon. the research of deane and murphy (2013) found that the students’ attendance is positively correlated with overall examination score. in addition, louis, bastian, mckimmie, and lee (2016) also found the positive correlation between attendance and objective performance. conclusion and suggestions conclusion based on the findings, conclusions can be drawn as follows: (1) girls have higher achievement in literarture study compared to boys; (2) the predictors which are statistically proven as direct significant predictors of students’ final test score are gender, second dummy variable for class, and mid-term test, while the rest of the predictors (except for the first dummy variable of class) contribute indirectly to predicting students’ achievement in literature study; (3) the magnitude of the predictors might be different in different classes. reid (research and evaluation in education) 151 − reid (research and evaluation in education), 3(2), 2017 limitation of the research the limitation of the research is that: (1) the research was conducted in an islamic institution, in which the references studied in the literature study course are those related to the islamic religion education, prophetic character, and intellectual character; (2) every institution has their own policy related to what kind of academic activities that the students have to pass through during literature study course in one semester. suggestions to achieve optimum final test score in literature study, the lecturers are suggested to: (1) consider gender, class, attendance, project presentation score, and mid-term test score to improve students’ final test score; (2) be concerned with the mid-term test results, because only mid-term test has a direct effect on the final test score; and (3) treat each class differently based on their own characteristics. references berg, s. l. (2006). two sides of the same coin: authentic assessment. community college enterprise, 12(2), 7–21. retrieved from https://www.questia.com/read/1 p3-1167542141/two-sides-of-the-samecoin-authentic-assessment cumming, j. j., & maxwell, g. s. (1999). contextualising authentic assessment. assessment in education: principles, policy & practice, 6(2), 177–194. https://doi. org/10.1080/09695949992865 deane, r. p., & murphy, d. j. (2013). student attendance and academic performance in undergraduate obstetrics/gynecology clinical rotations. jama, 310(21), 2282–2288. https://doi.org/10.1001/ jama.2013.282228 downing, j. (1977). how society creates reading disability. the elementary school journal, 77(4), 274–279. https://doi.org /10.1086/461058 droege, r. c. (1967). sex differences in aptitude maturation during high school. journal of counseling psychology, 14(5), 407–411. duckworth, a. l., & seligman, m. e. p. (2005). self-discipline outdoes iq in predicting academic performance of adolescents. psychological science, 16(12), 939–944. https://doi.org/10.1111/j.14 67-9280.2005.01641.x field, a. p. (2013). discovering statistics using ibm spss statistics (4th ed.). los angeles, ca: sage publication. finucci, j. m., gottfredson, l. s., & childs, b. (1985). a follow-up study of dyslexic boys. annals of dyslexia, 35(1), 117–136. https://doi.org/10.1007/bf02659183 louis, w. r., bastian, b., mckimmie, b., & lee, a. j. (2016). teaching psychology in australia: does class attendance matter for performance? australian journal of psychology, 68(1), 47–51. https://doi.org/10.1111/ajpy.12088 miller, m. d., linn, r. l., & gronlund, n. e. (2009). measurement and assessment in teaching (10th ed.). upper saddle river, nj: pearson. pedhazur, e. j. (1997). multiple regression in behavioral research: explanation and prediction (3rd ed.). orlando, fl: harcourt brace college. phye, g. d. (ed.). (1995). handbook of classroom assessment: learning, achievement, and adjustment. ames, ia: academic press. pritchard, a. (2009). ways of learning: learning theories and learning styles in the classroom (2nd ed.). oxford: routledge. reynolds, c. r., livingston, r. b., & willson, v. l. (2009). measurement and assessment in education (2nd ed.). upper saddle river, nj: pearson. thorndike, r. m., & thorndike-christ, t. m. (2010). measurement and evaluation in psychology and education (8th ed.). boston, ma: pearson education. this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(2), 2019, 95-102 available online at: http://journal.uny.ac.id/index.php/reid estimation of college students’ ability on real analysis course using rasch model 1isnaini; *2wikan budi utami; 3purwo susongko; 4herani tri lestiani 1,2mathematics education department, universitas pancasakti tegal jl. halmahera km. 1, mintaragen, kec. tegal tim., kota tegal, jawa tengah 52121, indonesia 3department of natural science education, universitas pancasakti tegal jl. halmahera km. 1, mintaragen, kec. tegal tim., kota tegal, jawa tengah 52121, indonesia 4mathematics education department, institut agama islam negeri syekh nurjati cirebon jl. perjuangan by pass sunyaragi, kota cirebon, jawa barat 45132, indonesia *corresponding author. e-mail: wikan.piti@gmail.com submitted: 24 august 2018 | revised: 15 march 2019 | accepted: 21 august 2019 abstract this study is aimed at estimating the difficulty level of essay tests and the accuracy of students’ ability in real analysis essay test using the rasch model with the quest program and r 3.0.3 package erm program. the population in this study was all students of the department of mathematics education, universitas pancasakti tegal in the academic year 2016/2017, who were enrolled in the real analysis course. the data were analyzed using the r 3.0.3 package erm program and quest program. the students’ ability was obtained from the result of the course final exam of the first real analysis course. the analysis shows that: (1) by using rasch model for partial credit scoring, the difficulty level shows that 100% of essay questions in real analysis final exam is categorized as difficult, (2) the estimation of students’ ability in real analysis course using rasch model with cml method is better than the estimation of students’ ability using rasch model with jml approach. keywords: estimation of ability, level of difficulty, rasch model, item response theory permalink/doi: https://doi.org/10.21831/reid.v5i2.20924 introduction one important component in the formation of quality human resources is education. the most important factor to be able to compete globally in the 21st century is education. according to mardapi (2012, p. 12), efforts to improve the quality of education can be pursued through improving the quality of learning and the quality of the assessment system. thus, in the process of education in higher education, for example in learning mathematics must strive to implement the learning process and assessment as well as possible. a good process of learning mathematics can certainly be done by providing flexibility for students to develop and explore their abilities. today, education in indonesia is still considered very low, especially for mathematics. even though mathematics is the main science taught from elementary school to university. this indication can be seen from the low student achievement in each academic year. ironically, mathematics is a subject that is not liked. many students are afraid of mathematics. for them, math is like a frightening enemy they want to avoid. schwartz (2005, p. 1) suggests the basic success of mathematics education is to support the development of intelligence in mathematics from a variety of life conditions. student's mathematical skills in living conditions at the school can be seen when students take the test. the implementation of the test is basically to assess the success of students during the learning process. https://doi.org/10.21831/reid.v5i2.20924 estimation of college students’ ability on real analysis course... isnaini, wikan budi utami, purwo susongko, & herani tri lestiani 96 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 the test is very necessary so that the educator in this case the lecturer can know the student's learning achievement after being given the subject matter in the learning process. therefore, making a good test needs to be pursued by considering the ability of students, so that the tests carried out as a measuring tool to test student achievement can reflect/ describe the true abilities of students. students of the mathematics education program at universitas pancasakti tegal all this time consider the most difficult subjects to be real analysis. real analysis comprises deductive and axiomatic topics. previous observation on the performance of students of universitas pancasakti revealed the students’ ability in this course is relatively low. it is indicated by their ability to prove a convergent sequence yet, they found it difficult in solving some problems related to convergent sequence as there are many theorems are included. student learning evaluation activities are one of the important tasks that must be done by lecturers. in the field of education, evaluation of student learning achievements is conducted to determine the progress of students in the curriculum that has been taught. one effort to evaluate students is to give examinations in the middle of the semester and at the end of the semester. however, sometimes giving questions that are too difficult or too easy causes it to be difficult for lecturers to distinguish students' abilities. therefore, an analysis of exam questions is needed in the hope that the exam results present the ability of students. evaluation is a series of activities in improving the quality, performance, or productivity of an institution in carrying out its program. through evaluation, information about what has been achieved and which have not will be obtained, then this information is used to improve a program. according to tyler (1950), evaluation is a process of determining the extent to which educational goals have been achieved. according to griffin and nix (1991), evaluation is a judgment on the value of the measurement results or implications of the measurement results. tyler emphasizes the achievement of the objectives of a program, while griffin and nix emphasize the use of assessment results. thus, the focus of evaluation is a program or group, and there is a judgment element in determining the success of a program (mardapi, 2012, p. 4). the form of real analysis subject evaluation is the midterm and the final semester examination. the test is in the form of a description test, the advantages of the description form test are easy in the preparation. this form of description will also train students in expressing opinions both systematically and logically (buckley, winkel, & leary, 2004). a lecturer will be able to find out where the weaknesses of the students are in the material that has been taught so that they will give input on what things must be improved. scoring on the description form tests takes a long time and is relatively more difficult so the form of the description test is difficult to use for large-scale tests. an assessment will be meaningful if the results can be used to improve the quality of the learning process. an assessment will be meaningful if the results can be used to improve the quality of the learning process (mcmillan, 2005). the existence of the midterm and final semester exams in the real analysis course is to evaluate the ability of students. some theories and models that can be used to analyze test items are the ones with the rasch model. in this study, rasch model was employed to analyze test items. according to imaroh, susongko, and isnani (2017), the items parameter does not depend on the sample. further, ningsih and isnani (2010) revealed the different reliability levels of essay test items analyzed using item response theory model (1pl, 2pl, 3pl) and rasch model. the concept of objective measurement in the social sciences and the assessment of education, according to wright and mok (2004), must have five criteria, namely: (1) producing linear measurements with equal intervals, (2) exact estimation process, (3) identifying inaccurate (misfits) or uncommon items (outliers), (4) able to handle missing data, (5) produce measurements that are independent of the parameters studied. of the five conditions, so far only the rasch model can fulfill these five conditions. the quality of estimation of college students’ ability on real analysis course... isnaini, wikan budi utami, purwo susongko, & herani tri lestiani copyright © 2019, reid (research and evaluation in education), 5(2), 2019 97 issn 2460-6995 measurements in the assessment of education carried out with the rasch model will have the same quality as the measurements made in the physical dimension in the field of physics (sumintono & widhiarso, 2015). in measuring modern test theory, the rasch model is seen as the most objective measurement model. the use of the rasch model in measuring education has advantages in specific objectivity and the stability of high grain parameter estimates (wu & adams, 2007). the main characteristic of the rasch model is that this model considers all responses of a test taker regardless of the sequence in solving the problems. it means that the level of difficulty of each test item is not necessarily in consecutive order. the main advantage of the rasch model is that the mental process used by participants in solving the problems is more accurate. moreover, compared to other models (particularly classical test theory) this model has the ability to predict the missing data based on a systematic response pattern. this model has been applied to mathematics and reading tests, e.g., at the national assessment of educational progress (naep) (susongko, 2014). this model is also suitable for analyzing personality scale responses that have a multi-point scale. unlike the rasch model which includes all responses without considering the sequence in solving the problems, the gradation model requires sequential responses of the test takers from a low to a high category. in the gradation model, the level of difficulty of each test item is arranged in sequence, while in classical test theory, the pattern of students’ answers is not considered as classical test theory merely considers correct and incorrect answers. gradation model is suitable for a course that requires regularities or sequential responses of each test item, such as mathematics, physics, and chemistry. according to lababa (2008), one of the oldest test theories about behavioral assessment is classical true-score theory. classical test theory has an easy application. moreover, it is a practical model to describe how measurement errors can affect the observed score. quantitative item analysis emphasizes the analysis of internal test characteristics through empirically obtained data. internal characteristics include test item parameters which are the level of difficulty and discrimination power of a test. rasch model is a dichotomous scoring model that merely has two categories, namely the correct answer with a score of 1 and the incorrect answer with a score of 0. currently, it has been developed more extensively in polytomous scoring. according to retnawati (2014, p. 32), the polytomous scoring model is an item response model that has more than two scoring categories. in the rasch model, it is assumed that all items have the same discrimination index (isgiyanto, 2011). to deal with polytomous data with various ranks, a new type of analysis of the rasch model is developed, namely the partial credit model. however, the main purpose of the rasch model is to create a scale measurement at equal intervals. meanwhile, as the raw scores are not shown in interval form, the scores cannot be used directly to interpret the students’ ability. rasch model requires both per person score data and per item score data. these two scores become the basis for estimating true scores that indicate the level of individual ability as well as the degree of difficulty of the test. rasch modeling uses both per person score data and per item score data. these two scores become the basis for estimating true scores that indicate the level of individual ability as well as the degree of difficulty of the test. the advantage of the rasch model compares to other models, particularly classical test theory, is the ability to predict the missing data, based on a systematic response pattern. some studies had been carried out related to the use of the rasch model in analyzing test items. a study by kurniawan and mardapi (2015) showed that the rasch model provides complete information about test items, including its difficulty level. this study is aimed at estimating the difficulty level of the essay test on the first real analysis course by using the rasch model and describing the estimation of students’ ability in real analysis course by using the rasch model, quest program, and r 3.0.3 package erm program. estimation of college students’ ability on real analysis course... isnaini, wikan budi utami, purwo susongko, & herani tri lestiani 98 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 method this research is an explorative descriptive study of data sets of items and responses of participants in the semester's final examination of the real analysis subject in the academic year 2016/2017. this research is a post-hoc diagnosis that is described as a retrofitting approach (gierl, 2007). the retrofitting approach is carried out through analysis of the items and item response data in the final semester exam in the real analysis 2016/2017 academic year. some studies have implemented the rasch model by involving 30 to 300 students as the sample (bond & fox, 2007; keeves & masters, 1999). the subject of this present study was 82 students of mathematics education department of universitas pancasakti tegal in the academic year 2016/2017 who took the first real analysis course. the sampling technique used in this study is purposive sampling. it is one of the non-random sampling techniques where the researcher determines sampling by specifying specific characteristics suitable with the objectives of the study so that it is expected to answer the research problems. based on the explanation of the purposive sampling, there are two things that are very important in using the sampling technique, namely non-random sampling and setting specific characteristics according to the research objectives by the researchers themselves. the instrument used in this study was the final exam test on the first real analysis course. the test items include the introduction material, real numbers, sequences and series, and limit (bartle & sherbert, 2000). rasch model was applied to analyze the collected data. this analysis resulted in a description of the difficulty level of the test items. by using the erm package in r program version 3.0.3, the analysis generated the estimation of item parameters on the exam of real analysis. measurement modeling explains the procedure of how to organize raw scores into more meaningful information. moreover, it can utilize a mathematical model that can interpret raw scores into a score that provides more valid and accurate information. the analysis of raw scores leads to a new finding: the opportunity for students to correctly answer an item is the same as the comparison of students’ ability and the difficulty level of the test items. (bryan, 2004) ocfs (ogive curve function) become a prototype of rasch model development for polytomous items. if i is a polytomous item with score category = 0, 1, 2,. . . , mi, then the probability of participant n with score x on item i is later described in category response function (crf), which is illustrated in the following equation (glas & verhelst, 1989): equation (2) can be elaborated by the number of categories in the test items. for example, if a scale has three categories of the score of 0, 1, and 2, then there will be a category (j) as many as three individual probability equations for each category. probability in category 0 is: probability in category 1 is: probability in category 2 is: . in the probability of category 0, there is a number 1 in the numerator since rasch model requires the following equation: (glas & verhelst, 1989) estimation of college students’ ability on real analysis course... isnaini, wikan budi utami, purwo susongko, & herani tri lestiani copyright © 2019, reid (research and evaluation in education), 5(2), 2019 99 issn 2460-6995 findings and discussion the parameter of the difficulty level of test items has the same value interval as the parameter of participants’ ability (θ), which is bi j = θ. the bi j value ranges from -∞ to +∞. however, the values which are practically (or rationally) used are only between -4.0 to +4.0. it means that the more negative the difficulty level of an item or close to -4, the easier the problem. on the other hand, the more positive the difficulty level or approaching +4, the more difficult the problem (naga, 2003, p. 224). in case the parameter of the difficulty level of a test item meets bj ≤ -2, the item is then categorized as a very easy item. if it meets -2 ≤ bj ≤ 0, the item is then categorized as an easy item. furthermore, if it meets 0 < bj ≤ 2 and bj ≥ 2, the item is then categorized as a difficult and very difficult item, consecutively (hambleton, swaminathan, & rogers, 1991). the analysis of the question number 1 showed that δ11 = 0.861, δ12 = 0.374, and δ13 = 0.45. it implies that the difficulty level of the first, second, and third steps is included in the difficult category. in question number 2, the difficulty level of the first step is included in the difficult category (δ21=1.731), while the difficulty level of the second step is identified as very difficult (δ22=2.787). in question number 3, the results obtained were δ31=1.149 and δ32= 1.796, which suggest that the difficulty level of the first and second steps can be included in the difficult category. the analysis of question number 4 resulted δ41=-0.363 and δ42=-0.963. it indicates that the difficulty level in both steps is in included in the easy category. the results showed that there are three categories (δ12, δ21, δ41) which are identified as easy, one category (δ11) is identified very easy, and six categories (δ22, δ31, δ32, δ42, b51, and b δ52) are categorized as difficult. in general, the score of difficulty level of those items was 0.594, thus the four test items were identified as difficult. it can be inferred from the aforementioned results that the final exam items of real analysis course are categorized as difficult for the participants, even though all topics in the questions had been discussed during the course. the value of the difficulty level of item varies (typically) from about -2.0 to +2.0. item number 1 with sub-topic of the completeness of real numbers was identified as a difficult item. likewise, item number 2 and item number 3 with sub-topic of the limit of a sequence and the theorems of limit of a sequence, respectively, were categorized as difficult items. on the contrary, item number 4 with sub-topic of the theorems of limit of a sequence was identified as an easy item. to make it clearer, figure 1, figure 2, and figure 3 present the questions in the test and the sample of student’s answers. from the students’ answers which are presented in figure 1, figure 2, and figure 3, it can be foreseen that the student was incapable to solve the problems number 1, 2, and 3 systematically, because of the incapacity in understanding some theorems and definetions which are related to the problems. the students could not recognize and analyze the relation between the theorems and definitions. figure 1. student’s answer on problem 1 estimation of college students’ ability on real analysis course... isnaini, wikan budi utami, purwo susongko, & herani tri lestiani 100 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 figure 2. student’s answer on problem 2 figure 3. student’s answer on problem 3 it is presented in figure 4 that in the fourth problem, the student seemed to comprehend the topic. the theorems related to sequences and series were analyzed before the implementation for solving a problem. it can be seen from the sample in which the student could use the theorems systematically as suggested in solving the problem. figure 4. student’s answer on problem 4 the result of the analysis showed that the ability of the test participants was quite diverse. in fact, merely a small number of students can solve questions number 1, 2, and 3 correctly. most of the students could not determine specific theorems and definitions to solve the problems, especially in the second and third problems. in contrast, most of the students already understand the theorems used to solve the fourth problem, which are the sequences and series theorems, even though they faced a difficulty to analyze the theorems. the estimation of the students’ ability is presented in the interval scale (-3, +3). the category score in rasch model shows the number of the required steps to solve an item correctly. a high score indicates a good ability category. on the contrary, a low score indicates a low category of ability as well. the output of the estimation of ability parameter obtained from quest program and the package erm with partial credit modeling or estimation of college students’ ability on real analysis course... isnaini, wikan budi utami, purwo susongko, & herani tri lestiani copyright © 2019, reid (research and evaluation in education), 5(2), 2019 101 issn 2460-6995 rasch model is used to illustrate the comparison between the students’ ability estimated using the joint maximum likelihood (jml) approach with the package erm and those estimated using the conditional maximum likelihood (cml) approach with the quest program. in jml approach, the students’ ability could not be expressed in score 0 and score 100. meanwhile, in cml approach, the students’ ability can be expressed in score 0 (approximately a value of -3.09) and score 100 (as approximately a value of 85). therefore, it can be inferred that rasch model using cml approach is more suitable than rasch model using jml approach to estimate the students’ ability in understanding the subject-matter. the result of analysis meets the outfitmsq criteria if the value is 0.035 < outfitmsq < 3.239. the analysis resulted a value of 0.5 < outfitmsq < 1.5, thus it fulfills the range of outfitmsq. the criteria of infit mnsq is 0.5 < mnsq <1.5. according to the mean value and the standard deviation of rasch model, the cml approach with the package erm is eligible since the mean and the standard deviation meets the criteria. on the contrary, the jml approach with quest program is less appropriate as indicated by the mean and the standard deviation that do not meet the criteria. in conclusion, the result of analysis on the estimation of students’ ability reveals that the estimation of students’ ability using rasch model with cml approach and erm program is more accurate than the estimation of students’ ability using rasch model with jml approach and quest program. similarly, based on outfitmsq, rasch model using cml approach with erm program has better performance than rasch model using jml approach with quest program. conclusion based on the results and discussions, it can be concluded that the essay test items on the first real analysis course that have been tested to the students of mathematics education department, universitas pancasakti tegal can be classified as a good test. besides, the students’ ability can be estimated precisely by using rasch model with cml approach and erm package. the estimation of participants’ ability was quite diverse. a small number of students can solve questions number 1, 2, and 3 correctly despite these questions were classified difficult. meanwhile, most of students already understand the theorems used to solve the fourth problem. the students are capable to apply the theorems systematically to solve the fourth problem. references bartle, r. g., & sherbert, d. r. (2000). introduction to real analysis. new york, ny: john wiley & sons. bond, t. g., & fox, c. m. (2007). applying the rasch model: fundamental measurement in the human sciences (2nd ed.). mahwah, nj: lawrence erlbaum associates. buckley, k. e., winkel, r. e., & leary, m. r. (2004). reactions to acceptance and rejection: effects of level and sequence of relational evaluation. journal of experimental social psychology, 40(1), 14– 28. https://doi.org/10.1016/s0022-1031 (03)00064-7 gierl, m. j. (2007). making diagnostic inferences about cognitive attributes using the rule-space model and attribute hierarchy method. journal of educational measurement, 44(4), 325–340. https://doi.org/10.1111/j.1745-3984. 2007.00042.x glas, c. a. w., & verhelst, n. d. (1989). extensions of the partial credit model. psychometrika, 54(4), 635–659. https:// doi.org/10.1007/bf02296401 griffin, p., & nix, p. (1991). educational assessment and reporting: a new approach. sydney: harcourt jovanovich. hambleton, r. k., swaminathan, h., & rogers, h. j. (1991). fundamentals of item response theory. newbury park, ca: sage publications. imaroh, n., susongko, p., & isnani, i. (2017). uji validitas tes ulangan akhir semester gasal mata pelajaran matematika (studi deskriptif analisis dokumenter di smp estimation of college students’ ability on real analysis course... isnaini, wikan budi utami, purwo susongko, & herani tri lestiani 102 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 negeri slawi tahun pelajaran 2016/2017). jpmp (jurnal pendidikan mipa pancasakti), 1(1), 80–89. https:// doi.org/10.24905/jpmp.v1i1.792 isgiyanto, a. (2011). analisis data ujian nasional matematika berdasarkan penskoran model rasch dan model partial credit. prosiding seminar nasional penelitian, pendidikan dan penerapan mipa, 43–52. retrieved from https:// eprints.uny.ac.id/7172/1/pm-7 awal isgiyanto.pdf keeves, j. p., & masters, g. n. (1999). introduction. in g. n. masters & j. p. keeves (eds.), advances in measurement in educational research and assessment. amsterdam: pergamon-elsevier science. kurniawan, d. d., & mardapi, d. (2015). penyetaraan vertikal tes matematika smp dengan teori respons butir model rasch. jurnal evaluasi pendidikan, 3(1), 12–25. retrieved from http://journal.student. uny.ac.id/ojs/index.php/jep/article/vie w/1221/1093 lababa, d. (2008). analisis butir soal dengan teori tes klasik: sebuah pengantar. iqra’, 5, 29–37. retrieved from https://jurnal iqro.files.wordpress.com/2008/08/03jun-29-36.pdf mardapi, d. (2012). pengukuran, penilaian, dan evaluasi pendidikan. yogyakarta: nuha medika. mcmillan, j. h. (2005). understanding and improving teachers’ classroom assessment decision making: implications for theory and practice. educational measurement: issues and practice, 22(4), 34–43. https://doi.org/10.1111/ j.1745-3992.2003.tb00142.x naga, d. s. (2003). teori pengukuran. retrieved from http://dali.staff.gunadarma.ac.id/ downloads/folder/0.1 ningsih, l. d., & isnani, i. (2010). studi komparatif tingkat reliabilitas tes prestasi hasil belajar matematika pada tes bentuk uraian dengan model penskoran gpcm (generalized partial credit model) dan penskoran grm (graded response model). cakrawala: jurnal pendidikan, 4(8). https://doi.org/10.24905/cakrawa la.v4i8.176 retnawati, h. (2014). teori respons butir dan penerapannya: untuk peneliti, praktisi pengukuran dan pengujian, mahasiswa pascasarjana. yogyakarta: nuha medika. schwartz, s. l. (2005). teaching young children mathematics. london: praeger. sumintono, b., & widhiarso, w. (2015). aplikasi pemodelan rasch pada assessment pendidikan. cimahi: trim komunikata. susongko, p. (2014). pengantar metodologi penelitian pendidikan. tegal: universitas pancasakti tegal. tyler, r. (1950). basic principles of curriculum and instruction. chicago, il: university of chicago press. wright, b., & mok, m. m. c. (2004). an overview of the family of rasch measurement models. in e. v. smith jr. & r. m. smith (eds.), introduction to rasch measurement: theory, models and applications (pp. 1–24). maple grove, mn: jam press. wu, m., & adams, r. (2007). applying the rasch model to psycho-social measurement: a practical approach. melbourne: educational measurement. this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(1), 2019, 54-60 available online at: http://journal.uny.ac.id/index.php/reid testing the parent-child communication program: its effectiveness on developing children’s social competences *1agus mulyanto; 2alifah indalika mulyadi razak 1,2department of early childhood education, universitas islam nusantara jl. soekarno-hatta no. 530, buahbatu, kota bandung, west java 40286, indonesia *corresponding author. e-mail: agusmulyantouin@gmail.com submitted: 02 august 2018 | revised: 16 april 2019 | accepted: 07 may 2019 abstract social competence has a central role in the development of early childhood. one of the central roles of children's social competence is its influence on children's academic social abilities, such as the ability to (1) receive learning information, (2) follow the school’s rules, and (3) face increasingly complex academic challenges along with increasing school levels. this study aims to test the model of parent-child communication (pcc) in developing children’s social competence. the exclusive communication model was tested to 250 children aged 4-6 years in west java province, which is divided into five research zones, west java zone 1 to 5. this study used an experimental design pre-test and post-test to determine the effectiveness of the parent-child communication program that was tested through observation and interview techniques consisting of 68 items of social competence. the results show that the pcc program can effectively be applied by the collaboration of parents between fathers and mothers to optimize children's social competencies. the pcc program, which was not attended by both parents, would not be effective, for example, as happened in the west java zone 1, indicating the ineffectiveness of the pcc program because of the characteristics of parental activities that both work and do not have time to communicate with children. while in the other four west java zones pcc can be effective, because working parents want to take the time to interact and communicate actively with their children. keywords: parent-child communication, social competences, children permalink/doi: https://doi.org/10.21831/reid.v5i1.20679 introduction social competence is considered to play an essential role in child development. izard et al. (2001) explain that social competence can affect children's academic abilities. social competence includes the ability to know yourself, know the environment, manage interaction with the environment, share ability, gain prosocial skills, manage the challenges of learning in school, recognize the various emotional expressions, and make friends. the results of another study indicate that social competence can have a significant effect on the academic social life of a child (kail, 2012). a child who has good social competence is predicted to have good performance at school (papalia, feldman, & martorell, 2012). the result of a survey conducted to the teachers of the pembina state kindergarten in west java shows that 4-to-6-year-old children who were in kindergarten groups still need maximum stimulation in order to improve their social competence (mulyanto, muchtar, hanafiah, hoerudin, & razak, 2017), such as in the subcompetence of prosocial behavior, self-concept, emotional well-being, and academic-social life. in addition, mulyanto et al. (2017) also argue that 50% of kindergarten children attending school at pembina state kindergartens in west java province have not shown maximum social competence. it is clear that children's social competence is influenced by various community http://dx.doi.org/10.21831/reid.v5i1.20679 testing the parent-child communication program... agus mulyanto & alifah indalika mulyadi razak copyright © 2019, reid (research and evaluation in education), 5(1), 2019 55 issn 2460-6995 contexts (kärtner, keller, & chaudhary, 2010). various efforts are made to stimulate the development of children’s social competence, one of which is by giving rewards in the form of praise, star symbol, or toy that can effectively improve children’s prosocial behavior, which is, one of the indicators of social competences (fabes, fultz, eisenberg, may-plumlee, & christopher, 1989). besides, storytelling activities (mincic, 2007), video games (greitemeyer & osswald, 2010), and lego games (pang, 2010) are also proven to be an alternative to improve children's social competence. this form of stimulation is inseparable from the participation of parents and teachers to direct the children in developing their social competence, as evidenced by the research conducted by berns (1997) that presents the involvement of teachers and parents as one of the successes of school programs. the problem that then arises is if parents and teachers have established good communication, then, how the communication between parents and children are performed, so that learning information to improve children's social competence in school can also be aligned with routine activities at home. answering the above questions, jhon bowlby's attachment theory introduces parent-child communication (pcc) is a form of parent and child communication (papalia et al., 2012; schneider, atkinson, & tardif, 2001). the established communication involves an emotional bond between parent and children. it is hoped that there will be a mutual attachment contributed in establishing a high-quality relationship in accordance with the understanding of the concept of the attachment (bus, belsky, van ijzendoom, & crnic, 1997; schneider et al., 2001). through the pcc program, children feel safe and comfortable to tell their parents about what they feel and what they want to convey. besides, parents can also ask children to do or behave in a certain way so that both parties can easily negotiate and find solutions to the events at hand. the pcc has been implemented in 2006 (niles, reynolds, & nagasawa, 2006) as a form of program to convey cognitive information from parents to children, by using learning media as the intermediaries of parent-child communication program (berns, 1997). based on the previous experiences in implementing pcc, it is assumed that this program can also be effective in improving children's social competence. method this study used a quantitative experimental design with the employment of before-and-after experiment design, or commonly known as pre-posttest design (kumar, 2011). the population of this study consisted of 4-6-year-old kindergarten children who attend tk negeri pembina in west java. these schools were selected to become the population because the characteristics of the environmental conditions and facilities are considered homogeneous since each tk negeri pembina has the same service and facility standards. the total population of the tk pembina in the province was 50 schools, and the number of the kindergarten students was ranging from 80 to 100. thus, the total student population was approximately 5,000 children. therefore, the population needs to be limited through the determination of research samples by using random sampling (kumar, 2011), so that the total study sample consists of 250 children from five zones in west java province. the participants of this study were 250 children from kindergarten in west java province with characteristics of 4 to 6 years old, living with one father and one mother, and interacting with their father and mother every day. the research subjects were then grouped into 5 zones so that each zone consisted of 50 children. consideration of the distribution of research zones is based on the geographical conditions of data collection in the province of west java, including: (1) zone 1 consists of kindergartens in bandung raya; (2) zone 2 consists of tk in the northern part of west java; (3) zone 3 consists of tk in the southern part of west java; (4) zone 4 consists of eastern part of west java; (5) zone 5 consists of the western part of west java. this study aims to examine the effectiveness of parent-child communication programs to improve children's social competence. the pcc program is applied on the testing the parent-child communication program... agus mulyanto & alifah indalika mulyadi razak 56 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 questions under the following cases: (1) communication when the child goes to school; (2) communication when the child comes home from school; (3) communication between day and evening; (4) communication when children have their dinner; and (5) communication when the child is at bedtime. the fiveforms parent-child communication program is carried out by parents consisting of father and mother for ten days. the research can be said to be effective if the pcc program can improve children's social competence, which consists of 68 items measured through observation and interview. observation applied to items that can be observed, such as the ability to play with friends, the ability to help friends, and other items. interview measurement instruments applied to items that cannot be observed, such as the ability to distinguish sad, happy, disgusting, fearful, and other facial expressions. the technique of collecting data using observations and interviews was carried out by the class teacher of each child during the pre-test and post test process. data analysis was carried out by agreeing to the parentchild communication program through the following hypothesis testing: h0: there is an influence of parent-child communication intervention on child social competency h1: there is no influence from parent-child communication intervention on child social competency the data were collected using observation and interview techniques consisting of 68 items of valid child social competency indicators collected in the pre-test, which were then intervened at the treatment stage, and tested for the effectiveness at the posttest, with the following research steps: (1) implementation of social competence pre-test for children aged 4-6 years old; (2) implementation of parent-child communication model intervention; and (3) post-test social competence of children aged 4-6 years. each step is elaborated as follows. implementation of social competence pretest for 4-to-6-year-old children this activity examines children's social competence, which consists of 68 items that have been tested for validity and reliability using expert judgment. the implementation of the pre-test of social competence employs observation and interview instruments. children are conditioned in playing activities and daily routines at school. in playing the activities, the indicators that can be observed are, for example, indicators number 1, 2, 3, 5, 6, 8, 9, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, and 23. each of them measures the following ability: the ability to play with friends, resolve conflicts with friends, ask about feelings of friends, like playing with friends rather than playing alone, help friends, behave well with friends, work with friends, share things, share ideas, resolve conflicts, show concern for younger children, throw jokes, distinguish between right and wrong behaviors, show sympathy, show empathy, play with older friends, lead friends, and be led by friends (kail, 2012). meanwhile, the ability indicators that can be raised through the interview process are shown through the items number 4, 7, 10, 14, 18, 28, 29, 30, 31, and 35, each of which measures the ability to tell the teacher about friends' feelings, know the signs of friends who need help, show attention to friends, give advice to persuade friends, have the initiative to create a game or activity, choose the role or task given by the teacher, express choices, tell others about something, plan an action, and know the purpose of a rule (mulyanto et al., 2017). implementation of parent-child communication model intervention the treatment is carried out simultaneously to five zones in the province of west java. the research team asked the principal and teachers to socialize the parent-child communication (pcc) program to the parents of students in kindergarten in west java province. this treatment took place for ten days and was followed by the parents of students (knafo & plomin, 2006) consistently in the form of activities (1) communication when the children go to school; (2) communication when children go home from school; (3) communication between day and evening; (4) communication when children have dinner; testing the parent-child communication program... agus mulyanto & alifah indalika mulyadi razak copyright © 2019, reid (research and evaluation in education), 5(1), 2019 57 issn 2460-6995 and (5) communication when the the children are going to sleep. the implementation of this intervention was carried out by children from families of one father and one mother and consistently divided role assignments in this intervention activity. post-test social competence of children aged 4-6 years. this activity examines children's social competence consisting of 68 items by observing and interviewing children in play activities and daily routines at school, after being given a parent-child communication program intervention. the items of social competency indicators tested were the same as the indicator items in the pre-test activities, such as the indicator items number 24, 25, 26, 27, 32, 33, 34, 42, 43, 45, 46, 50, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 68, each of which measures the ability of being enthusiastic in school, following the rules, giving focused attention for 2 minutes, giving focused attention for 8 minutes, being brave to be left by the family members when the students are at school, running routine activities, completing two to three assignments at school, maintaining their own property, taking care of other people's goods, protecting themselves from strangers, showing the things they need, preserving the environment, controling anger, controling anxiety, controling the fear of scary things, controling the fear of darkness, controling the fear of certain situations, controling pain, controling sadness, controling the shyness to be able to have monologue, playing games, and showing confidence. meanwhile, the indicators of ability that can be measured through the interview process is on items number 36, 37, 38, 39, 40, 41, 44, 47, 48, 49, 51, 52, 53, 54, 55, 56, and 67, which respectively measure the ability to distinguish men and women based on their role and physical appearance, have a tendency towards favorite objects, make positive calls to his friends, mention things they like and dislike to be able to distinguish goods by themselves and those of others, distinguish comfort and discomfort with their physical care, distinguish comfort and discomfort in their physical appearance, follow the rules for their health, recognize the happy, sad, angry, fearful, surprised, disgust, and proud facial expressions (csoti, 2009). findings and discussion early childhood social competence is a form of children's school readiness (setiawati, izzaty, & triyanto, 2017). furthermore, the research in 2017 explained that children's social-emotional abilities influence almost all the development of the children. school readiness is something that can be intervened through collaboration between parents and teachers. the parent-child communication program is one of the interventions that can be applied to improve children's social competence, which has a long-term influence on school readiness. the parent-child communication program applied in this study involves fathers and mothers who have children aged 4-6 years. the involvement of parents in implementing this program requires good cooperation. this program is carried out for ten days through five forms of communication activities. the problem is the research subject has a different condition of father and mother activities. some children have fathers and mothers who both work from morning to night; there is also the condition of the father working from morning to night, and the mother who stays at home as a housewife. then some children also have fathers who work and go home once a month. there are also children with conditions where their father works and returns once a month, and the mother also works from morning to night. therefore, the implementation of the parentvertical communication program results is very diverse, spread across five zones of west java province. characteristics of research subjects in west java 1 zone (table 1) came from kindergarten throughout bandung raya, most of the parents of students had activities working outside the home. the pcc program is carried out for ten days and must involve both parents. in west java zone 1, out of 50 parents, only 15 parents were willing to consistently join the pcc program, ranging from communication before leaving for school, communication at home from school, communicatesting the parent-child communication program... agus mulyanto & alifah indalika mulyadi razak 58 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 tion between the afternoon and evening, communication at dinner, and communication before bedtime which were carried out completely by father and mother. the results show that the pcc program was less effective in the west java zone 1, which was indicated by 29 children experiencing an increase in social competence (mean positive ranks 29b = 26.81) with a total value of 777.50, 19 children experiencing a decrease in grades (mean negative ranks 19a = 20.97) with a total value of 398.50, and 2 others have the same pre-test and post-test scores (ties = 2c). based on the results of the pre-test and post-test, it was concluded that h0 was accepted (asymp.sig (2-tailed) = 0.052> α = 0.05) so that there was no effect of the pcc program on the average social competence of early childhood. the characteristics of the research subjects in west java zone 2 (table 1) consisted of kindergarten in the northern region of west java province, namely subang, purwakarta, and karawang regions, amounting to 50 children. participation in the research subject in this zone was considered very active because most of the children were accompanied by their mother at home (only the father works) so that the communication runs effectively. it is evidenced by the results of research showing that 35 people experienced an increase in social competence (mean positive ranks 35b = 26.24) with a total value of 918.5. meanwhile, 11 children experienced reduced scores (mean negative ranks 11a = 14.77) with a total value of 162.5, and four others had the same pre-test and post-test scores (ties = 4c). based on the results of the pre-test and posttest, it was indicated that h0 was rejected (asymp.sig (2-tailed) = 0.000 <α = 0.05), so that there was an effect of the parent-child communication program on the average social competence of early childhood. the effectiveness of the parent-child communication program is also tested in the southern west java zone 3 (table 2), namely kindergarten in garut, cianjur, and sukabumi regions. even though 20 parents stated that they worked with their father and mother, they were willing to set the time and duties in joining this pcc program. thus, 50 children were intervened effectively by their parents for ten days through the pcc program. this can be seen from the results of a study that showed 32 children experienced an increase in social competence (mean positive ranks 32b = 30.09) with a total value of 963. meanwhile, 18 children experienced a decrease in value (mean negative ranks 18a = 17.33) with a total value of 312 while no single sample had the same pre-test and post-test scores (ties = 0c). based on the results of the pre-test and posttest, it was indicated that h0 was rejected (asymp.sig (2-tailed) = 0.002 <α = 0.05), so that there was an effect of the parent-child communication program on the average of early childhood social competence. table 1. result of pcc program in west java for zone 1 and 2 west java 1 west java 2 n mean rank sum of ranks n mean rank sum of ranks negative ranks 19a 20.97 398.50 11a 14.77 162.50 positive ranks 29b 26.81 777.50 35b 26.24 918.50 ties 2c 4c total 50 50 table 2. result of pcc program in west java for zone 3 and 4 west java 3 west java 4 n mean rank sum of ranks n mean rank sum of ranks negative ranks 18a 17.33 312.00 19a 16.92 321.50 positive ranks 32b 30.09 963.00 28b 28.80 806.50 ties 0c 3c total 50 50 testing the parent-child communication program... agus mulyanto & alifah indalika mulyadi razak copyright © 2019, reid (research and evaluation in education), 5(1), 2019 59 issn 2460-6995 improvement on children's social competence also occurs in the eastern west java zone 4 (table 2), namely ciamis and sumedang areas, indicated by an increase in scores of 28 children (mean positive ranks 28b = 28.80) with a total value of 806.5. meanwhile, 19 children experienced a decrease in value (mean negative ranks 19a = 16.92) with a total score of 321.5, while three samples had the same pre-test and post-test scores (ties = 3c). based on the results of the pre-test and posttest, it was indicated that h0 was rejected (asymp.sig (2-tailed) = 0.01 <α = 0.05), so that there was an effect of the pcc program on children's social competence. table 3. result of pcc program in west java zone 5 west java 5 n mean rank sum of ranks negative ranks 19a 18.11 344.00 positive ranks 30b 29.37 881.00 ties 1c total 50 good communication skills between parents and children are also marked in the west java zone 5 (table 3), namely bogor and depok regions of 50 children experiencing an increase in social competence by 30 people (mean positive ranks 30b = 29.37) with a total value of 881. meanwhile, 19 children experienced a decrease in value (mean negative ranks 19a = 18.11) with a total value of 344, while one sample had the same pretest and post-test scores (ties = 1c). based on the results of the pre-test and post-test, it was indicated that h0 was rejected (asymp.sig (2tailed) = 0.008 <α = 0.05), so that there was an effect of the pcc model on the average social competence of children. based on the five data compiled from five areas in the province of west java and involving 250 children and 250 pairs of parents, it is proven that the parent-child communication model consists of five forms of communication, namely: (1) communication when children go to school; (2) communication when children go home from school; (3) communication between day and evening; (4) communication when children have a dinner; and (5) communication when the child is going to sleep. this model can be proven effective in improving early childhood social competence. conclusion this study aims to examine the effectiveness of parent-child communication programs on the development of children's social competence. children's social competence has a central position in their growth and becomes one of the main factors in children's readiness, so it requires good collaboration between parents and teachers to make it developed well. through this parent-child communication program, parents who have diverse activities are asked to maximize their communication with children, especially in terms of improving children's social competency, which consists of 68 behavioral items. the research findings show that the parent-child communication program can effectively improve children's social competencies consisting of 68 items. the pcc program can be implemented in the form of communication carried out by parents and children consisting of (1) communication before going to school; (2) communication when going home from school; (3) communication between the afternoon and evening; (4) communication at dinner; and (5) communication before going to bed. the five parent-child communication models are very effective in the eastern, western, southern, and also northern province of west java as well as the condition of the research subjects with many moms and dads who spend their time at home. acknowledgment the researchers deliver their gratitude to the ministry of research and higher education of the republic of indonesia for the research grant program so that the research results can be published. references berns, r. m. (1997). child, family, school, community: socialization and support (4th ed.). belmont, ca: holt, rinehart, and winston, inc. testing the parent-child communication program... agus mulyanto & alifah indalika mulyadi razak 60 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 bus, a. g., belsky, j., van ijzendoom, m. h., & crnic, k. (1997). attachment and bookreading patterns: a study of mothers, fathers, and their toddlers. early childhood research quarterly, 12(1), 81–98. https://doi.org/10.1016/s08852006(97)90044-2 csoti, m. (2009). developing children’s social, emotional, and behavioural skills. new york, ny: continuum international publishing. fabes, r. a., fultz, j., eisenberg, n., mayplumlee, t., & christopher, f. s. (1989). effects of rewards on children’s prosocial motivation: a socialization study. developmental psychology, 25(4), 509–515. https://doi.org/10.1037/001 2-1649.25.4.509 greitemeyer, t., & osswald, s. (2010). effects of prosocial video games on prosocial behavior. journal of personality and social psychology, 98(2), 211–221. https://doi.org/10.1037/a0016997 izard, c., fine, s., schultz, d., mostow, a., ackerman, b., & youngstrom, e. (2001). emotion knowledge as a predictor of social behavior and academic competence in children at risk. psychological science, 12(1), 18–23. htt ps://doi.org/10.1111/1467-9280.00304 kail, r. v. (2012). children and their development (6th ed.). upper saddle river, nj: pearson education. kärtner, j., keller, h., & chaudhary, n. (2010). cognitive and social influences on early prosocial behavior in two sociocultural contexts. developmental psychology, 46(4), 905–914. https:// doi.org/10.1037/a0019718 knafo, a., & plomin, r. (2006). parental discipline and affection and children’s prosocial behavior: genetic and environmental links. journal of personality and social psychology, 90(1), 147–164. https:/ /doi.org/10.1037/0022-3514.90.1.147 kumar, r. (2011). research methodology: a stepby-step guide for beginners. london: sage publication. mincic, m. s. (2007). dialogic reading with emotion-laden storybooks: intervention methods to enhance children’s emergent literacy and social-emotional skills. doctoral dissertation. department of psychology, george mason university. mulyanto, a., muchtar, h. s., hanafiah, hoerudin, c. w., & razak, a. i. m. (2017). children’s social competences: an early detection. international journal of management and applied science, 3(12), 70–73. niles, m. d., reynolds, a. j., & nagasawa, m. k. (2006). does early childhood intervention affect the social and emotional development of participants? early childhood research and practice, 8(1). pang, y. (2010). lego games help young children with autism develop social skills. international journal of education, 2(2), 1–9. https://doi.org/10.5296/ ije.v2i2.538 papalia, d. e., feldman, r. d., & martorell, g. (2012). experience human development (12th ed.). new york, ny: mcgrawhill. schneider, b. h., atkinson, l., & tardif, c. (2001). child-parent attachment and children’s peer relations: a quantitative review. developmental psychology, 37(1), 86–100. setiawati, f. a., izzaty, r. e., & triyanto, a. (2017). exploring the construct of school readiness based on child development for kindergarten children. reid (research and evaluation in education), 3(1), 42–49. https://doi.org/ 10.21831/reid.v3i1.13663 research and evaluation in education issn 2460-6995 research and evaluation in education, 2(2), 2016, 155-164 available online at: http://journal.uny.ac.id/index.php/reid research article proving content validity of self-regulated learning scale (the comparison of aiken index and expanded gregory index) heri retnawati faculty of mathematics and natural science, universitas negeri yogyakarta, jl. colombo no. 1, depok, sleman, 55281, yogyakarta, indonesia abstract this study aims to prove the content validity of the self-regulated learning (srl) scale using likert model and multiple-choice model with content validity coefficient based on expert assessments with aiken formula and expanded gregory formula. in this study, the srl scale with likert and multiple-choice model are developed using the same outline/format. there are three experts who assess the items' relevancy using indicators of both scale formats. the results of the expert assessments are then used to calculate the coefficient of the validity with aiken formula and the expanded gregory formula. the results showed that the content validity coefficient based on expert assessment on likert and multiple-choice format with aiken formula is at 0.9 for each, while using the aiken formula and expanded gregory formula, the coefficient is 0.6 for likert, and 0.8 for multiple-choice. keywords: validity coefficient, aiken formula, expanded gregory formula, srl scale how to cite item: retnawati, h. (2016). proving content validity of self-regulated learning scale (the comparison of aiken index and expanded gregory index). research and evaluation in education, 2(2), 155-164. doi:http://dx.doi.org/10.21831/reid.v2i2.11029 *corresponding author. e-mail: retnawati.heriuny1@gmail.com http://dx.doi.org/10.21831/reid.v2i2.11029 research and evaluation in education 156 − reid, 2(2), december 2016 introduction successful learning is driven by many factors. one of them is self-regulated learning which is related to independent learning like what college students do. college students are students who study at college and categorized as adults. they are categorized so because of their age, and because of the demands of independent learning in college. for college students, managing themselves to learn is a factor that supports their success in learning at college. the ability to manage themselves in the study is often referred to as self-regulated learning. various opinions related to selfregulated learning are presented by experts. pintrich in schunk (2005) states that selfregulated learning, or self-regulation, is an active, constructive process whereby learners set goals to review their learning and then attempt to monitor, regulate, and control the reviews of their cognition, motivation, and behavior, guided and constrained by reviewing their goals and the contextual features in the environment. zimmerman (1989; 1990) writes that self-regulated learning strategies are actions and processes directed at acquiring information or skills that involve agency, purpose, and instrumentality perceptions by learners. it means that a person carries out self-regulated learning in the learning process if he/she controls his/her behavior and cognition systematically by noting the rules made by him/herself, controlling the learning process, integrating the knowledge, practicing to remember the information obtained, and developing also maintaining positive values from his/her learning. social cognitive theory of bandura (kivinen, 2013) presents the theoretical basis of the self-regulated learning development model in an individual, in which contextual factors and interactional behavior give advantages to students to organize their study and to set themselves at the same time. social cognitive perspective differs from the standpoint of personal interaction, behavior and his/her environment that is often referred to a triadic process from bandura, as seen in figure 1. self-regulation is a cyclical process, because the input of the initial capabilities is used to make decisions to repeat the efforts that have been made. the effort of those repetitions is necessary because people, environment, and behavior always change during a learning process that is always observed and monitored. discussion on self-regulated learning includes three phases: forethought and planning phase, performance monitoring phase, and reflection on performance phase (zumbrunn, tadlock, and danielle, 2011). in the forethought and planning phase, there are two related things: task analysis, and selfconfidence and motivation. the determination or performance monitoring phase includes self-control and specific observations. self-reflection phase consists of self-development and selfreaction. these three phases are interrelated and they affect each other, so that they make up a cycle. the cycle is described in figure 2. the forethought phases can be classified into two points, namely the task analysis (covering self-regulation purpose and strategic figure 1. self-regulation triadic form from zimmerman (kivinen, 2013) research and evaluation in education proving content validity of self-regulated learning scale... 157 heri retnawati figure 2. srl phase (zumbrunn, tadlock, & danielle, 2011) planning) and self-motivation (self-confident and task-oriented). the forethought phases can be classified into two points: the task analysis (covering self-regulation purpose and strategic planning) and self-motivation (selfconfident and task-oriented). the performance monitoring phase includes self-control (covering self-instruction, focus of attention, task-solving strategies). self-reflection consists of self-consideration (self-evaluation and attribution) and self-reaction (self-satisfaction and adapt-ability). to determine the srl scale, wolkers, pintrich, and karabenick (2003) write that developing items is essential to do first to measure the cognition arrangements, followed by regulation, motivation, and behavior. these three things need to be measured in the academic context. some researches show that the srl is strongly associated with motivation (vrieling, bastiaens, and stijnend, 2012). the srl can be reinforced by educators in the learning process by preparing tasks that support the improvement of srl (zumbrunn, tadlock, and danielle, 2011). srl is recognized as an important predictor of student academic motivation and achievement (zumbrunn, tadlock, and danielle, 2011). related to the importance of the srl contribution to the success of college education, the srl of students need to be measured. the result of the measurement can be interpreted to be followed up as an effort to maintain or improve the srl. therefore, the valid srl measurement instrument is needed to develop based on the instrument development steps each of which can be accounted. srl measuring instrument development steps consist of several stages, including constructing a format based on the proper construction theory, preparing items, proving the content validity, trying out instruments on the correlating respondents, estimating the reliability, understanding the characteristics of the items, and reassembling the decent items into the instrument that is ready for use. one of the instruments that can be used to measure the srl is a questionnaire. the questions in the questionnaire have various forms, including dichotomy questions, multiple-choice questions, rank ordering, rating scale, and also open-ended questions (cohen, manion, and morrison, 2011). each of these forms has its own characteristic. dichotomy questions in the questionnaire contain only two answer choices. these questions are used if the researcher wants to ask the respondents questions related to variable containing two answers only, for example, gender (male or female, yes or no, true or false). the multiple-choice questionnaire questions are basically like multiple choice questions in description question. in the multiple-choice, respondents are usually allowed to choose one answer only. the scoring can be done by only right or wrong option, or stratified alternatives. if scoring is done differently, an ideal condition needs to be thought by a questionnaire maker. the questionnaire model that is most often used in indonesia is rating scale or better known as likert model. performance control: self control (self-instruction, focus of attention, and task solving strategies); self-observation (self-note and self-experimentation) thinking: task analysis (goal setting and strategic planning); self-motivation conviction (selfconviction and task orientation) self-reflection phase self-consideration (self-evaluation and attribution); self-reaction (self-satisfaction and adaptivity) research and evaluation in education 158 − reid, 2(2), december 2016 from the interviews with practitioners in the educational fields, some practitioners question the validity of the questionnaire with likert model in multiple choice models. each practitioner has its own reasonable arguments. the likert questionnaire model is easy to make and easy to read by the respondents, but the data obtained contain desirability bias. the multiple-choice questionnaire model is difficult to make and the respondents need time to read, but more valid data can be obtained from it. related to this problem, this study describes the proof of the content validity from the questionnaire in likert and multiple-choice model with stratified scoring. there are various opinions on the validity of the instruments used for the measurement, both in education and psychology. according to american educational research association (aera), american psychological association (apa), and the national council on measurement in education (ncme) in the standards for educational and psychological testing, validity refers to the degree of facts and theories that support the interpretation of instrument scoring, and the most important consideration in the development of an instrument (1999). other experts point out that the validity of a measuring instrument is to what extent the measuring instrument able to measure what should be measured (nunnally, 1978; allen and yen, 1979, p.97; kerlinger, 1986). meanwhile, linn and gronlund (1995) explain that validity refers to the adequacy and interpretation appropriateness made of assessment, related to a specific use. this opinion is reinforced by messick (1989) who writes that validity is an integrated evaluative policy concerning what extent of empirical facts and theoretical reasons support the adequacy and appropriateness of inferences and actions based on test scores or scores of an instrument. based on those opinions, it can be concluded that validity will show supports to empirical facts and theoretical reasons for the interpretation of test scores or score of an instrument, and it is associated with the measurement precision. there are three types of validity, namely: (1) criterion validity (criterion-related validity), (2) content validity, and (3) construct validity (nunnally, 1978; allen and yen, 1979; fernandes, 1984; woolfolk and mccane, 1984; kerlinger, 1986; and lawrence, 1994). this can be known through validity existence facts. sources of validity facts can be grouped into content validity, response process, internal structure, relations with other variables, and the consequences of the implementation of data collection (aera, apa, and ncme, 1999; cizek, rosenberg, and koons, 2008). the validity existence of an instrument can be identified through content analysis and empirical analysis from instrument score of item response data (lissitz and samuelsen, 2007). the criteria of validity are divided into two, namely the predictive validity and concurrent validity. fernandes (1984) writes that the validity based on criteria is intended to answer the question about the extent to which an instrument can predict the participants’ ability in the future (predictive validity) or estimate the ability of other measuring devices in almost the same deadline (concurrent validity). a similar opinion is also expressed by lawrence (1994) who says that the instrument is said to have predictive validity if it is able to predict capability in the future. in the analysis of the predictive validity, performances to be predicted are called criteria. the size of the estimated predictive validity value of an instrument is described by the correlation coefficient between the predictors of those criteria. the content validity of an instrument is the extent to which the items in the instrument represents the components in the overall area of the contents of the object to be measured and the extent to which the items reflects behavioral traits that will be measured (nunnally, 1978; fernandes, 1984). meanwhile, lawrence (1994) explains that content validity is the questionable representation of special abilities that must be measured. based on this opinion, it can be concluded that the content validity is related to the rational analysis of the domain to be measured to determine the representation of the instrument with the ability to be measured. construct validity is the validity which shows to what extent the instruments reveal research and evaluation in education proving content validity of self-regulated learning scale... 159 heri retnawati the ability or particular theoretical construct to be measured (nunnally, 1978; fernandes, 1984). a construct validation procedure starts from an identification and restriction regarding the variables to be measured and is expressed in terms of a logical construct based on the theory of those variables. from this theory, a practical consequence of the results of measurements on certain conditions is drawn, and this consequence will be tested. if the result is in line with expectations, the instrument is considered to have good construct validity. validity is an indispensable term required in an instrument’s development. according to sireci supported by lissitz and samuelsen (2007), the validation of instruments used in education should involve the content analysis and empirical analysis of the scores obtained from the instrument and the respondents’ response to the items. content analysis of an instrument is associated with content analysis that later, also needs an empirical analysis to prove the construct validity. both of these analyses are intended to make instruments in the world of education qualified as a standard measurement instrument. content validity is determined using expert agreement. expert agreement, also called as measured domain determines the content validity stratification (content-related). this happens because of the measuring instruments, for example a test or questionnaire is proved to be valid if the expert believes that the instrument measures the mastery abilities defined in the domain or the measured psychological constructs. for understanding this agreement, a validity index can be used, including the index proposed by aiken (1980; 1985). the item validity index proposed by aiken is formulated as follows: (1) where v is the item validity index; s is scores assigned by each rater minus the lowest score in the used category (s = r lo, with r = rater category selection score and lo the lowest scores in the scoring category); n is the number of raters; and c is the number of categories that raters can choose. based on the afore-mentioned opinion, v is the rater’s deal index of items’ suitability with indicators that need to be measured using the items. if it is applied to the measurement instrument, according to a rater, then n can be replaced by m (the number of items in an instrument). the v index value ranges from 0 to 1. the closer an item to 1, the better it is, because it is more relevant to the indicator. another way to prove the content validity with expert agreement is using expert index agreement suggested by gregory (2007). the index also ranges from 0 to 1. it is done by making contingency tables on two experts, with the first category that is not relevant and less relevant become the weak relevancy category, and the second category which is for quite relevant and very relevant that is created in a new strong relevant category. the expert agreement index for content validity is a comparison of the number of items of the two experts with strong relevance category of overall items. the expert agreement index for content validity is a comparison of the numbers of items from two experts as validators with strong relevance to the overall items category (gregory, 2007). while the results of the relevancy tabulation (contingency tables) are presented in table 1, the validity coefficient is presented in formula 2. table 1. the relevance category scoring with two validators validator 1 weak strong validator 2 weak a b strong c d content validity coefficient = (2) if the validators are three experts, the size of contingency tables with the number of cells 2x2x2 = 8 cells is presented in table 2. the content validity coefficient is an expansion coefficient of formula 2. the coefficient calculation with the formula 2 expansion is presented in formula 3. this coefficient also ranges from 0 to 1. the coefficient close to 0 means the validators’ agreement index on the instrument item relevance with their indicators is getting research and evaluation in education 160 − reid, 2(2), december 2016 lower. conversely, if the validity coefficient is closer to 1, the validators’ agreement index about the instrument items relevance with their indicator becomes greater. method sub-indicators are compiled by using srl components and indicators (adapted from zimmerman, 2000). the results of the indicator development and the item numbers are presented in table 3. instrument items which are the srl scales are arranged by using the outline above. the scale is set in two forms: a questionnaire in likert model and a multiple choice model. for example, item 1 in table 4 for items with likert model and table 5 for items with the multiple choice model. two forms of the outline/format and items of the instrument for measuring srl were then given to three validators. the validators consisted of two educational psychologists and one educational measurement expert. the three validators assessed the items’ relevancy with indicators, on both scale forms. based on the results of the assessment of the three validators, then the validity index and validity coefficient were calculated using aiken scale (formula 1), on both of the scale models. table 2. table of contingency to calculate the validity coefficient with gregory formula involving three validators expert 1 weak weak weak weak strong strong strong strong expert 2 weak weak strong strong weak weak strong strong expert 3 weak strong weak strong weak strong weak strong total a b c d e f g h content validity coefficient = (3) table 3. srl components and indicators (adapted from zimmerman (2000)) components indicators sub indicators items thought task analysis goals setting 1 strategic planning 2 confidence self-capability 3 task-oriented 4 performance control self-control self-instruction 5 study focus effort 6 task-finishing strategy 7 sufficient observation metacognitive observation 8 self-note 9 self-experimentation 10 self-reflection self-consideration self-evaluation 11 causal attribution 12 self-reaction self-satisfaction (reward) 13 self-satisfaction (punishment) 14 adaptive/defensive 15 table 4. items with likert model no statements never seldom often always 1 i frame my study/course goals before the activity begins 1 2 3 4 8 i make maps of activities that i have done research and evaluation in education proving content validity of self-regulated learning scale... 161 heri retnawati table 5. items with multiple-choice model no. items 1. at the beginning of the lecture (semester 1), a statement that is the most suitable with your condition is. . . . a. i frame my purposes clearly after i graduate. (4) b. i just know the best college for me, and my dream after graduate is not important. (2) c. i have a principle that life is just flowing, including the lecture. (1) d. i know what i will do after i graduate, but i am not sure with that. (3) 8. about the efforts that you have done, which statement describes your condition. . .. . a. i record my failure, so it motivates me to correct it. (3) b. failure, success, and effort that i have been done or will do, i draw them only in my mind. (2) c. i do not map my efforts, success, and failures that i think i fail to correct. (1) d. i make a map or diagram of the efforts that i have done and their results, as a success or failure. (4). table 6. experts final results of items compatibility with indicators data likert multiple-choice items rater1 rater2 rater3 items rater1 rater2 rater3 1 4 4 2 1 4 4 2 2 4 4 4 2 4 4 4 3 4 4 4 3 4 4 4 4 4 4 4 4 4 3 3 5 4 2 4 5 4 3 2 6 4 4 4 6 4 4 4 7 4 2 3 7 4 2 3 8 4 2 4 8 4 3 4 9 4 2 4 9 4 3 4 10 4 4 4 10 3 3 4 11 4 4 4 11 4 4 4 12 4 4 4 12 4 4 4 13 4 2 4 13 4 4 4 14 4 4 3 14 4 4 4 15 4 4 4 15 4 4 4 notes: (4= very relevant, 3= adequate relevant, 2= less relevant, 4= irrelevant) by using the same data, a new category was created for relevancy, weak and strong classifications, with which a contingency table as shown in table 2 was made. furthermore, the validity coefficient was calculated using the extended gregory formula (formula 2) in both scale models. findings and discussion the results of the assessment of the validators are inserted into table 6. in addition to providing quantitative assessments, the validators also provide qualitative inputs, which include (1) statement improvement in likert items, (2) stem items and the multiplechoice option improvement, and (3) according to the validators, indonesian respondents are not familiar yet with the multiple-choice questionnaire, because its reading takes longer time than a questionnaire with likert model. furthermore, the results of the quantitative assessment, the items of validity index, and the scale of validity coefficient using aiken formula are calculated on likert model or scale with multiple-choice model. the results are presented in table 7. comparison on each item of the two models is presented in figure 3. table 7 and figure 3 show that the calculation results of the item validity index using likert model and inventory model are not much different. similarly, the scale using likert model and inventory model obtained are exactly the same in the result of validity coefficient calculation. research and evaluation in education 162 − reid, 2(2), december 2016 table 7. the results of validity calculation using aiken formula items likert multiple-choice 1 0.78 0.78 2 1.00 1.00 3 1.00 1.00 4 1.00 0.78 5 0.78 0.67 6 1.00 1.00 7 0.67 0.67 8 0.78 0.89 9 0.78 0.89 10 1.00 0.78 11 1.00 1.00 12 1.00 1.00 13 0.78 1.00 14 0.89 1.00 15 1.00 1.00 scale 0.90 0.90 based on the same data, the item relevance category that becomes only weak and strong is created. furthermore, each category is calculated on likert questionnaire models presented in table 8. based on table 8, from a 15-item scale, there are nine strong items that have strong relevance according to the three validators’ assessment. this shows that with formula 3, the instrument reliability coefficient srl measurement using likert model obtains 0.60. using the same technique, the relevant category of the validity coefficient in multiplechoice models is also created. the results are presented in table 9. based on table 9, from 15 items of the scale, there are 12 strongly relevant items according to the three validators’ assessment. this shows that with formula 3, reliability coefficient instrument of srl measurement with multiple-choice models gains 0.80. figure 3. aiken index on scale of likert and multiple-choice model table 8. likert relevance category expert 1 weak weak weak weak strong strong strong strong expert 2 weak weak strong strong weak weak strong strong expert 3 weak strong weak strong weak strong weak strong total 0 0 0 0 0 5 1 9 table 9. multiple-choice relevancy category expert 1 weak weak weak weak strong strong strong strong expert 2 weak weak strong strong weak weak strong strong expert 3 weak strong weak strong weak strong weak strong total 0 0 0 0 0 1 2 12 research and evaluation in education proving content validity of self-regulated learning scale... 163 heri retnawati figure 4. comparison of validity coefficient using aiken formula and gregory formula the comparison of calculation results of srl validity coefficient scale’ if it is compared based on its forms and formulas, is presented in figure 4. based on the image, it can be obtained that the result of the validity coefficient calculation using aiken formula is more stable compared with it is using gregory formula. from their shape, these results indicate that the validity coefficient calculated using gregory formula on srl scale of the multiple-choice model is lower than the validity coefficient scale calculated on the likert model. conclusion in this study, two instruments of srl measurement on likert model and multiplechoice model using the same format are developed. the formats and the two instrument models are then given to three validators to be assessed for the relevance of the items with indicators. the results of the expert assessment are used to prove the content validity using aiken formula and expanded gregory formula. the results of the study show that the content validity coefficients, based on expert assessment on likert format and multiple choice with aiken formula, are at 0.9 for each, with the index for each item being almost the same, and with the aiken formula and expanded gregory formula being 0.6 for likert and 0.8 for multiple choice. these results show that the acquisition of the index and the validity coefficient using aiken formula on likert model and multiplechoice model are almost the same. this happens because both models are developed using the same format. however, when the validity verification is done by using the gregory formula, the results are different. coefficient acquisition using the gregory formula is less than that using the aiken formula, because in the gregory formula, the probability to obtain the combination of all three validators on assessing a strong relevant item is very small. some future research projects that can be done are the stability of the number of validators. further research is needed on the number of validators, so the acquisition of the index or the coefficient is maximized. it is better done on both aiken formula and gregory formula. references aiken, l, r. (1980). content validity and reliability of single items or questionnaires. educational and psychological measurement, 40, 955-967. aiken, l, r. (1985). three coefficients for analyzing the reliability and validity of ratings. educational and psychological measurement, 45, 131-142. allen, m. j. & yen, w. m. (1979). introduction to measurement theory. monterey, ca: brooks/cole. american educational research association, american psychological association, and national council on measurement in education. (1999). standards for educational and psychological testing. research and evaluation in education 164 − reid, 2(2), december 2016 washington, dc: american psychological association. cizek, g.j., rosenberg, s.l. & koons, h.h. (2008). source of validity evidence for educational and psychological test. educational and psychological measurement, vol. 68, pp. 397-412. cohen, l., manion, l., & morrison, k. (2011). research methods in education. new york, ny: routledge. fernandes, h. j. x. (1984). evaluation of educational program. jakarta: national education planning, evaluating and curriculum development. gregory, r.j. (2007). psychological testing: history, principles, and applications. boston, ma: pearson. kerlinger, f.n. (1986). asas-asas penelitian behavioral [behavioral research principles] (l.r. simatupang, trans.). yogyakarta: gajahmada university press. kivinen, k. (2013). assessing motivation and the use of learning strategies by secondary school students in three international schools (unpublished doctoral dissertation). university of tampere, finland. lawrence, m.r. (1994). question to ask when evaluating test. eric digest. retrieved from http://www.ericfacility.net/eric digest /edu.385607.html. linn, r.l. & gronlund, n.e. (1995). measurement and assessment in teaching (7 th ed.). englewood cliffs, nj: prentice-hall. lissitz, w. & samuelsen, k. (2007). further clarification regarding validity and education. educational researcher, vol. 36, no. 8, pp. 482-484. messick, s. (1989). validity in r. l. linn (ed.), educational measurement (3 rd ed., pp. 13-103). new york, ny: macmillan. nunnally, j. (1978). psychometric theory (2 nd ed.) . new york, ny: mcgraw hill. schunk, d. h. (2005). self-regulated learning. the educational legacy of paul r. pintrich. educational psychologist, 40, 8594. vrieling, e., bastiaens, t., & stijnend, s., (2012). effects of increased selfregulated learning opportunities on student teachers’ motivation and use of metacognitive skills. australian journal of teacher education, vol 37, 6, august 2012. wolkers, c.a., pintrich, p.r., & karabenick, s.a., (2003). assesing academic selfregulated learning. paper presented on the conference on indicators of positive development: definitions, measures, and prospective validity, washington, dc. woolfolk, a. e. & mccune, l. n. (1984). educational psychology for teachers. englewood cliffs, nj: prentice hall. zimmerman, b.j. (1989). a social cognitive view of self-regulated academic learning. journal of educational psychology, 81(3). zimmerman, b.j. (1990). self-regulated learning and academic achievement: an overview. education psychologist, 25(1), 3-17. zimmerman, b. j. (2000). attaining selfregulation: a social cognitive perspective. in boekaerts, m., pintrich, p. r., and zeidner, m. (eds.), handbook of selfregulation: theory, research, and applications, academic press, san diego, ca, pp. 13–39. zumbrunn, s., tadlock, j., & danielle, e. (2011). encouraging self-regulated learning in the classroom: a review of the literature. metropolitan educational research consortium (merc). virginia commonwealth university. http://www.ericfacility.net/eric copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(1), 2018, 12-21 available online at: http://journal.uny.ac.id/index.php/reid assessment of the social attitude of primary school students *1ari setiawan; 2siti partini suardiman 1universitas sarjanawiyata tamansiswa jl. kusumanegara 157 yogyakarta 55165, indonesia 2graduate school of universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia *corresponding author. e-mail: ari.setiawan@ustjogja.ac.id submitted: 11 april 2018 | revised: 28 may 2018 | accepted: 17 july 2018 abstract the implementation of curriculum 2013 at primary school level brings about its own problems to teachers. a serious problem emerges in the assessment, especially the assessment of core competence for the social attitude aspect. this problem arises because social attitude has many dimensions and requires judgments in diverse forms. in addition, the assessment of social attitude is focused on the affective sphere. the objective of this research is to assess the social attitude of grade iv and/or v students of primary school using three integrated instrument models: self-assessment (sa), peer assessment (pa), and observational assessment (oa). this research employed qualitative approach. the respondents were 58 students chosen by using cluster random sampling and purposive sampling techniques. the data were collected through direct disclosure questionnaire and observation, and analyzed descriptive quantitatively. the results of the research are as follows: (1) the component of honesty attitude is in category a (entrusted); (2) the component of discipline is in category a (entrusted); (3) the responsibility component is in category b (developing); (4) the politeness component is in category b (developing); (5) caring component is in category b (developing); (6) confidence component is category a (entrusted); and (7) students' social attitude is mainly in category b (good) which indicates that most students have good social attitude. keywords: assessment, social attitude, primary school introduction there are three domains of learning outcomes that a student achieves in a learning process, namely: cognitive, affective, and psychomotor domains (krathwohl, bloom, & masia, 1973, pp. 6–7). cognitive domain is the result of learning that has something to do with memory, ability to think, or intelligence. in addition, affective domain refers to learning outcomes in the form of sensitivity and emotion that deals with attitude, values, and interests, meanwhile, psychomotor domain is related to a certain skill or ability of motion (kurniawan, 2014, pp. 10–12). as a result of learning, these three domains require assessment, including integrated thematic approach model. a successful learning is defined by behavior (affective) as well as environment (retnawati, 2016). one aspect that requires assessment is affective domain. the characteristics of the affective domain are attitude, values and interests (mccoach, gable, & madura, 2013, pp. 7–24). the attitude referred to in this study is the social attitude of elementary school students. social attitude is an affective domain that needs to be assessed using an appropriate instrument. social attitude can be seen as something associated to the attitude which is related to social conditions. it is an acquired tendency to evaluate social things in a specific way. it is characterized by positive or negative beliefs in, feelings of, and behaviors on a particular entity. it has three main components: emoreid (research and evaluation in education), 4(1), 2018 issn 2460-6995 assessment of the social attitude… 13 ari setiawan & siti partini suardiman tional, cognitive, and behavioral components. the emotional component is the feeling experienced in evaluating a particular entity. the cognitive component implies thoughts and beliefs adopted towards the subject, while the behavioral component is the action that results from a social attitude (bernann, 2015, p. 13). lapierre in azwar (2015, p. 5) contends his idea that social situation is an anticipatory pattern of behavior, tendency or readiness, predisposition to conformity in social situations, or simply social attitude is a response to conditioned social stimuli. in other words, social attitude is a pattern of behavior regarding conditioned social situations. ahmadi (2002, p. 163) writes that social attitude is the consciousness of an individual who determines the real, repetitive actions of the social object. thus, social attitude represents a person's response to social objects. in line with this idea, gerungan (2004, p. 161) proposes that social attitude is the same and repeated ways of responding to social objects. it leads to the repeated ways of behaving toward a social object. as stated by soekanto (supardan, 2011), social objects relate to interpersonal behavior or social processes. it involves relationships between people or groups in social situations. social attitude is a tendency to evaluate social things in a certain way. it plays an important role in children's development, because it shapes children's perceptions of the social environment and has a significant effect on behavior (crano & prislin, 2011, p. 19). children who start interacting with the social environment will begin to have social attitude, and this also occurs in primary school-aged children. considering the various understandings above, the writer concludes that social attitude is the awareness of a person in acting repetitively in real life to determine the response to social objects in his or her relation with others. social attitude encourages a person to do things in a certain way as a form of his or her reaction to social objects. the evidence of children’s behaviors these days is quite concerning. primary school students are now generally less disciplined than they used to, and they have low care and responsibility. it is not in accordance with the ideal affective development of primary students. ekowarni (2009) contends that there are some values related to social condition that should be instilled in primary school students, including: politeness, caring, cooperativeness, discipline, humility, even-temperedness, tolerance, independence, honesty, confidence, toughness, positivity, fairness, peacefulness, perseverance, creativity, citizenship, responsibility, and sincerity. in today’s education practice, where social attitude actually becomes the core of education, the assessment has not yet been conducted. this is due to the teachers’ limitations, especially in the assessment process. teachers are more likely to spend their time on teaching regardless of the importance of making appropriate assessment. stiggins’s study shows that teachers should spend a third to a half of their available time to engage in assessment activities (stiggins, 1999, p. 3). they are constantly making decisions about how to interact with their students, and decisions that are based on part of information that they collect about their students through classroom appraisals. in fact, they do not spend much time on assessment. the results of a study conducted by zuchdi, prasetyo, and masruri (2012, p. 68) show that the practice of assessing the learning outcomes especially in elementary schools, so far, is mainly focused on the cognitive assessment. the students' appreciation is shown by the rank and score in their examination. although all educators know that the realm of education is cognitive, affective, and psychomotor (behavioral) aspects, in practice, the affective and psychomotor aspects are not given adequate attention, especially in assessing students (khilmiyah, sumarno, & zuchdi, 2015). teachers are not accustomed to assessing changes in the social attitude (affective spheres) of students of primary schools. this happens not because of the unwillingness of the educator, but because of the lack of educators' ability to describe the affective field of achievement indicators. as a consequence, the assessment does not reflect the students' overall abilities. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 14 – assessment of the social attitude... ari setiawan & siti partini suardiman it is clear that the assessment of social attitude cannot be done in the same way as that of the cognitive domain (such as by giving questions). assessment of social attitude is more directed to recording physical activities related to social interaction, not merely the ability to answer a number of questions given. in the primary school education system which applies thematic approach, the social attitude aspect that is part of the affective domain must be assessed. this refers to the content standards in elementary schools that contain competence in social attitude reflected by the students showing honesty, discipline, responsibility, politeness, care, and confidence in interacting with family, friends, teachers, and neighbors and showing love to their own nation. the existing assessment system is simple without sufficient indicators. the teachers have put more focus on the assessment of the cognitive aspect which has clearer construct and criteria, while the affective aspect has more complicated construct and the teachers have insufficient competence in designing the instruments of the assessment. another obstacle is the fact that designing learning objectives in terms of affective aspects is more difficult than designing the cognitive and psychomotor aspects (mardapi, 2012). in other words, the affective domain is difficult to define and assess because it is latent. based on the data collected by the researchers related to the assessment employed to assess the existing social attitudes, the models include observation methods (syamsudin, 2015, p. 109; waryadi, 2013, pp. 1–5), selfassessment of social attitude at the end of learning, and assessment developed by the teacher by referring to the technical guidance. these three assessments focus only on one method and tend to assess the apparent aspect of the student based on one point of view (teacher or student). this assessment also does not cover all of the aspects suggested in the core competencies of the social attitudes that the curriculum suggests. in addition, assessment which uses only one method will produce inaccurate conclusions on the social attitudes assessed. assessment of social attitude is often done at the end of an instruction, regardless of the process. this is done by the teacher as a routine and an attempt to execute the obligation. this kind of assessment produces only a visible social attitude at the end of learning. this will result in insufficient information, in which the results obtained are only viewed from one section of the lesson. assessment should be done during the teaching-learning process, from the start to the end based on real or authentic condition. in addition, an assessment applying three assessment methods (integrated) has not been conducted. thus, this research is very important to do because by doing the assessment integrating self-assessment, peer assessment, and observational assessment, the results will be more adequate. method this research is explorative descriptive research that describes the social attitude of elementary school students using three forms of self-assessment (sa), peer assessment (pa), and observational assessment (oa) instruments. the instrument validity was done using the confirmatory factor analysis (cfa), seen from the estimated loading factor per item. the result of the grain loading factor is between 0.31-0.99 (> 0.30) which means that the item in social attitude instrument (sa, pa, and oa) is valid. the use of validity criteria was seen at the loading factor of at least 0.30 as the consideration referring to azwar (2015, p. 143). the alpha cronbach approach was used to estimate the reliability of the instrument, obtaining the reliability value between 0.788 and 0.886 (> 0.70). this requirement refers to nunally (1981), sunyoto (2012), and mardapi (2017) who state that an instrument is said to be reliable when the combined coefficient of grains (alpha reliability) is 0.70 or more. the population in this research was the students of elementary schools in yogyakarta which have been implementing curriculum 2013 for two years. a sample of 58 students of kaliagung elementary school in sentolo, kulonprogo regency and pakel elementary school was established using the cluster reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 assessment of the social attitude… 15 ari setiawan & siti partini suardiman random sampling technique. the two schools were chosen because they have been implementing curriculum 2013 based on thematic learning and conducting affective assessment. the data were collected using questionnaires for sa and pa, and observation sheets for oa. the questionnaire and observation data were complementary and integrated. the data obtained were analyzed to describe the students’ achievement in social attitude. the achievement in social attitude was divided into two parts: (1) achievement based on honesty, discipline, responsibility, politeness, care, and confidence components, and (2) the achievement of social attitude as the combination of all social attitude components, referring to the results of the social attitude of elementary students. there is also a categorization of social attitude as a whole by combining all of the three forms of assessment used in this research. the data analysis was done through categorization of assessment results using score, average, and standard deviation. the data were derived from overall scores obtained by the respondents. the data obtained were analyzed using the categorization suggested by mardapi (2012) as stated in table 1. this categorization was used to assess the social attitude in detail based on the honesty, discipline, responsibility, politeness, care, and confidence components. this categorization also helps the teacher in monitoring the students’ ability to absorb thematic learning outcomes especially in the affective aspect. the assessment result of each component was then continued with the assessment of the students’ social attitude, which was the integration of all components. to understand and interpret the assessment results of the social attitude using the three models in this research, the researcher made a description to get the understanding of the social attitude components performed by the students. the description helped the teacher to reveal the achievement of social attitude, as stated in table 2. table 1. categorization of components of students’ social attitude no. student’s score categorization of students’ social attitude 1. x ≥ x̄ + 1.sbx entrust (a) 2. x̄ + 1. sbx > x ≥ x̄ developing (b) 3. x̄ > x ≥ x̄ 1. sbx seen (c) 4. x < x̄ 1. sbx not yet seen (d) notes: x̄: the average score of all students in a class sbx: standard deviation of the overall score of students in one class x: score achieved by students table 2. description of students’ social attitude achievement no assessed aspect achievement description 1. social attitude components (honesty, discipline, responsibility, politeness, care, and confidence) entrust students consistently show social attitude (honesty, discipline, responsibility, politeness, care and confidence*) in daily life and interaction at school. 2. developing students often show social attitude (honesty, discipline, responsibility, politeness, care and confidence*) in daily life and interaction at school. 3. seen students start to show social attitude (honesty, discipline, responsibility, politeness, care and confidence*) in daily life and interaction at school. 4. not yet seen students have not yet shown show social attitude (honesty, discipline, responsibility, politeness, care and confidence*) in daily life and interaction at school. *choose one based on the component being assessed. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 16 – assessment of the social attitude... ari setiawan & siti partini suardiman table 3. categorization of students’ social attitude no. student’s score category of social attitude achievement 1. x ≥ x + 1.sbx sb (sangat baik/ very good) 2. x + 1. sbx > x ≥ x b (baik/ good) 3. x > x ≥ x 1. sbx cb (cukup baik/ fair) 4. x < x 1. sbx kb (kurang baik/ poor) notes: x̄: the average score of all students in a class sbx: standard deviation of the overall score of students in one class x: score achieved by students table 4. description of the achievement of students’ social attitude no. assessed aspect achievement description 1. social attitude sb (very good) students are always honest during the learning process and social interaction, disciplined in daily life, show responsibility for the tasks and duties. the students are also polite to the teachers and friends, show care to others and environment, and also show high confidence in the class. all of those aspects are entrusted. 2. b (good) students are often honest during the learning process and social interaction, disciplined in daily life, show responsibility for the tasks and duties. the students often show polite behavior to the teachers and friends, show care to others and environment, and also show high confidence in the class. all of those aspects are developed. 3. cb (fair) students sometimes show honesty during the learning process and social interaction, discipline in daily life, and responsibility for the tasks and duties. the students are sometimes polite to the teachers and friends, show care to others and environment, and also show high confidence in the class. all of those aspects start to emerge and be seen. 4. kb (poor) students have not shown honesty during the learning process and social interaction, have not been disciplined in daily life, and have not shown responsibility for the tasks and duties. the students are less polite to the teachers and friends. they also have not given care to others and environment or performed high confidence in the class. all of those aspects are not yet seen or observed. students’ social attitude (honesty, discipline, responsibility, politeness, care, and confidence) was derived from the categorization presented in table 3. to figure out the meaning of the results of the social attitude assessment, table 4 presents the description of each achievement. the next assessment was a test to know the effectiveness of the assessment done. the effectiveness is based on the criteria suggested by four experts at psychometrics, assessment, thematic learning of primary education, and psychological counselor. the consultation also involved three primary teachers. the data obtained were categorized and presented in table 5 (mardapi, 2012). table 5. categorization of the instruments effectiveness no. respondent’s score effectiveness categorization 1. x ≥ x̄ + 1.sbx very effective 2. x̄ + 1. sbx > x ≥ x̄ effective 3. x̄ > x ≥ x 1. sbx fairly effective 4. x < x̄ 1. sbx less effective notes: x̄ : the average score of respondents sbx : standard deviation of the overall score of respondents x : score achieved by the respondents reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 assessment of the social attitude… 17 ari setiawan & siti partini suardiman findings and discussion the assessments were conducted in two qualified primary schools; they were pakel elementary school and kaliagung elementary school, involving 58 students. the data obtained were analyzed using the descriptive method and categorization. the assessment of these values was done using sa, pa, and oa instrument models. the results were analyzed to know the description of the assessment. figure 1. social attitude viewed from six components the results of the assessment were analyzed in two phases. the first phase presents each component. the honesty component or value of the primary school students is presented in table 6. table 6. social attitude value: honesty no. value number of student percentage 1. a (entrust) 23 39.66% 2. b (developing) 16 27.59% 3. c (seen) 12 20.68% 4. d (not yet seen) 7 12.07% total 58 100% figure 2. histogram of results of the students’ honesty assessment table 6 and figure 2 show that generally the value of honesty in thematic learning from the sample of 58 students is as follows: there are 23 students (39.66%) who are in category a or entrust, 16 students (27.58%) who are in category b or honesty is developing, 12 students (20.68%) in category c or honesty starts to be observed, and seven students (12.07%) in category d which means that their honesty has not been shown. the next value is discipline. the detailed results can be seen in table 7. table 7. social attitude value: discipline no. value number of student percentage 1 a (entrust) 35 60.34% 2 b (developing) 19 32.76% 3 c (seen) 4 6.90% 4 d (not yet seen) 0 0% total 58 100% figure 3. histogram of results of the student’s discipline assessment table 7 and figure 3 indicate that from the sample students, their discipline in thematic learning is categorized as follows: there are 35 students (60.34%) who are in category a or entrust, 19 students (32.76%) who are in category b or developing, four students (6.90%) who are in category c, and no student in category d. table 8. social attitude value: responsibility no value number of student percentage 1 a (entrust) 14 24.14% 2 b (developing) 32 55.17% 3 c (seen) 10 17.25% 4 d (not yet seen) 2 3.44% total 58 100% reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 18 – assessment of the social attitude... ari setiawan & siti partini suardiman figure 4. histogram of results of the students’ responsibility assessment table 8 and figure 4 indicate that generally from the sample students, it can be seen that there are: 32 students (55.17%) in category b (responsibility is developing), 14 students (24.14%) in category a which means that responsibility is entrusted, 10 students (17.25%) in category c where responsibility starts to emerge, and two students (3.44%) in category d. table 9. social attitude value: politeness no value number of student percentage 1 a (entrust) 16 27.58% 2 b (developing) 30 51.73% 3 c (seen) 10 17.24% 4 d (not yet seen) 2 3.45% total 58 100% figure 5. histogram of the results of students’ politeness assessment table 9 and figure 5 indicate that from the sample students involved, it can be seen that there are 30 students (51.73%) who are in category b or developing, 16 students (27.58%) who are in category a which means that politeness is already instilled. in addition, there are 10 students (17.25%) who are in category c, and two students (3.44%) who are in category d, which means that the students have not shown polite behavior in thematic learning. table 10. social attitude value: care no value number of student percentage 1 a (entrust) 17 29,31% 2 b (developing) 32 55,17% 3 c (seen) 8 13,79% 4 d (not yet seen) 1 1,73% total 58 100% figure 6. histogram of the results of students’ care assessment table 10 and figure 6 show that the results of the assessment of students’ care are as follows: 32 students (55.17%) are in category b, 17 students (29.31%) are in category a, eight students (13.79%) are in category c, and one student (1.73%) is in category d. in addition, table 11 and figure 7 indicate that from the sample students involved, the results of the confidence assessment are as follows: 46 students (79.31%) are in category a or instilled, nine students (53%) are in category b or developing, one student (1.72%) is in category c, and two students (3.44%) are in category d or not showing self-confidence. table 11. social attitude value: confidence no value number of student percentage 1 a (entrust) 46 79.31% 2 b (developing) 9 15.53% 3 c (seen) 1 1.72% 4 d (not yet seen) 2 3.44% total 58 100% reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 assessment of the social attitude… 19 ari setiawan & siti partini suardiman figure 7. histogram of the results of the students’ confidence assessment the second phase of analysis in this research dealt with the description of the assessment results of students’ social attitude in the thematic learning. the results are the integration of the three assessment models employed in this research (sa, pa and oa). the results are presented in table 12. table 12. description of the students’ social attitude assessment no value number of student percentage 1 sb (very good) 11 18,96% 2 b (good) 38 65,52% 3 cb (fair) 9 15,52% 4 kb (poor) 0 0 total 58 100% figure 8. histogram of the results of students’ social attitude assessment from table 12 and figure 8, it can be seen that the students’ social attitude in thematic learning is as follows. eleven students (18.96%) are categorized as sb or very good. there are 38 students (65.52%) included in category b or good. there are nine students (15.52%) considered as cb or fair in terms of their social attitude. there is no student categorized in category d or poor. an example of sb (very good) category is when the students are able to show honesty during a teachinglearning process and social interaction, they are disciplined in daily activities at school, they show responsibility for their tasks and duties, they show polite behavior to their teachers and peers, they care about others and environment, and they show confidence in class. all those aspects have already entrusted and instilled in students’ daily life. as previously mentioned, the results of this research are divided into two parts. the first result is the assessment based on the social attitude components, covering honesty, discipline, responsibility, politeness, care, and confidence. the second result deals with the social attitude value along with the description which can be used to fill out the report of the learning outcome. based on the components of assessment results, it can be generally said that confidence is included in category a or entrust (46 out of 58 students or 79.31%). in addition, 35 students show discipline as how it is described in category a, while honesty is reflected by 23 students and is considered as being instilled. there are 32 students showing responsibility, 30 students showing care, and 32 students reflecting politeness. these three values are in category b (developing). another interesting result is that there are seven students (12.06%) who are categorized in category d. they have not shown honesty in their daily life and social interaction at school. the dishonesty is shown when they copied other students’ work. it is in line with the idea of koellhoffer (2009, p. 27) that honesty deals with avoiding plagiarism, including taking others’ idea or answers without permission during the learning process, test, etc. the results also present that the social attitude assessment is integrated components developing the attitudes such as honesty, discipline, responsibility, politeness, care, and also confidence. from the sample of 58 students, 11 (18.96%) are included in sb, or, in other words, their social attitude is very good. in addition, 38 students (65.52%) are considered to be good. the social attitude is the reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 20 – assessment of the social attitude... ari setiawan & siti partini suardiman result of responses to the social stimuli contained in thematic learning. this is supported by lapierre in azwar (2015, p. 5) who proposes that social situation is a pattern of behavior, anticipative tendency or readiness, predisposition to adapt to social situation, or, simply social attitude is a response towards conditioned social stimulus. from the assessment results of the students’ social attitude, it can also be inferred that their social attitude turns out to be varied. there are 36 (65.52%) students in sb (very good) category and 11 students (18.96%) in b (good) category. from that result, sb (very good) category has deep meaning. the results can also be used in the report of the learning outcomes of core competence in social attitude aspect or kompetensi inti (ki)–2 (core-competence 2) and become the evaluation material for thematic learning. the assessment results obtained are also used by teachers to fill out the report of the learning outcomes in the mid semester and the end of the semester. this research also yields effectiveness from the assessment conducted. there are 79% of the teachers who claim that the assessment involving three different models in this research is effective. this indicates that more varied and integrated methods can result in more accurate assessment results. this shows that this instrument is useful in helping teachers to assess social attitude as an affective component of integrated thematic learning outcomes in primary school. conclusion and suggestion conclusion the results of this research are divided into two parts. the first result is the assessment based on the components of social attitude covering honesty, discipline, responsebility, politeness, care, and confidence. the second result deals with the social attitude value along with the description which can be used to fill out the report of the learning outcome. for teachers, this assessment can be used to fill in the report of students’ learning outcomes in the affective domain or ki 2 (core-competence 2). for parents and students, the assessment results are helpful in finding out the description of social attitude that has been achieved by students. this description can be used as an introspection and improvement of students' social attitude. suggestion the comprehensive results of this research may become a guidance for the teachers to assess students’ social attitude. the existing assessment can also become an evaluation towards the learning practice. the future research should reveal other components of social attitude as the results of learning process. references ahmadi, h. a. (2002). psikologi sosial. jakarta: rineka cipta. azwar, s. (2009). penyusunan skala psikologi. yogyakarta: pustaka pelajar. azwar, s. (2015). skala pengukuran sikap manusia. yogyakarta: pustaka pelajar. bernann, s. l. (2015). pengetahuan, sikap, dan perilaku manusia. yogyakarta: parama. crano, w. d., & prislin, r. (2011). attitudes and attitude change. new york, ny: psychology press. ekowarni. (2009). pedoman pendidikan akhlak mulia siswa sekolah dasar. jakarta: departemen pendidikan nasional, direktorat pendidikan dasar dan menengah. gerungan, w. a. (2004). psikologi sosial. bandung: refika aditama. khilmiyah, a., sumarno, s., & zuchdi, d. (2015). pengembangan model penilaian keterampilan intrapribadi dan antarpribadi dalam pendidikan karakter di sekolah dasar. jurnal penelitian dan evaluasi pendidikan, 19(1), 1–12. https://doi.org/10.21831/pep.v19i1.45 50 koellhoffer, t. t. (2009). character education being fair and honest. new york: infobase publishing. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 assessment of the social attitude… 21 ari setiawan & siti partini suardiman krathwohl, d. r., bloom, b. s., & masia, b. b. (1973). taxonomy of educational objectives book 2/affective domain. new york, ny: longmans, green. kurniawan, d. (2014). pembelajaran terpadu tematik (teori, praktik, dan penilaian). bandung: alfabeta. mardapi, d. (2012). pengukuran, penilaian, dan evaluasi pendidikan. yogyakarta: nuha medika. mardapi, d. (2017). pengukuran penilaian dan evaluasi pendidikan (2nd ed.). yogyakarta: parama. mccoach, d. b., gable, r. k., & madura, j. p. (2013). instrument development in the affective domain: school and corporate applications. new york, ny: springer. nunally, j. c. (1981). psychometric theory (3rd ed.). new york, ny: mcgraw-hill. retnawati, h. (2016). proving content validity of self-regulated learning scale (the comparison of aiken index and expanded gregory index). reid (research and evaluation in education), 2(2), 155–164. https://doi.org/ 10.21831/reid.v2i2.11029 stiggins, r. j. (1999). assessment, student confidence, and school success. the phi delta kappan, 81(3), 191–198. sunyoto, d. (2012). validitas dan reliabilitas. yogyakarta: nuha medika. supardan, d. (2011). pengantar ilmu sosial: sebuah kajian pendekatan struktural. jakarta: bumi aksara. syamsudin, a. (2015). model penilaian afektif siswa sekolah dasar. doctoral dissertation, universitas negeri yogyakarta, yogyakarta. waryadi. (2013). menyiasati pelaksanaan penilaian sikap dalam implementasi kurikulum 2013. jakarta: balitbang kemenag. zuchdi, d., prasetyo, z. k., & masruri, m. s. (2012). model pendidikan karakter terintegrasi dalam pembelajaran dan pengembangan kultur sekolah. yogyakarta: uny press. this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(1), 2019, 1-9 available online at: http://journal.uny.ac.id/index.php/reid developing psychomotor evaluation instrument of biochemistry practicum for university students of biology education *1etika dyah puspitasari; 2mohamad joko susilo; 3novi febrianti 1,2,3department of biology education, universitas ahmad dahlan jl. prof. dr. soepomo, s.h., janturan, yogyakarta 55164, indonesia *corresponding author. e-mail: etika.puspitasari@pbio.uad.ac.id submitted: 04 december 2018 | revised: 06 december 2018 | accepted: 06 december 2018 abstract practicum is one of the important aspects of the learning of biology. there is no psychomotor evaluation instrument that is valid and reliable. this study is aimed at developing a valid and reliable psychomotor evaluation instrument for biochemistry practicum. the study is developmental research using the 4-d model of ‘define, design, develop, and disseminate’. instrument validation was carried out through construct validation. the findings show that the developed instrument is characterized by a high level of construct validity although the reliability measure is not very well-estimated. the instrument is constructed of four factors of perception, set, guided response, and mechanism developed into 80 statement items. keywords: instrument development, psychomotor domain, practicum, biochemistry, biology education permalink/doi: https://doi.org/10.21831/reid.v5i1.22126 introduction science learning (particularly biology) involves practicum. a practicum, in biology learning, is an activity of exploration as well as experimentation in the laboratory or in the open field to give a direct experience to the students. since science covers three aspects: product, process, and scientific attitudes (tursinawati, 2016), it is often stated that practicum is an unseparated part of science. one of the superior aspects of practicum as a learning method is that it gives the chance for students to test, find, and elucidate theories (suryaningsih, 2017); develop the basic skills of experimentation; endorse enthusiasm for knowledge; elevate problem-solving skills; provide students with facilities of scientific investigation. practicum activities can also improve the students’ scientific processes and concept masteries (lestari & diana, 2018; suardana, liliasari, & ismunandar, 2013). it is unfortunate to say that, thus far, evaluation on practicum activities in the laboratory still emphasize on the cognitive aspects, while psychomotor skills evaluation receives small attention (hamid et al., 2012). this can be seen from the low proportion of the cognitive evaluation for the pre-test, post-test, and final assignment that is usually written. meanwhile, according to osman, hiong, and vebrianto (2013), in order that students acquire the skills needed for the 21st-century, biology learning must involve a lot of inquiry skills. inquiry skills include (1) formulating problems, (2) proposing a hypothesis, (3) designing experimentation to test hypotheses, (4) testing data analyses and making conclusions, and (5) writing a report. in addition, students are also expected to be able to operate experiment tools in the laboratory. maknun, surtikanti, and subahar (2012) map the essential laboratory skills into 14 as follows: (1) observing, (2) calculating, (3) measuring, (4) classifying, (5) finding space/time relation, (6) formulating hypotheses, (7) designing an experiment, (8) controlling variables, (9) interpreting data, (10) making inferhttp://dx.doi.org/10.21831/reid.v5i1.22126 developing psychomotor evaluation instrument... etika dyah puspitasari, mohamad joko susilo, & novi febrianti 2 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 ences, (11) predicting, (12) concluding, (13) applying, and (14) communicating. however, in their study, they find that mastery of the essential laboratory skills of biology teachercandidate students in the ecology practicum is still low. likewise, kasilingam, ramalingam, and chinnavan (2014) describe psychomotor skills into seven levels, namely: (1) perception, (2) set, (3) guided response, (4) mechanism, (5) complex overt, (6) adaptation, and (7) origination. verbs that are related to the perception level include selecting, choosing, isolating, and identifying. verbs that represent the set level include showing, starting, explaining, etc. in the guided response level, verbs that can be used include imitating, following steps, making, etc. in the mechanism level, verbs that are relevant include calibrating measuring, mixing, organizing, heating, manipulating, etc. maknun, surtikanti, munandar, and subahar (2012) categorize psychomotor skills of the practicum class in the ecology subject matter as setting up the tools in line with the practicum plan, calibrating and maintaining the laboratory tools, operating pipettes, operating microscopes, taking notes, working safely in accordance with work health and security. the results of their study show that the psychomotor skills of teacher-candidate students in biology practicum are still low. the conduct of the biochemistry practicum in the study program of biology education, universitas ahmad dahlan (uad) has included cognitive, psychomotor, and affective aspects; however, evaluation of the cognitive aspects dominates the process (80%) while the psychomotor and affective aspects take the rest (20%). besides, no standard and valid instrument have been developed for the evaluation of the psychomotor aspects of learning in the biochemistry practicum. as a result, evaluation for the practicum has a high measure of subjectivity. to date, many studies have been conducted on the development of evaluation instruments. one example is the study done by ridlo (2012), but this study focuses on the knowledge aspects of biology practicum. instrument development in the psychomotor domain in biology practicum is done, among others, by yunita, agung, and nuraeni (2016) with good validation in the aspects of material, construction, and language. yulianti, andriani, and taufiq (2014) have developed a psychomotor evaluation instrument in the temperature and calorie topic. another study was done by hazarianti, masriani, and hadi (2016) on a psychomotor evaluation rubric in the practicum of the distribution coefficient sub-material. this rubric, however, is used in the classes other than biology. development of psychomotor evaluation instruments has been done so far for high school students; meanwhile, very little has been done for university students. besides, learning evaluation has so far emphasized the cognitive skills, even for practicum classes which actually need psychomotor skills. it is therefore important that the development of evaluation instruments be developed in biology education, especially in the biochemistry topic. this is due to the fact that biochemistry is one of the basic materials in biology along with physiology, genetics, microbiology, and others. method this study is a research and development using the four-d model proposed by thiagarajan, semmel, and semmel (1974). the four developmental phases in the model are as follows: (a) define, (b) design, (c) develop, and (d) disseminate (haviz, 2013). this study is con-fined up to the 'develop' phase, leaving out the 'disseminate' phase (febriana, rachmadiarti, & faizah, 2016; noverina, taufiq, & wiyono, 2014; yunita et al., 2016). the 'define' phase includes four steps, namely: (a) initial analysis, (b) curriculum review, (c) content review, and (d) learner analysis. the 'design' phase includes four steps, namely: (a) selection of assessment scales, (b) development of the instrument draft, (c) instrument validation, and (d) test of assistant limitation. the 'develop' phase consists of two steps, namely: (a) product evaluation by experts and (b) small-group and large-group tryouts. the study used two questionnaires as the instruments for data collection. the first questionnaire, using the likert scale, consisted developing psychomotor evaluation instrument... etika dyah puspitasari, mohamad joko susilo, & novi febrianti copyright © 2019, reid (research and evaluation in education), 5(1), 2019 3 issn 2460-6995 of statements concerning the instrument feasibility to be given to material experts and evaluation experts. the second questionnaire was tested for readability to practicum assistants using the guttman (yes/no) scale. all instruments were first validated by the material and evaluation experts. data were analyzed by a combination of descriptive and qualitative techniques. the likert scale was scored by 4 to 1 rating to be categorized into very good, good, poor, and very poor. the guttman scale had 2 ratings with a maximum score of 15. for item validity, exploratory factor analysis (efa) was used with the four indicators of kaisermeyer-olkin measure of sampling adequacy (kmo, msa), bartlett's test of sphericity, anti-image correlation, and factor loading. as general criteria, if the level of bartlett's test of sphericity is p<0.5, kmo-msa value is >0.5, and the anti-image correlation is >0.5, the sample data are feasible for analysis. the quantitative data from the experts and assistants were analyzed for feasibility by categorizing them into four interpretation criteria using the formula proposed by mardapi (2008). the research product is regarded as feasible if the results of the analyses are minimally at the category of ’good‘. the criteria include content material, construction, language, objectivity, and utility. findings and discussion findings the study is research and development in three phases, namely: (a) define, (b) design, and (c) develop. in the ‘define’ phase, analyses are conducted in the initial situation, curriculum content, subject material, and learner characteristics. analyses of the initial situation are done by carrying out discussions with biochemistry practicum coordinators and assistants concerning the running and evaluation of the biochemistry practicum. from this activity, it can be known that the practice of practicum evaluation is still dominant in the cognitive domain, approaching 80% of the whole process. psychomotor skills aspects take only about 10%. the curriculum analyses are done on the practicum lesson plans, learning outcomes, and practicum guidebooks. concerning the learning outcomes, among others, students are able to practice making ph solutions of various concentrations, making buffer solutions, and measuring ph solutions. these abilities in making and measuring ph solutions will become the bases for doing other practicum activities. analyses of the content material are directed to look at the basic materials that are given before the practicum class. the content material for ph practicum is an advanced topic. the topic of ph making and measuring are the fifth items in the whole syllabus of the biochemistry practicum. the preceding classes contain practicum activities the accuracy and correctness of experiments. in these preceding practicums, students practice liquefying, measuring, and using the right tools. it is expected that in the fifth practicum, students are readily familiar with the initial and basic steps of experimentation. learner analyses are directed to look at the characteristics of the students who take the biochemistry practicum in semester 2. the biochemistry practicum is the first practicum the students have in their program. there is no practicum in semester 1. the practicum uses four of the six levels of the psychomotor domain (hamid et al., 2012) namely: level 1 (perceiving), level 2 (being ready for active participation), level 3 (integrative responding), and level 4 (showing work performances to become habitual). in the complete scheme, level 5 is complex overt responding and level 6 is adapting. these are not yet included in the items for the learning evaluation. in the ‘design’ phase, the following steps were carried out: selecting evaluation scales, developing the instrument draft, validating, and readability testing. selection of the evaluation scales is done by reviewing the instrument draft design. initially, the evaluation scales are related to the check-list type with yes/no responses. according to ibezim and igwe (2016), the check-list instrument is more objective in measuring psychomotor skills than rating scales. however, taking the experts and assistants’ recommendation, the developing psychomotor evaluation instrument... etika dyah puspitasari, mohamad joko susilo, & novi febrianti 4 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 likert-type rating scales be used for the developed instrument. it is expected that, by using the likert scales, differences in the students’ performances can be more clearly detected. three scales will be used: 1 for inadequate, 2 for good, and 3 for very good. the product draft consists of an instrument for readability and instrument for expert validation. the draft is formatted in the following aspects: (1) title of the experiment, (2) objectives to be achieved, (3) psychomotor evaluation aspects, (4) levels of the psychomotor skills, (5) indicators for the psychomotor skills, (6) descriptors representing the indicators, (7) evaluation scales, (8) evaluation rubrics, and (9) scoring guides. the experiment title is related to the experiment of making and measuring ph. the learning objective to be achieved is for students to be able to make solutions with various concentrations, making solution buffers, and measuring ph solutions. the aspects that will be observed in the activities consist of preparation for the practicum, running of the practicum, and reporting. the product instrument was evaluated by validators before it was subjected to the try-outs. this evaluation consists of readability checks by practicum assistants and evaluation instrument by evaluation and subject matter experts. the instrument validity was obtained from the wider-scale try-out using exploratory factor analysis (efa). the results of the efa analyses show that the kaiser meyer olkin measure of sampling adequacy (kmo) is 0.787 which means that the factor analysis can be continued. looking at the number of factors that have an eigenvalue of more than 1, four levels of the psychomotor domain can be obtained: level 1 for perception, level 2 for set (readiness for active participation), level 3 for guided response (integrative responses), and level 4 for mechanism (showing performance as a habit). the indicators for the psychomotor skills in the developed instrument cover the following details: being able to set the tools and materials for the experiment, writing up the steps of the work, making hcl solution using various concentrations, measuring the ph of the hcl solution, making 2% naoh solution, making 100 ml of 0.2m ch3coona solution, making 1% gelatin solution, making 0.2m nah2po4.h2o solution, making 0.2m ph 5 acetate buffer solution, writing out practicum objectives, writing out observation results, comparing observation results with the theory, writing out the discussion of results, making conclusions, and writing up the practicum report. these are presented in table 1. each indicator is operationalized into descriptors. there are 80 descriptor statements. the three three-scale likert criteria are 1 for inadequate, 2 for good, and 3 for very good. table 1. psychomotor levels and indicators of the instrument psychomotor level indicator perception selecting tools and materials formulating practicum objectives set writing up sequence of works writing out results of observation guided responses making hcl solution using various concentrations making 2% naoh solution making 100 ml of 0.2m ch3coona solution making 1% gelatin solution making 0.2m nah2po4.h2o solution making 0.2m ph 5 acetate buffer solution comparing observation results with the theory writing out the discussion of results making conclusions writing up the practicum report mechanism calculating the solution volume weighing materials determining height or volume of solution using practicum tools (pippete, bulb, glass, etc.) measuring ph solution developing psychomotor evaluation instrument... etika dyah puspitasari, mohamad joko susilo, & novi febrianti copyright © 2019, reid (research and evaluation in education), 5(1), 2019 5 issn 2460-6995 the total score made by the students is the sum of all the scores obtained for each indicator. the maximum score is 240 and the minimum score is 80. students’ score can be obtained by the following formula: student’s score = score gained by student x 100 240 the instrument that has been constructvalidated was subjected to readability checks by the practicum assistants. the results of the readability test show that the instrument readability can be categorized as very good (93.83%). the readability checks include language, ease, objectivity, and utility. the 'develop' phase consists of three activities, covering: (a) expert evaluation, (b) small-group try-out, and also (c) large-group try-out. based on the results of the evaluation by the subject-matter and evaluation experts, the instrument is categorized as very good (91.67). the evaluation includes language, construct, content, objectivity, and utility. a minor revision is suggested, however, by the subject-matter experts on the use of vocabulary words and simplification of the descriptors. the final version of the instrument draft ends up with 80 statement items. some indicators and descriptors of the final draft are presented in table 2, table 3, table 4, and table 5. in table 2, the indicators and descriptors are those that are used for the perception level. there are two indicators in this level, namely selecting tools and materials and formulating practicum objectives. these indicators are supported by kasilingam et al. (2014) whereby the level perception can be operationalized by choosing, selecting, describing, etc. table 3 shows indicators and descriptors for the psychomotor set level. the set level operationalizes into mental, emotional, and physical readiness of the student to work. in this level, the indicators are writing down work procedure and writing up observation results. these indicators are chosen for the reason that students’ readiness to do the practicum can be seen from their understanding on the sequence of the steps in the practicum class, which is represented by their ability to write down the steps in accordance with the guidebook. in the same way, students’ readiness to communicate the results and write a report is shown by their ability to write down the results of the practicum and any important phenomenon in the form of a report draft. table 2. indicators and descriptors operationalized from perception indicator descriptor selecting tools and materials selecting tools and materials for making solution from various concentrations, making buffer solution, and measuring ph arranging tools and materials on the operation desk thoroughly as directed by the guidebook formulating practicum objectives writing up learning objectives in accordance with the learning outcomes in biochemistry practicum table 3. indicators and descriptors operationalized from the set level indicator descriptor writing down the sequence of work steps writing down the complete steps of the practicum job in accordance with the guidebook arranging the tools and materials on the work desk in accordance with the guidebook writing up observation results writing up results of the practicum in a tentative draft developing psychomotor evaluation instrument... etika dyah puspitasari, mohamad joko susilo, & novi febrianti 6 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 in the level of guided response, there are 50 items to be tested. these items are written out in accordance with the practicum guidebook (kasilingam et al., 2014). some of these instrument items are shown in table 4. the mechanism level has skill categories with which students are familiar. it includes calculating solution volume, weighing materials, observing solution volume through the glass tube, heating solution, measuring ph solution, etc. these descriptors use operational mechanisms like measuring, organizing, heating, etc. (kasilingam et al., 2014). after being evaluated by experts, the instrument was subjected to a try-out to a small group of 20 students. the results show that the average of students’ scores is 128. converted into the 1 to 100, this score is represented by 53.33. this score belongs to the low category. table 4. indicators and descriptors operationalized from guided response indicator descriptor making hcl solution using various concentrations putting solution hcl 1m using correct pippete into 25 ml bulb tube to make hcl 0.05m pouring aquades to liquefy hcl 1m through the bulb glass wall at correct limit stripe making 100 ml of 0.2m ch3coona solution putting 1.64 gram ch3coona into bulb tube 50ml using glass bicker pouring aquades into bulb glass containing 1.64 gram ch3coona through tube wall closing bulb glass and shaking it ch3 coona crystal solves making 100ml of 1% gelatin solution putting 1 gram gelatin into bicker glass pouring 60 ml aquades in glass bicker containing 1 gram gelatin solving gelatin solution cooling gelatin solution to room temperature pouring cool gelatin solution into 100ml bulb tube using glass bicker pouring aquades into bulb tube to limit stripe closing bulb tube and shaking it to make solution homogeneous making 100ml of 0.2m ph 5 acetate buffer solution taking 62.95 ml of ch3coona solution taking 37.05 ml of 0.2m ch3cooh solution mixing ch3coona and ch3cooh solutions by shaking erlenmeyer to make solution homogeneous table 5. indicators and descriptors operationalized from mechanism level indicator descriptor calculating solution volume constructing formulas for calculating volumes solution concentrations to be used calculating solution volume for solution liquidation following the volume comparison formula and practicum concentration weighing material putting filtering paper or watch glasson the analytic plate or pan pushing marking button to calibrate scales till zero number (0) appears putting material on filtering paper or watch glass weighing weight of material in accordance with needed weight determining height or volume of solution using practicum tools (pippete, bulb, glass, etc.) observing height of solution volume to be measured parallel with eyes using meniscus point to determine volume of solution taking 5 ml of buffer solution into test tube using pipette for each solution concentration measuring solution ph merge tip of ph indicator into test tube in 5 seconds compare colour of merged ph indicator paper with that of universal standard ph indicator developing psychomotor evaluation instrument... etika dyah puspitasari, mohamad joko susilo, & novi febrianti copyright © 2019, reid (research and evaluation in education), 5(1), 2019 7 issn 2460-6995 the instrument was finally subjected to the bigger-group try-out of 45 students. the results of the try-out show that the average of students’ scores is 148.8. converted into the 1 to 100, this score is represented by 62. this score also belongs to the low category. discussions the research findings show that the developed instrument has a high construct validity; however, results of the small-group and large-group try-outs are not satisfactory. this may be due to the condition that the results of the practicum experiment are shared by the students in the group so that not every student is able to carry out all of the assessment aspects in the practicum. the results of the try-out to the large group show a score that is interpretable into the low category; the same with those of the small group try-out. this may be caused by the fact that the practicum is carried out by task assignments. this was done because the practicum material is big in volume while the time is limited to two hours. this causes the condition that students are not able to conduct all the activities in the practicum so that the observed psychomotor scores are partial. the low level of the results of the tryouts may also be caused by the suspicion that the instrument reliability measure is not very well-defined or estimated. according to lee, brennan, and kolen (2000), when the reliability measure is low, the standard error of measurement (sem) is also low; bringing about the consequence that the validity of the measurement is zero. on the other hand, when the reliability measure is high and the sem is low, it means that there is validity in the results of the measurement. in spite of all that, the height of the reliability measures (regardless of the sizes) does not guarantee the presence of validity (azwar, 2008). consequently, it is true that the conduct of reliability estimation is important in instrument development. conclusion and suggestions conclusion based on the research findings, it can be concluded that the developed instrument is feasible to be used. the instrument has a high measure of construct validity although its reliability is not very well-estimated. in fact, instrument reliability can be elevated in two ways, i.e. by increasing items that have high internal consistency or reducing those with low internal consistency. the instrument is constructed of four psychomotor aspects of perception, set, guided responses, and mechanism distributed into 80 statement items. suggestions the developed psychomotor evaluation instrument has not been estimated very well in terms of its reliability. it is suggested that other studies intended to develop an evaluation instrument carry out reliability estimation. the techniques can be suited to the objectives and types of data of the study. acknowledgment the authors thank lppm uad for providing the internal fund through hibah penelitian dosen pemula in 2018, and they deliver their gratitude to all academicians of biology education study program at universitas ahmad dahlan who gave a hand in the research. references azwar, s. (2008). penyusunan skala psikologi. yogyakarta: pustaka pelajar. febriana, m., rachmadiarti, f., & faizah, u. (2016). kelayakan perangkat penilaian materi ekologi yang sesuai dengan tagihan kurikulum 2013. bioedu: berkala ilmiah pendidikan biologi, 5(1), 49–54. hamid, r., baharom, s., hamzah, n., badaruzzaman, w. h. w., rahmat, r. a. o. k., & raihantaha, m. (2012). assessment of psychomotor domain in materials technology laboratory work. procedia social and behavioral sciences, 56, 718–723. https://doi.org/10.1016/j.sb spro.2012.09.708 haviz, m. (2013). research and development: penelitian di bidang kependidikan yang inovatif, produktif dan bermakna. ta’dib, 16(1), 28–43. developing psychomotor evaluation instrument... etika dyah puspitasari, mohamad joko susilo, & novi febrianti 8 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 hazarianti, p., masriani, & hadi, l. (2016). pengembangan rubrik penilaian psikomotorik pada praktikum submateri koefisien distribusi mahasiswa pendidikan kimia. jurnal pendidikan dan pembelajaran, 5(11), 1–10. ibezim, n. e., & igwe, n. (2016). checklist versus rating scale in psychomotor assessment: achieving objectivity. international journal of humanities and social science invention, 5(8), 48–51. https://doi.org/10.3389/fpsyt.2014.00 196 kasilingam, g., ramalingam, m., & chinnavan, e. (2014). assessment of learning domains to improve student’s learning in higher education. journal of young pharmacists, 6(1), 27–33. https:// doi.org/10.5530/jyp.2014.1.5 lee, w., brennan, r. l., & kolen, m. j. (2000). estimators of conditional scalescore standard errors of measurement: a simulation study. journal of educational measurement, 37(1), 1–20. lestari, m. y., & diana, n. (2018). keterampilan proses sains (kps) pada pelaksanaan praktikum fisika dasar i. indonesian journal of science and mathematics education, 1(1), 49–54. retrieved from http://ejournal.raden intan.ac.id/index.php/ijsme/article/vi ew/2474/1828 maknun, d., surtikanti, r. r. h. k., munandar, a., & subahar, s. (2012). keterampilan esensial dan kompetensi motorik laboratorium mahasiswa calon guru biologi dalam kegiatan praktikum ekologi. jurnal pendidikan ipa indonesia, 1(2), 141–148. https://doi.org/10.1529 4/jpii.v1i2.2131 maknun, d., surtikanti, r. r. h. k., & subahar, t. s. (2012). pemetaan keterampilan esensial laboratorium dalam kegiatan praktikum ekologi. jurnal pendidikan ipa indonesia, 1(1), 1–7. mardapi, d. (2008). teknik penyusunan instrumen tes dan nontes. yogyakarta: mitra cendekia. noverina, s., taufiq, & wiyono, k. (2014). pengembangan rubrik penilaian keterampilan dan sikap ilmiah mata pelajaran fisika kurikulum 2013 di kelas x sekolah menengah atas. jurnal inovasi dan pembelajaran fisika, 1(2), 145–151. retrieved from https://ejournal.unsri. ac.id/index.php/jipf/article/view/1804 /749 osman, k., hiong, l. c., & vebrianto, r. (2013). 21st century biology: an interdisciplinary approach of biology, technology, engineering and mathematics education. procedia social and behavioral sciences, 102, 188–194. https://doi.org/ 10.1016/j.sbspro.2013.10.732 ridlo, s. (2012). pengembangan tes pengetahuan praktikum biologi berdasarkan graded response dan generalized partial credit. jurnal penelitian dan evaluasi pendidikan, 16(edisi dies natalis ke-48 uny), 166–182. https:// doi.org/10.21831/pep.v16i0.1111 suardana, i. n., liliasari, l., & ismunandar, i. (2013). peningkatan penguasaan konsep mahasiswa melalui praktikum elektrolisis berbasis budaya lokal. jurnal pendidikan dan pembelajaran (jpp), 20(1), 45–52. retrieved from http://journal. um.ac.id/index.php/pendidikan-danpembelajaran/article/view/3869 suryaningsih, y. (2017). pembelajaran berbasis praktikum sebagai sarana siswa untuk berlatih menerapkan keterampilan proses sains dalam materi biologi. jurnal bio education, 2(2), 49–57. retrieved from https://jurnal.unma.ac.id/ index.php/be/article/view/759/708 thiagarajan, s., semmel, d. s., & semmel, m. i. (1974). instructional development for training teachers of exceptional children: a sourcebook. bloomington, in: indiana university. tursinawati. (2016). penguasaan konsep hakikat sains dalam pelaksanaan percobaan pada pembelajaran ipa di sdn kota banda aceh. jurnal pesona dasar, 2(4), 72–84. retrieved from developing psychomotor evaluation instrument... etika dyah puspitasari, mohamad joko susilo, & novi febrianti copyright © 2019, reid (research and evaluation in education), 5(1), 2019 9 issn 2460-6995 http://www.jurnal.unsyiah.ac.id/pear /article/view/7534 yulianti, n., andriani, n., & taufiq. (2014). pengembangan instrumen penilaian psikomotorik pada pokok bahasan suhu dan kalor di smp. jurnal inovasi dan pembelajaran fisika, 1(2), 152–156. retrieved from http://ejournal2.unsri. ac.id/index.php/jipf/article/view/1805 /750 yunita, l., agung, s., & nuraeni, r. (2016). pengembangan instrumen penilaian aspek psikomotorik siswa sma/ma pada praktikum titrasi asam basa. in pros. semnas pend. ipa pascasarjana um (pp. 662–670). malang: universitas negeri malang. retrieved from http://pasca. um.ac.id/wp-content/uploads/2017/0 2/luki-yunita-662-670.pdf reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(2), 2017, 114-123 available online at: http://journal.uny.ac.id/index.php/reid research article developing physics problem-solving skill test for grade x students of senior high school 1 amipa tri yanti nadapdap; * 2 edi istiyono *graduate school of universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia *email: edi_istiyono@uny.ac.id submitted: 24 july 2017 | revised: 29december 2017 | accepted: 29 december 2017 abstract this research aimed to develop a physics problem-solving skill (pss) test for grade x students of senior high school which met test instrument characteristics and feasibility. the development stages included: (a) test designing, (b) test trial, and (c) test revision and preparation. the designing stage included: (1) needs analysis, (2) mapping, (3) drawing conclusion, (4) determining test purpose, (5) determining competencies, (6) determining materials, (7) preparing answers, (8) writing items, (9) validating content, (10) improving and preparing the test, and (11) preparing the scoring guide with pcm. the trial stage consisted of: (1) determining trial subjects, (2) performing trial, and (3) analyzing trial result data based on irt. the study was performed in kulonprogo involving 281 students. the result shows that the instrument fulfills content validity with aiken’s v of 0.95 to 0.98. based on infit mnsq criteria, 52 items fit pcm, item difficulty index ranges from -1.47 to 0.88, meaning that all items are good, and information function analysis and sem show that the test fits the ability between -1.3 and 2.7. therefore, the test instrument meets the characteristics and feasibility to measure physics pss in high school. keywords: problem-solving skill, testing, physics, assessment how to cite item: nadapdap, a., & istiyono, e. (2017). developing physics problem-solving skill test for grade x students of senior high school. reid (research and evaluation in education), 3(2), 114-123. doi:http://dx.doi.org/ 10.21831/reid.v3i2.14982 introduction assessment in education must be performed in order to measure student’s cognitive skills. it is expected to increase the success of learning process. thus, a series of test assessment instruments should be developed. a test is a planned measurement instrument used by educators to give an opportunity to students to show their achievement and it is related to predetermined objectives (cangelosi, 1995). a test can show the success rate of teaching based on target aspects. its preparation is adjusted to its purpose, e.g. a summative test is used to measure student’s achievement, formative test is to measure the success of learning process and a diagnostic test is to examine student’s difficulty before a teaching and learning process. there are other test types used to measure certain skills, such as cognitive, affective and psychomotor skills. a test has many variations in its preparation, i.e. multiple choice, sentence completion, listing, true-false, essay, matching, and modified form (tonidandel, quiñones, & adams, 2002) therefore, a test should be developed consistently, adjusted to its form and measurement purpose. problem solving is a skill which should be improved in the 21 st century. indonesia is a http://dx.doi.org/%2010.21831/reid.v3i2.14982 http://dx.doi.org/%2010.21831/reid.v3i2.14982 reid (research and evaluation in education) 115 − reid (research and evaluation in education), 3(2), 2017 developing country in terms of education, so problem-solving skill (pss) is a skill which must be mastered by students in curriculum 2013 (k-13). rating in k-13 is done in the form of authentic assessment that assesses the start of the input, process and results (outputs) of learning, including attitudes, knowledge and skills. an assessment technique is relevant with the scientific learning process and able to assess the students’ ability in the teaching and learning process and results. regulation of minister of education and culture no. 59 of 2014 states that problem-solving skill is required to achieve the objectives of k13 to give students the life skills to be an individual and citizen who is faithful, productive, creative, innovative, and affective, as well as able to contribute to social life, nation, country, and world civilization. this skill is expected to produce scientific students (nadapdap & lede, 2016). therefore, problem-solving skill test should be developed. problem-solving consists of four parts: (1) understanding a problem; (2) preparing a plan for solution; (3) performing a plan; (4) reexamining o lya, 1957). the indicators of problem-solving development according to helaiya (2010) are including: (1) the ability to identify problem and problem-solving process; (2) the ability to define problem by thinking about different situations from the reality; (3) the ability to think of many possible alternatives of some solutions; (4) the ability to verify result of solution; and (5) the ability to verify in a solution acquisition process. therefore, the aspects of a problem-solving test can be developed, including: (1) understanding; (2) planning a solution in problem solving; (3) describing a problem; (4) finding a way to solve a problem according to the planned solution; (5) bringing about a problem; and (6) evaluating the problem solving result assessment (helaiya, 2010). in physics teaching, pss is the main topic in physics education research (per) because it has long-term benefits. further, physics pss can help students understand the concepts of physics in real terms. the most important part in teaching physics is students are expected to understand the real world. the theory of learning is based on one’s process with its various interactions to gain experience which makes one have changes in cognitive, affective and psychomotor skills (slameto, 2010). according to bloom, cognitive process thinking consists of lower order thinking which consists of abilities to memorize, understand and apply, and higher order thinking which consists of the ability to analyze, evaluate and create. pss is a part of higher order thinking (carvalho et al., 2015). higher order thinking skills (hots) are: (1) higher order thinking at the upper part of bloom’s cognitive taxonomy, (2) teaching purpose behind cognitive taxonomy which can prepare students to perform knowledge transfer, (3) ablility to think, which means that students can apply the knowledge and skills that they develop during the learning process in a new context (brookhart, 2010). pss can be measured by using a test which is consistent with the purpose of student’s higher order thinking. besides, the test which is used has to require the use of knowledge and skills in the new situation. in order to assess the hots, something new should be used. one of the ways to do that is using a test which is in the valid category — a test which is aimed to measure the hots. one of the modern measurement theories is called generalized partial credit model (gpcm). gpcm is the improved partial credit model (pcm). the pcm discriminant items are constant or 1, while the value gpcm discriminant varies. pcm is also appropriate for analyzing the response to the measurement of critical thinking and conceptual understanding in science (istiyono, 2016). pcm was developed to analyze the test items that require several steps to resolve. gpcm can be applied to tests, which is done with the steps that are clear for the testee. a physics achievement test is a test administered following the exact steps. therefore, gpcm is expected to be applied properly. multiple-choice test has advantages, including: (1) the material being tested can cover most of the learning materials, (2) the students' answers can be corrected easily and quickly, and (3) the answers to each question is obviously right or wrong, so it is an objecreid (research and evaluation in education) developing physics problem-solving skill test... 116 amipa tri yanti nadapdap & edi istiyono tive assessment (istiyono, 2016). therefore, using a multiple-choice item test to measure the problem-solving skills is good to do. assessment in education uses two kinds of measurement theories: classical measurement theory and modern measurement theory or item response theory (irt). the classical test theory (ctt) is also called the true-score classical theory. the ctt is so named for the elements of this theory have been developed and applied for a long time, but still survive today (suryabrata, 2002). according to the classical theory of measurement, measuring by using measurement score result is usually conducted partially based on the steps that must be taken in order to correct an answer items. scoring is conducted at every step and score each item participant adds a score obtained by the students of each step, and the ability is estimated by the raw scores. a scoring model is not necessarily right, because the level of difficulty of each step is not taken into account. since a test is an instrument that provides stimulus in the form of a command or a question which requires a response from the test participants, the response which is given by the test participants stated in a score is easy to interpret. in addition, the scoring results of a multiple choice test is gained by the use of a dichotomous model, which means that if the item response is correct, it is given a score of 1 and if the response is wrong, it is given a score of 0. teachers do not use polytomous scoring models that would be more equitable because it considers item response measures. these dichotomous scoring models have yet to appreciate the steps of problem solving, because different error rates will result in the same score of 0. dichotomous scoring models are certainly less fair. one of the scoring guidelines that can be selected is the provision of each category, as presented in table 1. hots is interdependent with students’ problem-solving skill. physics pss can really help students solve physics problems in learning. with that skill, students are expected to solve a given problem with an effective solution. an accurate solution is seen based on the aspects to be measured, the aspects which measure students’ problem-solving skill consistent with a students’ operational stage of formal thinking. high school students are 17 years old in average, an age when they can think abstractly and logically which is categorized as problem solving stage. table 1. scoring category & description category guidelines category -1 the students are wrong in writing the concepts used and the results are wrong. this is indicated by the students that answer question one and also one of the reasons category -2 the students are wrong in writing the the concepts used but the results can be correct. this is indicated by the students’ correct answer to questions wrong basis. category -3 the students are correct in writing the concepts used but the end result is wrong. this is indicated by students’ wrong answer to the question and correct reason. category -4 the students are correct in writing the concepts used and the results are correct. this is indicated by the students’ correct answer to questions and correct reason. (istiyono, 2016) thinking skill is required in scientific thinking. further, scientific thinking is involved in hipothetico-dedutive and inductive types (piaget, 2005). scientific thinking is working effectively and systematically, as well as proportionally. in terms of pss, at that age, students can draw conclusions and interpret and develop hypotheses. however, the existing test did not describe the skill which demands thinking consistent with the optimization of the characteristics of student’s ability (eraikhuemen & ogumogu, 2014). therefore, the higher the characteristics of the cognitive development stage, the more orderly and abstract the students’ thinking. the appropriate assessment to get information on student’s thinking skill based on characteristics is by giving an appropriate test for measuring the thinking competence level. however, the current development of assessment is only based on the classical theory assumption in which scoring is performed step by step and student’s score per item is gained reid (research and evaluation in education) 117 − reid (research and evaluation in education), 3(2), 2017 by adding the student’s score in every step, and the skill is estimated by raw scores. thus, an assessment which can cover the thinking skill level such as problem identification to assessment should be developed (gok, 2010). therefore, a physics problem-solving skill test instrument was developed for grade x students of high school. the purpose of the study was to produce an instrument to measure physics pss in grade x students in their even semester and to get the characteristics of the physics pss assessment instrument. method this study is a developmental study with quantitative approach. the instrument development used in this study was the modified orindo and antonio model (oriondo & dallo-antonio, 1998). the developed assessment instrument was a physics pss test for grade x students in their even semester of 2016/2017 academic year. population the study was performed in public high schools (or sekolah menengah atas negeri – sma negeri) in kulonprogo regency, yogyakarta, i.e. sma negeri 1 wates, sma negeri 2 wates and sma negeri 1 pengasih. the trial subjects were 281 students. the sample consisted of the students who had received similar tested materials in the three schools and they were selected not based on ranking. the valid instrument was used in the form of a pss test instrument packed in two packages of materials, each containing 30 questions with 8 anchor items of multiplechoice type reasonably ready for use in empirical testing. testing is done by testing the instrument to 281 students. the respondents were chosen from the class which had studied the materials of elasticity, static fluid, temperature and heat and optical equipment. they were classes x of sma negeri 2 wates, sma negeri 1 wates, and sma negeri 1 pengasih kulon progo. the test results were analyzed by reference of the test using the criteria of acceptance of instrument suitability with rash model, seen from the mean value of infit mnsq (mean of square) which ranged from 0.77 to 1.33 (adams & khoo, 1996, p. 30). the trial sample in the analysis by irt consisted of 281 students, who were required in irt model research. some experts consider that the bigger the sample size, the better the measurement result will be. one of the bases for using 281 students as the trial sample was shin, who was using 200 to 1000 (shin, 2009). therefore, the 281 students used in this measurement was considered adequate. data collection technique the instrument development was based on the aspects and sub-aspects of pss test, including: (a) test design, (b) test trial, and (c) test revision and preparation. meanwhile, the instrument designing stage consisted of: (1) needs analysis, (2) mapping, (3) drawing conclusion, (4) determining test purpose, (5) determining tested competencies, (6) determining tested materials, (7) preparing test answers, (8) writing items, (9) validating content by expert, (10) improving and preparing test, and (11) preparing scoring guide with partial credit model (pcm). the trial stage consisted of: (1) determining trial subjects, (2) performing trial, and also (3) analyzing the trial result data based on irt. figure 1 shows the test development stage. the test developed was a physical test used in high school with problem-solving aspect. the test was developed in the form of a multiple choice item consisting of 60 items including 8 anchor items. the test developed yielded 2 sets of problems with package a of 30 questions and package b of 30 questions. each package has 8 anchor items. the data analysis employed in this study was partial credit model 1 pl (pcm 1-pl) for the testing item fitness of the physics pss test for grade x students of high school. based on irt, the sample was adequate and good according to pcm 1-pl model (adams & khoo, 1996). the content validity analysis was performed qualitatively by material experts using aiken index. the content validity analysis was performed qualitatively by material experts using aiken’s v index. based on the index, the item was valid if the minimal aiken’s v is 0.87 (aiken, 1980). reid (research and evaluation in education) developing physics problem-solving skill test... 118 amipa tri yanti nadapdap & edi istiyono figure 1. phases of test development the data analysis was performed on several aspects, including (1) the fitness of instrument items, (2) the reliability, (3) the item characteristic curve (icc), (4) the difficulty index, and also (5) the total information function and standard error measurement (sem). the goodness of the fit test for the overall test and testees (case/person) was based on the average infit mean of square (mean infit mnsq) and its standard deviation, or by the observation of the average infit t (mean infit t) and its standard deviation. if the average infit mnsq was approximately 1.0 and its standard deviation was 0.0 or the average infit t was approaching 0.0 and its standard deviation was 1.0, then the whole test fits the model. an item or testee/case/ person fits a model in the infit mnsq ranging from 0.77 to 1.30. an item was good if the difficulty index was over -2.0 or less than 2.0 (hambleton & swaminathan, 1985). the test reliability was tested by testing the information function and the following criteria presented in table 2. table 2. criteria of ideal score score criteria realibility category >0.94 excellent 0.91 – 0.94 very good 0.81 – 0.90 good 0.67 – 0.80 acceptable <0.67 questionable findings and discussion the development resulted in a problem-solving skill test with two sets of problems, coded a and b, each consisting of the materials of: elasticity, static fluid, temperature and heat, and also optical instruments. table 3 shows the item distribution with eight items as the anchor items with the aspects of identification, planning, application, and also assessment. table 3. distribution test subject elasticity static fluid temperature and heat optic aspect/ sub aspect identify distinguish 1a* 1b* 8a 8b 17a 17b 24a 24b identify 2a 2b 25a *25b* plan formulate 3a 3b 9a 9b, 10a 10b 19a 19b devise 4a 4b 26a 26b apply and execute connect 5a 5b 12a 12b, 11a 11b, 16a *16b* 28a 28b apply 13a 13b 21a *21b* 20a, 20b, 18a 18b 29a 29b analyze 6a 6b 14a* 14b* 23a* 23b* 27a*, 27b* evaluation investigate 7a *7b* 22a 22b 30a 30b conclude 15a 15b reid (research and evaluation in education) 119 − reid (research and evaluation in education), 3(2), 2017 the research product was validated by two assessment experts and five practitioners to assess the feasibility. aiken index is in the range of 0.8 to 1.00. it can be interpreted that all of the items have good content validity and have supported overall content validity. the fit goodness was tested for overall test items. the fitness of the overall test items used the principle developed by adams and khoo (1996, p. 30) based on infit mean of square (mean infitmnsq) and its standard deviation or observing the average infit t (mean infit t) and its standard deviation. if the average infitmnsq was approximately 1.0 and its standard deviation 0.0 or the average infit t approached 0.0 and its standard deviation 1.0, the overall test fits pcm 1-pl model. table 4 shows the average infitmnsq is 1.00 and its standard deviation 0.02, so the overall test fits pcm 1 pl model. the fitness determination of each item followed the principle of adams and khoo (1996, p. 30) in which an item fits the model if infit mnsq ranges from 0.77 to 1.30. with infit mnsq as the item acceptance limit or fit according to the model (ranging from 0.77 to 1.30) and by using the infit t from -2.0 to 2.0, the items which met the goodness of fit were found. the infit mnsq value ranged from 0.99 to 1.03. with infit mnsq as the item acceptance limit or fit according to the model (ranging from 0.77 to 1.30), all of the 52 items fit the pcm. table 4. testing the statistic fit parameter level no test parameter item estimation case estimation 1 average and std.deviation -0.25 ± 0.28 0.17 ± 0.02 2 infit mnsq 1.00 ± 0.02 1.00 ± 0.12 3 outfit mnsq 1.00 ± 0.02 1.00 ± 0.12 4 infit zstd 0.09 ± 0.75 0.06 ± 1.84 5 average difficulty 1.00 ± 0.95 6 estimate reliability 0.8 the result of the reliability testing shows that the value of the reliability of the instrument is 0.28. based on the relative value, the whole item is reliable as it corresponds to the reliability of the interpretation data of the rasch model sufficiently categorized. figure 2 shows the goodness of item with an analysis by quest. based on results of the analysis, it can be concluded that the entire test items are in accordance with the pcm model with the whole item being within the range of infit mnsq pss from 0.77 to 1.33 and using infit t with the limit of -2.0 to 2.0 in accordance with figure 2 that no item exceeds the acceptance limit. in conclusion, 52 items fit the pcm model. figure 2. goodness of fit instrument reid (research and evaluation in education) developing physics problem-solving skill test... 120 amipa tri yanti nadapdap & edi istiyono based on the result of analysis, the reliability of the instrument is 0.80. the reliability is adequate. the instrument has adequate strength and reliability because it consists of the items which have high information function (hambleton & swaminathan, 1985, p. 94). it may be because the test fits the skill of the tested students. an item is categorized as good if the difficulty index is higher than -2.0 or less than 2.0 (hambleton & swaminathan, 1985, p. 36). based on the analysis result, the items difficulty is between -0.95 and 1.0 with an average of 0 and standard deviation of 0.32. therefore, based on the difficulty level, 52 items are good. the average difficulty of the aspect of problem-solving skills are shown by table 5. table 5. average difficulty of the aspect of problem-solving skills aspect difficulty identify -0.13 plan -0.16 apply and execute 0.20 evaluate 0.54 construct validity is empirically proven by goodness of fit in the partial credit model (pcm). table 4 shows the average value and standard deviation of infit mnsq are 1.00 and 0.02, respectively, so the test fits pcm 1 pl. this means that the test is empirically valid. the test contains valid aspects of the pss. this is because: (1) the items were developed consistently with the appropriate instrument item development procedure, (2) the items were developed from indicators derived from the aspects of the problem-solving skill and physics materials, (3) the test consisted of 52 items whose content validity was examined through expert judgment, and (4) the tryout respondents (students) worked on the test seriously (istiyono, mardapi, & suparno, 2014). the difficulty level b for good item varies between -2.00 and 2.00. an item with the difficulty level of -2.00 is very easy, while that with the difficulty level of 2.00 is very difficult. based on the test characteristics, the problem-solving skill test had the reliability coefficient, test information function, and estimation parameter which were reliable and had high stability. figure 3. the percentage of difficulty level of aspect figure 4. the percentage of difficulty level of sub-aspect figures 3 and 4 show the percentage of the aspects and sub-aspects that have been tested. the percentage of the results indicates that the frequency of students’ responses to the per item categories of each aspect and the sub-category is put into category one, two, three and four. the first category states that the frequent answers are with a score of one whereas a score of four is expressed by the fourth category. the percentage of each difficulty level of each item is shown in figure 3. it shows that the highest difficulty level is in the application aspect. category 1 percentage shows that most students answer correctly in score 1, so the item is difficult. figure 3 shows that the percentage of the application in category 1 is 64 and that in category 4 is 6. figure 4 shows the level of difficulty of each aspect of the problem-solving skill. the differences between the classical theory and the modern theory in educational assessment can be illustrated by five students id e n ti fy d is ti n g u is h m ak e a p la n f o rm u la te s o rt in g c o n n e c t a p p li e d c h e c k c ri ti c iz e reid (research and evaluation in education) 121 − reid (research and evaluation in education), 3(2), 2017 a, b, c, d, and e taking the test as many as 5 items with five alternatives type. the wrong item was given a score of 0 and a maximum of four is given to the correct answer. the most difficult aspect is the evaluation and implementation aspect. this shows that the students’ problem-solving skill in evaluation and implementation aspects is still low. figure 5. icc of item no. 38 the characteristic of the item is indicated by the item characteristic curve (icc) and the difficulty index. based on the result of the icc analysis, 52 items are equivalent to the number of the questionnaire items developed. figure 5 shows an example of icc for item 38. it shows that in category 1, the ability of most of the students is very low θ = -3), in category 2, the ability of most of the students is low θ = -1), in category 3, the ability of some students is high θ = 1), in category 4, the ability of most of the students is very high θ = 3). the difficulty level ranges from small to large sequential categories 1, 2, 3, and 4. based on figure 6, the measurement information is in the range of the ability of -1.3 to 2.7. therefore, the test instrument is suitable to be used for the students with -1.3 to 2.7 so that in that range, information function shows the ability level estimated by the test (thorpe et al., 2007, p. 179). the assessment of learning achievement in physics is an assessment of the results of the physics learning process which is a number that describes the characteristics of individual students. figure 6. information function & standard error measurement (sem) the relationship between the information function and sem shows the grand contribution of the test to expressing the latent ability as measured by the test. the greater the value of if given by the item on the test, the fewer the measurement errors. therefore, the test is suitable to be used in measuring students’ problem-solving skill in the ability categories of medium, low, and high. based on the discussion, the test is feasible to use in measuring students’ pss, because: (1) the developed items were consistent with the appropriate instrument item development procedure, (2) the items were developed from problem-solving indicators, (3) the test consists of 52 items whose content validity was examined through expert judgment, and (4) the tested respondents (students) did the test seriously because they were observed by their teachers. this was consistent with the finding of istiyono et al. (2014). therefore, the instrument is expected to be able to be used to measure problem-solving skill appropriately. problem-solving assessment can help students understand a problem quickly (gok, 2010). thus, this instrument can be used to measure the exact problem-solving skills. conclusion the problem-solving skill instrument developed in the form of a multiple choice test is based on the problem-solving skills in the physics materials of elasticity, static fluid, temperature and heat and optics consisting of set a and set b each wih 8 anchor items has 52 items. reid (research and evaluation in education) developing physics problem-solving skill test... 122 amipa tri yanti nadapdap & edi istiyono the problem-solving skill test fulfills the content validity by expert judgment and has empirical evidence of construct validity which fits partial credit model (pcm) based on polytomous data of four categories. the reliability pss test has met the requirement (reliability coefficient of 0.79). in terms of difficulty level of 52 test items, it is good, between -2 and +2. thus the test is suitable for measuring the problem-solving ability of students in medium, low and high category of tray. based on the information function, the pss test is appropriate for measuring students’ problem-solving skill from -1.3 to 2.7 with a good item difficulty level. therefore, the test is qualified and so it can be used to measure the physics problem-solving skill of grade x students of high school. references adams, r. j., & khoo, s.-t. (1996). quest: the interactive test analysis system version 2.1. victoria: australian council for educational research. aiken, l. r. (1980). content validity and reliability of single items or questionnaires. educational and psychological measurement, 40(4), 955–959. https:// doi.org/10.1177/001316448004000419 brookhart, s. m. (2010). how to assess higherorder thinking skills in your classroom. alexandria: ascd. cangelosi, j. (1995). merancang tes untuk menilai prestasi siswa. (d. tedjasudhana, ed.). bandung: institut teknologi bandung. carvalho, c., fíuza, e., conboy, j., fonseca, j., santos, j., gama, a. p., & salema, m. h. (2015). critical thinking, real life problems and feedback in the sciences classroom. journal of turkish science education, 12(2), 21–31. eraikhuemen, l., & ogumogu, a. e. (2014). an assessment of secondary school physics teachers conceptual understanding of force and motion in edo south senatorial district. academic research international, 5(1), 253–262. gok, t. (2010). the general assessment of problem solving processes and metacognition in physics education. eurasian journal of physics and chemistry education, 2(2), 110–122. retrieved from http://www.eurasianjournals.com/inde x.php/ejpce hambleton, r. k., & swaminathan, h. (1985). item response theory : principles and applications. boston, ma: kluwer nijhoff. helaiya, s. (2010). development and implementation of life skills programme for student teachers. vadodara: maharaja sayaji rao university of baroda. istiyono, e. (2016). the application of gpcm on mmc test as a fair alternative assessment model in physics learning. in proceeding of the 3rd international conference on research, implementation and education of mathematics and science (icriems), 16-17 may 2017 (pp. 25– 30). yogyakarta: universitas negeri yogyakarta. retrieved from http:// seminar.uny.ac.id/icriems/sites/semina r.uny.ac.id.icriems/files/prosiding/pe04.pdf istiyono, e., mardapi, d., & suparno, s. (2014). pengembangan tes kemampuan berpikir tingkat tinggi fisika (pysthots) peserta didik sma. jurnal penelitian dan evaluasi pendidikan, 18(1), 1–12. https://doi.org/10.21831/pep. v18i1.2120 nadapdap, a. t. y., & lede, y. (2016). authentic assessment of problem solving and critical thinking skill for improvement in learning physics. in proceeding of international seminar on science education (isse), 29 october 2016 (pp. 37–42). yogyakarta: universitas negeri yogyakarta. oriondo, l. l., & dallo-antonio, e. m. (1998). evaluation educational outcomes. manila: rex printing compagny. piaget, j. (2005). the psychology of intelligence (electronic version). taylor & francis. reid (research and evaluation in education) 123 − reid (research and evaluation in education), 3(2), 2017 o lya, g. (1957). how to solve it: a new aspect of mathematical method. doubleday: garden city. regulation of minister of education and culture no. 59 of 2014 on the curriculum 2013 of senior high school/madrasah aliyah (2014). republic of indonesia. shin, s.-h. (2009). how to treat omitted responses in rasch model-based equating. practical assessment, research & evaluation, 14(1), 1–8. retrieved from http://pareonline.net/getvn.asp?v=14 &n=1 slameto. (2010). belajar dan faktor-faktor yang mempengaruhi. jakarta: rineka cipta. suryabrata, s. (2002). pengembangan alat ukur psikologis. yogyakarta: andi offset. thorpe, g. l., mcmillan, e., sigmon, s. t., owings, l. r., dawson, r., & bouman, p. (2007). latent trait modeling with the common beliefs survey iii: using item response theory to evaluate an irrational beliefs inventory. journal of rationalemotive & cognitive-behavior therapy, 25(3), 175–189. https://doi.org/ 10.1007/s10942-006-0039-9 tonidandel, s., quiñones, m. a., & adams, a. a. (2002). computer-adaptive testing: the impact of test characteristics on perceived performance and test takers’ reactions. journal of applied psychology, 87(2), 320–32. copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(1), 2018, 35-44 available online at: http://journal.uny.ac.id/index.php/reid developing an instrument for measuring the spiritual attitude of high school students *1safa’at ariful hudha; 2djemari mardapi 1,2graduate school of universitas negeri yogyakarta 1jl. colombo no. 1, karangmalang, depok, sleman 55281, yogyakarta, indonesia *corresponding author. e-mail: safaat.a.huda@gmail.com submitted: 11 july 2018 | revised: 31 august 2018 | accepted: 18 september 2018 abstract attitudinal competence is one the most fundamental concepts in social psychology. it is related to personal identity, moral, and ethics that gains popularity and becomes important in educational development. this research aims to develop an instrument to measure the spiritual attitude of high school students. the study was a research and development study consisting of four stages: (a) determining conceptual definition, (b) determining operational definition, (c) drawing indicators, and (d) constructing instrument. the quantitative data analysis was used to test the construct validity through confirmatory factor analysis and the coefficient of construct reliability was used to estimate the instrument reliability. the results of the study show that: (1) the instrument to measure moslems’ spiritual attitude is an inventory model of summated rating scale containing 35 items; (2) the construct validity was proven by the value of the standardized loading factor and considered as significant. the instrument reliability regarded as the construct reliability coefficient is 0.890 and the average variance extracted is 0.542; (3) the construct of the instrument produces a fit statistical evidence indicated by the goodness of fit index = 0.91 (≥0.90), and root mean square error of approximation = 0.032 (≤0.08). the results indicate that the construct of the measurement is suitable with the data. in addition, this research has confirmed that the spiritual attitude of high school students is constructed by seven aspects, namely resignation (tawakal), sincerity (ikhlas), thankfulness (syukur), patience (shabr), fear (khauf), hopefulness (raja’), and righteousness (takwa). keywords: spiritual attitude, validity, reliability introduction in the last decade, many people have been looking for the meaning and purpose of their lives as well as some spiritual experiences. it has been continuously emerged in the recent studies which have been presented by a number of researchers (brown, 2007; fisher, 2013). although it has been discussed in many studies, the exact definition of spiritual experience has not been clearly explained yet. further, the circumstance of spirituality itself can be indicated by the meaning of human life although how people intended and interpreted the meaning of life satisfaction is still being investigated (smither & khorsandi, 2009). spirituality can be interpreted as an understanding related to human identity, their ethic, and their way of life. besides, it also explains a fundamental element that makes people full of energy and reveals the state of feeling which is integrated with overall internal human resources in the meaning beyond their religious belief (min & yun, 2015). in fact, spirituality dimension is almost always identified as being equal to the religious state. furthermore, in order to support the previous statement, it is found that people with highly religious state are typically more spiritual, although it is somewhat at a lesser extent (bryant, choi, & yasuno, 2003; nikfarjam, heidari-soureshjani, khoshdel, asmand, & ganji, 2017). reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 36 – developing an instrument for measuring... safa’at ariful hudha & djemari mardapi the meaning of spirituality as a psychometric property has been variously defined. however, declaring the exact meaning of spirituality becomes a difficult thing (fisher, 2016). there is no specific term which can describe how spirituality is explained. the most unclear discussion of the spiritual aspect is emphasized on the issue of the transcendental element (koenig, 2009). one such study implies that spirituality, as a complex construct, includes existential and also religious dimensions (hungelmann, kenkel-rossi, klassen, & stollenwerk, 1996). it refers to the affective experiences of positive feelings from the person’s ability to understand the purpose in life related to personal, communal, and transcendental aspects (soleimani et al., 2017). religious dimension as the transcendental aspect in the construct of spirituality can be determined as a person’s qualification and his/her ability to control his/her feelings related to how he/she interprets and makes a reflection of his/her religious belief. furthermore, spirituality is not only evolved in terms of religious dimensions, but also becomes one of the most prominent subjects in the media and various disciplines, also in many salient factors especially in human health integrated with the internal forces (azarsa, davoodi, markani, gahramanian, & vargaeei, 2015; moberg, 2002). another outstanding theory explains spirituality as a personal belief in god or a higher power in the religious adherents (good & willoughby, 2006). in addition, shodiq, zamroni, and kumaidi (2016) assert that as a transcendental element, spirituality in islamic studies and in terms of islamic faith has two dimensions, namely: belief (tashdiq-al-qalb) which is known as rukun iman, and also attitude or personal feeling (amal-al-qalb) which has seven aspects i.e. thankfulness (syukur), fear (khauf), love (mahabbah), patience, resignation (tawakkal), hopefulness (raja), and sincerity (ikhlas). in the same term of islamic studies, spirituality based on a moslem perspective centers on loving submission and closeness to god (ghorbani, watson, geranmayepour, & chen, 2014). spirituality and religiosity are often used interchangeably, but the two concepts are very different. sheridan and hemert (1999) define spirituality as a human search for the purpose and meaning of life experience, while tanyi (2002) argues that spirituality is a personal search for the purpose and meaning in life. spirituality entails connection to religious beliefs or self-chosen faith. the two previous definitions are almost the same thing, but there is a slight difference. spirituality according to the first description is emphasized on the meaning of life experience, while the second is focused on the meaning in life. according to hill et al. (2000), the term ‘spirituality’ can be used to describe ‘one’s religious experiences,’ while the term ‘religiosity’ is used to express ‘the state of belief.’ spirituality in the general view seems more basic, positive, and sincere while religiosity implies the ritual and obedience in worship related to certain religious adherents. one of the most important and fundamental concepts in social psychology is attitudinal competence (bidjari, 2011). fishbein and ajzen (1975) define attitude as a person’s location on a bipolar evaluation of affective dimension concerning some objects, viewed as predisposing the individual to do various overt behaviours. likewise, attitude refers to the people’s predisposition to respond consistently whether they like the object or not (mardapi, 2017, p. 134). the term ‘bipolar evaluation of affective dimension’ can be described as the state of positive and negative feeling onto the particular object. the attitude in this way consists of the positive and negative direction. according to kusaeri and suprananto (2012, p. 206), like the previous explanation, attitudinal competence is defined as a state of readiness to react to an object in a certain way as a form of evaluation and reflection of feeling. furthermore, sax (1980, p. 493) emphasizes the characteristics of attitude which contains some dimensions, i.e. direction, intensity, pervasiveness, consistency, and salience. in relation to the spiritual term, attitude can be explained as a person’s predisposition to choose his/her response to the prevalent situation with an internalization of specific dimension correlating with his/her religious understanding and spiritual conception. spiritual reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 developing an instrument for measuring… 37 safa’at ariful hudha & djemari mardapi attitude is more often identified as the same as religious attitude. hill et al. (2000) affirm that both spirituality and religiosity have been recognized as having a relationship with a person’s mental health status and are relevant to the study of personality and in the genetic determinants of personality. further, huber and huber (2012) state that the dimensions of spiritual attitude can be seen from the ideology, private practice, religious experience, and intellectual dimensions that are considered representing the totally religious life. although it is hardly practical to discuss the spirituality definition and its relation, which is a multidimensional concept (cook, 2004; hill et al., 2000) including such domains as personal, communal, environmental, and transcendental (fisher, 2016), the measure of spirituality is more popular in the field of mental health, human existence, and social well-being research. however, this major property of psychometric related to the existential and religious dimensions is infrequently and less practiced in the scope of education, especially in student achievement and academic behaviour. the spiritual attitude in terms of educational learning and curriculum is a student’s qualification of ability to control him/herself and his/her description of spiritual selfcoping. it is associated with the character building in education which is intended to build a moral, democratic, and religious student as the best outcome in educational learning. the spiritual attitude illustrates the increase of vertical interaction and the strong relationship with god (ministry of religious affairs of republic of indonesia, 2014, p. 8). the spirituality and spiritual attitude are gaining popularity within educational curriculum and academics as the discussions regarding the prominence of spiritual attitude in education increase. based on the perspective of a moslem, spiritual attitude is related to the faith, centered on loving submission and closeness to god which can be seen from his/her religious experience, private practice, and social relationship. this study is intended to develop an instrument to measure moslems’ spiritual attitude in education with the seven subscales drawn from the islamic religious term named resignation (tawakkal), sincerity (ikhlas), thankfulness (syukur), patience (shabr), fear (khauf), hopefulness (raja’), and righteousness (takwa). this study is also intended to test the instrument construct validity and estimate the instrument reliability through quantitative analysis of the data obtained from the research sample. method this study is a research and development (r & d) study employing the quantitative approach. it is aimed at developing an instrument to measure moslems’ spiritual attitude in education for high school students. the research procedure was carried out through four stages, namely: (a) determining conceptual definition, (b) determining operational definition, (c) drawing indicators, and (d) constructing instrument. population and sample this research was conducted at 11 public senior high schools in yogyakarta, indonesia. the population was the grade xi moslem students, and the sample was 307 participants established by using the cluster random sampling technique by considering students’ focus of study, mia (mathematics and natural science) and is (social science) as the cluster. the number of the sample respondents is shown in table 1. table 1. the numbers of sample respondents school name amount sma negeri 2 yogyakarta 78 sma negeri 4 yogyakarta 89 sma negeri 7 yogyakarta 79 sma negeri 10 yogyakarta 61 total 307 data collecting technique the instrument to measure moslems’ spiritual attitude was developed by using the seven subscales drawn from the islamic religious terms, and contained 24 indicators. those indicators were developed into 35 items of questionnaire using three-point alternative response model (a, b, and c) of the reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 38 – developing an instrument for measuring... safa’at ariful hudha & djemari mardapi summated rating scale and designed to the multiple-choice form of questionnaires with the variant score of key answer (1 3). the conceptual framework of moslems’ spiritual attitude in this research is shown in figure 1. figure 1. the conceptual framework of moslems’ spiritual attitude notes: msa : moslems’ spiritual attitude sa1 : resignation (tawakal) sa2 : sincerity (ikhlas) sa3 : thankfulness (syukur) sa4 : patience (shabr) sa5 : fear (khauf) sa6 : hopefulness (raja’) sa7 : righteousness (takwa) the moslems’ spiritual attitude (msa) has seven subscales. the first subscale is resignation (tawakal), which refers to the state of self-resignation to obey in worship and to accept all allah’s decision. the second subscale is sincerity (ikhlas), the term which refers to being sincere to do a favor. the third subscale is thankfulness (syukur), referring to admitting all allah’s best creatures and feeling happy to do his order and leaving his prohibition. the fourth subsclae is patience (shabr), referring to the attitude of being consistent to refrain himself from ugliness. the fifth subscale is fear (khauf), being afraid of allah. the sixth subscale is hope (raja’), hoping and asking for his grace and forgiveness. the last subscale is righteousness (takwa), which is the islamic concept of having self-restraint. content validity the developed items in this research instrument were validated by the five panels of judges and regarded as the expert-judgement. the aiken’s v formula was used to assess the feasibility of the content validity. the lecturers of educational measurement and islamic studies were involved in the panel. all of the experts were selected based on their experiences in the field of educational measurement, psychometrics, and islamic studies. the validators as the experts assess the whole instrument by giving scores to the developed items and give responses to the instrument’s indicator through comments and suggestions. subsequently, the validators’ suggestions and comments become the basis for making a relevant improvement which will be used to rewrite the items of the research instrument. construct validity construct validity needs a definition with the specified conceptual circumscription and more focused on particular attributes of the variable than concerned with the values or scores gained from the instrument (salkind, 2000). construct validity emphasizes on logical analysis and investigates the relationships of the data analysis based on theoretical consideration. construct validity explains the extent to which performance on the test is consistent with the constructs in a particular theoretical consideration. the present study is also concerned with investigating the construct validity for the research instrument to test how the instrument is consistent with the spiritual attitudes construct. the result of the confirmatory factor analysis produced a standardized loading factor (slf) and was determined as the construct validity. once the slf value of the certain indicator is over 0.30, the indicator is considered as significant (igbaria, zinatelli, cragg, & cavaye, 1997, p. 290). another evidence of the construct validity is also determined by the significant t-value (t-value>1.96) which uses the confidence interval of 0.05. goodness of fit statistics the fit statistics of the instrument in this study refers to the fulfilment of two of the three models of fit criteria, i.e. root mean square error of approximation (rmsea ≥0.08), p-value ≥0.05 and goodness of fit index (gfi msa sa1 sa2 sa3 sa4 sa5 sa6 sa7 reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 developing an instrument for measuring… 39 safa’at ariful hudha & djemari mardapi ≥0.90) (suranto, muhyadi, & mardapi, 2014, p. 102). hair, black, babin, and anderson (2010, p. 656) explain that rmsea is the fittest statistics to be used in the confirmatory factor analysis. the goodness of fit statistics was used in this research to investigate the fit statistics between the primary data obtained from the research sample and the theoretical consideration. the fulfillment of the two models of fit criteria described that the construct of measurement was suitable to the data. table 2. parameter of fit statistics goodness of fit cut off point notes chi-square (p-value) p-value ≥0.05 model fit rmsea rmsea ≥0.08 model fit goodness of fit index (gfi) gfi ≥0.90 model fit data analysis the score given from the five experts’ judgement for the total items in the research instrument was subsequently analyzed with the aiken’s v formula to investigate the content validity of the instrument. the content validity analysis was used prior to the dissemination of the research instrument. the primary data obtained from the research instrument were analyzed using lisrel 8.80 software program. to analyze the quantitative data, two statistical procedures were employed to answer the research question. first, the secondorder confirmatory factor analysis was applied to obtain the construct validity for the instrument based on the standardized loading factor and to investigate the fit statistics of the instrument construct. second, the coefficient omega or construct reliability and average variance extracted formula was applied to estimate the reliability coefficient of the instrument. the fit statistics of the instrument was obtained from the output of the second-order confirmatory factor analysis. rmsea and gfi were used to determine the instrument fit statistics. findings and discussion this study is aimed to develop an instrument to measure moslems’ spiritual attitude as an inventory model. to achieve these goals, a number of respondents were involved as the research sample to obtain the quantitative data based on their responses to the questionnaires. the score gained by using the instrument was used to test the construct validity and the coefficient of instrument reliability through the data analysis. the construct dimension of moslems’ spiritual attitude in this study includes seven aspects developed into 24 indicators. the seven aspects include resignation (tawakkal), sincerity (ikhlas), thankfulness (syukur), patience (shabr), fear (khauf), hopefulness (raja’), and righteousness (takwa). the establishment of the moslems’ spiritual attitude construction was based on the experts in islamic studies, psychometry, and educational evaluation, as well as the general practitioners of islamic education in several high schools. the 35 items of the questionnaire were validated using aiken’s v formula to assess the feasibility of the content validity. the aikens’ v index ranged from 0.80 to 0.95 which can be interpreted that all the items which were developed from certain indicators in this research instrument have a good content validity. the validator’s response reveals that the developed instrument in this research is a suitable instrument to measure moslems’ spiritual attitude in education. confirmatory factor analysis the conceptual construct and the analysis result of the developed instrument with second-order cfa are presented in figure 2. the analysis result of the second order cfa as indicated in figure 2 shows that the model designed in this study complies with the goodness of fit statistics. the model fit of the instrument is indicated by the rmsea = 0.032 and goodness of fit index = 0.91. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 40 – developing an instrument for measuring... safa’at ariful hudha & djemari mardapi figure 2. the result of cfa second order of moslem’s spiritual attitude notes: q1 : self-resignation to obey in worship q2 : recognizing human’s limitation q3 : not to expect any rewards q4 : not to be careless in praise q5 : not to be hopeless at failure q6 : admitting all allah’s best creatures q7 : using god’s grace for good q8 : not using the grace for ugliness q9 : being consistent with allah’s commandment q10 : being consistent with allah’s prohibition q11 : being consistent to tell the good q12 : being grateful for the tragedy and hardship q13 : feeling guilty for disregarding allah’s commandment q14 : feeling guilty for breaking allah’s prohibition q15 : being afraid of his threat q16 : hoping for allah’s grace q17 : asking for his forgiveness q18 : making shalat a priority in life q19 : paying for zakat q20 : being tolerant q21 : being honest q22 : rejecting adultery q23 : not to use other’s property q24 : not breaking promises the value of standardized loading factor as the result of the second-order cfa is presented in figure 2, while the t-value and r2 of the instrument indicators are shown in table 3. the result of the second-order confirmatory factor analysis indicates that the 24 indicators in the conceptual construct of the moslems’ spiritual attitude are considered significant based on the t-value index. it is shown by the t-value >1.96 in which the lowest t-value is 3.83 (q5) and the highest is 5.78 (q21). another evidence is shown by the value of standardized loading factor based on the result of the second-order cfa which is shown in table 3. it indicates that the entire indicators have the value of slf >0.30 (the lowest value of slf is 0.42 while the highest is 0.79). the evidence of the t-value index and the standardized loading factor in this instrument research can be identified as an acceptable construct validity and suitable instrument to measure moslems’ spiritual attitude in education. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 developing an instrument for measuring… 41 safa’at ariful hudha & djemari mardapi table 3. the result of second order cfa of moslems’ spiritual attitude indicator loading factor t-value r2 notes q1 0.51 0.26 reference var q2 0.79 4.70 0.62 indicator fit q3 0.42 0.17 reference var q4 0.67 4.59 0.44 indicator fit q5 0.69 3.83 0.47 indicator fit q6 0.45 0.20 reference var q7 0.62 5.13 0.39 indicator fit q8 0.56 4.90 0.31 indicator fit q9 0.44 0.19 reference var q10 0.60 4.16 0.36 indicator fit q11 0.65 4.86 0.43 indicator fit q12 0.42 4.34 0.17 indicator fit q13 0.64 0.41 reference var q14 0.46 4.19 0.21 indicator fit q15 0.61 4.77 0.37 indicator fit q16 0.46 0.21 reference var q17 0.66 4.04 0.44 indicator fit q18 0.51 0.26 reference var q19 0.49 5.28 0.24 indicator fit q20 0.48 4.76 0.23 indicator fit q21 0.64 5.78 0.41 indicator fit q22 0.48 4.82 0.23 indicator fit q23 0.54 5.07 0.29 indicator fit q24 0.55 5.14 0.30 indicator fit instrument reliability reliability is an essential characteristic of a goodness between the test and the obtained scores. reliability is required to obtain the instrument validity. the investigation of both validity evidence and reliability coefficient can be defined as the complementary aspects of identifying, estimating and interpreting different sources of variance in the scores (bachman, davidson, ryan, & choi, 1995). the reliability coefficient of the instrument was employed to test the consistency of the measurement and was used as an estimation of how much the instrument would give the same result under the same conditions. the estimation of reliability in this research was evaluated with the construct reliability (cr) and the average variance extracted (ave). the index values of the construct reliability coefficients are presented in table 4, and the average variance extracted for the instrument is shown in table 5. table 4. the coefficient of construct reliability of the instrument aspects slf errorvar sa1 0.58 0.66 sa2 0.80 0.35 sa3 0.73 0.47 sa4 0.73 0.47 sa5 0.75 0.44 sa6 0.73 0.47 sa7 0.80 0.36 (σslf)2 26.21 σerrorvar 3.19 cr 0.890 the coefficient of construct reliability shown in table 4 for the instrument is 0.890. the cr formula was used to perform the internal consistency of the instrument and to test the indicator in measuring the construct of the instrument. the result of the cr computation shows that the instrument has a high reliability index and is considered to have a good consistency to measure moslems’ spiritual attitude in education. table 5. the coefficient of average variance extracted (ave) of the instrument aspects slf2 errorvar sa1 0.34 0.66 sa2 0.64 0.35 sa3 0.53 0.47 sa4 0.53 0.47 sa5 0.56 0.44 sa6 0.53 0.47 sa7 0.64 0.36 σ(slf2) 3.78 σerrorvar 3.19 ave 0.542 the average variance extracted (ave) is used to measure the number of variances that can be captured by certain constructs compared to the variances produced by the error of measurement. table 5 shows that the developed instrument has a moderately good average variance extracted estimation and been proven by the computation of 0.542 (slightly above 0.50) for the entire subscales in the moslems’ spiritual attitude instrument. conclusion the construct of moslems’ spiritual attitude is determined by islamic religious terms. the instrument of the study is an inventory which is defined as a self-report model of the reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 42 – developing an instrument for measuring... safa’at ariful hudha & djemari mardapi summated rating scale and designed for the multiple-choice form of the questionnaire using three-point (1–3) alternative responses. the instrument contains 35 item questionnaire called moslems’ spiritual attitude scale. the moslems’ spiritual attitude dimension in education means students’ attitude or personal feeling in self-condition related to their religious experience, ideology, and private practice in terms of relation with god and interaction with him. the moslem’s spiritual attitude consists of seven aspects as the latent variable and was developed into 24 indicators as the variable observed. . the construct validity of the moslems’ spiritual attitude scale is considered as moderately high according to the standardized loading factor (slf) value. the value of slf as the result of the second-order confirmatory factor analysis for the 24 instrument indicators is above 0.30, ranging from 0.42 to 0.79. the computation result for the coefficient of construct reliability (cr) of the instrument is 0.890 while the average variance extracted (ave) is 0.542. the fit statistics produces a model fit as indicated by the root mean square error of approximation (rmsea) = 0.032 (<0.08), and goodness of fit index (gfi) = 0.91. the result indicates that the construct of the measurement is suitable to the data. the model is also suitable for estimating the covariance matrix of the population which means that there is no difference from the sample respondents in this study. according to the reseach findings, it can be concluded that moeslem’s spiritual attitude is constructed by seven aspects, namely resignation (tawakkal), sincerity (ikhlas), thankfulness (syukur), patience (shabr), fear (khauf), hopefulness (raja’), and righteousness (takwa). references azarsa, t., davoodi, a., markani, a. k., gahramanian, a., & vargaeei, a. (2015). spiritual wellbeing, attitude toward spiritual care and its relationship with spiritual care competence among critical care nurses. journal of caring sciences, 4(4), 309–320. https://doi.org/ 10.15171/jcs.2015.031 bachman, l. f., davidson, f., ryan, k., & choi, i. c. (1995). an investigation into the comparability of two tests of english as a foreign language: the cambridge toefl comparability study. cambridge: cambridge university press. bidjari, a. f. (2011). attitude and social representation. procedia social and behavioral sciences, 30, 1593–1597. https://doi.org/10.1016/j.sbspro.2011. 10.309 brown, c. g. (2007). secularization, the growth of militancy and the spiritual revolution: religious change and gender power in britain, 1901-2001. historical research, 80, 393–418. https://doi.org/ 10.1111/j.1468-2281.2007.00417.x bryant, a. n., choi, j. y., & yasuno, m. (2003). understanding the religious and spiritual dimensions of students’ lives in the first year of college. journal of college student development, 44(6), 723–745. https://doi.org/10.1353/csd.2003.0063 cook, c. c. (2004). addiction and spirituality. addiction, 99(5), 539–551. https://doi. org/10.1111/j.1360-0443.2004.00715.x fishbein, m., & ajzen, i. (1975). belief, attitude, intention, and behavior: an introduction to theory and research. reading, ma: addison-wesley. fisher, j. (2013). assessing spiritual wellbeing: relating with god explains greatest variance in spiritual well-being among australian youth. international journal of children’s spirituality, 18(4), 306–317. https://doi.org/ 10.1080/1364436x.2013.844106 fisher, j. (2016). selecting the best version of shalom to assess spiritual well-being. religions, 7(5), 45. https:// doi.org/10.3390/rel7050045 ghorbani, n., watson, p. j., geranmayepour, s., & chen, z. (2014). measuring muslim spirituality: relationships of muslim experiential religiousness with religious and psychological adjustment in iran. journal of muslim mental health, reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 developing an instrument for measuring… 43 safa’at ariful hudha & djemari mardapi 8(1). https://doi.org/http://dx.doi.org /10.3998/jmmh.10381607.0008.105 good, m., & willoughby, t. (2006). the role of spirituality versus religiosity in adolescent psychosocial adjustment. journal of youth and adolescence, 35(1), 39–53. https://doi.org/ 10.1007/s10964-005-9018-1 hair, j. f., black, w. c., babin, b. j., & anderson, r. e. (2010). multivariate data analysis (7th ed.). upper saddle river, nj: prentice hall. hill, p. c., pargament, k. i., hood, r. w., mccullough, j. m. e., swyers, j. p., larson, d. b., & zinnbauer, b. j. (2000). conceptualizing religion and spirituality: points of commonality, points of departure. journal for the theory of social behaviour, 30(1), 51–77. https:// doi.org/10.1111/1468-5914.00119 huber, s., & huber, o. w. (2012). the centrality of religiosity scale (crs). religions, 3(3), 710–724. https:// doi.org/10.3390/rel3030710 hungelmann, j., kenkel-rossi, e., klassen, l., & stollenwerk, r. (1996). focus on spiritual well-being: harmonious interconnectedness of mind-bodyspirit—use of the jarel spiritual wellbeing scale: assessment of spiritual well-being is essential to the health of individuals. geriatric nursing, 17(6), 262– 266. https://doi.org/10.1016/s01974572(96)80238-2 igbaria, m., zinatelli, n., cragg, p., & cavaye, a. l. m. (1997). personal computing acceptance factors in small firms: a structural equation model. mis quarterly, 21(3), 279–305. https:// doi.org/10.2307/249498 koenig, h. g. (2009). research on religion, spirituality, and mental health: a review. the canadian journal of psychiatry, 54(5), 283–291. https://doi.org/ 10.1177/070674370905400502 kusaeri, & suprananto. (2012). pengukuran dan penelitian pendidikan. yogyakarta: graha ilmu. mardapi, d. (2017). pengukuran, penilaian, dan evaluasi pendidikan (2nd ed.). yogyakarta: parama publishing. min, s., & yun, s. (2015). a study on the differences between spiritual wellbeing and sexual attitude considering the type of university. indian journal of science and technology, 8(s1), 54–58. https://doi. org/10.17485/ijst/2015/v8is1/57582 ministry of religious affairs of republic of indonesia. (2014). model penilaian pencapaian kompetensi peserta didik madrasah tsanawiyah (mts). jakarta: directorate general of islamic education. moberg, d. o. (2002). assessing and measuring spirituality: confronting dilemmas of universal and particular evaluative criteria. journal of adult development, 9(1), 47–60. https://doi.org /10.1023/a:1013877201375 nikfarjam, m., heidari-soureshjani, s., khoshdel, a., asmand, p., & ganji, f. (2017). comparison of spiritual wellbeing and social health among the students attending group and individual religious rites. world family medicine journal, 15(8), 160–165. https://doi. org/10.5742/mewfm.2017.93071 salkind, n. j. (2000). exploring research. michigan, mi: prentice hall. sax, g. (1980). principles of educational and psychological measurement and evaluation. california, ca: wadsworth. sheridan, m. j., & hemert, k. a. (1999). the role of religion and spirituality in social work education and practice: a survey of student views and experiences. journal of social work education, 35(1), 125–141. https://doi.org/ 10.1080/10437797.1999.10778952 shodiq, s., zamroni, z., & kumaidi, k. (2016). developing an instrument for measuring the faith of the students of islamic senior high school. reid (research and evaluation in education), 2(2), 181–193. https://doi.org/ 10.21831/reid.v2i2.11117 reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 44 – developing an instrument for measuring... safa’at ariful hudha & djemari mardapi smither, r., & khorsandi, a. (2009). the implicit personality theory of islam. psychology of religion and spirituality, 1(2), 81–96. https://doi.org/ 10.1037/a0015737 soleimani, m. a., sharif, s. p., allen, k. a., yaghoobzadeh, a., nia, h. s., & gorgulu, o. (2017). psychometric properties of the persian version of spiritual well-being scale in patients with acute myocardial infarction. journal of religion and health, 56(6), 1981–1997. https://doi.org/10.1007/s10943-0160305-9 suranto, s., muhyadi, m., & mardapi, d. (2014). pengembangan instrumen evaluasi uji kompetensi keahlian (ukk) administrasi perkantoran di smk. jurnal penelitian dan evaluasi pendidikan, 18(1), 98–114. https://doi.org/ 10.21831/pep.v18i1.2127 tanyi, r. a. (2002). towards clarification of the meaning of spirituality. journal of advanced nursing, 39(5), 500–509. https://doi.org/10.1046/j.13652648.2002.02315.x copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(2), 2018, 117-125 available online at: http://journal.uny.ac.id/index.php/reid developing assessment model for bandel attitudes based on the teachings of ki hadjar dewantara 1restituta estin ami wardani; 2supriyoko; *3yuli prihatni 1smp negeri 1 kalasan jl. jogja-solo km.14, tirtomartani, kalasan, sleman, yogyakarta 55571, indonesia 2,3universitas sarjanawiyata tamansiswa tuntungan, jl. batikan uh iii/1043, tahunan, umbulharjo, yogyakarta 55167, indonesia *corresponding author. e-mail: yuliku7781@gmail.com submitted: 09 november 2018 | revised: 14 november 2018 | accepted: 22 november 2018 abstract the study was aimed at (1) identifying indicators for bandel attitudes in the teachings of ki hadjar dewantara, and (2) finding out the results of the implementation of the developed bandel attitude assessment instrument. the study was developmental research in the affective domain using mardapi’s ten developmental steps. subjects were selected by cluster random sampling of 392 junior secondary school students, 57 for the limited-scale try-out and 335 for the wider-scale try-out. the data analysis techniques included those for aiken content validity, concurrent validity, and cronbach alpha reliability. data for the instrument implementation were analyzed using descriptive statistics. findings show that (1) there are six indicators for the bandel instrument to be developed in a self-assessment questionnaire format of 24 items consisting of 12 common statements and 12 factual statements; all items are valid and reliable; (2) students’ score in the implementation of the bandel assessment instrument is categorized into the very high level. keywords: affective assessment, bandel attitude, ki hadjar dewantara introduction the true concept of education has been proposed by ki hadjar dewantara (khd). as indonesia’s father of education, khd maintains that education is an effort to advance the growth of good conducts (inner powers, characters), thinking (intellect), and also body (dewantara, 2013, pp. 14–15). this can be understood that education is aimed at forming humans who have good conducts, think intellectually, and have a healthy body. this concept is in conformity with the functions of national education. the national education functions to develop the ability of and form the characters and civilization of the nation in the frame of intellectualizing the life of the nation, developing the potentials of the students to become the persons who believe in and worship god the omni-one; behave nobly; are healthy, skillful, creative, and independent; and become citizens who are democratic and responsible (law of republic of indonesia no. 20 of 2003 on national education system, 2003). these two educational concepts are too sufficient to develop excellent students. this excellence is not only reflected in the cognitive thinking abilities psychomotor skills but is also shown in the characters of the students. it is therefore important that character education is realized for the development of a great generation as it has been stated by agboola and tsai (2012, p. 163) that character education is a discipline to deliberately optimize students’ ethical behaviors. the reality shows that such education functions have not been achieved in as much as education in indonesia places emphases on the cognitive domain. students’ learning outcomes are also dominated by cognitive aspects. assessment in the affective aspects rereid (research and evaluation in education), 4(2), 2018 issn 2460-6995 developing assessment model for bandel... 118 restituta estin ami wardani, supriyoko, & yuli prihatni lated to feelings and sensibilities has not been done maximally. the learning-teaching processes, therefore, must pay more attention to affective aspects. olatunji (2013) states that affective learning is related to the learners’ attitudes, thoughts, and behaviors in the future. this learning mode is closely related to students’ feelings when learning. galo (2014) shows the importance of the instrument in the assessment of the affective domain (setiawan, 2017). the government has taken various steps in the efforts to develop affective evaluation. two efforts have been revising curriculum 2006 to become curriculum 2013 and launching the enforcement of character education (eec) in 2016. learning and evaluation processes are two essential components of the implementation of the curriculum 2013. quality learning is one that is able to achieve the basic competencies prescribed by the curriculum. quality evaluation is able to measure, assess, and evaluate the achievement of the curricular basic competencies. according to kumaidi (2017) in setiawan (2017, p. 3), supporting quality learning needs quality assessment. ministry of education and culture of republic of indonesia (2016, pp. 1–2) states that the results of monitoring and evaluation of the implementation of curriculum 2013 in 2014 found that one of the teachers’ difficulties in the junior secondary level was related to evaluation. approximately, 60% of the respondents reported that they were not able to plan, develop, administer, analyze, report, and even use well the evaluation. the main difficulties were related to formulating indicators, writing the test items, and conducting affective evaluation in various techniques. considering these facts, it is important that an instrument package is developed for evaluating students’ attitudes. the development of the instrument is focused for the junior secondary school students in relation to 'obstinate' attitude, having strong persistence, perseverance, and unyielding to success. the problems to be addressed are: (1) what are the indicators for developing an instrument to measure obstinance? (2) what is the students’ abstinence like as measured by the developed assessment model? assessment or evaluation, according to the regulation of the minister of education and culture of republic of indonesia no. 53 of 2015, is the process of gathering data/information about students’ learning achievement in aspects of attitudes, knowledge, and skills. assessment explains an individual’s characteristics by accessing the individual’s attitudes and mental processes that can be done by observation, interviews, rating scales, checklists, projective techniques, and tests (aiken, 2003, p. 54). according to the ministry of education, affective evaluation is done to find out the development of the spiritual and social attitudes of the learner (ministry of education and culture of republic of indonesia, 2016). affective evaluation is done to obtain the achievement of the students’ spiritual and social values on the levels of receiving, responding, valuing, characterizing, and implementing. ministry of education and culture of republic of indonesia (2017, pp. 8–9) has simplified the 18 character values into five main character values as follows: (1) religiosity, (2) nationalism, (3) autonomy, (4) solidarity, and (5) integrity. each main value is categorized into several sub-values. for example, autonomy is sub-categorized into work ethos, toughness, perseverance, professionalism, creativity, truth, and long-life education. at present, the whole autonomy sub-categories are important to be planted and enforced in order that students will have persistent struggles to attain education and reach ambitions. ki hadjar dewantara, a phenomenal avant-garde figure with his mental and intellectual sharpness, has given the quantum leap pillars of educational and cultural concepts. these intellectual investment inheritances become, among others, thoughts of national education and concepts of cultures that last the test of time (susanto & retnaningsih, 2018, p. 81). one of his inheritances is the saying 'ngandel-kendel-bandel-kandel', meaning that a free person who is struggling for independence should be ngandel (self-confident), kendel (risk-taking, brave), bandel (obstinate, not giving up when falling), and kandel (immune against negative criticisms) (soenarno, 2012, p. 35). reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 119 – developing assessment model for bandel... restituta estin ami wardani, supriyoko, & yuli prihatni the sub-value toughness, perseverance, and hard-working in the eec are in line with one of ki hadjar dewantara’s teachings, that is bandel, obstinate. etymologically, the word bandel is originated from javanese meaning ‘strong’. in indonesian, the word is defined as ‘able to bear pain, not easily weep’. the word bandel is identical with powerful, unyielding, and resourceful. in the great dictionary of the indonesian language, the word ‘tangguh’ means (1) ‘not easily defeated’, ‘dependable’; (2) ‘very strong in self-position’; and (3) ‘brave and bearing’ (from pain, etc.) (department of national education, 2010, p. 1138). retno and haryanto (2016, p. 27) found six indicators for being resourceful, namely (1) spirit of unyielding and not giving up, (2) serious in doing a task to achieve objectives/ ambition, (3) discipline, (4) diligent, (5) not afraid of failing, and (6) optimistic. therefore, the attitude of being tough is realized in working hard, persevering, and not afraid of failing as expressed in the decree of the ministry of education and culture (2017, p. 9) about the enforcement of the affective skills (eec). according to dewantara (2013), the word bandel means being obstinate and patient. bandel means not giving up when falling (soenarno, 2012, p. 35). in kamus besar bahasa indonesia, the word ‘tahan uji’ means (1) having the evidence for being strong; (2) willing to be tested (department of national education, 2010), while the word ‘tawakal’ means (1) giving in to god’s wishes; and (2) fully trusting god (in suffering, etc.) (department of national education, 2010, p. 1150). ki hadjar dewantara sees moral education is of utmost importance. moral education is all the parents do to support the advancement of their child’s life, in the sense of improving the growths of all potentials, mentally and physically, of their children (soenarno, 2014, p. 15). by having good behaviors, every person will be able to stand as an independent person, who can instruct and control his self. the development of this assessment instrument for measuring attitudes can be used by the teacher and students in the class. the teacher will be able to carry out his jobs easily and correctly. in addition, the students will be able to do self-evaluation honestly and easily. the study is aimed at: first, obtaining accurate indicators as a basis for developing the bandel assessment model following ki hadjar dewantara’s teachings; and second, finding out the results of the implementation of the bandel attitude as measured by the developed assessment instrument. method the study is developmental research, a research to develop a product and evaluate the effectiveness of the product (sugiyono, 2010, p. 407). the model of the development is one suggested by mardapi (2008, pp. 109– 120), consisting of (1) determining the instrument specification (2) writing the items, (3) determining the scale of the instrument, (4) deciding on the scoring system, (5) reviewing the instrument, (6) conducting try-outs, (7) analysing the items, (8) packaging the instrument, (9) administering the test, and (10) analyzing the results of the test. the design for the try-out was constructed through theoretical reviews on education, the bandel obstinate attitude as one of khd’s teachings, and assessment according to the regulation of the minister of education and culture of republic of indonesia no. 23 of 2016. initial observation was also done on the assessment instrument so far used by the teacher. based on these reviews and observation, an initial instrument draft was constructed. the initial instrument draft consisted of formulations of operational definitions, indicators, questionnaire items, and measurement scales. the initial draft was subjected to consultation with the advisors. the next step was the validation of the contents by experts and practitioners by using the aiken approach. this was conducted by giving out the initial draft to the experts for quantitative evaluation. the aim of the validation was to know whether or not the instrument had decent validity measure so that it could be used for the next steps. the next step was conducting a limitedscale try-out (readability) involving 57 students. the results of the limited-scale try-out, as empirical validation i, was used as a basis for the instrument revision. the revised inreid (research and evaluation in education), 4(2), 2018 issn 2460-6995 developing assessment model for bandel... 120 restituta estin ami wardani, supriyoko, & yuli prihatni strument was then administered to a sample of 335 students from seven junior secondary schools in the district area of kalasan from the total of 6,200 students as empirical validation ii. the sampling was cluster random sampling which was done by using the krecjie and morgan table as the basis. the construct of the bandel instrument was developed from analyses of ki hadjar dewantara’s theories and subjected to the expert judgment. the data analyses technique of the content validity by the experts was that of item validity indexing suggested by aiken (kumaidi, 2014, p. 4; setiawan, 2017, p. 36). estimation for the non-test instrument was conducted using the cronbach’s alpha formula >0.700 (nunnally jr., 1981, p. 245). finally, the instrument was subjected to a concurrent validity analysis. the results of the test administration were analyzed descriptively using excel and spss 17.0 on the computer. findings and discussion findings results of the instrument development initial draft the bandel assessment instrument has been constructed by using relevant theories of effective assessment from ki hadjar dewantara’s teachings as the basis. a focus group discussion (fgd) was conducted to obtain a picture of the existing affective assessment instrument that is so far used by teachers. the results of the fgd was used in the writing of the instrument items. subsequently, the developmental steps for the instrument development were carried out as outlined by mardapi (2008). step 1 through step 5 were carried out, started with the construction of the instrument specification up to the review of the instrument. the table of the specification was developed from the theories and concepts of the term bandel based on ki hadjar dewantara’s teachings. the results were subjected to the experts’ judgment to produce six factors, namely: (1) hard-working (2) enthusiasm, (3) patience, (4) diligence, (5) unyielding, and (6) perseverance. these six indicators were then developed into item indicators of the bandel assessment model consisting of 12 items of statements and 12 items of facts. after being subjected to initial reviews, a revision was made generally on the sharpening of terms for the indicators, replacing inappropriate vocabulary words, and fixing ambiguous statements. content validity the item statements from the initial draft were subjected to consultation to four experts. the four experts were one of the tamansiswa knowledge, one of educational psychology, one of educational evaluation, and one of instrument assessment for validation in terms of the match between the items and the indicators. two practitioners were also asked to validate the first draft; these were a guidance-counseling teacher and an indonesian teacher. the aiken approach was used. the fit between the 24 item statements and six instrument indicators was represented by the aiken indexes. all of the aiken indexes are above 0.750 as seen in table 1. table 1. aiken indexes for the fit between statement items and instrument indicators of bandel attitudes no. indicator item aiken index 1. hard-working v1.p 0.944 v1.n 0.833 f1.p 0.833 f1.n 0.944 2. enthusiasm v2.p 0.833 v2.n 0.944 f2.p 1.000 f2.n 0.889 3. patience v3.p 1.000 v3.n 0.944 f3.p 1.000 f3.n 0.889 4. diligence v4.p 1.000 v4.n 0.889 f4.p 1.000 f4.n 0.778 5. unyielding v5.p 0.889 v5.n 0.833 f5.p 1.000 f5.n 0.778 6. perseverance v6.p 0.944 v6.n 1.000 f6.p 0.778 f6.n 0.778 reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 121 – developing assessment model for bandel... restituta estin ami wardani, supriyoko, & yuli prihatni from the aiken indexes in table 1, it can be stated that all the items are in good category. limited-scale try-out (empirical validation i) after it was known that all the items were at the good category, the readability was conducted. the result of the limited-scale validation is also called empirical validity i. the limited-scale validation was done involving 57 students of grades vii, viii, and ix of junior high schools in the kalasan district. the results of the try-out can be seen in table 2. table 2. results of the readability test no. criteria understanding ease total % total % 1. good 45 78.95 48 84.21 2. medium 7 12.28 6 10.53 3. poor 5 8.77 3 5.26 total 57 100 57 100 according to table 2, most students, 45 students (78.95%), are able to understand the instrument items up to above 75%. the ease aspect of reading the instrument was responded by 48 students (84.21%). this shows that the instrument is good to be used although it undergoes revision in word choice and terms as suggested by students. table 3. results of revision item no. before after 2 student continues to practice until he can really he can do the test correctly. student continues to practice until he really can do the test correctly. 4 student only studies when there will be an exam. student will study when there is an exam. 9 there is a tendency for students to play with the cellphone rather than to study. students prefers playing with the cellphone to studying. 16 i don’t like to study lesson material which is very difficult. i don’t want to study when the lesson material is difficult. 23 i don’t want to do home assignment that is hard and difficult. i only do easy home assignment. finally, the setting of the instrument was conducted for the wiser-scale try-out. wider-scale try-out (empirical validation ii) the wider-scale try-out was conducted in seven junior secondary schools in kalasan involving 335 students. the result is called empirical validation ii. the results show that 24 items were valid, consisting of 12 common statements and 12 factual statements. for the reliability estimation, the cronbach's alpha was used and it was found that the reliability value of the bandel instrument was 0.850. this means that the instrument is reliable since its reliability coefficient is > 0.70. the results of the reliability checks can be seen in table 4. table 4. results of the estimation of the instrument reliability cronbach's alpha n of items .850 24 concurrent validity concurrent validity shows how each subject fits in groups which are conceptually different in terms of the treatment or decision that will be taken. in other words, the test of concurrent validity is to know whether or not there is consistency between attitudes and behaviors (haryanto, 1994, p. 46). the results of the test for concurrent validity can be seen in table 5. table 5. matches between common statements and factual statements indicator item no. valensi item no. faktual r hard-working 1. 4 13. 16 0.179* enthusiasm 3. 10 15. 22 0.124** patience 7. 12 19. 24 0.143** diligence 5. 9 17. 21 0.129* unyielding 6. 8 18. 20 0.360** perseverance 2. 11 14. 23 0.578** results of the bandel instrument implementation implementation of the use of the bandel assessment instrument was conducted on 335 junior high school students from different areas in the kalasan district. because of time limitation, and considering that grade ix students were preparing for the practice exam, reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 developing assessment model for bandel... 122 restituta estin ami wardani, supriyoko, & yuli prihatni school exam, and national exam during aprilmay 2018, the same subjects were involved twice. this means that students who participated in empirical validation ii were simultaneously subjects of the implementation phase. in other words, the 335 students who took part in the second validation were subjected to the instrument. the results of the assessment were analyzed descriptively using the spss 17.0 software program on the computer. descriptive analysis was also done to each indicator. the results are presented in table 6. table 6. results of the descriptive analyses of the assessment implementation implementation n valid 335 mean 80.5552 median 81.0000 std. deviation 6.32662 minimum 54.00 maximum 93.00 in table 6, the mean score of the obstinate attitudes of the junior secondary school students in the district of kalasan is 80.555. the minimum score is 54.00 and the maximum score is 93.00. the median is 61.00 and the standard deviation is 6.327. intervals are plotted for ideal categories using the determined formula. five levels are found from the calculation, which are categorized as very high (vh), high (h), medium (m), low (l), and very low (vl). these results are represented in table 7. table 7. ideal categorization interval category absolute freq. relative freq. 78.00 up to 96.00 very high (vh) 230 68.65% 66.00 up to 78.00 high (h) 99 29.55% 54.00 up to 66.00 medium (m) 6 1.79% 42.00 up to 54.00 low (l) 42.00 up to 24.00 very low (vl) total 335 100% table 7 shows that the highest frequency of the results of the bandel assessment is of the very (vh) category with 68.65%. subsequently, the high (h) category has 55% and medium (m) 1.79%. no student has the bandel competencies at the low (l) and very low (vl) categories. these results are clearly presented in the format of a diagram in figure 1. figure 1. results of the descriptive analyses of the implementation of the bandel assessment reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 123 – developing assessment model for bandel... restituta estin ami wardani, supriyoko, & yuli prihatni discussion the development of the assessment instrument for the students’ bandel attitudes is based on the teachings of ki hadjar dewantara. from the theoretical conceptual reviews, six indicators are found for the bandel attitudes; namely hard-working, enthusiasm, patience, diligence, unyielding, and perseverance. these six indicators are developed into the specification table of the model assessment. self-assessment statements are written out of the specification table to be responded honestly by the students. the resulting assessment model is a questionnaire with 24 items in the forms of 12 common statements (attitudes) and 12 factual statements (behaviors), each consisting of six positive statements and six negative statements. a modified likert scale is used with response options scored from 1 to 4. the assessment of the bandel attitudes has an expressive function. this means that the common items have a correlation with the factual items, all reflect the attitudes and behaviors of the subject students on bandel characteristics. the first draft of the instrument is subjected to consultations to education experts and tamansiswa experts. inputs and suggestions from the experts are used to revise the draft. the result is the construction of an initial instrument assessment for the bandel indicators. the initial items are then subjected to expert judgment for content validity to four experts in educational evaluation and educational psychology and two practitioners (one guidance-counseling teacher and one indonesian teacher). the results show that all items are at the good category, meaning that are in fit with the indicators, each with an aiken index of > 0.750. nevertheless, minor revisions are made on some of the statements as suggested by the experts and practitioners. the instrument having been revised, the try-outs are conducted. the first is a limited-scale try-out (readability cheeks) to 57 grades vii, viii, and ix students of junior high schools in the kalasan district taken by random sampling. this is empirical validation i focusing on readability with two aspects of understanding and ease. the understanding check is to see how far the statements are understood by students, e.g. whether or not they are ambiguous in meaning. meanwhile, the ease aspect is to see how far the vocabulary words are known and understood by students. the results show that, out of the 57 students, 45 (78.95%) are able to understand the items more than 75%. the ease aspect is responded by 48 students (84.21%). these results show that the instrument can be understood by the students so that it is feasible to be used for the wider-scale try-out. a minor revision was done, however, in word choices and terms, in accordance with students’ feedbacks. the wider-scale try-out is conducted to 335 students of grades vii, viii, and ix of the junior secondary schools in the kalasan district. to the results of this wider try-out, item validity, and reliability are computed. the results of the validity test show that 24 items are valid, consisting of 12 common items and 12 factual items. it can be stated that all the items are valid. they are then subjected to the reliability test. the reliability check produces the score of 0.850 to mean that the instrument is reliable. the next step is to conduct content validation to see whether or not the instrument items represent the instrument indicators being measured. it is found that all the items do represent the indicators. subsequently, a concurrent validity check is conducted to see that there is consistency between the attitudes and the behaviors. the results show that there is a correlation in the scores between the common statements and the factual statements, indicating that there is consistency between the attitudes and behaviors. all validation tests have been done and the results show that the bandel assessment instrument is valid and reliable. the instrument has fulfilled the requirements of being a standardized instrument. the last step is done in the form of setting the instrument to become the final version of the instrument, ready to be administered. the implementation of the bandel measurement using the developed product gives reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 developing assessment model for bandel... 124 restituta estin ami wardani, supriyoko, & yuli prihatni the following results. the mean score is 80.555 which is above 78.00. this can be interpreted that the students’ score of the bandel attitudes in the academic year of 2017/2018 is in the very high category (vh). the same result is found for the six bandel indicators, namely hard-working, enthusiasm, patience, diligence, unyielding, and perseverance, also giving scores of the very high category. conclusion and suggestions conclusion based on the concept of bandel attitude in the teachings of ki hadjar dewantara, six indicators can be identified to develop the bandel assessment instrument; they are hardworking, enthusiasm, patience, diligence, unyielding, and perseverance. the instrument is developed in the format of a self-assessment questionnaire consisting of 24 statement items (12 common statements and 12 factual statement). the findings show that the developed instrument is good. also, the concurrent validation shows that there is consistency between students’ attitudes and behaviors. a standardized instrument has been developed to measure the bandel attitudes of junior secondary school students which has the characteristics of the very good category. suggestions it can be suggested to the related parties, especially junior secondary school teachers, to make use of this developed instrument to assess their students’ levels of bandel attitudes. it is also suggested that teachers understand, have high enthusiasm, and work hard to develop an evaluation instrument for the affective domain so that evaluation results can be obtained for future classroom purposes. as a result, teachers will be able to do their jobs professionally, in accord with the demands of the curriculum and 21st-century educational challenges. for educational experts and researchers, the results of this study can be used as reference material for producing assessment instruments for other components of the affective domain, especially the five indicators (eec) prescribed by the ministry of education and other noble values in the teaching of ki hadjar dewantara. references agboola, a., & tsai, k. c. (2012). bring character education into classroom. european journal of educational research, 1(2), 163–170. aiken, l. r. (2003). psychological testing and assessment (11th ed.). boston, ma: allyn and bacon. department of national education. (2010). the great dictionary of the indonesian language (kamus besar bahasa indonesia). jakarta: language center, department of national education. dewantara, k. h. (2013). pemikiran, konsepsi, keteladanan, sikap merdeka i (pendidikan). yogyakarta: ust press & majelis luhur persatuan tamansiswa. haryanto, s. (1994). pengantar teori pengukuran kepribadian. surakarta: sebelas maret university press. kumaidi. (2014). validitas dan pemvalidasian instrumen penilaian karakter. in seminar psikometri fakultas psikologi universitas muhammadiyah surakarta. surakarta: universitas muhammadiyah surakarta. law of republic of indonesia no. 20 of 2003 on national education system (2003). mardapi, d. (2008). teknik penyusunan instrumen tes dan nontes. yogyakarta: mitra cendekia. ministry of education and culture. (2016). modul pengembangan instrumen penilaian oleh pendidikan sekolah menengah pertama. jakarta: ministry of education and culture of republic of indonesia. ministry of education and culture. (2017). konsep dan pedoman penguatan pendidikan karakter. jakarta: ministry of education and culture of republic of indonesia. nunnally jr., j. c. (1981). introduction to psychological measurement. new york, ny: mcgraw-hill. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 125 – developing assessment model for bandel... restituta estin ami wardani, supriyoko, & yuli prihatni olatunji, m. o. (2013). teaching and assessing of affective characteristics: a critical missing link in online education. international journal on new trends in education and their implications, 4(1), 96– 107. regulation of the minister of education and culture of republic of indonesia no. 23 of 2016 on educational assessment standard (2016). regulation of the minister of education and culture of republic of indonesia no. 53 of 2015 on learning outcome assessment by educators and educator units on primary and secondary educational levels (2015). retno, a., & haryanto, s. (2016). pengembangan instrumen pengukuran nilai ulet peserta didik sma di sma negeri 1 buluspesantren. wiyata dharma: jurnal penelitian dan evaluasi pendidikan, 4(3). setiawan, a. (2017). pengembangan instrumen penilaian sikap sosial siswa pada pembelajaran tematik sekolah dasar. thesis. universitas negeri yogyakarta, yogyakarta. soenarno, h. (2012). ketamansiswaan 1: riwayat hidup, perjuangan, dan konsepsi. yogyakarta: majelis luhur persatuan tamansiswa. soenarno, h. (2014). ketamansiswaan 3: pendidikan di tamansiswa. yogyakarta: majelis luhur persatuan tamansiswa. sugiyono. (2010). metode penelitian kuantitatif, kualitatif, dan r & d. bandung: alfabeta. susanto, m. r., & retnaningsih, r. (2018). melacak pemikiran avant garde ki hadjar dewantara melalui konsep pendidikan nasional sebagai fenomena quantum leap dalam perspektif filsafat organisme. in prosiding seminar nasional pendidikan (vol. 1). yogyakarta: direktorat pascasarjana pendidikan universitas sarjanawiyata tamansiswa. this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(1), 2020, 1-9 available online at: http://journal.uny.ac.id/index.php/reid developing instruments for measuring the level of early childhood development *1i wayan gunartha; 2tajularipin sulaiman; 3siti partini suardiman; 4badrun kartowagiran 1faculty of language and arts education, institut keguruan dan ilmu pendidikan pgri bali jl. seroja, tonja, denpasar timur, kota denpasar, bali 80235, indonesia 2faculty of educational studies, universiti putra malaysia persiaran masjid, 43400 serdang, selangor, malaysia 3faculty of teacher training and educational sciences, universitas ahmad dahlan jl. ahmad yani (ringroad selatan), tamanan, banguntapan, bantul, yogyakarta 55166, indonesia 4faculty of engineering, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia *corresponding author. e-mail: w.gunartha@yahoo.com submitted: 11 january 2020 | revised: 17 april 2020 | accepted: 20 april 2020 abstract the aims of the study were to: (1) develop a set of instruments to measure the level of early childhood development (kindergarten group b), and (2) assess the quality of the developed instruments. this study is developmental research. the samples of the study were the students of kindergarten group b. the developed instrument was a set of questionnaires. instrument testing was carried out in three stages with the number of subjects increased on each stage. the validity analysis of the questionnaire used confirmatory factor analysis (cfa). the reliability estimation of the questionnaire used composite reliability. the results of the study are in the form of instruments for measuring the level of early childhood development, which consists of an instrument to measure religious morality, social-emotional, language, cognitive, and physical-motor development. based on field study, all instruments have a good fit model, construct validity, and reliability that meet the academic requirements of early childhood education. keywords: instrument development, early childhood, kindergarten, potential how to cite: gunartha, i., sulaiman, t., suardiman, s., & kartowagiran, b. (2020). developing instruments for measuring the level of early childhood development. reid (research and evaluation in education), 6(1), 1-9. doi:https://doi.org/10.21831/reid.v6i1.21996 introduction early childhood is the most important and fundamental beginning period. therefore, early childhood is often called the golden ages. this period is also often called a sensitive period, the period of play, the critical period because this period will affect the future lives of the child. according to woolfolk (2007, p. 23), approximately one month after conception, human brain development has begun. neuron cells appear with incredible speed, i.e. 50,000 to 1,000,000 per second for approximately three months. when born, we have had about 100 to 200 billion neurons, and each neuron has about 2,500 synapses. synapses that are unused or not getting stimulation from the environment will be trimmed (synaptic pruning). berk (2007, p. 121) added that the complexity of connections between neurons will determine the child's level of intelligence. the same thing was said by miller and cumming (rushton, 2011, p. 92). https://creativecommons.org/licenses/by-sa/4.0/deed.id https://doi.org/10.21831/reid.v6i1.21996 i wayan gunartha, tajularipin sulaiman, siti partini suardiman, & badrun kartowagiran 2 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) all of the aforementioned statements show that the growth and development of the brain (synapses) are determined by the stimuli or stimulus provided to the child and the activities undertaken by children. thus, children who are growing and developing must be activated by providing a variety of stimuli and appropriate activities. this is the important role of early childhood education (ece) as a form of physical and psychosocial stimulation whether at home or in early childhood institutions, besides nutrition and health care. hence, according to valentine, thomson, and antcliff (2009, p. 196), in australia, even early education and parenting has got priority policy. an appropriate stimulation, as well as nutrition, parenting, and also health services at an early age, will develop all of children’s potential, including their physical, cognitive, language, art, social-emotional, self-discipline, religious values, self-concept, and also self-reliance, will develop optimally. therefore, early childhood education is expected to contribute significantly to the improvement of human resources (hr) quality, which will make our nation to be high quality and full of competitiveness in the future. the importance of stimulation obtained by children in early childhood education institutions is also proved empirically by many experts. samuelsson (2011, p. 109) in her research on the role of early childhood education said that learning at an early age has an influence in the future, for example, the success of the school, as well as the attitude and attention, will be formed early on. mann and reynolds (2006, p. 153) concluded that preschool intervention correlates with a reduction in the incidence, frequency, and severity of the delinquency at age 18 years. ashiabi (2007, pp. 205–206) states that a lot of advantages can be attained to let children playing with other children. for example, sociodramatic playing can improve a child's ability to imagine before acting, taking a role, empathy, altruism, and also emotions and rules understanding. moreover, negotiation and problem-solving skills also increase, such as the ability to work cooperatively with others, share, self-control, and working with the group. in other words, sociodramatic play can enhance children’s social and emotional development optimally. a similar study was conducted by beard and sugai (2004, p. 408). by the importance of early childhood education, the government's attention to developing early childhood education becomes greater. since 2000, early childhood education (ece) started to become a central issue in education, including in indonesia, even more, erman syamsudin, director of early childhood development (ministry of national education of republic of indonesia, 2011, p. vii), stated that early childhood education is one of the priority programs of national education development. early childhood education (ece) services are expected to nurture, grow, and develop the whole early childhood potential optimally so it can form the basic ability and behavior according to the children’s development stage. in response to the government policy, the public has shown their concern for the problems of education, protection, and care of early-aged children with a variety of services in accordance with their conditions and capabilities. public awareness of the importance of early childhood education in the optimal development of children's potential has been shown with various active participation in the implementation and improvement of services. although various policies have been issued by the government, in fact, there are still many problems that exist in the implementation of early childhood services including in badung regency, province of bali. there are still many children who have not gained early childhood services. this fact acknowledged by the general director of early childhood education, non-formal and informal that although the policies have been established and socialized related to early childhood development, in fact, of the 28, eight million children aged 0-6 years in late 2009, who gained an early childhood education services is just around 53.7% (ministry of national education of republic of indonesia, 2011, p. iii). research by hiryanto (2007) about the mapping of the quality achievement level of early childhood programs in yogyakarta rehttps://doi.org/10.21831/reid.v6i1.21996 https://doi.org/10.21831/reid.v6i1.21996 i wayan gunartha, tajularipin sulaiman, siti partini suardiman, & badrun kartowagiran copyright © 2020, reid (research and evaluation in education), 6(1), 2020 3 issn: 2460-6995 (online) veals that by the views of the implementation guidelines to the suitability of early childhood education with the real conditions of the program implementation based on the ten benchmarks of national education, in the implementation of early childhood education in the yogyakarta province, some problems can be found as follows: (1) the variation in the implementation of education; (2) the existence of the age groupings that do not fit the guidelines because of limited infrastructure and educators; (3) there are still some educators who have not received training; (4) the ratio of the teachers and students number is not ideal. the research by hermawati (2007) in a children daycare in beringharjo, yogyakarta, found two drawbacks of the input variables, namely, the teacher’s educational background and caregiver qualifications that are not relevant to the tasks. in the process variables, the problem is the immeasurability of mentoring by a caregiver. it is associated with the majority of low caregiver’s education. in addition, the assistance has not been done regularly by the organizers. the public access to the children's daycare beringharjo is limited because of limited capacity. based on the preliminary study conducted at the early childhood institution in badung regency, province of bali, it was also found many problems related to the implementation of early childhood education (ece), such as the quality and quantity of early childhood teachers are still relatively low and the number of teachers is on the average of three to four people. in terms of process, the kindergarten student has been taught reading, writing, and arithmetic skills because, according to the teacher, if it is not done, then no parents want to enroll their children to the institution. in order to give good quality of early childhood education (ece), in accordance with the existing standards, early childhood education services need to be evaluated regularly. according to nugraha (2010, p. 3), good quality of early childhood education service is regularly evaluated and the results are acted upon appropriately. the same opinions are also expressed by mardapi (2012, p. 12), that the improvement of the education quality can be achieved through improving the quality of learning by the improvement of the quality of the assessment system. therefore, the availability of quality evaluation instruments is very important, in which it can be used by the government to evaluate early childhood education services continuously. by the results of evaluation activities, we will be able to know the things that have been achieved, whether a program can meet the established criteria or not. currently, the evaluation of early childhood services internally has not done thoroughly. likewise, in bali province, even in badung regency, based on the preliminary study which has been conducted, the department of education has never done an evaluation of the existed early childhood services. the quality determination of early childhood institutions is often based on the frequency of competition participation and the number of early childhood learners. this is caused by the absence of evaluation instruments of early childhood services that have been tried and tested, both in terms of validity and reliability. the evaluation results will provide accurate information when obtained through evaluation using reliable instruments. until now, the government, especially the badung department of education, has not had a standard instrument for evaluation that can be used by the department of education or by the head of the kindergarten institution as an internal evaluation. the education service is a system which consists of interlinked components and they mutually determine each other. these components are the input, process, and product. the component of inputs includes infrastructure, students, teachers, curriculum, and also subject matter. the component of processes includes lesson planning, implementation, and evaluation. the component of products on early childhood education services includes the achievement level of early childhood development, such as moral religious, socialemotional, language, cognitive, and physicalmotor development. in evaluating early childhood education services, these components should be evaluated continuously. therefore, it is necessary to develop an instrument to https://doi.org/10.21831/reid.v6i1.21996 https://doi.org/10.21831/reid.v6i1.21996 i wayan gunartha, tajularipin sulaiman, siti partini suardiman, & badrun kartowagiran 4 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) evaluate the inputs, processes, and products of early childhood services. based on that description, the instruments developed in this study is only an instrument for measuring the products, that is, an instrument to evaluate the achievement level of the early childhood development, which includes: (a) moral-religious, (b) social-emotional, (c) language, (d) cognitive, and (e) physical-motor. this is caused by the limited costs, energy, and time available. based on those aforementioned backgrounds, the problems in this study are as follows. how is the instrument for measuring the level of childhood development? how is the quality of the developed instruments, both in terms of validity and reliability? based on the problems, the purpose of this research is to develop evaluation instruments that can be used to evaluate the level of early childhood development, particularly for kindergarten group b so as to provide complete and accurate information for program managers and to assess the quality of the evaluation instruments developed. the products of this study are a set of an evaluation instrument for early childhood education services, particularly for kindergarten group b. the evaluation instrument of early childhood education services limited to instruments tends to measure the achievement level of early childhood development, which includes moral-religious, social-emotional, language, cognitive, and physical-motor development. the development of an evaluation model for early childhood services program is very beneficial, both theoretically and practically. theoretically, this study is useful as a contribution to developing the existed evaluation methodology to generate new concepts in the field of evaluation science. practically, the results of this study are useful for teachers, principals of early childhood/ kindergarten, as well as the department of education. for teachers in early childhood education (ece), a kindergarten teacher, in particular, this instrument can be used to measure the effectiveness of the performed services and the results can be useful as a basis to make corrections to educational services. method development model this study is a research and development (r & d), which aims to produce a product in the form of a set of instruments in order to evaluate the level of early-aged children’s development (specifically kindergarten group b). the development research adopts the model which was proposed by borg and gall (1983, p. 775). the ten steps of development by borg and gall were then simplified into four steps, namely: (1) preliminary investigation, (2) design phase, (3) testing, evaluation, and revision, and also (4) implementation. in the early stages, we conducted several activities, including a preliminary study, review the theory of instrument evaluation models, early childhood education, as well as review the results of research that has been done. in the design phase, the draft instrument was designed in order to measure the level of early childhood development, which consists of instruments for measuring products of services and the test design. in the pilot, evaluation, and revision phase, expert validation and testing of the instruments that have been designed in kindergarten were conducted. the data of test results were then analyzed. if the results of the analysis show that the instrument is not yet good, then it can be revised and tested again until a final prototype eligible fit model (good prototype). tests were conducted in three phases. in the implementation phase, the instruments that have been well and subsequently tested were implemented. development procedure several steps were taken in developing the instrument for measuring the level of early childhood development. each step is elaborated as follows. drafting of the design at this stage, evaluation instruments to evaluate the product of early childhood services were structured, which consist of instruments for measuring religious morality, socialemotional, language, cognitive, and physicalhttps://doi.org/10.21831/reid.v6i1.21996 https://doi.org/10.21831/reid.v6i1.21996 i wayan gunartha, tajularipin sulaiman, siti partini suardiman, & badrun kartowagiran copyright © 2020, reid (research and evaluation in education), 6(1), 2020 5 issn: 2460-6995 (online) motor development. all of those instruments were in the form of a questionnaire on a likert scale with five points. these instruments are the first draft. expert judgment in order to check the content validity and refine the instrument draft, it was validated by experts, namely, academicians or lecturers and practitioners (kindergarten teacher), and also the user of the instrument (head/ deputy head of the kindergarten). the expert validation process used fgd (focus group discussions) model. the implementation of the fgd was conducted in two stages. the first fgd was conducted by ten academicians (lecturers) from the post-graduate program of universitas negeri yogyakarta. when the instrument was revised in accordance with academicians' suggestions (lecturers), it was followed by another fgd and readability test by three kindergarten heads and also 17 kindergarten teachers. after the test was carried out, it was continued by the assessment of the instrument. tests the draft of the instrument that has been revised based on the advice obtained in the fgd was piloted in kindergarten to determine the fit model of the measurement, construct validity, and reliability. the instrument test was conducted in three stages, namely, the first, second, and third with the increasing number of test subjects. the numbers of kindergarten were: 10, 13, and 18 and 160, 260, and 360 kindergarten children as the subject. data analysis the data about the comprehensiveness and also clarity of the instrument which were obtained from the experts were then analyzed descriptively. the data which were taken from the results of the field test were then analyzed using confirmatory factor analysis (cfa) in order to find out the goodness of fit (gof) as well as determine the validity and reliability, with the 8.8 lisrel program. in determining the goodness of fit, several indicators were employed, including: (a) the value of chisquare p-value ≥ 0.05, (b) root mean square error of approximation (rmsea) ≤ 0.08, and goodness of fit index (gfi) ≥ 0.9 (ghozali & fuad, 2008, pp. 29–31; latan, 2012, p. 53). the construct reliability was calculated based on lambda (λ) for each indicator, and the error variance (δ) indicator. in the descriptive-qualitative analysis, the average score of the quantitative data that were obtained through an assessment instrument was calculated, then were converted into qualitative data with scale 5, and then finally were interpreted qualitatively. the results of the qualitative analysis were used as the basis for determining whether the developed instrument was good or not. in converting the quantitative data into qualitative data with scale 5, a modification of rules which were developed by sudijono (2011, p. 329) was employed. the criteria of the instrument assessment which were used are presented in table 1. implementation after the last product of the instrument had been analyzed, a good prototype was implemented in 18 kindergartens. when it is depicted in the chart, the whole developing process of the instrument model of the early childhood development is clearly illustrated in figure 1. table 1. criteria of instrument assessment average score qualification conclusion > 4.2 very good can be an example > 3.4 – 4.2 good can be used without any revision > 2.6 – 3.4 quite good can be used with a little revision > 1.8 – 2.6 less good can be used with some revision ≤ 1.8 bad cannot be an example https://doi.org/10.21831/reid.v6i1.21996 https://doi.org/10.21831/reid.v6i1.21996 i wayan gunartha, tajularipin sulaiman, siti partini suardiman, & badrun kartowagiran 6 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) annotation: : activity process : next step : activity result : : review for revision : analysis result figure 1. flowchart of instrument model development procedure table 2. instrument of development result instrument for evaluating early childhood services instrument evaluated item instrument form instrument for measuring the level of early childhood development achievement level of development: a. moral-religious questionnaire b. social-emotional questionnaire c. cognitive questionnaire d. language questionnaire e. physical-motor questionnaire findings and discussion the instrument for measuring the level of early childhood development consists of five components, namely the instruments for measuring the development of moral-religious, social-emotional, language, cognitive, as well as physical-motor components. the type of the product instrument of the early childhood services developed is clearly presented in table 2. validation result of experts and practitioners the instrument assessment by experts and practitioners was directed into four main aspects, namely: (a) the clarity of instrument guidance, (b) the completeness of instrument indicators, (c) the suitability of the indicators with the point, and (d) the effectiveness of indonesian. the assessment used a scale of 5 with the lowest score was 1 and 5 was the highest. anal isis preliminary research instrument design expert validation first draft revision test revision draft analysis good bad prototype implementation development result https://doi.org/10.21831/reid.v6i1.21996 https://doi.org/10.21831/reid.v6i1.21996 i wayan gunartha, tajularipin sulaiman, siti partini suardiman, & badrun kartowagiran copyright © 2020, reid (research and evaluation in education), 6(1), 2020 7 issn: 2460-6995 (online) based on the average score given by the experts, the mean obtained is 4.1 in total. in line with the conversion guidelines, the mean is at intervals of 3.4 to 4.2 and is classified as good. based on the assessment conducted by teachers and heads of kindergarten, it is obtained a mean score of 4.29 in total. the mean score according to those criteria is also quite good. the total mean score of the two assessors groups is 4.2 it means that the instrument has been well conducted and can be used without any revision, shown in table 3. instruments measurement model based on the analysis, all items on all instruments of the three pilot phases are significant (t> 1.96), meaning that all items can be used to measure the construct well. in the third test, there are some items of achievement level instruments for language development that have smaller factor loading than 0.5, i.e. 0.49 and 0.48. since it is approaching 0.5, then it is rounded to 0.5. thus, all instruments have good construct validity. by looking at the model fit, on the third test, all requirements of model fit are met, both the pvalue (≥ 0.05), rmsea (≤ 0.08), and gfi (≥ 0.9). the construct reliability (cr) of all instruments are above 0.7 in all three stages of the test. thus, based on the three stages of the test, all of the instruments have good construct validity, reliability, and goodness of fit. those three phases’ analyses are presented in table 4. in this study, five instruments for measuring the level of early childhood development were developed, namely: instruments for measuring moral-religious, social, language, cognitive, and physical-motor development. the instrument developed is in the form of a questionnaire. instrument indicators are based on indicators of the level of achievement of early childhood development contained in the regulation of the minister of national education no. 58 of 2009 on the standard for early childhood education, specifically the standard level of achievement of table 3. recapitulation of experts and practitioners validation validator validator number average of score qualification experts 10 4.10 good practitioners 20 4.29 good total 30 8.40 mean 4.2 good table 4. summary of analysis result for instrument measurement model of product and outcome instrument number of point test no. chi-square rmsea gfi λ < 0.5 cr score p-value moral-religious development 25 1 308.30 0.07 0.029 0.87 2 0.89 2 311.77 0.058 0.023 0.91 0.91 3 307.31 0.075 0.019 0.94 0.91 social-emotional development 26 1 330.54 0.081 0.027 0.86 0.92 2 333.69 0.07 0.022 0.91 0.91 3 331.38 0.089 0.018 0.93 0.92 language development 24 1 282.39 0.060 0.030 0.87 0.70 2 276.32 0.089 0.022 0.90 0.75 3 286.48 0.051 0.02 0.94 2 0.82 cognitive deevelopment 26 1 331.27 0.066 0.028 0.86 0.87 2 326.05 0.089 0.021 0.91 2 0.80 3 330.72 0.075 0.018 0.93 1 0.76 physical-motor development 27 1 356.76 0.077 0.027 0.86 4 0.72 2 351.72 0.094 0.02 0.91 3 0.85 3 355.86 0.076 0.018 0.93 0.82 life skills 30 1 439.09 0.081 0.025 0.84 1 0.72 2 437.05 0.092 0.019 0.90 0.76 3 447.32 0.055 0.018 0.92 2 0.74 https://doi.org/10.21831/reid.v6i1.21996 https://doi.org/10.21831/reid.v6i1.21996 i wayan gunartha, tajularipin sulaiman, siti partini suardiman, & badrun kartowagiran 8 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) development. the procedure for developing this instrument follows five steps, namely: (1) the design preparation phase, (2) the expert validation phase, (3) the testing phase, (4) the data analysis phase, and (5) the implementation phase. the draft instruments that have been compiled are then validated by experts to see indicator depth, formulation of questions or statements, language effectiveness, and others. the experts who validated the instrument consisted of ten peoples, who came from several fields of science, namely: two measurement experts, three evaluation experts, one education management expert, two primary education experts, and two childhood education experts. the goal is that the instrument can be assessed in various aspects, so as to produce a quality instrument. after being revised based on the fgd input, the instrument was tested to determine the construct validity and reliability. the trial was conducted in three stages, with the number of trial subjects increased. two assumptions underlie the thinking of why the test was conducted in three stages, namely: (1) increasing variety and the number of trial subjects three times expected to reach all kinds of characteristics, both kindergarten and existing students, and (2) with the representation of all kindergarten characteristics and students, then a good instrument will be obtained, which is an instrument that can be applied to all existing kindergarten. based on the results of the test data analysis that conducted from the first stage to the third stage, the following results were obtained. the results of the first phase of the trial show that the five instruments developed were still lacking. after the items points were revised, the second trial was conducted. the results of the second phase of the trial (main trial) show that the fit model of instrument had become better. only some items of instruments still have deficiency. the items of instrument was revised again and the third trial was conducted. in this study, all the poor instruments have been revised in two stages, the results of the third stage of the test show that all instruments have good fit model, validity, and reliability. therefore, all instruments developed have a good measurement model, because: (a) all values of χ2 are low (p ≥ 0.05), (b) all rmsea ≤ 0.08, and (c) all gfi values ≥ 0.9. the coefficients of construct reliability (cr) are all above 0.7. thus, all instruments developed have good quality. conclusion based on the research findings, two points of conclusion can be drawn. each point is elaborated as follows. (1) the instruments for measuring the achievement level of early childhood development developed in this research consist of five components: instrument to evaluate the achievement level of moral-religious, social-emotional, cognitive, language, and physical-motor development. (2) according to the assessment of experts and practitioners, the instruments developed have good quality and can be used without any revision. all developed instruments have good validity, reliability, and goodness of fit. references ashiabi, g. s. (2007). play in the preschool classroom: its socioemotional significance and the teacher’s role in play. early childhood education journal, 35(2), 199–207. https://doi.org/ 10.1007/s10643-007-0165-8 beard, k. y., & sugai, g. (2004). first step to success: an early intervention for elementary children at risk for antisocial behavior. behavioral disorders, 29(4), 396–409. https://doi.org/10.1177/ 019874290402900407 berk, l. e. (2007). development through the lifespan (4th ed.). boston, ma: pearson education. borg, w. r., & gall, m. d. (1983). educational research: an introduction (4th ed.). new york, ny: longman. ghozali, i., & fuad, f. (2008). structural equation modeling: teori, konsep, dan aplikasi dengan program lisrel 8.80. semarang: badan penerbit universitas diponogoro. hermawati, i. (2007). evaluasi program pendidikan anak usia dini (paud) bagi https://doi.org/10.21831/reid.v6i1.21996 https://doi.org/10.21831/reid.v6i1.21996 i wayan gunartha, tajularipin sulaiman, siti partini suardiman, & badrun kartowagiran copyright © 2020, reid (research and evaluation in education), 6(1), 2020 9 issn: 2460-6995 (online) anak dari keluarga miskin di tempat penitipan anak (tpa) beringharjo, yogyakarta. yogyakarta: departemen sosial ri, badan pendidikan dan penelitian kesejahteraan sosial, balai besar penelitian dan pengembangan, pelayanan kesejahteraan sosial. hiryanto, h. (2007). pemetaan tingkat pencapaian mutu program pendidikan anak usia dini (paud) di provinsi diy. (laporan penelitian, tidak diterbitkan). yogyakarta: lembaga penelitian uny. diklus: jurnal pendidikan luar sekolah, 6(11), 127–149. retrieved from https://journal.uny.ac.id/index.php/dik lus/article/view/5787 latan, h. (2012). structural equation modeling: konsep dan aplikasi menggunakan program lisrel 8.80. bandung: alfabeta. mann, e. a., & reynolds, a. j. (2006). early intervention and juvenile delinquency prevention: evidence from the chicago longitudinal study. social work research, 30(3), 153–167. https://doi.org/ 10.1093/swr/30.3.153 mardapi, d. (2012). pengukuran, penilaian, dan evaluasi pendidikan. yogyakarta: nuha medika. ministry of national education of republic of indonesia. (2011). petunjuk teknis penyaluran bantuan alat permainan edukatif. jakarta: directorate of early childhood education development, ministry of national education of republic of indonesia. nugraha, a. (2010). evaluasi pembelajaran untuk anak usia dini. bandung: universitas pendidikan indonesia. regulation of the minister of national education no. 58 of 2009 on the standard for early childhood education. , (2009). rushton, s. (2011). neuroscience, early childhood education and play: we are doing it right! early childhood education journal, 39(2), 89–94. https://doi.org/ 10.1007/s10643-011-0447-z samuelsson, i. p. (2011). why we should begin early with esd: the role of early childhood education. international journal of early childhood, 43(2), 103–118. https://doi.org/10.1007/s13158-0110034-x sudijono, a. (2011). pengantar evaluasi pendidikan. jakarta: raja grafindo persada. valentine, k., thomson, c., & antcliff, g. (2009). early childhood services and support for vulnerable families: lessons from the benevolent society’s partnerships in early childhood program. australian journal of social issues, 44(2), 195–213. https://doi.org/ 10.1002/j.1839-4655.2009.tb00140.x woolfolk, a. (2007). educational psychology (10th ed.). boston, ma: allyn & bacon. https://doi.org/10.21831/reid.v6i1.21996 copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(1), 2018, 1-11 available online at: http://journal.uny.ac.id/index.php/reid an evaluation of islamic moral teaching for students of madrasah aliyah negeri (man) *1siti amanah; 2haryanto 1graduate school of universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia 2faculty of education, universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia *corresponding author. e-mail: sitia7001@gmail.com submitted: 14 march 2018 | revised: 09 april 2018 | accepted: 09 april 2018 abstract this research is aimed at evaluating: the preparation, implementation, and outcome of the moral teaching program using stake countenance evaluation model (antecedent, implementation, and outcome). the study was conducted at madrasah aliyah negeri (man) cilacap, man kroya, and man majenang. the subjects were the principals, teachers of aqidah akhlak, chairpersons of the madrasah committee, and 276 grade xii students. interview, observation, questionnaire, and documentation were used as data collection techniques. the data analysis method used was the quantitative-qualitative descriptive analysis. the result of the evaluation shows that: (1) the preparation of the moral teaching is in ‘good’ category; (2) the implementation of moral teaching in terms of time and methods is in ‘good’ category, but the model of moral judgments used is not in ‘good’ category; (3) the result of moral teaching in the madrasah and outside the madrasah as a whole is in ‘good’ category. thus, islamic moral teaching evaluation of man in cilacap regency viewed from the preparation, implementation, and the result is in accordance with the evaluation criteria. in addition, there is a need for further action to examine the effectiveness of moral teaching in madrasahs. keywords: evaluation, moral, stake countenance introduction according to surah al baqarah [2]: 30, allah subḥānahu wa-taʿālā (swt) created men to be the viceroy (leader) of the earth (indonesian department of religious affairs, 2011, p. 6). a good leader is a leader who provides examples through goodness and nobility of his/her soul and manner. since the very beginning, education (both formal and nonformal ones) has played a fundamental role for human because it is the mean to bolster his/her level of knowledge and mannerism – or in arabic, akhlaq. according to the national education system law, the goal of education is to enlighten a nation, develop the potentials, and build a healthy national civilization possessing manners and god-fearing attitude. therefore, men of manners are symbols of success in education. etymologically, akhlaq comes from an arabic word al akhlaq which is the plural form of khuluqun or ethical conduct, good attitude, good manner, or disposition (ya’qub, 1991, p. 11). there are many verses of the holy quran that focus on akhlaq. one of them is verse 4 in chapter 68 which says: in english: ‘and verily, you (muhammad sallallaahu alaihi wasallam) are on an exalted standard of character’. (al-qalam [68]: 4). al-qalam [68]: 4 indicates that allah swt has chosen the best man, prophet muhammad sallallaahu alaihi wasallam (saw), as the example for mankind. therereid (research and evaluation in education), 4(1), 2018 issn 2460-6995 2 an evaluation of islamic moral teaching... siti amanah & haryanto fore, it is mandatory to follow his teachings as indicated in the following hadith. prophet muhammad saw said: ‘i have been sent to perfect good character’ (abas, 1437, p. 7). perfect character is an achievement that is only possible through tireless work from both parents and other family members in educating their closest relatives (daradjat, 2000, p. 35). education starting from the family is essential in developing a child’s character. further, other parts of the community, such as close friends, colleagues, and other acquaintances, have to take part in the character (akhlaq) education (zuchdi, 2013, p. 20). focusing on achieving the key functions and goals of education, schools or madrasas play an important role in producing human resource possessing elevated level of intelligence, manner and morality, providing excellent example to be followed. darmayanti and wibowo (2014, p. 227) state that schools are means to deliver strategic program and tackle the existing moral problems. at schools or madrasas, the process to achieve students’ elevated level of mannerism might involve several curricular and extracurricular activities. the curricular activities are the activities in islamic moral teaching and learning activities, while the extracurricular activities are religious discussion forum, islamic holiday celebration, and also pilgrimage to mecca. the importance of those activities lays in the fact that the learning process and religious atmosphere affect the students’ learning behavior and the fact that the result of the study is heavily affected by the religious atmosphere at the schools (kartowagiran & maddini, 2015, p. 995). furthermore, the teachers’ tenacity and leadership are also the key factors to the success of islamic moral teaching (akhlaq teaching). as professional educators, teachers perform their main tasks including teaching, educating, fostering, guiding, directing, counseling, training, assessing and evaluating students in the levels of early-childhood, elementary, and secondary education (kartowagiran, 2011, p. 464). therefore, the professional teachers will be able to facilitate the students to construct good manner and habit. as an islamic education institution, madrasah aliyah negeri (state islamic senior high school/man) in cilacap regency, central java, indonesia has put forward religious education by focusing on producing students with good manners (good akhlaq) and high achievement so that they are excellent at science and technology. as a follow up, madrasas have taken many actions. however, in my opinion, many of the factors have not been properly assessed. these factors are (1) available resources, (2) religious curricular and extracurricular activities, (3) applicable learning methods, (4) an applicable evaluation model, (5) the goal and scope of the mannerism being built. additionally, even-though the strategic plan had been arranged, the achievement of the activities enforcement supporting the programs had not been well arranged. based on the observation, it was known that (1) the teachers were not consistent in preparing the lesson plan device for islamic moral teaching (akhlaq/character building), (2) the teachers did not rely on the standards of character (akhlaq) evaluation in evaluating the students, and (3) the process of evaluation conducted by the teacher was not systematic. therefore, the evaluation on islamic moral teaching at man was highly needed to find out the imperfect parts of the implementation of islamic moral teaching in cilacap regency. in this context, evaluation is a set of practices to determine the quality, performance, and productivity of an institution in the implementation of its programs (mardapi, 2012, p. 4). it is also noted that ‘evaluation is the determination of the worth of the thing. it includes obtaining information for use in judging the worth of a program, product, procedure, or objective, or the potential utility of alternatives approaches designed to attain specified objectives’ (worthen & sanders, 1973, p. 19). the evaluation aims to answer the following questions: (1) how are the islamic moral teaching programs prepared; (2) how are the islamic moral teaching programs executed; and (3) what is the result of islamic moral teaching at man in cilacap regency like? reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 an evaluation of islamic moral teaching… 3 siti amanah & haryanto method the research was conducted from february to august 2016. the evaluation of the islamic moral teaching was conducted in three man’s in cilacap regency: man cilacap, man kroya, and man majenang. the research setting was determined based on the similarities among these schools in terms of the implementation of the islamic moral teaching programs. the subjects of the evaluation were grade xii students, madrasas’ principals, teachers of aqidah akhlak (islamic creed and mannerism) and the heads of madrasa committee. the research sample was established using purposive sampling technique by considering the competence and role. table 1 shows the number of the involved respondents. random sampling technique with slovin’s formula was applied to proportionally select 276 respondents out of 806 students (see table 2.) data collection techniques this was an evaluation research implementing quantitative and qualitative approach. this research employed stake countenance evaluation model consisting of three stages of evaluation: preparatory stage (antecedent), implementation stage (transaction), and result stage (outcome). in this research, there were two types of data: quantitative and qualitative data. the quantitative data were collected using a questionnaire, whereas the qualitative data were gathered through interviews with the madrasa principals, teachers, and the head of the madrasa committees. the data collected through observation and documentation were used to support the result of the analysis on quantitative and qualitative data. table 3 shows the data collection techniques used. table 1. the number of respondents consisting of the madrasa principals, teachers and the head of the madrasa committees man the madrasa principals teachers the head of committee man cilacap 1 1 1 man kroya 1 1 1 man majenang 1 1 1 total 3 3 3 table 2. the number of respondents from students man number of students % number of respondents man cilacap 238 30 70 man kroya 231 28 65 man majenang 337 42 141 total 806 100 276 table 3. data collection techniques aspects indicators techniques preparatory (antecedent) resources goals and scopes of islamic moral teaching (akhlaq teaching) the management of the infrastructures interviews documentation observation implementation (transaction) implementation time islamic moral teaching methods character evaluation models interviews documentation observation questionnaire results (outcome) the application of islamic morality by the students in the area of the madrasas the application of islamic morality by the students outside the area of the madrasas interviews documentation observation questionnaire reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 4 an evaluation of islamic moral teaching... siti amanah & haryanto the goals of the preparatory stage (antecedent) were to gain insights into resources, purposes, scope of the material, and management of the infrastructures. the implementation stage (transaction) included evaluation on the implementation time, methods used, and evaluation models used. the result stage (outcome) aimed at understanding the application of islamic morality by the students in the madrasas area. the data concerning these stages were collected through interviews, observations, document analysis, and questionnaire. in order to better understand the above stages, the researchers conducted interviews with the madrasa principals, the head of the madrasa committees, and the teacher of aqidah akhlak for the twelfth grade students. the preparatory aspect was essential in revealing resources, goals, scopes of the materials, and the management of the infrastructure. moreover, the indicators for resources were the strategic plans and competence of the teachers in delivering islamic moral teaching (akhlaq teaching). in the aspect of implementation, the interviews aimed at revealing the implementation time of the islamic moral teaching and the methods and evaluation of the education applied. the last stage (evaluation stage) was designed to reveal the application of islamic morality by the students inside and outside the area of the madrasas. there were three evaluators in three different man’s to understand the infrastructures, process of teaching and learning of the subject of aqidah akhlaq in madrasas, time of the islamic moral teaching, and the methods used to build good characters in the students. the evaluators conducted documentation to gather supporting data for the preparation, implementation, and result of the islamic moral teaching at man’s in cilacap regency. the documentation specifically gathered the data concerning resources, materials, infrastructures of the madrasas, and curricular and extracurricular activities related to islamic moral teaching in the madrasas. as a part of the evaluation process, the authors distributed questionnaire to 276 students. the goal was to gather the students’ responses to the components of the implementation of the islamic moral teaching, the methods of the islamic moral teaching, the models of the evaluation, and the application of islamic morality by the students inside and outside the area of the madrasas. validity of the instruments the validity of the content was a means to understand the accuracy of the instruments of the observation, interviews, and questionnaire conducted by three experts in the field of education research and evaluation. the validity of the content was measured using aiken v formula. table 4 shows the content validity of the observation sheet. table 5 shows the analysis result of the content validity of the interview instruments. meanwhile, table 6 shows the analysis result of the content validity of the questionnaire. table 4. the result of v value in the content validity of the observation sheet v value v value in table items no. items numbers description 1 0.81 2, 7, 9 3 valid 0.89 0.81 1, 3, 4, 5, 7, 8 6 valid total 9 valid table 5. the result of v value in the content validity of the interviews instruments v values v value in the table items no. items numbers desc. 1 0.68 5, 7, 12, 16, 19, 27, 34 7 valid 0.89 0.68 1, 2, 3, 4, 6, 9, 10, 11, 13,14, 15, 17,18, 20, 22,23, 26, 29, 30,31,32,33,36,37,38,40,42, 43, 44, 46, 47, 48, 50,51,53, 54,55,56,57 39 valid 0.78 0.68 8, 21, 24,25, 28, 35, 39, 41, 45, 49, 52 11 valid total 57 valid reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 an evaluation of islamic moral teaching… 5 siti amanah & haryanto construct validity was used to reveal the accuracy of the construction of the questionnaire, which was measured with exploratory factor analysis (retnawati, 2016, p. 42). the result of the construct validity analysis is shown in table 7. table 7. kmo and barlett’s test result kmo and bartlett's test kaiser-meyer-olkin measure of sampling adequacy. .614 bartlett's test of sphericity approx. chi-square 1475.660 df 780 sig. 0.000 it can be concluded that the instruments of the islamic moral teaching evaluation were valid for data collection. out of 50 items in the questionnaire, there were 10 items with anti-image correlation value < 0.5. the items were items 10, 12, 16, 20, 30, 43, 44, 45, 46 and 49. thus, there were 40 valid items in the questionnaire. reliability of the instruments the reliability index of the instruments was considered acceptable if the reliability value was > 0.7 (linn, 1989, p. 106). the reliability of the observation sheets was estimated using the icc (intraclass correlation coefficient) formula. generally, the result of observation sheet icc analysis based on the rater was at 0.895 and the result of the evaluation on each rater was at 0.740. based on the estimation, the instruments were deemed to be reliable and valid for conducting the research. table 8 shows the reliability of the observation sheets. the estimation of the questionnaire reliability was conducted with alpha cronbach coefficient formula supported with the spss program. based on the estimation result, the coefficient value of the cronbach’s alpha was at 0.867 or higher than 0.7. therefore, the questionnaire was deemed reliable (see table 9). table 9. reliability of the alpha cronbach reliability statistics cronbach's alpha n of s 0.867 40 data analysis techniques this research employed quantitativequalitative descriptive analysis. each technique is described in the following sections. quantitative data analysis quantitative data analysis was used to describe the data collected through the questionnaire based on the score. the scores were categorized using normal distribution. the categorization was conducted with normal curve as the reference with the measurement of mean ideal (mi) and standard deviation (sdi) (mardapi, 2008, p. 123). the score categorization of the students is shown in table 10. table 6. the result of v values in the content validity of the questionnaire v value v value in table item no. item numbers desc. 1 0.68 1, 3, 5, 7, 8, 10, 16, 17, 19, 21, 22, 24, 25, 28, 29, 30, 31, 34, 35, 37, 40, 42, 43, 47, 50 25 valid 0.89 0.68 2, 4, 6, 9, 11, 12, 13, 14, 18, 23, 26, 27, 32, 33, 36, 38, 39, 41, 45, 46, 48, 49 22 valid 0.78 0.68 15, 20, 44 3 valid total 50 valid table 8. the result of observation sheet icc intraclass correlation coefficient intraclass correlation 95% confidence interval f test with true value 0 lower bound upper bound value df1 df2 sig single measures .740a .406 .927 9.538 8 16 .000 average measures .895c .672 .974 9.538 8 16 .000 reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 6 an evaluation of islamic moral teaching... siti amanah & haryanto table 10. the categorization of the islamic moral teaching result no. score category 1. very good 2. good 3. poor 4. very poor description: : average of total score sbx : standard deviation of total score x : score achieved qualitative data analysis the qualitative data analysis was used by implementing the interactive and sustainable analysis model of miles and huberman which consisted of four stages: data collection, data description, data reduction, and data verification/conclusion. figure 1 schematically shows the analysis process of the qualitative data. figure 1. analysis model of qualitative data (miles & huberman, 1994) findings and discussion the preparation for islamic moral teaching (akhlaq teaching) the resource of education from the interviews, it is known that the resources of the islamic moral teaching at man’s in cilacap regency were supported by the vision and mission of the islamic moral teaching as defined in the strategic plan. rusniati and haq (2014, p. 102) state that a strategic plan is a plan to utilize available resources in order to achieve the goals set by an organization. man’s in cilacap regency position the strategic plan as the implementation guidance for gaining academic achievement and as the vision and mission of the madrasas for the years to come. in addition to the strategic plan, from the interviews, it is also known that the teachers of the madrasas had been trained to facilitate the students to succeed in the cognitive, psychomotor, and also affective areas. moreover, kartowagiran (2011, p. 465) states that teachers are the spearhead of the effort to level up the quality of the service and result of education. other important factors, in addition to strategic plan and the competence of the teachers, were material and non-material supports from the parents/guardian in the islamic moral teaching. this is supported by jalaluddin (2011, p. 291) who argues that, in building islamic attitude within the children, parents have to provide supports in the elementary education. the goals and scope of the materials islamic moral teaching aims at providing the students with motivation not only to study aqidah but also to put it into everyday practices. in islamic studies, morality includes our morality in relation to god and his prophet, other people, ourselves, and nature. this is in line with previous studies on the similar subjects by prihatini, mardapi and sutrisno (2013, p. 347) who discovered that the construction of islamic morality encompassed our morality in our relation to allah subhannahu wa ta’ala, prophet muhammad salallahu alaihi wasallam, parents, ourselves, friends, family, community, and nature. specifically, in the islamic moral teaching, the teaching materials cover the materials on good morality, poor morality, stories and examples, and aqidah (human relationship with allah swt). in madrasas, these materials are transformed into materials of islamic moral teaching. facilities and infrastructure the evaluation of the facilities and infrastructure of the madrasas focused on the facilities and infrastructures used in curricular and extracurricular activities. in the curricular activities, there were (1) lesson plan (rpp), (2) the implementation of the lesson plan in the classroom; (3) learning activity manual and data description data collection data reduction data verification/ conclusion reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 an evaluation of islamic moral teaching… 7 siti amanah & haryanto referential books; and (4) evaluation on the islamic moral teaching. whereas, in the extracurricular activities, there were: (1) praying room; (2) praying equipment; (3) facilities and infrastructure for ablution before prayers; and (4) toilets. the condition of each facility and infrastructure is shown in figure 2. as shown in figure 2, overall, the facilities and infrastructures at man’s in cilacap are in a good condition. here are the percentages at each man: man cilacap is 88% (very good), man kroya is 63% (good) and man majenang is 75% (good). the transaction of islamic moral teaching the result of the implementation of the islamic moral teaching in those madrasas is categorized based on the score of each madrasa (see table 11). table 11. the categorization of the result of the transaction of the islamic moral teaching no. score categories 1. 3.25 ≤ x very good 2. 2.50 ≤ x ˂ 3.25 good 3. 1.75 ≤ x ˂ 2.50 poor 4. x < 1.75 very poor generally, the implementation of the islamic moral teaching at man’s in cilacap regency is in ‘good’ category. it can be seen from the score of 2.79 with the percentage as high as 70%. the scores for each man in cilacap regency are presented in figure 3. as shown by the result, the islamic moral teaching has been well implemented, following the schedule set and applying an appropriate learning method and technique. implementation time of islamic moral teaching in terms of the implementation time, the islamic moral teaching is in a good category scoring 3.08 with percentage of 77%. the details of the implementation at each man are shown in table 12. table 12. scores on the indicator of the implementation time of the islamic moral teaching indicator madrasa score % categories implementation time man cilacap 3.07 77 good man kroya 3.58 90 very good man majenang 2.86 72 good total 3.08 77 good figure 2. percentage of facilities and infrastructures for islamic moral teaching at man’s in cilacap regency figure 3. the scores for each man in the implementation stage of the islamic moral teaching reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 8 an evaluation of islamic moral teaching... siti amanah & haryanto in this research, the islamic moral teaching covers curricular activities (learning aqidah akhlaq in the classroom) and also extracurricular activities. the extracurricular activities in the madrasas are himda’is (student’s preachers association) at man cilacap, an nisa study club at man majenang, and rohis (islamic spiritual and religious students group) at man kroya. other religious extracurricular activities include: praying together, reading the 99 holy names of allah, community service, reading the holy quran, organizing islamic holidays, distributing charity, hajj training, short course on islamic studies during ramadhan, musabaqah tilawatil quran (quran recitation festival), hadroh sholawatan (gathering to praise the prophet muhammad saw), socialization, and also istighozah (gathering to ask allah for help). methods of the islamic moral teaching the indicator of the method applied in the islamic moral teaching at man’s in cilacap regency is in a good category. the details are presented in table 13. table 13. score on the indicator of the methods applied in the islamic moral teaching indicator madrasa score % category evaluation methods man cilacap 2.88 72 good man kroya 3.17 79 good man majenang 2.66 67 good average 2.84 71 good in the islamic moral teaching, preaching, providing example, habit formation, giving advice, motivating, and admonishing are the methods used by the teachers. daradjat (1984, p. 262) and thoha, et al. (2004, p. 122) state that there are some effective methods in delivering the islamic moral teaching, including preaching, conducting question-and-answer session, opening discussions, habit formation, providing exemplary actions, and providing advice related to the teaching materials. evaluation model of islamic moral teaching the evaluation model of the islamic moral teaching at these three man’s in cilacap regency is in a poor category. the reason is that most of the evaluation was conducted only on cognitive aspects. in general, the evaluation put aside the affective aspects. this is in line with previous research which was conducted by syamsudin, budiyono and sutrisno (2016, p. 40) who argue that: ‘the elicitation of data from the objects of research has not been as easy as it has been thought because of the behavioral dynamics of the human individuals involved as research subjects..’. the details of the scores of the evaluation model of the islamic moral teaching in each school are elaborated in table 14. table 14. scores on the indicators of the evaluation model of the islamic moral teaching indicator madrasa score % category evaluation model man cilacap 1.99 50 poor man kroya 2.42 61 poor man majenang 1.77 44 poor average 1.98 50 poor in the next phase, it is known that the difficulty to measure the affective achievement laid on the fact that the indicator of affective aspects is hard to measure directly. it is possible, but it requires more time spent for observation. in addition, khuriyah (2003, p. 60) in her research states that the construct in the measurement of morality has not been fully developed. the outcome of islamic moral teaching the outcome of islamic moral teaching at man’s in cilacap regency is categorized based on the score that each school gained. the details of the categorization are presented in table 15. table 15. the categorization of the outcome of the islamic moral teaching at man’s in cilacap regency no. score category 1. 3.25 ≤ x very good 2. 2.50 ≤ x ˂ 3.25 good 3. 1.75 ≤ x ˂ 2.50 poor 4. x < 1.75 very poor reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 an evaluation of islamic moral teaching… 9 siti amanah & haryanto in general, based on the data which were collected through the questionnaire, the outcome of the islamic moral teaching at man’s in cilacap regency reaches the score of 2.99, or it is in a good category. this shows that the students have partially applied good morality in their relation with allah swt, the prophet, other people (social interaction), themselves, and also the nature. the score which is gained by each man is shown in table 16. table 16. scores on the outcome stage of the islamic moral teaching indicator madrasa score % category evaluation outcome man cilacap 3.04 76 good man kroya 3.15 79 good man majenang 2.89 72 good average 2.99 75 good there are two indicators in the outcome of the islamic moral teaching at man’s in cilacap regency, namely: (1) the application of the islamic morality inside the area of the madrasas, and (2) the application of the islamic morality outside the area of the madrasas. the implementation of islamic morality by students inside the area of the madrasas in applying islamic morality in the area of the madrasas, the students of man cilacap gained the score of 3.07 (77%). the students of man kroya scored 3.18 (80%), and the students of man majenang scored 2.90 (73%). in agreement with these scores, during the interviews, the madrasa principals and the teachers stated that, in general, the students applied islamic morality around the schools well. it is portrayed, evidently, in the behavior and habits of the students. they are friendly, disciplined, soft-spoken, top achievers, and also active in worship and other religious activities. table 17 and figure 4 show the score on the implementation of islamic morality in each madrasah aliyah negeri (man) in cilacap regency. table 17. the score and percentage of the students’ application of the islamic morality in the area of the madrasas indicator madrasa score % category application of islamic morality inside the madrasas man cilacap 3.07 77 good man kroya 3.18 80 good man majenang 2.9 73 good average 3.01 75 good figure 4. the score of the students’ application of islamic morality in the area of the madrasas the good category means that islamic morality education in the madrasas had driven the students to be better human beings. this is in line with the definition of education by marzuki (2009, p. 1) which states that the process of education is a part of the agents of change which shall possess the power to improve the characters of the nation through the improvement of the characters of the students in the education institutions. the implementation of islamic morality by students outside the area of the madrasas there were two indicators in defining the application of islamic morality – based on the orders of allah swt and the teachings of prophet muhammad saw outside the area of the madrasas: (1) the students’ ability to control themselves and shield themselves from promiscuity and negative impacts of technology development, and (2) the students’ ability to foster awareness on social issues. the indicator is in a good category with the score of 2.88 and the percentage of 72%. the scores are shown in table 18 and figure 5. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 10 an evaluation of islamic moral teaching... siti amanah & haryanto table 18. the score and percentage of the students’ application of the islamic morality outside the area of the madrasas indicator madrasa score % category application of islamic morality outside the madrasas man cilacap 2.88 72 good man kroya 2.99 75 good man majenang 2.83 71 good average 2.88 72 good figure 5. the score of the students’ application of the islamic morality outside the area of the madrasas the good category means that the students had applied what they had learnt from the islamic moral teaching outside the area of the madrasas. it was evidenced in their worship activities, their ability to stay away from bad habits and promiscuity, and the negative impacts of technology. they were also able to apply the islamic morality in the social aspects. the finding in this research is in line with zuchdi (2013, p. 20) who affirms that the community, represented by colleagues, close friends, co-workers and other parties in the community, has to take part in the development and education of morality of the students as the heirs of the nation. conclusions and recommendations conclusions based on the result of the islamic moral teaching at man’s in cilacap regency, it can be concluded that: (1) the preparatory stage (antecedent) of the islamic moral teaching, covering resources, goals, material scope, and facilities and infrastructures, scored 75% or is in a good category; (2) the implementation stage (transaction) of the islamic moral teaching, covering the implementation time and the method, is in a good category (except for the evaluation model which is in a good category); (3) in general, the outcome of the islamic moral teaching – the application of islamic morality inside and outside the area of the madrasas at man’s in cilacap regency scored 75% or is in a good category. recommendations the result of the analysis, i.e the indicator of the evaluation model of the programs of islamic moral teaching is in a poor category. therefore, the evaluation aspects of the education need improvement and deeper analysis. the goal is to formulate students’ morality criteria objectively. additionally, the research on the effectiveness of the islamic moral teaching has to be conducted. it will serve as the follow up of this evaluation, which should serve as the basis for researchers to conduct systematic research on the aspects of islamic moral teaching, specifically on the morality evaluation techniques and models, the development of the morality evaluation instruments, the effectiveness of the morality teaching methods and the effectiveness of the programs of the islamic moral teaching. references abas, z. z. (1437). makarimal akhlak. cairo: al qomar. daradjat, z. (1984). dasar-dasar pendidikan agama islam: buku teks pendidikan agama islam pada perguruan tinggi umum. jakarta: bulan bintang. daradjat, z. (2000). ilmu pendidikan islam. jakarta: bumi aksara. darmayanti, s. e., & wibowo, u. b. (2014). evaluasi program pendidikan karakter di sekolah dasar kabupaten kulonprogo. jurnal prima edukasia, 2(2), 223– 234. https://doi.org/10.21831/jpe.v2i2. 2721 indonesian department of religious affairs. (2011). al qur’an dan terjemahnya: syaamil al qur’an special for women. bandung: pt. sigma exa grafika. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 an evaluation of islamic moral teaching… 11 siti amanah & haryanto jalaluddin. (2011). psikologi agama. jakarta: raja grafindo persada. kartowagiran, b. (2011). kinerja guru profesional (guru pasca sertifikasi). cakrawala pendidikan, 30(3), 463–473. https://doi.org/10.21831/cp.v3i3.4208 kartowagiran, b., & maddini, h. (2015). evaluation model for islamic education learning in junior high school and its significance to students’ behaviours. american journal of educational research, 3(8), 990–995. https://doi.org/ 10.12691/education-3-8-7 khuriyah, k. (2003). pengembangan instrumen evaluasi ranah afektif untuk pedidikan agama islam. jurnal penelitian dan evaluasi pendidikan, 5(6), 59–73. https://doi.org/10.21831/pep.v5i6.205 8 linn, r. l. (1989). educational measurement. new york, ny: macmillan. mardapi, d. (2008). teknik penyusunan instrumen tes dan nontes. yogyakarta: mitra cendekia. mardapi, d. (2012). pengukuran, penilaian, dan evaluasi pendidikan. yogyakarta: nuha medika. marzuki. (2009). prinsip dasar akhlak mulia: pengantar studi konsep-konsep dasar etika dalam islam. yogyakarta: debut wahana press. miles, m. b., & huberman, a. m. (1994). qualitative data analysis: an expanded sourcebook (2nd ed.). thousand oaks, ca: sage publication. prihatini, s., mardapi, d., & sutrisno, s. (2013). pengembangan model penilaian akhlak peserta didik madrasah aliyah. jurnal penelitian dan evaluasi pendidikan, 17(2), 347–368. https://doi.org/ 10.21831/pep.v17i2.1705 retnawati, h. (2016). validitas reliabilitas & karakteristik butir (panduan untuk peneliti, mahasiswa, dan psikometrian). yogyakarta: nuha medika. rusniati, & haq, a. (2014). perencanaan strategis dalam perspektif organisasi. jurnal intekna: informasi teknik dan niaga, 14(2), 102–209. retrieved from http://ejurnal.poliban.ac.id/index.php/ intekna/article/view/178 syamsudin, a., budiyono, b., & sutrisno, s. (2016). model of affective assessment of primary school students. reid (research and evaluation in education), 2(1), 25–41. https://doi.org/ 10.21831/reid.v2i1.8307 thoha, c. (2004). metodologi pengajaran agama (2nd ed.). yogyakarta: pustaka pelajar. worthen, b. r., & sanders, j. r. (1973). educational evaluation: theory and practice. worthington, oh: longman. ya’qub, h. (1991). etika islam: pembinaan akhlaqulkarimah (suatu pengantar). bandung: diponegoro. zuchdi, d. (2013). pendidikan karakter: konsep dasar dan implementasi di perguruan tinggi. yogyakarta: uny press. reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(2), 2017, 163-173 available online at: http://journal.uny.ac.id/index.php/reid research article the utilization of junior high school mathematics national examination data: a conceptual error diagnosis * 1 kartianom; 2 djemari mardapi *graduate school of universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia *email: kartianom@gmail.com submitted: 23 january 2018 | revised: 22 february 2018 | accepted: 26 february 2018 abstract the goal of the research is to gain insights into the characteristics of the items in the mathematics national examination, the attributes on which the items were formulated and the result of a conceptual error diagnosis of the mathematics materials based on the result of the junior high school mathematics national examination. this is quantitative descriptive research. the data were collected from 3,079 grade-nine students of junior high schools who took the national examination in the academic year of 2015/2016. the sample was established randomly based on the package code of the examination which is p0c5520 with 574 students as the examinees. documentation method was applied in collecting the data. the result of the research shows that – upon the implementation of the classical test theory – there are 16 items in ‘difficult’ category, 24 in ‘intermediate’ category, and no items in ‘easy’ category. furthermore, upon the implementtation of the item response theory, the result shows that 28 items are in ‘good’ category and 12 items are in ‘poor’ category. in addition, there are 50 attributes on which the junior high school mathematics national examination test (package p0c520) is formulated. four attributes are content attributes and the rest (46) are process skill attributes. the result of the diagnosis shows that there are 11 types of errors made by the students when trying to complete the content items. most of the errors are conceptual errors related to the geometric materials especially in the submaterials of polyhedron, triangles, and quadrangles. keywords: conceptual error, attributes, junior high school mathematics national examination how to cite item: kartianom, k., & mardapi, d. (2017). the utilization of junior high school mathematics national examination data: a conceptual error diagnosis. reid (research and evaluation in education), 3(2), 163-173. doi:http://dx.doi.org/10.21831/reid.v3i2.18120 introduction in the education system, evaluation is an urgent thing to perform. evaluation is a medium to put students in the context of what they understand and what they are able to perform, while describing what they do not understand and what they are not able to perform (sumintono & widhiarso, 2015, pp. 2– 3). the goal of the evaluation on the result of the study as conducted by the government is to measure the competence level of the graduates on certain subjects as formulated in national examination (or ujian nasional – un). the items in national examination are formulated based on the competence standards of the graduates, basic competence and achievement indicator. most of the education practitioners utilize the reports on the result of the national examination as the supporting data in the process of policy-making, as a medium in http://dx.doi.org/10.21831/reid.v3i2.18120 reid (research and evaluation in education) the utilization of junior high school mathematics national examination data... 164 kartianom & djemari mardapi comparing the achievement of the examinees in the national level and as a medium in mapping the quality of national education. for example, the report of the junior high school national examination result for mathematics in baubau municipality in the academic year of 2014/2015 shows that the average score on mathematics is 42.62 with 15.0 as the lowest score and 97.5 as the highest score (ministry of education and culture, 2015). the result indicates that some examinees gave incorrect responses to some of the items of the mathematics national examination. the mistakes might be caused by the level of the items in the examination and the examinees’ lack of conceptual knowledge or because they made a conceptual errors. a good examination item must go through a calibration process, so the information on the items can be gained from the applied test. this information is commonly called characteristics of the items, which can be estimated by using two approaches, namely: classical test theory (ctt) and item response theory (irt). a good item can be reviewed from its difficulty level, discrimination index, and distractor effectiveness. in the ctt approach, the index of the difficulty level of a good item must be 0.3 – 0.8, while the discrimination index must be  0.3 and the option of each item at least has to be selected by 5% of the examinees (mardapi, 2012, p. 128). in the irt approach, the index of the difficulty level of a good item must be (ai) -2.0 – +2.0 (hambleton, swaminathan, & rogers, 1991, p. 13), while the discrimination index must be (bi) 0 +2.0 (hambleton et al., 1991, p. 15), and pseudo guessing index must be (ci) 0 – 1/k (hambleton et al., 1991, p. 17). items with very low or very high facility index cannot be categorized as good items because they cannot differentiate the level of ability of the examinees. the error indication of the examinees can be caused by the difficulty level. it might not be caused by the lack of competence. items with negative discrimination index indicate that the correctness of the answer is questionable. the correctness of the answer is also questionable if the distracting items are only selected by <5% of the examinees. the examinees with the pseudo guessing index >1/k show that the distracting items are not able to attract those with low capability (abadyo & bastari, 2015). a conceptual error is an error in understanding the concept in which the understanding is not in accordance with the scientific definition as agreed generally by the experts in that field. in mathematics, this error happens when students fail to relate the initial concept with the newly-given one (russell, o’dwyer, & miranda, 2009, p. 416). in fact, a conceptual error is closely related to the conceptual knowledge of the examinees. mathematics conceptual knowledge is the examinees’ understanding of the scope of the field of mathematics. the scope of mathematics subject include: (1) number, (2) algebra, (3) geometry and measurement, and (4) statistics and probability. therefore, in mathematics, a conceptual error can be defined as an incorrect use of the concepts which do not follow the scientific definition in the scope of mathematics field (numbers, algebra, geometry, and measurement and statistics and probability. in order to learn about the error indication related to a conceptual error, there should be diagnosis process. the goal of the diagnosis activity is to understand the strength and weakness of the examinees (leighton & gierl, 2007, p. 242). the cognitive diagnosis model (cdms) can be utilized in two ways, (a) retrofitting (post-hoc analysis) from nondiagnostic examination to gain richer or wider information and (b) designing or constructing a set of items for diagnostic purposes (ravand & robitzsch, 2015, p. 3). in the approach of retrofitting (post-hoc analysis), non-diagnostic examination instruments are reconstructed in a way that they can be used to identify the strength and weakness of the examinees in defining the attributes based on which the test items are formulated. attributes are the description of knowledge in completing examination contents in a certain domain (wang & gierl, 2011, p. 166) and the basis of cognitive or skill process crucial to completing the test items (gierl, cui, & zhou, 2009, p. 5; gierl, zheng, & cui, 2008, pp. 66–67; yamtinah & budiyono, 2015, p. 71). in mathematics, attributes consist of three categories: content attributes (common reid (research and evaluation in education) 165 − reid (research and evaluation in education), 3(2), 2017 materials), process attributes (expected capability after learning the materials in the content attributes) and skill attributes (specific mathematical skills critical in certain materials) (tatsuoka, 2009, p. 2). attributes utilized in this research are content attributes and process skill attributes. there are already many studies taking advantages of diagnosis activities in indonesia. however, most of them focus on the development of the diagnostic instruments. secondary data such as national examination, pisa and timss are rarely used in diagnostic activities. if we take a look at the studies in the last six years (2011-2017), secondary data have been a fresh medium to gain information on the influential factors in the academic achievement of examinees (kartianom & ndayizeye, 2017, p. 200) and the difficulty of the examinees in completing the mathematics test items of the national examination (isgiyanto, 2011, p. 308; retnawati, 2017, p. 33). even though national examination is neither the main factor in determining the passing of the examinees, nor the main requirement in continuing to higher education level, the result of the national examination is valuable data for diagnostic purposes. to be more specific, the poor result of the junior high school national examination in baubau municipality was driven by the lack of comprehensive diagnosis on the result of the national examination, especially on the subject of mathematics. both of the academia and the municipality administrator do not seem to see diagnostic activities as an urgent matter. the data of the national examination are left untouched and have not yet been transformed into insightful information. the objective of this research is to gain insights into the characteristics of the test items and see the result of the diagnosis on the conceptual error in mathematics materials based on the result of the junior high school mathematics national examination in baubau municipality. method this research is quantitative descriptive research which applies content analysis in drawing conclusion by identifying various characteristics specifically in a message – in the test items and the responses of the examinees objectively, systematically and generally. the research was conducted in baubau municipality. the data were collected from the center for education evaluation (commonly known as puspendik) in jakarta, in the form of national examination sheets and the response sheets. the data source is the ninth graders of junior high schools in the academic year of 2015/2016 in baubau municipality. the total number of the examinees is 3,079. the sample was established randomly (random sampling) based on the package code of the examination content. the researchers selected the package code of p0c5520 with 574 examinees in total. the object of the research is 40 test items and 22,960 responses of the examinees. the expost facto data in the form of the the examinees’ responses and the items in the junior high school mathematics national examination were collected using documentation technique. the data were analyzed for diagnostic information. the items in the national examination were selected to be the data because they had been standardized. therefore, the bias has been minimized. moreover, they had been calibrated, which allowed the researchers to compare the existing series and the packages from each year. a good examination instrument must be valid and reliable. in this research, the instruments chosen are the instruments of the national examination which have been tested in large and small scales. therefore, it is safe to assume that the validity and reliability of the instruments are fulfilled. the validity implemented in this research is closely related to the attribute formation. the validity of the content of the attributes on which the test items are formulated was proven based on the judgment of the experts. in order to produce the content validity index of the attributes formation, the result of the judgment was then calculated using aiken formulation. based on the aiken index, the researchers formulated criteria in order to show the content validity of the attributes formation (see table 1) (kartianom, 2017, p. 153). reid (research and evaluation in education) the utilization of junior high school mathematics national examination data... 166 kartianom & djemari mardapi table 1. content validity index criteria aiken index content validity criteria > 0.4 low 0.4 – 0.8 medium > 0.8 high in order to understand the characteristics of the items using ctt approach, the data were analyzed using tap software version 14.7.4. table 2 shows the criteria of good items based on ctt approach (mardapi, 2012, p. 128). table 2. item characteristic criteria using ctt parameter criteria ai more than or equal with 0.3 bi 0.3 to 0.8 ci the answer choice is chosen by at least 5% of the examinees description: ai = items differentiators index bi = items difficulty level index ci = distractor effectiveness index using irt approach, the data were analyzed with the help of bilog-mg software. prior to the analysis, the sample was tested for its adequacy using spss11.5 software. the sample is considered adequate when the value of kaiser mayer olkin measure (kmo) > 0.5 with significance value (sig.) of < 0.05. after that, the assumption test was conducted on the item parameter estimation using irt approach. the assumption to be fulfilled was local unidimension and independency. unidimension assumption was conducted with the support of spss 11.5 software based on the formation of the dominant factor. the formulated factor was with the eigen value > 1.0. the dominant factor has large eigen value discrepancy with the next factor and it has at least 20% cumulative frequency (retnawati, munadi, & al-zuhdy, 2015). the local independency assumption will be automatically fulfilled when the unidimensional assumption is fulfilled (retnawati, 2014, p. 141). when the assumption in irt approach has been fulfilled, the next one is goodness of fit test. there are three models in irt approach: model 1-pl, model 2-pl and model 3-pl. the goodness of fit test is conducted with the support from bilog-mg software by comparing the significant value of 2  with 0.05  and also icc curve. if the value of sig. 2 > 0.05  , the items can be categorized as fit with the model. for icc curve, the data are considered fit when the distribution of the data matches the model (figure 1). figure 1. icc curve in each model, the criteria of good items in the irt approach are presented in table 3 (hambleton et al., 1991, pp. 13–17). table 3. irt criteria of items characteristics model parameter criteria ai bi ci 1-pl 0 up to +2 2-pl 0 up to +2 -2 up to +2 3-pl 0 up to +2 -2 up to +2 0 up to 1/k description: ai = item discrimination index bi = items difficulty level index ci = pseudo guessing index in this research, the error made by the examinees was analyzed through the response of the mathematics examination contents (answer sheets of the examinees) of the national examination in the academic year of 2015/ 2016. the analysis was conducted by formulating the probable description of the alternative response to the test items. at this point, the researchers did not use the description of the examinees’ answers and the responses to determine the achievement of the students, but to understand the type and the area of the error. in order to conduct the diagnosis on the a conceptual error made by the examinees, the researchers: (1) identified the attributes of the examination content by defining the opreid (research and evaluation in education) 167 − reid (research and evaluation in education), 3(2), 2017 tions of responses to each item using the content analysis; (2) named the type of the error in each response option based on the attributes on which the items were formulated; (3) analyzed the response option using tap software version 14.7.4 to measure the percentage of each type of error in each material. there was a follow up for the most dominant type of error in order to understand the area of the error. findings and discussion the characteristics of the test items classical test theory to understand the difficulty level, differentiator, and distractor effectiveness of the examination content, the researchers applied the classical test theory when analyzing the items. the data were in the form of answer sheets multiple choices with the answer key. table 4 shows the result of the recapitulation of the characteristics of the test items based on the difficulty level of the items in each material. table 4. the difficulty level of the items in each material materials category total easy medium difficult numbers 0 7 4 11 algebra 0 4 6 10 geometry 0 9 4 13 statistics 0 3 1 4 probability 0 1 1 2 total 0 24 16 40 table 4 shows that: (1) the materials on number have seven items in ‘medium’ category and four items in ‘difficult’ category; (2) the materials on algebra have four items in ‘medium’ category and six items in ‘difficult’ category; (3) the materials on geometry have nine items in ‘medium’ category and four items in ‘difficult’ category; (4) the materials on statistics have three items in ‘medium’ category and one item in ‘difficult’ category; and (5) the materials on probability have one item in ‘medium’ category and one item in ‘difficult’ category. table 5 shows the result of the recapitulation of the characteristics of the test items based on the differentiators of the items in each material. table 5. the differentiators of the items in each materials materials category total good not good numbers 9 2 11 algebra 6 4 10 geometry 8 5 13 statistics 1 3 4 probability 2 0 2 total 26 14 40 table 5 shows that overall the discrimination index of the test items in the content of the mathematics national examination in baubau municipality has 26 items in ‘good’ category and 14 items in ‘not good’ category. if we take a closer look at the materials: (1) the materials on numbers have nine items in ‘good’ category and two items in ‘not good’ category, (2) the materials on algebra have six items in ‘good’ category and four items in ‘not good’ category, (3) the materials on geometry have eight items in ‘good’ category and five items in ‘not good’ category; (4) the materials on statistics have one item in ‘good’ category and three items in ‘not good’ category; and (5) the materials on probability have two items in ‘good’ category and no item is in ‘not good’ category. other critical information in the classical test theory is distractors effectiveness. the distribution of the response choice can be considered as effective or acceptable when each option in the test items is chosen by at least 5% of the examinees (mardapi, 2012, p. 129). figure 2 presents the functionality percentage of the distracting items. good 100% not good 0% figure 2. the functionality percentage of the distractors reid (research and evaluation in education) the utilization of junior high school mathematics national examination data... 168 kartianom & djemari mardapi figure 2 shows that 100% of the items have effective distractors. this means the distractors in the items of the junior high school mathematics national examination in baubau municipality are well-functioned distractors. in other words, they are able to attract the examinees. item response theory principally, the item response theory uses the probabilistic model. there are three analytic models: 1pl, 2pl and 3pl. in order to correctly select analytic model, the goodness of fit test is a crucial process. however, before that, the sample adequacy and assumption test has to be conducted. table 6 shows the result of the sample adequacy test. table 6. the result of the kmo and bartlett kmo and bartlett's test kaiser-meyer-olkin measure of sampling adequacy 0.810 bartlett's test of sphericity approx. chi-square 2425.233 df 780 sig. 0.000 table 6 shows that the kmo value is at 0.810 or 0.5 higher. this means that the sample used in this research is adequate. next, unidimensional assumption test was conducted while considering the scree plot (figure 3). scree plot component number 39373533312927252321191715131197531 e ig e n v a lu e 6 5 4 3 2 1 0 figure 3. the scree plot of the result of the exploratory factor analysis the scree plot in figure 3 shows that there is one dominant factor in the junior high school mathematics national examination in the academic year of 2015/2016 in baubau municipality. this can be seen from the shift in the eigen value of the first factor up to the second factor. in the second factor and beyond, the shift of the eigen value is not too high. therefore, it is safe to conclude that the unidimensional assumption test on the contents of the junior high school mathematics national examination in the academic year of 2015/2016 in baubau municipality has been fulfilled. when the unidimensional assumption test has been fulfilled, the local independency assumption is automatically fulfilled. this also means that there is a correlation among the factors in the junior high school mathematics national examination in the academic year of 2015/2016 in baubau municipality, so the goodness of fit test can be conducted. the goodness of fit test for models 1-pl, 2-pl and 3-pl is conducted by comparing the significant value of 2  with 0.05  and icc curve. table 7 shows the result of the goodness of fit test for 1-pl, 2pl and 3-pl. table 7. the result of the goodness of fit between the items and the model fitting model fitting items model 1-pl model 2-pl model 3-pl sig. chi-square value 24 35 13 using icc curve 5 12 2 table 7 shows that based on the goodness of fit test, 24 items fit with model 1-pl, 35 items fit with model 2-pl and 13 items fit with model 3-pl. when the goodness of fit test with icc curve is applied, five items fit with model 1-pl, 12 items fit with model 2pl and two items fit with model 3-pl. this makes model 2-pl the fittest analytic model. the parameter used in model 2-pl is the difficulty level (bi) and differentiators (ai), whereas guessing (ci) for the item is considered zero. the items which fit with model 2-pl are brought to the next analytic step. the items are as follows, items 1, 2. 3, 4, 5, 7, 8, 9, 10. 12. 13, 14, 15, 16, 17, 19, 20. 21, 22. 24, 25, 26, 27, 29, 30. 31, 32. 33, 34, 35, 36, 37, 38, 39 and 40. in model 2-pl, the items that do not fit with model 2-pl are not included reid (research and evaluation in education) 169 − reid (research and evaluation in education), 3(2), 2017 in the next analytic steps even though they have difficulty and differentiators as the parameter. these excluded items are items 6, 11, 18, 23 and 28. table 8 shows the result of the characteristics analysis on the test items based on model 2-pl with the support from bilog-mg program. table 8. the characteristics of the test items based on the parameter of difficulty level and differentiators category parameter frequency desc. a b good 35 28 28 not good 0 7 7 total 35 35 35 table 8 shows that based on the criteria of model 2-pl, there are 28 items in ‘good’ category and 7 items in ‘not good’ category. in fact, those 7 items in ‘not good’ category possess good differentiators but have bad difficulty level. those items are items 33, 9, 15, 29, 19, 21, and 35. respectively, their difficulty level parameters are 4.463, 4.027, 3.870, 2.747, 2.644, 2.100, and 2.028. these items have very high difficulty level with item 33 having the highest difficulty level. in terms of the differentiator’s parameter, 40 items fall in ‘good’ category. this strengthens the indication that the error in the examinees responses – specifically while trying to complete items 33, 9, 15, 29, 19, 21 and 35 – is not caused by the difficulty level. in addition to items parameter, the researchers also gain insights into the test information function as shown in figure 4. figure 4. information functions and test measurement error figure 4 shows that the content of junior high school mathematics national examination in the academic year of 2015/ 2016 in baubau municipality has higher information than the error in measurement with the ability range from -1.6 to +4.0. if the examination was delivered to the examinees with the ability range lower than -1.6 and higher than +4.0, the error in the measurement would be a lot higher than the information function. subject-matter mastery in the mathematics national examination the subject-matter mastery of the test takers of the national examination of mathematics of the academic year 2015/2016 can be seen from the proportion of true answers of the test takers on the number, algebra, geometry, statistics, and probability materials as presented in figure 5. figure 5. percentage of student's answers to each material figure 5 shows that all materials tested on the mathematics national examination of the academic year 2015/2016 in baubau municipality are considered difficult by the test takers. this can be seen from the percentage of the wrong answers that are greater than the percentage of the correct answers of the test takers on each material. attributes on which test items are formulated the attributes, on which the items are formulated, are developed and validated by five experts (expert judgment), three of whom are mathematics teachers of state junior high schools in yogyakarta who previously had inreid (research and evaluation in education) the utilization of junior high school mathematics national examination data... 170 kartianom & djemari mardapi volved in the development of the examination, and two are mathematics lecturers. generally, all of the attributes of the items of the junior high school mathematics national examination in the academic year of 2015/ 2016 in baubau municipality consist of four content attributes and 46 process skill attributes. the content validity index of the attributes of those 40 items is at 0.888 which falls in ‘high’ category. table 9 shows the distribution of the attributes of the items in each material. table 9. the distribution of the test items attributes no material content attributes process skill attributes 1 numbers 1 13 2 algebra 1 13 3 geometry 1 14 4 statistics and probability 1 6 total 4 46 table 9 shows the distribution of the attributes on which the test items are formulated. each material competence has several attributes. some of the attributes are alike and some are different. thus, the material competence has to be divided into groups along with all of the attributes. diagnosis of the examinees’ errors error type the identification of the error focuses on the attributes which are not mastered and applied correctly by the examinees when they are trying to complete the items in the mathematics national examination. based on the content analysis, the errors can be categorized into 11 types, which consist of: (1) conceptual errors, (2) language-related interpretative errors, (3) procedural errors, (4) calculation errors, (5) representation errors, (6) conceptual and language-related interpretative errors, (7) conceptual and calculation errors, (8) conceptual and calculation errors, (9) languagerelated interpretative and procedural errors, (10) representation and procedural errors, and (11) representation and calculation errors. figure 5 shows the percentage of each type of error. furthermore, in general, table 10 shows the frequency of each type of errors. table 10 shows that most of the errors are conceptual errors. they are in the area of basic concept of numbers, algebra, geometry (plane figure and solid figure) and probability. most of them are found in geometric materials. 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% conceptual language-related interpretative procedural calculation representation conceptual and language-… conceptual and calculation conceptual and representation language-related interpretative… procedural and representation calculation and representation probability statistics geometry algebra number figure 5. the percentage of each type of error in each material reid (research and evaluation in education) 171 − reid (research and evaluation in education), 3(2), 2017 table 10. types of errors made by the examinees types of errors frequency percentage (%) conceptual 5804 41.41 language-related interpretative 1749 12.48 procedural 1106 7.89 calculation 873 6.23 representation 1759 12.55 conceptual and language-related interpretative 966 6.89 conceptual and calculation 575 4.10 conceptual and representation 347 2.48 language-related interpretative and procedural 271 1.93 procedural and representation 81 0.58 calculation and representation 486 3.47 total 14017 100 the area of the conceptual errors the most dominant conceptual errors are: (1) the basic concept of integers in the materials of numbers, root form (irrational) and comparison; (2) the concept of relation and function, basic concept of algebraic operation, basic concept of integers and straight line equation in the materials of algebra; (3) the basic concept of geometry, polyhedron, triangles and quadrangles in the materials of geometry; (4) the basic concept of probability in the materials of statistics. these all are shown in details in figure 6. discussion by using ctt and irt, there are five items with a very high level of difficulty (items 9,15,19,21, and 33). item 9 is related to number; items 9, 15 and 21 are about algebra, while item 33 is related to geometry. the high percentage of students answering those items wrongly is due the very high level of item difficulty. besides, the very high level of item difficulty indicates that there are a lot of students with incomplete attributes of those materials. based on the content analysis, there are 11 types of students’ errors. the conceptual error is the dominant type of errors mostly occured in geomerty-related items. in line with the result of this research, isgiyanto (2011) also found that, in indonesia, the junior high school students are weak at geometry and measurement with the low level of attributes of content/concept completeness. the conceptual errors made by the students are indicated by the conceptual errors occurring in number and algebra materials. the testees’ understanding of numbers is the key to understanding the material of algebra. the understanding of numbers and algebra is the requirement for the understanding of the geometrical materials. further, in their study, 0% 20% 40% 60% 80% 100% root forms proportion integer the sets relation or fungtion quadratic equation linear equation shape basic probability statistics and probability geometry algebra number figure 6. the area of error in each material reid (research and evaluation in education) the utilization of junior high school mathematics national examination data... 172 kartianom & djemari mardapi russell et al. (2009, p. 416) mention that a conceptual error occurs because of the failure in connecting new concept with the earlier concept. specifically, the conceptual error made by the students is located in the basic concept of integers, irrationals, comparisons, association and function, algebra operation, linear equation, polyhedron geometry, triangle, square, and probability. the findings of this research are supported by the findings of a research conducted by retnawati (2017, p. 33), which found that junior high school students in yogyakarta, indonesia found it difficult to finish the national examination questions due to their disability to understand the concept of fraction, rationing fraction with square-root denominator, linear equation with one or two variables, determining the members of a sets, determining the gradient a linear equation, also the concept of area. conclusion and recommendations conclusion based on the result of the analysis and description, it can be concluded that, first, based on the classical test theory, 16 test items are in ‘difficult’ category, 24 are in ‘medium’ category, and no item is in ‘easy’ category. based on item response theory, 28 items are in ‘good’ category and 12 items are in ‘not good’ category. second, there are 50 attributes – 4 content attributes and 46 process skill attributes on which the junior high school mathematics national examination content (package p0c5520) are formulated. third, there are 11 types of errors made by the examinees when they tried to complete the examination. most of the errors are conceptual errors in the materials of geometry especially in the sub materials of polyhedron, triangles and quadrangles. recommendation based on the conclusion, the recommendations are: (1) for users of the diagnostic information. the result of the research can be used as the materials for training on the process of conducting diagnostic information. it is expected that this type of training can be used to improve the quality of learning process in the schools with low result in the mathematics national examination. (2) for researchers, this research focuses only on diagnosis the types and areas of error made by the examinees when trying to complete junior high school mathematics national test items based on the attributes of the items. therefore, this research can be deepened by diagnosing the errors or difficulties faced by the examinees with the help of r packages cdm program while using model dina. references abadyo, a., & bastari, b. (2015). estimation of ability and item parameters in mathematics testing by using the combination of 3plm/grm and mcm/gpcm scoring model. reid (research and evaluation in education), 1(1), 55–72. gierl, m. j., cui, y., & zhou, j. (2009). reliability and attribute-based scoring in cognitive diagnostic assessment. journal of educational measurement, 46(3), 293–313. https://doi.org/10.1111/j.17453984.2009.00082.x gierl, m. j., zheng, y., & cui, y. (2008). using the attribute hierarchy method to identify and interpret cognitive skills that produce group differences. journal of educational measurement spring, 45(1), 65– 89. retrieved from https://pdfs.seman ticscholar.org/0a0b/180342ee51f6121dd 4e3199c9cc4df3bc377.pdf hambleton, r. k., swaminathan, h., & rogers, h. j. (1991). fundamentals of item response theory. new delhi: sage publications. isgiyanto, a. (2011). diagnosis kesalahan siswa berbasis penskoran politomus model partial credit pada matematika. jurnal penelitian dan evaluasi pendidikan, 15(2), 308–325. retrieved from https://journal.uny.ac.id/index.php/jpe p/article/view/1099/1151 kartianom, k. (2017). diagnosis kesalahan konsep materi matematika smp berdasarkan hasil ujian nasional di kota baubau. master thesis, universitas negeri yogyakarta, reid (research and evaluation in education) 173 − reid (research and evaluation in education), 3(2), 2017 indonesia. kartianom, k., & ndayizeye, o. (2017). what 's wrong with the asian and african students' mathematics learning achievement? the multilevel pisa 2015 data analysis for indonesia, japan, and algeria. jurnal riset pendidikan matematika, 4(2), 200–210. https:// doi.org/10.21831/jrpm.v4i2.16931 leighton, j. p., & gierl, m. j. (2007). defining and evaluating models of cognition used in educational measurement to make inferences about examinees’ thinking processes. educational measurement: issues and practice, 26(2), 3–16. https://doi.org/ 10.1111/j.1745-3992.2007.00090.x mardapi, d. (2012). pengukuran, penilaian, dan evaluasi pendidikan. yogyakarta: nuha medika. ministry of education and culture. (2015). laporan hasil ujian nasional. jakarta: balitbang. ravand, h., & robitzsch, a. (2015). cognitive diagnostic modeling using r. practical assessment, research & evaluation, 20(11). retrieved from http:// pareonline.net/getvn.asp?v=20&n=11 retnawati, h. (2014). teori respons butir dan penerapannya: untuk peneliti, praktisi pengukuran dan pengujian, mahasiswa pascasarjana. yogyakarta: nuha medika. retnawati, h. (2017). diagnosing the junior high school students’ difficulties in learning mathematics. international journal on new trends in education and their implications, 8(1), 33–50. retrieved from http://www.ijonte.org/fileupload/ks63 207/file/04.heri_retnawati.pdf retnawati, h., munadi, s., & al-zuhdy, y. a. (2015). factor analysis to identify the dimension of test of english proficiency (toep) in the listening section. reid (research and evaluation in education), 1(1), 45–54. https://doi.org/ 10.21831/reid.v1i1.4897 russell, m., o’dwyer, l. m., & miranda, h. (2009). diagnosing students’ misconceptions in algebra: results from an experimental pilot study. behavior research methods, 41(2), 414–424. https://doi.org/10.3758/brm.41.2.414 sumintono, b., & widhiarso, w. (2015). aplikasi pemodelan rasch pada asesmen pendidikan. bandung: trim komunikata. tatsuoka, k. k. (2009). cognitive assessment: an introduction to the rule space method. new york, ny: routledge/taylor & francis. wang, c., & gierl, m. j. (2011). using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills in critical reading. journal of educational measurement, 48(2), 165– 187. https://doi.org/10.1111/j.17453984.2011.00142.x yamtinah, s., & budiyono, b. (2015). pengembangan instrumen diagnosis kesulitan belajar pada pembelajaran kimia di sma. jurnal penelitian dan evaluasi pendidikan, 19(1), 69–81. https://doi.org/10.21831/pep.v19i1.455 7 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(1), 2019, 10-20 available online at: http://journal.uny.ac.id/index.php/reid methods used by mathematics teachers in developing parallel multiple-choice test items in school *1kartika pramudita; 2r. rosnawati; 3socheath mam 1,2department of educational research and evaluation, graduate school of universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia 3faculty of education, royal university of phnom penh, cambodia russian federation boulevard, toul kork, phnom penh, cambodia *corresponding author. e-mail: kartika_pramudita.2017@student.uny.ac.id submitted: 04 december 2018 | revised: 12 february 2019 | accepted: 14 february 2019 abstract the study was aimed at describing five methods of the development of parallel test items of the multiplechoice type in mathematics at yogyakarta (primary education level). the study was descriptive research involving 22 mathematics teachers as the respondents. data collection was conducted through interviews and document reviews concerning the developed test packages. a questionnaire was used to gather data about the procedure the teachers employed in developing the tests. findings show that the teachers used five methods in developing the test item; namely (1) randomizing the item numbers; (2) randomizing the sequences of response options; (3) writing items using the same contexts but different figures; (4) using anchor items; and (5) writing different items based on the same specification table. all of the respondents stated that they developed the table of the specification before developing the test items and that most of them (77%) did the validation of the instruments in content and language. keywords: parallel test items, test item development, mathematics evaluation, multiple-choice testing permalink/doi: https://doi.org/10.21831/reid.v5i1.22219 introduction evaluation is one of the essential aspects of education that will contribute to the achievement of educational quality. one of the objectives of evaluation is to know students’ real competence. effective evaluation can differentiate between highand lowachieving students. an effective evaluation gathers evidences that are valid concerning learning outcome. the process and product of evaluation are also able to give improvement to students’ motivation and achievement in learning (stiggins & chappuis, 2012, p. 3). one type of evaluation conducted in school is cognitive evaluation. cognitive evaluation can be performed by using tests that will show the individual or group characteristics (rasyid & mansur, 2008, p. 11). assessment for learning is integral to best practice in teaching and learning. the development of a measurable test instrument must be done through qualitative and empirical research. according to mardapi (2008, p. 15), a test instrument, either test or non-test, must have evidence for validity and reliability so that test results can be comparable and economical. a test is said to be valid if it measures what it is supposed to measure. a test with high validity will have a low error of measurement, meaning that the scores obtained by testees are close to the original scores. a test is said to be reliable if the observed scores have a high correlation with the original scores. sources for an instrument validity can be traced from the contents of the test, in the forms of qualitative analyses of the materials, constructs, and language of the test. http://dx.doi.org/10.21831/reid.v5i1.22219 methods used by mathematics teachers... kartika pramudita, r. rosnawati, & socheath mam copyright © 2019, reid (research and evaluation in education), 5(1), 2019 11 issn 2460-6995 a test battery used in an evaluation can be in various test items. the test form selected must be in line with the objective of the testing. one common test form is the multiple-choice test. a multiple choice test item consists of a stem followed by several alternative responses (kehoe, 1995b, p. 2). the multiple-choice test form is suitable for testing that involves an enormous amount of material, such as the national examination (ne) or national-standard school examination (nsse). this is the superiority of multiple-choice testing in that it covers a high number of items, is objective, is efficient, and can be highly reliable (reynolds, livingston, & willson, 2009, pp. 184–186). a multiple-choice test can measure all the thinking processes in the cognitive domain from the lowest to the highest levels. this can be highly suitable for testing in the field of mathematics (torres, lopes, babo, & azevedo, 2011, p. 11). a number of studies have been done for the evaluation of mathematics learning using the multiple-choice test mode. one study is conducted to measure the high-order thinking skills in mathematics for junior high schools students using a multiplechoice test with four options (rosnawati, kartowagiran, & jailani, 2015, pp. 189–196). multiple-choice tests frequently studied are those of the ne and nsee. some of the problems related to the use of these two tests are the quality of the test and frauds frequently occur during test administrations. a study shows that, based on item response theory analyses, of the 40 items of the mathematics ne for the junior high school, 28 are good and 12 are poor (kartianom & mardapi, 2017, p. 172). to look at the fraud practices during the administering of the national examination can be done from the ne integrity indexes. in some regions, integrity indexes are found low, showing high fraud in the administration of the exams. this condition indicates that students of the primary and junior secondary schools are still fearful of the exams, although the results are not the only determinations for passing. the national exam, however, is used as a criterion for admission to the higher school level. for such, students give all kinds of efforts to get good results; one of which is by sharing answer keys. the multiple-choice system makes it possible for the test takers to exchange answers easily. this chance raises illegal cooperation among the test takers, which cause the test results to be invalid. consequently, the exam results do not at all reflect the real competences of the students. this problem needs a solution. one solution taken by the government is by giving out several parallel tests. development of parallel tests takes different ways among subject matters in its method and rules. in the mathematics subject matter, item stems and options involve a lot of figures. differences in the figures can have an impact on the levels of item difficulties. even numbers and odd numbers give different difficulty levels. the choice of distractors also influences difficulty levels. in the development of the test packages for mathematics, therefore, must obey the rules. in another angle, mathematics teachers are expected to prepare the students in approaching the national examination. in order to know the teachers’ readiness to do it, research needs to be conducted. a study on the competence and readiness of mathematics teachers looked at the self-efficacy of mathematics teachers in yogyakarta. the findings show that the self-efficacy of 43.07% of the teachers is at the low category, 55.47% at the medium category, and the rest 1.46% at the high category (widdiharto, kartowagiran, & sugiman, 2017, pp. 69–75). these findings indicated that teachers’ confidence in facing the ne was at the medium level. probing further on the competence and readiness of teachers in approaching the ne and nsse, it was necessary to know the teachers’ competencies in developing test practices and try-outs for the ne. the purpose of the try-outs was to see each student’s competence achievement to be used as a basis for improvement activities. it is, therefore, crucial that the test items developed by teachers be functional in showing the students’ competences. another thing to be conducted is that which could minimize students’ interaction in doing the test. this minimalization is done by developing several test packages. the packages should be parallel so that they would not raise a new problem. a parallel test must have methods used by mathematics teachers... kartika pramudita, r. rosnawati, & socheath mam 12 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 identical objective, difficulty level, and format so that the test will be the same, but the items will be different. if the packages have been able to minimize frauds but have different levels of difficulty, the results will not be valid either. it is, therefore, necessary that the development of the test packages consider the parallelism of the items that are developed by teachers through a variety of methods. before testing the parallelism of the test packages, it is necessary to gather information concerning the methods used by the teachers to develop the test packages. this paper is to figure out how teachers develop parallel test items of the multiple-choice type in mathematics. method the research employed a descriptive research approach to obtain information about the methods that the teachers used for developing the mathematics test packages in the school. the study used interviews and document reviews as the test techniques and questionnaires as the non-test technique for collecting pertinent data. open-ended interviews were given to 22 mathematics teachers. each teacher was given the freedom to provide information to the method he/she used in developing the test packages. each teacher was allowed to have more than one response, depending on his/her experiences. the research instrument used to gather data was an interview guide. it contained questions about the methods to be used by the teachers to develop the test packages and the reasons for selecting the methods. in order to obtain evidence that the teachers did use the packages, documents review was done. besides finding that the packages were there, it was also used for finding results of the tests to the students. the questionnaires were used to look at the procedures for developing the packages. they were used to know the steps the teachers employed in developing the packages from the formulation of the objectives, construction of the specification table, to the item validation of content and language. they were also used to obtain evidence on the consistency between the item development and the test development procedure. the questionnaires were completed by check and cross marks. a check mark was given if a teacher did the step in the test development, a cross mark when a teacher did not. findings and discussion findings the key findings of the study are that in developing mathematics test packages, teachers had applied five methods including (1) randomizing the item numbers; (2) randomizing the sequences of response options; (3) writing items using the same contexts but different figures; (4) using anchor items; and (5) writing different items based on the same specification table. the majority of teachers up to 37 % (of 22 teachers) used the same contexts with different figures to construct test items (as seen in figure 1). it was followed by 21 % that developed different test items from the same table of specification. meanwhile, other proportions developed the same items in different item numbers, developed the same items with different orders for the options, and used anchor items. figure 1. methods of test package development notes: a: same items in different item numbers b: same items with different orders of options c: same contexts with different figures d: using anchor items e: different items from the same specification table methods used by mathematics teachers... kartika pramudita, r. rosnawati, & socheath mam copyright © 2019, reid (research and evaluation in education), 5(1), 2019 13 issn 2460-6995 figure 2. data of instrument validation table 1. randomization of the orders of response options package sample item package 1 sequencing from small to large numbers line gradient passing through point a( , 3) and b(6, 2 ) is 7. value of is … . a. 2 b. 3 c. 4 d. 5 package 2 sequencing from large to small numbers line gradient passing through point a( , 3) and b(6, 2 ) is 7. value of is … . a. 5 b. 4 c. 3 d. 2 figure 2 presents a diagram of the results of questionnaires completed by 22 respondent teachers. it shows that all the teachers constructed the specification table before beginning to write the test items. next, 17 teachers had their items validated in content and language by peer teachers. the rest five teachers did not have their items validated. in developing test items, one should follow all steps set up in the procedure. after writing the items, teachers should have subjected them to peer validation by their colleagues as experts (torres et al., 2011, p. 7). randomizing item numbers from the interview, 18% of the teacher state they randomized items numbers to produce parallel items. thus, the same test items were developed but were sequenced in different numbers. the difficulty levels and differentiating powers of the items were the same. the distractor functioning was the same too because identical distractors were used. the method of randomizing item numbers is easy to use, does not take much time, and produces many test packages, as many as the test items. the interview reveals that some respondents commented that developing the items by changing the options order gave advantage to the students who got an item order that is the same with the content order. however, those who got items orders that are different from the content orders were put to a disadvantage because mathematics is built of axiomatic and deductive systems such that content sequences are highly compact. randomizing sequences of options a total of 11% of respondents experienced randomizing the order of the response options. in developing multiple-choice test, randomizing the response options orders can minimize illegal interaction among the testees. the interview result reveals that randomizing the order of the options may result in two possibilities. first, if students find out that the options are different only in the orders, they can work out a way to interact with each other. in other words, this method still makes it possible for them to interact although they get different test packages. second, if the students do not realize that the tests are different only in the options orders, they will not get advantage from their interaction. thus, in this case, methods used by mathematics teachers... kartika pramudita, r. rosnawati, & socheath mam 14 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 the method functions well in minimizing frauds. at least two test packages is needed in using this method, since sequencing can be done in two ways; from small to great or from great to small. an example of test package development by altering the options size is shown in table 1. in table 1, the stem in the two packages is the same, but the options order is different, although the options are the same. test packages that have all the options in figures can only be developed in two different versions. if the response options are not in the form of figure, more packages can be obtained (table 3). in this version, the stem and options are the same, but the options order is different. the number of packages that can be developed depends on the number of options. for example, a three-option item can be sequenced in several versions (table 2). table 2. randomization of response options package option order package 1 a. p1 b. p2 c. p3 package 2 a. p1 b. p3 c. p2 package 3 a. p2 b. p3 c. p1 package 4 a. p2 b. p1 c. p3 package 5 a. p3 b. p1 c. p2 package 6 a. p3 b. p2 c. p1 table 3. randomization of response options package item package 1 line equation that passes the point (0. -2) and point (4. 1) is … . a. b. c. d. package 2 option a is exchanged with d and b with c. line equity that passes the point (0. -2) and point (4. 1) is … . a. b. c. d. package 3 option a is exchanged with c and b with d. line equation that passes the point (0. -2) and point (4. 1) is … . a. b. c. d. package 4 option a is exchanged with b and c with d. line equation that passes the point (0. -2) and point (4. 1) is … . a. b. c. d. etc. methods used by mathematics teachers... kartika pramudita, r. rosnawati, & socheath mam copyright © 2019, reid (research and evaluation in education), 5(1), 2019 15 issn 2460-6995 from table 2, it can be seen that six test packages can be developed from changing the orders of three response options. the number of test packages obtained by randomizing the orders of the options is n!, where n is a number of options. the number of packages can increase if a number of ways of item combination are impacted by certain items. suppose package 1 is an initial item; packages 2 – 6 can be constructed in the way shown in table 3. package 7 can be constructed by exchanging options a and b on an even item and b and c on an odd item. this way of combining items will give a larger number of packages. table 3 presents an original item of two test packages with no order of options. package 2 is obtained by changing options a and b in the initial item. the number of options influences the number of packages. generally, the mathematics items for primary and junior secondary schools have four options, while senior secondary schools have five options. the method of constructing test items by changing the orders of the options is intended to maintain item characteristics. also, the distractors are also expected to function effectively. numerous studies have been conducted that are related to the quality of multiple-choice tests. the studies commonly look into the quality of items in terms of levels of difficulty, differentiating powers, and distractor effectiveness. in addition to revealing information about test qualities, these studies also look into aspects that need to be improved to increase the quality of tests to be able to measure well. constructing items using the same context but different figures in the study, 37% of the respondents constructed the test items using the same contexts but different figures. this method (see table 4) results in two test packages that will be able to minimize the testees’ interaction. the teachers revealed that this method of test construction decreases the students’ chance to cooperate. however, item construction using this method should be done carefully by paying full attention to the figures being used in each package. even though the figures in each package are different, care must be taken in terms of even and odd figures since there are different perceptions of these figures between boys and girls (wilkie & bodenhausen, 2015, pp. 3–9). besides, the size of the figures must also be taken into great account to make sure that the item difficulties are equal. item difficulty levels influence discriminating powers; good items will we correctly answered by 30% to 80% of the testees (kehoe, 1995a, p. 1). these percentages must be taken care of so that the test administration is minimized from frauds, and the results are fair to all the testees. table 4. items with the same context but different figures package item package 1 a room with an air-conditioning of 3oc. after the device is activated, the room temperature reduces 2oc every 4 minutes. when the airconditioner has been activated for 28 minutes, the room temperature will presently be … oc. a. –20oc b. –15oc c. –12oc d. –11oc package 2 an air-conditioning set is 5oc. after it is activated, the temperature of the device reduces 4oc every 8 minutes. when the air-conditioner has been activated for 32 minutes, its temperature will presently be … oc. a. 21 b. 16 c. –11 d. –59 methods used by mathematics teachers... kartika pramudita, r. rosnawati, & socheath mam 16 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 table 4 presents two test items into different test packages developed with the same context but different figures. test item 1 uses figures that are relatively smaller than those in package 2. the combination of even and odd numbers, however, is equal. item 1 uses 3, and item 2 uses 5. these are two odd numbers with a small difference. later on, package 1 uses 4 while package 2 uses 8. meanwhile, 32 and 28 are not far apart; both are two-digit and even figures. using anchor items from the interviews, 13% of respondents developed the test packages using anchor items. some studies have been done to obtain evidence for the functioning of anchor items. studies show that the more anchor items used, the better the results are for the test equalization (kartono, 2008, pp. 317– 318). it means that anchor items function to equalize tests. one study increased the anchor items of a physics test up to 40%; the results show that items at the low, mid, and high difficulty levels are not yet equal (abdullah, mansyur, & rosdiyanah, 2016, pp. 217–218). this inequality may be due to the fact that physics tests involve items with figures in them. the use of different figures in items will have an impact on the item difficulty levels. even and odd figures also influence difficulty levels. mathematics subject matter involves a lot of figures in its tests; and, thus, in using this method, developers must be accurate and careful to produce parallel tests. developing items using the same specification table based on the results of the interviews, 21% of respondents constructed a test specification table and developed from it some different test packages. this mode of instrument development can be done in several ways, such as using various figures in the test items, making the same problem with different contexts, etc. this method of test development is effective in reducing frauds when the test is based on the teacher’s narratives. the two test items presented in table 5 are developed from the same indicator, problem-solving in daily life using line arithmetic. the contexts and figures used in the items are different. in package 1, what is known is the first leg and amount of increase per year; while in package 2, what is known is the line from leg 1 to leg 3. the figures used in the two items are also different. the teacher needs to pay attention to these differences. the case is feared in which students can complete package 1 but not package 2 because of the different contexts. this condition may cause invalid testing so that the objective of the evaluation is not achieved. in order to prevent this from happening, it is suggested that teachers know and have information about parallel testing and the ways to develop parallel tests. table 5. items constructed out of the same indicator type item package 1 amount of sugar consumption by people in a village is 1,000 kg in 2013 and is always doubled each year. the total sugar consumption from 2013 to 2018 is … . a. 66,000 kg b. 65,000 kg c. 64,000 kg d. 63,000 kg e. 62,000 kg package 2 a scavenger collects trash plastic bottles. on the first day, he gets 2.5 kg, on the second day 3 kg, and on the third day 3.5 kg, and so forth following an arithmetic line system. if the plastic bottles are sold to a collector at rp10,000.00/kg, in 15 days the scavenger earns …. a. rp800,000.00 b. rp900,000.00 c. rp1,000,000.00 d. rp1,200,000.00 e. rp1,500,000.00 methods used by mathematics teachers... kartika pramudita, r. rosnawati, & socheath mam copyright © 2019, reid (research and evaluation in education), 5(1), 2019 17 issn 2460-6995 discussions the research findings show that 18% of the respondents stated that they developed test packages by randomizing the order of the item numbers believed to be able to produce parallel sets of items. this method had also been done in the entrance testing at muhammadiyah university of bengkulu. the randomization of item numbers used the linear congruent method (lcm) computer software. this selection system ran effectively (gunawan & prabowo, 2017, pp. 144–151). the test consisted of 100 items scheduled for 90 minutes. one of the test items is numerical. this test item has identical characteristics as numerical items tested in the school mathematics so that the method of randomizing the item numbers is effective. one advantage of this method of developing parallel tests of the multiple-choice type is that it can produce test packages in a large number. the number of test packages will be the same as the number of test items. it is the combination of all items in the test. a simple illustration of a test with three items can be seen in table 6. table 6. randomization of test item numbers package item number package 1 1, 2, 3 package 2 1, 3, 2 package 3 2, 3, 1 package 4 2, 1, 3 package 5 3, 2, 1 package 6 3, 1, 2 a test with three items can be developed into six test packages. the number of the packages is the combination of all the test items; so, if a test has an n item, the number of the packages that can be developed is n. a test consisting of 40 items can be developed by randomization of the item numbers into 40! packages. findings show that 11% of the respondents developed the packages by reordering the response options. in 2016, a study investigated the influence of distractor revision upon item validity and reliability. the study found that it did (ali, carr, & ruit, 2016, pp. 6–9). some other studies reveal that the quality of an item is influenced by the quality of distractors. another study found that the quality of distractors has an impact on the item’s difficulty level (tarrant & ware, 2010, pp. 539– 543). the number of distractors, on the other hand, does not impact the item quality (royal & dorman, 2018, pp. 3–5). in conclusion, by maintaining the parallelism of the distractors, parallel instrument packages can be obtained. in the interviews with the teachers, it was found that they randomized the response option by using google doc. it was a computer application for on-line testing. in the process, the teacher input a test set through the application. google doc. would automatically shuffle the response options of each item. when the students open the application to do the test, they will get items with different orders of the options. this application helped teachers in providing test packages by using one initial test set. this computer application can be used with, of course, the backing of the school facilities for on-line testing. one weakness, however, lies in the fact that the computer application did not sequence figures from small to large or from large of small. it becomes a violation of the rules for randomizing response options. the use of google doc application must consider the form of the options. it would be best used for options that do not use series orders such as sizes of figures. the method of constructing test packages from the same table of the specification was claimed by 37% of the respondents. conditions and considerations must be taken into account when developing test packages using this method. however, not all the rules were followed. the teachers merely considered the contexts to get parallel levels of difficulty. as can be seen in package 1 and package 2, the options consist of one correct answer and three distractors. determining the correct answer within the options was almost not a problem. the problem lies, however, on providing distractors that can function well. instrument development must also consider the parallel functioning of the distractors because they also contribute to the quality of the item. distractors were made to lead low students to select them so that the item can distinguish between methods used by mathematics teachers... kartika pramudita, r. rosnawati, & socheath mam 18 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 low-achieving and high-achieving students. worse, it should not happen that low-achieving students choose the correct answer while high-achieving students choose the wrong options. in this case, distractors do not function well. table 7 presents some possibilities to help distractors functioning. based on table 7, the possibilities of students’ errors can be used as a basis for selecting distractors effectively. the distractors in package 2 are 21, 16, and -59. for the item in package 1, if distractors are calculated in the same way as they are in package 2, the values 19, 14, and -53 are obtained. the item sample of package 1 in table 4 shows that the distractors are -20, -15, and -12. it shows that there is no parallelism in selecting distractors so that the item parallelism is doubted. students’ inaccuracy in doing package 2 makes them choose the wrong options or distractors. students’ error in doing package 1, if there are no good distractors, will induce them to try to find the correct answers. it may produce unfairness among testees. from the interviews results, it is known that 13% of the respondents used anchor items to develop the packages. development of test packages using anchor items has been done for nsse for primary, junior secondary, and senior secondary schools, in addition to the ne. for the school examination (nsse), the teachers were involved in developing the test items. some items are standardized by the government, and the other is developed by the teachers. this is the anchor-based development. the anchor items function to equalize one item among the others. it is expected that the test will be able to reveal students’ competencies across regions using tests that are different but equal. based on the results of the interviews, 21% of respondents developed different test items from the same specification table. this method requires extra time when many packages are expected to be produced. besides, the characteristics of the items produced may not be the same so that it needs the difficulty levels testing of the items in each package. in the practice of developing different items from the same specification table of the national level, it is never achieved to produce different items having the same difficulty level albeit being developed from the same table of specification (herkusumo, 2011). this thought must be considered when developing different test packages based on the same specification table. table 7. possibilities of errors made by testees type possibility of errors of package 2 possibility of errors of package 1 error 1 option 21 is obtained from: 4 min8 min32  utes utes , so decreasing 4 times. the decrease in temp. in 32 min: cc o  1644 room temp after 32 min: ccc  21516 ( room temp after) 7 min4 min28  utes utes , so decreasing 7 times. the decrease in temp. in 28 min: cc o  1427 room temp after 28 min: ccc  19514 ( room temp after) error 2 option 16 is obtained from: 4 min8 min32  utes utes , so decreasing 4 times. the decrease in temp. in 32 min: cc o  1644 (testee stops at temp drop) 7 min4 min28  utes utes , so decreasing 7 times. the decrease in temp. in 28 min: cc o  1427 (testee stops at temp drop) error 3 option -59 is obtained from: 8 4 32  , (error in selecting a number to calculate temp.). the decrease in temp. in 32 min: c 6488 ccc  59645 14 2 28  , (error in selecting a number to calculate temp.). the decrease in temp. in 28 min: c 56414 ccc  53563 methods used by mathematics teachers... kartika pramudita, r. rosnawati, & socheath mam copyright © 2019, reid (research and evaluation in education), 5(1), 2019 19 issn 2460-6995 conclusion and suggestions conclusion teachers can use various methods of developing mathematics test packages by randomizing the item number, reordering the response options, using the same context with a different figure, using anchor items, and using the same table of specification. these methods are applied based on the respondents’ logical thinking supported by analyses proposing that the test packages being developed are parallel. however, no theoretical bases have been used by the teachers in developing the tests. all the teachers used a specification table to develop tests while most of them had validated content and language. suggestions further research is needed to look at how the parallelism of the test packages can be developed among those five methods. such research will be useful for the teachers to improve their theories and knowledge in developing parallel multiple-choice test items so that their evaluation of students is valid and reflect the real students’ competences. references abdullah, s., mansyur, m., & rosdiyanah, r. (2016). pengaruh jumlah butir anchor terhadap hasil penyetaraan tes berdasarkan teori respon butir. jurnal kependidikan: penelitian inovasi pembelajaran, 46(2), 207–218. https://doi.org/10.21831/jk. v46i2.10935 ali, s. h., carr, p. a., & ruit, k. g. (2016). validity and reliability of scores obtained on multiple-choice questions: why functioning distractors matter. journal of the scholarship of teaching and learning, 16(1), 1–14. https://doi.org/ 10.14434/josotl.v16i1.19106 gunawan, g., & prabowo, d. a. (2017). sistem ujian online seleksi penerimaan mahasiswa baru dengan pengacakan soal menggunakan linear congruent method (studi kasus di universitas muhammadiyah bengkulu). jurnal informatika upgris, 3(2), 143–151. https://doi.org/10.26877/jiu.v3i2.1872 herkusumo, a. p. (2011). penyetaraan (equating) ujian akhir sekolah berstandar nasional (uasbn) dengan teori tes klasik. jurnal pendidikan dan kebudayaan, 17(4), 455–471. https://doi.org/ 10.24832/jpnk.v17i4.41 kartianom, k., & mardapi, d. (2017). the utilization of junior high school mathematics national examination data: a conceptual error diagnosis. reid (research and evaluation in education), 3(2), 163–173. https://doi.org/ 10.21831/reid.v3i2.18120 kartono, k. (2008). penyetaraan tes model campuran butir dikotomus dan politomus pada tes prestasi belajar. jurnal penelitian dan evaluasi pendidikan, 12(2), 302–320. https://doi.org/ 10.21831/pep.v12i2.1433 kehoe, j. (1995a). basic item analysis for multiple-choice tests. practical assessment, research & evaluation, 4(10), 1–3. kehoe, j. (1995b). writing multiple-choice test items. eric/ae digest series edotm-95-3, 3, 1–6. mardapi, d. (2008). teknik penyusunan instrumen tes dan nontes. yogyakarta: mitra cendekia. rasyid, h., & mansur. (2008). penilaian hasil belajar. bandung: cv wacana prima. reynolds, c. r., livingston, r. b., & willson, v. l. (2009). measurement and assessment in education (2nd ed.). upper saddle river, nj: pearson. rosnawati, r., kartowagiran, b., & jailani, j. (2015). a formative assessment model of critical thinking in mathematics learning in junior high school. reid (research and evaluation in education), 1(2), 186–198. https://doi.org/ 10.21831/reid.v1i2.6472 royal, k., & dorman, d. (2018). comparing item performance on threeversus four-option multiple choice questions in a veterinary toxicology course. methods used by mathematics teachers... kartika pramudita, r. rosnawati, & socheath mam 20 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 veterinary sciences, 5(2), 55. https:// doi.org/10.3390/vetsci5020055 stiggins, r. j., & chappuis, j. (2012). an introduction to student-involved assessment for learning. boston, ma: pearson. tarrant, m., & ware, j. (2010). a comparison of the psychometric properties of three and four-option multiple-choice questions in nursing assessments. nurse education today, 30(6), 539–543. https: //doi.org/10.1016/j.nedt.2009.11.002 torres, c., lopes, a. p., babo, l., & azevedo, j. (2011). improving multiplechoice questions. us-china education review, b(1), 1–11. widdiharto, r., kartowagiran, b., & sugiman, s. (2017). a construct of the instrument for measuring junior high school mathematics teacher’s self-efficacy. reid (research and evaluation in education), 3(1), 64–76. https://doi.org/ 10.21831/reid.v3i1.13559 wilkie, j. e. b., & bodenhausen, g. v. (2015). the numerology of gender: gendered perceptions of even and odd numbers. frontiers in psychology, 6, 810. https:// doi.org/10.3389/fpsyg.2015.00810 reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(1), 2017, 1-11 available online at: http://journal.uny.ac.id/index.php/reid research article mapping elementary school students’ creativity in science process skills of life aspects viewed from their divergent thinking patterns * 1 bambang subali; 2 paidi; 3 siti mariyam *faculty of mathematics and natural sciences, universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia *email: b_subali@yahoo.co.id submitted: 10 march 2017 | revised: 11 july 2017 | accepted: 11 july 2017 abstract the purpose of this study was to map elementary school students’ creativity in science process skills (sps) of life aspects in science subjects viewed from their divergent thinking patterns using written tests whose items were fitted with partial credit model (pcm). the measurement used a test validated using the irt approach published in jee journal in 2015. the trials employed four sets of test, each comprising 20 items completed with anchor items which were fitted referring to pcm. the measurements were performed with larger scale on 14 regional technical implementation unit (rtiu) in yogyakarta special province in five regencies/cities to students of grades iv, v, and vi. the findings show that the higher the grade level, the higher of the testees’ scores would be. there were some testees who did not have divergent thinking ability and they obtained a score of 0 the divergent thinking ability of the students was not related to the regency/city where an rtiu was located. keywords: creativity, divergent thinking, science process skills, partial credit models how to cite item: subali, b., paidi, p., & mariyam, s. (2017). mapping elementary school students' creativity in science process skills of life aspects viewed from their divergent thinking patterns. reid (research and evaluation in education), 3(1), 1-11. doi:http://dx.doi.org/10.21831/reid.v3i1.13294 introduction the core of teaching natural sciences is to teach the students to investigate natural phenomena to look for a scientific product by experiencing a scientific process with reference to scientific attitude (carin & sund, 1989). a scientific process involves aspects of science process skills. a scientific process arranged in a particular order is called scientific method (towle, 1989). the teaching that can enhance learners to master every aspect of the science process skills is badly needed in order that they can master the scientific process. the science process skills should be taught to students partially at the beginning. after mastering the aspects of science process skills, they are taught the science process skills as a unit of scientific method. science process skills according to rezba et al. (2007), science process skills can be divided into two aspects, namely basic skills and integrative skills. basic skills include observing, communicating, classifying, measuring metrically, inferring, and predicting. meanwhile, integrative skillls consist of identifying variables, constructing a table of data, constructing a graph, describing relationships between variables, acquiring and processing data, analyzing investigations, constructing hypotheses, defining variables operationally, designing experiments, and experimenting. http://dx.doi.org/10.21831/reid.v3i1.13294 reid (research and evaluation in education) 2 − reid (research and evaluation in education, 3(1), 2017 unlike rezba et.al., bryce, mccall, macgregor, robertson, and weston (1990) divide science process skills into three aspects, namely basic skills, process skills, and investigative skills. basic skills comprise of observational skills, recording skills, measurement skills, manipulative skills, procedural skills, and following instruction skills. while process skills are skills of inference and selection of procedures. furthermore, investigative skills include skills to make plan and carry out a practical investigation. in reference to the american association for the advancement of science in 1965 (chiappetta, 1997), science process skills are categorized into two, namely basic skills and integrated skills. basic skills are skills of observing, classifying space time relations, using numbers, measuring, inferring, and predicting. furthermore, integrated skills include such skills as defining, formulating models, controlling variables, interpreting data, hypothesizing, and experimenting. wenning (2005) says that science process skills can be classified into rudimentary skills, basic skills, intermediate skills, integrated skills, and advanced skills. rudimentary skills comprise of observing, collecting and recording data; drawing conclusions; communicating and classifying results; measuring metrically; estimating; decision making 1; explaining; and predicting. basic skills are skills of identifying variables, constructing a table of data, constructing a graph, describing relationships between variables, acquiring and processing data, analyzing investigations, defining variables operationally, designing investigations, experimenting, hypothesizing, decision making 2, developing models, and also controlling variables. integrated skills include skills of identifying problems to investigate, designing and conducting scientific investigations, using technology and mathematics during investigations, generating principles through the process of induction, and communicating and defending a scientific argument. advanced skills may consist of solving complex real world problems, synthesizing complex hypothetical explanations, establishing empirical laws on the basis of evidence and logic, analyzing and evaluating scientific arguments, constructing logical proofs, generating predictions through the process of deduction. in 2010, wenning (2010) revises the formulation of science process skills by adding one new skill namely culminating skills. also, the revision includes the elaboration of each existing skill. the formulation of science process skills according to wenning (2010) consists of rudimentary skills, basic skills, intermediate skills, integrated skills, culminating skills, and advanced skills. rudimentary skills are skills of observing, formulating concepts, estimating, drawing conclusions, communicating results, and classifying results. moreover, basic skills include predicting, explaining, estimating, acquiring and processing data, formulating and revising scientific explanations using logic and evidence, recognizing and analyzing alterative explanations and models. meanwhile, intermediate skills comprise measuring, collecting and recording data, constructing a table of data, designing and conducting scientific investigations, using technology and math during investigations, and describing relationships. the second aspect is integrated skills which include measuring metrically, establishing empirical laws on the basis of evidence and logic, designing and conducting scientific investigations, using technology and math during investigations. furthermore, culminating skills comprise collecting, assessing, and interpreting data from a variety of sources, constructing logical arguments based on scientific evidence, making and defending evidencebased decisions and judgments, clarifying values in relation to natural and civil rights, and practicing interpersonal skills. advanced skills are skills of synthesizing complex hypothetical explanations, analyzing and evaluating scientific arguments, generating predictions through the process of deduction, revising hypotheses and predictions in light of new evidence, and solving complex real-world problems. creativity and divergent thinking solving problems to find new products through scientific method is a process of inquiry. according to mayer (1980), all science reid (research and evaluation in education) mapping elementary school students’ creativity in science process skills... 3 bambang subali, paidi, & siti mariyam is inquiry. biology is one kind of science. biologists try to answer questions about living things. finding new products is a creative work. creative thinking belongs to the high cognitive level in bloom's taxonomy referring to anderson, krathwohl, and bloom (2001) and dettmer (2005). this tells that creativity can be taught to elementary school students. meanwhile, miller (2008) states that something that is not duplicated/imitated is categorized as creative. in addition, rule, schneider, tallakson, and highnam (2012) who have quoted several sources state that elementary and middle school students who are high-achieving in science and who exhibit creativity are often not challenged or given the opportunity to fully utilize their abilities in regular classrooms. many gifted students drop out because school is boring, repetitious, and lacks relevance to real life. they expect more exciting and challenging learning processes. unfortunately, many classroom teachers lack sufficient background knowledge to design stimulating, advanced science projects for these students; some avoid science altogether. in reference to csikszentmihalyi’s model of creativity (peppler & solomou, 2011), individuals build on culturally valued practices and design to produce new variations of the domain, which, if deemed valuable by the community (i.e. the field), becomes part of what constitutes the evolving domain. each component of the system continues to influence one another over time. the rethinking of design for knowledge sharing is an important part of creating new work processes and has to evolve hand in hand with space planning (mitchell, inouye, & blumenthal, 2003). hadzigeorgiou, fokialis, and kabouropoulou (2012) cited the opinion of barrow (2010) that the inquiry in science will be able to develop students’ creativity if there is an imaginative and divergent thinking process. measuring creativity and divergent thinking skills students’ mastery of creativity should be measured. according to kelly (2004), the existing research on creativity aims at measuring the divergent thinking as proposed by torrance and creative personality developed by gough. there is only little research which measures creativity as a multidimensional phenomena using self report scales which are valid and easy to administer. the main problem in measuring creativity is ensuring what is measured is really creativity and is not affected by the measurement of intelligence (cramond, 1994). many studies regarding the strategies to measure creative thinking ability are compiled by kind and kind (2007). a detailed explanation about creativity tests which include a test to measure divergent thinking process is presented by cropley (2000). viewed from how to measure creativity, there are many ways and aspects that are measured. for instance, one of the strategies to measure the ability of divergent thinking can be classified based on the content and the products as reported by meeker (1969). olivant (2009) says that according to guilford (1950), creativity could and should be studied in non-eminent, ‘everyday’ people using psychometrics such as divergent thinking tasks (or paper and pencil tasks) to measure creative thinking. torrance's tests of creative thinking was created by torrance (1979), and they are probably the best-known and most widely used creativity psychometric instruments. sternberg and lubart (1999) in torrance (1979) state that many researchers viewed the tests as trivial and inadequate measures, while others charged that the tests, while possibly measuring aspects of creativity, failed to capture their essence. the context dependency of creativity among students has been elaborated by diakidoy & constantinou in 2000-2001 (kind & kind, 2007) by getting as many responses as possible from three open ended assignment forms and scored based on divergent thinking skills of guilford, namely: (a) fluency, i.e. the considerations in a given solution, (b) flexibility, i.e different types of solutions. the science process skill measurement on different thinking aspects in siology subject of senior high school students in diy and central java was performed by subali reid (research and evaluation in education) 4 − reid (research and evaluation in education, 3(1), 2017 (2009). in this case, the standardization of instrument utilized item response theory or the irt approach. this approach creates a calibration that puts learners’ ability and item difficulty on the same scale. therefore, they can be compared. the results show that the average score of creativity ability is much lower than the item difficulty index of the item to measure creativity. subali (2011) also measured high school students’ creativity in the science process skills in biology subjects. the results are also relatively low. subali and mariyam (2013) have conducted a research concerning the development of science process skill creativity related to the aspects of life on science subjects that has been done by elementary school teachers. most of the teachers stated that creativity had been taught to students in science subjects. however, the student’s mastery on creativity has not been studied. therefore, the creativity mastery of elementary school students on life aspects viewed from divergent thinking skills need to be investigated. this research aimed at measuring the students’ creativity in science process skills of life aspects viewed from divergent thinking patterns consisting of two aspects namely basic and process skills. the basic skill aspects have been published in the journal of asiapacific forum on science learning and teaching, volume 17, issue 1, article 2 (jun., 2016) (subali, paidi, & mariyam, 2016). the research aimed to map the creativity in science process skills (sps) of life aspects of elementary school students in sciences subjects viewed from the divergent thinking pattern using written tests skills of which test items are fitted based on partial credit model (pcm). method the research was conducted for three years and consisted of three stages. the first stage is divided into two phases. in the first phase, the blue print of science process skills (sps) is developed. the blue print of sps is formulated based on sps blue prints produced by the research conducted by subali (2009) used for measuring divergent thinking ability of sps in biology subjects for senior high school students. in addition, the blue print is developed referring to several sources such as rezba et al. (2007), bryce et al. (1990), and cox (1958). the sps aspects include (a) basic skills and (b) process skills. this is considered as the difficulty to teach investigative skills to students of grades iv and v. on the second phase, based on the blue print of sps, creativity tests for sps consisting of 63 items are developed. all items were judged by experts consisting of three lecturers --all holding doctoral degrees --of biology education department. using the divergent scoring model of diakidoy and constantinou (kind & kind, 2007), the items were tested to 637 students of grades v and vi. the report of the instruments in this research were validated using the irt approach in 2015 and was published in journal of elementary education (jee) vol.25, no. 1 pp. 91-105 by subali and mariyam (2015). based on the irt approach, an item is declared to be able to measure the ability if it is fitted with the the model, in this case 1-pl (rasch model). if all items are fitted with the model, the instruments can also be declared as valid (wright & masters, 1982). the testing of fitted items on rasch model was carried out using the quest program (adams & khoo, 1993). on the third phase, the instrument was administered in large scale from elementary schools in regional technical implementation unit (rtiu) in diy (yogyakarta special province). the sample was established by using the purposive sampling technique by considering the characteristics of rtiu and school achievement through national examination. the sample testees were taken from 10 rtius in five regencies/cities in diy. two rtius from each regency/city were selected purposively. one of the rtius was located in the national capital and another was located far from the national capital, except for the rtiu in the city of yogyakarta because both were in the city center. moreover, two private elementary schools and four public elementary schools from each rtiu were selected. the test participants included students of grades iv, v, and vi. there were 2,563 testees of grade iv, 2,685 testees of grade v, and 2,619 from grade vi. reid (research and evaluation in education) mapping elementary school students’ creativity in science process skills... 5 bambang subali, paidi, & siti mariyam findings and discussion findings after the instrument was administrated to elementary school students of grades iv, v, and vi, the findings of the research are presented as follows. table 1 shows that there is a reasonable increase of scores performed by elementary school students of grades iv to vi. this means that the higher the grade level, the greater the creativity score in science process skill on life aspects mastered by the students will be. compared to the fact that the total score which was achieved by grade 6 was 40 with the average score of 20, it can be said that the achievement of the average score of 18.5 with the minimum score of 0 and maximum score of 38 is still relatively low, seen from the aspect of competence mastery. table 2 shows that the highest ranking of the creativity scores of divergent thinking model on science process skills aspects of the fourth grade students in the five regencies/ cities is achieved by sleman regency. the score is higher than yogyakarta city score as its capital. meanwhile, the lowest score is achieved by kulonprogo regency. this indicates that the test results are not related to the characteristics of city or non-city regions. table 3 shows that the highest score of creativity in science process skill on life aspects of the fifth grade elementary school students in five regencies/cities is achieved by the city of yogyakarta, and the lowest score is bantul regency. this situation is different from that in the fourth grade. table 1. creativity scores based on divergent thinking model of science process skill on life aspects in natural sciences subjects based on grades in diy province. grade n score ȳ s min max total grade iv 2563 12.8 6.7 0 37 40 grade v 2685 15.3 6.5 0 36 40 grade vi 2619 18.5 6.4 0 38 40 table 2. creativity scores based on divergent thinking model of science process skill on life aspects in natural sciences subjects of the fourth grade students based on types of locations in diy province grade iv n score ȳ s min max total yogyakarta 553 12.6 7.3 0 36 40 bantul 593 12.6 6.7 0 37 40 sleman 605 14.0 6.5 0 33 40 kulonprogo 380 11.4 6.0 0 29 40 gunungkidul 432 12.9 6.7 0 31 40 table 3. creativity scores based on divergent thinking model of science process skill on life aspects in natural sciences subjects of the fifth grade students based on types of locations in diy province grade v n score ȳ s min max total yogyakarta 534 16.9 6.6 0 36 40 bantul 632 13.5 6.3 0 34 40 sleman 688 16.4 6.3 1 35 40 kulonprogo 361 14.2 6.0 0 32 40 gunungkidul 470 15.2 6.5 0 36 40 reid (research and evaluation in education) 6 − reid (research and evaluation in education, 3(1), 2017 table 4. creativity scores based on divergent thinking model of science process skill on life aspects in natural sciences subjects of the sixth grade students based on types of locations in diy province grade vi n score ȳ s min max total yogyakarta 571 18.72 5.90 0 36 40 bantul 603 17.58 6.19 0 33 40 sleman 620 19.49 7.02 0 38 40 kulonprogo 335 18.15 6.04 2 34 40 gunungkidul 490 18.10 6.22 0 34 40 table 5. the mean scores and standard deviation of science process skills creativity on life aspects in natural sciences subject based on the types of rtius of the fourth grade students in diy province grade iv n score ȳ s min max total yogyakarta east yogyakarta 134 23.2 17.5 0 88 120 west yogyakarta 419 34.5 18.8 0 88 120 bantul bantul selatan 140 37.0 20.1 2 100 120 banguntapan 240 33.4 18.3 0 84 120 piyungan 213 25.6 14.9 0 65 120 sleman sleman 182 37.7 18.3 5 84 120 kalasan 256 34.8 16.7 0 86 120 ngemplak 167 32.7 16.7 2 88 120 kulonprogo pengasih 105 32.2 16.3 4 73 120 kalibawang 127 26.5 16.6 0 71 120 sentolo 148 27.0 14.1 0 70 120 gunungkidul wonosari 196 37.6 17.1 3 73 120 panggang 130 24.0 16.8 0 82 120 purwosari 106 32.9 15.5 0 71 120 table 4 shows that the highest score of science process skills creativity of life aspects of the sixth grade students in five regencies/ cities is achieved by sleman regency, and the lowest is achieved by kulon progo regency. the followings are the results of the creativity measurement of science process skill on life aspects in rtius of each regency/city ranging from grade iv to vi. table 5 presents the results of measurements on the fourth grade. table 5 shows that the highest score of science process skills creativity of life aspects of the fourth grade students in the five regencies/cities is achieved by the rtiu in sleman regency. it is followed by wonosari rtiu in gunungkidul regency, and south bantul rtiu in bantul regency. while the low score that ranks xii is achieved by piyungan rtiu in bantul regency, rank xiii is achieved by rtiu panggang gunungkidul and rank xiv is achieved by rtiu of east yogyakarta. this may imply that the mastery of science process skill creativity of life aspects on the fourth grade students is not dominated by students of the elementary school located in the capital of the province. table 6 presents the results of measurements of the grade v. it shows that the highest score of science process skills creativity of reid (research and evaluation in education) mapping elementary school students’ creativity in science process skills... 7 bambang subali, paidi, & siti mariyam life aspects of the fifth graders of elementary school in five regencies/cities in yogyakarta rtius is achieved by the north yogyakarta. sleman regency comes second and the third rank is achieved by rtiu of ngemplak in sleman regency. while the low ranks, i.e rank xii is achieved by pengasih rtiu in kulon progo regency, rank xiii is achieved by banguntapan rtiu in bantul regency and rank xiv is achieved by piyungan rtiu in bantul regency. table 6. the average scores and creativity standard deviation of science process skill of life aspects in natural sciences subject based on the types of rtiu of gade v students in diy grade v n score ȳ s min max total yogyakarta east yogyakarta 122 35.8 18.4 0 88 120 west yogyakarta 412 45.5 17.4 0 98 120 bantul bantul selatan 135 40.3 18.0 4 89 120 banguntapan 250 33.1 17.3 0 86 120 piyungan 247 31.3 15.4 1 72 120 sleman sleman 180 42.9 17.3 7 87 120 kalasan 297 40.4 18.0 3 93 120 ngemplak 211 41.6 17.0 6 94 120 kulonprogo pengasih 111 34.2 14.6 0 69 120 kalibawang 117 35.4 19.9 0 86 120 sentolo 133 36,0 12.9 10 81 120 gunungkidul wonosari 227 40.9 16.8 3 89 120 panggang 131 34.5 16.3 0 78 120 purwosari 112 37.3 20.3 0 99 120 table 7. the average scores and creativity standard deviation of science process skill of life aspects in natural sciences subject based on the types of rtiu of grade vi students in diy grade vi n score ȳ s min max total yogyakarta east yogyakarta 149 43.1 15.8 8 84 120 west yogyakarta 422 48.7 16.2 0 98 120 bantul bantul selatan 127 47.2 17.8 10 93 120 banguntapan 256 43.6 18.7 0 86 120 piyungan 220 43.7 14.4 9 78 120 sleman sleman 162 53.6 21.9 10 101 120 kalasan 277 47.3 19.4 3 108 120 ngemplak 181 49.4 18.2 0 103 120 kulonprogo pengasih 102 40.5 15.4 11 75 120 kalibawang 102 52.5 17.1 16 92 120 sentolo 131 44.2 15.4 4 84 120 gunungkidul wonosari 207 49.4 16.2 4 88 120 panggang 138 43.6 18.2 0 90 120 purwosari 145 41.9 15.9 3 85 120 reid (research and evaluation in education) 8 − reid (research and evaluation in education, 3(1), 2017 the results of the measurements on grade vi is presented in table 7 which shows that the highest score of science process skills creativity of life aspects of the sixth grade of elementary school students in five regencies/cities in yogyakarta rtius is achieved by sleman rtiu in sleman regency, rank ii by kalibawang rtiu in kulon progo regency, and rank iii by wonosari rtiu in gunung kidul regency. while the low ranks, i.e. rank xii is achieved by east yogyakarta, rank xiii is achieved by purwosari rtiu in gunungkidul regency and rank xiv is achieved by pengasih rtiu of kulon progo regency. discussion the results of the research show that the average creativity ability of sps on life aspects of the elementary school students of grades iv, v, dan vi in 14 rtius is low. on the contrary, based on the research conducted by subali and mariyam (2013), most teachers said that they had taught creativity to the students. this is probably because the teachers do not know well how to develop student’s creativity. according to dettmer (2005, pp. 70–78), creativity learning ideally must use an applied learning and an ideational learning model. in addition, teachers could encourage the students to be creative by giving examples on how to (a) substitute/replace, (b) combine, (c) adapt, (d) modify, add, (e) put something for another use, (f) eliminate or reduce and (g) reconstruct or reverse (michalko, 2000). another reason is that the target of the teaching focuses on concept understanding. therefore, creativity is not the main teaching target. whereas, according to burke-adams (2007), it is very important to consider the learning needs of talented students in integrating creativity into a standard-based system. teachers are not aware that the goal of creativity development in natural sciences teaching is to direct the students to perform opened-discovery or inquiry or do the relevant tasks. meanwhile, teachers are supposed to develop student’s thinking in order that they can perform logical thinking creatively (kind & kind, 2007, pp. 1–37). teachers concentrate more on developing students in order that they can understand the concept and automatically develop their convergent thinking skills. teachers will rarely give questions with divergent answers (croom & stair, 2005). the teacher’s worry regarding not to teach creativity to low academic potential students may not happen. the research findings of ferrando, prieto, ferrandiz, and sanchez (2005) tell that smart students are not always creative. moreover, cromie (2003) says that not all studies tell a correlation between students’ iq and creativity. in addition, rawat, qazi, and hamid (2012) state that the development of creativity is closely linked to the development of skills to form a corresponding consideration in different situations. with regard to this, teachers should develop students’ creativity as early as possible. the findings indicate that there are two possibilities why elementary schools located in a big city are not always showing the highest scores. the first possibility is that the children are not potential. thus, although the teachers develop creativity, the result may not be optimal. the second possibility is that children are assessed by their parents to be potentials so that they ask their children to go to elementary schools that are good based on the society assesment. for example, ungaran, serayu, and muhammadiyah sapen elementary schools. however, the score of rtiu in north yogyakarta is not always the highest. elementary schools in the area of rtiu sleman are partially assessed by local people to be good schools and rank top. however, elementary schools in kalibawang rtiu for grade vi rank second out of 14 rtius eventhough kalibawang rtiu is located in the remote areas. therefore, it seems that the teacher's role in developing the creativity of learners may not be optimal. moreover, students in the sixth grade of elementary schools at cities probably more focus on the achievement of the high score national achievement in order to be received at junior high schools that are assessed good by the community based on the achievement on national examination. reid (research and evaluation in education) mapping elementary school students’ creativity in science process skills... 9 bambang subali, paidi, & siti mariyam conclusion and suggestions based on the findings of the research, it can be concluded that a measuring instrument for science process skills creativity of life aspects produced and tested in 2015 is relatively low. recommendations are necessary to improve the ability of teachers in teaching science process skill creativity of life aspects to students. the findings indicate that elementary schools located in remote areas rtiu may achieve high score probably because elementary school teachers in the city are more focusing on developing students to reach a high score of un. it is worth exploring further using eksposfacto retrospective approach. acknowledgement the deepest gratitude is addressed to the directorate of research and community service of the ministry of research, technology, and higher education which has sponsored this research so that it can be carried out. references adams, r. j., & khoo, s.-t. (1993). acer quest version 2.1. camberwell, victoria: australian council for educational research. anderson, l. w., krathwohl, d. r., & bloom, b. s. (eds.). (2001). a taxonomy for learning, teaching, and assessing: a revision of bloom’s taxonomy of educational objectives. new york, ny: longman. https://doi.org/10.1207/s15430421tip4 104_2 bryce, t. g. k., mccall, j., macgregor, j., robertson, i. j., & weston, r. a. j. (1990). techniques for assessing process skills in practical science: teacher’s guide. oxford: heinemann educational books. burke-adams, a. (2007). the benefits of equalizing standards and creativity: discovering a balance in instruction. gifted child today, 30(1), 58–63. carin, a. a., & sund, r. b. (1989). teaching science through discovery. columbus, oh: merrill publishing company. retrieved from https://books.google.co.id/books ?id=hoplaaaayaaj&source=gbs_b ook_other_versions chiappetta, e. l. (1997). inquiry-based science: strategies and techniques for encouraging inquiry in the classroom. science teacher, 64(10), 22–26. cox, d. r. (1958). planning of experiments. new york, ny: john wiley & sons. cramond, b. (1994). we can trust creativity tests. educational leadership, 52(2), 70. retrieved from http://ezproxy.lib. ucalgary.ca:2048/login?url=http://searc h.ebscohost.com/login.aspx?direct=tru e&db=ehh&an=9411032030&site=eh ost-live cromie, w. j. (2003). creativity tied to mental illness: irrelevance can make you mad. harvard gazette. retrieved from https://news.harvard.edu/gazette/story /2003/10/creativity-tied-to-mentalillness/ croom, b., & stair, k. (2005). getting from q to a: effective questioning for effective learning. the agricultural education magazine, 78(1), 12–14. retrieved from http://search.proquest. com/docview/224994858?accountid=8 579 cropley, a. j. (2000). defining and measuring creativity: are creativity tests worth using? roeper review, 23(2), 72–79. https://doi.org/10.1080/02783190009 554069 dettmer, p. (2005). new blooms in established fields: four domains of learning and doing. roeper review, 28(2), 70–78. https://doi.org/10.1080/02783190609 554341 ferrando, m., prieto, m. d., ferrandiz, c., & sanchez, c. (2005). intelligence and creativity. electronic journal of research in educational psychology, 3(3), 21–50. retrieved from http:// www.investigacion-psicopedagogica.org /revista/new/english/contadorarticul o.php?101 reid (research and evaluation in education) 10 − reid (research and evaluation in education, 3(1), 2017 hadzigeorgiou, y., fokialis, p., & kabouropoulou, m. (2012). thinking about creativity in science education. creative education, 3(5), 603–611. https://doi.org/10.4236/ce.2012.35089 kelly, k. e. (2004). a brief measure of creativity among college students. college student journal, 38(4), 594–596. kind, p. m., & kind, v. (2007). creativity in science education: perspectives and challenges for developing school science. studies in science education, 43(1), 1–37. http://dx.doi.org/10.1080/ 03057260708560225 mayer, w. v (ed.). (1980). biological science: an inquiry into life (4th ed.). denver, co: biological sciences curriculum study. meeker, m. n. (1969). the structure of intellect: its interpretation and uses. columbus, oh: merrill publishing company. michalko, m. (2000). four steps toward creative thinking. the futurist, 18–21. retrieved from https:// www.questia.com/read/1g162026052/four-steps-toward-creativethinking miller, p. w. (2008). measurement and teaching. munster, in: patrick w. miller & associates. mitchell, w. j., inouye, a. s., & blumenthal, m. s. (eds.). (2003). beyond productivity: information, technology, innovation, and creativity (committee). washington, dc: the national academies press. retrieved from https://www.nap.edu /read/10671/chapter/1#ii olivant, k. f. (2009). an interview study of teachers’ perceptions of the role of creativity in a high-stakes testing environment. fresno, ca: california state university. peppler, k. a., & solomou, m. (2011). building creativity: collaborative learning and creativity in social media environments. on the horizon, 19(1), 13– 23. https://doi.org/10.1108/ 10748121111107672 rawat, k. j., qazi, w., & hamid, s. (2012). creativity and education. academic research international, 2(2), 264–275. rezba, r. j., sparague, c. s., fiel, r. l., funk, h. j., okey, j. r., & jaus, h. h. (2007). learning and assessing science process skills (3rd ed.). dubuque, ia: kendall/hunt. rule, a. c., schneider, j. s., tallakson, d. a., & highnam, d. (2012). creativity and thinking skills integrated into a science enrichment unit on flooding. creative education, 3(8), 1371–1379. https:// doi.org/10.4236/ce.2012.38200 subali, b. (2009). pengukuran keterampilan proses sains pola divergen dalam mata pelajaran biologi sma di provinsi diy dan jawa tengah. universitas negeri yogyakarta. retrieved from http:// eprints.uny.ac.id/4541/ subali, b. (2011). pengukuran kreativitas keterampilan proses sains dalam konteks assessment for learning. cakrawala pendidikan, 30(1), 130–144. https://doi.org/10.21831/cp.v1i1.4196 subali, b., & mariyam, s. (2013). pengembangan kreativitas keterampilan proses sains dalam aspek kehidupan organisme pada mata pelajaran ipa sd. cakrawala pendidikan, 32(3), 365–381. https://doi.org/10.21831/cp.v3i3.1625 subali, b., & mariyam, s. (2015). measuring the indonesian elementary schools student’s creativity in science processing skills of life aspects on natural sciences subject: in yogyakarta special province (diy). journal of elementary education, 25(1), 91–105. retrieved from https://www.google.co.id/url?sa=t&rct =j&q=&esrc=s&source=web&cd=1&c ad=rja&uact=8&ved=0ahukewjk0tv 7qnhwahujpo8khblecmsqfggn maa&url=http%3a%2f%2fpu.edu.p k%2fimages%2fjournal%2fjee%2fp df-files%2f6_v25_no1_15.pdf&usg =aovvaw0zfhys1ed_mqaoeflsr pb9 subali, b., paidi, p., & mariyam, s. (2016). the divergent thinking of basic skills of reid (research and evaluation in education) mapping elementary school students’ creativity in science process skills... 11 bambang subali, paidi, & siti mariyam sciences process skills of life aspects on natural sciences subject in indonesian elementary school students. asia-pacific forum on science learning and teaching, 17(1). retrieved from http:// proxy.library.vcu.edu/login?url=http:// search.proquest.com/docview/1871581 264?accountid=14780%5cnhttp://vcualma-primo.hosted.exlibrisgroup.com/ openurl/vcu/vcu_services_page?url_v er=z39.88-2004&rft_val_fmt=info:ofi/ fmt:kev:mtx:journal&genre=article torrance, e. p. (1979). a three-stage model for teaching for creative thinking. in a. e. lawson (ed.), the psychology of teaching for thinking and creativity (pp. 226–253). columbus, oh: association for education of teachers of science, ohio state university. towle, a. (1989). modern biology. austin, tx: holt, rinehart and winston. wenning, c. j. (2005). levels of inquiry: hierarchies of pedagogical practices and inquiry processes. journal of physics teacher education online, 2(3), 3–12. retrieved from http:// scholar.google.com/scholar?hl=en&btn g=search&q=intitle:levels+of+inquir y:+hierarchies+of+pedagogical+practi ces+and+inquiry+processes#0 wenning, c. j. (2010). level of inquiry: using inquiry spectrum learning sequences to teach science. journal of physics teacher education online, 5(3), 11–20. retrieved from www2.phy.ilstu.edu/pte/ publications/learning_sequences.pdf wright, b. d., & masters, g. n. (1982). rating scale analysis. chicago, il: mesa press. reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(2), 2017, 124-132 available online at: http://journal.uny.ac.id/index.php/reid research article the implementation of population education in senior high school * 1 claver nzobonimpa; 2 zamroni *department of english language and literature, faculty of languages and social sciences, université du burundi (national university of burundi) unesco avenue no. 2, p.o. box 1550 bujumbura, burundi *email: nzobonimpacl@yahoo.fr submitted: 14 july 2017 | revised: 27 december 2017 | accepted: 06 february 2018 abstract this research aimed to evaluate the implementation of population education in senior high school in terms of (1) learning process, (2) learning materials, (3) evaluation process, (4) course outcome, (5) teachers’ role, (6) perception of population education, and (7) factors supporting and inhabitting population education. the research subjects were one teachers’ supervisor, three teachers, and 65 students. the data were collected through questionnaires, interviews, and documentation and analyzed quantitatively using descriptive statistics. the qualitative data collected through interviews were used for deeper explanation. the research findings were: (1) the teaching process was not quite appropriate, (2) materials for population education were available and efficient, (3) the evaluation process was not appropriate, (4) the students were satisfied with the teachers’ role, (5) the students’ perception of population education was very positive, and (6) the constraints in population education included (a) limitation in time, (b) too many extracurricular activities, (c) rapid change of data, and (d) the validity of materials. keywords: population education, implementation, learning process, integration how to cite item: nzobonimpa, c., & zamroni, z. (2017). the implementation of population education in senior high school. reid (research and evaluation in education), 3(2), 124-132. doi:http://dx.doi.org/10.21831/reid.v3i2. 10024 introduction education is very important for human being. moore (2015, p. 1) says that: ‘changes in society are often in more demands being placed on our education system’. further, as stated in law no. 20 of 2003, indonesian national educational system ensures equal opportunity, improvement of quality, relevance and efficiency in education to meet various challenges in the development of local, national, and global lives changes (unesco, 2015, p. 1). syamsudin, budiyono, and sutrisno (2016, p. 26) inform that the goal of education in indonesia is to develop learners’ potentials so that they become indonesian individuals with faith and fear of god, noble morals, good health, great knowledge, high competency, creativity, and independence, and become individuals who are democratic and responsible. in order to reach its education objectives, indonesian government elaborates the curriculum which contains the objectives and strategy to achieve the education goals. in line with this opinion, indonesian national education law of 2003 defines curriculum as “...a set of plan and regulations about the aims, content, materials of lessons and the methods employed as guidelines for implementation of learning activities to achieve given education objectives” (dharma, 2008). population education is one of the teaching programs delivered in schools. it is a http://www.ub.edu.bi/ http://dx.doi.org/10.21831/reid.v3i2.%2010024 http://dx.doi.org/10.21831/reid.v3i2.%2010024 reid (research and evaluation in education) 125 − reid (research and evaluation in education), 3(2), 2017 program introduced due to the rapid population growth in both the industrial and developing countries. in early 1960s, the study of human reproduction, birth control, and also investigation of the cause and effect of population was included into the school curriculum (sulistyo, 1997, p. 26). in secondary schools, the government integrated population education topic into six subjects: biology, geography, economics, civics, physical education, and anthropology with the use of the integrative approach. though population education has been introduced in indonesian formal education many decades ago, some problems still occur. there are still significant differences between the ideal situation (self-reported) and the actual practice related to teachers' roles in teaching population education. this difference indicates that there are role conflicts for teachers in teaching population education. some observable barriers related to the implementation of population education are the lack of teachers’ knowledge and skill, and also the lack of teachers' autonomy in carrying out teaching activities. education john dewey (ornstein & levine, 1989, p. 10) considers education as a social process by which the groups of immature members, especially children, learn to participate in a group life. thus, through education, children receive knowledge about their cultural heritage and learn to use it in problem solving. hills (1986, p. 50) says that education has two principles: passing on knowledge from one generation to the next, and providing the people with skills which enable them to analyze, diagnose, and question something. education, in the narrowed sense, is regarded to be equivalent to instruction. it consists of ‘specific influences’ given consciously to bring in the development and growth of the students. in general, education aims to transmit a common set of beliefs, values, norms, and understanding from the adult to the youth. morality, on the other hand, aims to maintain the order in a society; to respect people as well as regard them holistically (nayef, yaacob, & ismail, 2013, p. 165). population education viederman (v. k. rao, 2001, p. 31) says that population education may be defined as an educational process which assists persons to (a) learn causes and consequences of population problems; (b) define the nature of the problems associated with population process and characteristics; and (c) assess the positive and effective means by which the society as a whole and he/she as an individual can respond to the areas that influence these processes in order to enhance the quality of life. rao (2004, p. 34) says that: ‘population education is an educational program which provides for a study of population situation in the family, community with the purpose of developing in the students’ rational and responsible attitudes and behavior toward that situation’. based on the definition, we can understand that population education is a program which provides a study of population situation at various levels. it also intends to develop rational and responsible attitudes and behavior to that situation. learning learning is identified as some kinds of change in behavior which is relatively long lasting. according to schunk (2012, p. 3), the definition of learning is ‘an enduring change in behavior, or in the capacity to behave in a given fashion, which result from practice or other form of experience’. learning aims at changing the behavior of the learner. learning is the main activity organized in school which has three main criteria: (1) learning involves change, (2) learning endures over times, and (3) learning occurs through experience. illeris (2009, p. 14) distinguishes the definition of learning into four. first, learning can refer to the results of individual learning processes. second, learning refers to individual psychological processes that lead to alterations or results described as meaning. third, learning, as well as processes of learning, refers to the interaction process among individuals, his/her material, and social environment described as meaning. fourth, learning and process of learning are used identically with the word teaching. it may be interpreted as a result of tacit short circuit between what is taught and what is learned. reid (research and evaluation in education) the implementation of population education... 126 claver nzobonimpa & zamroni in the discussion of learning activities, assan (2014, p. 340) insists that learning activities, especially in adults, have three features, including the facts that: (1) the learners develop different outlooks and approaches with maturity and/or experience; (2) the learners reveal different degrees of independence in their learning; (3) the learners exhibit a different amount of involvement in, or different approaches to, learning tasks. the type of involvement is often dependent upon the context in which the learning activity takes place. as far as learning theories are concerned, we distinguish the following learning theories: self-directed learning. borich (2000, p. 273) has defined self-directed learning as an approach to teaching and learning that actively engages students in the learning process to acquire the high levels of behavioral complexity outcome. mohammadi and araghi (2013, p. 75) assert that self-directed learning refers to any self-teaching projects in which the learner establishes his specific goal, decides how to achieve it, finds the relevant resources, plans his strategies, and maintains his motivation to learn independently. bear (2012, p. 28) argues that self-directed learning is a process which occurs when individuals take initiative, with or without the help of others, in diagnosing their learning needs, formulating learning goals, identifying human and material resources for learning, choosing and implementing appropriate learning strategies, and also evaluating learning outcomes. cooperative learning. unlike self-directed learning, cooperative learning is defined as activities that involve groups of students jointly working through assigned tasks (after receiving instruction from the teacher) until all of the group members have successfully mastered and completed them (johnson, et al., in thanh, 2014, p. 3). discovery learning. joy (2014, p. 32) explains that learning happens by discovering, which prioritizes reflection, thinking, experimenting, and exploring. he also suggests that the discovery learning approach is closer to the concepts of exploration, discovering, invention and the ‘knowledge cannot be transferred from one person to another’ concept; instead, a student needs to experience an event in order to make it truly meaningful. perception in the perspectives of social psychology, walgito (2010, p. 99) defines perception as the process of organizing, and interpreting the stimulus received into something meaningful. in perception, the stimulus may come from the outside of the individuals (external) or within the individuals (internal). furthermore, mozkowit and orgel (walgito, 2010, p. 101) argue that perception is a global response to a stimulus. from those definitions, perception is viewed as the response to a stimulus or surroundings. then these responses will be interpreted as meaningful information related to the stimuli. teacher’s role in population education malik, murtaza, and khan (2011, p. 784) determine the teachers’ role in learningteaching processes as the persons who are responsible to ensure whether the teaching process puts emphasis on course context, interpersonal relationship, or on classroom discipline and control. the following cases are also taken into consideration by teachers: (1) the kind of learning being promoted by putting emphasis on the acquisition of skill, facts or understanding; (2) the pattern of communication in the classroom; and (3) students’ communication, by keeping eye on the way in which educational tasks are organized. hudgins et al. (1983, p. 489) distinguish six roles of a teacher in the classroom. first, a teacher is a transmitter. in this role, his duty is to transmit factual information to students. second, he is a socializer; he supervises the development of moral values and norms of his students. third, he is an initiator and administrator of goals; he initiates and administers long-range and short-run activities and goals of the class membership. fourth, he is an evaluator. he evaluates his students’ academic performance. fifth, he is a motivator; he motivates his students to realize their achievement potential. sixth, he is a disciplinarian. his duty is to discipline and apply sanctions in response to the class members’ behavior. reid (research and evaluation in education) 127 − reid (research and evaluation in education), 3(2), 2017 method the main aim of this research is to find out the implementation of population education in senior high school. this research used a mixed method (quantitative data were analyzed under descriptive statistics method, then supported by qualitative data analysis). the basic assumption is that the use of both quantitative and qualitative methods in combination may provide a better understanding of the research problem and question (creswell, 2010). the research was conducted in a senior high school in yogyakarta special region, indonesia, from january to april 2016. the sample consisted of 65 students of class xi, three teachers (sociology teacher, economics teacher, and geography teacher), and also one supervisor. documentation was used to collect the population education curriculum, students’ books, and teachers books. the sample of the research is presented in table 1. table 1. research sample data source rate students 65 teachers 3 principals 1 total 69 research variables in this research, the aspects evaluated are: (1) the efficiency of the learning/teaching materials, (2) the appropriateness of the learning/teaching process and evaluation process, (3) the teachers’ and students’ satisfaction on population education outcome, (4) the efficiency of the evaluation process, (5) population education outcome, (6) the students and teachers’ appreciation of population education, (7) teachers’ role, and (8) the factors that facilitate or inhibit the learning process of population education. data collection techniques this research used a variety of data collection techniques, i.e. questionnaires, observation, and interview. in order to collect the quantitative data, questionnaires were given to 65 students. the qualitative data were collected through classroom observations and an interview with four teachers of sociology, geography, economy and one teacher who is in charge of monitoring the social studies program. research instruments the research involved the following instruments. the first one is observation guide and checklist. observations were conducted in the beginning of the semester. through these observations, the researchers collected information about school and its population education program. the researchers also checked the teacher’s materials, students’ text books and some teachers’ facilities through checklist. the second instrument is a questionnaire. the students were given an open and closed questionnaire. the questions were related to (1) the efficiency of the learning material, (2) the appropriateness of learning/teaching process, (3) teachers’ and students’ satisfaction on population education outcome, (4) efficiency of evaluation, (5) population education outcome, (6) students’ and teachers’ appreciation of population education, (7) teachers’ role, and (8) the factors that facilitate or inhibit population education learning processes. the third instrument is an interview guide. the topics of the interview were identical with the questionnaire evaluation aspect. validity and reliability of instruments validity assessment was required to provide an evidence related to whether the instrument indeed accomplishes what it is supposed to accomplish (teo, 2013). in this research, the face validity and content validity were used to validate the instruments by involving two experts in social studies. in order to check whether the research instruments measured what it was supposed to measure, a tryout test was administered. the tryout results had allowed the researchers to revise the content and form of some variables. data analysis techniques the questionnaire applying modified likert scale which is proposed by mardapi reid (research and evaluation in education) the implementation of population education... 128 claver nzobonimpa & zamroni (2008, p. 23) was administered to 65 students and analyzed using descriptive statistics. table 2 shows the criteria for learning process, material, course, outcome and also perception of population education. this analysis was followed by three key stages of analyzing qualitative data. miles and huberman in irambona and kumaidi (2015, p. 121) explain that the three stages of qualitative data analysis are data reduction, data display, and conclusion formulation. the qualitative data were reduced to make them simpler to analyze, then were summarized and formulated to a conclusion. this analysis was done during data collection, as well as after all of the data had been gathered. findings and discussion findings population education learning process based on the students’ stand point, the learning process of population education is less appropriate. the mean score of the students’ rating is 29.6, which means that most students chose ‘sometimes’ category. based on the interview with geography, economics and sociology teachers and the teachers’ supervisor, it is discovered that population education is not planned as an integrated lesson. the researchers also discover that there are some opinions related to population education, including: (1) population education is not popular, (2) some teachers do not have any concern in teaching population education in their courses, (3) population education is not a prominent material in social science class, (4) population education course taught only concerns indonesia and east asia issues. learning materials questions were asked to the students in order to discover the efficiency of learning material and sources which are used in population education learning process. it is revealed that students are satisfied with the materials. this is reflected by the number of students who chose ‘always’ and ‘often’. there are 43.07% of the students who chose ‘always’ category. meanwhile, 24 students or 36.92% of all students chose ‘often’ category. the rest of the sample is in the two remaining categories. there are 12 students choosing ‘sometimes’ and only one student chose ‘never’ category. the mean score of learning material efficiency is 14.17 and it is included in ‘often category’. according to the interview with the teachers, the researchers discovered that the material/books related to population education are easily found. it is also discovered that mass media help teachers to improve and update their learning material. television, newspapers, internet and other information technology help the teachers and students as the learning sources of references. evaluation the objective of this research is to find out whether the students are given assignments and instruction to discuss population issues inside/outside of the class. the researchers found that the students’ opinion on the evaluation appropriateness is less appropriate. there are only seven students (10.77%) who chose ‘always’ category. meanwhile, 16 out of 65 students (24.61%) chose ‘often’ category, while more than half of the students chose ‘sometimes’ and ‘never’. table 2. the criteria of learning process, material, course, outcome and perception score x categories predicate x≥m+1sd strongly agree/ always (very)satisfying/positive/good/appropriate m≤x<+1.5 sd agree/often satisfying/positive/good /appropriate m-1.sd≤x 0.05), as well as root mean square error of approximation (rmsea) or the average size of the expected difference per degree of freedom (df) in the population of less than 0.08. the result of cfa is presented in table 1. according to the table, each dimension of riasec corresponds to each latent variable and leans to one factor, meaning that it fulfills the requirement for construct validity. meanwhile, instrument’s realibility was tested using alpha cronbach formula and the result of reliability coefficient is 0.891. subjects the population of this study was the lower-grade students of primary schools in daerah istimewa yogyakarta (special regions of yogyakarta), which consist of four regencies and one municipality. the cluster random sampling technique was used to establish the sample. cluster referred to the regencies or municipality which have different characteristics, and random reffered to the technique used in selecting both the schools from the selected regencies and municipality, and the classes of the selected schools. the research subjects were 266 primary school students of grade 1, 2, and 3. the numbers of students undergoing preliminary field testing, main field testing, and operational field testing were 12, 83, and 171 students, respectively. details of the number and classification of the subjects are presented in table 2. table 1. the result of the confirmatory factor analysis statistics dimensions r i a s e c χ2 34.03 35.30 16.04 41.49 34.97 39.44 df 26 26 15 29 27 32 significance (p) 0.13436 0.10535 0.26074 0.06235 0.13960 0.17136 rmsea 0.023 0.025 0.019 0.027 0.023 0.020 result fit fit fit fit fit fit table 2. the number of subjects in each field testing preliminary field testing primary school sdn samirono (kota) grade grade 1 grade 2 grade 3 total 4 4 4 main field testing primary school sdn karangmojo ii (gunungkidul) grade grade 1 grade 2 grade 3 total 27 30 26 operational field testing primary school sdn kotagede (kota) sdn sonosewu (bantul) sdit tunas mulya (gunungkidul) grade grade 1 grade 2 grade 2 grade 3 grade 1 grade 3 total 29 27 30 23 30 32 reid (research and evaluation in education) quartet cards as the media of career exploration... 178 yulia ayriza, farida a. setiawati, agus triyanto, nanang e. gunawan, moh k. anwar, & nugraheni d. budiarti data analysis technique the descriptive quantitative analysis technique was employed in this study. in addition, feedback and suggestions from the teachers and research assistants who were in charge of guiding and monitoring the field testing were compiled as a part of the qualitative data used as additional references to improve the research product. findings and discussion findings this section illustrates the findings of the current study which are based on the second to ninth step of borg and gall’s ten-step r and d method. a brief summary of the finding of the first-year study is provided to give a comprehensive depiction of the study. the first stage, research and information collection, shows that the career knowledge and interest of the subjects fit holland’s riasec construct theory. the second-year research is a follow-up of the previous study’s findings. the planning stage is manifested in developing the concept of quartet playing cards based on holland’s riasec theory. at the top of the card, the word for the type of occupation is made bigger and bolder, while a description of the occupation is provided at the bottom. the cards are grouped into three volumes based on the difficulty levels (low, medium, and high), and four categories, i.e. task, tool, workplace/product/service, and working attire or attribute. the stage of developing prelimary product form involves creating the product prototype as designed in the previous stage, and conducting feasibility test on it. validity tests were performed by a professor with an extensive research on career development, and a ph.d scholar on primary education, with the following results: (a) several adjustments were made on the pictures, the size of the cards to fit the size of a child’s palm, and on the thickness of the material to ensure the cards’ durability; (b) adjusments were made on the measurement of the career knowledge instrument. the next stage is the preliminary field testing. in this stage, four students of sdn samirono (yogyakarta) were randomly selected from each grade, making a total of 12 subjects. the result of the limited field testing is presented in table 3. table 3 reveals that most of the first graders (75%) are able to play the game, find it easy to play, and will play it at home, while all second and third graders (100%) have no problem and respond positively to all of the test items. all subjects from the three grades (100%) agree that the playing card helps them learn many types of occupationss and enjoys playing the game. however, about 25% of the first graders still have difficulty in playing the game, find it hard to play, and will not consider playing it at home. in addition, the minority also does not find it easy to learn the characteristics of the occupations. unlike the second and third graders, the first graders show more enthusiam for obtaining the goal of the quartet career cards as learning media. additionally, during the limited field testing, the research team received feedback and suggestions from the teachers and research assistants, including to make the font size bigger, the colors brighter, as well as adjustment on the pictures to reflect the geographical condition of where the students live. table 3. the result of limited field testing primary school sdn samirono mean grade grade 1 grade 2 grade 3 subjects 4 students 4 students 4 students able to play the game 75% 100% 100% 92% find the game easy to play 75% 100% 100% 92% will play at home 75% 100% 100% 92% learn many types of occupationss 100% 100% 100% 92% learn the characteristics of the occupations 75% 100% 100% 92% enjoy the game 100% 75% 100% 92% reid (research and evaluation in education) 179 − reid (research and evaluation in education), 3(2), 2017 in the next stage, the research team conducted revision to the main product based on the recommendation and feedback from the students and teachers in stage four. the result of the revision was the final draft of main product that was ready for the main field testing. the main field testing was conducted in the same mechanism as the preliminary field testing in stage four, but with larger subjects. there were 27 grade 1 students, 30 grade 2 students, and 26 grade 3 students of sdn karangmojo gunungkidul. the data are presented in table 4. as illustrated in table 4, the results of extended main field testing are varied among students of grade 1, 2, and 3. nearly all of the students (90%) are able to play the quartet career cards game and find it easy to play. about the same number (94%) of students will play the game at home, and manage to learn the types (92%) and also characteristics (94%) of the occupations from playing it. on average, there are 99% of students who enjoy playing the game. overall, 93% of the students achieve the goal of playing the game, as targeted by the research team. in the next stage, operational product revision was made according to the feedback and suggestions of the subjects during the extended main field testing, particularly on the colors in the type of occupation and the answer choice. moreover, the colors in the low, medium, and high levels were changed into red, blue, and green, respectively. once the revision was made, it was concluded that the developed learning media was ready for operational field testing. the operational field testing was done in one primary school in the municipality of yogyakarta, and two other schools in the diy regencies. they were sd kotagede (municipality of yogyakarta) with 29 first graders and 27 second graders; sd sonosewu (bantul regency) with 30 second graders and 23 third graders; and sdit tunas mulia (gunungkidul regency) with 30 first graders and 32 third graders. the result is presented in table 5. table 5 shows that the data obtained from operational field testing are varied. in general, the first grade students of the three schools are able to play the game (90%), find the game easy to play (91.5%), will play the game at home (91.5%), learn many types (83%) and characteristics (76.5%) of the occupations, and enjoy the game (95%). on the other hand, 85% second graders are able to play the game and find the game easy to play, table 4. the result of extended main field testing primary school sdn karangmojo ii mean grade grade 1 grade 2 grade 3 subjects 27 students 30 students 26 students able to play the game 89% 90% 92% 90% find the game easy to play 89% 90% 92% 90% will play at home 100% 87% 96% 94% learn many types of occupationss 100% 77% 100% 92% learn the characteristics of the occupations 100% 87% 96% 94% enjoy the game 100% 97% 100% 99% table 5. the result of operational field testing primary school sdn kotagede 1 sdn sonosewu sdit tunas mulia mean subjects 29 27 30 23 30 32 grade 1 2 2 3 1 3 able to play the game 83% 89% 81% 100% 97% 91% 90% find the game easy to play 86% 89% 81% 100% 97% 91% 91% will play at home 90% 100% 97% 100% 93% 97% 96% learn many types of occupations 86% 96% 100% 100% 80% 97% 93% learn the characteristics of the occupations 83% 93% 97% 100% 70% 100% 91% enjoy the game 90% 93% 100% 100% 100% 100% 97% reid (research and evaluation in education) quartet cards as the media of career exploration... 180 yulia ayriza, farida a. setiawati, agus triyanto, nanang e. gunawan, moh k. anwar, & nugraheni d. budiarti 99% will play the game at home, 98% learn many types of occupations, 95% learn the characteristics, and 97% enjoy the game. finally, among grade 3 students, 95.5% students are able to play the game and find it easy to play, 98.5% will play it at home and learn about many types of occupations, while 100% students both learn the characteristics and enjoy the game. the operational field testing reveals that grade 3 students have the highest achievement of the research goal (98%), followed by the second graders (93%), and the first graders (88%). on average, there are 93% lower-grade primary school students who achieve the goal of quartet card careers as targeted by the research team. the ninth and final stage in this research was final product revision. this stage resulted in a suitable final product of career exploration for children aimed at improving their career knowledge. discussion this study is a follow up of the firstyear study on the exploration of career interest and knowledge construct using quantitative analysis. the first-year study shows that both the career interest and knowledge of the lower-grade primary students in daerah istimewa yogyakarta are society-oriented, and that they correspond well to holland’s theory of six career categories (riasec). the fact that students’ career knowledge is society-oriented implies that their career interests are limited to the social scope, as well. as a result, children’s career development may be disrupted, especially when there is no intervention. therefore, the second-year study was aimed at improving children’s career knowledge based on holland’s riasec theory through quartet career cards as the learning media specifically developed for that purpose. the decision to conduct intervention on the lack of career knowledge among lowergrade primary students is based on the research by xu, hou, and tracey (2014, p. 654) in china, which revealed that the lack of selfexploration and environment exploration was caused by the lack of information or knowledge of exploring career options, as well as the lack of efforts or supporting facilities in the career exploration process. in that case, intervention is imperative to make improvements on the children’s career knowledge. the final product of the developed learning media in this study is modified quartet cards containing pictures and information aimed at lower-grade primary students’ career development. it is expected that students can acquire wider knowledge and develop their career interests, so that they are well-informed and ready to make a decision on their vocational preference when they grow up. in addition, the game is also intended to increase the players’ interpersonal relationships and give them pleasure and enjoyment when playing it. the notion is in line with the study of garris et al. (2002) who state that games can be used as a part of learning activities to make students enjoy the learning process more due to the element of fantasy. moreover, parker and lepper (1992) find that learning in an environment which involves an element of fantasy is more beneficial to students than one conducted in other conditions. all field testing, whether preliminary, main, or operational, shows that more than 90% students are able to play quartet career cards, find it easy to play, enjoy the game, and will play it again at home. the learning aspect of the game is aimed at helping students improve their career knowledge. this is evident in how students manage to learn more types of occupations, what the jobs entail, and the relevant tools, setting, attire or attributes required for particular jobs. for instance, they children learn that caping (a traditional wide cone-shaped hat made from bamboo) is an attribute associated with indonesian farmers as they do not have special attires for their job. on the other hand, the discussion on career exploration as early as primary school was in accordance with the study conducted by magnuson and starr (2000). they argue that making simple decisions during early childhood such as choosing what food to eat, toys to play, clothes to wear, or things to do in their daily lives will help the children have personal preferences. in the future, the preferences will manifest in the formation of inreid (research and evaluation in education) 181 − reid (research and evaluation in education), 3(2), 2017 dividual autonomy that helps them to make decisions and life choices, including in determining what career they would have as an adult. based on piaget’s cognitive development theory, the lower-grade primary students are in the concrete operational stage, which is marked by how they are not dominated by perception and rely on experience to guide them, in addition to the extraordinary cognitive development and formative stage in the formal education setting (schunk, 2012). the concrete operational stage includes the ability to classify, combine, and compare. in this stage, children are also able to understand the connection and to make sense of a series of events (hill, 2012). the theory implies that as educators, teachers should be able to provide the appropriate learning style and environment according to the students’ cognitive development so that students are encouraged to explore and actively participate in social interaction. in relation to learning environment, teachers are also responsible for giving new stimuli for students’ cognitive construct to stimulate their development through assimilation and accommodation. as learning media, the quartet career cards allow the lower-grade primary students to explore a variety of possible career options for their future, as well as to engage in an active participation by interacting with their peers. as a result, the game acts as a stimulus for the environment that simultaneously constructs the children’s cognitive structure with career knowledge. taveira, silva, rodriguez, and maia (1998, p. 90) emphasize the significance of early career exploration for children at the primary school age to support and foster the children’s proper development. based on the research development, field testing, findings, discussion, relevant theories and previous studies, it can be concluded that the quartet career card game is a contributing factor in children’s career exploration process, as it can help improve the career knowledge of lower-grade primary students. this implies that quartet career card can be recommended as media in career guidance activities to expand and enhance children’s career knowledge. conclusion the developing of the quartet career cards game is aimed at supporting children’s future career development by improving their knowledge on possible career options. while the product development process is based on borg and gall's (1983) ten-step model, the concept of the cards itself relies on holland’s theory of the six career categories, involving realistic, investigative, artistic, social, enterprising, and conventional. the quartet career cards consist of a picture and information on the types of occupations and each of their tasks, tools, workplace, products or services, as well as the attributes and work attires. there are three levels of difficulty ranging from high, medium, to low. the cards are proven suitable and feasible to be used by lower-grade primary students, based on a series of field testing, as well as validity tests conducted by theory and media experts. acknowledgment the authors express gratitude to the islamic development bank and universitas negeri yogyakarta for the financial support in conducting this research, under the grant of penelitian unggulan perguruan tinggi tahun anggaran 2017 no. 19/penel./ p.upt/un34.21/ 2017. references ayriza, y., setiawati, f. a., & triyanto, a. (2016). career interest and knowledge of lower grade students of primary school. in international conference of computer, environment, social science, engineering and technology (icest 2016) (p. may 23-25th). medan, indonesia. borg, w. r., & gall, m. d. (1983). educational research: an introduction (4th ed.). new york, ny: longman. garris, r., ahlers, r., & driskell, j. e. (2002). games, motivation, and learning: a research and practice model. simulation & gaming, 33(4), 441–467. https:// doi.org/10.1177/1046878102238607 reid (research and evaluation in education) quartet cards as the media of career exploration... 182 yulia ayriza, farida a. setiawati, agus triyanto, nanang e. gunawan, moh k. anwar, & nugraheni d. budiarti hill, w. f. (2012). theories of learning: teori-teori pembelajaran, konsepsi, komparasi dan signifikansi (5th ed.). (m. khozim, trans.). bandung: nusa media. holland, j. l. (1997). making vocational choices: a theory of vocational personalities and work environments (3rd ed.). odessa, fl: psychological assessment resources. https://doi.org/10.1016/0022-4405(74) 90056-9 knight, j. l. (2015). preparing elementary school counselors to promote career development: recommendations for school counselor education programs. journal of career development, 42(2), 75– 85. https://doi.org/10.1177/0894845 314533745 magnuson, c. s., & starr, m. f. (2000). how early is too early to begin life career planning? the importance of the elementary school years. journal of career development, 27(2), 89–101. https:// doi.org/10.1177/089484530002700203 myers, j. e., sweeney, t. j., & witmer, m. (2000). the wheel of wellness counseling for wellness: a holistic model for treatment planning. journal of counseling & development, 78(3), 251– 266. https://doi.org/10.1002/j.15566676.2000.tb01906.x nauta, m. m. (2010). the development, evolution, and status of holland’s theory of vocational personalities: reflections and future directions for counseling psychology. journal of counseling psychology, 57(1), 11–22. https://doi.org/10.1037/a0018213 oecd, & asian development bank. (2015). education in indonesia: rising to the challenge. paris: oecd. https://doi.org/ 10.1525/as.1951.20.15.01p0699q parker, l. e., & lepper, m. r. (1992). effects of fantasy contexts on children’s learning and motivation: making learning more fun. journal of personality and social psychology, 62(4), 625–633. https://doi.org/10.1037/0022-3514.62. 4.625 santrock, j. w. (2008). educational psychology (3rd ed.). new york, ny: mcgraw-hill. schunk, d. h. (2012). learning theories: an educational perspective. upper saddle river, nj: pearson/merrill prentice hall. sigelman, c. k., & rider, e. a. (2006). lifespan human development (5th ed.). belmont, ca: thomson wadsworth. sweeney, t. j. (2009). adlerian counseling and psychotherapy: a practitioner’s approach (5th ed.). new york, ny: taylor & francis. https://doi.org/10.4324/97802038861 44 taveira, m. d. c., silva, m. c., rodriguez, m. l., & maia, j. (1998). individual characteristics and career exploration in adolescence. british journal of guidance & counselling, 26(1), 89–104. https:// doi.org/10.1080/03069889808253841 xu, h., hou, z.-j., & tracey, t. j. g. (2014). relation of environmental and selfcareer exploration with career decisionmaking difficulties in chinese students. journal of career assessment, 22(4), 654– 665. https://doi.org/10.1177/1069072 713515628 reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(2), 2017, 152-162 available online at: http://journal.uny.ac.id/index.php/reid research article characteristics and equation of accounting vocational theory trial test items for vocational high schools by subject-matter teachers’ forum dian normalitasari purnama graduate school of universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia email: diannsp@gmail.com submitted: 23 january 2018 | revised: 12 february 2018 | accepted: 12 february 2018 abstract this study is aimed at: (1) understanding the characteristics of accounting vocational theory trial test items using the item response theory and (2) determining the horizontal equation of accounting vocational theory trial exam instruments. this was explorative-descriptive research, observing the subject of the eleventh-grade students. the research objects were test instruments and responses of students from six schools selected through the stratified random sampling technique. the data analysis employed review sheets and bilog program for the item response theory 2pl. the findings were as follows. (1) the test item review of test packages a and b found 37 good quality items, the item response theory using 2pl showed that package a test generated 27 good questions, package b test contained 24 good questions. (2) the question equating using the mean/sigma method resulted in the equation of = 1.168bx + 0.270, with the mean/mean method resulting in the equation of = 0.997bx 0.250, the mean/mean method at 0.250, while mean/sigma method at 0.320. keywords: accounting questions, vocational high school, horizontal equating, item response theory how to cite item: purnama, d. (2018). characteristics and equation of accounting vocational theory trial test items for vocational high schools by subject-matter teachers' forum. reid (research and evaluation in education), 3(2), 152-162. doi:http://dx.doi.org/10.21831/reid.v3i2.18121 introduction nitko and brookhart (2011, p. 3) define assessment as a broad term referring to a process for obtaining information used for making decisions about students; curricula, programs, and schools; and educational policy. assessment and evaluation of learning outcomes are among the efforts made to monitor the students’ competency following the learning process. in accordance with article 57 paragraph (1) of law no. 20 of 2003 on national education system, evaluation is performed in the national education quality control framework to show education provider’s accountability to interested parties such as students and educational institutions and programs. the evaluation, for instance, is implemented by the government through national examination (ujian nasional or un). national examination is held annually and simultaneously across indonesia. regulation of the minister of education no. 20 of 2007 on the educational assessment standard explains that national examination is an activity which measures students’ competency in certain science and technology subjects to appraise their achievements in national education standards. the outcomes of the national exam are further used by the government to establish policies pertaining to education. article 68 of government regulation no. 19 of http://dx.doi.org/10.21831/reid.v3i2.18121 reid (research and evaluation in education) 153 − reid (research and evaluation in education), 3(2), 2017 2005 on national education standard mentions that the outcomes of the national exam are used as a consideration in mapping the quality of educational program and/or unit. the mapping has the purpose to understand the quality of education in each region. before national examination is held, the provincial and regency education offices hold trial exams (nationally known as ‘tryouts’)as a preparation for students in facing the exam. in an interview between the researcher and an accounting teacher at a vocational high school, the teacher said that he chaired the accounting subject-matter teachers’ forum (musyawarah guru mata pelajaran or mgmp) of sleman regency. the interview revealed that the test used in the accounting trial exam for vocational high schools held by the education office of sleman regency, particularly for the productive accounting subject, was prepared by the accounting subjectmatter teachers’ forum. the questions were given in two packages (a and b), with the same exam content outline and materials to avoid cheating during the trial exams. both packages for the accounting vocational theory trial exam for vocational high schools in sleman regency can be used as a collection of questions with good characteristics. a good test instrument is composed of good items (retnawati, 2014, p. 62). therefore, an analysis of test items contained in a test instrument is necessary to help finding out the quality of the instrument. mardapi (2012, p. 128) suggests that an item analysis can observe the difficulty level, discrimination index, and distractor’s effectiveness of test items. the analysis also helps in observing the validity and reliability of a test. in addition to test item characteristics, the parallelism of both trial test packages is unproven. this means that the difficulty level and discrimination index of both test packages may or may not be the same. this can cause a student’s scores to be higher than his ability, and thanks to the easier test package he received. this situation may result in the inaccurate measurement in students’ competency achievement. for this reason, although both packages for the accounting vocational theory trial exam prepared by the accounting subject-matter teachers’ forum are provided with the same exam content outline and materials, the equation between package a and b still becomes a subject of attention. when the parallelism of the two test packages is proven, an equation process is the next step to be taken. kolen and brennan (2014, p. 2) define equation or equating as a statistical process in order to adjust the scores of a test so that they can be used interchangeably. sukirno (2007) explains that equating can compare the scores earned by students albeit using different test packages. in that way, test participants will not be disadvantaged by easier or harder test packages they receive. there are two approaches that can be used for test equating: classical test theory (ctt) and item response theory (irt). in ctt, the test to be equated must have the same reliability index. the item response theory, which utilizes the mathematical model, determines that the probability of test participants in giving the right answer to a question depends on the ability they possess and also the characteristics of the question (hambleton, swaminathan, & rogers, 1991, p. 9). test equating using irt is more representative than that using ctt, since irt has invariance characteristics in its parameter. the ability parameter is invariance with the test parameter and vice versa (aminah, 2012). the same measurement scale in the scores obtained by students during a trial exam will make education quality monitoring easier. the test outcomes will show the students’ competence mastery in facing the national exam, while serving as a consideration for making decisions for improving the quality of graduates. hambleton and swaminathan (1985, p. 197) explain that horizontal equating is performed between two different versions of a test, and vertical equating is performed on tests across the difficulty levels. horizontal equating can also be defined as determining the equal score for differences (crocker & algina, 2008, p. 456). horizontal equating is proper when it is used for the security of a test, so that several forms of tests are needed. these forms are not the same, but it is expected that they are similar in their content and difficulty. when the difficulty, reliability, and content of reid (research and evaluation in education) characteristics and equation of accounting vocational theory trial test items... 154 dian normalitasari purnama tests are so different from one form to another, few methods of equating can properly work (cook & eignor, 1991). dorans, moses, and eignor (2010) mention that in an equivalent group design, two tests are administrated to two equivalent groups chosen randomly from the same population (they are assumed to have equivalent ability). moghadamzadeh, salehi, and khodaie (2011) also explain that the equivalent group design might reduce the effect of exercise and boredom, but it might also cause a bias since they might not have equivalent distribution of ability. to reduce the possibility of bias, the use of a big sample is suggested. in addition, liao and livingston (2012) present three approaches that could be considered as alternatives to a common-item equating design. in their paper, the randomly equivalent form approach assembles the test forms of equal difficulty by stratified random sampling of items from the item pool. previous study which was conducted by miyatun and mardapi (2000) also introduces the nonanchor item equating using the equivalent group design. the above description illustrates the significance of equating both test packages of accounting vocational theory trial exam for vocational high schools prepared by the accounting subject-matter teachers’ forum of sleman regency. the question analysis and test instrument equating will realize objective information and show the actual competency of students in preparing for the national examination. method this descriptive-quantitative research tries to equate the test instruments of accounting vocational theory trial exam for vocational high schools that were prepared by the accounting subject-matter teachers’ forum of sleman regency in the academic year of 2015/2016 in two packages, a and b. the research was conducted at vocational high schools in sleman regency, yogyakarta special region. the subjects of this research are grade xii students of vocational high schools in sleman regency who took the accounting vocational theory trial exam in the academic year of 2015/2016. the research objects were test instruments and also 650 students’ package b participants in the form of answer sheets from six vocational high schools selected through the stratified random sampling technique based on the national exam rank for accounting vocational theory subject in the academic year of 2014/2015. kolen and brennan (2014, p. 13) state that there are two ways to do an equivalent group design: (1) by giving single test to measure students’ ability, and (2) by doing a structure test administration, for example, x test for the first student, y test for the second student, x test for the third student and so on. in reference to the theory, the accounting competency try out test is considered to be suitable with the equivalent group design since the students with odd number of students identity were working with package a test, while those with even number working with package b test. the data were collected through documentation. they were reviewed by experts to see the characteristics of the test items qualitatively. the review of the test items was made to material, construction, and language to see their qualitative characteristics. the trial exam answer sheets or responses were used for the quantitative analysis. the test instruments were analyzed using the item response theory with the assistance from the bilog-mg program to generate three-phase output. in the first phase, it revealed the number of test participants correctly answering test items, ratio of correct answer probability divided by wrong answer probability, and biserial coefficient. the second phase obtained the data on item parameter according to the item response theory model used. the 1-pl model covers the data on the difficulty level, the 2pl model covers information on the difficulty leve, and discrimination index, and the 3-pl model covers the difficulty level, discrimination index, and guessing factor. in estimating the parameter, the logistic model with the highest number of fit items was used. fit items are items with calculated chi-square value smaller than table chi-square value or pvalue above 5%. the goodness of fit test aims at knowing whether or not the items used are in accordance with the model applied. reid (research and evaluation in education) 155 − reid (research and evaluation in education), 3(2), 2017 the level of difficulties is an item category, easy or uneasy item to students. it can be understood by calculating the number of students who answer correctly. it is considered good when the scores range from -2 to 2, the discrimination index is considered good when the scores range from 0 to 2, and guessing factor is considered good when the score is lower than 0.2 (1/total answer alternatives). the testing of the equation of the two test packages is aimed at observing whether or not packages a and b tests were parallel. in the presence of any evidence of non-parallelism, both packages need to be equated. allen and yen (1979, p. 59) suggest that two test instruments are considered parallel when both have the same mean and variance. the parallelism testing of the test instruments was carried out using the spss program. equating was carried out based on the result of parameter estimation from bilog which generates information on the equated test instrument conversion constant. equating was held using equivalent group design since, as shown by the data, the students’ responses were sourced from two different test instruments and answered by two different student groups with equivalent ability. there were no anchor items in both test instruments. findings and discussion findings validity and reliability of questions this research involved five raters to estimate the validity using aiken formula. the validity of the test items in both packages a and b according to aiken formula is relatively good. package a contains 26 questions with good validity index (minimum 0.87) and also 14 questions with poor content validity. package b contains 27 questions with good content validity index. there are 13 questions with very poor content validity index. characteristics of accounting vocational theory trial test items based on question item review criteria table 1 shows the characteristics of trial test items based on the outcome of expert review. in the material aspect, test packages a and b have 37 good questions and three poor questions. this is due to the reason that the prepared questions are not in accordance with the exam content outline. in the material and language aspects, 40 items in both packages a and b are in a good category. characteristics of accounting vocational theory trial test items based on item response theory the quantitative analysis using item response theory requires an assumption test as a prerequisite. a unidimensional assumption test was carried out to observe whether or not the accounting vocational theory trial exam instruments measure one’s ability (trait). the unidimensional test was performed with the factor analysis using spss 20. as presented in figure 1, the result of the factor analysis shows that 40 test items form 11 factors that explain 55.063% of the total variance. the result also shows that the first factor is dominating, with eigen value of 9.439 which is five times bigger than the second factor. therefore, it is safe to say that package a of the accounting vocational theory trial exam instrument is unidimensional. table 1.outcome of trial test items review aspect package question criteria good poor very poor qty % qty % qty % material a 37 92.5 3 7.5 b 37 92.5 3 7.5 construction a 40 100 b 40 100 language a 40 100 b 40 100 reid (research and evaluation in education) characteristics and equation of accounting vocational theory trial test items... 156 dian normalitasari purnama figure 1. scree plot of package a as presented in figure 2, in package b test, 40 test items form 13 factors which explain 56.740% of the total variance. the result also shows that the first factor is dominating, with the eigen value of 7.595 which is four times larger than the second factor. therefore, it can be assumed that package b of the accounting vocational theory trial exam instrument is unidimensional. figure 2. scree plot of package b local independence assumption test for package a is proven with variance-covariance matrix and students’ ability in doing package a test, where the students were divided into 15 groups. the classification was carried out by listing the students’ rank from the highest to lowest ability. the classification was held using the 2-parameter ability estimation model. the result shows that the elemental value is outside the diagonal approaches, meaning that the test instruments have passed the local independence assumption test. the parameter invariance assumption test came in two types. the first was question item parameter invariance test which is aimed to observe whether or not the test questions changed when answered by different student groups. the second was parameter invariance test on participants’ abilities to see whether or not the estimated students’ abilities changed when the test items were changed. the test was performed using scree plots as presented in figure 3, 4, and 5. figure 3. scree plot of parameter invariance for the difficulty level in package a test figure 4. scree plot of parameter invariance for discrimination index in package a test figure 5. scree plot of parameter invariance for participants’ abilities in package a test reid (research and evaluation in education) 157 − reid (research and evaluation in education), 3(2), 2017 figure 3, 4, and 5 show that in general, all of the plots are relatively close to the diagonal line, which can be read that the parameter invariance in package a test is met. figure 6. scree plot of parameter invariance for the difficulty level in package b test figure 7.scree plot of parameter invariance for discrimination index in package b test meanwhile, figure 6 and 7 show that in general, all plots are scattered, away from the diagonal line. scattered plots away from diagonal line show that the invariance parameter of the difficulty level and the discrimination index of package b test are not met. figure 8. scree plot of parameter invariance for participants’ abilities in package b test figure 8 shows that, in general, all plots are relatively close to the diagonal line. therefore, it can be inferred that the assumption for invariance parameter for students’ abilities in package b test is met. the result of model fitness. in order to determine the model that is fit to the items, data analysis under the three parameter logistics was conducted (1pl, 2pl, and 3pl). the fit-model analysis was assisted by bilog software version 3.0. the fit-items were the items with chi-square value bigger than 5%. the fit-model analysis was beneficial to the determination of the model fitness test to this modern approach by using bilog version 3.0 program. table 2. goodness of fit test of model by pvalue category model 1pl 2pl 3pl fit 15 32 31 unfit 25 8 9 table 2 shows that the item analysis based on the item response theory fits the 2pl model. the result of question analysis based on 2pl model in package a test found 27 good questions and 13 poor questions. such poor questions were caused by the difficulty level and discrimination index that exceeded the criteria. table 3. goodness of fit test of model by pvalue category model 1pl 2pl 3pl fit 11 30 28 unfit 29 10 12 table 3 shows that the item analysis based on irt fits the 2pl model. the result of the item analysis based on 2pl model in package b test found 24 good questions and 16 poor questions. information function (if). the item information function helps determining the quality of a test instrument. to observe the information function of package a and b tests, 2pl model was used. in the 2pl model, the highreid (research and evaluation in education) characteristics and equation of accounting vocational theory trial test items... 158 dian normalitasari purnama est plot information function will be reached when a student who responds to an item has an ability that is equivalent to the difficulty level and discrimination index of the item. figure 9. chart of function information of package a figure 9 shows that the maximum information function value is 27.884 with -0.250 logit (theta). the estimated standard error of measurement for package a is 0.189 or inversely proportional with the information function of the test. this means that the participants of accounting vocational theory trial package a test will give good information with the smallest measurement error if answered by the participants with -0.250 ability. figure 10. chart of function information of package b figure 10 indicates that the maximum information function value at 18.362 is reached with 0 logit (theta). the test’s sem is 0.2337 or inversely proportional with the test function. this means that the participants of accounting vocational theory trial package b test will give good information with the smallest measurement error if the test was done by the participants with zero (0) ability. accounting vocational theory trial exam equating test. verification of the equation of the accounting vocational theory trial test of both package a and b must be held in order to see whether or not both packages are parallel. the test for the test instruments’ equation can be done using the t-test. the result of the t-test shows the significance value at equal variances assumed at 0.000 < alpha 0.05. this means that the average score in package a and b differs (with the average difference of 3.092), and therefore, equating is necessary. equating when the accounting vocational theory trial exam instruments were proven unparallel, equating was necessary. during equating test, one needs to determine which package will be used as the benchmark. this research equated package a to package b, as presented in table 4. based on the result of analysis using bilog 3.0, it is found that the items with good characteristics and the mostly fit are in the 2pl model. mean/sigma method. in the mean/sigma method, the calculation of α and β constants using the mean and standard deviation of the difficulty level resulted in constants α = 1.168 and β = 0.270. from the constants α and β, it is found the equation of package a (x) to package b (y) as follows: = 1.168θx + 0.270 = 1.168bx + 0.270 = using α and β, item parameter transformation was carried out, which resulted in the equating item parameter as presented in table 5. the package a test shows that there are 17 test items whose average difficulty level is -0.113 and standard deviation 0.641, and after equation, the mean changes to 0.138 and standard deviation changes to 0.749. further, the average discrimination index of package a test is 1.285 with the standard deviation of 0.386, and after equation the mean changes to 1.100, and the standard deviation changes to 0.330. reid (research and evaluation in education) 159 − reid (research and evaluation in education), 3(2), 2017 table 4. summary of question parameter no package a package b the difficulty level discrimination index the difficulty level discrimination index 5 -0.320 1.364 0.843 1.084 8 -0.608 1.444 -0.677 1.719 9 -0.136 1.842 -0.282 1.793 12 -0.369 1.634 -0.949 1.606 13 -0.484 1.548 -0.307 1.709 16 -0.262 1.197 -0.154 1.167 17 -0.743 1.508 0.927 0.628 20 0.297 1.199 0.066 1.165 21 1.511 0.573 1.529 0.630 23 -0.103 1.552 0.169 1.787 24 -1.116 0.606 -0.270 1.848 27 -0.035 1.591 0.229 1.619 30 0.624 0.981 0.682 1.049 33 0.602 1.272 0.738 0.910 34 0.427 1.601 0.791 1.402 36 -0.468 1.317 0.391 0.887 37 -0.730 0.611 -1.375 0.905 µ -0.113 1.285 0.138 1.289 σ 0.641 0.386 0.749 0.422 table 5. conversion of package a to package b using mean/sigma method no package a package b b initial initial ( ) 5 -0.320 1.364 -0.104 1.168 8 -0.608 1.444 -0.440 1.236 9 -0.136 1.842 0.111 1.577 12 -0.369 1.634 -0.161 1.399 13 -0.484 1.548 -0.295 1.325 16 -0.262 1.197 -0.036 1.025 17 -0.743 1.508 -0.598 1.291 20 0.297 1.199 0.617 1.026 21 1.511 0.573 2.035 0.491 23 -0.103 1.552 0.150 1.329 24 -1.116 0.606 -1.033 0.519 27 -0.035 1.591 0.229 1.362 30 0.624 0.981 0.999 0.840 33 0.602 1.272 0.973 1.089 34 0.427 1.601 0.769 1.371 36 -0.468 1.317 -0.277 1.128 37 -0.730 0.611 -0.583 0.523 µ -0.113 1.285 0.138 1.100 σ 0.641 0.386 0.749 0.330 reid (research and evaluation in education) characteristics and equation of accounting vocational theory trial test items... 160 dian normalitasari purnama mean/mean method. in mean/mean method, the calculation of constants α and β uses the mean of difficulty level and discrimination index, which resulted in constants α = 0.997 and β = 0.250. from the constants α and β, it is found that the equation of package a (x) to package b (y) is as follows: = 0.997θx 0.250 = 0.997bx 0.250 = table 6 shows the conversion of the result of equation to the difficulty level and discrimination index parameters. package a test shows that there are 17 test items whose average difficulty level is -0.112 and standard deviation is 0.641, and after equation, the mean changes to 0.138 and standard deviation changes to 0.639. the parameter of discrimination index of package a is 1.285 with the standard deviation of 0.385, and after equation, the mean changes to 1.289 and standard deviation changes to 0.387. table 6. conversion of package a to package b using mean/mean method no package a package b b initial initial ( ) 5 -0.320 1.364 -0.069 1.368 8 -0.608 1.444 -0.356 1.448 9 -0.136 1.842 0.114 1.847 12 -0.369 1.634 -0.118 1.639 13 -0.484 1.548 -0.232 1.553 16 -0.262 1.197 -0.011 1.201 17 -0.743 1.508 -0.491 1.512 20 0.297 1.199 0.546 1.203 21 1.511 0.573 1.756 0.575 23 -0.103 1.552 0.147 1.557 24 -1.116 0.606 -0.863 0.608 27 -0.035 1.591 0.215 1.596 30 0.624 0.981 0.872 0.984 33 0.602 1.272 0.850 1.276 34 0.427 1.601 0.676 1.606 36 -0.468 1.317 -0.217 1.321 37 -0.730 0.611 -0.478 0.613 µ -0.112 1.285 0.138 1.289 σ 0.641 0.385 0.639 0.387 accuracy of equating result based on root mean square difference. kim and cohen (1996, p. 17) explain the formula to calculate the equating accuracy as follows. rmsd ( ) = rmsd ( ) = rmsd ( note: rmsd = root mean square difference = differentiator power of the first test after being equated to the second test = differentiator power of the first test = the difficulty level of the first test after being equated to the second test = the difficulty level of of the first test = the ability of the test participants of the first test after being equated to the second test = the ability of the test participants of the first test table 7. summary of rmsd calculation result for mean/sigma and mean/mean methods parameter rmsd mean/sigma method mean/mean method the difficulty level (b) 0.272 0.251 discrimination index (a) 0.192 0.004 ability (θ) 0.320 0.250 table 7 shows that the rmsd value in the mean/mean method is lower than that of the rmsd value in mean/sigma method. it can be assumed that equation with the mean/ mean method is more accurate compared to that with the mean/sigma method. discussion characteristics of trial exam question item based on question item review both package a and b tests in the material aspect have 3 test items that require revision as they do not fit the exam content reid (research and evaluation in education) 161 − reid (research and evaluation in education), 3(2), 2017 outline. for the construction and language aspects, both package a and b are 100% in good criteria. characteristics of test items the result of the analysis of package a test shows that 15 test items fit the 1pl model, 32 test items fit the 2pl model, while 31 test items fit the 3pl model. the characteristics of questions in package a based on the 2pl model show that there are 27 good questions that fit the model. thirteen items are poor as their difficulty level and discrimination index do not meet the criteria (above +2). the result of the analysis of package b test shows that 11 test items fit the 1pl model, 30 test items fit the 2pl model, and 28 test items fit the 3pl model. this shows that the 2pl model has the largest number of fit test items. if seen based on the 2pl model, 24 items are good and fit the model, while 16 items are poor. trial exam question equating the questions used in the trial exam of the accounting vocational theory in sleman regency were given in packages a and b. if both packages were used unequally, one of the student groups would be disadvantaged, particularly for students working on harder test packages. the result of the t-test on the scores in the two packages shows that both packages are non-parallel, and therefore, equating is necessary. the result of the question equating using the mean/sigma method resulted in the equation = 1.168bx + 0.60, while the mean/mean method resulted in the equation = 0.997bx– 0.250. kilmen and demirtasli (2012) conduct similar equating research by using four methods in the irt approach. those four methods use the least rmsd value to determine the accuracy. the rmsd value in the mean/mean method is smaller than the rmsd value in the mean/sigma method. the mean/mean method resulted in the rmsd for parameter b at 0.251, parameter a at 0.004, and ability parameter at 0.250 whereas the mean/sigma method resulted in the rmsd for parameter b at 0.272, parameter a at 0.192, and ability parameter at 0.320. the lower rmsd value shows more accurate equating result, in this case, it is shown that the mean/mean equating method shows better result than the mean/sigma method. conclusion the results of expert review of the test items are as follows. (1) in terms of the material, construction, and language aspects, the test items in the test instruments of accounting vocational theory trial exam prepared by accounting subject-matter teachers’ forum of sleman regency are in a good category. (2) the content validity of package a and b test items according aiken formula is satisfactory. (3) the reliability coefficient of test instruments of accounting vocational theory trial exam for both package a and b is in a good category, at 0.887 for package a and 0.856 for package b. (4) the analysis based on the item response theory using the 2pl model to package a test shows that 32 items fit the model, whereas 30 items fit the model of package b. the discrimination index of package a shows that there are 27 good items and 13 poor items. in package b test, 24 items are in a good category while the remaining 16 items are in a poor category. poor items are resulted from the difficulty level and discrimination index which exceed the criteria. (5) equation using the mean/mean method shows smaller result compared to the rmsd value found using the mean/sigma method. references allen, m. j., & yen, w. m. (1979). introduction to measurement theory. montery, ca: cole publishing. aminah, n. s. (2012). karakteristik metode penyetaraan skor tes untuk data dikotomos. jurnal penelitian dan evaluasi pendidikan, 16(special issue for uny’s 48th dies-natalis), 88–101. https:// doi.org/10.21831/pep.v16i0.1107 cook, l. l., & eignor, d. r. (1991). an ncmf instructional module on irt equating methods. educational measurement: issues and practice, 10, 37–45. reid (research and evaluation in education) characteristics and equation of accounting vocational theory trial test items... 162 dian normalitasari purnama crocker, l. m., & algina, j. (2008). introduction to classical and modern test theory. new york, ny: holt, rinehart, and winston. dorans, n. j., moses, t. p., & eignor, d. r. (2010). principles and practices of test score equating. princeton, nj: educational testing service. government regulation no. 19 year 2005, on national education standard (2005). republic of indonesia. hambleton, r. k., & swaminathan, h. (1985). item response theory: principles and applications. boston, ma: kluwer nijhoff. hambleton, r. k., swaminathan, h., & rogers, h. j. (1991). fundamentals of item response theory. london: sage publications. kilmen, s., & demirtasli, n. (2012). comparison of test equating methods based on item response theory according to the sample size and ability distribution. procedia social and behavioral sciences, 46, 130–134. https://doi.org/10.1016/j.sbspro.20 12.05.081 kim, s.-h., & cohen, a. s. (1996). a comparison of linking and concurrent calibration under item response theory. in american educational research association annual meeting (pp. 1–52). new york, ny: american educational research association. kolen, m. j., & brennan, r. l. (2014). test equating, scaling, and linking: methods and practices. new york, ny: springer. law no. 20 of 2003 of republic of indonesia on national education system (2003). liao, c.-w., & livingston, s. a. (2012). a search for alternatives to common-item equating. in paper presented at the annual meeting of the national council on measurement in education. vancouver, british columbia, canada. mardapi, d. (2012). pengukuran, penilaian, dan evaluasi pendidikan. yogyakarta: nuha medika. miyatun, e., & mardapi, d. (2000). komparasi metode penyetaraan tes menurut teori respons butir. jurnal penelitian dan evaluasi pendidikan, 2(3), 1–18. https://doi.org/10.21831/pep.v2 i3.2083 moghadamzadeh, a., salehi, k., & khodaie, e. (2011). a comparison method of equating classic and item response theory (irt): a case of iranian study in the university entrance exam. procedia social and behavioral sciences, 29, 1368– 1372. https://doi.org/10.1016/j.sbspro. 2011.11.375 nitko, a. j., & brookhart, s. m. (2011). educational assessment of students (6th ed.). boston, ma: pearson education. regulation of the minister of education no. 20 of 2007 on the educational assessment standard (2007). republic of indonesia. retnawati, h. (2014). teori respons butir dan penerapannya: untuk peneliti, praktisi pengukuran dan pengujian, mahasiswa pascasarjana. yogyakarta: nuha medika. sukirno, s. (2007). penyetaraan tes uan: mengapa dan bagaimana? cakrawala pendidikan, 26(3), 305–321. https://doi.org/10.21831/cp.v3i3.3983 copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(2), 2018, 126-135 available online at: http://journal.uny.ac.id/index.php/reid developing higher-order thinking skill (hots) test instrument using lombok local cultures as contexts for junior secondary school mathematics *1syukrul hamdi; 2iin aulia suganda; 3nila hayati 1,2,3universitas hamzanwadi jl. cut nyak dien no.85, pancor, selong, lombok timur, west nusa tenggara 83611, indonesia *corresponding author. e-mail: syukrulhamdi@hamzanwadi.ac.id submitted: 23 november 2018 | revised: 04 december 2018 | accepted: 10 december 2018 abstract the study was aimed at producing a valid and reliable higher-order thinking skill (hots) test instrument using lombok local cultures as contexts in the junior secondary school mathematics subject matter. the study is developmental research involving a field try-out of 75 students of grade viii. data were analyzed using classical test theories of difficulty levels, discriminating powers, and functioning distractors. the test validity is assessed using the aiken formula and reliability is estimated by cronbach alpha. findings show that, of the 20 initial multiple-choice items, 15 were valid and reliable and had the characteristics of good test items with a medium-rated difficulty level average of 0.28, a good-rated discriminating power of 0.31), a good-rated reliability coefficient of 0.79, and all distractors well-functioning. keywords: test item development, higher-order thinking skill (hots), junior secondary school mathematics education introduction twenty-first-century education does not merely provide access to information for students. it is expected to form generations to be able to act effectively in facing the complex and ever changing world’s challenges. it must be able to give new experiences, unique and creative ideas, and develop collaborative attitudes as learners’ capital to face the world of work, get along with society, and live the daily lives. the partnership for 21st century skill (warisdiono, et al., 2017, p. 18) explains that learning in educational world must focus in developing the 4c’s as competencies which must be acquired to face the 21st century: creativity, critical thinking, communication, collaboration. this has had a great influence on the educational curricula in accommodating 21st-century competencies into the school subject matters, including mathematics. mathematics is one of the knowledge fields that have central roles in the development of competencies needed to face the 21st century environments. mathematics understanding is a readiness centre for the young generation to live in modern society. a proportion of the growth of problems and situations exposed in daily lives, including in the professional contexts, needs a number of levels of mathematics understanding, mathematics thinking, and mathematics tools. mathematics is an important tool for the young adults in confronting the issues and problems in the personal, professional, societal, and scientific environments in their daily lives (oecd, 2013 in kurniati, harimukti, & jamil, 2016, p. 143). however, the low level of the learners’ mathematics knowledge has attracted the attention of the educators and researchers and has always become hot topics of discussions in society. mailto:syukrulhamdi@hamzanwadi.ac.id reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 127 – developing higher-order thinking skill (hots)... syukrul hamdi, iin aulia suganda, & nila hayati a number of international evaluations on mathematics learning reveal that indonesian students have not shown pleasing reality. indonesia has 1,095 class hours per year but students’ competencies are under the average level, as compared to south korea that has 903 and japan with 712, sitting in the high level of the world ranking (rahmawati, 2016, p. 6). the indonesian’s involvement in the international assessment is how its educational achievement is among other countries in the world. results of the study of programme for international student assessment (pisa) conducted by organization for economic cooperation and development (oecd), looking at the thinking abilities of students around the 15 years of age in reading, mathematics, and science since 2000, show that the average score of mathematics literacy of indonesian children is still under the international standard (indonesia pisa center, 2013). the mathematics literacy of indonesian children is, therefore, low. in the pisa study 2015 that took 540,000 15-year old students from 72 countries, indonesia was at the 63 rank of the 70 countries being assessed with a mathematics score of 386. the international standard score was 490 (oecd, 2016). this shows that the mathematics literacy average score of indonesian students is still under the international standard score. beside pisa, results from another study, trend in international mathematics and science study (timss) taken by indonesia since 1999, reported the same thing. the mathematics competences of grade viii indonesian students were low (scientific literacy, october 24, 2014). the timss study in 2015 showed that indonesian students scored 397 out of the international standard 500. indonesia is still under the average rank, 45 out of 50 countries (mullis, martin, foy, & arora, 2015). details of the pisa and timss mathematics ranking of indonesian students can be seen in table 1. this condition of indonesian education in mathematics is frightening. the pisa and timms studies pointed out that the students lacked logic and reasoning in completing test items that demand the competences of analysis, evaluation, and creation. table 1. mathematics ranking of indonesian students by pisa and timss year pisa pisa score timss timss score 1999 / 2000 39 of 41 367 34 of 38 403 2003 38 of 40 360 35 of 46 411 2006 / 2007 50 of 57 391 36 of 49 397 2009 61 of 65 371 2011 / 2012 64 of 64 375 38 of 42 386 2015 63 of 70 386 45 of 50 397 (sources: indonesia pisa center, 2013; mullis, martin, foy, & arora, 2012; mullis et al., 2015; oecd, 2014, 2016; scientific literacy, 2014) the director of the national educational evaluation centre (neec), nizam (krisiandi, 2016), stated that indonesian students are good at answering questions of the memorization type, but poor at application and reasoning. school learning, from daily quizzes to school exams, has not sharpened students’ abilities to reason. nizam also mentioned that learning through the subject matters must not be directed only to knowledge skills but also to competences. in the 21st century, basic literacy (science, mathematics, reading, and technology) and also competencies of critical, creative, communicative, and collaborative thinking must be mastered. the neec researcher, rahmawati (krisiandi, 2016), also stated that the students’ competences in higher-order thinking are still weak; students must be habituated with higher-order thinking test items. teachers are expected to develop test items which deal with higher-order thinking. this is not as easy; yet, teachers need to familiarize themselves with high-order test items, items which are used by timss and pisa. this is in agreement with the national curriculum 2013 that demands learner competencies to communicate and think critically and creatively. the study by kurniati et al. (2016, p. 154) had the same tone with the neec study by rahmawati stating that the lack of higher-order thinking skills (hots) in the students is caused by the inability of the students to understand the subject-matter material and apply it in daily life. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 developing higher-order thinking skill (hots)... 128 syukrul hamdi, iin aulia suganda, & nila hayati revision to the curriculum 2013 in 2017 requires teachers to make a number of improvements. among others, one is for the teacher to be creative in integrating literacy, 21st 4c skills (creative, critical, communicative, and collaborative), and hots in their classroom instruction (pedia pendidikan, 2017). phol (kurniati et al., 2016, p. 143) stated that the ability to involve analysis, evaluation, and creativity is a higher-order ability. according to brookhart (2010, p. 29), the hots involves logic and reasoning, analysis, evaluation, creation, problem solving, and also judgment. further, hamdi, kartowagiran, and haryanto (2018, p. 1) stated that, at the third level, which is high level, students’ understanding is characterized by the abilities to work with complex materials such as mathematical thinking and reasoning and communicative, critical, creative, interpretative, reflective, generalizing, and mathematical skills. the use of hots items in tests is able to train students to sharpen their abilities and skills that are in line with the 21st-century demands. through hots-based test items, critical thinking skills (creative thinking and doing, creativity, and self-reliance learning), will be built through practices in solving various daily-life real problems (problem-solving) (warisdiono, et al., 2017, p. 18). the elevation of higher-order thinking skills has become a priority in the school mathematics learning. students of the junior secondary levels must be trained toward higher-order thinking in accordance with their age. this can be done by the teacher by giving test items of the hots type. for this, it is not enough for the teacher to merely pick up material from the packaged workbooks; but they need to resource to more weighted materials. the problem faced by teachers is that they have insufficient ability to develop test items of the hots type. at school, many teachers still use test items that tend to test students’ memory aspects rather than higher-order thinking skills. the test items are directed more to lowerthinking skills (lots) of memorization and understanding. on the other hands, what the students need to face the future demands is hots. the development of hots in students is expected to raise students’ ability in problem solving, elevate their self-confidence in mathematics, and improve their learning achievement (butkowski, et al., 1994 in budiman & jailani, 2014, p. 142). a hots test item is given through a stimulus. a stimulus can be derived from the recent global issues such as technology, information, science, education, health, and infrastructure. a stimulus can also be raised from the environment such as cultures. it is a fact, however, that test items in the school books lack the involvement of cultural issues. in fact, peoples like the japanese, chinese, koreans, and others have used cultural issues in their mathematics learning which makes far advanced in all fields. kurumeh stated that the success of the japanese and chinese in mathematics learning is because they use ethnomathematics (supriadi, arisetyawan, & tiurlina, 2016, p. 2). various cultural products of the indonesian ancestors show art creativities that contain mathematics elements. the case is the same with the cultural products of the lombok sasak tribes. one example is the shield from ende used in a traditional dance. it is made of thick buffalo leather with a twodimensional geometric pattern. another product is the sasak house architecture with three-dimensional ornaments. besides, many traditional clothes of sasak have geometrical pattern motifs and the traditional wedding ceremonies have statistical elements. one example is presented in figure 1. figure 1. example of cultural products of lombok sasak ethnic (department of national education, 2000, p. 21) according to acho, imako, and uloko (wulandari & puspadewi, 2016, p. 35), students’ memory and learning achievement obtained through cultural-based instruction are higher than those obtained through conventional teaching. besides, the study by nur and reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 129 – developing higher-order thinking skill (hots)... syukrul hamdi, iin aulia suganda, & nila hayati palobo (2017, p. 11) shows that contextual instruction using lombok local cultures as contexts gives a positive and significant influence on the students’ problem-solving abilities in mathematics. in addition, curriculum 2013 requires teachers to be able to develop hots test items in line with local environments. as such, stimuli for the test items will be attractive since they can be directly observed and accepted by students. besides, the use of local cultures for hots test items will increase students’ senses of attachment and ownership towards the local potentials of their place. linking mathematics with cultures will expectedly help students see the connection and application of mathematics not only with other disciplines of science, but also with real life. the item format developed in the present study is that of the multiple-choice type. according to the opinions of experts and research results, tests of the multiple-choice format can be used for hots (budiman & jailani, 2014, p. 142). the procedure suggested for the hots items is that of a set of items consisting of an input followed by answer options. based on the rationalization added with data and supporting evidence presented, a need is felt on developing hots test instruments with local cultural contexts in the mathematics subject matter of the junior secondary school to prepare students to face the 21st century. the valid and reliable test instrument can be used to train students’ in attaining hots, help teachers in testing students’ hots, and become a reference source for the development of hots test items for other base competencies in the syllabus. method the study was development research. it applied the seven steps of gathering initial information, planning, development of first draft and expert validation, limited-scale tryout/readability, revision of the first draft, field try-out, and revision of the final product. initial information gathering was related to the product to be developed. it was done through theoretical reviews covering needs analyses, reviews of the concepts and theories concerning hots and local cultures, and analyses of the core competencies (cc) and base competences (bc) of the semester 2 of grade viii of junior secondary mathematics in the curriculum 2013. in the planning phase, the design of the developed product was outlined through the steps of defining, formulating the objectives, and designing of the initial product. this consisted of formulating the product specification, determining the objectives, and constructing the table of specification for the hots test items using lombok cultures as the contexts. the developed hots test items were constructed using hots indicators and also bc indicators. the hots indicators were synthesized from ennis (komalasari, 2013, p. 266; bayer, ellis, gokhale, cotton, & langrehr (thebooke, n.d.); torrance (lestari & yudhanegara, 2015, p. 89); budiman & jailani, 2014, p. 143). the indicators were (1) identifying and relating relevant information from a problem; (2) making accurate conclusion based on the obtained information; (3) finding consistencies/inconsistencies in the product; (4) evaluating the product against determined criteria/standards; (5) synthesizing ideas/strategies for the problem solution; (6) applying the strategy for the problem solving; and (7) developing new alternatives of the problem solving. in the development of the initial product, a first draft of the hots instrument was developed. this consisted of 20 multiplechoice test items. the draft was then subjected to validation by the expert team from the department of mathematics education. the objective of this assessment was to see whether or not the developed test was acceptable and feasible to be used. another purpose was to obtain feedback for the improvement of the draft. after being validated by the experts, the draft was then subjected to the analysis of the results of the item validation. data were in the form of scores of the test items by the experts. the analysis used aiken’s v formula to calculate the content validity index of the test items. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 developing higher-order thinking skill (hots)... 130 syukrul hamdi, iin aulia suganda, & nila hayati the next step was trying out the draft in a limited-scale group. the validated and revised draft was tried out in a group of 15 junior secondary school students. the try-out was done to obtain information concerning the testees’ ease measure in reading the items, level of attractiveness of the test, and level of testees’ interest in the test. the results were converted into percentages wherein ≥ 60% means positive. the following phase was the revision of the first draft based on the results of the limited-scale try-out. after being revised, the product was then subjected to the field tryout. the field try-out was conducted in two grade viii classes in two junior secondary schools. these schools were mts. muallimin nw pancor and mts. nw pancor. this tryout involved 75 students. the resulting data were analyzed empirically by way of classical test parameters. the final step was the revision of the product. this was done on the second draft that was tried out in the two schools. an item was accepted as a final product if it fulfilled one of the following criteria: (1) the item satisfied all the requirements of difficulty levels, discriminating powers, and functioning distractors; and (2) easy and difficult items were accepted if they had a discriminating power of the good/medium category and the placement of the distractors was functioning. the items that were accepted were then re-formatted to become final products verified as hots items. findings and discussion findings higher-order thinking skills include critical thinking, creative thinking, and problem solving. problem solving, seen as the main skill in hots, is a skill in critically and effectively managing, combining, or developing information in the form of facts or ideas to solve a problem and make a decision or finding a solution to a hard-to-handle situation. a hots item is one that requires the ability to apply higher-level thinking. the item is presented using a stimulus. a stimulus can be resourced from global issues such as technology, information, science, education, health, and infrastructure. a stimulus can also be obtained from the environment such as cultures. lombok is one of the islands in indonesia which retains various cultures from history in the forms of objects, non-objects, traditional habits, ethics, and arts. the diversity of the cultures can be used as stimuli and integrated into the school learning processes, including mathematics. inheritances of history, architecture, dances, musical instruments, and others contain mathematics elements. one example is the shield from ende used in a traditional dance. it is made of thick buffalo leather with a two-dimensional geometric pattern. another product is the sasak house architecture with three-dimensional ornaments. besides, many traditional clothes of sasak have geometrical pattern motifs and the traditional wedding ceremonies have statistical elements. hots items can be integrated with cultural elements. one example of the development of test items based on cultural elements, in this case, lombok, can be seen in figure 2. figure 2. example of hots item using the context of sasak culture translation: gendang belek is a music instrument specific to sasak ethnic in lombok. some gendang belek have the same form but different sizes, as can be seen in the pictures. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 131 – developing higher-order thinking skill (hots)... syukrul hamdi, iin aulia suganda, & nila hayati the radius of kenceng is twice as large as that of pencek. the radius of terumpang is the same as thrice as that of pencek. if l1, l2, and l3 consequently states the size of pencek, kenceng, and terumpang, which of the following statements is correct? figure 2 shows an example of a hots test item using a local cultural context of lombok with a hots indicator of critical thinking (making an accurate conclusion from the information of a situation/problem). to be able to answer the question, the testee needs to be able to recall and understand factual, conceptual, and procedural material about circles. then, by doing an analysis of the situation (stimulus), the testee determines the strategy in solving the problem. other than the example above, there are other cultural inheritances that can be integrated into mathematics. the use of cultural elements in hots test items will be able to elevate students’ senses of attachment and ownership towards the local potentials of their place. linking mathematics with cultures will also help students see the connection and application of mathematics not only with other disciplines of science but also with the real world. instrument development the product of the developmental study is a valid and reliable hots test instrument, using the local cultures of lombok as a context, consisting of multiple-choice test items for junior secondary school mathematics. the instrument development passes two assessment phases. the first phase is to assess the validity of the instrument, conducted by three experts of mathematics education. the second involves a limited-scale try-out with 15 testees and a field try-out in two schools with 75 testees. validation by experts is to look at the contents of the initial product and obtain feedbacks for revising the first draft. in the process, the experts are given the table of the specification of the test, the test items, and the evaluation sheets. data of the experts’ evaluation are subjected to the aiken’s v formula to find the content validity coefficient. the results can be seen in table 2. table 2. results of experts’ validation item number aiken’s v coefficient criteria 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19, 20 0.67-1.00 good to be used 6 0.33 need revision/ deletion in table 2, it can be seen that, out of the 20 test items, 19 are feasible for use and one needs revision or deletion. however, there is a number of items which needs to be revised or deleted following the experts’ feedbacks. these include the format of the writing, completeness of the stimulus texts, clearer pictures, and suitability with the junior secondary school level. the results of the readability check in the limited-scale try-out show that the majority of the students give positive responses towards the test, between 75% and 94%. this is strengthened by positive comments written by some students. sample statements can be seen in figure 3. meanwhile, the difficulty levels of the items can be seen in table 3. figure 3. students’ comments on the use of the test instrument translation: give short opinions about this culture-based hots test i like the test that is given because i can know about sasak cultures more deeply this hots test give knowledge to me about lombok cultures or sasak ethnic reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 developing higher-order thinking skill (hots)... 132 syukrul hamdi, iin aulia suganda, & nila hayati table 3. difficulty levels of the main product test items category item number total tk < 0,25 (difficult) 1, 3, 6, 17 4 0,25 ≤ tk ≤ 0,75 (medium/enough) 2, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 14 tk ≥ 075 (easy) 0 table 3 shows that 14 items (77.78%) have the difficulty level in the medium category. meanwhile, table 4 shows that seven test items (38.88%) have a discriminating power of the medium category. the spread of the distractors of the main product test items can be seen in table 5. table 4. discriminating power of the main product test items category item number total dp < 0.20 (poor) 1, 6, 9, 14 4 0.20 ≤ dp < 0.40 (medium) 4, 12, 13, 15, 16, 17, 18 7 0.40 ≤ dp < 0.70 (good) 2, 3, 5, 7, 10, 11 6 0.70 ≤ dp ≤ 1.00 (very good) 8 1 tabel 5. effectiveness of distractors of the test items category item number total functioning 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 18 not functioning in table 5, it is clear that the distractor distribution of all of the test items is functioning; it means that all the distractors are chosen by 5% of the testees. based on the results of the analyses of the item characteristics above, the number of items that are accepted and replaced/rejected can be seen in table 6. in table 6, a total of 15 items (83.33%) are accepted and 3 (16.67%) are rejected. the accepted are then reformatted to become the final product test instrument of hots in terms of the test validity. table 6. results of analyses of item characteristics category item number total percentage accepted 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18 15 83.33% rejected 1, 6, 14 3 16.67% revision of final product the final product revision is conducted to obtain a test instrument that is valid and reliable. revision is done by looking at the results of evaluation in the two product tryouts. the revision involves experts’ validation, limited-scale try-out, and field try-out. the experts’ validation and product tryouts are used as the main consideration for revision. first, item revision is based on the experts’ inputs and suggestions. in general, these include the format of the writing, completeness of the stimulus texts, clearer pictures, and suitability with the junior secondary school level. figures 4, 5, and 6 show items that are good after revision and items that are rejected. figure 4. good item after revision good item after revision bumbungan is a traditiponal house of sasak ethnic in lombok. bumbungan has a steep roof, made from hay with a thickness of about 15 cm. the roof is intentionally let to span to the bottom wall and almost covers the wall. like the picture below. the roof bottom is 5.2 in length and the top 5/13 of the roof bottom. height of the roof is 3/2 of the roof top. the circumference of the roof of the house is … a. 12.6 m c. 13.6 m b. 13.4 m d. 14.0 m reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 133 – developing higher-order thinking skill (hots)... syukrul hamdi, iin aulia suganda, & nila hayati the item in figure 5 is rejected because the item is not well-formulated and the picture is meaningless (the notes are not clear). meanwhile, the item in figure 6 is deleted because it is considered too difficult for junior secondary school age. figure 5. poor item that is deleted (1) figure 6. poor items that are deleted (2) second, item revision from the limitedscale try-out is done on the results of the analyses of the item characteristics. most of the revision deals with discriminating powers and non-functioning distractors. third, item revision from the field tryout is done in the same way. all the test items are then verified with the hots indicators to make sure that all indicators have been represented. after being verified, all items are reformatted to become the final product of the study. the field try-out involves 75 students, consisting of 24 from mts. muallimin nw pancor and 51 from mts. nw pancor. in general, the achievement of students who take parts in the study can be seen in figure 7. figure 7. student achievement profile in mathematics learning viewed from the completed results of the test instrument based on figure 7, it is clear that the average score of students of mts. muallimin nw pancor is higher than that of mts. nw pancor. moreover, a total of 11 students of mts. muallimin nw pancor have scores above the average and 11 have scores below the average. for mts. nw pancor, 27 students are above the average score and 24 students are below. discussion the product of the study is a valid and reliable hots test instrument using lombok cultures as contexts. it is a fact that, up to the present time, no effort has been done for evidence of test validity and reliability. the development of the instrument begins with the review of hots which, according to brookhart (2010, p. 29), consist of the ability poor item to be deleted rudat dance the lombok-specific rudat dance is usually used to welcome guests, involving 10 dancers. the distance of the most-front dancer and the mostback dancer is … unit. a. 3.18 c. 6.36 b. 3.56 d. 10 tari rudat tarian khas lombok yang biasa digunakan sebagai tarian penyambut tamu, terdiri dari 10 penari. jarak penari paling depan dengan penari paling belakang adalah…satuan a. 3,18 c. 6,36 b. 3,56 d. 10 poor item to be deleted begasingan lombok traditional game in the picture, an arch is made with the center point of p and it crosses the line in q point. then, with the same radius, an arch is made with the center of q, so that it crosses the first arch in r point. from the points of p, q, and r, prq angle is made. the size of the angle formed by prq angle is… a. 30° c. 60° b. 45° d. 75° 0 10 20 30 40 mts. muallimin nw pancor mts. nw pancor 34.49 25.60 reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 developing higher-order thinking skill (hots)... 134 syukrul hamdi, iin aulia suganda, & nila hayati of logic and reasoning, analysis, evaluation, creation, problem solving, and also decision making (judgment). it is followed by formulating the item indicators and writing of the test items. then, the test items are subjected to content validation through expert judgment. this is followed by the aiken analyses. before being administered in the field try-out, the items are subjected to a limited-scale try-out for readability. the field try-out involves 75 students from two schools. finally, item analyses and reliability estimation are conducted. the test instrument development has been conducted following the standard procedure and found that the test is valid and reliable. the test items developed in the study are those of the multiple-choice type. according to opinions and research results from experts, a multiple-choice test can be used to measure hots (budiman & jailani, 2014, p. 142). it is suggested that the format of the hots test items consist of an introduction followed by response options. conclusion and suggestions based on research findings and discussion, a conclusion is drawn as follows. the final product of the study is a hots test instrument using lombok local cultures as contexts for junior secondary school mathematics consisting of 15 multiple-choice test items with four options. the validity of the test is indicated by the experts’ judgment showing that the test is good to be used in the aspects of contents, format, and language. based on the classical test theories, the instrument fulfills the requirement for reliability shown by a reliability coefficient of 0.79 (good category), with an average score of 0.28 for difficulty levels (medium category), discriminating powers of 0.31 (good category), and functioning distractors. based on the conclusion of the study, it is suggested that further research is conducted by analyzing the test items using the irt as the more modern method. this will expectedly be able to compare the item difficulty levels and the testees’ abilities across time and location. references brookhart, s. m. (2010). how to assess higherorder thinking skills in your classroom. alexandria: ascd. budiman, a., & jailani. (2014). pengembangan instrumen asesmen higher order thinking skill (hots) pada mata pelajaran matematika smp kelas viii semester 1. jurnal riset pendidikan matematika, 1(2), 139–150. https://doi. org/10.21831/jrpm.v1i2.2671 department of national education. (2000). kain songket lombok. nusa tenggara barat: kantor wilayah provinsi nusa tenggara barat bagian proyek pembinaan permuseuman. hamdi, s., kartowagiran, b., & haryanto, h. (2018). developing a testlet model for mathematics at elementary level. international journal of instruction, 11(3), 375–390. https://doi.org/10.12973/iji. 2018.11326a indonesia pisa center. (2013). ranking indonesia dalam pisa (2000–2012). retrieved february 11, 2018, from www.indonesiapisacenter.com/2013/0 8/ranking-indonesia-dalam-pisa-20002012.html komalasari, k. (2013). pembelajaran kontekstual: konsep dan aplikasi. bandung: pt rafika aditama. krisiandi. (2016, december 15). daya imajinasi siswa lemah. kompas, p. 11. retrieved from https://nasional.kom pas.com/read/2016/12/15/23091361/ daya.imajinasi.siswa.lemah kurniati, d., harimukti, r., & jamil, n. a. (2016). kemampuan berpikir tingkat tinggi siswa smp di kabupaten jember dalam menyelesaikan soal berstandar pisa. jurnal penelitian dan evaluasi pendidikan, 20(2), 142–155. https://doi. org/10.21831/pep.v20i2.8058 lestari, k. e., & yudhanegara, m. r. (2015). penelitian pendidikan matematika. bandung: pt rafika aditama. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 135 – developing higher-order thinking skill (hots)... syukrul hamdi, iin aulia suganda, & nila hayati mullis, i. v. s., martin, m. o., foy, p., & arora, a. (2012). timss 2011 international result in mathematics. chestnut hill, ma: timss & pirls international study center. mullis, i. v. s., martin, m. o., foy, p., & arora, a. (2015). timss 2015 international result in mathematics. chestnut hill, ma: timss & pirls international study center. nur, a. s., & palobo, m. (2017). pengaruh penerapan pendekatan kontekstual berbasis budaya lokal terhadap kemampuan pemecahan masalah matematika. aksioma: jurnal pendidikan matematika, 6(1), 1–14. oecd. (2014). pisa 2012 result in focus: what 15 year olds know and what they can do with what they know. paris: oecd publishing. oecd. (2016). pisa 2015 result in focus. paris: oecd publishing. pedia pendidikan. (2017). penjelasan singkat perbedaan rpp k13 edisi revisi 2017 dengan rpp k13 revisi 2016. retrieved february 8, 2018, from http://www. pediapendidikan.com/2017/05/rppk13-revisi-2017.html rahmawati, s. (2016, december 14). seminar hasil penilaian pendidikan. seminar hasil timss 2015. retrieved from puspen dik.kemendikbud.go.id/seminar/index. php?folder=hasil seminar puspendik20 2016 scientific literacy. (2014, october 24). survei international timss (trends in international mathematics and science study). retrieved february 11, 2018, from literacyofsci entific.blogspot.co.id/2014/10/surveiinternasioanl-timms-trends-in.html. supriadi, s., arisetyawan, a., & tiurlina, t. (2016). mengintegrasikan pembelajaran matematika berbasis budaya banten pada pendirian sd laboratorium upi kampus serang. mimbar sekolah dasar, 3(1), 1–18. https://doi.org/10.17509/ mimbar-sd.v3i1.2510 thebooke. (n.d.). kemampuan berpikir kritis dan kreatif. retrieved march 15, 2018, from http://thebooke.net/do/downloadgratis-buku-berpikir-kritis warisdiono, et al. (2017). modul penyususnan higher order thinking skill (hots). jakarta: direktorat pembinaan sma, direktorat jenderal pendidikan dasar dan menengah departemen pendidikan dan kebudayaan. wulandari, i. g. a. p. a., & puspadewi, k. r. (2016). budaya dan implikasinya terhadap pembelajaran matematika yang kreatif. jurnal santiaji pendidikan, 6(1), 31–37. copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(2), 2018, 144-154 available online at: http://journal.uny.ac.id/index.php/reid mapping of physics problem-solving skills of senior high school students using physpross-cat *1edi istiyono; 2wipsar sunu brams dwandaru; 3revnika faizah 1department of educational research and evaluation, universitas negeri yogyakarta jl. colombo no. 1, depok, sleman, yogyakarta 55281, indonesia 2,3department of physics education, universitas negeri yogyakarta jl. colombo no. 1, depok, sleman, yogyakarta 55281, indonesia *corresponding author. e-mail: edi_istiyono@uny.ac.id submitted: 04 december 2018 | revised: 19 december 2018 | accepted: 20 december 2018 abstract evaluation using computerized adaptive tests (cat) is an alternative to paper-based tests (pbt). this study was aimed at mapping physics problem-solving skills using physpross-cat on the basis of the item response theory (irt). the study was conducted insleman regency, yogyakarta, involving 156 students of grade xi of senior high school. sampling was done using stratified random sampling technique. the results of the study show that the physpross-cat is able to accurately measure physics problem-solving skills. students’ competences in physics problem solving can be mapped as 6% of the very high category, 4% of the high category, 36% of the medium category, 36% of the low category, and 18% of the very low category. this shows that the majority of the students’ competences in physics problem solving lies within the categories of medium and low. keywords: assessment, problem-solving skill, cat introduction one of the 21st-century learning and innovation skills is the ability related to critical thinking, problem solving, technology, and information (daryanto & karim, 2017). technology is an integral aspect of the development of a nation. the more advanced the cultures of a nation, the more varied and complicated the technology that is used. problem solving is a cognitive process directed to the attainment of an objective when there is a solution method to solve a problem (bueno, 2014). physics learning highly needs problemsolving skills; it is, therefore, necessary to have an evaluation as one of the efforts in elevating the learners’ thinking skills. nitko and brookhart (2011, p. 3) define evaluation as a process to obtain information for making decisions concerning the learners, curriculum, program, school, and educational policy. evaluation instruments used in learning covers tests and non-tests (nitko & brookhart, 2011). test-type instruments can be further grouped into objective tests and non-objective tests. objective tests can be in the form of multiple-choice, short answers, matching, and objective essays. non-objective tests can be open essays, work performance or observation, and portfolios or project tasks (mundilarto, 2010, p. 52). multiple-choice test items can be used to assess learning more complex outcomes which are concerned with the aspects of recall, understanding, application, analysis, synthesis, and also evaluation (arifin, 2016, p. 138). the administering of the test can be done in two modes: paperpencil and computer-based test (cbt). the paper-pencil test is paper-based test (pbt) as has been done for long, while cbt is computer-based (pakpahan, 2016, p. 24). pbt is based on the assumption that learners with the same level of age and education have the same level of competences. in reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 145 – mapping of physics problem-solving skills... edi istiyono, wipsar sunu brams dwandaru, & revnika faizah reality, there is, however, a significant variation (bagus, 2012, pp. 45–46). the pbt model has many shortcomings especially related to deviating behaviors, such as frauds, discussions, sharing of answer keys, or even teachers or schools giving out answers keys with the intention that the teachers or schools are not regarded as failing in the running of education and learning by the society (balan, sudarmin, & kustiono, 2017, p. 37). further, retnawati (2014, p. 190) states that indonesia is a big archipelago consisting tens of provinces. as such, distribution of test packages from the centre to the regions faces many obstacles including, for example, during the national examination (ne). this causes, among others, test administration to be impartial and tests results not valid in that they do not represent the real competences of the students. these limitations of pbt can be overcome by testing using the computer. computer-based testing has some advantages, including: there is no need to wait for weeks for testees to receive their scores; scores can be obtained immediately. cbt also provides the facility for giving each testee test items that are pre-arranged to give the testee the freedom to select the next test item (miller, linn, & gronlund, 2009, p. 12). according to luecht and sireci (2011), the cbt model can be categorized into: (1) computerized fixed tests (cft); (2) linear-on-the-fly tests (loft); (3) computerized adaptive tests (cat); (4) stratified computerized adaptive tests (as); (5) content-constrained cat with shadow tests; (6) test-based cat and multistage computerized mastery tests (combined); and (7) computer-adaptive multistage tests. each model has its own advantages and disadvantages. cbt gives more advantages than pbt does in that, among others, its scoring system is automatic and it reduces the burdens on the part of the testees (riley & carle, 2012). however, cbt is similar to pbt in that it may not be able to measure the testees’ abilities accurately since there is still a potential of fraud in its administration. cbt makes the testees respond to all of the items so that there is inefficiency in the use of time. there are two theories in assessment that have been empirically and technologically developed. these are classical test theory (ctt) and item response theory (irt). both ctt and irt widely represent two different frames of assessment. in views of the ctt, scoring of a test is done partially, using the steps that need to be taken in answering a test item correctly. scoring is conducted step by step, each testee’s item score is obtained by summing up the score in each step, and achievement is estimated from raw scores. this scoring model may not be appropriate since the difficulty level of each step is not taken into consideration (istiyono, mardapi, & suparno, 2014, p. 4). in the item level, the ctt model is relatively simple; ctt does not demand a complex theoretical model to relate a testee’s success in responding to a test item. on the contrary, ctt collectively considers a group of testees for a particular item. irt has been developed and important to complement ctt in the design, interpretation, and evaluation of a test or examination. irt has a strong mathematical basis and relies on a complex algorithm more efficiently calculated on the computer (adedoyin, 2010, p. 108). irt supports the use of the computer in educational testing. irt can be used to provide any item saved in the computer independently, so that the computer select a test from item banks, manage the procedure of the item administering, or design a model for a new computer-based item-response test (masters & keeves, 1999, p. 139; van der linden & glas, 2003). thus, a test which uses cat is highly suitable with the item response theory (irt). hambleton, swaminathan, and rogers (1991, p. 9) propose three assumptions underlying the item response theory, including: (1) the chance for answering an item is not dependent on that for another item (local independence), (2) an item measures one competence dimension (unidimensional), and (3) the response pattern of each item can be represented in an item characteristic curve. the weaknesses of the classical theory are tackled up by these three assumptions. hambleton et al. (1991) identify four limitations of the classical theory. first, item statistics such as difficulty levels and discriminating powers are restricted by specific observed samples that are reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 mapping of physics problem-solving skills... 146 edi istiyono, wipsar sunu brams dwandaru, & revnika faizah obtained; i.e. they depend on the group and test. second, reliability is defined by paralleltest concepts, which are difficult to realize in practice. this is due to the fact that individuals can never be the same in the second test since they may forget, earn new competences, or have different motivation and anxiety levels. third, standard errors of measurement are assumed to be the same for all subject matters and variabilities in errors are not being considered. fourth, the classical theory reflects focus on the test-level information to put itemlevel information aside. test-level information is an additive process, that is, the amount of information across the item, and item-level information is the information only for certain items. these limitations show that the classical theory deals with individual score totals and not each testee’s competences in the individual level. a cat is based on the item response theory. hambleton and swaminathan (1985, p. 48), state that there are three types of scoring systems: dichotomous, polytomous, and continuous. of the three, dichotomous system is the most used in the educational evaluation. the models that can be used for the dichotomous data are latent linearity, perfect scale, latent distance, ogive one-two-three normal parameter, one-two-three logistic parameter, and four logistic parameter (barton & lord, 1981; guttman, 1944; lazarsfeld & henry, 1968; lord, 1952). the dichotomous model is only suitable for items with twocategory scores such as true/false. for items with more than two score categories, the polytomous system is used. the polytomous scoring system has a number of models, such as nominal response, graded response, partial credit model, and others (bock, 1972; geoff n. masters, 1982; samejima, 1969). the partial credit model (pcm) has been developed in order to analyze the test items which require multiple-step responses, wherein the items follow the partial credit model patterns so that individuals with higher competences will score higher than those who have lower competences (istiyono, 2017, p. 2). therefore, it is reasonable that the partial credit model is used for multiplechoice tests. a cat is based on the principles that items must be selected by a consideration that they must measure the testees’ competences. generally, an item is selected in that it gives the most information to estimate the testee’s competences. then, based on the true/false response pattern, the competence level is supposed to return and the item is selected on the basis of the newly estimated competence. these processes are then continued up to a certain precision of the obtained testee’s competences (hambleton & zaal, 1991). based on the discussion of these facts, a need is felt on the development of a test that will measure the testees’ competences in problem solving. the computerized adaptive test (cat) has been developed as a cbt alternative to examine pbt tests and provide better tests items and shorter tests in accordance with each test. cat is a testing system which is more advanced than cbt (hadi, 2013, p. 12). in accordance with suyoso, istiyono, and subroto (2017), computer-based evaluation is needed more and can help teachers in conducting an evaluation in their subject-matter teaching. in the 21st century, more is emphasized on the higher-order thinking cognitive domain such as hots bloomian, hots marzonian, critical thinking, creative thinking and problem solving (brookhart, 2010; heong et al., 2011; schraw & robinson, 2011). testees interact directly with the computer containing the test items of the subject matter. they work on answering test items through the computer as they do in pbt through writing. the number of items is the same that in pbt and item characteristics do not function as they do in cat (pakpahan, 2016, pp. 26–27). the use of cat does not require items in a great number since the computer is able to give the items in accordance with the testees’ competence levels. on the contrary, pbt, which is developed by classical theories, needs items in a great number since it needs to measure the testees’ optimum competences repeatedly (gregory, 2014). according to weiss (2004, p. 82), cat is a technology that is viable to have the potentials to give a better assessment, in smaller testing time, for various application in counseling and education. in these two fields, there are needs to measure reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 147 – mapping of physics problem-solving skills... edi istiyono, wipsar sunu brams dwandaru, & revnika faizah individuals’ changes. there are so many varieties in the evaluation applications, and one that is able to make use of the superiority of assessment applications which are good and efficient is that which applies the cat technologies. method the study was conducted in state senior high school in sleman regency, yogyakarta province, during the even semester of the 2017-2018 academic year. the subjects of the study were 156 students of the physics department selected by a stratified random sampling technique taking the higher, medium, and lower groups into consideration based on the students’ scores of the national examination in physics. the size of the sample was determined from the population using the 1-pl formula that ended with 150 to 250 students (linacre, 2006). data collection was conducted by a test that was used to map students’ competences in problem solving in the field of physics. the research participants were asked to take the physpross-cat test which was the product of this research development. the physpross-cat consists of items that have undergone development in the forms of multiple-choice items with reasons. the mate-rial is related to the balance of solid things, elasticity and hooke law, static fluid, dynamic fluid, and temperature and calorie. the development of the instrument was based on the curriculum 2013 which had been revised on the aspects and sub-aspects of problem-solving skills (ministry of education and culture, 2013). the aspects included identification, planning, implementation, and evaluation. the sub-aspects included identifying, differentiating, planning, formulating, sequencing, connecting, applying, checking, and criticizing. the test was developed into four sets of test items, 180 in total with nine anchor items. the test items had the characteristics that fulfilled the requirements for testing. these requirements were as follows: (a) based on the results of the content validation by the evaluation experts, the test was content-wise valid with aiken’s v value of 0.97; (b) based on the empirical evidence, the test had a fit with the partial credit model (pcm) polyatomic data with four categories with a mean score and infit mnsq standard deviation of 1.00±0.25; (c) based on the cronbach alpha reliability estimation values, all items were regarded as reliable at the measure of 0.93; (d) based on the levels of difficulty, the test was regarded as good with a range of 1.23 to 1.50; and (e) on the information function and sem, the test was stated to be able to estimate competences on the range between -2 and 1.6. the scoring of the test used the partial credit model (pcm) technique which was a development of the 1-pl model and was of the rash family. meanwhile, the results of the physics problem-solving test used the computerized adaptive test (cat) categorized in the form of levels adapted from (azwar, 2010). the categories are shown in table 1. table 1. intervals of students’ problemsolving skills no skill interval level 1 mi + 1.5sbi<θ veryhigh 2 mi + 0.5sbi<θ ≤ mi + 1.5sbi high 3 mi + 0.5sbi<θ ≤ mi – 0.5sbi medium 4 mi – 1.5sbi<θ ≤ mi – 0.5sbi low 5 θ xi + 1.8.sbi very good xi + 0.6.sbi < x ≤ xi + 1.8.sbi good xi – 0.6.sbi < x ≤ xi + 0.6.sbi moderate xi – 1.8.sbi < x ≤ xi – 0.6.sbi poor x ≤ xi – 1.8.sbi very poor notes : xi (ideal mean) = ½ (ideal maximum score + ideal minimum score) sbi (ideal standard deviation) = 1/6 (ideal maximum score – ideal minimum score) x = empirical score the model testing was conducted by implementing several indicators. the first indicator was (kaiser-meyer-olkin) kmo; if the kmo was bigger than 0.5 then the data could be analyzed further (ghozali, 2009, p.394). the second indicator was the size of load factor that displayed the size of item load of its factors. the criteria implemented, as proposed by tabachnick and fidel (1983), showed that the factor load was bigger than 0.71 (very good); 0.63 (very good); 0.55 (good); 0.45 (moderate); 0.32 (poor). for the item load, the benchmark was 0.55; if the item score was bigger than 0.55 then the item could be implemented. the third indicator was the eigen value whose score should be bigger than 1 (one); this score was the overall representation in relation to the relevance among the factors that were selected as the indicators. if the cumulative percentage was bigger than 50%, the research might state that the factor selection was appropriate. the data gathering was conducted by means of a document checklist and questionnaire. the subjects in the first experiment were 41 respondents from the qa team members, teachers, employees, school committee members, and the students. these respondents came from three schools. the second experiment involved 120 respondents from five schools. last but not least, the operational/implementation experiment involved 258 respondents from 10 schools. findings and discussion results of product testing based on the theoretical studies and the preliminary studies conducted at three junior high schools, there are 12 essential indicators in the implementation of internal qa system at the schools. this result was then consulted with experts, educational practitioners in a forum group discussion (fgd). at the preliminary stage, the fgd was conducted in order to attain the feedbacks both in terms of evaluation procedures and evaluation model preliminary design that took the constructs of qa evaluation, instrument form, data source and data gathering method and evaluation procedures. from the results of assessment research and evaluation in education an evaluation model of educational quality assurance... 199 sugiyanta, soenarto toward the indicators, it was found 12 indicators with very important category and four indicators with important category for the implementation of educational qa items. on the other hand, for the aspect of qa performance, it was found 13 indicators in a very important category and six indicators in an important category. the draft of the model was then experimented gradually. first experiment the first experiment was conducted in order to gain feedbacks from the practitioners and the users of qa evaluation instrument at schools regarding the evaluation model feasibility. the educational qa evaluation instrument was the instrument resulted from the revision and the preliminary draft resulted from the preliminary review and the fgd that was validated by the experts. the first experiment was conducted at three schools, namely state junior high school of 1 banguntapan, state junior high school of 3 tempel and state junior high school of 1 berbah. the respondents in the first experiment were the team of school development (tim pengembang sekolah, tps) or the team of qa consisting of the principal, teachers, employees, school committee members, students, and students’ parents. quantitatively, from the results of model and feasibility instrument and the guidelines of questionnaire evaluation from the first experiment, it was found that: (1) the mean score of the evaluation model assessment was equal to 3.6 (low category); (2) the mean score of the instrument clarity assessment was equal to 3.653 (low category); and (3) the mean score of self-evaluation assessment was equal to 3.615 (low category). based on these results, the elements in each assessment were improved and the second experiment was conducted. second experiment the second experiment was the main field experiment whose objective was to gain feedbacks from the wider field especially in relation to the evaluation model and evaluation instrument. the analysis of the fitness of measurement model was conducted toward two evaluation instruments namely: (1) the qa system implementation instrument and (2) the qa performance instrument. then, the analysis of the fitness of the measurement model was conducted by implementing the exploratory factor analysis (efa) technique with the assistance of spss 17.0 program. based on the results of the analysis of the qa system implementation instrument, it was found that the kmo score was equal to 0.551 at the significance level of 0.000. the score implied that the data could be analyzed further. the kmo score was bigger than the required score (0.50). therefore, once again, the data were analyzed further. then, based on the multivariate correlation test with bartlett, it was found that sig 0.000 was smaller than the alpha 0.05; as a result, the researcher could conclude that there was correlation among the multivariate. in other words, the data in the second experiment was feasible for further analysis. the size of correlation value among the multivariate variables, based on the coefficient of measure of sampling adequacy (msa) in the anti image correlation, showed that almost all items in all variables were bigger than 0.5; as a result, the variables could be predicted and could be analyzed further (santoso, 2014, p.69). the number of variants and variables that could be explained by the factors that were designed could be seen from the communality value. from the results of the analysis, it was found that 40 items of the qa system implementation assurance and the 40 items had communalities score that was bigger than 0.50. in other words, the variables within the evaluation model could be explained by the factors that were explained. the total cumulative variance from the results of the analysis was 77.514%, which implied that the variables in the study might explain 77.514% of nine factors that were designed with various item distributions. from the results of the testing of the qa performance instrument, it was found that the kmo score was equal to 0.860 at the significance level of 0.000 and, therefore, the data could be analyzed further. the results of the testing by means of multivariate correlation with bartlett showed that the alpha research and evaluation in education 200 − reid, 2(2), december 2016 score was equal to 0.000, which was smaller than 0.05. therefore, the researcher could conclude that there was a correlation among the multivariate variables. the results of intermultivariate variable correlation showed that the coefficient value of measure of sampling adequacy (msa) in the anti image correlation in almost all items within all variables was bigger than 0.5; therefore, the researcher could conclude that these variables could be predicted and could be analyzed further. the number of variants from the variables that could be explained by nine factors that were designed could be seen from the communality value. from the 55 items, there were 50 items or 90.90% items of the qa performance instrument whose communality value was bigger than 0.50. the finding implied that the variables within the evaluation model could be explained by the factors that were designed. the total cumulative variance from the results of the analysis was equal to 87.995%, which implied that the variables in the study might explain 87.995% of the factors that were designed. operational experiment the operational experiment was conducted after the researcher improved the items that were not relevant to the content and the factors that were designed; as a result, the items were grouped into four factors in accordance with the theoretical model. in this stage, the experiment involved the subjects in a wider scale, namely 10 schools that were located in four counties and one municipality of the province of yogyakarta special region. twelve respondents were selected from each school, and the respondents consisted of principals, teachers, employees, students, and parents/foster parents. overall, the third experiment involved 128 respondents consisting of principals, school qa team members, teachers, and employees, 130 respondents consisting of the school committee members and the parents in order to gain additional information. based on the results of the experiment, it was found that the kmo score was bigger than the required kmo score (0.50). such coefficient value belonged to the meritarius or beneficial category so that the data could be analyzed further. in addition, based on the multivariate correlation test with bartlett, it was apparent that sig 0.000 was smaller than the alpha 0.05. thereby, it can be concluded that there was a correlation among the intermultivariate variables. as a result, the data were feasible for further analysis. in addition, the communality value showed that the number of variants from the variables could be explained by the four existing factors. the inter-multivariate variable correlation score, based on the coefficient of measure of sampling adequacy (msa) in the anti image correlation, in all items from all variables, was bigger than 0.5 and, therefore, it can be concluded that these variables could be predicted and be analyzed further. the number of variants from the variables that could be explained by the four factors that could be explained was apparent from the communality value. based on the results of the analysis, there were nine items whose communality value was under 0.50, namely: the planning factor, items 3, 14 and 15; the implementation factor, items 18 and 21; the monitoring factor, items 31, 32 and 37; and the action factor, item 47. meanwhile, the communality value of the other 39 items was bigger than 0.50. the cumulative percentage of the analysis results for the four factors was quite good, namely 56.526%. the cumulative percentage showed that the instrument might explain the factors in the qa evaluation model for about 56.526%. the percentage had met the requirements proposed by tabachnick and fidel (1983) that if the cumulative percentage was bigger than 50% then factor selection would be appropriate. the four existing factors had the eigen value > 1, which showed that the selected factors were used as the indicators of a characteristic or a trait. thereby, it can be concluded that there were four factors existing in the constructs of the educational qa system implementation instrument, namely planning, implementation, monitoring, and evaluation and the action of improvement could be explained by the variables that were observed within the study. research and evaluation in education an evaluation model of educational quality assurance... 201 sugiyanta, soenarto the size of the item load of the factor was shown by the size of the factor load from each variable in the component matrix table. based on the results of the component matrix rotation, there were still six items whose correlation coefficient was under 0.55. based on these results, it can be concluded that 42 out of 48 items (87.50%) of the instrument had a factor load which was bigger than 0.55. the finding implied that the item load of the factor was good so that in general the items within the instrument were valid and could be used. on the other hand, the items that were not valid were omitted or were not put into the final product of the evaluation instrument. the above results of the analysis showed that the qa system implementation instrument had four factors, namely planning, implementation, monitoring, and evaluation and the action of improvement was proven by the size of the factor from each factor variable that was bigger than 0.50. the results of rotated component matrix also proved that the instrument items were concentrated according to the factors that were hypothesized theoretically. the reliability of the eqa system implementation had increased in terms of the reliability coefficient value from the first experiment and the operational experiment. table 2. the recapitulation of the reliability coefficient for the qa system implementation instrument no. name of factor coefficient of cronbach alpha first stage test final test 1 planning 0.937 0.940 2 implementation 0.899 0.906 3 monitoring and evaluation 0.927 0.930 4 action of improvement 0.861 0.872 the testing of the constructs of the qa performance instrument consisted of a latent variable with seven observed variables. the statement was supported by the results of factor analysis. the results of the analysis showed that the kmo score of the qa performance variables was equal to 0.785. the score was bigger than the required score (0.50) and, as a result, the analysis belonged to the meritarius or beneficial category. based on the score, the data on the qa performance could be analyzed further. the multivariate correlation test by means of bartlett and sig. also showed that p=0.000. the score was smaller than 0.05; therefore, it can be concluded that there was inter-multivariate variable correlation. in other words, the data from the third experiment could be feasible for the further analysis. the size of measure of sampling adequacy (msa) coefficient in the anti image correlation in all items for all variables was bigger than 0.5. in relation to the statement, it can be concluded that these variables could be predicted and analyzed further (santoso, 2014, p.69). last but not least, the communality value also showed that the number of variants from the variables could be explained by seven existing factors. from the results of the analysis, it was found that from 55 items, there were five items whose score was lower than 0.50. the five items were found in the following factors: resource development factor, item 6; program development factor, item 22; school community satisfaction factor, items 36 and 37; school community behavioral change factor, item 52. on the other hand, the score of the other items was higher than 0.50. these findings implied that the factors of the educational qa performance were: (1) resource; (2) program and activity development; (3) people participation; (4) customer satisfaction; (5) knowledge, attitude and skill change; (6) behavioral change; and (7) social, economic and environmental development. all of these factors could be explained by the variables that were observed in the study. based on the above results, it could be stated that factors 3, 4, 6 and 7 had valid observed variables. then, the invalid items were omitted and they were not included in the instrument. based on the above results, it could be stated that 42 out of the total instrument items had a load factor that was bigger than 0.55. this implied that the item load of the factor load was good so that in general the items in the instrument were valid and could research and evaluation in education 202 − reid, 2(2), december 2016 be used. furthermore, for the sake of improving the instrument, the invalid items were omitted and were not put in the final product of the evaluation instrument. the reliability of the educational qa performance instrument from the first experiment to the final stage experiment was increasing. in details, the alpha coefficient from each factor in the qa performance instrument is presented in table 3. the results of the final stage experiment showed that each factor in the evaluation instrument showed that all of the factors had the alpha cronbach coefficient that was lower than 0.7. from all of the items in the seven factors under analysis, all of the items had a strong correlation with each factor that was hypothesized. thereby, it can be stated that all of the variables in the factors was quite reliable and all of the items from each factor had high reliability level so that these items could be used as the items in the educational qa performance instrument. the model feasibility and the instrument clarity the aspect of model practicality in the first experiment gained many criticisms from the schools because the instrument was too complicated for the evaluation implementation since the evaluation involved many parties, starting from the school committee members, employees, students and parents. in addition, within the data analysis, the results of the evaluation were considered complicated if the evaluation was implemented manually. table 4 presents the assessment results of the evaluation model feasibility. the results of the improvement showed the increase on the model practicality level as shown by the results of the evaluation in the second experiment and operational experiment; in both experiments, the results of the evaluation was in ‘good’ category. thereby, the model of the educational qa evaluation could be implemented without any revision. the assessment of the evaluation instrument clarity as seen in table 5, in the first experiment, from five aspects of the instrument clarity, there were two aspects that were good and three aspects that were not good. the aspects that were good were the criteria clarity and the instrument manual clarity. the two aspects did not need any fundamental revision. table 3. the recapitulation of the reliability coefficient for the qa performance instrument no name of factor coefficient of cronbach alpha first stage test second stage test 1 resource development 0.870 0.980 2 program and activity development 0.967 0.973 3 school community participation 0.843 0.843 4 school community satisfaction 0.891 0.896 5 knowledge, attitudes and skills change 0.930 0.930 6 school community behavioral change 0.772 0.838 7 school social, economic and environmental development 0.855 0.855 table 4. the model feasibility no aspect of assessment score and category first experiment second experiment operational experiment 1. evaluation model clarity 3.6667 (good) 3.9375(good) 3.8788(good) 2. evaluation procedure clarity 3.4667(moderate) 3.8125(good) 3.8030(good) 3. evaluation model practicality 3.4667(moderate) 3.8125(good) 3.8030(good) 4. evaluation model benefits 3.8000 (good ) 4.2083(good) 4.0606(good) 5. model use feasibility (time, cost and efforts) 3.6667 (good) 3.8125(moderate) 3.8182(good) average 3.6133(good) 3.9427(good) 3.8727(good) research and evaluation in education an evaluation model of educational quality assurance... 203 sugiyanta, soenarto the aspect of indicator clarity, of statement item readability, and of font shape and size relevance was not good. the improvement that was done in the first experiment was the one on the aspect of indicator clarity and of statement item readability. the two aspects were designed in a simpler manner and were adjusted to the respondents’ characteristics. in addition, the fonts were designed in an ordinary layout so that the fonts would be more readable. the results of the improvement showed good results in the second experiment and the operational experiment, and the results were in ‘good’ category. based on the results of the final experiment, the evaluation instrument could be stated as feasible as the instrument of educational quality instrument evaluation at schools. the assessment of the evaluation manual the assessment of the evaluation manual feasibility was conducted from the first experiment. based on the results of the assessment as shown in table 6, the results that were not in the good category were improved. the results of improvement in the first and second experiments show improvement on the quality of the evaluation manual. the final product review the model of the educational qa evaluation was an evaluation model for the process of national education standards fulfillment through the implementation of educational qa system and the educational qa performance. based on the results of the experiment and the analysis of the evaluation and implementation at schools, the evaluation model of the educational qa was fit into the evaluation of the implementation of qa and performance system within the schools in terms of implementing the educational qa programs. the evaluation model of the educational qa was an evaluation model for the process of national education standards fulfillment through the implementation of the educational qa system and the educational qa performance. based on the results of the experiment, the analysis and the implementation of the model in the schools, the evaluation model of the educational qa was fit into the evaluation on the implementation of the qa and performance system within the schools in terms of implementing the educational qa programs. table 5. the assessment of the instrument clarity no. aspect of assessment score and category first experiment second experiment operational experiment 1. criteria clarity 3.6000(good) 4.0000(good) 3.9697(good) 2. instrument manual clarity 3.6000(good) 4.0625(good) 4.0455(good) 3. indicator clarity 3.7333(worse) 4.0625(moderate) 4.2121(very good) 4. statement item readability 3.6000(worse) 4.1458(good) 4.1515(good) 5. relevance between the font size and the font shape 3.7333(worse) 4.1667(good) 4.2121(very good) average score 3.6533(worse) 4.0875(good) 4.1182(good) table 6. the assessment of the evaluation manual no. aspect of assessment first experiment second experiment 1. relevance between the content and the development model 3.60 (good) 4.0208 (good) 2. manual clarity 3.60 (good) 4.0417 (good) 3. evaluation stage clarity 3.73 (good) 4.0833 (good) 4. recommendation direction and objective of evaluation results clarity 3.60 (good) 4.1042 (good) 5. sentence comprehensiveness 3.73 (good) 4.1250 (good) average score 3.65 (good) 4.0750 (good) research and evaluation in education 204 − reid, 2(2), december 2016 the followings are the characteristics, the strengths and the weaknesses of the evaluation model of educational qa that is developed within the study. the model characteristics according to the evaluation objectives, there are several characteristics of the evaluation model of the educational qa that distinguish this evaluation model from other evaluation models, namely: (1) the evaluation model is implemented in order to evaluate the process of fulfilling the national education standards or the standards that are stipulated by the schools through the implementation of the qa system and of qa performance; (2) the evaluation model could be implemented by the schools, the school supervisors, the public and the related parties to identify the level of the schools’ qa in fulfilling the standards implemented; (3) this evaluation model consists of two aspects, namely the qa system implementation and the qa performance; (4) this evaluation model has the criteria with four levels of qa that represent the level of the school’s qa in the process of fulfilling the standards stipulated by the schools; and (5) this evaluation model is open and transparent or, in other words, the data gathering is conducted openly by involving all components of school community and the results and the recommendations of improvement are presented transparently. these criteria show the effectiveness of the schools’ qa program and the schools’ commitment in fulfilling the quality promises. the model strengths the evaluation model has several strengths, namely: (1) the evaluation results could be directly implemented for improving the schools’ management in implementing the qa programs because the recommendation of improvement is based on the data of qa implementation; (2) the evaluation model could be implemented for portraying the process of fulfilling the national education standards or the standards stipulated by the schools; and (3) the evaluation model is equipped with the recommendation formats, the school budget plan/the school budget plan and the formats of evaluation results report, which are very useful for the schools in performing the follow up of the evaluation results. the model weaknesses the developed evaluation model in the study still has several weaknesses, namely: (1) the evaluation model development is limited to the province of yogyakarta special region; (2) the model implementation is limited to the schools that have applied the educational qa programs; (3) the data sources are limited to the documents and the respondents from the school community and do not involve the members of surrounding communities; (4) the evaluation model is not able to portray the overall quality culture, which is the final objective of quality assurance; (5) the aspects under evaluation are still limited to the main components of quality assurance; and (6) the questionnaire and the document checks of the qa are still limited to the information provided by the respondents and the documents and are not equipped with an observation instrument. conclusions and recommendations conclusions based on the data description and the data analysis, the following conclusions can be drawn: (1) the appropriate evaluation model for evaluating the educational qa of junior high schools consists of an evaluation of the educational qa system and an evaluation of the educational qa performance; (2) the constructs of educational qa system implementation instrument consist of four dimensions, namely planning, implementation, monitoring, and evaluation and the action of the improvement based on the exploratory factor analysis and the factor load of all variables is bigger than 0.50 and belongs to the good category; (3) the constructs of the educational qa performance instrument consist of seven dimensions, namely resource development, program and activity development, school community participation, school community satisfaction, knowledge, attitude and skills change, school community behavioral change research and evaluation in education an evaluation model of educational quality assurance... 205 sugiyanta, soenarto and social, economic and environmental change, and based on the exploratory factor analysis the loading factor of all variables is bigger than 0.50 and belongs to the good category; and (4) the feasibility of the evaluation model for the educational qa at junior high schools belongs to the good category based on the expert validation, the user validation, and the practitioner validation as well as the evidence found in the field study. recommendations based on the conclusions, the following recommendations can be proposed: (1) the evaluation model (epmp) could be implemented as an alternative both for the schools in implementing the self-evaluation and for the related parties such as the office of education, the institution of educational qa (lembaga penjaminan mutu pendidikan, lpmp) and centre for development and empowerment of teachers and education personnel (pusat pengembangan dan pemberdayaan pendidik dan tenaga kependidikan, p4tk) in viewing the efforts of fulfilling the quality/standards by measuring the implementation of the qa system and performance; (2) the results of evaluation could be turned into the matters of managerial supervision for the school supervisors especially at junior high schools; (3) the qa evaluation could be developed online so that the time and the space limitation could be minimized; (4) the development of the qa system implementation is limited to the four main components of qa system and, on the other hand, the qa performance is still limited to the seven dimensions according to the exploratory factor analysis, and as a result, these findings provide an opportunity to develop an evaluation model of qa that will be more comprehensive; (5) the qa system implementation instrument and the qa performance instrument are still limited to the questionnaire and the document check on the implementation, and therefore, the future researchers might develop the more comprehensive instruments; and (6) the data source in the evaluation model is still limited to the school community and the document, so future researchers might develop an evaluation model of qa with more representative and variable data sources. references borg, w.r. & gall, m.d. (2003). educational research: an introduction. london: longman. edmond, r.r. (1979). effective school for the urban poor. educational leadership, 37, 15-27 ghozali, i. (2009). aplikasi analisis multivariat dengan program spss [multivariate analysis application with spss program] (2 nd ed.). semarang: universitas diponegoro. kaplan, r.m, & saccuzzo, d.p. (1982). psichological testing: principles, application, and issues. monterey: brooks/cole. loder, c.p.j. (ed.) (1990). qaand accountability in higher education. london: kogan page. rockwell, k., & bennett, c. (2004). targeting outcomes of programs: a hierarchy for targeting outcomes and evaluating their achievement. faculty publications: agricultural leadership, education & communication department. retrieved on december 2, 2009, from http://digitalcommons.unl. edu/aglecfacpub/48/. santoso, s. (2014) seri solusi bisnis berbasis ti: menggunakan spss untuk statistik multivariat. jakarta: elex media komputindo. sudijono, a. (2003). pengantar evaluasi pendidikan [an introduction to educational evaluation]. jakarta: pt raja grafindo persada. tabachnick, b.g & fidel, l.s.(1983). using multivariate statistics. new york, ny : harper & row. widoyoko, s. (2013). pengembangan model evaluasi kualitas dan output pembelajaran ips di smp. jurnal penelitian dan evaluasi pendidikan, 11(1). doi:http://dx.doi.org/10.21831/pep.v1 1i1.1417 http://digitalcommons.unl/ http://dx.doi.org/10.21831/pep.v11i1.1417 http://dx.doi.org/10.21831/pep.v11i1.1417 reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(2), 2017, 133-143 available online at: http://journal.uny.ac.id/index.php/reid research article discrepancies in assessing undergraduates’ pragmatics learning oscar ndayizeye higher teacher-training school of burundi, (ecole normale supérieure (ens) du burundi) boulevard du 28novembre, b.p. 6983 bujumbura, burundi email: ndaosca@yahoo.fr submitted: 15 june 2017 | revised: 28 december 2017 | accepted: 03 january 2018 abstract the purpose of this research was to reveal the level of implementation of authentic assessment in the pragmatics course at the english education department of a university. discrepancy evaluation model (dem) was used. the instruments were questionnaire, documentation, and observation. the result of the research shows that respectively, the effectiveness of definition, installation, process, and production stages in logits are -0.06, -0.14, 0.45, and 0.02 on its aspect of the assessment methods’ effectiveness in uncovering students’ ability. such values indicate that the level of implementation fell respectively into ‘very high’,’high’, ‘low’, and ‘very low’ categories. the students’ success rate is in ‘very high’ category with the average score of 3.22. however, the overall implementation of the authentic assessment fell into a ‘low’ category with the average score of 0.06. discrepancies leading to such a low implementation are the unavailability of the assessment scheme, that of scoring rubric, minimal (only 54.54%) diversification of assessment methods, infrequency of the lecturer’s feedback on the students’ academic achievement, and the non-use of portfolio assessment. keywords: authentic assessment, program evaluation, pragmatics, rasch model how to cite item: ndayizeye, o. (2017). discrepancies in assessing undergraduates’ pragmatics learning. reid (research and evaluation in education), 3(2), 133-143. doi:http://dx.doi.org/10.21831/reid.v3i2.14487 introduction writing, for some people, springs out from something else, and the motivation to write this article is remote to 2014 when the authors audited a pragmatics course in english language and literature study program, faculty of languages and arts of a university. during that time, they observed many but a thing among which the use of (a) classification by (yule, 1996, pp. 47–48); (b) students’ classroom presentations, during which each student was given a sheet used to comment on the presenters’ content clarity and the language use in general, and after presentations, students were given a chance to comment/ read aloud their reflections on the previous presentations; (c) a detailed syllabus downloadable from the university’s e-data of the staff, giving details on the assessment schema in that course whose assessment comprised students’ attendance, class participation, assignments, mid-semester exam (which actually was a take-home exam), and final exam; and (d) a course book written by yule (1996), entitled pragmatics. as the authors remarked, the characteristics previously featured are those indicating the authentic assessment of yusuf (2015, pp. 292–293). however, with this pre-survey insights, yusuf could not tell whether what he observed was really an authentic assessment being implemented in a pragmatics course. in 2017, wishing to discover more about the auhttp://dx.doi.org/10.21831/reid.v3i2.14487 reid (research and evaluation in education) discrepancies in assessing undergraduates’ pragmatics... 134 oscar ndayizeye thentic assessment as the authors observed that such assessment was quasi-absent in the assessment of linguistics-related course in the first author’s country, they decided to go back to the faculty of languages and arts, especially in the 5th semester in which pragmatics course was administered in the english language and literature study program of the university to investigate the issue. in (higher) education, the solutions to assessment-related problems can be investigated in a series of aspects, such as, how lecturers may track plagiarism in students’ assessment tasks, the development of fair assessment criteria/rubrics, the implementation of authentic assessment, and the impact of students’ right to sue educators to the court and how this impedes on assessment. the list of these assessment-related perplexing issues in indonesian (higher) education system or in the first author’s country of origin is far from being exhaustive. assessment is a process that is integral part of the logic in which the lecturers’ and their students’ roles are to be played maximally for the learning to take place. the normal flow is that the lecturers give assessment tasks, and the students do them, and ideally this flow goes on until the students graduate. the problem arises when the two main parties in the teaching-learning process have different perception of some issues. for example, the views on assessment sometimes diverge as lecturers might view it as a motivation for learning, while their students might see it as the emptiness of any motivation to improve learning but that it is only marking-grounded; and this has also become fry, ketteridge, and marshall's (2009, p. 133) observation. even among assessors, divergence does also exist. one trend of academics still thrive to use tests (exams) where students give short-answers while another advocates for real-life assignments that result in students’ competency, knowledge and interest building. the academics in the last group even label short-answer exams as the traditional practice of assessment. real-life assignment advocates also stress how this type of activities is related to motivating learning via welltimed and consistent feedback. whichever views, it is urgent to see the role of authentic assessment in language classes and how feedback might enhance learning improvement and outcomes in high education. something obvious is that assessment at this level of education should enhance the students‘ deep learning approach (joughin, 2009, p. 19). getting students to using such approach requires that the assessment tasks be well-prepared. it should be noted that assessment has attracted and drawn the attention of many academicians and also education practitioners. some academicians including mardapi (2008, p. 5, 2012, p. 12) and fook and sidhu (2010, p. 153) account assessment as an integral or central part of teaching-learning processes. for instance, mardapi, in that work, even goes further saying that the efforts to improve the quality of education can be reached through the enhancement of the quality of learning and the quality of its assessment system. the national research council [nrc] (1996, p. 5) in diranna et al. (2008, p. 8) also insists that assessment and learning are inseparable as they cannot be the two sides of the same coin, which means that the two are mutually inclusive. the choice of assessment methods has balance some considerations. diranna et al. (2008, pp. ix–x) insist that the assessment model should balance and be susceptible (a) to effectively demonstrate how students ‘represent knowledge’, build knowledge in the course they are learning; (b) to display students’ real performance; and (c) to be a good choice of ‘an interpretation method’ that allows correct inferences about students’ performance. if the assessment model choice does not balance the aspects raised above, assessment may not achieve its end in education. fry et al. (2009, p. 198) also review how, in the beginning, researching into assessment practices in higher education was not welcome by academicians: they consider such research as either no-need-to-be-done, or as loaded of deliberate disrespect or just one way of treading down their academic space/autonomy. this can be simply considered as ‘fearing the unknown’ as research can lead to the reid (research and evaluation in education) 135 − reid (research and evaluation in education), 3(2), 2017 extenuating of practices that negatively affect a given educational system as brown and glasner (1999, p. 28) stresss it. the literature shows that research plays a lot to demonstrate to the academics that they are not geniuses not to need improvements or other new career insights. the authentic assessment is also related to the notions of the assessor’s compliance with assessment principles, formative feedback, scoring rubric, and alignment between learning activities with assessment methods, to quote but a few. it is crucial that some of these key-terms be defined in the context of this research. to begin with, assessment was defined by the university of queensland, australia (2007) in joughin (2009, p. 14) as having to do with any work (which may include assignment, examination, performance or practicum) that is to be completed by a student as a requirement. assessment is carried for different reasons, ranging from permitting the (1) grading of a student; (2) educational purposes fulfilment, like motivating students’ learning, providing necessary feedback to students; and (3) as a student’s official achievement record that might be availed as a proof for certification. the afore-mentioned definition is very clear for it discloses some forms the students’ tasks can take, i.e. assessment can be carried out through exams, assignments, practical tasks, and performance. it equally details that assessment has various purposes, i.e. educational and for official record about students’ achievement, certifying their competence, and grading them. educational purpose of assessment will be deepened later. more about the purpose of assessment is proposed by irons (2008, p. 13). according to him, assessment can serve the purpose of promoting learning through providing helpful feedback, i.e. technically put, through formative assessment and formative feedback. feedback, as it appears in the previous line, also needs defining. it is closely related to comments on students’ work in order to enhance learning and high learning achievements. according to irons (2008, p. 13), formative feedback has to do with any piece of information, or simply a process or activity that is meant to afford or accelerate student learning and this is achieved through comments based on students‘ outcomes in the formative or summative assessment. the effectiveness of feedback providing depends, among other things, on whether it helps clarify what good performance is (goals, criteria, expected standards) or if it provides opportunities to close the gap between current and desired performance. it is also important to give account of what authentic assessment is, since the whole study rotates around it. the first view is insisted by mueller (2014) in suarta, hardika, sanjaya, and arjana (2015, p. 47) who defines authentic assessment as a form of assessment in which learners demonstrate competence, or a combination of knowledge, skills, and attitude in order to complete an essential task in a real-world situation. based on this opinion, one can simply put that authentic assessment urges students to make use of their competence or to combine what they have already known with the existent skills just to solve a real-world problem. mardapi (2012, pp. 166–167) also accounts for what authentic assessment really is. madapi stipulates that in this form of assessment, learners present or do a given assignment, the critical thinking is built in the way that students are assessed based on their ability to ‘construct’ or ‘apply’ knowledge in a real-world setting, and the evidence of what students are able to do is in live/direct, i.e. it can be observed and this turns authentic assessment to a learner-centered one. the core idea here is that authentic assessment engages students into real-world tasks that incite the use of critical thinking in constructing knowledge. another aspect worth underlining is that authentic assessment has got a series of methods that a teacher has to handle given the class size, the students’ level of study, and ability. teachers also smoothly use authentic assessment methods with an aim of aligning teaching-learning activities and tasks, with the assessment method chosen. diversification of assessment techniques in authentic assessment is demonstrated in the choice offered to teachers. the latter might choose to use stureid (research and evaluation in education) discrepancies in assessing undergraduates’ pragmatics... 136 oscar ndayizeye dents’ classroom presentations, classroom discussions, individual assignments, group assessments, quizzes, examinations, students’ portfolios, students’ self-assessment and/or peer-assessment, projects, and performance assessment (yusuf, 2015, pp. 292–293). assessment, especially in high education, is also maximally effective if it complies with a series of principle. in the indonesian higher education context, the ministry of research, technology, and higher education had issued principles as they can be read in the higher education curriculum book i.e. buku kurikulum di pendidikan tinggi (tim kurikulum dan pembelajaran, 2014, p. 67). according to such a reference, any assesment should be educative, authentic, objective, accountable, and transparent. in higher education, the literature about the tasks and course objectives alignment, and the assessment methods that enhance learning improvement and outcomes through feedback is still limited. the angle of assessment issue that is still unexploited is how the pragmatics course is assessed authentically given the role was assigned empirically to play for students who will become english language teachers. one among other reasons why only few pragmatics course assessment studies are available is given in mcnamara and roever (2006, p. 54) who comment that assessing a student’s ability in pragmatics of a given language is somehow difficult. this is due to the fact that the assessor has to conciliate authentic tasks to be used and practically, given that the necessary costs required to align assessment tasks and practice are huge. however, if some researchers did not explore the angle, this does not mean it cannot be explored. rubrics are also great tools to be used in authentic asssessment contexts. the rubric formats used in indonesia, indeed those mentioned in official texts about assessment, are of two types, i.e. descriptive and holistic, and lecturers may choose whichever seems comprehensible to students, efficient and effective in assessing students’ knowledge, skills, and competencies. the types and formats of rubrics together with their definitions are available in the tim kurikulum dan pembelajaran's (2014, pp. 69–71) book, in which: (1) rubric is an assessment guide that describes the criteria used by a lecturer in assessing the result of the student’s achievement level in his/her assignment/task. in addition, the rubric lists the expected performance characteristics which are manifested/demonstrated in the process and the students’ work, and it also becomes a sort of reference to assess each of those performance characteristics; (2) a descriptive rubric provides descriptions of the assessment characteristics or benchmark on each given value scale; (3) holistic rubrics have only one value scale, i.e. the highest scale. the content of the description of the dimensions is the criteria of a performance to the highest scale. if the student does not meet these criteria, the lecturer comments by giving the reasons why the student cannot get the maximum score in his/her tasks. it should be noted that the low quality of rubrics, indeed any rubric which is not clear, or simply wrongly constructed climaxes in doubts about the scoring integrity of the assessor concerned. further, christie et al. (2015, p. 31) investigate how assuring assessment grading tools quality affects student motivation and learning. the study displays how the australian and usa lecturer’s assessment practices of not using scoring rubrics to assess the quality of students’ work tend to turn the final judgment of students’ learning into a questionable one. the lecturers involved in that study tend to use common sense in assessment scoring instead of written rubrics, which could affect negatively, as the authors observed, the lecturer’s integrity in grading students’ work. with such conviction in mind, this study investigated the still-unexploited angle of assessment issues, that is, how pragmatics course is assessed authentically given its importance for the teacher students of english language. this research was sorely concerned with the implementation of authentic assessment in higher education. some related aspects such as alignment, feedback, and compliance with the assessment principles are also tackled. the problem was formulated around the idea of curiosity to know the extent to which the authentic assessment was implereid (research and evaluation in education) 137 − reid (research and evaluation in education), 3(2), 2017 mented in the pragmatics course taken by semester five students in the english language and literature study program. since such an assessment has its own indicators, the problem also includes: (1) how the assessment standard is indicated in the curriculum being implemented in the pragmatics course, (2) the proof of alignment between students’ tasks and the assessment methods in the pragmatics course, (3) the pragmatics course assessment methods providing more feedback to the students, (4) what the compliance with the authentic assessment principles in assessing students’ tasks in the pragmatics course is like, and (5) what the authentic assessment implementation in the pragmatics course is like. carrying out this program evaluation was beneficial, firstly, to the theoretical literature by broadening it as far as the evaluation of the implementation of the authentic assessment in teaching pragmatics course to indonesian students who are expected to be teachers of english language is concerned. equally, this work is meant to broaden more literature regarding the use of the discrepancy model of evaluation (dme) in foreign language assessment, especially in english as a foreign language (efl) settings. secondly, it is also beneficial to the practical aspect, because the students who are taking the pragmatics course might foster some new ideas to the pragmatics course lecturer in the perspective of adjustment as far as the course administration is concerned. futhermore, broader space is also open to other researchers to investigate into the realms of authentic activities and assessment that might develop efl teacher students’ pragmatic competence, especially the pragma-linguistic and also socio-pragmatic competencies. the research questions in this study were based on the problem formulated and the dem stages, i.e. pragmatics course program definition, installation, process, and also product (fernandes, 1984; fitzpatrick, sanders, & worthen, 2011, pp. 156–157). those questions are: (a) to which degree did the assessment that was carried out in the pragmatics course comply with the authentic assessment standard as indicated in the curriculum? (b) what is the proof of alignment between the assessment methods used in the pragmatics course and the students’ learning activities? (c) what were the most consistent feedback providing assessment methods among the ones used in the pragmatics course assessment? (d) what were the possible necessary inputs for the implementation of the authentic assessment carried out in the pragmatics course? (e) to which extent had the authentic assessment been implemented in the pragmatics course?. method this research is a program evaluation that employed provus’s discrepancy evaluation model. this program evaluation was carried out at a university which is located in yogyakarta special region, indonesia. the population of this study was the semester 5 pragmatics course takers. the research employed non-probability sampling method and saturated sampling technique (in which population is equal to sample) was used with n=31. procedure the core is that there is a determination of: (1) the standard (s), i.e. how the pragmatics course assessment should be conducted, based on the ministry of research, technology, and higher education assessment principles as stated in the higher education curriculum book (tim kurikulum dan pembelajaran, 2014, pp. 67–74), i.e. buku kurikulum pendidikan tinggi and the university’s english language and literature study program curriculum (2014), and then (2) taking performance (p) measure, i.e. given the pragmatics course inputs/resources, at this stage, the pragmatics course assessment characteristics were observed, and the assessment process was scrutinised. then, it was followed by the evaluation per se, i.e. the determination of discrepancies (d) by comparing performance (p), i.e. how the program performs compared to the standard (how it should behave). data, instruments, and data collecting technique in the pragmatics course program evaluation, both quantitative and qualitative data reid (research and evaluation in education) discrepancies in assessing undergraduates’ pragmatics... 138 oscar ndayizeye were collected. three instruments were used in order to collect the data in this study, including: questionnaire, observation guide, and documentation. through the questionnaire, the data about the assessment techniques, most feedback providing technique, compliance with assessment principles, resources, and the effectiveness of each assessment technique in uncovering the students’ ability were collected. by documentation, information about the pragmatics course objectives, assessment standards, the rubrics used, and students’ final learning outcomes were gathered. the observation instrument helped the authors in gathering information about the main inputs (curriculum, lecturer, and students), the assessment methods used, details about the assessment process, and teachinglearning facilities. data analysis techniques two types of analysis were carried out, i.e. (descriptive) quantitative analysis through rasch model with the winsteps software version 3.73.0 and qualitative analysis: following miles, huberman, and saldan a (2014, pp. 12– 13) technique consisting of (1) data reduction or condensation, (2) data display, and (3) conclusion drawing/verification. evaluation criteria table 1 shows the the criteria of the level of authentic assessment implementation. table 1. (dis)agreement and authentic assessment level of implementation interval categories x<-.99 strongly agree/very high -.99≤x≤0 agree/high 0.1≤x≤1.01 disagree/low x≥1.01 strongly disagree/very low (developed based on sumintono and widhiarso (2015, p. 40) note: x: stands for each statement’s ‘item measure’ value in logits as analysed through winsteps version 3.73.0. meanwhile, table 2 provides the information about students’ scores categorization. table 2. categorizing the students’ scores score x categories criteria x ≥ m + 1. sd very high x≥ 3 m ≤ x < m + 1. sd high 2.5≤ x <3 m 1. sd ≤ x < m low 2 ≤ x< 2.5 x < m 1. sd very low x< 2 source: mardapi (2008, p. 123) note: m : mean of students’ final scores in the pragmatics course x : each single student’s score out 4 (because the score scale is 4-1) sd : standard deviation; obtained through sd= (4-1)1/6 as the score scale is 4-1 in order to admit that a given method was used, it has to satisfy the criteria that: mean=1 (or close to 1, that is 0.9), and std≤0.31. similarly, to determine whether there has been diversification of assessment methods and the students’ success rate in the pragmatics course, some criteria were used: <50% : low 50%-65% : average/minimal 66%-81% : high ≥ 82% : very high findings and discussion before the results and discussion is presented, it should be underlined that item measure values for quantitative data are expressed in logits. for rasch model applied in social sciences, the more the item measure value in logit gets superior to 0, the more the subjects do not agree with the statements presented to them. on the contrary, if the item measure value is equal to 0 or negative, this is an indication that the statement was agreed on by the respondents. in few words, the logit values comprised between -2 up to ≤0 are indicators that statements concerned are agreed by the respondents. the discussion starts with quantitative data followed by qualitative data. concerning the quantitative data, at the program definition stage, the resources/inputs recognized by the pragmatics course takers as primordial included: the lecturer, course objectives, classroom ability to cater for all the students, class reid (research and evaluation in education) 139 − reid (research and evaluation in education), 3(2), 2017 cleanness, sufficiency of chairs, adjustable luminosity, functional fans, and also lcd projector as their measure values in logits are respectively -0.79, -0.26, -0.57, -0.16, -0.79, 1,00, -1,00, and -1.23. at the pragmatics course installation stage, the following is the comparison between the standard performance of the program and how it should behave. it is an activity aimed at finding the discrepancies. given the pragmatics program process stage/ assessment process, the performance of the program has indicators of good performance in terms of the assessment principles of being educative, authentic, and the alignment of learning activities with the assessment used. based on the measure values related to the positive indicators of good performance, the following measure values are more illustrative: -0.26, -0.16, -0.16, -0.16, and -0.57. it should be noted that the values represent respectively the fact that the assessment principles of being educative and authentic, the last three values are concerned with the statements about alignment. the latter was accepted as having been observed by the lecturer of pragmatics. by doing so, she complied with the guideline which was provided in the study program curriculum, higher education (he) (tim kurikulum dan pembelajaran, 2014), citing the ministry of education and culture’s decree number 49 of 2014, article 20, about he in indonesia, sections 1 and 4 about assessment in he. nevertheless, the core activity at dem program of installation is finding discrepancies, those which have been registered are non-compliance with the assessment principles of objectivity, accountability, and implicitly that of feedback. the item measure values associated with those three principles are superior to 0.1. the score fits to the criterion of 0.1≤x≤1.01, so that it indicates that the respondents disagreed that there was optimization of the three principles previously mentioned. there was no use of portfolio assessment although it was recommended in the english language and literature study program and high education curriculum book (buku kurikulum di perguruan tinggi). as portfolio is described as a highly-recommended assessment method that allows lecturers to keep an eye on every student’s knowledge process in the study program curriculum, if this lack is added to infrequency of feedback by the lecturer, the fact of not using portfolio was felt as a discrepancy. the dem program process stage is concerned with the results of the mostly used authentic assessment methods, the extent to which assessment methods were diversified, and the authentic assessment method, one of which is was the most feedback providing. on the list of the eleven authentic assessment methods found in the literature, six were admitted to have been used in the pragmatics course. the criteria used in determining that a given assessment method was used are that of mean = 1, and sd ≤ 0.31. the following authentic assessment methods are satisfying: students’ classroom discussion, individual assignments, quizzes, examinations, project assessment, and group assignments. the descriptive statistics (mean; sd) features are respectively: (1;0), (1;0), (0.90; 0.31), (1;0), (1;0), and (1;0). if these values are compared to the criteria pre-established, the aforementioned authentic assessment methods satisfied them thoroughly. the second aspect looked at this point was authentic assessment method diversification. simple calculations showed that the diversification was but average/minimal. over the total of eleven authentic assessment methods, if six only were used, this means that the diversification was of (6x100)/11=54.54%. compared to the criteria, this percentage falls into the 50%-65% interval, which is signifying that such diversification is simply ‘average/ minimal’. on the top of that, the respondents’ appreciation of group assignment assessments is shown in two ways: (1) they agree that it provides them with valuable feedback; (2) they recommend it to the lecturer for a better administration of pragmatics course in the future. this is indicated by its related item measure value in logits, which is -0.47. if such measure is compared to the criteria set, this illustrates that group assignments were admitted to have provided helpful feedback to the pragmatics course takers. such finding is in reid (research and evaluation in education) discrepancies in assessing undergraduates’ pragmatics... 140 oscar ndayizeye line with bentley and warwick (2013). lately, students appreciate group assignment assessment as they gain learning from their friends/ peers and develop teamwork, communication, and also interpersonal skills. furthermore, the respondents recommend the use of group assignments, one of the techniques of authentic assessment, to the pragmatics course lecturer. this is also a case in fook and sidhu's (2010) study that sought to examine the implementation of authentic assessment in higher education in malaysia, especially in the course of ‘testing, assessment, and evaluation 752’ (tsl 752) which is taught in a master program at the faculty of education of a public university in selangor, malaysia. in both of these studies, authentic assessment was proven as being susceptible or appreciated to enhance learning as it won acceptance from the respondents. students who are successful in the pragmatics course have the scores ranging from 2.5 to 4 as it is well-described in the students’ academic guide which is termed peraturan akademik (universitas negeri yogyakarta, 2014, p. 15). except for two students who were in irregular conditions, 29 out of 31 students got a score comprised between 2.66 and 4. compared to the criteria pre-set in table 1, students’ scores fall in ‘high’ and ’very high’ categories. as far as the qualitative data are concerned, the analysis led to the observation that the pragmatics course lacked clear assessment and scoring scheme, and the fact of not using portfolio although it is described as a highly-recommended assessment method that allows the lecturers to keep an eye on every student’s knowledge process. the infrequency of the lecturer’s feedback to students’ learning and assignments was also found. similar findings were found in christie et al. (2015, p. 31). later, it is demonstrated that australian and usa lecturer’s assessment practices of not using scoring rubrics to assess the quality of the learners’ work tend to turn the final judgment of students’ learning into a questionable one. simply put, if the respondents’/students’ perceptions are that there was no maximization of the objectivity and accountability principles in that course, the students might have suspected the scoring integrity. in general, the evaluation result of each stage is presented in table 3. table 3. holistic evaluation of authentic assessment implementation no dem stage/ component average category 1 program definition stage -0.06 high 2 program installation stage -0.14 high 3 program process stage 0.45 low 4 program product stage 0.02 low average for the 4 dem stages 0.06 low the students’ final scores in the pragmatics course are averaged and categorized as follows: average : 3.22 category : very high therefore, the pragmatics course definition and product (based on the students’ scores aspect) are respectively in ‘high’ and ‘very high’ categories as the average for the item measure order value for the dem definition stage is -0.06, while the average for the students’ final scores is 3.22. the performance of the pragmatics course over the resources/inputs is also in ‘high’ category. such performance is not maximal as explained by the dem process stage which has the average for the item measure order value of 0.45, falling then in ‘low’ category. another aspect of the dem product stage (concerned with the effectiveness of assessment methods used in uncovering the students’ knowledge, ability, and competence) is in ’low’ category with the average for the item measure order value of 0.02. conclusion and suggestions conclusion a general overview of the implementation of authentic assessment is in ‘low’ category. the definition and installation stages reid (research and evaluation in education) 141 − reid (research and evaluation in education), 3(2), 2017 are in ‘high’ category. one aspect of the pragmatics course product stage is in ‘low’ category because the process itself is stained by some impediments and it is in ‘low’ category. the diversification of the assessment methods is still ‘average/minimal’. that conclusion is formulated by the following main findings. firstly, the compliance of the pragmatics course assessment with the curriculum assessment standard is found to be in ‘high’ category. however, at the dem pragmatics installation stage, the discrepancies registered: (a) are little compliance with the assessment principles of feedback, objectivity, and also accountability; (b) lack the pragmatics assessment plan and scoring rubrics; (c) lack tasks and assessment methods that will push students for further research in the field of pragmatics; (d) are ineffective to support students’ learning monitoring due to no use of portfolio assessment. secondly, the proof of alignment of students learning activities and assessment methods is that: (a) the students’ intended learning outcomes are in line with the study program curriculum; (b) the problemsolving skills which are engaged by the students during the learning activities resemble those required to solve assessment tasks. thirdly, the most consistent feedback providing assessment method is group assignments. meanwhile, the other assessment methods which are used include: (a) students’ classroom discussion, (b) individual assignments, (c) quizzes, (d) examinations and also project assessment. fourthly, the inputs which are found to be necessary for the implementation of the authentic assessment in the pragmatics course to be possible course include: (a) the lecturer, (b) the course objectives, (c) the classroom that is clean and big enough to cater for all the students, (d) enough chairs, (e) adjustable luminosity, and also (f) functional fans and lcd projector. fifthly, the level of implementation of the pragmatics course is transcribed in the dem pragmatics course product stage that includes two aspects of the product: (a) effectiveness of the assessment methods in uncovering the students’ ability, which is in ‘low’ category, (b) the students’ final scores in the pragmatics course, which are in ‘very high’ category. implications based on the conclusions, the implications for practice are: (1) until the teachers/ lecturers choose activities that push students to use available learning resources, students will always perceive such expensive resources or services as having less importance in their learning; (2) until used up teaching/learning resources are replaced, they are seen as inexistent by students; (3) the lecturer’s teaching effort and high academic competence without availing a clear assessment scheme and a scoring rubric might stain the whole scoring integrity for that teacher; (4) lecturers may use many assessment methods, and there may be alignment between students’ learning activities and expected outcome assessment methods, but still assessment methods providing valuable feedback to students being very few; (5) a course where students’ success rate is high as indicated by students’ final scores does not implicate that the whole assessment practice has been without any spot mark. suggestions suggestions for the university administration, lecturers, and educational researchers or education practitioners are as follows. (1) the university’s administration should conduct a regular check of the used-up learning resources in the classroom and replacement of those in bad conditions. (2) the pragmatics course lecturers are suggestedd to (a) apply the more student-centred teaching approach (more interactive and more chance for students to talk); (b) choose students’ learning activities that push them to learn how to use resources provided by the university. (it would be unfortunate that the university presumably pays much for external journals and the internet hotspot maintenance, but the students still say that those resources do not improve their pragmatics course learning); and (c) explain and give students opportunities to ask about either the tentative or provisional assessment scheme as well as scoring rubric. (3) other researchers are suggested to (a) carry out other studies to evaluate the implementation of authentic assessment in the english language and literature study program particularly and all the fla (foreign language assistant) dereid (research and evaluation in education) discrepancies in assessing undergraduates’ pragmatics... 142 oscar ndayizeye partments generally, and (b) conduct other research related to the lecturers’ teaching strategies/techniques, methods, and learning activities. (4) there should be a development of a model of applying item response theory or any model linked to it (e.g. rasch model) in the assessment practices in indonesian higher education. references bentley, y., & warwick, s. (2013). an investigation into students’ perceptions of group assignments. journal of pedagogic development, 3(3), 11–19. retrieved from https://journals.beds.ac.uk/ojs/index.p hp/jpd/article/view/199/310 brown, s. a., & glasner, a. (1999). assessment matters in higher education: choosing and using diverse approaches. buckingham: society for research into higher education & open university press. christie, m. f., grainger, p., dahlgren, r., call, k., heck, d., & simon, s. (2015). improving the quality of assessment grading tools in master of education courses: a comparative case study in the scholarship of teaching and learning. journal of the scholarship of teaching and learning, 15(5), 22–35. https://doi.org/10.14434/josotl.v15i5.1 3783 diranna, k., osmundson, e., topps, j., barakos, l., gearhart, m., cerwin, k., … strang, c. (2008). assessment-centered teaching: a reflective practice. thousand oaks, ca: corwin press. fernandes, h. j. x. (1984). evaluation of educational programs. jakarta: national education planning evaluation and curriculum development. fitzpatrick, j. l., sanders, j. r., & worthen, b. r. (2011). program evaluation: alternative approaches and practical guidelines. boston, ma: pearson education. fook, c. y., & sidhu, g. k. (2010). authentic assessment and pedagogical strategies in higher education. journal of social sciences, 6(2), 153–161. https://doi.org/10.3844 /jssp.2010.153.161 fry, h., ketteridge, s., & marshall, s. (2009). a handbook for teaching and learning in higher education: enhancing academic practice (3rd ed.). new york, ny: routledge. irons, a. (2008). enhancing learning through formative assessment and feedback. london: routledge. joughin, g. (ed.). (2009). assessment, learning and judgement in higher education: a critical review. wollongong: springer. https:// doi.org/10.1007/978-1-4020-8905-3_2 mardapi, d. (2008). teknik penyusunan instrumen tes dan nontes. yogyakarta: mitra cendekia. mardapi, d. (2012). pengukuran penilaian dan evaluasi pendidikan. yogyakarta: nuha medika. mcnamara, t. f., & roever, c. (2006). language testing: the social dimension. oxford: blackwell publishing. miles, m. ., huberman, a. m., saldan a, j. (2014). qualitative data analysis: a methods sourcebook (3rd ed.). thousand oaks, ca: sage. suarta, i. m., hardika, n. s., sanjaya, i. g. n., & arjana, i. w. b. (2015). model authentic self-assessment dalam pengembangan employability skills mahasiswa pendidikan tinggi vokasi. jurnal penelitian dan evaluasi pendidikan, 19(1), 46–57. https://doi.org/10.21831 /pep.v19i1.4555 sumintono, b., & widhiarso, w. (2015). aplikasi pemodelan rasch pada asesmen pendidikan. cimahi: trim komunikata. tim kurikulum dan pembelajaran. (2014). buku kurikulum pendidikan tinggi. jakarta: directorate of learning and student affairs, directorate general of higher education, ministry of education and culture. universitas negeri yogyakarta. (2014). buku peraturan akademik universitas negeri reid (research and evaluation in education) 143 − reid (research and evaluation in education), 3(2), 2017 yogyakarta (revised ed.). yogyakarta: uny press. yule, g. (1996). pragmatics. oxford: oxford university press. yusuf, a. m. (2015). asesmen dan evaluasi pendidikan: pilar penyedia informasi dan kegiatan pengendalian mutu pendidikan. jakarta: prenada media group. copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(2), 2018, 105-116 available online at: http://journal.uny.ac.id/index.php/reid technology-enhanced pre-instructional peer assessment: exploring students’ perceptions in a statistical methods course yosep dwi kristanto department of mathematics education, universitas sanata dharma paingan, maguwoharjo, depok, sleman, yogyakarta 55282, indonesia e-mail: yosepdwikristanto@usd.ac.id submitted: 25 august 2018 | revised: 20 november 2018 | accepted: 22 november 2018 abstract there has been strong interest among higher education institution in implementing technology-enhanced peer assessment as a tool for enhancing students’ learning. however, little is known on how to use the peer assessment system in pre-instructional activities. this study aims to explore how technologyenhanced peer assessment can be embedded into pre-instructional activities to enhance students’ learning. therefore, the present study was an explorative descriptive study that used the qualitative approach to attain the research aim. this study used a questionnaire, students’ reflections, and interview in collecting student’s perceptions toward the interventions. the results suggest that the technology-enhanced preinstructional peer assessment helps students to prepare the new content acquisition and become a source of students’ motivation in improving their learning performance for the following main body of the lesson. a set of practical suggestions is also proposed for designing and implementing technologyenhanced pre-instructional peer assessment. keywords: peer assessment, pre-instructional activities, perceptions, statistical methods, higher education introduction there has been strong interest among higher education institutions in implementing peer assessment as a tool for enhancing students’ learning. indeed, the growth of computer technology has a significant role in improving peer assessment applications in various educational settings (yang & tsai, 2010). it is also the case in mathematics learning. mathematics education researchers have shown substantial evidence of technologyenhanced peer assessment’s benefits on the students’ learning (chen & tsai, 2009; peter, 2012; willey & gardner, 2010). specifically, tanner and jones (1994) posit that peer assessment helps the students to perform reflection through reviewing the works of others and recalling their own works. reflection process through which the students recall their existing mental context is fundamental components in learning (lee & hutchison, 1998; van woerkom, 2010; wain, 2017). therefore, this process of reflection meets the purpose of pre-instructional activities. in the pre-instructional activities, it is expected that students can link their prior knowledge with the new content to be learned (dick, carey, & carey, 2015). for this rationale, it is acceptable to stimulate reflection process by conducting peer assessment in preinstructional activities. however, little has been shown in the literature that peer assessment is used in pre-instructional activities, though scott (2017) has utilized the simulated peer assessment in improving numerical problem-solving skills as a prerequisite for learning biology. the questions and the solutions used in scott’s study were not genuine students’ works but were constructed by the researcher. therefore, the present study tries to shed a light on how to embed technology-enhanced peer assessment into pre-instructional activities to enhance students’ learning. this paper reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 technology-enhanced pre-instructional peer assessment... 106 yosep dwi kristanto investigates students’ perceptions in an attempt to portray students’ learning. technology-enhanced peer assessment in understanding peer assessment, this study refers to the definition proposed by topping (1998). he defined peer assessment as a process in which student measures the learning achievement of his/her peers. in the process, students have two different roles, namely assessors and assessees. as assessors, they evaluate and, in many cases, provide feedback to the works of their fellow students. in assessees role, they receive marking and feedback for their works and may act upon it. recent studies found that peer assessment has positive impacts on the students’ learning. several studies demonstrate that peer assessment can benefit the students in the assessment task, i.e. the quality of assessment they provided (ashton & davies, 2015; gielen & de wever, 2015; jones & alcock, 2014; patchan, schunn, & clark, 2018). furthermore, peer assessment also has effects on the students’ acquisition of knowledge and skills in the core domain. in their study, hwang, hung, and chen (2014) show that peer assessment effectively promotes the students’ learning achievement and problemsolving skills. in particular, gaining learning achievement was also shown in statistics class (sun, harris, walther, & baiocchi, 2015). one possible rationale of such benefits of peer assessment in the students’ learning is the exposure to the works of their peers. when the students view their peers’ works, they compare and contrast the works with their alternative solutions. this process of comparing and contrasting has the potential to facilitate students learning (alfieri, nokes-malach, & schunn, 2013; reinholz, 2016). even though peer assessment has a number of advantages in facilitating learning, it also has several issues. the major concern in peer assessment is its validity as well as reliability (cho, schunn, & wilson, 2006). topping (1998) found disagreement on the degree of validity and reliability of peer assessment on his review, some studies report high validity and reliability (haaga, 1993; stefani, 1994; strang, 2013), and the others report otherwise (cheng & warren, 1999; mowl & pain, 1995). however, the issues regarding validity and reliability can be reduced by providing the students with assessment rubrics (hafner & hafner, 2003; jonsson & svingby, 2007) since it makes expectations and criteria explicit. another issue regarding the peer assessment system is about administrative workload (hanrahan & isaacs, 2001). when implementing peer assessment in their class, instructors at least should manage the students’ submission, assessment, and grading evaluation. fortunately, these functions can be administered by using technology (kwok & ma, 1999). technology can be used to record and assemble the results of scoring and commentary efficiently. in addition, technology also enables the teacher to provide immediate feedback based on the automated score calculation. in the spirit of making the most of peer assessment’s benefits and addressing its problems, peer feedback can be employed to accompany the peer assessment process. in peer feedback, the students discuss each other regarding performance and standards (liu & carless, 2006). they comment or annotate the draft or final assignments of their peers to give advice for the improvement of the assignments. when feedback comes with grading, it can be used to explain and justify the grade. it is also used to pose thought-provoking questions. the presence of the thoughtprovoking questions can foster the assessees’ reflection on their assignments. pre-instructional activity: theory and practice from the instructional design perspective, gagné, briggs, and wager (1992) posit that an instruction should be designed systematically to affect the students’ development. thus, instructional activities should be designed to facilitate the students’ learning. one major component of the activities is pre-instructional activities. the activities are done prior to beginning formal instruction and it is significantly important to motivate the students, inform them the learning objectives, and stimulate recall of prerequisite skills. this study reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 107 technology-enhanced pre-instructional peer assessment... yosep dwi kristanto will not theoretically discuss all of the preinstructional activities in depth. instead, it will briefly present the examples of pre-instructional activities that appear in literature. pre-instructional activities can be done in different strategies. it also applies to mathematics learning. loch, jordan, lowe, and mestel (2014), in the calculus of variations and advanced calculus class, use screencasts to facilitate students in revising the prerequisite knowledge regarding the calculus techniques. further, some scholars (jungić, kaur, mulholland, & xin, 2015; love, hodge, corritore, & ernst, 2015) use peer instruction as a pre-instructional strategy. the lesson introduction also can be done by simply telling the students of the prerequisites or testing them on entry skills (conner, 2015). method this study was an explorative descriptive research employing a qualitative approach in exploring how technology-enhanced peer assessment can be embedded into pre-instructional activities to enhance students’ learning. the following sections give details of the research’s setting, data collection, and also data analysis. research setting the research was conducted at a private university in yogyakarta, indonesia to investigate students’ perceptions of the peer assessment system in statistical methods class. the class was conducted in a multimedia laboratory in which students have a computer to assist them in learning statistics. the author was the instructor of the class. the class utilized exelsa, moodle-based learning management system developed by the university, for the course administration purpose. in exelsa, the students can access learning materials, post to a forum, and discuss with their peers about a certain topic, submit their assignments, assess and give feedback to their peers’ works. the class was conducted biweekly with 24 meetings of instruction, one meeting of the midterm exam, and one meeting of the final exam. each meeting consisted of 100-minute learning activities. in three out of twenty-four meetings, the class was begun with peer-assessment activity. therefore, students must submit their assignments before the class started. the assignments used in peer assessment were on the topics of one-way and two-way anova. the assignments were done individually and required microsoft excel and spss statistics in processing and analyzing real data given in the problems. the more details of the assignments will be described in the findings section. the peer assessment system used in this study was a workshop module (dooley, 2009) provided by the lms. the peer assessment takes place during the pre-instructional activities. the peer assessment system has five phases, i.e. setup, submission, assessment, grading evaluation, and reflection phases. in the setup phase, the instructor should set the introduction, provide submission instructions, and create an assessment form. after all of the components are set up, the instructor can activate the submission phase. in this phase, students can submit and edit their assignment. optionally, they also can give a note on their assignments. however, students can only submit and edit their assignments before the class started. right after the class started, the instructor activated the assessment phase. in this phase, each student was assigned randomly to review assignments by their two peers. thus, each student has two assessors. in reviewing their peers’ assignments, students used a rubric to obtain a more objective assessment. the grading strategy used in the peer assessment system is the number of errors through which students grade each criterion by answering yes or no questions and optionally provide comments on the criterion. after all of the assignments were reviewed, the instructor can switch on the grading evaluation phase in which submission and assessment grade of each student were calculated automatically. in the end, students can directly see their score and feedback provided by their peers and reflect on it. the last mentioned is a reflection phase. the peer assessment process can be seen in figure 1. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 technology-enhanced pre-instructional peer assessment... 108 yosep dwi kristanto figure 1. peer assessment phases data collection the data collection process in this study was conducted between may and june 2018 and has been carried out in three phases. in the first phase, the researcher asked the students to write the reflection about their learning experiences in the course. the researcher prompted the students to use gibbs reflective model (gibbs, 1988). one learning experience that should be reflected by students was their experience in peer assessment activity. this phase of the data collection process was administered by the lms. in the second phase, a questionnaire adapted from brindley and scoffield (1998) was used to examine students’ perceptions on the peer assessment. the questionnaire consists of three sections. the first section asked the students’ personal data while the second section asked students’ perceptions of peer assessment. the last section invited students to assess how useful the peer assessment process was. the second phase was done in the week right before the final exam and administered by google form. the third phase was conducted by interviewing three students on their general opinion about the learning process. the three students were purposively chosen to represent students’ achievement. these students were interviewed simultaneously so they feel comfortable since the interviewer was their lecturer. the interview was recorded with the approval of these students to prevent data loss. in addition, logs of three peer assessment activities in the lms was also generated and downloaded. this logs file records the students’ activities in the peer assessment system. once downloaded, the logs data were then sorted in microsoft excel to know the duration of assessment task fulfillment done by each student. moreover, the data also were used to find total time-frame of the assessment phase in each meeting. data analysis data from the questionnaire and data logs were analyzed using descriptive statistics. students’ response from each item of the questionnaire was described as a proportion or mean value whereas data logs were described as a mean value for each meeting. data from students’ reflection and interview were examined and categorized by the researcher. the categories are derived from wen and tsai (2006, pp. 33–34) study, i.e. positive attitude, online attitude, understanding-and-action, and negative attitude. the data were labeled with the corresponding codes and analyzed via the atlas.ti package program (for more information about conducting qualitative data analysis with atlas.ti see, friese, 2014). reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 109 technology-enhanced pre-instructional peer assessment... yosep dwi kristanto research participants in total, 34 students were enrolled in the author-taught course under study. student gender demographics consisted of eight male and 26 female students. most students were in their junior year with only five students from senior year. all of the students were prospective mathematics teachers. findings and discussion findings pre-instructional activities profiles the three meetings utilized technologyenhanced peer assessment in the pre-instructional activities. at the beginning of each session, the instructor informed the students about the learning objectives that should be achieved and linked the objectives with the previous assignments. the students were then asked to assess their peers’ works through lms. during the peer assessment process, the instructor moved about the classroom, observed students’ progress on the assessment task, provided guidance if necessary, and answered questions if they arose. after the peer assessment process was complete, the instructor gave the students the opportunity to reflect on the score and feedback they received. the latter activity was the end of the preinstructional activities. the description of the assignments to be submitted before each meeting started is as follows. first meeting required students to submit an assignment on the topic of one-way anova. the assignment asked students to investigate if there is a difference in the mean of football players’ height in each position, i.e. forward, midfielder, defender, and goalkeeper. in the assignment, the instructor provided real data obtained from various sources. in this meeting, two students did not submit their assignment and there were also two students who submitted their assignment but did not attend the class. in the second meeting, the students should have submitted a one-way anova problem from the accompanying textbook (bluman, 2012, p. 632). the problem asked them to determine the effective method in lowering blood pressure by examining the mean of individuals’ blood pressure from three samples categorized by the methods they follow. the peer assessment process used in this meeting was slightly different from the previous meeting. in the assessment phase, the students had to assess an example submission provided by the instructor as an assessing practice before they assessed their peers’ works. three students did not submit their assignment in this meeting. in the third meeting, students should have submitted their assignment for the 'car crash test measurements' problem from the accompanying textbook (triola, 2012, p. 643). in this problem, the students were instructed to test for an interaction effect, an effect from car type and car size. one student did not submit their assignment in this meeting and there were also three students who submitted their assignment but did not attend the class. the mean of assessment tasks carried out by all students in each meeting was calculated and reported in table 1. on average, the period starting from the assessment phase begins until the assessment phase closes were 43.50 minutes. the table reveals that there has been a sharp decrease in the mean of first and second assessment tasks period carried out by the students in each meeting. in particular, the decreasing trend also applied in the second meeting when the students first reviewed an assessment example. in this meeting, the students reviewed example assessment in nearly a half of an hour (26.83 mins), the first peer’s works in almost a quarter of an hour (12.69 mins), and the second peer’s works in just over six minutes (6.72 mins). table 1. mean of assessment phase time-frame in minutes example assessment assessment 1 assessment 2 total meeting 1 – 20.03 5.83 39.97 meeting 2 26.83 12.69 6.72 59.80 meeting 3 – 10.77 4.83 30.72 m = 26.83 m = 14.50 m = 5.79 m = 43.50 reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 technology-enhanced pre-instructional peer assessment... 110 yosep dwi kristanto students’ perception to investigate the students’ perceptions of peer assessment, this study employed both quantitative and qualitative data. the quantitative data were obtained from the questionnaire, while the qualitative data were obtained from the students’ reflections, the questionnaire, and interview. from the questionnaire result, it is reported that most of the students (86.21%) in this study had previous experience on peer assessment. it is also found that approximately three out of four students perceived the necessity of assessing their peers. further, it is only 27.59% of the students who fully understood the expectation imposed on them when reviewing their peers’ works, whereas the rest only have a moderate understanding. in other words, all students understood what others expect on them in assessment tasks. four items of the questionnaire were rating-scale questions and used to explore the students’ perceived easiness, fairness, pressure, and benefit of peer assessment. a mean report of the students’ responses to the items is shown in table 2. the students gave a high rating on fairness and responsibility of their marking (m = 4.07) and benefits of peer assessment they receive (m = 4.21). with regard to the grading task, they tend to posit that they have difficulties in assessing their peers’ works (m = 3.24). however, they were under moderate pressure when they are doing the assessment task (m = 3.03). the sources of the pressure are various, more than half comes from their role (62.07%), almost a third comes from their experiences (31.03%), and the rest comes from their peers (6.90%). the students’ written reflection and interview are used to examine the students’ perceptions as well. the perceptions were grouped into four defined categories and presented in table 3. the main theme of the students’ statements was the helpfulness of peer assessment in enhancing their learning. regarding this theme, students stated that peer assessment helps them to enable reflective process, viz., reflecting on their mistakes shown by peers as well as reflecting and reviewing their table 2. students’ perceptions scale on peer assessment question mean how difficult was assessing your peers’ work? 3.24 how fair and responsible were you in assessing your peers' work? 4.07 how much pressure did the experience put you under? 3.03 how beneficial was the peer-assessment to you? 4.21 table 3. categories of students’ perceptions category code (frequency) positive attitude helping learning (42) providing feedback (5) enabling interaction (3) sustainability (3) helping instructor (2) engaging (1) motivating (1) online attitude anonymity (1) efficiency (1) transparency (1) understanding-and-action grading strategy (8) action for improvement (7) assessment criteria (2) negative attitude credibility (15) no feedback (6) underestimating self-ability (3) reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 111 technology-enhanced pre-instructional peer assessment... yosep dwi kristanto own works to be compared and contrasted to peers’ works. second, the students perceived the peer assessment process as a tool for knowledge building since they should review their knowledge when assessing others. they added that assessing their peers encouraged them to discuss to their friends if they are indecisive about their assessment. this discussion led them to construct new knowledge to provide marking and feedback on the assessment task. third, the students thought that peer assessment process develops their evaluative judgment making skills regarding their own works or others when they provide feedback to peers. finally, the process of reviewing peers’ works gives critical understanding and develops higher-level learning skills, such as analyzing and evaluating. the quotations from five students that reflect the benefits of peer assessment with regard to its usefulness in enhancing their learning are given below: in my opinion, the peer assessment is useful. (it is) because it encourages me to review my own works if there is a mismatch between my own works and peers. so, (i) learned twice at once regarding the works. (s6) … because i don’t know (it is right or wrong) … i ask for help to my friend and found that my insight was improved. (s15) this (peer) assessment was good to provide feedbacks to peers’ works as well as to be responsible with my marking. (s31) (peer assessment) help us to think critically in assessing friends’ works. (s12) … we also must evaluate the answer of our friends which indirectly makes us reviewing the topics so that we can know/analyze where the friends’ mistakes are. (s29) assessment credibility is another major theme of students’ perceptions on peer assessment. on one hand, the students agreed that peer assessment gives the instructor other perspectives to provide more accurate grading and timely feedback. on the other hand, the students also questioned their peers’ ability in assessing their works. it is possible that their peer assessors made an inaccurate assessment if the assessors’ own works were inaccurate since the assessors often referred to it when undertaking an assessment task. underrating self-ability also becomes a source of credibility issues. when the students feel incompetence on the subject-specific tasks, they are afraid of not being able to provide appropriate judgments. reliability is students’ next concern on peer assessment. they found that their assessors give different grades on the same item. hence, they questioned peers’ understanding of rubric criteria given by the instructor. the following are the students’ statements related to the credibility of peer assessment. peer assessment is very useful as if the instructor makes an error on assessment, it can be remedied by peers’ grading. (s8) … however, the peer assessment doesn’t work optimally when the assessor lacks understanding on what being assessed. (moreover) the accuracy of each student’s assessment is different from one another. (s34) … maybe the assessors’ opinions are different from each other, since there are two friends that get different scores although their answers are more or less the same. (s19) the students thought that feedback is an important component in peer assessment. corrective feedback provided by peers was helpful for the students to know the errors on their works whereas suggestive feedback useful to make improvements later on. the importance of feedback was also reflected in students’ responses when they did not receive feedback. they believe that assessors’ task was not only give marking but also provide constructive comments. some of the students’ comments regarding the importance of feedback are as follows. the one who said ‘no’ also comment. it is a constructive thing for us (to know) our mistakes that (the location of) the mistakes are in here, in here, and in here … there is a friend (that not only) said ‘correct’ but also give a comment, (you) should write like this and like this. so, that’s the positive. it’s like constructing (the understanding of) us. (s15) sometimes there is a friend who said that our answer was not correct, but does not give a single comment. that’s it. so, we do not know where it goes wrong. (s24) reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 technology-enhanced pre-instructional peer assessment... 112 yosep dwi kristanto other peer assessment aspects did not escape the students’ attention. with regard to the number of errors grading strategy, they perceived that it provided not many options in marking peers’ works. instead of answering yes or no in each criterion, they prefer to use scale-rating strategy. however, they thought that the peer assessment process can facilitate students’ discussion as well as studentsinstructor interaction. other benefits of technology-enhanced peer assessment were also unfolded. students stated that such assessment model was transparent and efficient as well as engaging and motivating. discussion the aim of this study was to explore how technology enhanced peer assessment can be embedded into pre-instructional activities to enhance the students’ learning. this paper interprets the students’ perceptions in an effort to investigate students’ learning experiences. in general, the research results show that technology-enhanced peer assessment holds significant promise to be an effective pre-instructional strategy. the learning benefits provided by peer assessment meet the purpose of the pre-instructional strategy. the findings of the present study show that the process of assessing and commenting on the works of others facilitate the students’ learning. this finding is in line with the result of prior studies in peer assessment investigation (hanrahan & isaacs, 2001; sun et al., 2015). one possible explanation of this finding can be derived from comparative thinking perspective (alfieri et al., 2013; silver, 2010). when the student reviews peers’ works, they compare and contrast it with their own works. if they doubt their own works, they ask for help to others or the instructor. this process of comparing and contrasting helps them to rehearse their own understanding that is useful for preparing them to gain new knowledge related to it. the findings also suggest that peer assessment stimulates reflective thinking that drives action for improvement. similar to the results of other studies (davies & berrow, 1998; liu, lin, chiu, & yuan, 2001), the peer assessment process leads the students to think critically and reflect the quality of their own works compared to the others’. this evaluative process helps the students to devise a plan in improving their learning products later on. as a feedback receiver, the students also take advantages of the feedback to enhance their learning. in other words, peer assessment can become a source of students’ motivation in improving their learning performance in the commencing main body of a lesson (jenkins, 2005). the study also shows the importance of feedback in students’ learning. as a salient element of peer assessment, peer feedback facilitates students in taking an active role in their learning (liu & carless, 2006). when the students provide corrective feedback on the peers’ works, they develop an objective attitude in conducting their assessment task (nicol & macfarlane-dick, 2006). through providing suggestive feedback, the students think critically on the drawbacks of their peers’ works even when the works are correct (chi, 1996). as a feedback receiver, the students use peers’ comment to improve their works. moreover, peer’s comments are potential to spark cognitive conflict when the comments contradict the student’s prior knowledge. from the socio-cognitive perspective, cognitive conflict is fundamental in facilitating students’ learning when it is successfully resolved (nastasi & clements, 1992). however, the results of this study also reveal the resistance of peer assessment. many students in this study have negative attitudes toward the fairness of peer grading. the similar result also can be found in the literature (cheng & warren, 1999; davies, 2000; liu & carless, 2006). the negative perceptions come from the students’ skepticism about the expertise of their fellow students. even when a rubric was provided, the students thought that some of their peers were not really fair in giving marking. another issue arose from grading strategy used in the assessment task. the correct and not-correct dichotomy into which students should categorize their peers’ work is considered to be inflexible (sheatsley, 1983). the students want more flexible grading strategy in order to be more confident in assessing their peers. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 113 technology-enhanced pre-instructional peer assessment... yosep dwi kristanto last but not least, the study has several limitations to be considered. the first limitation of the current study relates to its exploratory design in investigating the students’ learning experience. future studies with a larger sample and a longer period are needed to verify the evidence found in this study. second, this study only focuses on implementing peer assessment. comparative studies are needed to compare the effectiveness of peer assessment and other strategies, such as advance organizers and overviews, to be used in pre-instructional activities. finally, design-based studies could contribute to future literature in giving peer assessment design that optimizes the learning transition from lesson introduction to the main body of the lesson. conclusion and suggestions the contribution of this study is to show the potential of technology-enhanced peer assessment to be used as pre-instructional activities. the results of the current study, in general, suggest that the technologyenhanced pre-instructional peer assessment helps the students to prepare the new content acquisition for the following lesson. it is also found that peer feedback has a significant role in the peer assessment process in facilitating students’ learning. based on the findings in the present study, the author proposed a set of suggestions for designing and implementing technology-enhanced pre-instructional peer assessment. first, a training should be provided to students so that they can provide and manage feedback as well as take action upon it effectively. second, discussions between students and the instructor about assessment criteria are needed in order to improve students’ understanding about what to be assessed by their fellow students’ works. if necessary, the instructor also can invite students to develop the assessment criteria. third, the instructor should monitor students’ attitude toward grading strategy. this monitoring process aims to know the suitability of the grading strategy to students, tasks, and learning context. finally, the instructor should use the assignment features (e.g., its content and context) used in peer assessment as a link to the commencing main body of the lesson. acknowledgment the researcher would like to thank the students who participated in this study and lppm of universitas sanata dharma that supported this study. in addition, the researcher expresses gratitude to russasmita sri padmi who kindly agreed to edit this manuscript. references alfieri, l., nokes-malach, t. j., & schunn, c. d. (2013). learning through case comparisons: a meta-analytic review. educational psychologist, 48(2), 87–113. https:// doi.org/10.1080/00461520.2013.775712 ashton, s., & davies, r. s. (2015). using scaffolded rubrics to improve peer assessment in a mooc writing course. distance education, 36(3), 312–334. https://doi.org/10.1080/01587919.2015 .1081733 bluman, a. g. (2012). elementary statistics: a step by step approach (8th ed.). new york, ny: mcgraw-hill. brindley, c., & scoffield, s. (1998). peer assessment in undergraduate programmes. teaching in higher education, 3(1), 79–90. https://doi.org/ 10.1080/1356215980030106 chen, y., & tsai, c. (2009). an educational research course facilitated by online peer assessment. innovations in education and teaching international, 46(1), 105–117. https://doi.org/10.1080/147032908026 46297 cheng, w., & warren, m. (1999). peer and teacher assessment of the oral and written tasks of a group project. assessment & evaluation in higher education, 24(3), 301–314. https:// doi.org/10.1080/0260293990240304 chi, m. t. h. (1996). constructing selfexplanations and scaffolded explanations in tutoring. applied cognitive psychology, 10(7), 33–49. https://doi.org/ reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 technology-enhanced pre-instructional peer assessment... 114 yosep dwi kristanto 10.1002/(sici)1099-0720(199611)10:7 <33::aid-acp436>3.0.co;2-e cho, k., schunn, c. d., & wilson, r. w. (2006). validity and reliability of scaffolded peer assessment of writing from instructor and student perspectives. journal of educational psychology, 98(4), 891–901. https://doi.org/10.1037/00220663.98.4.891.supp conner, k. (2015). investigating diagnostic preassessments. mathematics teacher, 108(7), 536–542. davies, p. (2000). computerized peer assessment. innovations in education and training international, 37(4), 346–355. https://doi. org/10.1080/135580000750052955 davies, r., & berrow, t. (1998). an evaluation of the use of computer supported peer review for developing higher-level skills. computers & education, 30(1), 111–115. dick, w., carey, l., & carey, j. o. (2015). systematic design of instruction (8th ed.). boston, ma: pearson. dooley, j. f. (2009). peer assessments using the moodle workshop tool. in proceedings of the 14th annual acm sigcse conference on innovation and technology in computer science education (vol. 41, pp. 344–344). new york, ny: acm. https:/ /doi.org/10.1145/1562877.1562985 friese, s. (2014). qualitative data analysis with atlas.ti (2nd ed.). london: sage. gagné, r. m., briggs, l. j., & wager, w. w. (1992). principles of instructional design (4th ed.). fort worth, tx: harcourt brace college. gibbs, g. (1988). learning by doing: a guide to teaching and learning methods. london: feu. gielen, m., & de wever, b. (2015). structuring the peer assessment process: a multilevel approach for the impact on product improvement and peer feedback quality. journal of computer assisted learning, 31(5), 435–449. https://doi. org/10.1111/jcal.12096 haaga, d. a. f. (1993). peer review of term papers in graduate psychology courses. teaching of psychology, 20(1), 28–32. https://doi.org/10.1207/s15328023top2 001_5 hafner, j., & hafner, p. (2003). quantitative analysis of the rubric as an assessment tool: an empirical study of student peergroup rating. international journal of science education, 25(12), 1509–1528. https:// doi.org/10.1080/0950069022000038268 hanrahan, s. j., & isaacs, g. (2001). assessing selfand peer-assessment: the students’ views. higher education research & development, 20(1), 53–70. https://doi.org /10.1080/07294360123776 hwang, g.-j., hung, c.--ming, & chen, n.-s. (2014). improving learning achievements, motivations and problem-solving skills through a peer assessment-based game development approach. educational technology research and development, 62(2), 129–145. jenkins, m. (2005). unfulilled promise: formative assessment using computeraided assessment. learning and teaching in higher education, (1), 67–80. jones, i., & alcock, l. (2014). peer assessment without assessment criteria. studies in higher education, 39(10), 1774– 1787. https://doi.org/10.1080/0307507 9.2013.821974 jonsson, a., & svingby, g. (2007). the use of scoring rubrics: reliability, validity and educational consequences. educational research review, 2(2), 130–144. https://d oi.org/10.1016/j.edurev.2007.05.002 jungić, v., kaur, h., mulholland, j., & xin, c. (2015). on flipping the classroom in large first year calculus courses. international journal of mathematical education in science and technology, 46(4), 508–520. https://doi.org/10.1080/0020739x.201 4.990529 kwok, r. c. w., & ma, j. (1999). use of a group support system for collaborative assessment. computers & education, 32(2), 109–125. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 115 technology-enhanced pre-instructional peer assessment... yosep dwi kristanto lee, a. y., & hutchison, l. (1998). improving learning from examples through reflection. journal of experimental psychology: applied, 4(3), 187–210. https://doi. org/10.1037/1076-898x.4.3.187 liu, e. z.-f., lin, s. s. j., chiu, c.-h., & yuan, s.-m. (2001). web-based peer review: the learner as both adapter and reviewer. ieee transactions on education, 44(3), 246–251. https://doi.org/10.110 9/13.940995 liu, n.-f., & carless, d. (2006). peer feedback: the learning element of peer assessment. teaching in higher education, 11(3), 279–290. https://doi.org/10.10 80/13562510600680582 loch, b., jordan, c. r., lowe, t. w., & mestel, b. d. (2014). do screencasts help to revise prerequisite mathematics? an investigation of student performance and perception. international journal of mathematical education in science and technology, 45(2), 256–268. https://doi. org/10.1080/0020739x.2013.822581 love, b., hodge, a., corritore, c., & ernst, d. c. (2015). inquiry-based learning and the flipped classroom model. primus: problems, resources, and issues in mathematics undergraduate studies, 25(8), 745–762. https://doi.org/10.1080/10511970.2015 .1046005 mowl, g., & pain, r. (1995). using self and peer assessment to improve students’ essay writing: a case study from geography. innovations in education and training international, 32(4), 324–335. https:// doi.org/10.1080/1355800950320404 nastasi, b. k., & clements, d. h. (1992). social-cognitive behaviors and higherorder thinking in educational computer environments. learning and instruction, 2(3), 215–238. https://doi.org/10.1016/ 0959-4752(92)90010-j nicol, d. j., & macfarlane-dick, d. (2006). formative assessment and self-regulated learning: a model and seven principles of good feedback practice. studies in higher education, 31(2), 199–218. patchan, m. m., schunn, c. d., & clark, r. j. (2018). accountability in peer assessment: examining the effects of reviewing grades on peer ratings and peer feedback. studies in higher education, 43(12), 2263–2278. https://doi.org/ 10.1080/03075079.2017.1320374 peter, e. e. (2012). critical thinking: essence for teaching mathematics and mathematics problem solving skills. african journal of mathematics and computer science research, 5(3), 39–43. https://doi.org/ 10.5897/ajmcsr11.161 reinholz, d. (2016). the assessment cycle: a model for learning through peer assessment. assessment & evaluation in higher education, 41(2), 301–315. https://doi.or g/10.1080/02602938.2015.1008982 scott, f. j. (2017). a simulated peer-assessment approach to improve students’ performance in numerical problemsolving questions in high school biology. journal of biological education, 51(2), 107– 122. https://doi.org/10.1080/00219266. 2016.1177571 sheatsley, p. b. (1983). questionnaire construction and item writing. in p. h. rossi, j. d. wright, & a. b. anderson (eds.), handbook of survey research (pp. 195–230). new york, ny: academic press. silver, h. f. (2010). compare & contrast: teaching comparative thinking to strengthen student learning (a strategic teacher plc guide). alexandria, va: association for supervision & curriculum development. stefani, l. a. j. (1994). peer, self and tutor assessment: relative reliabilities. studies in higher education, 19(1), 69–75. https://doi.org/10.1080/030750794123 31382153 strang, k. d. (2013). determining the consistency of student grading in a hybrid business course using a lms and statistical software. international journal of webbased learning and teaching technologies (ijwltt), 8(2), 58–76. https://doi.org/ 10.4018/jwltt.2013040103 reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 technology-enhanced pre-instructional peer assessment... 116 yosep dwi kristanto sun, d. l., harris, n., walther, g., & baiocchi, m. (2015). peer assessment enhances student learning: the results of a matched randomized crossover experiment in a college statistics class. plos one, 10(12), e0143177. https://doi. org/10.1371/journal.pone.0143177 tanner, h., & jones, s. (1994). using peer and self-assessment to develop modelling skills with students aged 11 to 16: a socio-constructive view. educational studies in mathematics, 27(4), 413–431. https://doi.org/10.1007/bf01273381 topping, k. (1998). peer assessment between students in colleges and universities. review of educational research, 68(3), 249– 276. https://doi.org/10.3102/00346543 068003249 triola, m. f. (2012). elementary statistics technology update (11th ed.). boston, ma: addison-wisley. van woerkom, m. (2010). critical reflection as a rationalistic ideal. adult education quarterly, 60(4), 339–356. https://doi.org /10.1177/0741713609358446 wain, a. (2017). learning through reflection. british journal of midwifery, 25(10), 662– 666. https://doi.org/10.12968/bjom.20 17.25.10.662 wen, m. l., & tsai, c.-c. (2006). university students’ perceptions of and attitudes toward (online) peer assessment. higher education, 51(1), 27–44. https://doi.org/ 10.1007/s10734-004-6375-8 willey, k., & gardner, a. (2010). investigating the capacity of self and peer assessment activities to engage students and promote learning. european journal of engineering education, 35(4), 429–443. https://doi.org/10.1080/03043797.2010 .490577 yang, y.-f., & tsai, c.-c. (2010). conceptions of and approaches to learning through online peer assessment. learning and instruction, 20(1), 72–83. reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(1), 2017, 64-76 available online at: http://journal.uny.ac.id/index.php/reid research article a construct of the instrument for measuring junior high school mathematics teacher’s self-efficacy * 1 rachmadi widdiharto; 2 badrun kartowagiran; 3 sugiman *centre for the development and empowerment of mathematics teachers and educational personnel (pppptk matematika) of yogyakarta jl. kaliurang km. 6, sambisari, condong catur, depok, sleman 55281, yogyakarta, indonesia *email: rachmadiw@yahoo.com submitted: 06 april 2017 | revised: 15 july 2017 | accepted: 15 july 2017 abstract the aim of this study was to develop a construct of the instrument for junior high school mathematics teacher self-efficacy and its mapping in the special region of yogyakarta. the population was 816 junior high mathematics teacher, and a sample of 274 teachers was selected through proportionate random sampling technique. the data were analyzed using confirmatory factor analysis (cfa), using lisrel 8.80 software through the first order and the second order stages. the result of data analysis toward four dimensions obtained: 11 fit items for dimension of personal efficacy (pe), 12 fit items for dimension of general teaching efficacy (gte), 13 fit items for dimension of subject matter teaching efficacy (ste), and 8 fit items for dimension of outcome efficacy (oe). afterward, the result of selecting 54 items in the first order stage was examined for the second order cfa, which shows the model is fit to the data and obtains 25 fit items. the loading factors for each dimension pe, gte, ste, and oe consecutively were: 0.46; 0.84; 0.89, and 0.92, and the mapping of mathematics teacher self-efficacy level shows: 43.07% in low category, 55.47% in medium category, and 1.46% in the high category. keywords: self-efficacy, construct, mathematics, junior high school how to cite item: widdiharto, r., kartowagiran, b., & sugiman, s. (2017). a construct of the instrument for measuring junior high school mathematics teacher's self-efficacy. reid (research and evaluation in education), 3(1), 6476. doi:http://dx.doi.org/10.21831/reid.v3i1.13559 introduction increasing and improving teacher quality is continually implemented by the government through the fulfillment of academic qualification of s-1/d-iv, teacher certification, block grant for the continuation of the study, the revitalization of teachers working group (kelompok kerja guru/kkg) for elementary school teachers, subject-matter teacher forum (musyawarah guru mata pelajaran /mgmp) for junior and senior high school teachers, and program bermutu (better education through reformed management and universal teacher upgrading) (jalal et al., 2009, p. 124). however, the government's efforts still failed to give satisfactory results when the condition of teacher teaching practices do not support the ability of students in mathematics achievement. the results of timss (trends in international mathematics and science study) video study 2007 (leung & ragatz, 2010, p. 33) states that most junior high school mathematics teachers in indonesia use 76% of their time to the problem activity and 24% for non-problem, while in japan, 82% for problem activiy and 18% for non-problem, and mailto:farida_as@uny.ac.id http://dx.doi.org/10.21831/reid.v3i1.13559 reid (research and evaluation in education) a construct of the instrument for measuring junior high school... 65 rachmadi widdiharto, badrun kartowagiran, & sugiman hong kong 85% of problem activity and 15% for non-problem activity. the report of training need assessment and recruitment (pppptk matematika, 2007, p. 46) of a sample of 268 teachers in 15 provinces showed approximately 61.78% of the teachers had difficulty learning mathematics associated with problem solving. in a further interview with the respondents, there was a tendency for them to avoid delivering learning materials considered difficult. in terms of having no choice and a must to deliver the stated material, they lacked confidence in their performance or in their teaching practice. sumardyono (2011, p. 244) in a study of math anxiety to 89 participants of mathematics teachers training in pppptk matematika in 2010, from the district of banjarmasin, south kalimantan, showed that adapting mathematics anxiety rating scale (mars), increased the level of anxiety gradually from the higher education in which teachers teach to the lower education. it means that the high school teacher had a low level of anxiety compared to the junior high school mathematics teachers or primary school teachers. meanwhile, preliminary research conducted by the researchers to 38 junior high school mathematics teachers in java, who attended pppptk matematika training activities in 2012, by adopting teacher efficacy scale (tes) from tschannen-moran, hoy, and hoy (1998), reported that 13.16% of those with high-efficacy scale categories, and approximately 86.84% efficacy scale were in a medium category. the teachers were also tested with mathematics teacher's efficacy belief instrument (mtebi) developed by enoch & smith (1997) which states that most (almost all) their mathematics belief is in the medium category, and no participant was in the low or high category. hastuti et al. (2009) mention that teacher certification improves the welfare and likely improves the teachers quality because they can concentrate more and become more motivated. however, he was not convinced by it because increasing the quality and performance is a matter of personal commitment. sadtyadi and kartowagiran, (2014, p. 291) mention that through the assessment done by the time the teacher is teaching, it is difficult to describe the actual performance of teachers, because they have a tendency to be better prepared, than when monitoring is not done, in the assessment of its performance. based on that condition in which the teacher competence is still not optimal, and they themselves lack of confidence in carrying out the tasks they are responsible for, it is a bit difficult to expect they will be able to teach the material to their students well. pajares (1996, p. 544) refers to bandura‟s opinion that defines self-efficacy as a belief about their ability to successfully perform certain tasks in certain situations. self-efficacy is also defined as an assessment of the person's ability to organize and execute courses of actions required to complete a type of work that has been determined, primarily for mathematics teachers‟ duties associated with fostering students‟ mathematical power (kastberg, d‟ambrosio, mcdermot, & saada, 2005, p. 10). self-concept burn (1984) states that self-concept is a composite image of what we think we are, what we think we can achieve, what we think others think of us, and what would we like to be. most social psychologists, one of whom, rokeach, (burn, 1984, p. 52) agree that selfconcept as a set of selfattitudes that consist of four components appear to be embodied: (a) a belief, or knowledge or cognitive component, (b) an affective or emotional component, (c) an evaluation, and (d) a predisposition to respond. attitude organizes a relatively enduring belief in the object or situation around as a person's tendency to respond in various ways that he or she likes. thus the self-concept is more of a hypothetical construct. in other words, it is a concept or a useful way to predict the attitude or behavior of a person, but must be careful not to „filter‟ or judge the constructs as a thought that exists in the real world. self-efficacy another concept related to the belief of an individual or representation of one's self is the self-efficacy. bandura in keller (2010, p. 146) mentions another concept related to the reid (research and evaluation in education) 66 − reid (research and evaluation in education, 3(1), 2017 belief in personal agency, i.e self-efficacy, the belief which is typically referred to as a person's belief that he or she can succeed in performing a given task. in line with this definition, bandura also mentions self-efficacy as people's judgments on reviews of their abilities to organize and execute courses of action required to attain designated types of performances. based on those opinions, it might be concluded that self-efficacy is defined as a belief or judgment about a person's ability to organize and execute courses of action required to complete a type of work that has been determined. keller (2010, p. 146) states that a person's self-efficacy is comprised of combination of belief related to three questions: am i capable of doing the things that are necessary for success, developing a plan that will lead to success, and persisting in my effort long enough to achieve success? thus the results of the strength or toughness of selfefficacy can be expected of a person: whether repetition or modification of planned behavior, how much effort will be made, and how long one will survive in the face of obstacles and challenging experience. mathematics teacher competency teacher competence, on law no. 14 year 2005 of republic of indonesia about teachers and lecturers, article 32, states that the promotion and development of the teaching profession as referred to in paragraph (1) includes the pedagogical competence, personal competence, social competence, and spiritual competence. in relation to the competence of mathematics teachers, there are some opinions that highlight the mastery of substance, achievement in performing or teaching in the classroom, peers assessment, or preparation of the portfolio. fennema and franke (turnuklu & yesildere, 2007, p. 2) mention that some of the components of mathematical knowledge to be possessed by a mathematics teacher: knowledge of mathematics, knowledge representation/math symbols, knowledge of the students, and knowledge about teaching and decision making. another opinion from kulm and wu (turnuklu & yesildere, 2007, p. 3) mentions the beliefs on a reciprocal basis underlying the substance of pedagogical content knowledge. pedagogical content knowledge comprises of three components: content knowledge, teaching practice, and knowledge of the curriculum, each of which interacts reciprocally. in the practice of teaching, a teacher must understand the thought of his/her students (knowing students' thinking). the understanding of students‟ thoughts is translated into five components namely: addressing students' misconception, engaging student learning in math, student learning, promoting student thinking in mathematics, and building on student math idea. thus, this shows that the conviction of a mathematics teacher will be the basis for the substance of pedagogical knowledge, which will ultimately lead to active student activities, anticipate misconceptions, and build mathematical ideas. mathematics teacher self-efficacy related to teacher efficacy, several studies support the theory that the belief in one's ability is the best predictor for the behavior of the completion of a task (bandura, 1996, 1997; pajares, 1996, in leder, pehkonen, & torner, 2002, p. 216). referring to bandura‟s concept of self-efficacy as confidence in one's ability to organize and carry out a number of actions needed to generate the expected result, with the same understanding, philippou and christou (leder et al., 2002, p. 217), mention that the teaching efficacy can be understood as a belief in the ability of teachers to organize and create effective learning environments. the activities and actions of teachers are more dependent on what they believe than on what they know, or the competence they rarely achieve. the same idea is said by hoy and spero (2005, p. 29) that teachers 'sense of efficacy as teachers' judgments about reviews their capabilities to promote student learning. gibson and debo (leder et al., 2002, p. 218) classify the teacher's self-efficacy into two factors: general teaching efficacy (gte) and personal teaching efficacy (pte). gte refers to teachers‟ general feeling that their teaching and education system will be able to grow and develop students' academic achievement despite the negative influence of outside reid (research and evaluation in education) a construct of the instrument for measuring junior high school... 67 rachmadi widdiharto, badrun kartowagiran, & sugiman teachers. meanwhile, personal teaching efficacy (pte) is intended as a reflection of teachers conviction on their own to continue the significant learning and student achievement. furthermore, philippou and christou (leder et al., 2002, p. 217) state that the efficacy of belief about the teaching of mathematics is mostly, but not entirely, shaped by one's experience and knowledge of mathematics and its pedagogy. the process skills of mathematics teachers should also be developed, i.e skills in reasoning, understanding of the concept, the relationship between concepts, representation, communication, and problem solving. in relation to teachers‟ efficacy and competence, tschannen-moran et al. (1998) proposed an integrated model of teacher efficacy cyclical. furthermore, with reference to lee (2009, p. 15) about the teacher's selfconcept, as well as the opinion of gibson and dembo (1984) about the outcome efficacy, the approach develops a system model as presented in figure 1. method research on the construct of the instrument for measuring junior high school mathematics teacher self-efficacy is a kind of developmental research (borg & gall, 1983, p. 775), to obtain a construct of dimensions or factors in relation to the self-efficacy of mathematics teachers, especially junior high school mathematics teachers in the special region of yogyakarta. this study was conducted over four months from september to december 2013) in four districts, namely sleman, bantul, kulonprogro, and gunungkidul regencies, as well as in the municipality of yogyakarta in the of special region of yogyakarta. research design the development of a modified construct in this study adapted borg & gall‟s model, which was simplified from 10 steps into 6 steps. the development of the modified construct is presented in figure 2. figure 1. development of “the cyclical nature of teacher eficacy” (tschannen-moran et al., 1998); “outcome efficacy” (gibson & dembo, 1984; soodak & podell, 1996), through the system approach. (tschannen-moran et al., 1998) teacher’s self concept (lee, 2009) performance gibson & dembo (1984) and soodak & podell (1996) sources of efficacy information . cognitive process analyzing the teaching task assessing personal teaching competence teacher efficacy consequences of teacher efficacy consquences of performance process input output outcome reid (research and evaluation in education) 68 − reid (research and evaluation in education, 3(1), 2017 figure 2. research design on the development of the instrument of junior high school mathematics teacher self-efficacy there are six stages in this research design. first, a construct was designed. this activity consisted of context analysis, relevant literature review, and prototype designing. at this stage, a preliminary instrument of mathematics teacher efficacy was developed. it consisted of four dimensions, namely: personal efficacy (pe), general teaching efficacy (gte), subject-matter teaching efficacy (ste), and outcome efficacy (oe) based on the relevant theory and literature. the total number of items in this prototype instrument was 94 items consiting of: pe (25 items), gte (26 items), ste (33 items), and oe (10 items). likert scale was used with the rating scale of 1–4. the second stage was validation by experts. this activity was a focus group discussion (fgd) involving eight experts or specialists from universities, consisting of: two mathematics education experts, three psychometric experts, one educational psychologist, and two experts on teacher training. the aspects assessed included: blue-print and indicators, clarity of the instruments, and the model development. the fgd results obtained content validity (content validity coefficient) through aiken validity (aiken, 1985, p. 132; azwar, 2013, p. 134) was 0.71, meaning that the instrument could be used for collecting data. the third stage was the limited testing. this activity involved 32 people, consisting of 22 mathematics teachers (of three districts in the province) and seven principals and three supervisors. the results of this readability test obtained a score of 4.13 which means that the instrument could be used. the fourth stage was the revision or improvement. based on the expert judgement in fgd and the limited testing, revision was done to improve the instrument in accordance with the input and advice from the experts. the fifth stage was the extended testing. in this case the test subjects were as many as 274 mathematics teachers in the special region of yogyakarta. the sixth stage was the final product and its use. the data from the extended testing were analyzed by using lisrel 8.80 through the the first order and the second order analysis of cfa in order to obtain a suitable construct between the model and data. based on this instrument, the researchers employed it for mapping the level of mathematics teacher self-efficacy. population and sample the population of this research was junior high school mathematics teachers, by referring to the data of the provincial education department of yogyakarta in 2012. it consisted of 816 junior high school mathematics teachers. using the proportionate random sampling technique (cohran, 2010, p. 85) the researcher established a sample of 274 teachers, consisting of 38 teachers from yogyakarta city, 85 from sleman, 70 from bantul, 38 from kulonprogro, and 43 from gunungkidul. the sample size in the cfa analysis was determined by the number of the observed variables or items. according to hair, black, babin, and anderson (2006), for the sample size, it is recommended to use the estimates of the maximum likelihood (ml) at 100-200. construct design: context analysis, designing prototype expert validity: focus group discussion (fgd) limited testing – readability test: math teacher/principal/ supervisor instrument revision – initial product extended testing : junior high school math teacher final product of instrument of junior high school math teacher self-efficacy and its use reid (research and evaluation in education) a construct of the instrument for measuring junior high school... 69 rachmadi widdiharto, badrun kartowagiran, & sugiman data analysis technique in data analysis, confirmatory factor analysis (cfa) consisting of first-order and second order with the software of lisrel 8.80 was used. the evaluation criteria for the model fit were by p-value on chi-square (2) and the root mean square error of approximation (rmsea). the model was declared fit if the p-value was greater than chi-square (2); was not significant if p-value  0.05, meaning there is no significant difference between the model with the data (joreskog & sorbom, 2003, p. 128). the evaluation model with the rmsea was expected to show the rmsea value of  0.05 for the model considered as close to or the rmsea value of  0.08 for a model declared as a good fit model. furthermore, the fit instrument construct was used to map the level of mathematics teacher selfefficacy, referring to the score of mathematics teacher self-efficacy (x): x    1 (low category),   1    x + 1 (medium category), and  + 1  x (high category). findings and discussion the first order analyisis dimension of personal efficacy (pe) this dimension consists of three indicators: mathematics self-concept, math anxiety and internalizing the source of efficacy. pe consisted of 25 items, the number of the items might decrease gradually in the the first order analysis for obtaining a fit model. items v2, v5, v6, v7, v9, and v10 were eliminated because the t-value of the loading factor < 1.96. items v24 and v25 were also eliminated because of a negative loading factor value. items v3, v13, v18, v19, v20 and v23 errors were eliminated because they shared a variance among items as the cause of the goodness of fit value was not a significant dimensional construct. the result of the first order analysis of cfa showed that the model was fit to the data by obtaining the chi-square = 53.61 df = 44 p-value = 0.15201 and rmsea = 0.020. the number of items decreased from 25 items into 11 items, with the loading factor of 0.19 to 0.64. so the items on the dimensions of pe were 11 items, namely items v1, v4, v8, v11, v12, v14, v15, v16, v17, v21, and v22. dimension of general teaching efficacy this dimension consited of four indicators: pedagogy content knowledge, classroom management, student engagement, and parental involvement. gte consisted of 26 items, the number of item might decrease gradually in the first order analysis for obtaining a fit model. items v29 and v30 were eliminated because the t-value of the loading factor <1.96. items v26 and v43 were also eliminated because it had an error value variance greater than the value of the loading factor which caused the goodness of fit value was not significant. the result of the first order analysis of cfa showed that the model was fit to the data by obtaining chi-square = 66.59; df = 54, p-value = 0.11670 and rmsea = 0.029. the number of the items decreased from 26 items into 12 items with the loading factor of 0.22 to 0.71. thus, the number of the fit items in dimensions of gte was 12 items, namely items v28, v31, v35, v38, v41, v44, v45, v47, v48, v49, v50, dan v51. dimension of subject-matter teaching efficacy (ste) this dimension consisted of three indicators: knowledge of junior high school mathematics content, teaching strategies, and fostering student mathematical power. ste consisted of 33 items, the number of items might decrease gradually in the first order analysis for obtaining a fit model. items v52, v56, v57, v58, v66, and v67 were eliminated because the value of t loading factor < 1.96. the items that were removed were items v53, v54, v59, v60, v61, v62, v65, v68, v69, v70, v71, v74, and v78 because the error of variance was much greater than the value of the factor loading, causing the value of goodness of fit not significant. the result of the first order analysis of cfa showed that the model was fit to the data by obtaining chi-square = 24.49; df = 20; p-value = 0.2216, rmsea = 0.029. the number of the items decreased from 33 items reid (research and evaluation in education) 70 − reid (research and evaluation in education, 3(1), 2017 into 13 items with the loading factor from 0.22 to 0.40. thus, there were 13 items left in ste‟s dimensional. they were items v55, v63, v72, v73, v75, v76, v77, v79, v80, v81, v82, v83, and v84. dimension of outcome efficacy (oe) the dimension of outcome efficacy (oe) consisted of three indicators, namely: student achievement, building mathematics attitude, and the continouing study. oe consisted of 10 items, and the number might decrease gradually in the first order analysis for obtaining a fit model. all items had a t-value of loading factor > 1.96 and no negative loading factor value. however, items v86 and v91 were eliminated because they had a value of error variance much greater than the value of the loading factor, causing the value of goodness of fit not significant. the result of the first order analysis of cfa showed that the model was fit to the data by obtainng chi-square = 14.60; df = 9; p-value = 0.10256; rmsea = 0.048. the number of the items decreased from 10 items to eight items, with the loading factor from 0.32 to 0.42. therefore, the number of items was reduced to 8 items, namely v85, v87, v88, v89, v90, v92, v93 and v94. the second order analysis based on the items obtained in each dimension in the first order analysis, the second order analysis of cfa was done. several simulations and iterations among these dimensions were done for obataining a fit model, such as: pe and gte; ste and oe; dimensions of pe, gte, and dimension ste. finally, iterations of the next dimensions of pe, gte, ste, and oe, derived a construct model that was fit to the data, as presented in figure 3. figure 3. path diagram of the second order analysis output reid (research and evaluation in education) a construct of the instrument for measuring junior high school... 71 rachmadi widdiharto, badrun kartowagiran, & sugiman table 1. results of the second order cfa of the instrument of junior mathematics teachers‟ selfefficacy with 25 items item loading factor t-value r2 result dimension: pe item 8 0.40 --0.19 reference item item 11 0.61 2.68 0.37 item fit item 14 0.58 2.62 0.30 item fit item 15 0.48 2.47 0.25 item fit item 16 0.65 2.63 0.42 item fit item 17 0.66 2.67 0.44 item fit item 22 0.42 2.53 0.21 item fit dimension: gte item 28 0.57 --0.33 reference item item 31 0.42 6.41 0.18 item fit item 35 0.59 7.20 0.35 item fit item 44 0.55 6.05 0.30 item fit item 45 0.62 6.70 0.38 item fit item 47 0.52 5.48 0.27 item fit item 48 0.64 7.65 0.41 item fit dimension: ste item 63 0.62 -0.39 reference item item 72 0.48 6.04 0.23 item fit item 73 0.56 8.11 0.31 item fit item 75 0.52 7.68 0.28 item fit item 77 0.66 8.71 0.43 item fit item 81 0.56 8.35 0.31 item fit item 84 0.64 8.42 0.41 item fit dimension: oe item 87 0.52 -0.28 reference item item 89 0.78 8.17 0.60 item fit item 90 0.76 8.00 0.57 item fit item 94 0.69 7.31 0.48 item fit the results of the tests performed on the measurement model of the second order analysis of cfa on 54 items resulted in p value = 0.12824 (p> 0.05) and rmsea = 0.019 (rmsea <0.05). based on the data, pvalue and rmsea were successfully met so that it could be concluded this model was really fit with the data. the rmsea value of 0.019 indicates that the model is very fit. in other words, all 25 items are valid indicators for measuring the instrument construct of self-efficacy of junior high school mathematics teachers. these results also showed that 25 items measured a latent variable, which was the self-efficacy of mathematics teachers. it was concluded that self-efficacy measurement instrument for mathematics teachers met unidimensionality assumptions. table 1 is the table of all fit items of the results of second order cfa for measuring of junior high school mathematics teacher self-efficacy. based on the t-value of the second order cfa testing, it was known that all of the items were fit to measure junior high school mathematics teacher self-efficacy because the whole t-value was greater than 1.96. from table 1, it is also noted that item 89 has the highest contribution to the measuring instrument with the loading factor of 0.78, while item 15 gives the smallest contribution to the loading fator of 0.38. mapping of mathematics teacher self-efficacy the degree or level of mathematics teacher self-efficacy was obtained from the interpretation of the scores of an individual mathematics teacher as many as 274 teachers within 25 fit items. the scores obtained in the questionaire are raw scores, which need to be converted first into z-standard score, with μ = 0, and  = 1. however, because the standard z-scores allow their negative score, then for the ease of readability and interpretation, they need to be converted into t-score,with μ = reid (research and evaluation in education) 72 − reid (research and evaluation in education, 3(1), 2017 50, and  = 10. the result of the conversion of the scoring through a simple program ms excel, and which referred to the categorization, as presented in the data analysis, shows the obtained mapping of mathematics teacher self-efficacy for each diemnsion as in table 2. meanwhile, the percentage of respondents‟ mtse level is shown in figure 4. discussion the results of the analysis of the second order with chi-square = 297.58; df = 271; pvalue = 0.12824; rmsea = 0.019; 25 out of 54 items are with the factor loading () of each dimension pe, gte, ste, and oe consecutively being 0.46; 0.84; 0.89 and 0.92. thus it can be said that the model was fit to the data. dimension of personal efficacy (pe) the dimension of personal efficacy contains three indicators, from the initial 11 items (the first order) it decreases to 7 items (the second order). those three indicators include (a) mathematics self-concept, with item descriptors: efficacy of the ability to provide necessary information known to the students in learning mathematics (v8); (b) mathematics anxiety, item descriptors; efficacy on the readiness of the teachers when they would teach mathematics (v11), tranquility or comfort during mathematics learning (v14), the level of concern toward the material that was not be acquired (v15), having difficulty in concentrating while teaching mathematics (v16), and concerns if there were other people observing their teaching (v17); (c) the internalization of the source of efficacy, with efficacy item descriptor against social persuasion such as: invitations, suggestions, and verbal advice from a colleague which can push them to perform task (v22). table 2. frequency recapitulation of mathematics teacher self-efficacy (mtse) level for each dimension category dimensions/factors mtse pe gte ste oe low(l) 82 94 94 93 93 medium (m) 192 179 181 170 178 high (h) 0 2 4 1 4 total number 274 figure 4. graphic of percentage mtse level 64.96% 33.94% 1.46% reid (research and evaluation in education) a construct of the instrument for measuring junior high school... 73 rachmadi widdiharto, badrun kartowagiran, & sugiman dimension of general teaching efficacy (gte) general teaching efficacy dimension contains four indicators (from the initial 12 items (the first order) decreasing to 7 items (the second order)). those four indicators are: (a) pedagogy content knowledge of mathematics, with item descriptors including efficacy of ability to apply appropriate learning strategies in classroom practice (v28); (b) classroom management, with item descriptor efficacy toward abilities including students‟ drive to obey the rules in class (v31), explanation of the steps that must be performed by students in learning inside and outside the classroom (v35); (c) students‟ engagement with descriptor item efficacy toward abilities which help students to be actively involved in fun and meaningful learning (v44), maintain or restore the students‟ attention to stay focused on the material presented (v45), enhance students' understanding (v47) and assure the students that they can perform for completing the lesson task at school well (v48). in gte dimensions, there actually exist one more indicator, that is promoting the parental involvement in helping their children learning mathematics. at the time of the firstorder analysis of gte, all of the three items representing this indicator exist, but unfortunately on the second order cfa three items are forced to be eliminated to fit its models. in this study, the respondents might argue that the indicators were not required to measure mathematics teacher self-efficacy. thus, in other words, the respondents believed that parental involvement might not help their children learning. the existence of out-ofschool learning guidance, additional lesson, or private lesson would replace this role. dimensions of subject-matter teaching efficacy (ste) the dimensions of ste (subject-matter teaching efficacy) consist of three indicators. the initial 13 items (the first order) decreased to 7 items (the second order). those three indicators are: (a) the strategy of mathematics teaching, with item descriptors being the efficacy toward the ability to guide students in using a representation of an image, and symbol for mathematics learning (v63); (b) fostering students‟ mathematical power, with item descriptors being the efficacy in abilities to capture gaps between students‟ capability and competencies expected (v72), guiding students in examining the true relationship between one statement and others (v73), guiding students in developing a conjecture from available premises (v75), designing learning that encourages students to appreciate the benefits of mathematics (v77), managing the provision of questions to students (v81), and efficacy toward the ability to provide questions to students relating the idea of mathematics and its applications (v84). in fact, there is one more indicator in ste dimension, that is acquiring mathematics content knowledge. at the time of firstorder analysis of gte, there is one out of five items as a representation of this indicator, but in the second order cfa, one item was forced to be eliminated to fit its models. it means that in this study, these indicators were not required to measure mathematics teachers‟ self-efficacy. in other words, acquiring the mathematics content knowledge only by asking thorugh questionaire is not enough. using a test to measure this domain is more reasonable. the existence of teacher competency testing would support the absence of this indicator. dimension of outcome efficacy (oe) the dimension of outcome efficacy (oe) consists of three indicators, the initial eight items (the first order) decrease to four items (the second order). those three indicators include: (a) student achievement, with item descriptor efficacy against the ability to: guide students to succeed in the mathematics contest or mathematics olympiad in district region (v87), (b) bulid a mathematics attitudes, with items descriptor belief that mathematics learning done by the teacher is to promote students‟ critical logical thinking and to be consistent (v89), efficacy toward the belief that learning is done to guide the students to be honest, disciplined and responsible (v90), (c) continouing study, with item descriptor efficacy toward the belief that learning is done to be able to equip students to practice problem solving in their future life (v94). reid (research and evaluation in education) 74 − reid (research and evaluation in education, 3(1), 2017 based on the discussion, a framework of a construct for junior high school mathematics teacher self-efficacy in the special region of yogyakarta could be made, consisting of four dimensions. the first dimension is personal efficacy (pe) with a loading factor (=0.41), consisting of indicators including mathematics self-concept, mathematics anxiety, and also internalization source of efficacy. the second dimension is general teaching efficacy (gte) with a loading factor ( = 0.84), consisting of indicators including pedagogy content knowledge, classroom management, and student engagement. the third dimension is subject-matter teaching efficacy (ste) with a loading factor (=0.89), consisting of indicators including teaching strategies and fostering students' mathematical power. the fourth dimension is outcome efficacy (oe) with a loading factor (=0.92), consisting of indicators including student achievement, building mathematics attitude, and continuing study. mapping of mathematics teacher self-efficacy figure 4 shows that the percentage of mathematics teacher self-efficacy is 1.46% in a high category, 64.96% in a medium category, and 33.94% in a low category. with the hope of an ideal efficacy of mathematics teachers reaching high categories, as many as 98.54% of mathematics teachers should be enhanced for high category efficacy. in order to give an idea of the profile of mathematics teacher self-efficacy in these categories and to make it easier to follow up the results of measurements of efficacy in the process of continuing professional development of mathematics teachers, and refer to the indicators and items of mtse which have been fit and significant, table 3 is a general description of mtse (mathematics teacher self-efficacy) profile. table 3. general description of mathematics teacher self-efficacy mtse category description of mtse low (score: 24.96 – 49.92) a. not sure: the importance of understanding and the role of mathematics selfconcept, overcoming math anxiety, and internalization of the sources of selfefficacy, in carrying out the task of teaching responsibility. b. not sure: able to master knowledge of the pedagogical substance, class-room management, and students engagement c. not sure: able to perform mathematical learning strategies and fostering students' mathematical power. d. not sure: able to improve student achievement, students' mathematical attitudes, as well as provisions for the continuation of the study at the next level. medium (score: 49.93 – 74.88) a. sure: the importance of understanding and the role of mathematics selfconcept, overcoming math anxiety, and internalization of the sources of selfefficacy, in carrying out the task of teaching responsibility. b. sure: able to master knowledge of the pedagogical substance, classroom management, and students engagement c. sure: able to perform mathematical learning strategies and fostering stu-dents' mathematical power. d. sure: able to improve student achievement, students' mathematical atti-tudes, as well as provisions for the continuation of the study at the next level. high (score: 74.89 – 100.00) a. very sure: the importance of understanding and the role of mathema-tics selfconcept, overcoming math anxiety, and internalization of the sources of selfefficacy, in carrying out the task of teaching responsibility. b.very sure: able to master knowledge of the pedagogial substance, class-room management, and students engagement c. very sure: able to perform mathematical learning strategies and foster-ing students' mathematical power. d. very sure: able to improve student achievement, students' mathematical attitudes, as well as provisions for the continuation of the study at the next level. reid (research and evaluation in education) a construct of the instrument for measuring junior high school... 75 rachmadi widdiharto, badrun kartowagiran, & sugiman conclusion and suggestions based on the findings, some conclusions are drawn. first, the results in the second order analysis of the construct of the instrument for measuring the self-efficacy of junior high school mathematics teachers in the special region of yogyakarta shows the model is fit to the data, indicated by chi-square = 297.58; df = 271; p-value = 0.12824; rmsea = 0.019, from 54 items, 25 items obtained with factor loading () each dimension pe, gte, ste, and oe consecutively are: 0.46; 0.84; 0.89 and 0.92. the construct of the instrument for measuring self-efficacy of junior high school mathematics teachers in the special region of yogyakarta consists of four dimensions. first, the dimensions of personal efficacy (pe) with a loading factor (=0.41) with indicators: mathematics self concept, mathematics anxiety, and the internalization source of efficacy. second, the dimensions of general teaching efficacy (gte) with a loading factor ( = 0.84) with indicators: pedagogy content knowledge, classroom management and students engagement. third, the dimensions of subject-matter teaching efficacy (ste) with a loading factor (=0.89) consisting of indicators: teaching strategies and fostering students' mathematical power. fourth, outcome efficacy (oe) dimension with a loading factor (=0.92), with indicators: student achievement, building mathematics attitude, and the continouing study. the results of the mapping of the selfefficacy of mathematics teachers in the special region of yogyakarta show that 43.07% of the teachers are categorized as low, 55.47% are categorized as moderate, and 1.46% are in high category. besides, some suggestions are proposed. first, further research or advanced analysis needs to determine the relationship between mathematics teacher self-efficacy and teacher performance. second, research or further analysis is required to find out how instruments are constructed for senior or vocational high school mathematics teachers. references aiken, l. r. (1985). psychological testing and assessment (5th ed.). boston, ma: allyn and bacon. azwar, s. (2013). penyusunan skala psikologi (2nd ed.). yogyakarta: pustaka pelajar. bandura, a. (1996). self-efficacy in changing societies. new york, ny: cambridge university press. bandura, a. (1997). self-efficacy: the exercise of control. new york, ny: w. h. freeman and company. retrieved from https://books.google.co.id/books/abo ut/self_efficacy.html?id=ej-pn9g_oec&redir_esc=y borg, w. r., & gall, m. d. (1983). educational research: an introduction (4th ed.). new york, ny: longman. burn, r. b. (1984). the self concept, theory, measurement, development, and behavior. new york, ny: longman. cohran, w. g. (2010). teknik penarikan sampel (rusdiansya). new york, ny: john wiley & sons. gibson, s., & dembo, m. h. (1984). teacher efficacy: a construct validation. journal of educational psychology, 76(4), 569–582. https://doi.org/10.1037/00220663.76.4.569 hair, j. f., black, w. c., babin, b. j., & anderson, r. e. (2006). multivariate data analysis. upper saddle river, nj: prentice-hall. hastuti, h., sulaksono, b., akhmadi, a., syukri, m., sabainingrum, u., ruhamaniyati, r., … suryahadi, a. (2009). pelaksanaan sertifikasi guru dalam jabatan 2007: studi kasus di provinsi jambi, jawa barat, dan kalimantan barat. jakarta. hoy, a. w., & spero, r. b. (2005). changes in teacher efficacy during the early years of teaching: a comparison of four measures. teaching and teacher education, 21(4), 343–356. reid (research and evaluation in education) 76 − reid (research and evaluation in education, 3(1), 2017 jalal, f., samani, m., chang, m. c., stevenson, r., ragatz, a. b., & negara, s. d. (2009). teacher certification in indonesia: a strategy for teacher quality improvement. jakarta: ministry of national education. joreskog, k. g., & sorbom, d. (2003). lisrel 8.54 help file. chicago, il: scientific software international. kastberg, s. e., d‟ambrosio, b., mcdermot, g., & saada, n. (2005). context matters in assessing students‟ mathematical power. for the learning of mathematics, 25(2), 10–15. keller, j. m. (2010). motivational design for learning and performance: the arcs model approach. florida: springer. law no. 14 year 2005 of republic of indonesia about teachers and lecturers (2005). leder, g. c., pehkonen, e., & torner, g. (2002). beliefs: a hidden variable in mathematics education. new york, ny: kluwer academic publishers. lee, j. (2009). self construct and anxiety across cultures. princeton, nj. leung, f., & ragatz, a. b. (2010). inside indonesia’s classrooms: a timss video study of teaching practices and student achievement. jakarta: human development department of east asia and pacific region, world bank office. pajares, f. (1996). self-efficacy beliefs in academic settings. review of educational research, 66(4), 543–578. pppptk matematika. (2007). laporan kegiatan training need assessment (tna) dan rekruitmen smp tahun 2007. yogyakarta. sadtyadi, h., & kartowagiran, b. (2014). pengembangan instrumen penilaian kinerja guru sekolah dasar berbasis tugas pokok dan fungsi. jurnal penelitian dan evaluasi pendidikan, 18(2), 290–304. https://doi.org/10.21831/pep.v18i2.28 67 soodak, l. c., & podell, d. m. (1996). teacher efficacy: toward the understanding of a multi-faceted construct. teaching and teacher education, 12(4), 401–411. https://doi.org/10.1016/0742051x(95)00047-n sumardyono. (2011). kecemasan matematika guru matematika dinas pendidikan kota banjarmasin. edumat: jurnal edukasi matematika, 2(4). tschannen-moran, m., hoy, a. w., & hoy, w. k. (1998). teacher efficacy: its meaning and measure. review of educational research, 68(2), 202–248. https://doi.org/10.2307/1170754 turnuklu, e. b., & yesildere, s. (2007). the pedagogical content knowledge in mathematics: pre-service primary mathematics teachers‟ perspectives in turkey. iumpst: the journal, 1(october). retrieved from www.k12prep.math.ttu.edu copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(1), 2018, 45-57 available online at: http://journal.uny.ac.id/index.php/reid exploring the accuracy of school-based english test items for grade xi students of senior high schools *1martin iryayo; 2agus widyantoro 1university of rwanda college of education kg 11 ave, kigali, rwanda 2department of english education, universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia *corresponding author. e-mail: martiniryayo@gmail.com submitted: 08 june 2018 | revised: 27 july 2018 | accepted: 01 august 2018 abstract this study is set out to (1) explore the accuracy of school-based english test items developed by english teachers and (2) compare the relationship between the content covered by teacher and the students’ success level. this research used the quantitative approach. the source of the data is all grade xi students’ answers to the english test for the second semester of 2016/2017 academic year, and their english teachers’ responses to the questionnaire. during this cross-sectional survey, 241 grade xi students and six english teachers were selected by using the total population sampling technique. to analyze the data, the irt model was prioritized with bilog mg 3.0, winisteps 3.7. the findings of the study indicate that (1) the test is valid, (2) it is reliable, (3) majority of the items are moderately difficult, (4) more than a half of all items have power to discriminate the examinees, (5) some items show fully-effective distractors, and (6) the test gives much information at -.40 of theta which means that the test is difficult for the grade xi students. moreover, there is a wide gap between the content covered and the level of success. keywords: ctt, discrimination power, distractor, information function, irt, theta, total population sampling introduction a well-constructed test is the best way to evaluate a student’s mastery in a particular field. gronlund (1993, pp. 205–206) stresses that tests do not only help teachers to make some instructional decisions with their direct influence on students’ learning, but they also assist in a number of other ways. for instance, tests can increase students’ motivation. the purpose of tests is to obtain an accurate and fair assessment of students’ abilities. nevertheless, it is impossible for a test to evaluate skills or knowledge bases if it is influenced by irrelevant factors that could undermine the results. these factors that potentially create bias can comprise of gender, ethnic, and cultural differences. in case there is no proper accounting for these biasing factors, the outcome of the test will unfairly represent the abilities of the examinees (gronlund, 1993, p. 207). alternatively, the results of a test are essentially meaningless if they are unfair for test takers due to the culture, gender, or ethnic origin biases. a veracious picture of skills and knowledge that students have in either the subject area or domain tested should be presented by test results. the successful instructional, curriculum planning, and evaluation of linked programs cannot be accomplished without students' quality achievement data. test scores that overestimate or underestimate students’ actual knowledge and skills cannot serve these important purposes (young, cummings, & st-onge, 2017). the accuracy of the achievement data cannot be procured since the composers of the test do not pay attention to the accuracy of the test components, because once the test is not well prepared, it obviously reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 46 – exploring the accuracy of school-based english test... martin iryayo & agus widyantoro affects the students’ achievements even if they understand the material well. thus, the test must be as accurate as possible. in standardized testing, there are several means for measuring students’ cognitive abilities. currently, multiple-choice tests are commonly used for measuring students’ cognitive abilities (galsworthy et al., 2005). standardized scores is used by most schools to evaluate the educational quality and student performance (brescia & fortune, 1989, pp. 1–5). as long as it is believed that test scores are considered as an important factor to assess students’ performance, teachers should develop the tests which are as fair as possible for examinees regardless of their races, genders, or any disability they may have (joint committee on testing practices of american psychological association, 2004). reviewing all items of a test is the most fruitful way to ensure that they are free from all irrelevant sources of variances because item bias dirtily affects the examinees’ scores. there must be empirical revision of the items before administering them to the examinees in order to ensure the quality of their characteristics. in achievement testing, it is possible to use different formats. multiple-choice (mc) items are broadly used for classroom assessment and they always account for a significant constituent of a student's grade in a course (dibattista & kurzawa, 2011). a normal mc item is made up of a question, known as stem, and a list of alternatives from which one becomes the right answer to the question. the test takers pick only the option they think fits to the question asked. the keyed option is the best name for the correct answer while the remaining alternatives refer to as distractors. for instructors, there is a variety of advantages to use the mc test format; scoring mc items takes a short time particularly when the examinees indicate their responses on a wellscanned optical mc answer sheet (universally used form). for teachers of subjects with large enrolment, easy grading can make mc tests very specifically appealing to them. obviously, multiple choices tests are more advantageous even though there are some flaws still pending while measuring the students’ performance. content validity, clarity, and reliability are the most crucial traits of achievement tests. the content validity of a test is always seen by how accurately the test samples the range of knowledge, skills, and abilities expected from the testees during an examination period. the reliability of a test depends on its grading stability and its power to discriminate students upon the basis of their different levels of performance (kartowagiran, 2012). well-developed multiple-choice test items are in general more valid, clearer, and more reliable than essay tests because they broadly represent content in the syllabus, able to distinguish all levels of performance, and scoring consistency is virtually guaranteed. thus, validation is a starting point for dealing with multiple choice test item quality. content validity can be obtained in various ways. the content validity (relevance) by experts' judgement can be computed in different ways. the use of pre-established acceptability criterion, calculation of rating average upon each item relevance, quantification of item relevance (with three or more experts) by using coefficient alpha, and kappa coefficient computation are the most known techniques (polit & beck, 2006). with this approach, there should be a team of experts to judge whether an item on a scale is relevant to (or congruent with) the construct being measured. each rater is free to compute the percentage of item relevance, then the average is taken across all raters (experts). another way of evaluating the accuracy of mc test items is concerned with studying the answers that the examinees make, in which within this research, this analysis approach was used. precisely, teacher-developed test items administered to the examinees are basically analyzed on the basis of difficulty level, discrimination level, and effectiveness of the distractors (dibattista & kurzawa, 2011). in brief, before putting the items in their bank, the main characteristics stated above should be considered because any item which is either too difficult or easy, item that does not discriminate students, and item with ineffective distractors, does not qualify to be stored in the item bank. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 exploring the accuracy of school-based english test… 47 martin iryayo & agus widyantoro test takers should be differentiated by their abilities. the discrimination capacity of an mc test item is the most prevailing property because it reflects the extent to what more intelligent students are more likely than less knowledgeable students to select the keyed option (abadyo & bastari, 2015). mc test item discriminatory capacity can be measured with the computation of its index, which reflects the correlation between the examinees’ total scores and the score received on the item to be considered (i.e. 1 stands for the keyed option, while 0 for the wrong answer). even more, there are items which are problematic because they produce negative discriminatory indexes, maybe due to the unclear wording or the existence of two correct alternatives rather than one (dibattista & kurzawa, 2011). with the presence of such items, there is a detraction from the overall accuracy of the test as a whole, because the number of less knowledgeable examinees who select the keyed option outweighs that of the knowledgeable examinees. with regard to the perspective of its functionality, there are two requirements for a distractor to be functioning: first, at least some examinees must select it, if they do not, the distractor is not plausible to them until they can be lured away from the correct answer, so such a distractor never contributes to the discrimination of the test takers. abdulghani, ahmad, ponnamperuma, khalil, and aldrees (2014) have suggested that at least 5% of examinees should select each of an item’s distractors, and this value is a common benchmark for the effectiveness of the distractors. the second requirement refers to the power of a distractor to distinguish high achievers from low achievers (stronger from weaker students), considering that the power of discrimination is clear when the correct answer is more often chosen by the students with high scores than their counterparts. related to the statements, opinions, and views of different authors as fully explained in the previous section, the problems that always appear when developing school-based english test items with the format of multiple choices, are so many, such as the content of some multiple-choice tests, which does not cover the material taught in the classroom, and the main parts of multiple choice tests items; stem, key, and alternatives which are not built according to the criteria or guidelines; some teachers do not have enough skills to get by this problem; by analyzing the scores of the students obtained from multiple choice tests during a couple of academic years ago, there is inconsistency because there is a lack of item homogeneity; some individual items are not highly correlated to each other and even to the whole test; some english teachers are not cautious of the difficulty level of the items. at the end of a teaching session, they develop tests which are either too easy or difficult. the ideal index of difficulty should fall between -2 and 2 (hambleton, swaminathan, & rogers, 1991, p. 13). it is quite problematic to have items with difficulty index of far less than -2 or more than 2, and some english teachers do not know how to develop multiple choice items which can discriminate the participants. moreover, the distractors are powerless to attract the examinees because some are chosen by <5% of examinees (mkrtchyan, 2011). it is a problem to have items which cannot discriminate the achievers (ai less than 2). like other scientific studies, this study aims at exploring the accuracy of schoolbased english test items developed by english teachers through (1) validity index, (2) reliability coefficients, (3) difficulty level, (4) discrimination power, (5) distractor effectiveness, and (6) level of information given by the items and the whole test in general and at comparing between the content covered by teacher and student success level. the current study is expected to be beneficial. practically and even theoretically, the results of this study should be used by english test administrators, moderators, and even supervisors in order to make adequate policies on how to fairly and professionally prepare a suitable english test. this is very important because some teachers and other school academicians who develop test items for testing students do not have enough skills yet to examine the primordial characteristics indicating a good item. many researchers worked on the accuracy or quality of achievement test items. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 48 – exploring the accuracy of school-based english test... martin iryayo & agus widyantoro charismana and aman (2016) conducted a research about the quality of civic education final examination items, in the whole regency of kudus, indonesia. the students involved in the study were grade viii students of junior high schools that apply curriculum 2013. the data were analyzed both qualitatively and quantitatively. the qualitative results show that 31 items are good whereas are items are not. the quantitative results show that 24 items or 68.57% of all items are good, while 11 items or 31.42% of all items are not. as a result, approximately 15 items are recommended to be revised. a study conducted by osadebe (2015) with 100 items administered to 1000 students comes up with the results that the achievement test for the subject of economics has a high face and content validities. the test item quality was evaluated through difficulty and discrimination indexes. a difficult index or pvalue of 0.5 was referred to after the use of the formula for guessing correction. the index of discrimination was computed with point biserial statistics whereby the minimum boundary is .30. with the kr-20, the test was very highly reliable with the coefficient of .95. these findings support the use of this instrument to internally evaluate the students in order to be ready for the external testing (examination). according to the study by boopathiraj and chellamani (2013), which was aimed at analyzing test items in the subject of research with students enrolled in master of education (m.ed) program, they wanted to ensure the difficulty and discrimination levels of mc test items. a sample of 200 students from different colleges of education was established. the sample consisted of both genders. the findings indicate that a big number of items are not accepted, and there is a good discrimination index for some items, but some of them are rejected due to poor discrimination indexes. based on the statement above, most of the items have the difficulty level (bi) from -2 to 2 and discrimination index of (ai) > 2. sabri (2013) worked on a comprehensive test at a university in perak, involving 16 music students. with ms excel, he computed the difficulty level of 41 items. the reliability coefficients and discriminatory indexes were computed using ms excel and spss 17.7 respectively. the outcome of the research came up with the information that 44% of all items have the difficulty index of > .80, then 59% of the items have acceptable discriminatory power. there is no effective distractor. with kr-20, the coefficient of reliability is .717 while with kr-21 is .703. hence, it is reasonable to conclude that the items are reliable, moderately easy, 80% discriminate high from low achievers, but some distractors were chosen by less than 5% of examinees (implausible). quaigrain and arhin (2017) carried out a study about mc test items. the sample was made up of 247 students doing year-1 diploma in education at cape coast polytechnics. a test of 50 mc items was given to them in the subject of educational measurement. the results of the study show that the whole test has an internal consistency reliability of .77 (kr-20), the mean score of 29.23, the standard deviation mean score of 6.36, difficulty level (p-value) and discrimination index (di) of 58.46% (sd=21.23) and .22 (sd=.17), respectively, and the mean score of de of 55.04 (sd=24.09). as to di, 30 items (60%) are reasonably accepted. every item with moderate difficulty level, high discriminatory power, and functioning distractors should still be part of the next testing to improve classroom assessment quality. there is no study without innovation. the novelty of this study can be seen from data analysis section. apart from the variables that look similar to the previous studies by other academic researchers, the current research involves a new way of giving grades to teachers on the basis of content covered after the learning term. as the majority of the previous studies used classical test theory to analyze item accuracy, the researchers in this study used the item response theory (irt) to have clearer and more information on the item quality, so that the newly published irt software was used. this study is expected to come up with the answers to the questions in relation to the quality or accuracy of school-based english test items: (1) to what extent do english test reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 exploring the accuracy of school-based english test… 49 martin iryayo & agus widyantoro items represent the content or subject topics they intend to measure for grade-11?; (2) what proves that english test assesses the underlying theoretical construct it is purported to measure?; (3) how convergent are the items, making up english test, to be considered homogeneous? do they complement each other?; (4) how reliable and informative are the english test items?; (5) what is the difficulty level of the items making up english test?; (6) at what level do english test items powerfully discriminate between high and low achievers?; how effective are the distractors to ensure that english test outcomes provide more credible and objective picture of the knowledge of the examinees? method this study used the quantitative approach with a cross-sectional survey. it was carried out within the period of two months, from the end of may to the mid-june 2017. the study took place across all senior high schools under the management of muhammadiyah foundation. the schools are situated in bantul district, special region of yogyakarta, indonesia. in order to successfully reach the objectives of this study, the schools which are homogeneous were considered. population and sample the population of this study was all muhammadiyah high school students of grade-11 in the whole district of bantul, totalling 241 students. in order to have accurate results, all of the students were selected as participants. by the small community, it is possible to conduct a study with nearly the whole population and pay attention to whoever has moved through the network of the community (guyette, 1983). therefore, this study uses the purposive sampling technique with total population sampling. data collection techniques the technique used for data collection is documentation whereby the researchers recorded the answers from all examinees. to have information on the content covered by each teacher during the learning session, a questionnaire was used. with regard to the validity and reliability of the instruments in this study, experts' judgement and crobach's alpha indexes were computed. data analysis within the scope of this study, there are a lot of variables to be measured, including construct validity, internal consistency reliability, item level of difficulty, the level of discrimination, and the effectiveness of the distractors. it is, therefore, clear that both classical test theory (ctt) and item response theory (irt) are necessary in this analysis. table 1 displays the variables and related data analysis techniques. table 1. data analysis techniques no variables analysis techniques 1 validity: expert judgments with aiken indices content validity 2 reliability: internal consistency jasp 0.8.2.0 = spss24 information function irt/bilog-mg 3 3 level of difficulty irt/ bilog-mg 3 4 power of discrimination irt/ bilog-mg 3 5 distractor effectiveness rasch/winisteps 3.73 table 1 contains the variables of the study and the analysis related to them. the coefficient of reliability which can be accepted must have a minimum of .70. this value helps to determine the level of error within measurement. the higher the index of reliability is, the higher the level of errors within measurement decreases, and vice versa (mardapi, 2012, p. 128). item discrimination (ai) is the power of an item, by which its score is used for differentiating the examinees whose level of understanding is high from those whose level of understanding is low. the discrimination index is called slope because it shows the extent to which the probability to change the correct response like the ability or increase of the trait exists. according to hambleton and swaminathan (1985, p. 36), discrimination index varies from 0 to 2. the item difficulty is another important variable. its index (bi) is always measured from the scores of students or examinees which are reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 50 – exploring the accuracy of school-based english test... martin iryayo & agus widyantoro obtained from the answers of all participants in a test. item difficulty depends on the ability of the examinees. the more the testees have correct answers on an item, the higher the difficulty level of that item flops or decreases and vice versa. the item which is good or accepted is always situated between the interval of 2 (hambleton et al., 1991, p. 13). the level of difficulty decreases as the b-parameter value is close to -2, but when the b-parameter value is close to +2, the level of difficulty increases. the item analysis by using irt model must fulfills the prescribed assumptions. the general assumptions that always appear in item response theory models are unidimensional, local independent, and invariant parameters. the proof of unidimensionality is proven by the plot called scree plot as presented in figure 1. figure 1 shows that a unidimensional assumption is fulfilled for this study data analysis because there is one most dominant dimension. the only way to test the model fitness is statistical measurement with chisquare. the researchers chose the suitable model by considering the highest percentage as shown by figure 2 (stone, ye, zhu, & lane, 2009). figure 2 shows that the data in this study fit more to the second parameter model because it contains 36% (18 of 50) of all items. this result also supports the invariance assumption because when the data fit a model, the invariance criteria are automatically fulfilled (lord, 2012, p. 126). figure 2. goodness of fit (gof) the local independence has two facets: the local independence towards the test takers’ answers and local independence towards the test items (allen & yen, 2001, p. 241). the first facet means that the wrong or right answer of a test taker does not depend on the wrong or right answer of his/her cotest taker on a given item. the second facet means that to be wrong or right on a test item does not affect the answer to another item. this study puts interest on the second facet of local independence because it is related to the test items. the results show that the correlation of residuals for all items is close to 0. figure 1. unidimensionality proof by scree plot reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 exploring the accuracy of school-based english test… 51 martin iryayo & agus widyantoro findings and discussion the findings of this study are discussed based on the variables to be measured. validity index, reliability coefficient, discrimination index, difficulty level, distractor effectiveness, information function, and the success and content coverer weight were measured and the results can be found in this section. the findings about content validity with aiken index are displayed in figure 3. figure 3 contains information about the validity of items developed from the content expected to be covered by english teachers. it is supported that the english test represented the content taught because the aiken index for each indicator is accepted with the value bigger than .75. all items should be used because the overall index is .80. this result is supported by (retnawati, 2016) who states that if the index is lower than or equal to .40, the validity is still low, if it is between .40 and .80, the validity is moderate, and if it is >.80, the validity is very high. reliability is another important criterion for item accuracy. table 2 shows how reliable each item is. the guttmann’s lambda7 is the alternative of cronbach’s alpha. both coefficients were used to make a comparison. the reliability coefficients are really good. based on both cronbach’s and guttmann’s indices, the values range from .80 to .95. all items are perfectly reliable because any item’s reliability greater than .70 is considered perfect, and the lowest and highest boundaries are .00 and 1.0 respectively. with this finding, there is no doubt that the students’ answers to each item of the test are consistent. hence, the test was measuring what it was purported to measure. table 2. internal consistency reliability item α λ6 item α λ7 item1 0.88 0.92 item26 0.89 0.93 item2 0.88 0.92 item27 0.88 0.92 item3 0.88 0.92 item28 0.88 0.92 item4 0.88 0.92 item29 0.88 0.92 item5 0.88 0.92 item30 0.88 0.92 item6 0.88 0.93 item31 0.88 0.92 item7 0.88 0.92 item32 0.88 0.92 item8 0.88 0.92 item33 0.88 0.92 item9 0.88 0.92 item34 0.88 0.92 item10 0.88 0.92 item35 0.88 0.92 item11 0.88 0.92 item36 0.88 0.92 item12 0.88 0.92 item37 0.88 0.92 item13 0.88 0.92 item38 0.88 0.92 item14 0.88 0.92 item39 0.88 0.92 item15 0.88 0.92 item40 0.88 0.92 item16 0.88 0.92 item41 0.88 0.92 item17 0.88 0.92 item42 0.88 0.92 item18 0.88 0.92 item43 0.88 0.93 item19 0.88 0.92 item44 0.88 0.92 item20 0.88 0.92 item45 0.88 0.92 item21 0.88 0.92 item46 0.88 0.92 item22 0.88 0.92 item47 0.88 0.92 item23 0.87 0.92 item48 0.88 0.92 item24 0.88 0.92 item49 0.88 0.92 item25 0.88 0.92 item50 0.88 0.92 figure 3. aiken index (0.0 to 1.0) reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 52 – exploring the accuracy of school-based english test... martin iryayo & agus widyantoro the level of difficulty is very crucial to ensure the quality of test items. the results of b-parameter estimation for all english test items are summarized in table 3. table 3. difficulty index (bi) comment frequency % good 47 94 not good 3 6 total 50 100 the parameter estimation for all 50 items shows that only three items (6%) are classified ‘not good’. those items are items 1, 40, and 46. the classification of item difficulty index relies on the range varying from -2 to 2 (good), and if it is out of the range, then it is not good. this result is in line with mardapi (1991, p. 11) who states that the item difficulty level is the function of the ability of a test taker. an item is said to be good if it has the difficulty level (bi) between -2 ≤ b ≤ +2. an item with the difficulty level close or below -2 shows that the item is in an easy category. in contrary, an item with difficulty level (bi) close or above +2 shows an item that is in a difficult category. figure 4 shows more about the accuracy of the test items based on b-parameter. the diagram in figure 4 shows the test level of difficulty: figure 4. item accuracy based on bi-index apart from the difficulty level, test items must be able to discriminate students by their abilities. the discrimination index for each item out of 50 items is well indicated in figure 5. in terms of the discrimination index (ai), items 5, 10, 24, 35, 43, and 47, (12%) discriminate test takers at a low level because their a-indexes vary from between .35 to -.64. items 6, 16, and 27, (6%) discriminate the examinees at a very low level because their aindexes vary from .01 to .34. however, the overall a-index, 1.206, shows that the english test moderately discriminates the examinees. hence, all items with low discrimination indexes should be revised, while those with very low discrimination index should be replaced. the results are well shown in the diagram in figure 5. figure 5. discrimination power the results presented in figure 5 are supported by baker (2001, p.34) that discrimination index (ai): 0.01 – 0.34 very low; 0.35 – 0.64 low; 0.65 – 1.34 moderate; 1.35 – 1.69 high; 1.70, and above very high. discrimination index (ai) is connected to the distractors' power to attract the examinees. the results can be seen in the diagram in figure 6. figure 6. distractor functionality reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 exploring the accuracy of school-based english test… 53 martin iryayo & agus widyantoro notes: 0-4: number of distractors (functioning ≥ 5% or not functioning ≤ 5%) 1-50: number of items it was found that items 1 and 40 (4%) do not have any functioning distractor, items 12, 13, 14, 15, 31, and 32 (6 items, 12%) have 50% of distractors that are not functioning effectively, items 2, 4, 18, 19, 22, 23, 25, 30, 33, 34, 38, 39, 42, 43, and 47 (15 item, 30%) have 25% of distractors that are not functioning, and items 3, 5, 6, 7, 8, 9, 10, 11, 16, 17, 20, 21, 24, 26, 27, 28, 29, 35, 36, 37, 41, 44, 45, 46, 48, 49, and 50 (27 items, 54%) have distractors that are functioning at 100%. in general, the english test for grade xi students during the second semester of the academic year of 2016/2017 has only 27 perfect items, two items that should be removed, and 21 items that should be repaired. figure 6 represents the power of distractors within the test. these findings are supported by abdulghani, ahmad, ponnamperuma, khalil, and aldrees (2014) who suggest that at least 5% of examinees should select each of an item’s distractors, and this value is a common benchmark for the effectiveness of distractors. the information function is another indicator of test item accuracy. in the irt, the information function stands for the reliability. in this study, the plot was used to easily see the amount of information the test could give, as presented in figure 7. the maximum information can be seen on the student’s ability of -.04. on the other hand, the red line shows the error of measurement (sem), the more information line picks, the fewer the error of measurement values drops. in fact, the majority of grade xi students have a low ability because the test gives much information on the left side from 0 on the latent trait. we can see that the test is fit for the students whose abilities vary from .2.2 to 1.4. this is supported by istiyono, mardapi, and suparno (2014). as seen in figure 8, around 70% (169 students) of all students (241) have a low ability to answer the questions. therefore, there is no easy item for the students because their abilities are relatively low, -.40. figure 7. information function (if) -3 -2 -1 0 1 2 3 0 10 20 30 40 50 60 70 ability f re q u e n c y gaussian fit to ability scores for group: 1 figure 8. proportition of students' abilities reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 54 – exploring the accuracy of school-based english test... martin iryayo & agus widyantoro as previously described, the content covered and the level of success were compared to see whether the students could understand that content well. table 4 and figure 9 have much information on the issue. from table 4, it is easy to evaluate, compare, and classify teachers at the end of a teaching term/period. it is known that every teacher has a syllabus that encloses the whole material. every english teacher has the objectives to be achieved by the end of the term. a is given to an english teacher who reaches the target, b for a teacher who reaches an acceptable level, c for a teacher who needs to improve his/her teaching topics, d for a teacher who does not cover the content to the satisfaction, and f to a teacher that does cover a very minimum content. a= 82.5 to 100% of the content covered, b= 62.5 to 82.4% of the content covered, c= 42.5 to 62.4% of the content covered, d= 22.5 to 42.4% of the content covered, and f= 20% and below the content covered. apart from the teacher categorization criteria above, the new teacher project, as cited by seidel, stürmer, blomberg, kobarg, and schwindt (2011), suggests a way to give scores to teachers. in the report called rating a teacher observational tool, the teachers can be put into categories, including: ‘complete coverage’ when the tool of evaluation covers all the elements in the curriculum, ‘partial coverage’ when the test does not cover some components of the syllabus, and ‘inadequate coverage’ when the evaluation tool covers lower than 50% of all indicators in the syllabus. figure ‘3’ stands for the first category, ‘2’ for the second, and ‘1’ for the third. based on the answers of the teachers, all six teachers were categorized. with figure 9, it is easy to see the gap between the content covered and the success level of grade xi students. there are some english teachers, engt.bl, engt.sw, engt.pl, who show that content and success are in line, but the rest of the teachers, engt.py, engt.im, engt.ks, indicate a long gap between the content covered and success of students on the english test. information from figure 9 implies that there is a remarkable difference between rural and urban muhammadiyah senior high schools. for the rural schools, the content covered by english teachers does not explain the success level of students on the test developed from that content, but for the urban schools, there is correlation between the content covered and the success level of the students. table 4. classification of english teachers id code indicators covered/61 scale grade comment category comment engt.bl 53.00 3.5 a reached target 2 partially covered engt.py 53.00 3.5 a reached target 2 partially covered engt.im 52.00 3.4 a reached target 2 partially covered engt.sw 49.00 3.2 b acceptable 2 partially covered engt.ks 44.00 2.9 b acceptable 2 partially covered engt.pl 37.00 2.4 c need improvement 2 partially covered figure 9. content covered vs success reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 exploring the accuracy of school-based english test… 55 martin iryayo & agus widyantoro conclusion, implications, and suggestions in connection with the results of this study and its discussion within the previous chapter on the accuracy of the multiplechoice items of the english test, different concluding statements can be made as follows: (1) the items represent the content taught to the students during the second semester of academic year 2016/2017. (2) all items are internally consistent. (3) most of the items (47/50) have acceptable difficulty level, but there are two items which are very easy and one which is very difficult. (4) a big number of items (42/50) have good discrimination indexes, but nine items are unable to discriminate the high achievers from low achievers. (5) as many as 27 items have effective distractors, but 23 items still show powerless distractors. (6) some english teachers tried their best to cover the content expected to be taught to the students, but some others did not cover at least 50% of the content, and therefore, there is still a gap (for some schools) between the content covered and the success level of the students on the test developed from that content. (7) the test is obviously difficult for more than 70% of the students who have the ability of -.40, and fits the students whose abilities range from -2.2 to 1.2. like in other scientific studies, some implications are put forward that the improvement in constructing and developing english test items for grade xi students of muhammadiyah senior high schools in bantul district needs both qualitative and quantitative review. it is necessary to test the quality of each item. this process contributes to the identification of some weaknesses within the test because the quality level of a test is completely determined by the quality of its items. the results of the quantitative analysis of the english test, in general, are not accurate. the teachers should make some try outs of the items, then the results are analyzed with relevant and practical techniques, such as the item analysis with the classical test theory and item response theory as well. the determination of the technique of analysis depends on the purpose and number of examinees accompanied by other technical assessments. an analysis with the classical test theory needs a small sample (30 participants at minimum), but the item response is used for a big number of respondents. for a better future school-based assessment, the following suggestions are given: (1) all items with medium quality should be revised, re-measured until they fulfill the criteria of a good item; the items with bad quality should be dropped or completely replaced. (2) it is much better for the teachers to conduct some tryouts and analysis of items before testing. (3) it is quite advisable for the teachers to develop items that are suitable to the content that is already taught to the students; they should also give the blueprint to them. (4) before a set of items are chosen, it is necessary to conduct qualitative analysis with expert judgment. it can help english teachers to have information on the item characteristics in terms of construction, language, and content in general. (5) the item response theory is needed to identify the characteristics of items; irt related programs should be trained to teachers of senior high schools. (6) it is suggested to make a test item bank at the district level (bantul) for the english subject to help teachers practice in assessing students' achievements. (7) schools should prepare some routine trainings on evaluation, assessment, and measurement. it will help to increase the ability english teachers in evaluating learning outcomes. the management office of muhammadiyah schools should be vigilant to remote areas in terms of education and technology. references abadyo, a., & bastari, b. (2015). estimation of ability and item parameters in mathematics testing by using the combination of 3plm/ grm and mcm/ gpcm scoring model. reid (research and evaluation in education), 1(1), 55–72. https://doi.org/10.21831/reid.v1 i1.4898 abdulghani, h. m., ahmad, f., ponnamperuma, g. g., khalil, m. s., & aldrees, a. (2014). the relationship between non-functioning distractors and reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 56 – exploring the accuracy of school-based english test... martin iryayo & agus widyantoro item difficulty of multiple choice questions: a descriptive analysis. journal of health specialties, 2(4), 148–151. https: //doi.org/10.4103/1658-600x.142784 allen, m. j., & yen, w. m. (2001). introduction to measurement theory (1st ed.). long grove, il: waveland press. boopathiraj, c., & chellamani, k. (2013). analysis of test items on difficulty level and discrimination index in the test for research in education. international journal of social science & interdisciplinary research (vol. 2). brescia, w., & fortune, j. c. (1989). standardized testing of american indian students. college student journal, 23(2), 98– 104. charismana, d. s., & aman, a. (2016). analisis kualitas tes ujian akhir semester ppkn smp di kabupaten kudus. jurnal evaluasi pendidikan, 4(1), 1–9. dibattista, d., & kurzawa, l. (2011). examination of the quality of multiplechoice items on classroom tests. canadian journal for the scholarship of teaching and learning, 2(2), 1–23. https:// doi.org/10.5206/cjsotl-rcacea.2011.2.4 galsworthy, m. j., paya-cano, j. l., liu, l., monleón, s., gregoryan, g., fernandes, c., … plomin, r. (2005). assessing reliability, heritability and general cognitive ability in a battery of cognitive tasks for laboratory mice. behavior genetics, 35(5), 675–692. https://doi.org/ 10.1007/s10519-005-3423-9 gronlund, n. e. (1993). how to make achievement tests and measurements. needham heights, ma: allyn and bacon. guyette, s. (1983). community-based research: a handbook for native americans. los angeles, ca: american indian studies center, university of california. hambleton, r. k., & swaminathan, h. (1985). item response theory: principles and applications. boston, ma: kluwer nijhoff. hambleton, r. k., swaminathan, h., & rogers, h. j. (1991). fundamentals of item response theory. newbury park, ca: sage publications. istiyono, e., mardapi, d., & suparno, s. (2014). pengembangan tes kemampuan berpikir tingkat tinggi fisika (physthots) peserta didik sma. jurnal penelitian dan evaluasi pendidikan, 18(1), 1–12. https://doi.org/10.21831/pep.v18 i1.2120 joint committee on testing practices of american psychological association. (2004). code of fair testing practices in education. washington, dc, united states of america. kartowagiran, b. (2012). penulisan butir soal. a paper presented in the seminar on question items analysis and writing for civil servant resources of dikrekinpeg, in kawanua aerotel hotel. lord, f. m. (2012). applications of item response theory to practical testing problems. new york, ny: routledge. mardapi, d. (1991). konsep dasar teori respons butir: perkembangan dalam bidang pengukuran pendidikan. cakrawala pendidikan, 3(x), 1–16. mardapi, d. (2012). pengukuran, penilaian, dan evaluasi pendidikan. yogyakarta: nuha medika. mkrtchyan, a. (2011). distractor quality analyze in multiple choice questions based on information retrieval model. edulearn11 proceedings, 1624–1631. osadebe, p. u. (2015). construction of valid and reliable test for assessment of students. journal of education and practice, 6(1), 51–56. polit, d. f., & beck, c. t. (2006). the content validity index: are you sure you know what’s being reported? critique and recommendations. research in nursing & health, 29(5), 489–497. https://doi. org/10.1002/nur.20147 quaigrain, k., & arhin, a. k. (2017). using reliability and item analysis to evaluate a teacher-developed test in educational measurement and evaluation. cogent reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 exploring the accuracy of school-based english test… 57 martin iryayo & agus widyantoro education, 4(1), 1301013. https://doi. org/10.1080/2331186x.2017.1301013 retnawati, h. (2016). analisis kuantitatif instrumen penelitian. yogyakarta: parama publishing. sabri, s. (2013). item analysis of student comprehensive test for research in teaching beginner string ensemble using model based teaching among music students in public universities. international journal of education and research, 1(12), 1–14. seidel, t., stürmer, k., blomberg, g., kobarg, m., & schwindt, k. (2011). teacher learning from analysis of videotaped classroom situations: does it make a difference whether teachers observe their own teaching or that of others? teaching and teacher education: an international journal of research and studies, 27(2), 259–267. https://doi.org/10.1016 /j.tate.2010.08.009 stone, c. a., ye, f., zhu, x., & lane, s. (2009). providing subscale scores for diagnostic information: a case study when the test is essentially unidimensional. applied measurement in education, 23(1), 63–86. https://doi.org/ 10.1080/08957340903423651 young, m., cummings, b.-a., & st-onge, c. (2017). ensuring the quality of multiplechoice exams administered to small cohorts: a cautionary tale. perspectives on medical education, 6(1), 21–28. https:// doi.org/10.1007/s40037-016-0322-0 reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(1), 2017, 50-63 available online at: http://journal.uny.ac.id/index.php/reid research article the structural equation modeling of reading interest psycho-behavioural constructs: how are they related across different modes of reading? * 1 nur hidayanto pancoro setyo putro; 2 jihyun lee *faculty of languages and arts, universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia *email: noersabar@yahoo.com submitted: 03 april 2017 | revised: 13 may 2017 | accepted: 13 may 2017 abstract the present study examines the relationships between the psycho-behavioral constructs underlying undergraduate students’ reading interest. the a priori framework in conceptualizing the subcomponents of reading interest is based on two modes of reading (printed-text-based and also internet-based), and three types of psycho-behavioral motives/intentions of reading (affective, cognitive, and behavioral). participants in this study were students (m = 20.14 years old) from an indonesian university (n = 993). exploratory and confirmatory factor analyses show the salience of 10 factors across reading modes and psycho-behavioral domains of reading. the most acceptable sem models that explore the relationships among the sub-components of reading interest have the student reading interest in the print mode preceded interest in reading online materials. implications of these findings are discussed for theory development and practice. keywords: reading interest, affect, cognition, behavior, factor analysis how to cite item: putro, n. & lee, j. (2017). the structural equation modeling of reading interest psycho-behavioural constructs: how are they related across different modes of reading?. reid (research and evaluation in education), 3(1), 50-63. doi:http://dx.doi.org/10.21831/reid.v3i1.13530 introduction in the past twenty years, there is a growing body of literature suggesting that current young generation has adopted multiple modalities in reading. much of this work highlighted the increasing practice of reading online materials among school-aged and university students (coiro & dobler, 2007; karim & hasan, 2007; liu & huang, 2008; mckenna, conradi, lawrence, jang, & meyer, 2012). other research revealed the emerging practice of reading from social media platforms, e.g., facebook and twitter (junco, 2012; kirschner & karpinski, 2010). although these studies have indicated an increasing trend towards reading online materials and social media reading, there is also a large volume of published studies showing reading printed materials has not been completely eclipsed (e.g., buzzetto-more, guy, & elobaid, 2007; liu, 2005). as multimodal literacy becomes more widespread (walsh, 2010), a great deal of previous research into reading motivation has focused on how frequent access to the internet is associated with low interest in reading from printed materials and a decline in academic achievement (alterman, 2007; dewaal, schönbach, & lauf, 2005; kirchhoff, 2010; lee & leung, 2008; mokhtari, reichard, & gardner, 2009). similarly, there is a growing body of literature suggesting how social media disempower today’s youth book reading motivation and lead to low academic achievement (junco, 2012; kirschner & karpinski, 2010). while much research has explored the relationships between how the amount of http://dx.doi.org/10.21831/reid.v3i1.13530 reid (research and evaluation in education) the structural equation modeling of reading interest... 51 nur hidayanto pancoro setyo putro & jihyun lee time spent on digital reading and the amount of time spent on reading printed books, little is known about how the psycho-behavioural constructs of reading interest within different modes (i.e., printed, online, and social media) are related to one another. consequently, there is a need to further our understanding of interrelationships among the constructs within reading interest. this study is designed to address this lacuna. the following sections briefly review some evidence suggesting the relationships among interest in reading in print settings, interest in reading online materials, and interest in social media reading. the first section outlines previous studies indicating the relationships among reading printed materials, reading online materials, and reading social media. the next section of the literature reviews and explores the evidence for interrelationships among the constructs within reading interest. clearly, this is largely an unexplored area. this brief literature review is then followed by the results from an exploration of the relationships and interrelationships between the dimensions of reading interest, based on the data from 993 undergraduate students from one indonesian university. the discussion then builds on the findings from both survey analyses and brief review of the literature. it also provides some steps to explore the issues raised further in current thinking and practice. relationship between different modes of reading previous research that examined the relationship among reading in printed materials, reading online materials, and reading social media has shown mixed results. many studies that claimed a close and positive relationship between reading in print settings and reading online materials have made its argument based on the amount of time people spent on reading across different types of settings (oecd, 2011; veenhof, 2006). for instance, veenhof (2006) investigated the social impact of the internet use based on the statistics canada 2005 general social survey data, and found that readers who spent more time reading from the internet also spent more time reading printed books. similarly, the pisa 2009 project (oecd, 2011) also showed that students who read more frequently from online sources also read printed materials more frequently. a study by tenopir, volentine, and king (2013) also reported that readers who used social media more frequently also read scholarly materials more frequently. while much research has shown the close positive relationships between reading patterns involving different modes of reading, other studies demonstrated the contrasting results. when it comes to the time spent on reading, the negative relationship was documented as well between the amount of time spent on reading online materials and that spent on reading printed materials (dewaal, schönbach, & lauf, 2005; lee & leung, 2008). for instance, the study by lee & leung (2008) investigated the replacement effects of the internet and found that use of the internet for reading is negatively related to reading printed newspapers (r = -.23) and magazines (r = -.39), indicating that those who frequently read online are less likely to read printed materials. another line of studies emphasizing the differences has brought about the widespread trend of reading newspapers online. online newspapers have now become the preferred news source for young people over the printed newspapers (alterman, 2007; kirchhoff, 2010). together, these studies suggest that there has been some partial shift in the mode of reading from all printed settings to online/digital/social media reading. young generations, in particular, have adopted reading online materials as an alternative mode to conventional reading in print settings. these studies also indicate that reading can happen in three different formats: print, online, and social media. as such, reading interest in the present study is also investigated from these three modes of reading. relationships among the psycho-behavioural constructs of reading interest following putro's work (2017), reading interest in the present study is conceptualized to incorporate both mode and psycho-behavioral dimensions (i.e., affective, cognitive, and behavioral). specifically, putro (2017) claimed that each of these psycho-behavioral reid (research and evaluation in education) 52 − reid (research and evaluation in education), 3(1), 2017 dimensions is situated in a particular mode of reading. interest in reading printed materials involves three psycho-behavioral constructs (i.e., elaboration, enjoyment, and competence experience); interest in reading online materials involves five psycho-behavioral constructs (i.e., value, confidence, enjoyment, competence experience, and flow); and interest in reading social media comprises two psychobehavioral constructs (i.e., sense of belonging and enjoyment). while there have not been systematic reviews or empirical studies that examined the links between all these psychobehavioral constructs of reading interest, the relationships between certain pairs or groups of the constructs (e.g., enjoyment and flow) have been explored and their close – either conceptual or empirical – links have been demonstrated. the following segments provide some evidence on the empirical relationships between the ten constructs of reading interest. enjoyment, flow, and competence there have been few studies conducted (e.g. shernoff, csikszentmihalyi, shneider, & shernoff, 2003; sherry, 2004; weber, tamborini, westcott‐baker, & kantor, 2009) which have considered the role of enjoyment and competence/knowledge/cognitive abilities in generating flow experiences. people would experience being completely absorbed in an activity that they find intensely enjoyable (e.g., shernoff et al., 2003; sherry, 2004). without enjoyment, an intense experience of flow is unlikely to occur (csikszentmihalyi, 1997; sherry, 2004). further, in a study by shernoff, csikszentmihalyi, shneider, and shernoff (2003), it is reported that when students found classroom activities interesting, easy to concentrate, and enjoyable, their flow condition was also high. some scholars also argue that competence is important to sustain enjoyment and to transfer it to the flow condition (carroll & loumidis, 2001; csikszentmihalyi, 1997; sherry, 2004). enjoyment, competence, and achievement while the causal relationships among these constructs are proven to be hard to demonstrate, they are at least interrelated with each other. for example, confidence in performing a task can lead to higher competence level but the reverse relationship, i.e., competence leading to feeling confidence, is also highly likely (clanton et al., 2014; dunst & dempsey, 2007; pajares & johnson, 1994). not surprisingly, students’ self-evaluation of competence is significantly correlated with their achievement and confidence (pajares & johnson, 1994). examples of domains demonstrating the close relationships of competence, confidence, and achievement are abundant: writing (pajares & johnson, 1994), reading (mcgeown et al., 2015), and general cognitive abilities (stankov & lee, 2008). enjoyment, competence, value, and achievement many empirical studies have been conducted within the framework of expectancyvalue theory (eccles, 1983; wigfield, 1994; wigfield & eccles, 2002; wigfield & tonks, 2002) to investigate how students’ enjoyment, values, and perceived competence beliefs are related to academic outcomes attainment. it appears that competence beliefs, enjoyment (intrinsic value), and utility value would positively reinforce each other (chouinard, karsenti, & roy, 2007; cocks & watt, 2004; wilson et al., 2008). while the achievementrelated outcomes are employed as the final destination of this achievement-motivational theory, a more realistic picture would include reciprocal relationships (marsh & martin, 2011) among these constructs especially when developmental perspectives (wigfield & eccles, 2002) are taken into account. elaboration, enjoyment, value, competency, and sense of belonging. empirical studies were able to demonstrate the links between students’ use of elaboration strategy in reading and enjoyment in reading (lau & ho, 2016). even when people read for a targeted purpose (e.g., doing homework, conducting a project), reading with elaboration can be a useful strategy in attaining the goals. the program for international student assessment (pisa) data also showed that students’ use of elaboration strategies are positively linked to competency beliefs, anxiety, and interest (schleicher, 2016); students who use elaboration strategies more frequentreid (research and evaluation in education) the structural equation modeling of reading interest... 53 nur hidayanto pancoro setyo putro & jihyun lee ly in their reading reported higher self-competence beliefs in their ability, less anxiety and more interest in reading. students who are confident in their abilities in learning tended to report using more elaboration strategy as well (perry & smart, 2007). in recent studies people’s tendency to use the elaboration strategy were also found to have greater intrinsic motivation, sense of belonging, competence, and autonomy (sundar, 2015). a caveat should be registered; it appears that different studies use slightly different labels for the same constructs. in this study, enjoyment and intrinsic value are considered interchangeable, so are relatedness and sense of belonging. value in this study is referred to the perception of usefulness, i.e., utility value, perceived value, and perceived utility value. competence means self-beliefs in one’s own capability in completing a task, which is also interchangeably used with confidence, competence beliefs, and perceived competence. experiences of competence are referred to as memories of prior experiences about achievement or mastery of skills or tasks. method the participants were undergraduate students in an indonesian university, a medium-sized university with about 25,000 students enrolling in 2014. a total of 993 undergraduate students volunteered to participate in the study. the survey data were collected between the 17 th of august and the 16 th of november in 2014. seventy one percent of the participants were female students. the majority of these students were in their second year (45%) and third year (35%). students’ reading interest across the three modes (print, online, social media) was measured with 36 items from reading interest scale developed by putro (2017), in which the 36 items were converged into 10 factors: elaboration in print settings, enjoyment in print settings, competence experience in print settings, utility value in online reading, confidence in online reading, enjoyment in online reading, competence experience in online reading, flow in online reading, sense of belonging in social media reading, and enjoyment in social media reading. in the present study, these 10 factors were referred to as dimensions of reading interest. the survey items were written in a way that includes a particular reading mode. the survey respondents were asked to rate their interest in reading in three different formats, i.e., reading in print settings, reading online materials, and reading through social media. all items were measured on a 5-point response category, ranging from strongly disagree (1) to strongly agree (5) with the middle point of neither disagree nor agree (3). statistical analysis the main analyses of the present study were confirmatory factor analysis (cfa) and structural equation modeling (sem). cfa was used to confirm the structure of the reading interest dimensions. sem was used to test the relationships between the psycho-behavioral constructs of reading interest within and across modes of reading. mplus version 7.2 (muthén & muthén, 1998-2012) was used for both the cfa and sem results reported in this study. the maximum likelihood estimation with robust standard errors (mlr) was used to adjust for non-normality of the survey responses of the data, as suggested in bentler (2005). as the model fit indices, the comparative fit index (cfi > .90), tucker-lewis index (tli > .90), root mean square error of approximation (rmsea < .05), and standardized root mean square residual (srmr < .05) were used to indicate a good model fit (criteria cut-off scores indicated, also see byrne (2006). in addition, a ratio of 1/3 or less between the degrees of freedom (df) and chi-square statistics (x 2 ) was used as an acceptable model fit criterion (see wang & wang, 2012) instead of the significance of x 2 . the cronbach’s α scale reliability for each factor was calculated with spss version 21. findings and discussion nature of reading interest the result of the confirmatory factor analysis (cfa) showed that the 36 items converged into 10 factors was a very good fit (x 2 = 984.12, df = 549, x 2 /df= 1.8, rmsea = .03, srmr= .04, cfi = .97, and tli = .97). reid (research and evaluation in education) 54 − reid (research and evaluation in education), 3(1), 2017 table 1. confirmatory factor analysis on reading interest item factor 1 2 3 4 5 6 7 8 9 10 1. i always connect what i read in printed materials to my background knowledge. .81 2. when i read in printed settings, i always try to understand the materials better by relating to my personal experiences. .80 3. when i read in printed settings, i always figure out how the information fits in with what happens in my real life. .78 4. i enjoy reading printed materials .84 5. reading printed materials makes me feel good. .77 6. i feel happy if i receive a printed book as a present. .70 7. i had good marks because i liked reading printed materials. .88 8. my reading in print settings skill continues to help me get good grades. .85 9. i did well in school due to my ability in reading printed materials. .83 10. i did well in my courses because i read printed materials fast. .69 11. reading online materials helps me think about new concepts and ideas. .72 12. reading online materials advances my general knowledge. .69 13. new ideas come to my mind when i read online. .69 14. i learn about what is going on in the world from reading online materials. .65 15. reading online materials makes me feel linked to the world. .65 16. i obtain a great deal of information whenever do reading online materials. .61 17. reading online materials is very easy for me. .88 18. i never have problems in reading online materials. .76 19. when i read from screen (e.g., computer screen, cell-phone, etc.), i am a good reader. .70 20. reading online materials is one of my favourite activities. .84 21. reading online materials makes me feel relaxed. .81 22. i always try to read online for my own enjoyment. .69 23. i did well in my studies at university because of reading online materials. .91 24. i did well in school because of my reading online materials ability. .91 25. i had good grades because i liked reading online materials. .85 26. my academic achievement has been influenced by my ability in reading online materials. .83 27. i feel fascinated when i read online. .93 28. when i read online, i forget about other things. .82 29. time goes faster when i read online. .66 30. i feel linked to others who read the same things from social media sites (e.g. facebook, whatsapp). .86 31. reading from social media (e.g. facebook, whatsapp) makes me feel connected to the world. .81 32. reading from social media (e.g. facebook, whatsapp) makes me feel belonged to a certain group. .76 33. reading from social media (e.g. facebook, whatsapp) makes me communicate better with others. .71 34. social media reading is one of my favourite activities (e.g. facebook, whatsapp). .82 35. most of the knowledge i obtained is from my social media reading. .68 36. once i read social media sites (e.g. facebook, whatsapp), i always reading for hours. .65 cronbach’s α .84 .80 .89 .82 .82 .82 .95 .84 .87 .76 reid (research and evaluation in education) the structural equation modeling of reading interest... 55 nur hidayanto pancoro setyo putro & jihyun lee the standardised factor loadings were all significant and substantial, ranging from β = .61 to β = .91 across all 36 items. alpha coefficients for scores on the 10 reading interest dimensions ranged from .76 (enjoyment in social media reading) to .95 (competence experience in online reading), indicating reasonably good internal consistency for each of the scales. the standardized factor loadings of the cfa results are presented in table 1, together with the cronbach’s α of each factor. relationships between the psycho-behavioral constructs of reading interest subsequent to the cfa, the model building strategy was to first construct the model in a way to test the relationships among the psycho-behavioural dimensions situated in a particular mode (e.g., reading online materials). the dimensions representing other modes of reading (i.e., print, mode, and social media) were then added to build a more comprehensive model of reading interest that represented all three models of reading. reading online materials model a (see figure 1) was constructed to examine the relationships among the reading interest dimensions within the context of online reading. because there are more online reading variables (five) than print modes of reading (three) and social media reading (two), the model was built with the dimensions related to online reading first. this model reflects four propositions: (a) enjoyment in reading online materials facilitates flow in online reading (shernoff, csikszentmihalyi, shneider, & shernoff, 2003; sherry, 2004; csikszentmihalyi, 1997); (b) competence experience in reading online materials is positively linked to enjoyment in online reading (e.g., carroll & loumidis, 2001; sherry, 2004); (c) confidence in reading online materials is moderately related to enjoyment in online reading (e.g., clark & de zoysa, 2011); and (d) the perceived value in reading online materials is positively related to enjoyment and competence in reading online materials (e.g., wilson et al., 2008). figure 1. model a: the relationships among the psycho-behavioural constructs within the context of online reading notes: val_o: utility value in online reading; conf_o: confidence in online reading; enj_o: enjoyment in online reading; com_o: competence experience in online reading; flw_o: flow in online reading. reid (research and evaluation in education) 56 − reid (research and evaluation in education), 3(1), 2017 table 2. standardised coefficients, standard errors, estimated standard errors, and p-value for model a path estimate s.e. est./s.e. sig. enjoyment in reading online materials to flow in online reading .54 .03 16.17 .00 utility value in reading online materials to enjoyment in online reading .34 .04 7.68 .00 confidence in reading online materials to enjoyment in online reading .50 .05 10.79 .00 utility value in reading online materials to competence experience in online reading .49 .03 14.22 .00 competence experience in reading online materials to confidence in online reading .42 .04 11.89 .00 figure 2. model b: the relationships among the psycho-behavioural constructs within the context of online reading and reading in print settings notes. ela_p: elaboration in print settings; enj_p: enjoyment in print settings; com_p: competence experience in print settings; val_o: utility value in online reading; conf_o: confidence in online reading; enj_o: enjoyment in online reading; com_o: competence experience in online reading; flw_o: flow in online reading. table 3. standardised path coefficients, etandard errors, estimated standard errors, and p-value for model b path estimate s.e. est./s.e. sig. enjoyment in reading online materials to flow in reading online materials .54 .03 16.14 .00 utility value in reading online materials to enjoyment in online reading .34 .04 7.84 .00 confidence in reading online materials to enjoyment in online reading .50 .05 11.06 .00 utility value in reading online materials to competence experience in online reading .31 .04 8.61 .00 competence experience in reading online materials to confidence in online reading .42 .04 11.94 .00 competence experience in reading in print to competence experience in online reading .39 .04 9.95 .00 elaboration in reading in print to utility value in online reading .68 .03 23.16 .00 reid (research and evaluation in education) the structural equation modeling of reading interest... 57 nur hidayanto pancoro setyo putro & jihyun lee the model that showed the best fit to the data is presented in figure 2. model b presented in this figure, i.e., figure 2 yielded good fit (x 2 = 1295.34, df = 577, x 2 /df= 2.24, rmsea = .04, srmr = .08, cfi = .95, and tli = .95), which are better or higher fit indices compared to those of other models tested. the standardized path coefficients among the seven latent variables were all significant and substantial, ranging from β = .31 to β = .68. in fact, it is an extension of model a with additional pathways from ‘competence in reading in print settings’ to ‘competence in reading online materials’, and from ‘elaboration in reading in print settings’ to ‘utility values in online reading’. among the 10 factors, there was one more variable related to reading in print settings, which is ‘enjoyment in reading in print settings’. various attempts were made to include this variable, but the addition of this variable resulted in the worsening of the overall model fit and the potential pathways, such as from ‘enjoyment in reading in print settings’ to ‘enjoyment in online reading’ (β = .16, p < .01) and from ‘enjoyment in reading in print settings’ to ‘flow in online reading’ (β = .03, p > .05), showed weak and non-significant links. therefore, the variable was dropped in the final model, and it was concluded that model b is the best representation of the variables related to two reading settings (i.e., print and online reading). it also shows that reading interest in print settings precedes interest in reading online materials reading online materials, social media reading, and reading in print settings model c (see figure 3) was constructed to examine the relationships among the reading interest dimensions from the three different types of reading modes (i.e., reading online materials, reading in print settings, and social media reading). out the 10 factors, two variables are related to reading in social media, enjoyment in social media reading and sense of belonging through social media reading. models were built to reflect the literature suggesting that: (a) confidence in reading is moderately related to enjoyment in reading (e.g., clark & de zoysa, 2011; mcgeown et al., 2015) and (b) elaboration is related to sense of belonging (e.g., sundar, 2015). figure 3. model c: the relationships among the psycho-behavioural constructs of interest in online reading, social media reading, and reading in print settings reid (research and evaluation in education) 58 − reid (research and evaluation in education), 3(1), 2017 notes: ela_p: elaboration in print settings; enj_p: enjoyment in print settings; com_p: competence experience in print settings; val_o: utility value in online reading; conf_o: confidence in online reading; enj_o: enjoyment in online reading; com_o: competence experience in online reading; flw_o: flow in online reading; bel_s: sense of belonging in social media reading; enj_s: enjoyment in social media reading. table 4. standardised path coefficients, standard errors, estimated standard errors, and p-value of model c path estimate s.e. est./s.e. sig. enjoyment in reading online materials to flow in reading online materials .52 .03 15.16 .00 utility value in reading online materials to enjoyment in online reading .33 .04 7.65 .00 confidence in reading online materials to enjoyment in online reading .51 .05 11.18 .00 utility value in reading online materials to competence experience in online reading .31 .04 8.64 .00 competence experience in reading online materials to confidence in online reading .43 .04 12.04 .00 competence experience in reading in print to competence experience in online reading .39 .04 10.12 .00 elaboration in reading in print to utility value in online reading .72 .03 26.00 .00 elaboration in reading in print to sense of belonging in social media reading .41 .04 10.25 .00 confidence in online reading to enjoyment in social media .22 .50 4.53 .00 after several options were tested, a final model was chosen in which ‘elaboration in print settings’ is significantly related to ‘sense of belonging in social media reading’ (β = .41, p < .01) and ‘confidence in online reading’ is significantly linked to ‘enjoyment in social media’ (β = .22, p < .01). model c yielded good fit (x 2 = 1292.38, df = 579, x 2 /df = 2.23, rmsea = .04, srmr = .08, cfi = .95, and tli = .95). the standardised path coefficients in model c were all significant and substantial, ranging from β = .21 to β = .68 across all nine latent variables. this model reveals weak to moderate relationships between interest in social media reading and interest in reading online materials and printed materials. table 4 shows the standardised parameter estimates and standard errors of all pathways included in model c. discussion despite extensive research on reading interest, the relationships among the psychobehavioral dimensions of that construct remain unclear. the aim of this study is to examine how the dimensions of reading interest within and across modes of reading are related to one another. noteworthy findings from the final models of the relationship among the dimensions of reading interest are considered in this section. the first important finding is that the dimensions of interest in reading in print settings preceded those of interest in reading online materials, suggesting the importance of interest in reading in print settings for the development of interest in reading online materials. this finding supports the idea that reading in print settings is positively linked to, or may even facilitate, reading online materials (e.g., coiro, 2011a, 2011b; coiro & dobler, 2007; schmar-dobler, 2003). it may partly be explained by the fact that, to be able to get the most from reading online materials, readers need to be proficient in reading in print settings and to be able to use their reading-inprint strategies in order to read in online settings. fluent in-print readers need to learn additional practices and strategies, such as how to use web-based search engines and how to locate information efficiently and reid (research and evaluation in education) the structural equation modeling of reading interest... 59 nur hidayanto pancoro setyo putro & jihyun lee effectively by adopting strategies they used when they read in print settings. to better understand what they read online, undergraduate students also need to connect what they already know from reading in print settings to what they read online. in contrast to earlier findings (e.g., de waal et al., 2005; mokhtari et al., 2009), however, this study found no evidence of negative relationships between interest in reading online materials and interest in reading in print settings. a possible explanation for this is that previous studies relied on the frequency of either reading online materials or reading in print settings as the measure of reading interest. these previous studies drew this conclusion (i.e., that there is a negative relationship between reading online materials and reading in print) based on the fact that the time individuals spent reading online materials reduced the time they spent reading printed materials because they could not use the time spent on one activity for time spent on another activity (see mokhtari et al., 2009; valkenburg & peter, 2007). this study, however, did not rely on frequency of reading as the measure of reading interest, which might account for the different results. a moderate relationship between elaboration in reading in print settings and sense of belonging in social media reading was also documented in this study. this relationship may partly be explained by the nature of social media reading itself; that is, an activity performed to establish interactions among readers who share a common interest. the source of this interest may be what they read in print settings (e.g., interest in reading printed novels or comics). this result supports previous research findings that reading in print settings is related to social media reading (e.g., cheung, chiu, & lee, 2011; tenopir et al., 2013). the particular relationship between elaboration and sense of belonging has also been documented by sundar (2015). third, considering the relationships between particular reading interest dimensions, the first important finding is that enjoyment in reading seems to be the only variable directly and consistently connected to flow in reading. in the three models (a, b, and c), it is evident that enjoyment is the sole predictor of flow across modes of reading. this result appears to be consistent with other research showing that flow occurred only when individuals continued to follow their sense of enjoyment in a particular object of interest (csikszentmihalyi, 1997; shernoff et al., 2003). a possible explanation for this is that the flow experience in reading occurs only when an individual finds the reading activity to be intrinsically enjoyable. thus, people who know the value of the reading material and are confident in their reading skills will not experience flow if they do not find the reading activity enjoyable. this finding has important implications for the use of enjoyment in reading as one of the key predictors of flow in reading in future measurement of reading interest. this study also found that utility value in reading was significantly connected to enjoyment in reading online materials. this result is in line with those in previous studies (e.g., nakamura & csikszentmihalyi, 2009; wilson et al., 2008) and may help us to understand why some students are reluctant to read when they cannot perceive the value of what they need to read. this result may be explained by the fact that people find a reading activity enjoyable when they believe the reading activity is valuable or worth doing. in other words, the value or benefits expected from reading a text may help the reader to find the reading activity pleasurable. another interesting finding is that enjoyment in reading online materials is significantly linked to confidence in reading online materials and that confidence in reading online materials is significantly predicted by competence experience in reading online materials. this result supports the idea that enjoyment in reading is strongly influenced by both competence and confidence in reading (e.g., clark & de zoysa, 2011). it may explain the relatively significant correlation among confidence, competence, and enjoyment in the way that improvement in individuals’ competence in reading leads to improvement in their confidence in reading. improvement in their confidence may in turn lead to an increase in the pleasure or enjoyment derived from reading, as individuals will only find the reid (research and evaluation in education) 60 − reid (research and evaluation in education), 3(1), 2017 activity enjoyable when they are confident that their skills meet the associated reading challenges (shernoff et al., 2003). the results of this study also show that elaboration in reading is strongly connected with both utility value in reading online materials and enjoyment in recreational reading. this result is consistent with findings from other studies that elaboration is moderately related to enjoyment (e.g., frenzel, goetz, stephens, & jacob, 2009) and utility value (e.g., brockman, 2006). one possible reason for this is that linking individuals’ prior knowledge with what they are reading facilitates the creation of a balance between what they already know and the challenge from the reading process. in turn, this leads them to perceive the reading activity as enjoyable and the activity as valuable or worth doing. these findings suggest that, to help learners get the most from what they read, teachers need to involve the students’ prior knowledge before gradually changing the level of reading challenge to help them enjoy the reading process and to recognize the value of what they read. this study also found a significantly weak relationship between confidence in reading online materials and enjoyment in social media reading. this result supports the ideas of dunst and dempsey (2007) who found that confidence in parenting led to enjoyment in parenting. further, research has shown that confidence in reading is moderately related to enjoyment in reading (mcgeown et al., 2015). this relationship may be partly explained by the fact that when people believe they are good at a particular activity (i.e. confident in their abilities), they are more likely to enjoy performing the activity (boyd & yin, 1996; carroll & loumidis, 2001; durik, vida, & eccles, 2006). thus, undergraduate students’ belief in their abilities in reading online materials appears to lead them to enjoy reading through social media platforms such as facebook and twitter. conclusion and suggestions this study provides evidence of how the psycho-behavioral constructs of reading interest are related to one another. given that the constructs within interest in reading in print settings are connected to those in reading online materials, educators need to encourage students utilizing both reading modes to help them get the best from their reading. online materials help students search for information efficiently and effectively, whereas printed materials facilitate deep understanding. the existence of moderate to strong relationships among elaboration in reading in print settings, utility value in reading online materials, confidence in reading online materials, and enjoyment in reading online materials suggests that educators can enhance students’ reading interest (particularly their enjoyment in reading) by connecting reading tasks to real life experience, assigning value to the reading activity, and developing students’ confidence in reading. although testing of the final model of reading interest dimensions yielded an acceptable fit with the data, other models might also yield an acceptable fit. evidence from other types of investigations is required to confirm these models and to test their application. references alterman, e. (2007). out of print: the death and life of the american newspaper. caligrama (são paulo. online), 3(3). doi: http://dx.doi.org/10.11606/issn.18080820.cali.2007.67395. bentler, p. m. (2005). eqs 6 structural equations program manual. encino, ca: multivariate software, inc. boyd, m. p., & yin, z. (1996). cognitiveaffective sources of sport enjoyment in adolescent sport participants. adolescence, 31(122), 383-395. brockman, g. (2006). what factors influence achievement in remedial mathematics classes? (doctoral dissertation). university of southern california, los angeles, ca. buzzetto-more, n., guy, r., & elobaid, m. (2007). reading in a digital age: ebooks are students ready for this learning object? interdisciplinary journal of elearning and learning objects, 3, 239-250. byrne, b. m. (2006). structural equation modeling with eqs: basic concepts, applications, and http://dx.doi.org/10.11606/issn.1808-0820.cali.2007.67395 http://dx.doi.org/10.11606/issn.1808-0820.cali.2007.67395 reid (research and evaluation in education) the structural equation modeling of reading interest... 61 nur hidayanto pancoro setyo putro & jihyun lee programming: mahwah, nj: lawrence erlbaum associates. carroll, b., & loumidis, j. (2001). children's perceived competence and enjoyment in physical education and physical activity outside school. european physical education review, 7(1), 24-43. doi: https://doi.org/10.1177/1356336x010 071005 cheung, c. m., chiu, p.-y., & lee, m. k. (2011). online social networks: why do students use facebook?. computers in human behavior, 27(4), 1337-1343. doi: 10.1016/j.chb.2010.07.028 chouinard, r., karsenti, t., & roy, n. (2007). relations among competence beliefs, utility value, achievement goals, and effort in mathematics. british journal of educational psychology, 77(3), 501-517. doi: 10.1348/000709906x133589 clanton, j., gardner, a., cheung, m., mellert, l., evancho-chapman, m., & george, r. l. (2014). the relationship between confidence and competence in the development of surgical skills. journal of surgical education, 71(3), 405-412. doi: 10.1016/j.jsurg.2013.08.009 clark, c., & de zoysa, s. (2011). mapping the interrelationships of reading enjoyment, attitudes, behaviour and attainment: an exploratory investigation. london: national literacy trust. cocks, r. j., & watt, h. m. (2004). relationships among perceived competence, intrinsic value and mastery goal orientation in english and maths. the australian educational researcher, 31(2), 81-111. doi: 10.1007/bf03249521. coiro, j. (2011a). predicting reading comprehension on the internet: contributions of offline reading skills, online reading skills, and prior knowledge. journal of literacy research, 43(4), 352-392. doi: 10.1177/1086296x11421979 coiro, j. (2011b). talking about reading as thinking: modeling the hidden complexities of online reading comprehension. theory into practice, 50(2), 107-115. doi: 10.1080/00405841.2011.558435 coiro, j., & dobler, e. (2007). exploring the online reading comprehension strategies used by sixth-grade skilled readers to search for and locate information internet. reading research quarterly, 42(2), 214-257. doi: 10.1598/rrq.42.2.2 csikszentmihalyi, m. (1997). flow and the psychology of discovery and invention. new york, ny: harper collins. dewaal, e., schönbach, k., & lauf, e. (2005). online newspapers: a substitute or complement for print newspapers and other information channels? communications, 30(1), 55-72. doi: 10.1515/comm.2005.30.1.55 dunst, c. j., & dempsey, i. (2007). family– professional partnerships and parenting competence, confidence, and enjoyment. international journal of disability, development and education, 54(3), 305318. doi: 0.1080/10349120701488772 durik, a. m., vida, m., & eccles, j. s. (2006). task values and ability beliefs as predictors of high school literacy choices: a developmental analysis. journal of educational psychology, 98(2), 382-393. doi: 10.1037/0022-0663.98.2.382. eccles, j. s. (1983). expectancies, values, and academic behaviors. in j. spence (ed.), achievement and achievement motives: psychological and sociological approaches (pp. 75-146). san francisco, ca: freeman. frenzel, a. c., goetz, t., stephens, e. j., & jacob, b. (2009). antecedents and effects of teachers’ emotional experiences: an integrated perspective and empirical test. in p. schutz (ed.), advances in teacher emotion research: the impact on teachers' lives (pp. 129-151). dordrecht: springer. junco, r. (2012). too much face and not enough books: the relationship between multiple indices of facebook use and academic performance. computers in human behavior, 28(1), 187-198. doi: 10.1016/j.chb.2011.08.026 reid (research and evaluation in education) 62 − reid (research and evaluation in education), 3(1), 2017 karim, n. s. a., & hasan, a. (2007). reading habits and attitude in the digital age. the electronic library, 25(3), 285-298. doi: 10.1108/02640470710754805 kirchhoff, s. m. (2010, september). us newspaper industry in transition. paper presented in congressional research service report for congress. kirschner, p. a., & karpinski, a. c. (2010). facebook® and academic performance. computers in human behavior, 26(6), 1237-1245. doi: https://doi.org/ 10.1016/j.chb.2010.03.024 lau, k.-l., & ho, e. s.-c. (2016). reading performance and self-regulated learning of hong kong students: what we learnt from pisa 2009. the asia-pacific education researcher, 25(1), 159-171. doi: https://doi.org/10.1007/s40299-0150246-1. lee, p. s., & leung, l. (2008). assessing the displacement effects of the internet. telematics and informatics, 25(3), 145-155. doi: 10.1016/j.tele.2006.08.002 liu, z. (2005). reading behavior in the digital environment: changes in reading behavior over the past ten years. journal of documentation, 61(6), 700-712. doi: 10.1108/00220410510632040 liu, z., & huang, x. (2008). gender differences in the online reading environment. journal of documentation, 64(4), 616-626. doi: http://dx.doi.org/ 10.1108/00220410810884101 marsh, h. w., & martin, a. j. (2011). academic self‐concept and academic achievement: relations and causal ordering. british journal of educational psychology, 81(1), 59-77. doi: 10.1348/000709910x503501 mcgeown, s. p., johnston, r. s., walker, j., howatson, k., stockburn, a., & dufton, p. (2015). the relationship between young children’s enjoyment of learning to read, reading attitudes, confidence and attainment. educational research, 57(4), 389-402. doi: 10.1080/00131881.2015.1091234 mckenna, m. c., conradi, k., lawrence, c., jang, b. g., & meyer, j. p. (2012). reading attitudes of middle school students: results of a u.s. survey. reading research quarterly, 47(3), 283306. doi: 10.1002/rrq.021 mokhtari, k., reichard, c. a., & gardner, a. (2009). the impact of internet and television use on the reading habits and practices of college students. journal of adolescent & adult literacy, 52(7), 609619. doi: 10.1598/jaal.52.7.6 muthén, l. k., & muthén, b. o. (1998-2012). mplus user’s guide (7th ed.). los angeles, ca: muthén & muthén. nakamura, j., & csikszentmihalyi, m. (2009). the concept of flow. in c. r. snyder & s.j. lopez, oxford handbook of positive psychology (pp. 89-105). new york, ny: oxford university press. oecd. (2011). pisa 2009 results: students on line: digital technologies and performance (vol. 6). paris: oecd publishing. pajares, f., & johnson, m. j. (1994). confidence and competence in writing: the role of self-efficacy, outcome expectancy, and apprehension. research in the teaching of english, 28(3), 313-331. doi: 10.2307/40171341. perry, r. p., & smart, j. c. (2007). the scholarship of teaching and learning in higher education: an evidence-based perspective. dordrecht: springer science & business media. putro, n. (2017). reading interest in a digital age (doctoral dissertation). university of new south wales, sydney. schleicher, a. (2016). teaching excellence through professional learning and policy reform: lessons from around the world. international summit on the teaching profession. oecd publishing. doi: 10.1789/9789264252059.en schmar-dobler, e. (2003). reading on the internet: the link between literacy and technology. journal of adolescent & adult https://doi.org/10.1348/000709910x503501 reid (research and evaluation in education) the structural equation modeling of reading interest... 63 nur hidayanto pancoro setyo putro & jihyun lee literacy, 47(1), 80-85. doi: 10.2307/40026906 shernoff, d. j., csikszentmihalyi, m., shneider, b., & shernoff, e. s. (2003). student engagement in high school classrooms from the perspective of flow theory. school psychology quarterly, 18(2), 158-176. doi: 10.1521/scpq.18.2.158.21860 sherry, j. l. (2004). flow and media enjoyment. communication theory, 14(4), 328-347. stankov, l., & lee, j. (2008). confidence and cognitive test performance. journal of educational psychology, 100(4), 961-976. doi: 10.1037/a0012546 sundar, s. s. (2015). the handbook of the psychology of communication technology. chicester: john wiley & sons. http://unsw.eblib.com/patron/fullrec ord.aspx?p=1895448 tenopir, c., volentine, r., & king, d. w. (2013). social media and scholarly reading. online information review, 37(2), 193216. doi: 10.1108/oir-04-2012-0062 valkenburg, p. m., & peter, j. (2007). online communication and adolescent wellbeing: testing the stimulation versus the displacement hypothesis. journal of computer‐mediated communication, 12(4), 1169-1182. doi: 10.1111/j.10836101.2007.00368.x veenhof, b. (2006). the internet: is it changing the way canadians spend their time?. statistics canada. available at http://www.statcan.gc.ca/pub/56f0004 m/56f0004m2006013-eng.htm. walsh, m. (2010). multimodal literacy: what does it mean for classroom practice? australian journal of language and literacy, 33(3), 211–239. wang, j., & wang, x. (2012). structural equation modeling: applications using mplus. chicester: john wiley & sons. weber, r., tamborini, r., westcott‐baker, a., & kantor, b. (2009). theorizing flow and media enjoyment as cognitive synchronization of attentional and reward networks. communication theory, 19(4), 397-422. doi: 10.1111/j.14682885.2009.01352.x wigfield, a. (1994). expectancy-value theory of achievement motivation: a developmental perspective. educational psychology review, 6(1), 49-78. doi: 10.2307/23359359 wigfield, a., & eccles, j. s. (2002). the development of competence beliefs, expectancies for success, and achievement values from childhood through adolescence. in a. wigfield, & j. eccles (eds.), the development of achievement motivation (pp. 91-120). san diego, ca: academic press. doi: https://doi.org/10.1016/b978012750053-9/50006-1 wigfield, a., & tonks, s. (2002). adolescents’ expectancies for success and achievement task values during the middle and high school years. in t. urdan & f. pajares (eds.), academic motivation of adolescents (pp. 53-82). greenwich, ct: information age. wilson, n., bouhuijs, p., conradie, h., reuter, h., van heerden, b., & marais, b. (2008). perceived educational value and enjoyment of a rural clinical rotation for medical students. rural remote health, 8(3), 999. available at: http://www.rrh.org.au/articles/subvie wnew.asp?articleid=999. http://unsw.eblib.com/patron/fullrecord.aspx?p=1895448 http://unsw.eblib.com/patron/fullrecord.aspx?p=1895448 http://www.rrh.org.au/articles/subviewnew.asp?articleid=999 http://www.rrh.org.au/articles/subviewnew.asp?articleid=999 copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(1), 2018, 22-34 available online at: http://journal.uny.ac.id/index.php/reid a factor analysis of an instrument for measuring physical abuse experience of students at school *1safrudin amin; 2badrun kartowagiran; 3pracha inang 1faculty of literature and culture, universitas khairun 1jl. pertamina kampus ii unkhair gambesi kota ternate selatan, 97719, indonesia 2faculty of engineering, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman 55281, yogyakarta, indonesia 3faculty of education, burapha university 169 longhaad bangsaen road, saensook, mueang, chonburi 20131, thailand *corresponding author. e-mail: safrudinamin@gmail.com submitted: 24 may 2018 | revised: 04 august 2018 | accepted: 08 august 2018 abstract violence in schools is increasingly reported by the mass media. it indicates that its prevalence is escalating. an instrument which has a proper psychometric property is needed to investigate the phenomenon. the study aims to develop an instrument for measuring physical abuse experienced by students in schools and explore the construct of the instrument. to pursue those objectives, the content validitity, construct validity, and reliability analysis on the developed instrument were measured. its content validity was confirmed through expert judgment, construct validity was proven through exploratory factor analysis, and reliability was estimated through cronbach’s alpha coefficient. experts considered that the content of all items were relevant, though they also suggested some improvement in wordings for greater clarity. the exploratory factor analysis on 31 items indicates that seven items need to be dropped and 24 items are divided into three factors called (1) victimized by friends with the loading factor ranging from 0.44 to 0.69, (2) victimizing friends with the loading factor ranging from 0.45 to 0.66, and (3) being victimized by teachers with the loading factor ranging from 0.57 to 0.68. the reliability of the test was 0.874. based on this result, the developed instruments consist of three factors with good validity and reliability. keywords: physical abuse, student, school, validity, reliability introduction studies on abuses against children have been conducted by individuals as well as organizations. research findings show the increasing incidents of violence against children in this country. this alarming trend, however, has not attracted serious attention from those in power (idris, 2015). one of the most important concerns of violence against children is the violence which takes place at school. a survey conducted by plan international and international center for research on women (icrw) shows 84% of indonesian children experience abuse in school. this result is higher than the trend in asian region which is 70% (qodar, 2015). the data released in june 2015 by the commission of indonesian children protection (komisi perlindungan anak indonesia – kpai) show that from 2011 to april 2015, violence against children grew significantly. in 2012, a survey in nine provinces demonstrated that 87.6% students experienced abuse in school (setyawan, 2015). the data published by kpai in november 2017 show that violence against children in school is mounting. as many as 84% students, or eight out of ten students, have ever experienced abuse in school. among them, 45% male students report that their teachers or school staff are the persecutors (setyawan, 2017). reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 a factor analysis of an instrument… 23 safrudin amin, badrun kartowagiran, & pracha inang before proceeding further, it is important to assert the terms ‘violence’, ‘abuse’, ‘maltreatment’, ‘bullying’, and the like. many studies by organizations or individuals have used the terms interchangeably although they refer to the same phenomena, or some concepts are treated as part of other concepts. the un secretary-general’s study defines violence against children in line with article 19 of the crc which treats ‘abuse’, ‘maltreatment’, and ‘exploitation’ as parts of violence (unicef, 2014b, p. 2). world health organization (who) equates the concept of ‘abuse’ and ‘maltreatment’ (unicef, 2014a, p. 19). it defines child abuse or maltreatment: ‘…constitutes all forms of physical and/or emotional ill-treatment, sexual abuse, neglect or negligent treatment or commercial or other exploitation, resulting in actual or potential harm to the child’s health, survival, development or dignity in the context of a relationship of responsibility, trust or power’ unicef (2014a, p. 21) made inventory of studies on violence against children and grouped together studies using different terms such as ‘physical violence’, ‘physical abuse’, and ‘physical maltreatment’ into one category that is physical dimension of violence for the reason that they are dealing with roughly the same phenomena. here, ‘violence’, ‘abuse’, and ‘maltreatment’ are regarded as identical. unicef (2014a, p. 21) also includes ‘bullying’ as part of ‘violence’. nansel et al. (2001) define bullying as aggressive behavior which is intended to harm or disturb, committed by a more powerful person or group to those who are powerless, and it occurs repeatedly over time. the characters of intention to harm others and asymmetric power between the persecutors and their victims overlaps with the definition of ‘violence’ held by who which also emphasizes the intention to harm others by using power (unicef, 2014a, p. 19). nansel et al. (2001, p. 2094) also state that bullying behavior could be verbal, psychological, or even physical. rivers and smith (1994, p. 362) find that physical abuse in bullying could be in the forms of ‘..directphysical behaviours such as hitting, kicking, and stealing’. nspcc (2016, p. 7) uses the term ‘physical bullying’ to refer to kicking, hitting, biting, pinching, hair pulling, and making threats’. this clearly shows that ‘bullying’ is equal to ‘physical abuse’, and it is reasonable that unicef includes ‘bullying’ as a part of ‘violence’. this research adopts who’s definition of child abuse mentioned earlier since it offers a notion that abuse or maltreatment does not always result in actual harm but could also be in a form of potential harm. however, as we go further to discuss physical abuse in this section, it will become clearer that our point of emphasis is not on the effects of violence acts as asserted by who, but on the acts of violence themselves. apart from the conceptual problem, in general, many experts come to a conclusion that any kinds of abuse against children committed either by teachers or fellow students in school, or abuse taking place outside school, has a destructive impact on children’s academic performance in school, in addition to other forms of negative impacts faced by the children. hyman and perone (1998, p. 19) explain that many studies have found that children who experience psychological maltreatment during their preschool and school age have lower academic performance. likewise, their ability and social competence are also low, compared to those students who have not experienced such maltreatment. this is in line with ajema, muraya, karuga, and kiruki (2016, p. 2) who conclude that violence in school and associated fear, anxiety, and injuries contribute to poor education and health outcomes. according to them, violence in school can lead to the destruction of children’s capacity and potentials to take advantages maximally during their education processes because they tend to be absent, unwilling to continue their study, and weakly motivated to get academic achievement. nansel et al. (2001) summarize that bullying has a significant correlation with academic achievement. both the persecutors and victims show low academic achievement compared to those students who are not involved in abuse. quoting some studies, simpson (2015, p. 18) also confirms that abuse such as reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 24 – a factor analysis of an instrument... safrudin amin, badrun kartowagiran, & pracha inang bullying can badly affect student’s academic performances. cohn and canter (2003) find that bullying causes the victims to face difficulties in dealing with academic challenges in school, and both perpetrators and victims have strong correlation with drop-out incidence. in addition, based on some studies, united nations secretary-general’s study (2006, p. 130) also synthesizes that ‘physical and psychological punishment, verbal abuse, bullying and sexual violence in schools are repeatedly reported as the reasons for absenteeism, dropping-out, and lack of motivation for academic achievement’. so far, violence against children has been reflected in various terms, such as ‘child abuse’, ‘violence against children’, ‘maltreatment’, ‘bullying’, and some more. however, the aspects of abuse are rather well-accepted by different organizations and scholars. choo, dunne, marret, fleming, and wong (2011) divide child abuses or the victimization of children into four categories: physical abuse, sexual abuse, emotional abuse, and neglect. in line with that notion, law no. 35 of 2014 of republic of indonesia also states that the aspects of child abuse are physical, psychological, sexual, and negligent. moreover, experts provide the detailed aspect of physical abuse. muthmainnah (2014, p. 446) states that ‘physical abuse occurs when an adult (parent, educator, caregiver, etc.) injures a child physically such as hitting, pinching, kicking, slapping, etc.’ clark, clark, and adamec (2007, p. 203) define physical abuse as ‘an act of commission by a parent or other persons that may or may not be accidental and that results in physical injury.’ besides, who claims that: ‘physical abuse of a child is that which results in actual or potential harm from an interaction or lack of an interaction, which is reasonably within the control of a parent or person in a position of responsibility, power or trust. there may be a sigle or repeated incidents’ (unicef, 2014a, p. 20). the adjective term ‘physical’ in ‘physical abuse’ has allowed the birth of various derivative terms such as physical violence, physical assault, physical harassment, physical victimization, physical maltreatment, physical bullying, and the like, but all refer to the threats or harmful actions that make the victim's physicality a target, whether it causes physical injury or not. the definition emphasizes on the acts of violence or abuses rather than the results of the acts. this position is fully reflected in the instrument developed in this study. in the context of research on child abuse in school, this instrument development is considered to be crucial for two reasons. first, there is a clear evidence of the increasing number of child abuse in school, including physical abuse, which has potential destructive impacts on students. second, studies on violence in school frequently do not make public the detailed psychometric properties of their instruments. in order to be useful, the instrument developed must have good validity and reliability to ensure its accuracy and internal consistency. method this study selected 584 respondents, who were grade ix students of three junior high schools in ternate, north maluku, indonesia. they were asked to fill out a questionnaire concerning their experiences of physical abuse in their schools. out of the total sample, 577 responses were feasible to be analyzed. this research adopted several items from previous studies (choo et al., 2011; straus, hamby, boney-mccoy, & sugarman, 1996; straus, hamby, finkelhor, moore, & runyan, 1998; unicef, 2014a), tailored and modified them to meet its specific objectives. the items in the questionnaire were ranked and scored using a modified likert scale. the respondents were asked to choose one of the responses offered. the response categories were: never = 1, seldom = 2, sometimes = 3, frequently = 4, and always = 5. all items are cast in positive terms. the 31 items addressed three different aspects assumed to be the aspects of child physical abuse. the three aspects are abuses committed by teachers, abuse committed by fellow students, and abuse committed by respondents to other students. the content validity was confirmed through expert judgment to ensure its relereid (research and evaluation in education), 4(1), 2018 issn 2460-6995 a factor analysis of an instrument… 25 safrudin amin, badrun kartowagiran, & pracha inang vance to the construct to be measured. three reviewers reviewed the first draft of the instrument and provided their input to improve the quality of the instrument. each item was accompanied by five alternative responses and each reviewer had to score the item by choosing an alternative answer ranging from 1 = irrelevant, 2 = rather relevant, 3 = relevant enough, 4 = relevant, and 5 = very relevant. experts considered the content of all 31 items were either relevant or very relevant, and suggested some improvements in wordings for more clarity. the construct validity was ensured through the exploratory factor analysis (efa). the exploratory analysis employed orthogonal rotation carried out with the varimax approach. the reliability of the instrument was estimated using coefficient alpha. both analyses were performed using spss 23 for windows. findings and discussion exploratory factor analysis basically, similar to many previous studies, construct validity can be proven by employing confirmatory factor analysis (cfa) (widdiharto, kartowagiran, & sugiman, 2017) and or exploratory factor analysis (clemens, carey, & harrington, 2010). this study employed exploratory factor analysis (efa) in order to explore the dimensions or factors in the instrument based on the empirically collected data (kartowagiran, 2008, p. 188). the results of initial check show that the instrument has the value of 0.89 in kaisermeyer-olkin (kmo) measure of sampling adequacy. this value is bigger than the minimum required score of 0.5. the significance indicated by the value of sig. is 0.000 < 0.05. all the values suggest that the data collected by using this physical abuse instrument were suitable for factor analysis. the next analysis involved factor extraction, factor rotation, interpretation of the result, reliability estimation, and naming the factors. the purpose of factor extraction is ‘to determine the number of initial subsets or factors that appear to represent the dimensions of the construct which is being measured’ (pett, lackey, & sullivan, 2003, p. 85). there are some factor extraction methods available for factor analysis, and some of them have been available in statistical soft wares such as statistical package for the social sciences (spss) or statistical analysis system (sas). this article prefers the principal component analysis (pca) to other methods, although some methodologists are not convinced of the use of pca for various reasons. costello and osborne (2005, p. 2), for example, write ‘component analysis is only a data reduction method’, and it does not ‘regard to any underlying structure caused by latent variables’, etc. although many criticisms stand against the use of principal component analysis, kline (2008, p. 74) sees that ‘principle factor analysis seems to be a sensible choice’ in factor analysis. besides, the use of principle component analysis is the most popular one probably due to the fact that some statistics software packages use it as their default (costello & osborne, 2005, p. 2), and also its result is easier to interpret compared to other methods (pett et al., 2003, p. 102). there are some common approaches to determination of the number of extracted factors to be retained (fabrigar & wegener, 2012, pp. 53–67). this study, however, applied three of them which were considered to be the most common procedures, namely, eigenvalues greater than 1, percentage of variance explained, and the use of scree plot. these methods are the most frequently used in determining factor solution in the form of unrotated factor solutions. although these approaches sometimes ‘do not provide meaningful and easily interpretable clusters of items’ (pett et al., 2003, p. 131), they are most commonly used in the stage of factor extraction before processing factor rotation. one of the results of factor extraction is table of total variance explained. in this study (see table 1), the formation of seven factors or components with eigenvalues > 1. factor one has the eigenvalue of 7.550, factor two has the eigenvalue of 1.983, and factor three has eigenvalue of 1.827. the fourth factor has the eigenvalue of 1.528, the fifth factor has 1.218, sixth factor has 1.064, and the reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 26 – a factor analysis of an instrument... safrudin amin, badrun kartowagiran, & pracha inang table 1. total variance explained in seven-factor model component initial eigenvalues extraction sums of squared loadings rotation sums of squared loadings total % of variance cumulative % total % of variance cumulative % total % of variance cumulative % 1 7.550 24.354 24.354 7.550 24.354 24.354 3.783 12.203 12.203 2 1.983 6.395 30.749 1.983 6.395 30.749 2.913 9.398 21.601 3 1.827 5.894 36.643 1.827 5.894 36.643 2.700 8.708 30.309 4 1.528 4.929 41.572 1.528 4.929 41.572 2.129 6.868 37.177 5 1.218 3.928 45.499 1.218 3.928 45.499 1.801 5.808 42.985 6 1.064 3.431 48.931 1.064 3.431 48.931 1.528 4.928 47.913 7 1.019 3.286 52.216 1.019 3.286 52.216 1.334 4.303 52.216 8 .978 3.156 55.372 9 .917 2.958 58.331 10 .888 2.865 61.196 11 .843 2.719 63.915 12 .800 2.581 66.496 13 .790 2.549 69.045 14 .759 2.448 71.493 15 .720 2.322 73.815 16 .665 2.146 75.961 17 .653 2.107 78.069 18 .628 2.026 80.095 19 .610 1.967 82.062 20 .587 1.893 83.955 21 .562 1.813 85.769 22 .534 1.721 87.490 23 .527 1.700 89.190 24 .509 1.641 90.831 25 .484 1.561 92.392 26 .475 1.531 93.923 27 .433 1.395 95.318 28 .391 1.261 96.580 29 .376 1.213 97.793 30 .368 1.186 98.979 31 .317 1.021 100.000 extraction method: principal component analysis. seventh factor has 1.019. using this procedure, the number of components with eigenvalues > 1 would be counted as the number of the extracted factors which later are specified into the model (fabrigar & wegener, 2012, p. 55). the second approach is the percentage of variance explained by each component. the table of total variance explained shows seven components, each of which has different values of the variance explained. component one accounted for 24% of variance, component two accounted for 6.3% of variance, factor three explained 5.8% of the variance. the rest four factors explained 4.929%, 3.928%, 3.431%, and 3.286% consecutively of the variance. this seven-factor model solution explained 52.216% of the variance in the table. the last approach used to determine the number of extracted factors was scree plot (see figure 1). the scree test basically examines ‘the graph of the eigenvalues and looking for the natural bend or breaks point in the data where the curve flattens out’ (costello & osborne, 2005, p. 3). although the interpretation of the scree plot is subjective in nature (fabrigar & wegener, 2012, p. 58; kline, 2008, p. 75), gorsuch proposes scree plots over the eigen-value > 1 as the criteria (pett et al., 2003, p. 120). with reference to that criterion, it is difficult to assume the formation of seven factors, since only four factors have eigenvalues > 1. in terms of the variance explained, although many researchers stop the factor extraction process when the total variance explained reaches 50-80%, there are no definite guidelines for a particular threshold (pett et al., 2003, p. 116). hair et al. (1995) give criteria of the last factor no less than 5% of the explained variance (pett et al., 2003, p. 116). although it is intended for natural science, in this study, it is quite relevant as can be seen in reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 a factor analysis of an instrument… 27 safrudin amin, badrun kartowagiran, & pracha inang figure 1. scree plot of seven-factor model the rest of this article. by applying that criterion to the above seven-factor model, it appeared that only the first three factors met the 5% criterion, despite the fact that seven factors had eigenvalue > 1. another problem was that the scree plot of this seven-factor model also could not show clearly the appearance of seven factors, instead, it showed only four factors. using a straight line drawn with a ruler through the lower values of the plotted eigenvalues, it identified only four factors formed above the line. dealing with the inconsistent outcomes of the above three approaches in determining the number of factors, and the difficulties in interpreting their results, like many other researchers, we relied on rotation to improve the meaningfulness and to have better interpretation of the factors generated. this study used orthogonal rotation carried out with the varimax approach. varimax maximizes the variances of the loading in the factors. the results of the factor rotation are component matrix and rotated component matrix. in component matrix, all factor loadings of each item in each factor are shown without discriminating them based on high loadings only. consequently, it includes all items-to-factor correlations. it, therefore, becomes overwhelming and rather difficult to interpret. in rotated component matrix, only high loading factors appear in each factor or component. at this point, a researcher can decide the value of factor loading allowed in the factor solution. in our seven-factor model, following sadtyadi and kartowagiran (2014, p. 295), we suppressed the absolute values of factor loadings to less than 0.50 and maintained > 0.5. this helped to provide only high factor loadings (> 0.5) in each item. as a result, the loadings appeared were not overwhelming. pedhazur and schmelkin state that ideally, each item has high and meaningful factor loading on one factor only and each factor has high or meaningful loadings for only some of the items (quoted in pett et al., 2003, pp. 132–133). the output of the rotated solution clearly showed that items were grouped into seven components or factors. except for three items with loadings factor less than 0.5, the items were distributed to seven components or factors. there was no crossloading item in this solution but several problems occurred. the first problem emerged because some different items carrying different conceptual meanings were grouped together, particularly in components 4 and 5. in terms of conceptual inappropriateness, some items were loaded on irrelevant factors or, in other words, some items failed to load on conceptually appropriate factors. this was considered as an indication of incorrect factor structure (costello & osborne, 2005, p. 5). reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 28 – a factor analysis of an instrument... safrudin amin, badrun kartowagiran, & pracha inang the second problem came to light from the fact that two components, 6 and 7, were supported only by 2 and 1 items consecutively. costello and osborne (2005, p. 5) state that ‘a factor with fewer than three items is generally weak and unstable’. the unsatisfying appearance of the seven-factor solution led us to seek other solution which was expected to be meaningful conceptually and easy to interpret. looking back to some indications shown in the sevenfactor model, particularly in the rotated component matrix, only the first three components or factors were easy to interpret and turned up to be conceptually more appropriate. in addition, by using the 5% criteria of variance extracted proposed by hair et al (pett et al., 2003, pp. 116–118), we found only three first factors met the criterion of 5%. this is a strong indication of the existence of a three-factor solution. in addition, we also linked this indication of three factors to the initial constructs in the physical abuse questionnaire and mapped the main issues in it. the instrument, in fact, contains three main issues i.e. abuse committed by teachers to respondents, abuse committed by fellow students to respondents, and abuse committed by the respondents of the survey to their fellow students. from these considerations, the factor analysis with three factors to extract was conducted. the lowest factor loading allowed was also determined to ≥ 0.40 by suppressing the items that have factor loadings of less than 0.40. the results showed that the three-factor analysis met the 5% criterion for each factor (as proposed by hair in earlier discussion). the variance explained by the three factors, however, was only 36.643%, lower than the variance explained by the seven-factor model. to solve this low variance explained, we tried to accommodate more items by cutting down the lowest factor loading to 0.30. the solution resulted from that decision, however, became more difficult to interpret. the analysis, therefore, was dragged back to ≥ 0.40. with this threshold, the result revealed that the loadings of some four items disappeared due to having factor loadings of less than 0.40 and three items loaded in inappropriate factors (this was fewer than the number of items loaded in inappropriate factors in the seven-factor model). those problematic seven items were then eliminated. therefore, the number of items declined from 31 to 24 items. after dropping these problematic items, we changed the sampling adequacy measured by kaiser-meyer-olkin (kmo) into 0.891. the barlett’s test was still significant 0.000 < 0.5. the elimination of some items did not give negative impact on the data as a whole because both values of kmo and significance indicated that the data were suitable for factor analysis. the decision to eliminate those problematic items, in fact, improved the factor structure given that the variance explained increased from 36.643% to 41.428%. another effect of eliminating some problematic items was that the number of the factors with eigenvalues > 1 decreased to five factors (previously seven factors). although the decreased number of factors was accompanied by an increase in variance explained, the criterion of determining the number of factors to retain was based more on the criterion that the factor has no less than 5% accounted for variance as proposed by hair et al (pett et al., 2003, p. 116). besides, the appearance of the scree plot and theoretical considerations of the original constructs contained in the questionnaire were also the basis for our decision. with regard to the criterion of > 5%, the data in table 2 clearly show the formation of three factors. in terms of the 5% criterion, the three model solutions prove that only the first three factors have higher than 5% of the variance extracted. factor one accounts for 26.288% of variance and has an eigenvalue of 6.309, factor two accounts for 8.016% of the variance and its eigenvalue is 1.924, and the third factor’s eigenvalue is 1.710 and it accounts for 7.123% of the variance. the fourth and fifth factors, although have eigenvalues of 1.165 and 1.050 respectively, which are higher than 1, each of their contributions to the explained variance is only 4.853% and 4.374%, less than 5%. these lead to their exclusion from the factors retained. as a whole, the three factors account for 41.428% of variance. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 a factor analysis of an instrument… 29 safrudin amin, badrun kartowagiran, & pracha inang table 2. total variance explained in three-factor model component initial eigenvalues extraction sums of squared loadings rotation sums of squared loadings total % of variance cumulative % total % of variance cumulative % total % of variance cumulative % 1 6.309 26.288 26.288 6.309 26.288 26.288 4.038 16.823 16.823 2 1.924 8.016 34.305 1.924 8.016 34.305 3.004 12.518 29.341 3 1.710 7.123 41.428 1.710 7.123 41.428 2.901 12.087 41.428 4 1.165 4.853 46.281 5 1.050 4.374 50.655 6 .937 3.904 54.559 7 .884 3.683 58.243 8 .851 3.545 61.788 9 .807 3.362 65.150 10 .781 3.253 68.403 11 .740 3.082 71.485 12 .690 2.876 74.361 13 .634 2.640 77.001 14 .608 2.532 79.533 15 .598 2.494 82.027 16 .581 2.419 84.446 17 .570 2.375 86.821 18 .529 2.206 89.027 19 .522 2.175 91.201 20 .489 2.036 93.238 21 .440 1.831 95.069 22 .435 1.811 96.881 23 .378 1.574 98.454 24 .371 1.546 100.000 extraction method: principal component analysis. another method which is used to help making decision on the number of factor to keep is scree plot. although some methodologists criticize the use of scree plot (fabrigar, wegener, maccallum, & strahan, 1999, pp. 278–279), it is one of the most widely used approaches in the exploratory factor analysis. costello and osborne (2005, p. 3) even state that ‘the best choice for researchers is the scree test’. it is admitted that one of main problems related to the use of scree plot is that researchers tend to use their subjective nature in interpreting them. some researchers, however, provide guidelines. costello and osborne (2005, p. 3) assert that ‘the number of data points above the “break” (i.e., not including the point at which the break occurs) is usually the number of factors to retain’. pett et al. (2003, p. 119) advise ‘that point where the factors curve above the straight line drawn [with a ruler] through the smaller eigenvalues identifies the number of factors’. the scree plot presented in figure 2 is an output based on the data processed through spss package. following the afore-mentioned guidelines of interpreting scree plot, the scree plot presented in figure 2 clearly presents three factors above the break or above the straight line drawn from the lowest eigenvalue horizontally. in other words, the scree output shows a similar result with the 5% criterion and is also relevant to the original constructs containing three main themes in the questionnaire of physical abuses. except for the criteria of eigenvalues > 1, all of these other criteria confirm the formation of three-factor solution model in the factor extraction. the decision to involve theoretical considerations or original construct in determining the number of factors to retain referred to the recommendations provided by nunnally and bernstein (1994) paraphrased by pett et al. (2003, p. 125) as follows: how many factors should we extract... two... three... four? there is no easy solution to this decision. nunnally and bernstein (1994) caution the researcher against using rigid guidelines for determining the ultimate number of factors to extract. whatever solution we arrive at should not be solely based on statistical criteria; it also needs to make theoretical sense. the ultimate criteria for determining the number of factors are factor interpretability and usefulness both during the initial extraction procedures and after the factors have been rotated to achieve more clarity. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 30 – a factor analysis of an instrument... safrudin amin, badrun kartowagiran, & pracha inang figure 2. scree plot of three-factor model it was the purpose to reach factor interpretability and usefulness that led us to involve our original construct in determining the factors beside statistical inputs. the purpose also became the basis for repeatedly refining the solution and examine them to find a more suitable solution which best explained the data and disclosed the structure of constructs behind the measurable variables. the use of orthogonal rotation with varimax had generated factor loading matrix in which the items were grouped together neatly to each factor. this is, therefore, more interpretable. there are two things worth noting here. first, the requirement of adequate numbers of items load in each factor is fulfilled. referring to views proposed by nunnally and bernstein (1994), pett et al. (2003, p. 125) write ‘if the extracted factors serve to describe characteristics that variables have in common, then, by definition, there need to be at least two items for each extracted factor’. further, costello and osborne (2005, p. 3) propose at least three items for each factor. second, the factor loadings of the items ranging from 0.44 to 0.69 are good enough and even very good (comrey & lee, 2009, p. 243). the detailed illustration of the matrix of factor structure and item loadings can be found in table 3. there are some guidelines to interpret the construct validity of this instrument based on the information presented in the factor structure matrix. some researchers employ factor loading of each item ≥ 0.30 (mccauley, ruderman, ohlott, & morrow, 1994, p. 548). comrey and lee (2009, p. 243) propose higher than 0.30, by saying ‘whereas loadings of 0.30 and above have commonly been listed among those high enough to provide some interpretive value, such loadings certainly cannot be relied upon to provide a very good basis for factor interpretation’. in addition, some even use factor loading > 0.50 (kartowagiran & jaedun, 2016, p. 133; wijanto, 2008, p. 193). according to costello and osborne (2005, p. 3), item loading table ‘has the best fit to data’ if item loadings above 0.30, no or few items cross-loadings, and no factors with fewer than three items. to meet those criteria, this study uses factor loading ≥ 0.40. in more detail, out of 24 valid items with factor loadings above 0.40, 14 of them are > 0.60, six are > 0.50, and the rest four items are > 0.40. there is no cross-loading item in the matrix which means each item is unidimensional. in addition, there are more than three items load in each factor. due to all requirements proposed by the above methodologists, which were fulfilled well, it can be confidently affirmed that the construct validity of this instrument has been reached satisfactorily. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 a factor analysis of an instrument… 31 safrudin amin, badrun kartowagiran, & pracha inang table 3. factor loading matrix of physical abuse experience among school students no item wordings factor 1 factor 2 factor 3 fisik17 has any student pushed your body or head harshly? .690 fisik11 has any student hit you by using any blunt objects (examples: wood, rattan, or others)? .688 fisik16 has any student tweaked your ears? .662 fisik13 has any student thrown something solid at you? (such as book or other stuff) .646 fisik10 has any student slapped you? .622 fisik15 has any student pulled your hair harshly? .607 fisik14 has any student pinched you because he/she got angry to you? .581 fisik12 has any student kicked you? (not in a jock or sport). .555 fisik18 has any student injured you? .468 fisik19 has any student scratched you? .441 fisik30 have you scratched other students? .665 fisik31 have you bitten other students? .658 fisik26 have you pulled another student’s hair harshly? .627 fisik27 have you tweaked another student’s ears? .548 fisik25 have you pinched other students because you are angry to him/her? .529 fisik28 have you pushed other student’s body or head harshly? .528 fisik29 have you injured other students? .499 fisik24 have you thrown something solid at other students (such as book or other stuff) .452 fisik5 has your teacher pinched you because he/she is angry? .687 fisik2 has your teacher hit you by using any blunt objects (examples: wood, rattan, or others)? .676 fisik8 has any teacher punished you by asking you to position your body in a way that made you are physically unpleasant? .636 fisik7 has your teacher tweaked your ears? .632 fisik1 has your teacher hit or slapped you? .628 fisik4 have your teachers thrown something (such as book or other stuff) at you? .578 extraction method: principal component analysis. rotation method: varimax with kaiser normalization. a. rotation converged in 5 iterations. usually, the factor’s name is drawn from the name of the item with the highest factor loading. in the case of this study, however, it is much easier to give the name since the items grouped in each factor have some common themes. ten items that load on factor 1 appear to have one common theme in spite of having different contents from one another. every item contains a specific act of abuse such as pushing body, hitting, and tweaking, but all refer to the same topic, that is, abuse committed by fellow students. in factor 2, each of the eight items deals with specific content, but the main theme assembling the items’ similarities within this factor is that the persecutors committing the abuse are the respondents who abuse other students. with the same pattern of interpretation, the six items loaded in factor 3 hold the same common theme, apart from their differences, namely abuse committed by school teachers. based on the mapping of the common themes reflected by the groups of items in each factor, it is reasonable to name the first factor containing items on abuses by fellow students as victimized by friends, the second factor covering items on abuses by respondents towards other students as victimizing friends, and the third factor carrying items containing abuses by teachers as being victimized by teachers. these names become new identities of each factor while the identity of each item is not important anymore. these identities, according to kachigan, can be used to communicate to other people who are interested in using the instrument for their own research or in applying the results of the studies that have used the instrument (pett et al., 2003, p. 210). reliability beside instrument validity, the instrument reliability is also important to estimate. reliability test is part of instrument construcreid (research and evaluation in education), 4(1), 2018 issn 2460-6995 32 – a factor analysis of an instrument... safrudin amin, badrun kartowagiran, & pracha inang tion to make sure the instrument composed by the retained factors has good internal consistency. reliability helps to know to what extent an instrument is free from measurement error. in order to ensure the reliability of an instrument that has some subscales (factors), some methodologists and also researchers emphasize to estimate the coefficient alpha of each factor or subscale (amir, 2015, p. 227; pett et al., 2003, p. 188). other methodologists, however, recommended to estimate the reliability of each scale as well as the entire scale. parsian and am (2009, p. 5), referring to nunally and bernstein (1994) and devon et al (2007), state that ‘if an instrument contains two or more subscales, cronbach’s alpha should be computed for each subscale as well as the entire scale.’ for this reason, in order to estimate the instrument reliability of the student’s experience of physical abuses in school, first, the researchers generated the coefficient alpha for the whole items involving the three factors together, entire scale, then we generated coefficient alphas of each of the three derived factors independently. the lowest but still acceptable reliability coefficient used here is ≥ 0.65 (cohen & swerdlik, 2009, p. 151; nurmin & kartowagiran, 2013, p. 189). the result of the reliability estimation shows that the reliability for the overall physical abuse scale (when the 24 items combined) is 0.874, which is satisfactory. coefficient alpha will not be significantly affected by any drop of item. if any item were deleted, the coefficient of the entire scale would remain higher than 0.80. coefficient alpha for factor one with the whole 10 items is 0.830. this is stable since any removal of any item will not seriously affect the coefficient for the reason that coefficient will remain above 0.80. factor two with eight items has 0.735 coefficient alpha, and cronbach’s alpha of factor three is 0.766. in short, the reliability estimation shows that both entire scale and each subscale of the instrument have a good reliability coefficient. conclusion and suggestions the exploration of construct of physical abuse or violence against children in schools and the development of instrument for measuringsuch abuse have revealed three factors behind the construct: (1) victimized by friends, (2) victimizing friends, and (3) being victimized by teachers. the factor loadings of the items grouped in the victimized by friends factor range from 0.44 to 0.69. the item loads in the victimizing friends factor have loadings ranging from 0.45 to 0.66. the items included in the factor of being victimized by teachers have factor loadings ranging from 0.57 to 0.68. all of these prove that this instrument has good construct validity. the reliability of the instrument was estimated through cronbach’s alpha coefficient. it is categorized as good since the reliability coefficient of the first factor is 0.830, that of the second factor is 0.735, and that of the third factor is 0.766. the alpha coefficient of the entire instrument is 0.874. in short, the final result of this instrument development is the formation of an instrument for measuring students’ experience of physical abuse in schools, which consists of three factors with 24 items, and it has good validity and reliability. by providing this instrument for measuring physical abuse experienced by students in schools, any researchers who are interested in studying student’s experience of physical abuse in schools can use this instrument. likewise, those who want to evaluate policies concerning child-friendly schools or any related policies on the subject of preventing physical abuse in schools can make use of this instrument. furthermore, this is also open for those who want to confirm this instrument through further analysis using the confirmatory factor analysis (cfa). references ajema, c., muraya, k., karuga, r., & kiruki, m. (2016). childhood experience of abuse in kajiado county-kenya. kenya: lvtc health. amir, m. t. (2015). merancang kuesioner: konsep dan panduan untuk penelitian sikap, kepribadian, dan perilaku. jakarta: prenada media group. choo, w.-y., dunne, m. p., marret, m. j., reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 a factor analysis of an instrument… 33 safrudin amin, badrun kartowagiran, & pracha inang fleming, m., & wong, y.-l. (2011). victimization experiences of adolescents in malaysia. journal of adolescent health, 49(6), 627–634. https://doi.org/ 10.1016/j.jadohealth.2011.04.020 clark, r. e., clark, j. f., & adamec, c. a. (2007). the encyclopedia of child abuse (3rd ed.). new york, ny: facts on file library of health and living. clemens, e. v, carey, j. c., & harrington, k. m. (2010). the school counseling program implementation survey: initial instrument development and exploratory factor analysis. professional school counseling, 14(2), 125–134. cohen, r. j., & swerdlik, m. (2009). psychological testing and assessment: an introduction to test and measurement. new york, ny: mcgraw-hill. cohn, a., & canter, a. (2003). bullying: facts for schools and parents. retrieved from http://www.naspcenter.org/factsheets/ bullying_fs.html comrey, a. l., & lee, h. b. (2009). a first course in factor analysis (2nd ed.). new york, ny: psychology press. costello, a. b., & osborne, j. w. (2005). best practices in exploratory factor analysis: four recommendations for getting the most from your analysis. practical assessment, research & evaluation, 10(7), 1–9. fabrigar, l. r., & wegener, d. t. (2012). exploratory factor analysis. oxford: oxford university press. fabrigar, l. r., wegener, d. t., maccallum, r. c., & strahan, e. j. (1999). evaluating the use of exploratory factor analysis in psychological research. psychological methods, 4(3), 272–299. hyman, i. a., & perone, d. c. (1998). the other side of school violence: educator policies and practices that may contribute to student misbehavior. journal of school psychology, 36(1), 7–27. idris, f. (2015, december 30). 2015, tahun buram kekerasan anak. republika. retrieved from https://www.republika. co.id/berita/koran/opini-koran/15/12/ 30/o05vk812-2015-tahun-buramkekerasan-anak kartowagiran, b. (2008). validasi dimensionalitas perangkat tes ujian akhir nasional smp mata pelajaran matematika 2003-2006. jurnal penelitian dan evaluasi pendidikan, 12(2), 177–195. https://doi. org/10.21831/pep.v12i2.1426 kartowagiran, b., & jaedun, a. (2016). model asesmen autentik untuk menilai hasil belajar siswa sekolah menengah pertama (smp): implementasi asesmen autentik di smp. jurnal penelitian dan evaluasi pendidikan, 20(2), 131–141. https://doi. org/10.21831/pep.v20i2.10063 kline, p. (2008). an easy guide to factor analysis. new york, ny: routledge. law no. 35 of 2014 of republic of indonesia concerning amendments to law no. 3 of 2002 concerning child protection (2014). mccauley, c. d., ruderman, m. n., ohlott, p. j., & morrow, j. e. (1994). assessing the developmental components of managerial jobs. journal of applied psychology, 79(4), 544–560. https:// doi.org/10.1037/0021-9010.79.4.544 muthmainnah, m. (2014). membekali anak dengan keterampilan melindungi diri. jurnal pendidikan anak, 3(1). retrieved from https://journal.uny.ac.id/index.ph p/jpa/article/view/3053 nansel, t. r., overpeck, m., pilla, r. s., ruan, w. j., simons-morton, b., & scheidt, p. (2001). bullying behaviors among us youth: prevalence and association with psychosocial adjustment. jama, 285(16), 2094–2100. nspcc. (2016). what children are telling us about bullying. child bullying report 2015/2016. nurmin, n., & kartowagiran, b. (2013). evaluasi kemampuan guru dalam mengimplementasi pembelajaran tematik di sd kecamatan salahutu kabupaten maluku tengah. jurnal prima edukasia, reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 34 – a factor analysis of an instrument... safrudin amin, badrun kartowagiran, & pracha inang 1(2), 184–194. https://doi.org/ 10.21831/jpe.v1i2.2635 parsian, n., & am, t. d. (2009). developing and validating a questionnaire to measure spirituality: a psychometric process. global journal of health science, 1(1), 2–11. https://doi.org/ 10.5539/gjhs.v1n1p2 pett, m. a., lackey, n. r., & sullivan, j. j. (2003). making sense of factor analysis: the use of factor analysis for instrument development in health care research. thousand oaks, ca: sage publications. qodar, n. (2015, march 15). survei icrw: 84% anak indonesia alami kekerasan di sekolah. liputan6.com. retrieved from https://www.liputan6.com/news/read/ 2191106/survei-icrw-84-anak-indonesiaalami-kekerasan-di-sekolah rivers, i., & smith, p. k. (1994). types of bullying behaviour and their correlates. aggressive behavior, 20(5), 359–368. sadtyadi, h., & kartowagiran, b. (2014). pengembangan instrumen penilaian kinerja guru sekolah dasar berbasis tugas pokok dan fungsi. jurnal penelitian dan evaluasi pendidikan, 18(2), 290–304. https://doi.org/10.21831/pep.v18i2.286 7 setyawan, d. (2015, june 14). kpai: pelaku kekerasan terhadap anak tiap tahun meningkat. komisi perlindungan anak indonesia (kpai). retrieved from http://www.kpai.go.id/berita/kpaipelaku-kekerasan-terhadap-anak-tiaptahun-meningkat setyawan, d. (2017, november 23). kekerasan anak di sekolah semakin memprihatinkan. komisi perlindungan anak indonesia (kpai). retrieved from http://www.kpai.go.id/berita/kekerasan -anak-di-sekolah-semakinmemprihatinkan simpson, s. b. (2015). bullying perceptions: understanding students with and without disabilities. retrieved from https:// mds.marshall.edu/cgi/viewcontent.cgi?r eferer=&httpsredir=1&article=1974&co ntext=etd straus, m. a., hamby, s. l., boney-mccoy, s., & sugarman, d. b. (1996). the revised conflict tactics scales (cts2). journal of family issues, 17(3), 283–316. https://doi.org/10.1177/019251396017 003001 straus, m. a., hamby, s. l., finkelhor, d., moore, d. w., & runyan, d. (1998). identification of child maltreatment with the parent-child conflict tactics scales: development and psychometric data for a national sample of american parents. child abuse & neglect, 22(4), 249–270. https://doi.org/10.1016/s01452134(97)00174-9 unicef. (2014a). measuring violence against children: inventory and assessment of quantitative studies. new york, ny: division of data, research and policy. unicef. (2014b). violence against children in east asia and the pacific: a regional review and synthesis of findings. bangkok: unicef eapro. united nations secretary-general’s study. (2006). violence against children in schools and educational settings. violence against children: united nations secretary-general’s study. geneva. retrieved from https://www.unicef.org/violencestudy/ 4. world report on violence against children.pdf widdiharto, r., kartowagiran, b., & sugiman, s. (2017). a construct of the instrument for measuring junior high school mathematics teacher’s self-efficacy. research and evaluation in education, 3(1), 64–76. https://doi.org/ 10.21831/reid.v3i1.13559 wijanto, s. h. (2008). structural equation modelling (sem) dengan lisrel 8.8. yogyakarta: graha ilmu. copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(1), 2018, 79-93 available online at: http://journal.uny.ac.id/index.php/reid continuing professional development (cpd) for junior high school mathematics teachers: an evaluation study *1pika merliza; 2heri retnawati 1,2department of mathematics education, universitas negeri yogyakarta jl. colombo no. 1, depok, sleman 55281, yogyakarta, indonesia *corresponding author. e-mail: pikamerlizasoemali@gmail.com submitted: 06 march 2018 | revised: 01 august 2018 | accepted: 15 november 2018 abstract responding to the importance of conducting evaluation on continuing professional development program for teachers, this study is aimed at describing the implementation and difficulty of continuing professional development (cpd) of mathematics teachers of junior high school (jhs) in bandar lampung, indonesia. this research used a descriptive approach employing a quantitative-qualitative method with sequential explanatory strategy. the population of the research was 181 junior high school mathematics teachers who have already become civil servants. the samples were 63 teachers for quantitative research selected using stratified random sampling and proportional random sampling technique, while eight teachers for qualitative research were selected using purposive sampling technique. these eight teachers were selected because they were the only teachers handling the cpd program. the data were collected through a test, questionnaires, checklist sheet, study document, and interview. data analysis was conducted using categorized performance trends, divided into five groups: very good/difficult, good/difficult, fair, poor/easy, and very poor/very easy. the data were analyzed using descriptive technique; the quantitative study analysis was performed by mean and standard deviation, whereas, the qualitative data analysis was obtained by data reduction, data display, and conclusion technique. the research results show that the majority of teachers’ cpd implementation is very poor, meanwhile, the difficulty of the engagement of cpd is categorized as fairly difficult. keywords: junior high school mathematics teachers, continuing professional development (cpd), teaching experience introduction as a developing country, indonesia aspires to improve becoming a developed country which is independent, unified, sovereign, just, and prosperous (preamble of the 1945 constitution of the republic of indonesia). various attempts are made to embody the ideals of the nation, one of which through the efforts of alleviation of poverty and unemployment, which became conspicuous case of developing countries. the high number of poverty and unemployment will impact badly on various aspects of life, such as increasing numbers of violence, theft, robbery, depression, political instability, and many others. according to yacoub (2012, p. 178), if a community has job and earnings (instead of unemployment), and then the earnings are expected to meet the necessities of life, so it is stated that they are not poor. it can be inferred that by the low unemployment number, then the number of poverty is also low. one of the factors contributing to the high number of unemployment is the low quality of human resources who are able to compete in both national and global scope. nationally and globally competitive human resources provide opportunities to get a job to fulfill their life necessities and decrease the level of poverty of the nation. the low quality of the nation’s human resources is a product of the poor quality of education. this is due reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 80 – continuing professional development (cpd) for... pika merliza & heri retnawati to the fact that education has a major influence for various aspects of sustainable life development and the supporting factors contributing to the sustainability and peace, giving a direct influence to decrease poverty, and also promoting health, gender, and sustainable environment (unesco, 2014, p. 25). the importance of the influence of education to human life becomes one of the factors underlying unesco to continue stating the idea of lifelong learning which is started since 1972 (tuijnman & boström, 2002, p. 95). the enhancement of abilities, skills, and attitudes determines the quality of human resources of a nation. thus, it is the responsibility of every individual to become a lifelong learner, to learning developing themselves, to continue and enhance the competence and expertise along with the development of science and technology. this responsibility is valid to everyone in profesion, including teachers. teachers are demanded to conduct professional development throughout their career related to the role and responsibility (gray, 2005, p. 5). based on a research related to the implementation of cpd for teachers, nuraeni and retnawati (2016, pp. 137–138) reveal that the effectiveness of subject-matter teachers forum (mgmp) still belongs to the low category. nuraeni and retnawati (2016) suggest that the activities of the mgmp that have been established in each regency and province allegedly have not functioned optimally to facilitate the mathematics teachers to develop themselves. furthermore, in regard to the performance of post-certification teachers, the facts show that not all certified teachers in indonesia have good competences and performances (world bank in jalal et al., 2009, p. 7). this is in line with a research conducted by abubakar (2015, p. 116) about the impact of certification on madrasah aliyah teachers’ competence in kendari, south east sulawesi, indonesia, which states that teachers’ certification has not had a positive impact on their competence improvement, either in their subject area or educational unit. it is proven by the findings of a research of kardiyem (2013, p. 17) that the overall performance of certified teachers in senior vocational schools in grobogan regency, east java, indonesia is in ‘not good’ category. various obstacles are faced by the teachers, including low motivation achievement, limited time, lack of knowledge, and perceptions on the government regulations. in addition, the teachers’ level of competence and skills before and after the certification is still the same. the teachers are less trying to improve their competence and tend to perform the same as before getting the certificate. as reported by nuraeni and retnawati (2016, p. 130) in their research on teacher performance in professional development in wonosobo regency, the teachers are still categorized as very poor in professional development, and certified teachers have less awareness in their professional development. further, fahmi, maulana, and yusuf (2011, p. 15) emphasize that teacher certification was expected to improve teachers’ quality, however, in fact, it does not contribute positively to the improvement of the students’ learning process. this condition is contradictory to the law no. 14 year 2005 of republic of indonesia about teachers and lecturers, which states that in performing professional duties, teachers are obligated to improve and develop academic qualifications and competencies in a sustainable manner in line with the development of science and technology. all teachers must have professionalism in their profession; teachers must be mastering the competencies needed to achieve educational aim. competence is an important component to support the performance of teachers in performing their duties and roles. according to hamilton-ekeke (2013, p. 15), teacher competence is the ability of a teacher to help learners to reach higher levels of learning. competence requires teachers to carry out professional responsibilities, hence the effectiveness of the implementation of teacher role as a learning agent depends on the teacher’s level of competence. the teacher’s level of competence is related to the professional and pedagogical knowledge. in indonesia, teachers’ professional competency standard consists of professional, pedagogic, reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 continuing professional development (cpd) for… 81 pika merliza & heri retnawati social, and personality competence (regulation of the minister of national education no. 19 of 2005 on national education standard). the four competencies must be mastered by all of professional teachers in indonesia, including mathematics teachers. mathematics is one of the important subjects that equips learners in facing a fullycompetitive life. alnoor and yuanxiang (2000, p. 1) explain that mathematics is a necessary tool in the field of science and technology, since it aims not only to teach arithmetic, but also provides opportunities for learners to become scientists exploring concepts related to everyday life. the purpose of mathematics education which requires logical, analytical, systematic, critical, and creative thinking as well as cooperative ability (regulation of the minister of education and culture no. 20 of 2016 on the competence standard of primary and secondary education graduates) are very useful in preparing highly competitive generation. it means that in order to reach a high quality mathematics learning, the competence of qualified teachers is required. it is essential for teachers to continue improving their competence to support their carriers. this is not only a demand for in-service mathematics teachers, but also for pre-service teachers. according to the data of world bank (2010, p. 18), ‘mathematics teachers have scored poorly on competency exams, raising concerns about the quality of their instruction’. further results of teacher competency test indicate a significant impact on the learning practices seen from the learners’ achievement (ünal, demir, & kiliç, 2011, p. 3252). for example, data of national examination (or ujian nasional un) result of junior high school in bandar lampung in the academic year of 2015/2016 released by national education standards board (badan standar nasional pendidikan, 2015) show that the average score of mathematics is 53.99. this score is lower than other subjects such as indonesian, english, and science which are 71.84, 64.28, and 63.66. this is assumed that mathematics’ low average score is due to its non-routine patterns of questions with increasing challenges provided for the students which are not able to solve only through calculation, but also require higher order thinking skill, involvement of reasoning, analyzing, synthesizing, and evaluating. by this situation, the tasks and roles of mathematics teacher of junior high school are great in establishing the learning situations. teachers are the determinant factor of students' learning experiences in the classroom, who prepare students to have higher order thinking skill (unicef, 2007, p. 93). furthermore, teachers are the central point to significantly improve their own competences. teachers' knowledge and skills are the factors that influence the success of classroom learning. in the mid of rapid technological advances, as a professional, and due to the demand for high standard of education, teachers must continue learning (in-services learning) in order to improve their competence. continuing professional development (cpd) is a continuing learning for teachers who are the main vehicle in the effort to bring the desired changes related to the success of the learners (ministry of national education of republic of indonesia, 2010, p. 9). in the case of teaching, development which can be made is in-service training. the content of cpd for mathematics teacher is believed to revitalize the teacher's skills in designing the teaching and learning process, increasing enthusiasm on the instruction, and also help to maintain their scientific knowledge (joubert, back, de geest, hirst, & sutherland, 2010, p. 1765). the concept of cpd for teachers has been implemented in many various countries around the world. in finland, the concept of cpd for teacher is based on the idea of ‘longlife learning’. the government has adopted the concept of cpd since the education of prospective teachers at university level. the implementation of cpd agenda for teachers during 7 days in a year should be involved in inter-school teacher training. in addition, as another form of cpd implementation for teachers, the finnish government requires all teachers to pursue higher education to attain at least a master degree as a minimum standard of teachers’ education level (layne, 2016, p. 9). in the uk, the types of cpd include: reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 82 – continuing professional development (cpd) for... pika merliza & heri retnawati (1) workshops held at school or outside the school, (2) certified courses, (3) courses held at the university, (4) teacher collaboration activities, (5) conferences, (6) guiding, training, or observing fellow teachers in teaching (lesson study), (7) joining committees, (8) teacher learning communities, and (9) self-learning (opfer & pedder, 2010, p. 241). in indonesia, cpd for teachers has been set forth in the regulation of the minister for the state apparatus empowerment and bureaucracy reform no. 16 of 2009, enforced in 2013, that teacher who has a certificate of educator is required to implement cpd with the calculated credit number. the credit number is needed by teachers to achieve higher degree which is automatically influential to their salary. cpd in indonesia consists of: (1) subsection of self-development, (2) scientific publications, and (3) innovative works. cpd activities provide benefits for teachers to improve their knowledge, skills, and competencies according to their professional standards. adey, et al. (wermke, 2011, p. 669) state that there are three important facts which become the basis of cpd for teachers: (1) cpd activities improve teachers’ new understanding of learning and their belief; (2) cpd for teachers is influential to the classroom instructional practice in the form of feedback on cpd activities; (3) cpd for teachers is based on intuitive knowledge that influences the teachers’ attitude intuition, one of which is influenced through training process. thus, cpd can provide benefits for teachers' beliefs as well as classroom instructional practices. in addition, desimone (2009, pp. 183– 184) states that the benefits of professional development for teachers are; (1) cpd can improve the teachers' knowledge and skills and/or change their attitudes and beliefs; (2) knowledge, skills, attitudes, and beliefs will be influential to improve the content knowledge that gives impact on pedagogical knowledge in the teaching practice; (3) cpd can cause changes in the way of teaching that has a positive impact towards the improvement of the learners' learning outcomes. however, based on the research results on cpd activities in some areas of indonesia, the implementation of cpd is still not encouraging. based on the results of nuraeni and retnawati (2016, pp. 137–138), the professional development of mathematics teachers of vocational senior high school in wonosobo regency is categorized in ‘poor’ category. this situation is in line with the finding of kartowagiran (2011, p. 463) stating that post-certification teachers’ performance in professional development is still unsatisfying. moreover, based on the results of a research conducted by noorjannah (2015, p. 107), there is fraud conducted by teachers in the cpd, especially in the paper writing; 70% of the teachers employ writing services, action research, promotion, or conduct certain activities such as certification. according to the research findings by aina, bambang, retni, afreni, and sadikin (2015, p. 31), the number of scientific papers published by teachers is low due to their difficulties in writing. the difficulties are related to the teaching hours, writing ethic and techniques, and unaccustomed experience of expressing ideas in a writing project. according to supriyanto (2015, p. 111), scientific writing in the cpd produced by teachers is basically not conducted periodically, and the policy of writing scientific paper for promotion is not responded positively. furthermore, based on the research findings of wibowo and jailani (2014), the cpd of mathematics teachers of junior high school in wonosobo based on its understanding, implementation, and difficulties are categorized in ‘medium’, ‘low’, and ‘very low’ categories. these categorizations may be caused by the difficulties faced by teachers in the implementation of cpd. qablan, mansour, alshamrani, and aldahmash (2015, p. 627) mention the obstacles in implementing cpd, including excessive workload, teaching hours, location of the activities, time of teaching, personal circumstances, funding, and the activities. furthermore, based on the data gained from interviews to mathematics teacher of junior high school in bandar lampung regarding the cpd activities, it is found that teachers do not focus to do their duty to teach mathematics in the class. it means that reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 continuing professional development (cpd) for… 83 pika merliza & heri retnawati teachers have tried to conduct cpd but they find various difficulties. thus, this research aims to find out the implementation of cpd and the difficulties faced by mathematics teachers of junior high school in bandar lampung in conducting cpd. method this research used descriptive approach employing quantitative-qualitative method (or mixed methods). the study was conducted at junior high schools in bandar lampung, indonesia. data collection was conducted from february to march 2017, through direct meeting of the researchers with respondents at their respective schools and mathematics subject-teachers forum (or musyawarah guru mata pelajaran mgmp) for junior high school (jhs) teachers in bandar lampung. the population in this research was 181 mathematics teachers of junior high schools in bandar lampung in the academic year of 2016/2017. samples were identified using stratified random sampling procedure based on the teachers’ teaching experiences, then subsequently selected through proportional random sampling technique. the samples were 63 mathematics teachers, whereas for qualitative research, eight respondents were chosen using purposive sampling technique. research procedure the research used mixed methods research design in the form of sequential explanatory design, in which the process of data collection were not done at the same time. the researchers collected and analyzed the quantitative data firstly, then the analyzed data were used as the basis for collecting and analyzing the qualitative data. the flow is presented in figure 1. figure 1. research setting of sequential explanatory design (creswell & clark, 2011, p. 68) the results of analysis from both studies were combined and compared, so that it was known which qualitative data were appropriate, expanded, or even aborted the results of quantitative data. furthermore, the results of the both data were presented in a table to draw conclusions. data, instruments, and data collection techniques the instruments which were employed in this study consist of checklist and document study which were used to measure the implementation of cpd for teachers, consisting of: (1) 27 items of cpd implementation on self-development sub-section; (2) 23 items of cpd implementation on scientific publication sub-section; and (3) 21 items of cpd implementation on innovative works sub-section. document study was developed from the checklist to check the physical evidence related to the involvement of cpd activities in each aspect. furthermore, the instruments were in the form of 15 items of questionnaire about the difficulties aspect with a semantic differential scoring scale with the intervals of 1-7. the items were equipped with very easy to very difficult choices and were completed by 15 items of open-ended questions related to the efforts in overcoming the difficulties. in addition, an instrument in the form of an interview guidance was used to measure all aspects of cpd. the validity of the instruments was in the form of content validity: face and logical validity, and was done by two expert judgments. the reliability of the instrument difficulty was 0.964 and the sem was 15.263, which means that it was in a very good category of reliability. data analysis technique the quantitative data analysis was presented in the table based on the tendency of the respondents’ answer on one of the criteria in each sub-variable. in categorizing the percentage of quantitative descriptive analysis, the researchers employed the categorization adapted from widoyoko (2013, p. 238), which is presented in table 1. quantitative data follow up with qualitative data follow up with reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 84 – continuing professional development (cpd) for... pika merliza & heri retnawati table 1. assessment categories formulas category x >mi + 1,8 sbi very good mi+ 0,6 sbi< x ≤ mi + 1,8 sbi good mi 0,6 sbi< x ≤ mi + 0,6 sbi fair mi 1,8 sbi< x ≤ mi 0,6 sbi poor x ≤ (mi 1,8 sbi) very poor note: mi = ideal mean score sbi = ideal standard deviation x = score of the respondents or actual score qualitative descriptive analysis technique the qualitative data which were obtained through interview and document study were analyzed by interactive model, and then the result of the data analysis was processed through the following flow: data reduction, data display, conclusion drawing, and verification (miles, huberman, & saldan ̃a, 2014, pp. 12–13). the data analysis techniques are presented in figure 2. figure 2. interactive model analysis scheme at the data reduction stage, simplifying the activities and selecting the key points were performed to create a core summary without changing the message. furthermore, in the presentation stage, the data processed were briefly displayed on a table to make it easy to understand the unit related to the steps of the research. the last stage is the conclusion stage, in which the researchers collected the data in order to draw conclusion based on the tendency of the same or similar data categorization. this conclusion stage is temporary, so that the researchers need to verify the conclusion which was drawn with the data categorization and the quantitative data to make it more credible. findings and discussion findings implementation of cpd to mathematics teachers of junior high school based on the results of cpd checklist sheet of mathematics teachers which includes self-development, scientific publication, and innovative works aspects, the implementation of cpd activities has the actual mean score (x) of 48.42 (very poor category), the ideal score (mi) of 162.5, and the ideal standard deviation of 54.17 with maximum score 325 and minimum score of 65, as presented in table 2. table 2. score of cpd implementation to mathematics teacher of junior high school category total percentage very good 0 0% good 0 0% fairly good 0 0% less good 3 5% very poor 60 95% total 63 100% table 2 shows that the performance of 95% of junior high school mathematics teachers in bandar lampung in implementing cpd activities is in ‘very poor’ category, meaning that the majority of the teachers are still less involved in the cpd activities, either in person or inside the mathematics teachers’ learning community/institutions, both in formal and non-formal learning. furthermore, the overall results of the implementation of cpd activities are obtained based on the details of each aspect. in the aspects of self-development, the actual score (x) is 33.32 (very poor category), the ideal score (mi) is 81, the ideal standard deviation is 18.00 with the maximum score of 135, and the minimum score of 27. in addition, the result of cpd implementation of junior high school mathematics teachers in bandar lampung seen from scientific publication aspect, the actual score (x) is 6.93 (very poor category). the mean ideal score is 69, the ideal standard deviation is 15.33 with the maximum score of 115 and the minimum score of 23. besides, in innovative reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 continuing professional development (cpd) for… 85 pika merliza & heri retnawati works aspect, the actual score (x) is 8.68 (very poor category), the ideal score (mi) is 45, the ideal standard deviation is 10.00 with the maximum score of 75 and the minimum score of 15. the detail score of each aspect is presented in table 3. difficulties of cpd implementaion to jhs mathematics teachers based on the teachers’ responses to the questionnaire, the actual score (x) is 505.39 (quite difficult category), the ideal score (mi) is 452, the ideal standard deviation is 113.00 with maximum score of 113 and minimum score of 452. to make it clear, the difficulty criteria are presented in table 4. based on table 4, 70% of jhs mathematics teachers have difficulties in conducting cpd activities, while the 30% have no difficulties to be involved in cpd. it means that more than half respondents have difficulties in actively participating in the activities on all aspects. in more detail, 8% of the teachers are in difficult category, 62% of teachers are in the quite difficult category, and 30% of teachers have no difficulties to conduct cpd. the overall data of the difficulties category for the details of each cpd activity are presented in table 5. based on table 5, less than 30% of the teachers stated that they have no difficulties in engaging each aspect of cpd. in the aspect of self-development, most of the respondents (30%) are included in the fair category, whereas in the aspect of scientific publication, they mostly are in the difficult category (52%), and in the aspect of innovative works, they are in the very difficult category (40%). meanwhile, the percentage comparison of the number of teachers who have difficulties in conducting cpd is presented in figure 3. table 3. cpd implementation score of self-development, scientific publications, and innovative works category self-development scientific publications innovative works total (%) total (%) total (%) very good 0 0% 0 0% 0 0% good 0 0% 0 0% 0 0% fair 0 0% 0 0% 0 0% less good 2 3% 0 0% 0 0% very poor 61 97% 63 100% 63 100% total 63 100% 63 100% 63 100% table 4. score of cpd difficulties of mathematics teachers category total percentage very difficult 0 0% difficult 5 8% fair 39 62% easy 19 30% very easy 0 0% total 63 100% table 5. score of difficulties of cpd implementation on self-development, scientific publishing and innovative works category self-development scientific publications innovative works total (%) total (%) total (%) very difficult 8 13% 0 0% 25 40% difficult 8 13% 33 52% 21 33% fair 34 54% 27 43% 15 24% easy 13 20% 3 5% 2 3% very easy 0 0% 0 0% 0 0% total 63 100% 63 100% 63 100% reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 86 – continuing professional development (cpd) for... pika merliza & heri retnawati 0 5 10 15 20 25 30 35 very difficult difficult fair easy very easy selfdevelopment scientific publication figure 3. diagram of comparison of questionnaire results of cpd implementation to mathematics teachers in bandar lampung based on figure 3, the mean score of the difficulty of cpd implementation to the majority of mathematics teachers in bandar lampung on doing scientific publication is in difficult category. besides, the difficulty faced by mathematics teachers in doing self-development is in fair category. in each aspect of activities, jhs mathematics teachers in bandar lampung obtained information on self-development, namely (1) teachers’ participation in the functional training and collective activities is related to the schedule and location of the activities and also the limitation of the participants; the difficulties to attend a master program are related to information, family permission, and funding; (2) difficulties in conducting action research (publications); it is found that some teachers have difficulties in identifying the problem, preparing good action research, elaborating research procedures, finding relevant studies, and finding expert guidance; (3) difficulties in making the classroom action research report; most of teachers have difficulties in preparing the stages of classroom action research report: writing the background, identifying the problem of the research, engaging with the objectives of the research, collecting relevant studies, getting expert guidance, determining the target achievement of the classroom action research, writing the findings and discussion, drawing conclusions, and finding motivation to write; (4) difficulties in creating popular scientific papers; some teachers complain about the difficulty of identifying ideas/issues, composing good sentences, finding appropriate references, and finding collaborative teacher colleagues, expert guidance, and also motivation; (5) difficulties in writing educational books (including textbooks, translation books, teacher manuals, enriching books, modules/ dictates) most of the teachers have difficulty in terms of finding relevant references, expert guidance, funding, and motivation, as well as in fulfilling their teaching responsibilities and credit report; (6) difficulties in journal writing and publishing book; most of difficulties emphasize that they do not know yet the benefits of publication so they do not have motivation to do it. furthermore, in the aspects of innovative work, the difficulties include: (1) difficulties in developing/creating/modifying media and teaching aids and developing visual aids with it, which is also emphasized on the teaching hours, and the lack of it skills and also motivation; and (2) difficulties in following the activities of creating instructional guidance/directions, questions, and also standard guidelines at the regency/municipality/province level where most teachers have difficulties with the opportunity to participate in the activities because only some teachers have opportunity to attend the activities. document analysis and interview results document analysis using 54 research samples were conducted, while interviews reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 continuing professional development (cpd) for… 87 pika merliza & heri retnawati were conducted to eight respondents of the research samples. the document analysis was conducted by checking the physical evidence of teacher involvement in each aspect of selfdevelopment activities, scientific publications, and innovation such as scientific works, master degree certificate, certificate of participation in scientific activities (training, seminars, workshops, subject-matter teachers forum (or musyawarah guru mata pelajaran – mgmp), and/or courses), action research reports, the created/developed/modified mathematical instructional media, and other appropriate physical evidence in the last five years. the results of the document analysis indicated that in the self-development, some teachers have been improving their education qualifications to master level. meanwhile, in terms of the certificate of activity, most of teachers have not been involved in mgmp activities in the past two years. regarding to activity certificates, some respondents are not so much worried about the certificates of their involvement in the activities; they are more concerned with the knowledge in the activities. in addition, according to the document studies in scientific publications, it is revealed that most of the respondents have followed the activities of writing classroom action research reports, but no document of scientific articles and books was found. furthermore, in the aspect of innovative work, it was found that most of teachers have created/modified simple instructional media needed for the sub-materials of mathematics learning. moreover, some respondents followed the activities of creating guidance/directions and mathematics questions sheet at regency/municipality level. based on the results of interviews to eight respondents, it is known that the respondents have followed various forms of cpd activities. in fact, most of teachers have been actively involved in regular mgmp activities every month, although some teachers are still less actively involved in the forum. the schedule of the cpd activities which is at the same time of their teaching schedule is one of the difficulties oftenly complained by the teachers. another activity followed by the teachers is the national instruction of teaching teachers which is held by the center of development and empowerment of mathematics teachers and personnels (or pusat pengembangan dan pemberdayaan pendidik dan tenaga kependidikan matematika – p4tk matematika). it is held based on the result of teacher’s competence test (or uji kompetensi guru – ukg) (pppptk matematika, 2016) which end up with some teachers appointed to be national instructors. in addition, some teachers have been actively searching for information on teacher scholarship as well as improving the quality of master degree education qualification with their own funds. in the aspect of scientific publications, most teachers conduct writing activities based on the action research results which become the prerequisite for obtaining credit numbers promoting their higher level of career. most of the respondents have understood the action research stages and how they are reported, although they do not know whether their reports are correct or not. in addition, teachers complain about the importance of expert guidance to guide or check their writing. then, related to the writing activities, some teachers are known to have oftenly written simple works such as arranging action research proposal, but the proposal has not yet completed due to some constraints, namely their teaching schedule, the absence of a guiding expert, the absence of positive support from their working environment, and the motivation to write. further, related to action research activities, some teachers state the importance of collaborative activities, both in terms of the implementation and reporting. furthermore, in creating innovative works, the difficulties that the teachers face in developing teaching media are the lack of motivation and it skill, their teaching schedule, and and the absence of colleagues to collaborate in carrying out the activities. meanwhile, the difficulty in following the activity of creating the mathematics instructional guidance and question sheets is that there is no offer and opportunity to follow it. the interview result is presented in table 6. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 88 – continuing professional development (cpd) for... pika merliza & heri retnawati table 6. data analysis of interview results aspect reduction and presentation results conclusion (interpretation) cpd implementation most respondents conduct cpd only in monthly mgmp activity. the implementation of cpd is still very less. most respondents rely on the activities organized monthly by the mgmp. furthermore, in terms of scientific publication, most teachers have not started writing scientific papers, except the action research result. meanwhile, in the innovative works, teachers have started to create simple innovative works. the implementation of teacher’s cpd in the aspect of selfdevelopment is better than other aspects. the difficulties that become the teachers’ obstacles related to the implementation of cpd are as follows: (1) in the aspects of selfdevelopment: teachers’ teaching schedule, the lack of information on the cpd activities, and the establishment of limited participants; (2) in the aspects of scientific publications and innovative works: problems related to the lack of time, motivation, expert guidance, and feedback/response of the results of writing/works. some respondents have been, are being, and will continue to master program either by trying to find scholarships or with personal funding. some respondents often try to write scientific paper related to action research but it has not yet finished. there is not found yet a teacher who made publication papers in the form of popular scientific writings and books related to mathematics learning. there are respondents who are active to conduct cpd after the announcement of ukg results. cpd difficulties some respondents are constrained to participate in self-development activities related to the teaching schedule, information, and limitations of the participants. some respondents have difficulty related to the duration of using computers in writing and creating it-based teaching media. most of the respondents have difficulties to write scientific papers related to action research because of family responsibilities, as well as the absence of expert guidance and feedback on the writing results. some respondents need fellow-teachers to collaborate with on the creation of scientific publications and innovative works. discussion based on the research findings, it is indicated that the implementation of cpd for mathematics teachers in bandar lampung is categorized very less in the aspects of selfdevelopment, scientific publications, and innovative works. these findings are supported by the results of document study and interview which make it clear that most teachers are more active in self-development activities than in scientific publications and innovative works. the findings are in accordance with the finding of wibowo and jailani (2014, p. 209) that only a few mathematics teachers in have engaged in cpd. further, based on the details of cpd implementation in each aspect, kasmayadi (2016, p. ii) states that many mathematics teachers have been involved in cpd activities in the self-development aspect, but they are still poor in the scientific publications and innovative works aspects. thus, it must be realized that as it is needed by a mathematics teacher, it is important to continuously improve self-competence through self-development activities, scientific publications, and innovative works, which affect the mathematics learning process and the achievement of the learners (badri, alnuaimi, mohaidat, yang, & al rashedi, 2016, p. 1; powell, terrell, furey, & scott-evans, 2003, p. 389; ünal et al., 2011, p. 3252). another thing to consider about the implementation of scientific publications and innovative works as a part of aspects in cpd is that those aspects require a special assessment team which will follow-up or give feedback to in assessing and reviewing their works, especially related to the results of the action research. teachers need feedback and review in professional development activities, reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 continuing professional development (cpd) for… 89 pika merliza & heri retnawati especially in action research, (chval, abell, pareja, musikul, & ritzka, 2008, p. i; kaur, bhardwaj, & wong, 2017, p. 172) because both feedback and review are important factors to encourage development (new trier township high school, 2012, p. 4). in scientific publication aspect, teachers’ cpd activities are dominated by creating classroom action research report because submitting the report is a requirement for gaining credit numbers for the promotion of their level as a civil servant. meanwhile, other activities such as being a speaker in a seminar/conference are still less conducted. only some teachers become speakers as in a national instruction program provided by p4tk mathematics. in addition, there is no teachers found conducting the activities of writing and publishing popular articles, textbooks, modules/ dictates, and translation books. furthermore, based on the difficulties faced by mathematics teachers of junior high school in bandar lampung, it is known that teachers are categorized in fairly difficult category in the aspects of self-development and scientific publications, and they are in the difficult category in the aspect of innovative work. based on the difficulties of the cpd implementation in all aspects, it is clear that in self-development aspect, i.e. participation in functional training and collective activities or joining domestic and abroad scholarship programs for teachers, they do not face difficulties related to the support from school principal, especially on functional training and collective activities. this fact is supported by the information obtained from the interviews that all of the interview respondents state that the principals fully support them to conduct cpd, especially in being involved in the monthly mgmp activity. even, the schools give 0 hour (day off) of teaching on the regular schedule of the mgmp activity. furthermore, the difficulties to pursue the teachers’ higher educational degree to master program are related to their teaching schedule, minimum family support, and funding. moreover, the difficulties faced by the teachers face in scientific publications aspect are (1) the difficulty of becoming speakers in scientific forums; some teachers’ difficulties are related to the schedule and location of the agenda, and the lack of ideas and motivation; (2) difficulties in conducting action research; some teachers find it is fairly difficult to identify the problem, prepare the action research, arrange the research procedures, find relevant studies, and they find it is very difficult to have experts’ guidance; (4) difficulties in writing popular scientific papers; some teachers complain about the difficulty of identifying ideas/issues, composing good sentences, and finding appropriate references, collaborative teacher colleagues, expert guidance, as well as motivation; (5) difficulties in writing educational books (which include textbooks, translation books, enriching books, teacher manuals, modules/dictates); most teachers have difficulty in terms of finding relevant references, expert guidance, fund, motivation, and fulfilling their teaching responsibilities as well as credit report at the same time; and (6) difficulties in writing journal article and publishing book; most of the difficulties emphasize that they do not know the benefits of publication yet so they do not have motivation to do it. in scientific publication activities, the most common reasons for the difficulty are motivation, expert guidance, and collaboration. based on the interview results, respondents state that some teachers often join training on scientific works writing, but they are still lack of writing practice. some of the interview respondents emphasized that they need colleague collaboration and expert guidance to conduct action research and write the report. this result is supported by the important findings that some teachers have not been very proficient or do not know the writing procedures yet, such as how to write popular scientific articles and good textbooks. further, the respondents hope that there will be such training to improve their skills. this situation is in line with the findings of a research conducted by kasmayadi (2016, p. 176) which suggests that a certain training, such as functional training, is needed to facilitate cpd and other related topics. furthermore, in term of innovative works aspect, the difficulties faced by the teachers include: (1) difficulties in developing/creating/modifying learning media and reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 90 – continuing professional development (cpd) for... pika merliza & heri retnawati teaching aids with it, due to their full teaching hours so that they have no opportunities to explore the probability of developing media and kits with or without it. some teachers have actually tried to create simple media commonly used in learning mathematics; (2) difficulties in attending the regency/municipality/province/national activity on creating mathematics learning guidance and question sheets; the difficulties are related to the opportunity to participate in such activities because only some teachers can participate in the activities. furthermore, in relation to the innovative works, most teachers are ‘users’, they prefer to use the mostly used mathematics materials that have already been available in their working environment. thus, the main factor is because the limited time and fund (wibowo & jailani, 2014, p. 209). teachers’ responses in the interview are similar to the finding of a research conducted by kasmayadi (2016, pp. 175–176) which insists that some teachers state that they have difficulties to implement the cpd. in the selfdevelopment aspect, the reasons are; there is no offer to join the training/course activities, they do not know the way of developing their selves, the have limited time due to their teaching duties or other additional tasks, they do not know the form of the activities (colloquium/panel discussion), and they do not receive the information. in the aspect of scientific publication, the reasons are: there is no offer to become a speaker, they do not have the material and ideas of what to write, they have limited time, and they are not confident to write. meanwhile, in the innovative work aspect, the reasons of the difficulty of the cpd implementation are their lack of computer skills, limited time, and less motivation as well as idea. related to the participation in the activities of questions/standards/guidelines preparation, the problem is because the absence of offer to join those activities. conclusion and suggestions conclusion based on the results from cpd checklist sheet, the implementation of cpd activities of junior high school mathematics teachers in bandar lampung is in very poor category. the majority of teachers are still less involved in the cpd activities either in person or in the forum of subject-matter teachers (mgmp). furthermore, the aspect of selfdevelopment is categorized fair, very poor, the scientific publication aspect is categorized very poor, and the innovative works aspect is also in very poor category. the difficulties of cpd implementation of jhs mathematics teachers is categorized fair. furthermore, the difficulties in self-development aspect is categorized fair, in scientific publications aspect is categorized difficult, and in innovative works aspect is categorized very difficult. the difficulties faced in self-development include: (1) teachers’ minimum participation in functional training and collective activities caused by the schedule and location of the activities, the limitation of the participants who can join the activities; and difficulty of pursuing master degree education program because of the lack of information, family permission, and funding; (2) difficulty in conducting action research (publications), especially in identifying the problem, preparing good action research, arranging research procedures, and finding relevant studies and expert guidance. most teachers have difficulties in creating popular scientific papers, writing journal article, and publishing books. the difficulties face in innovative works aspect include: (1) developing learning media and teaching aids because of teachers’ lack of it skills and motivation; and (2) joining the activities because of limited opportunity to participate in the activities. based on document analysis and interview, the implementation of cpd is still very less. most respondents rely on the activities organized by mgmp monthly. furthermore, in scientific publication, most teachers have not started writing scientific papers except writing action research report. meanwhile, in the innovative works, teachers have started to create simple innovative works. the difficulties that become obstacles faced by teachers related to the implementation of cpd include their teaching schedule, the lack of information, and the limitation of participants. in the aspects of scientific publications and innovareid (research and evaluation in education), 4(1), 2018 issn 2460-6995 continuing professional development (cpd) for… 91 pika merliza & heri retnawati tive works, the obstacles are related to time, motivation, expert guidance, and feedback from the results of the writing/works. suggestions based on the discussion and conclusion of the research, suggestions for the cpd implementation are proposed: (1) for jhs mathematics teachers, it is a must to improve their ability related to classroom action research, paper writing or scientific works, and innovative works. those activities can give impact on the better learning outcomes of the learners; (2) for school principals, they need to provide support and motivation for teachers to actively engage in various types of cpd activities constantly; (3) for the government, funds are needed to facilitate teachers in their engagement in cpd, especially in joining selfdevelopment training; (4) universities, as educator producers, should add organized coaching programs and equip students with the writing and researching abilities. references abubakar, a. (2015). dampak sertifikasi guru terhadap kualitas pendidikan pada madrasah aliyah di kota kendari. alqalam, 21(1), 117–128. https://doi.org/ 10.31969/alq.v21i1.204 aina, m., bambang, h., retni, s. b., afreni, h., & sadikin, a. (2015). pelatihan penulisan karya tulis ilmiah bagi guruguru sma 8 kota jambi. jurnal pengabdian pada masyarakat, 30(3), 29– 32. alnoor, a. g., & yuanxiang, g. (2000). assessment mathematics teacher’s competencies. wuhan: central china normal university. badan standar nasional pendidikan. (2015). aplikasi pamer un 2015/2016. jakarta: bsnp (badan standar nasional pendidikan). badri, m., alnuaimi, a., mohaidat, j., yang, g., & al rashedi, a. (2016). perception of teachers’ professional development needs, impacts, and barriers: the abu dhabi case. sage open, 6(3), 1–15. https://doi.org/10.1177/21582440166 62901 chval, k., abell, s., pareja, e., musikul, k., & ritzka, g. (2008). science and mathematics teachers’ experiences, needs, and expectations regarding professional development. eurasia journal of mathematics, science & technology education, 4(1), 32–43. creswell, j. w., & clark, v. l. p. (2011). designing and conducting mixed methods research. thousand oaks, ca: sage publications. desimone, l. m. (2009). improving mpact studies of teachers’ professional development: toward better conceptualizations and measures. educational researcher, 38(3), 181–199. https://doi.org/10.3102/0013189x083 31140 fahmi, m., maulana, a., & yusuf, a. a. (2011). teacher certification in indonesia: a confusion of means and ends. working papers in economics and development studies (wopeds). bandung: center for economics and development studies (ceds), padjadjaran university. gray, s. l. (2005). an enquiry into continuing professional development for teachers. cambridge: esmee fairbain foundation. hamilton-ekeke, j.-t. (2013). conceptual framework of teachers’ competence in relation to students’ academic achievement. international journal of networks and systems, 2(3), 15–20. jalal, f., samani, m., chang, m. c., stevenson, r., ragatz, a. b., & negara, s. d. (2009). teacher certification in indonesia: a strategy for teacher quality improvement. jakarta: departemen pendidikan nasional republik indonesia. joubert, m., back, j., de geest, e., hirst, c., & sutherland, r. (2010). professional development for teachers of mathematics: opportunities and change. in proceedings of cerme 6 (pp. 1761–1770). lyon, france: inrp. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 92 – continuing professional development (cpd) for... pika merliza & heri retnawati kardiyem. (2013). analisis kinerja guru pascasertifikasi (studi empiris pada guru akuntansi smk se-kabupaten grobogan). journal of economic education, 2(1), 18–23. kartowagiran, b. (2011). kinerja guru profesional (guru pasca sertifikasi). cakrawala pendidikan, 30(3), 463–473. https://doi.org/10.21831/cp.v3i3.4208 kasmayadi, w. (2016). model asesmen pengembangan keprofesian berkelanjutan guru sekolah menengah atas. doctoral dissertation. universitas negeri yogyakarta, yogyakarta. kaur, b., bhardwaj, d., & wong, l. f. (2017). teaching for metacognition project: construction of knowledge by mathematics teachers working and learning collaboratively in multitier communities of practice. in b. kaur, o. n. kwon, & y. h. leong (eds.), professional development of mathematics teachers: an asian perspective (pp. 169– 187). boston, ma: springer. law no. 14 year 2005 of republic of indonesia about teachers and lecturers (2005). layne, h. (2016). teacher education and teacher’s professional development in finland: myths and realities. in international conference on teacher education and professional development (pp. 8–12). yogyakarta: lppmp yogyakarta state university. miles, m. b., huberman, a. m., & saldan ̃a, j. (2014). qualitative data analysis: a methods sourcebook (3rd ed.). thousand oaks, ca: sage. ministry of national education of republic of indonesia. (2010). pembinaan dan pengembangan profesi guru buku i: pedoman pengelolaan keprofesian berkelanjutan (pkb) dan angka kreditnya. jakarta: direktorat jenderal peningkatan mutu dan tenaga kependidikan. new trier township high school. (2012). characteristic of professional practice at new trier high school. northfield, il: new trier township high school district 203. noorjannah, l. (2015). pengembangan profesionalisme guru melalui penulisan karya tulis ilmiah bagi guru profesional di sma negeri 1 kauman kabupaten tulungagung. jurnal humanity, 10(1), 97–114. nuraeni, z., & retnawati, h. (2016). the post-certification performance of mathematics teachers. the online journal of new horizons in education, 6(2), 130– 142. opfer, v. d., & pedder, d. (2010). benefits, status and effectiveness of continuous professional development for teachers in england. the curriculum journal, 21(4), 413–431. https://doi.org/10.1080/095 85176.2010.529651 powell, e., terrell, i., furey, s., & scottevans, a. (2003). teachers’ perceptions of the impact of cpd: an institutional case study ed. journal of in-service education, 29(3), 389–404. https://doi. org/10.1080/13674580300200282 pppptk matematika. (2016). hasil ukg 2015. unpublished. pppptk matematika, yogyakarta. qablan, a., mansour, n., alshamrani, s., & aldahmash, a. (2015). ensuring effective impact of continuing professional development: saudi science teachers’ perspective. eurasia journal of mathematics, science and technology education, 11(3), 619–631. https://doi.org/10.12973/eurasia.2015. 1352a regulation of the minister for the state apparatus empowerment and bureaucracy reform no. 16 of 2009 on teachers’ functional position and their credit points (2009). republic of indonesia. regulation of the minister of education and culture no. 20 of 2016 on the competence standard of primary and secondary education graduates (2016). republic of indonesia. reid (research and evaluation in education), 4(1), 2018 issn 2460-6995 continuing professional development (cpd) for… 93 pika merliza & heri retnawati regulation of the minister of national education no. 19 of 2005, on national education standard (2005). republic of indonesia. supriyanto, a. (2015). harapan, kenyataan dan strategi peningkatan kemampuan guru dalam penulisan karya tulis ilmiah. in prosiding seminar nasional pengembangan keprofesian menuju guru profesional (pp. 109–114). malang: universitas negeri malang. tuijnman, a., & boström, a.-k. (2002). changing notions of lifelong education and lifelong learning. international review of education, 48(1/2), 93–110. ünal, h., demir, i., & kiliç, s. (2011). teachers’ professional development and students’ mathematics performance: findings from timss 2007. procedia social and behavioral sciences, 15, 3252–3257. https://doi.org/10.1016/j.s bspro.2011.04.280 unesco. (2014). education strategy 20142021. paris: unesco. unicef. (2007). a human rights-based approach to education for all. new york, ny: united nations educational, scientific, and cultural organization. wermke, w. (2011). continuing professional development in context: teachers’ continuing professional development culture in germany and sweden. professional development in education, 37(5), 665–683. https://doi.org/10.10 80/19415257.2010.533573 wibowo, e., & jailani, j. (2014). analisis kesulitan guru matematika smp dalam pengembangan profesi di kabupaten wonosobo. jurnal riset pendidikan matematika, 1(2), 202–215. https://doi. org/10.21831/jrpm.v1i2.2676 widoyoko, e. p. (2013). evaluasi program pembelajaran: panduan praktis bagi pendidik dan calon pendidik. yogyakarta: pustaka pelajar. world bank. (2010). transforming indonesia’s teaching force. washington, dc: human development east asia and pacific region, world bank. yacoub, y. (2012). pengaruh tingkat pengangguran terhadap tingkat kemiskinan kabupaten/kota di provinsi kalimantan barat. jurnal eksos, 8, 176–185. reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 3(1), 2017, 12-27 available online at: http://journal.uny.ac.id/index.php/reid research article practitioner-informed improvements to early childhood intervention performance checklists and practice guides * 1 carl j. dunst; 2 deborah w. hamby; 3 linda l. wilson; 4 marilyn espe-sherwindt; 5 donna e. nelson *orelena hawks puckett institute 128 south sterling street morganton, nc 28655 united states *email: cdunst@puckett.org submitted: 25 may 2017 | revised: 25 july 2017 | accepted: 27 july 2017 abstract results from four early childhood practitioner field-tests of performance checklists and early intervention practice guides are reported. findings from the first field-test were used to make changes and improvements in the checklists and practice guides were evaluated in the second and third field-tests, and findings from the latter two field-tests were used to improve the checklist and practice guide evaluated in the fourth field-test. the results indicated that changes made in response to practitioners’ suggestions and feedback were associated with (1) progressive increases in the practitioners’ social validity judgments of the checklists, practice guides, and checklistpractice guide correspondence; and (2) progressive decreases in the number of practitioner suggestions and feedback for improving the early intervention materials. the field-test research demonstrates the importance of practitioner input, suggestions, and feedback for improving the usefulness of early childhood intervention practices. keywords: early childhood intervention, performance checklists, practice guides, social validity, practitioner appraisals how to cite item: dunst, c., hamby, d., wilson, l., espe-sherwindt, m., & nelson, d. (2017). practitioner-informed improvements to early childhood intervention performance checklists and practice guides. reid (research and evaluation in education), 3(1), 12-27. doi:http://dx.doi.org/10.21831/reid.v3i1.14158 introduction early childhood intervention for infants, toddlers, and preschoolers with identified disabilities or developmental delays and their families, as well as for young children who are at-risk for poor developmental outcomes for biological or environmental reasons is now common practice throughout the world (e.g., farrell, kagan, & tisdall, 2016; groark, eidelman, maude, & kaczmarek, 2011; guralnick, 2005; odom, hanson, blackman, & kaul, 2003; sukkar, dunst, & kirkby, 2017). early childhood intervention includes the experiences and learning opportunities afforded young children to promote acquisition of functional behavior (bailey & wolery, 1992; dunst & espe-sherwindt, 2017) and the supports and resources provided to or procured by parents and other family members to strengthen family functioning (cowan, powell, & cowan, 1998; dunst, 2017b). the field of early childhood intervention has a relatively short but rich history (dunst, 1996; mclean, sandall, & smith, 2016; meisels & shonkoff, 2000). in the 50+ years since hunt (1961) first noted that experiences early in a child’s life could alter developmental outcomes, and later, that responhttp://dx.doi.org/10.21831/reid.v3i1.14158 reid (research and evaluation in education) practitioner-informed improvements to early childhood intervention performance... 13 carl j. dunst, deborah w. hamby, linda l. wilson, marilyn espe-sherwindt, & donna e. nelson sive caregiving was an important factor in shaping those outcomes (hunt, 1987), considerable advances have been made in terms of understanding which experiences under which conditions have which kinds of outcomes and benefits (e.g., britto, engle, & super, 2013; farrell et al., 2016; odom & wolery, 2003; reichow, boyd, barton, & odom, 2016). early childhood intervention practitioners now have many choices and options in terms of the intervention practices they can use in their work with young children and their families. many factors influence practitioners’ adoption and use of different kinds of early childhood intervention practices, including, but not limited to, personal beliefs about practice-outcome relationships and one’s ability to use a practice competently and confidently (bruder, dunst, & mogro-wilson, 2011; trivette, dunst, hamby, & meter, 2012). these beliefs include the social validity appraisals of early childhood intervention practices and also their intended outcomes (kazdin, 2005). the practical importance of social validity appraisals is that these types of judgments can help explain why a practitioner does or does not use an intervention practice. subjective judgments of the importance and acceptability of intervention goals, practices, and outcomes likely influence practitioners’ adoption and use of different kinds of intervention procedures (foster & mash, 1999). according to strain et al. (2012), intervention practices are not likely to be used by practitioners (or parents) if the practices themselves are not viewed as socially valid and worth the time and also effort to use. dunst, raab, and hamby (2016), for example, found parents’ social validity judgments of interestbased child language learning practices were directly related to parents’ fidelity of use of the practices and indirectly related to child language development mediated by fidelity of use of the practices. the study described in this paper is part of a line of research and practice on (a) the development of evidence-informed early childhood intervention performance checklists and both practitioner and parent practice guides and (b) the influences of practitioner feedback and suggestions on the improvement of both sets of materials. the study involved four field-tests that solicited practitioner social validity judgments of selected checklists and practice guides as well as suggestions for the improvement of both products. each field-test involved practitioner review and evaluation of a different performance checklist and a different practice guide where feedback and suggestions were used to inform improvements in both sets of products. the findings from the first field-test were used to make changes in the checklists and practice guides in the second and third field-tests, and findings from the latter two field-tests were used to inform improvements to the checklist and practice guide in the fourth field-test. preliminary findings from this line of research and practice indicated that changes made to the checklists and practice guides in response to practitioner evaluations of the intervention materials were associated with stronger social validity appraisals of revised checklists and practice guides (dunst, 2017a). evidence-informed checklists and practice guides the performance checklists and practice guides that were the focus of field-test research were developed at the early childhood technical assistance center at the university of north carolina – chapel hill, united states (www.ectacenter.org). checklists include lists of the key characteristics of a method or procedure that operationally defines the active ingredients of desired performance. early childhood intervention checklists include the key characteristics of intervention practices that are used to produce observable changes or improvement in child or family functioning. practice guides include descriptions of everyday intervention activities that can be used to affect changes in child or family functioning. performance checklists early childhood intervention performance or procedural checklists provide concrete reminders for using intervention practices in a competent manner (gawande, 2009; http://www.ectacenter.org/ reid (research and evaluation in education) 14 − reid (research and evaluation in education, 3(1), 2017 wilson, 2013). the checklists were developed using a conceptualization-operationalizationmeasurement framework (babbie, 2009; dunst, 2017c; dunst, trivette, & raab, 2015) where research findings from primary research syntheses and reviews (dunst, in preparation) informed checklist indicator selection or development. performance checklists differ from other types of checklists by specifying a “list of tasks or steps required to complete a procedure [intervention practice] successfully” (wilson, 2013, p. 4). according to gawande (2009), these kinds of checklist indicators provide practitioners concrete reminders for how to implement an intervention practice consistently, reliably, and competently. twenty-nine performance checklists were developed by first using the division for early childhood recommended practices (division for early childhood, 2014) to identify internally consistent sets of practice indicators for different types of intervention practices where the final selection of checklist practice indicators were informed by research evidence. the checklists were all formatted in the same way because “applying organizations to new learning causes learners to focus on the meaning” [intent] of the checklist indicators (schwartz, 2014, p. 107). each checklist includes (1) a brief description of a checklist practice and how the checklist can be used, (2) a list of evidenceinformed practice indicators, (3) a rating scale for doing a self-evaluation or coach-facilitated evaluation of the use of the practice indicators, and (4) space for recording notes about a practitioners’ experience using the checklist practices. appendix a shows the performance checklist that was the focus of practitioner evaluation in the fourth field-test. the reader is referred to dunst (2017) for a more detailed description of the procedures used to develop the checklists. practice guides two sets of practice guides were developed using the checklist indicators as the sources of intervention activities: one set for parents and other primary caregivers and a second set for early childhood intervention practitioners. the practice guides were also all formatted in the same way. each practice guide includes: (1) a description of a practice and its intended outcome, (2) examples of activities for using a practice, (3) videos of parents or practitioners using the practice, (4) a vignette of parents or practitioners implementing a practice, (5) functional outcome indicators for determining if the practice had expected benefits, and (6) a link to external resources for additional ideas (activities) for using a practice. appendix b shows the practice guide for the family capacity-building practices checklist used in the fourth field-test. the practice guides are modeled after the ones that have been extensively field-tested and evaluated by parents, practitioners, technical assistance providers in previous research and intervention studies (e.g., dunst, masiello, meter, swanson, & gorman, 2010; dunst, trivette, gorman, & hamby, 2010; trivette, dunst, & hamby, 2010). hypotheses the analyses focused on two primary and two secondary hypotheses. the two primary hypotheses were: h1: the social validity judgments of the performance checklists and practice guides will increase linearly as a result of changes and improvements made in response to practitioners’ evaluations as evidenced by the sizes of effect for the linear increases and associated improvement indices. h2: the number of practitioner suggestions for improving the checklists and practice guides will decrease linearly as a result of changes and improvements made in response to practitioners’ evaluations as evidenced by the sizes of effect for the linear increases and associated improvement indices. the two secondary hypotheses were: h3: the sizes of effects and improvement indices for field-test 1 vs. field-test 4 will be larger than those for field-tests 2 + 3 vs. field test 4 as a result of the progressive changes and improvements made in response to practitioners’ evaluations of the checklists and practice guides. h4: the sizes of effects and improvement indices for field-tests 1 + 2 + 3 vs. reid (research and evaluation in education) practitioner-informed improvements to early childhood intervention performance... 15 carl j. dunst, deborah w. hamby, linda l. wilson, marilyn espe-sherwindt, & donna e. nelson field-test 4 will provide the best estimates of the cumulative changes made in response to practitioners’ evaluations of the checklists and practice guides. thus, the four hypotheses were tested by a priori linear and orthogonal contrasts for between-field-test comparisons in the analyses of the field-test research data. method the participants consisted of 67 practitioners from an early head start program in one state and two early childhood intervention programs in other states. the three programs have a history of using innovative practices where the program practitioners are knowledgeable about contemporary evidenceinformed early childhood intervention practices. there were no between-group differences in the percentage of participants in the different field-test studies, χ 2 = 6.68, df = 6, p = .3516, and nor was there in the percentage of participants in the type of early childhood program in the field-tests, χ 2 = 2.77, df = 6, p = .8375. table 1. background characteristics of the field-test participants characteristics number percent education degree aa 14 20 ba/bs 24 35 ma/ms 26 40 ph.d/ed.d 3 5 discipline early childhood 42 63 early childhood special education/special education 16 24 othera 9 13 years of experience <1 4 6 2-5 13 19 6-10 15 22 11-15 10 15 16-20 14 21 21+ 11 16 primary practitioner role child-focused 25 37 family-focused 42 63 note: aspeech and language pathologists, child and family specialists, early interventionists, and social workers/family workers. the background characteristics of the participants are shown in table 1. the majority of practitioners (75%) had either bachelor’s or master’s degrees. most of the practitioners had degrees in early childhood education or early childhood special education/ special education. the participants’ median years of experience ranged between 6 and 10 with 78% having from 6 to 20+ years of experience. nearly two-thirds of the participants worked primarily with parents and their children (family-focused) and 37% worked primarily with children (child-focused). there were no between field-test differences for any of the participant background characteristics shown in table 1, χs = 0.17 to 10.73, dfs = 3 to 15, ps = .1004 to .9817. procedure the performance checklists and practice guides that were the focus of the field-test evaluations are shown in table 2. the four topic areas included child, parent-child, parent, and family-focused intervention practices. both the checklists and practice guides included different kinds of interventions for (a) using everyday activities as sources of child learning opportunities and (b) parent sensitivity and responsiveness to child behavior in the activities as the primary caregiver practice to reinforce child competencies and sustain child engagement in the activities. the checklist in the first field-test included practice indicators for strengthening caregiver and child relationships that focused on bidirectional, reciprocal interactions between interactive partners (eshel, daelmans, cabral de mello, & martines, 2006). the practice guide for the checklist indicators included a number of different socially interactive games that caregivers could use to engage young children in your turn-my turn interactive episodes (e.g., dunst, pace, & hamby, 2007). reid (research and evaluation in education) 16 − reid (research and evaluation in education, 3(1), 2017 table 2. performance checklists and practice guides that were the focus of practitioner social validity judgments and feedback field-test topic area performance checklists practice guides 1 interaction adult-child interactions social games 2 environment natural learning opportunities it’s natural 3 instruction naturalistic instruction learning comes naturally 4 family family capacity-building everyday child learning the checklist in the second field-test included indicators for identifying everyday activities that provide the most opportunities for child learning (dunst, bruder, trivette, & hamby, 2006). the practice guide included ideas and strategies for engaging a child in the activities (dunst, raab, & trivette, 2013b). the checklist in the third field-test included indicators for using naturalistic teaching practices for reinforcing child behavior initiations and elaborations while engaged in everyday activities (dunst, raab, & trivette, 2011). the practice guide included different kinds of intervention activities and strategies for using natural reinforcing consequences for reinforcing child behavior (e.g., dunst, raab, & trivette, 2013a). the checklist in the fourth field-test included methods for strengthening family capacity to provide a child everyday learning opportunities (swanson, raab, & dunst, 2011). the practice guide included a set of step-bystep instructions for practitioners to use to encourage and support parent-mediated everyday child learning (e.g., raab, dunst, & trivette, 2013). the checklist and practice guide in the first field-test had not been subjected to prior review and feedback, and practitioner evaluations of both products were used as the baseline for evaluation of subsequent revisions and improvements to the checklists and practice guides in the second and third field-tests. the changes to the checklists in response to practitioner feedback and suggestions included clarifying the purpose of the checklist instructions and intended users (practitioners), rewording the checklist indicators to improve meaning and intent, clarifying how to use the checklist indicators to plan intervention sessions with parents, and the way how to use the rating scale to do a self-evaluation of how many and how well the checklist indicators were used with a child or parent. the changes to the practice guides included adding captions to the videos of parents or practitioners using the practices, adding additional intervention activities to the practice guides, including suggestions for making adaptations to the practice guide activities (where appropriate), and clarifying how to use the outcome indicators for evaluating the benefits of the practice guide activities. feedback and suggestions on the second and third field-tests were used to make additional changes to the checklist and practice guide in the fourth field-test. the changes to the checklist included clarifying the difference between using the checklist indicators for a priori intervention planning and doing a post hoc self-evaluation of the use of the indicators and clarifying the instructions for how to use the checklist indicators for completing a self-evaluation. the changes to the practice guide included additional specificity in terms of the focus and intent of the practice guide as well as the practice guide activities. field-test survey the survey included four sections: (1) practitioner social validity judgments of the checklists, practice guides, and correspondence between practice guides and checklist indicators; (2) open-ended questions asking for suggestions to improve the checklists and practice guides; (3) levels of experience needed for a practitioner to understand and use checklists and practice guides; (4) background information about the field-test participants (table 1). each field-test involved an emailed invitation sent to the directors of each program that included instructions for participation in the field-tests, pdfs of the checklists and practice guides, and a url link to the survey. the program directors were asked to forward the emailed invitation to their staff. participation in the field-tests was voluntary, reid (research and evaluation in education) practitioner-informed improvements to early childhood intervention performance... 17 carl j. dunst, deborah w. hamby, linda l. wilson, marilyn espe-sherwindt, & donna e. nelson and the field-test research was considered exempt from human subjects review because practitioners were asked only to evaluate materials designed for routine early childhood intervention. the surveys were completed online using qualtrics survey soft-ware. the social validity items for the performance checklists, practice guides, and checklist-practice guide correspondence (four per each section) were developed by using foster and mash’s (1999) framework for developing indicators for measuring the importance and acceptability of intervention practices and outcomes. in addition, the social importance of the checklists and practice guides was measured in terms of the subjective value attributed to the intervention materials (e.g., the checklist items are easy to understand and follow; the practice guide activities would be engaging to most children). the social acceptability of the checklists and practice guides was measured in terms of judgments about the fit of the practices to everyday life (e.g., the checklist indicators would be easy to use with a parent or child; the practice guide would be worth my time and effort to use). the social validity items were each rated on a 5-point scale ranging from do not agree at all (with the survey items) to agree a great deal (with the survey items). the items were adopted from the ones used in field-tests of other intervention practices (e.g., dunst et al., 2007; dunst, trivette, et al., 2010). the principal component factor analysis of the three sets of items in each field-test with orthogonal rotation each produced a single-factor solution indicating that summated scores were warranted as measures of social validity judgments (spector, 1992). the average coefficient alpha for the checklist indicators was .89 (range = .81 to .97), the average alpha for the practice guide indicators was .85 (range = .77 to .91), and the average alpha for the correspondence between the checklists and practice guides was .92 (range = .85 to .95). the alpha’s in all 12 factor analyses reached acceptable levels of internal reliability (nunnally & bernstein, 1994). the open-ended questions for improving the checklists asked for suggestions about the (1) checklist instructions, (2) checklist indicators, (3) self-evaluation scale, and (4) any other suggestions for improvement. the open-ended questions for improving the practice guides asked for suggestions about the (1) practice guide format, (2) practice guide activities, (3) videos of the practices, (4) child outcomes, and (5) any other suggestions to improve the practice guides. methods of analysis the 3 between field test anovas with preplanned linear and between group contrasts were used to evaluate the effects of changes to the checklists and practice guides on participants’ social validity judgments. the independent variable was the different fieldtests (field-test 1 vs. field-tests 2 + 3 vs. field-test 4). the linear contrasts and between-field-test comparisons permitted tests of the four study hypotheses. the dependent measures in three anovas were the summated social validity scores for the performance checklists, practice guides, and correspondence between the checklists and practice guides. the primary metrics for testing the study hypotheses were cohen’s d effect sizes and associated improvement indices (what works clearinghouse, 2014). effect sizes rather than statistical significance testing is the preferred metric for substantive interpretation because effect sizes and not p-values are the best estimates of the magnitude of the differences between two groups or contrasts (coe, 2002). as a general rule, effect sizes between .20 and .49 are considered small, those between .50 and .79 are considered medium, those between .80 and 1.19 are considered large, and effect sizes equal to or greater than 1.20 are considered very large. improvement indices are measures of the practical importance of the changes made to the checklists and practice guides (durlak, 2009). the indices convert effect sizes into a percentile change (gain) score by a target group. these indices vary from -50 to 50 where a positive difference between later and earlier field-tests provides a measure of the amount of improvement that occurred as a result of changes made to the checklists and practice guides. zcalc was used to evaluate the improvement indices (neill, 2006). reid (research and evaluation in education) 18 − reid (research and evaluation in education, 3(1), 2017 primary analyses of the practitioners’ social validity judgments were supplemented by computing the percent of indicators rated a 4 or 5 on the 5-point scale to ascertain the overall levels of agreement with the indicators. as found in consumer sciences research, the larger the percent of indicators rated a 4 or 5 on a 5-point scale, the stronger the endorsement of a product, practice, or service (mackiewicz & yeats, 2014; reichheld, 2003). the mantel-haenszel test for linear trends was used to determine if there were progressive increases in the percent of practitioners rating the social validity items a 4 or 5 from the first to fourth field-tests (spss inc., 2005). the effects of the changes made to the checklists and practice guides in response to practitioner suggestions were tested by both 3 between field-test anovas for the total number of practitioner suggestions and by 3 between field-test chi-square analyses for dichotomous responses for each open-ended section. the same linear contrasts for the social validity appraisals were made for evaluating changes in the practitioner suggestions. findings and discussion social validity judgments figure 1 shows the mean social validity scores for four field-tests for each set of importance and acceptability judgments. the 3 between field-test anovas produced between group differences for the practitioner social validity judgments of the performance checklists, f(2, 64) = 3.49, p = .0364, and checklist/practice guide correspondence, f(2, 64) = 4.94, p = .0101, but not for the practice guides, f(2, 64) = 0.42, p = .6562. the results for linear contrasts and between-field-test comparisons are presented in table 3. there were small linear increases to the practice guides to medium linear increases to both performance checklists and checklist/ practice guide correspondence increases from the first to fourth field-tests as evidenced by the sizes of effect for the linear trends. the effect sizes were associated with improvement indices of 9, 22, and 27 percent, respectively, in response to the progressive changes in the practice guides, checklists, and correspondence comparisons. the findings are consistent with hypothesis 1 that changes made in response to the practitioners’ feedback and suggestions would be related to improvements in the social validity judgments of the checklists and practice guides. figure 1. mean practitioner social validity scores for the four field-tests. table 3. linear contrasts and between-field-test comparisons and associated significance levels, effect sizes, and improvement indices product linear trend field-test comparisons 1 vs. 4 2 + 3 vs. 4 1+2+3 vs. 4 statistical significance performance checklists (pc) p = .0107 p = .0040 p = .1410 p = .0052 practice guides (pg) p = .1802 p = .1845 p = .2908 p = .2090 pc/pg correspondence p = .0025 p = .0012 p = .1706 p = .0014 cohen’s d effect sizes performance checklists (pc) 0.59 0.93 0.22 0.67 practice guides (pg) 0.23 0.29 0.18 0.21 pc/pg correspondence 0.74 1.09 0.33 0.78 improvement indices performance checklists (pc) 22 32 9 25 practice guides (pg) 9 11 7 8 pc/pg correspondence 27 36 13 28 note: both the linear trends and field-test comparisons all have numerator degrees of freedom = 1. reid (research and evaluation in education) practitioner-informed improvements to early childhood intervention performance... 19 carl j. dunst, deborah w. hamby, linda l. wilson, marilyn espe-sherwindt, & donna e. nelson there were small (practice guides) to large (checklists and checklist/practice guide correspondence) effect sizes for the differences between the field-test 1 vs. field-test 4 social validity judgments (table 3). these were associated with improvement indices of 11, 32, and 36 percent, respectively, for the practice guides, checklists, and correspondence judgments. the effect sizes for the fieldtests 2 + 3 vs. field-test 4 for the betweenfield-test comparisons were small for both the performance checklists and checklists/practice guides correspondence. the betweenfield-test comparisons were associated with improvement indices of 9% for the checklist differences and 13% for the checklist/practice guide correspondence differences. the comparisons of the two sets of results in table 3 shows, as hypothesized, that the sizes of effect and associated improvement indices for field-tests 1 vs. 4 are considerably larger than those for field-tests 2 + 3 vs. 4. the cumulative effects of the progressive changes made in response to the practitioner evaluations are evidenced from the field-tests 1 + 2 + 3 vs. field-test 4 comparisons. there were small (practice guides) to medium (checklists and checklists/practice guides correspondence) effect sizes for these between-field-test comparisons. the effect sizes were associated with improvement indices of 8, 25, and 28 percent, respectively. the results are consistent with the hypothesized relationships between changes made in response to practitioner feedback and suggestions and improvements in the social validity judgments of the intervention practices. the percent of social validity items rated a 4 or 5 on each section of the survey for the different field-tests are shown in figure 2. there were linear increases in the percent of indicators rated a 4 or 5 for the performance checklists, χ = 9.04, df = 1, p = .003, d = .79, practice guides, χ 2 = 5.88, df = 1, p = .015, d = .62, and checklist/practice guide correspondence, χ 2 = 10.98, df = 1, p = .001, d = .97. the effect sizes for the linear trends were medium to large and associated with improvement indices of 29, 23, and 33 percent, respectively. the smaller effect size for the linear increase in the social validity ratings of the practice guides was not unexpected given the fact that practitioner judgments of the practice guides were higher than those for the checklist on the first three field-tests. as shown in figure 2, 98% to 99% of the social validity items received the highest two ratings in the fourth field-test which are noticeably higher than that in the other three field-tests. figure 2. percent of social validity items judged as acceptable and important by the field-test participants practitioner suggestions figure 3. percent of practitioners making suggestions for improving the performance checklists and practice guides figure 3 shows the percent of practitioners who made suggestions for improving the checklists and practice guides in the different field-tests. the 3 between-field-test anovas for the total number of practitioner suggestions produced between field-test differences for both the performance checklists, f(2, 64) = 7.11, p = .0016, and practice guides, f(2, 64) = 10.51, p = .0001. there reid (research and evaluation in education) 20 − reid (research and evaluation in education, 3(1), 2017 were linear decreases in the number of suggestions for the checklists, f(1, 64) = 11.41, p = .0012, d = .85, and practice guides, f(1, 64) = 18.31, p = .0001, d = 1.07. both sizes of effects were large for the linear decreases in the number of practitioner suggestions. the effect sizes were associated with the improvement indices of 30% and 36% respectively. the patterns of results are consistent with the hypothesis that the practitioners would suggest fewer changes as a function of improvements made in response to their feedback and suggestions. further examination of the suggestions for improving the checklists found linear decreases in the percent of practitioners who made suggestions for changes to the checklist instructions, χ 2 = 6.43, df = 1, p = .011, d = .77, the checklist indicators, χ 2 = 6.96, df = 1, p = .008, d = .83, and the self-evaluating rating scale, χ 2 = 2054, df = 1, p = .0555, d = .45. the sizes of effects were medium, large, and small, respectively, and associated with improvement indices between 17% and 30%. there were also linear decreases in the percent of practitioners making suggestions to improve the practice guide format, χ 2 =12.43, df = 1, p = .0000, d = 1.15, practice guide activities, χ 2 = 9.49, df = 1, p = .001, d = .87, practice guide outcome statements, χ 2 = 2.06, df = 1, p = .051, d = .38, and videos of parents or practitioners using the practices, χ 2 = 17.17, df = 1, p = .0000, d = 1.36. the effect sizes were small to very large and associated with improvement indices between 15% and 41%. these findings, taken together, further support hypothesized relationships between the changes made in response to the practitioners’ evaluations of the checklists and practice guides and fewer suggestions for improving the intervention practices. discussion the findings provide support for the two primary hypotheses that changes made to the performance checklists and practice guides in response to early childhood practitioners feedback and suggestions would be related to the study outcomes. the results showed that practitioners’ social validity ratings increased as a function of the improvements to both the performance checklists and practice guides and also to the checklist/practice guide correspondence. the results also showed that there were fewer suggestions for making changes to the checklists and practice guides as a function of using practitioner feedback to improve both sets of products. the patterns of results also provide support for the two secondary hypotheses. the sizes of effect for the first vs. fourth field-tests were larger than those for the second and third vs. fourth field-tests (table 3). these results were expected because fewer suggestions for changes to the checklists and practice guides were made on the second and third field-tests compared to the first field-test (figure 3). the cumulative effects for the changes made in response to practitioners’ suggestions were evidenced by the sizes of effect for the first three field-tests vs. the fourth field-test. both the effect sizes for these comparisons and improvement indices (table 3) indicated that the progressive sets of changes made in response to practitioners’ suggestions were associated with the highest social validity rating (figure 1) and fewest suggestions for change (figure 3) on the fourth field-test. the fact that the effect sizes and improvement indices for the practice guides were smaller than those for the performance checklists and checklist-practice guide correspondence deserves comment in order to place the results in empirical context. the practice guides were modeled after the ones that had previously been field-tested with parents and practitioners where the results were used to improve the intervention materials (e.g., dunst, trivette, et al., 2010; trivette, dunst, masiello, gorman, & hamby, 2009). it was therefore not unexpected that the majority of social validity indicators for the practice guides on the first three field-tests were higher than those for the checklists and checklistpractice guide correspondence (figure 3). this was the case because the practice guide format and content were informed by lessons learned in previous research and practice. the focus on the social validity of the performance checklists and practice guides was based on research indicating that subjective judgments of the importance and acceptreid (research and evaluation in education) practitioner-informed improvements to early childhood intervention performance... 21 carl j. dunst, deborah w. hamby, linda l. wilson, marilyn espe-sherwindt, & donna e. nelson ability of intervention practices and outcomes are related to both adoption and fidelity of use of the practices (e.g., dunst et al., 2016; strain et al., 2012; trivette, raab, & dunst, 2014; wainer & ingersoll, 2013; wehby, maggin, moore partin, & robertson, 2011). as noted by strain et al. (2012), these “likingimplementation with fidelity relationships” (p. 197) are important because they help explain at least the likelihood of early childhood intervention practices being used as intended. conclusion and suggestions the study described in this paper has both strengths and limitations. one strength is the fact that the procedures used to inform changes in the checklists and practice guides illustrates how the consumer level input can be used to improve social validity appraisals of the intervention materials constituting the focus of evaluation. another strength is establishing the inverse relationship between increases in social validity ratings and concomitant decreases in practitioner suggestions for changes. in another set of analyses in this line of research and practice, practitioners’ cognitive judgments of the performance checklists and practice guides were the only variable accounting for variations in the social validity ratings of the intervention materials (dunst & hamby, 2017). one limitation of the study is that the field-tests were conducted in only three early childhood intervention programs. therefore, it is not known if practitioners in other early childhood intervention programs would judge the checklists and practice guides in the same or different ways. another limitation is the fact that only 4 out of 29 performance checklists and only 4 out of 67 practice guides were evaluated in the field-tests. whether other checklists and practice guides would be judged similarly is therefore not known. advances in our understanding of the role social validity judgments play in practitioners’ and parents’ use of different kinds of early childhood intervention has broadened our knowledge of the antecedents for and conditions under which intervention practices are used with fidelity (leko, 2014; strain et al., 2012). one simple way of assessing practitioners’ and parents’ social validity appraisals is to ask the question “was using xyz practice worth your time and effort or was it more trouble than it was worth?” if a practitioner or parent responds that it was not worth the trouble to use, it is unlikely that the practice will be used with fidelity or used at all. although the field-test process was used to inform improvements in early childhood intervention performance checklists and practice guides, the process itself could easily be used in other fields for achieving performance excellence. this is especially the case in professions where there needs to be practitioner buy-in to ensure actual performance mirrors desired performance. acknowledgements the field-test research described in this paper was supported, in part, by a subcontract to the orelena hawks puckett institute from the early childhood technical assistance (ecta) center, frank porter graham child development institute, university of north carolina--chapel hill. the ecta center is funded by the u.s. department of education, office of special education programs (grant #h326p120002). the opinions expressed, however, are those of the authors and no endorsement by the funder, university, or ecta center should be implied or inferred. special thanks to the practitioners who participated in the field-test evaluations and for providing assessments of and feedback on the performance checklists and practice guides. references babbie, e. r. (2009). the practice of social research (12th ed.). belmont, ca: wadsworth. bailey, d. b., jr., & wolery, m. (1992). teaching infants and preschoolers with disabilities (2nd ed.). upper saddle river, nj: merrill. britto, p. r., engle, p. l., & super, c. m. (eds.). (2013). handbook of early childhood development research and its impact on global policy. new york, ny: oxford. reid (research and evaluation in education) 22 − reid (research and evaluation in education, 3(1), 2017 bruder, m. b., dunst, c. j., & mogro-wilson, c. (2011). confidence and competence appraisals of early intervention and preschool special education practitioners. international journal of early childhood special education, 3(1), 13-37. coe, r. (2002, september). it's the effect size, stupid: what effect size is and why it is important. paper presented at the annual conference of the british educational research association, university of exeter, england. retrieved from http://www.leeds.ac.uk/educol/docum ents/00002182.htm. cowan, p. a., powell, d., & cowan, c. p. (1998). parenting interventions: a family systems perspective. in w. damon, i. e. sigel, & k. a. renninger (eds.), handbook of child psychology: vol. 4. child psychology in practice (5th ed., pp. 3-72). new york, ny: wiley. division for early childhood. (2014). dec recommended practices in early intervention/ early childhood special education. retrieved from http://www.dec-sped.org/recom mendedpractices dunst, c. j. (1996). early intervention in the usa: programs, models, and practices. in m. brambring, h. rauh, & a. beelmann (eds.), early childhood intervention: theory, evaluation, and practice (pp. 11-52). berlin, germany: de gruyter. dunst, c. j. (2017a). early childhood practitioner judgments of the social validity of performance checklists and parent practice guides. journal of education and training studies, 5(3), 176187. retrieved from http:// redfame.com/journal/index.php/jets/a rticle/view/2162 doi:10.11114/jets.v5i3.2162 dunst, c. j. (2017b). family systems early childhood intervention. in h. sukkar, c. j. dunst, & j. kirkby (eds.), early childhood intervention: working with families of young children with special needs (pp. 3860). abingdon, oxfordshire: routledge. dunst, c. j. (2017c). procedures for developing evidence-informed performance checklists for improving early childhood intervention practices. journal of education and learning, 6(3), 1-13. retrieved from http://www.ccsenet.org /journal/index.php/jel/article/view/66 005 doi:10.5539/jel.v6n3px dunst, c. j. (in preparation). research foundations for evidence-informed early childhood intervention performance checklists. dunst, c. j., bruder, m. b., trivette, c. m., & hamby, d. w. (2006). everyday activity settings, natural learning environments, and early intervention practices. journal of policy and practice in intellectual disabilities, 3, 3-10. doi:10.1111/j.17411130.2006.00047.x dunst, c. j., & espe-sherwindt, m. (2017). contemporary early intervention models, research, and practice for infants and toddlers with disabilities and delays. in j. m. kauffman, d. p. hallahan, & c. p. pullen (eds.), handbook of special education (2nd ed., pp. 831-849). new york, ny: routledge. dunst, c. j., masiello, t., meter, d., swanson, j., & gorman, e. (2010). technical assistance providers' evaluation of the center for early literacy learning practice guides. cellpapers, 5(3), 1-4. retrieved from http://www.early literacylearning.org/cellpapers/cellpape rs_v5n3.pdf dunst, c. j., pace, j., & hamby, d. w. (2007). evaluation of the games for growing tool kit for promoting early contingency learning. asheville, nc: winterberry press. dunst, c. j., raab, m., & hamby, d. w. (2016). interest-based everyday child language learning. revista de logopedia, foniatria y audiologia, 36, 153-161. doi:10.1016/j.rlfa.2016.07.003 dunst, c. j., raab, m., & trivette, c. m. (2011). characteristics of naturalistic language intervention strategies. journal of speech-language pathology and applied http://www.leeds.ac.uk/educol/documents/00002182.htm http://www.leeds.ac.uk/educol/documents/00002182.htm http://www.dec-sped.org/recommendedpractices http://www.dec-sped.org/recommendedpractices http://redfame.com/journal/index.php/jets/article/view/2162 http://redfame.com/journal/index.php/jets/article/view/2162 http://redfame.com/journal/index.php/jets/article/view/2162 http://www.ccsenet.org/journal/index.php/jel/article/view/66005 http://www.ccsenet.org/journal/index.php/jel/article/view/66005 http://www.ccsenet.org/journal/index.php/jel/article/view/66005 http://www.earlyliteracylearning.org/cellpapers/cellpapers_v5n3.pdf http://www.earlyliteracylearning.org/cellpapers/cellpapers_v5n3.pdf http://www.earlyliteracylearning.org/cellpapers/cellpapers_v5n3.pdf reid (research and evaluation in education) practitioner-informed improvements to early childhood intervention performance... 23 carl j. dunst, deborah w. hamby, linda l. wilson, marilyn espe-sherwindt, & donna e. nelson behavior analysis, 5(3-4), 8-16. retrieved from http://www.baojournal.com/slp -aba%20website/index.html dunst, c. j., raab, m., & trivette, c. m. (2013a). getting in step with responsive teaching. everyday child language learning tools, number 5, 1-4. retrieved from http://www.puckett.org/cecll /eclltools_5.pdf dunst, c. j., raab, m., & trivette, c. m. (2013b). methods for increasing child participation in interest-based language learning activities. everyday child language learning tools, number 4, 1-6. retrieved from http://www.puckett.org /cecll/ecllreport_7_learnops.p df dunst, c. j., trivette, c. m., gorman, e., & hamby, d. w. (2010). further evidence for the social validity of the center for early literacy learning practice guides. cellpapers, 5(1), 1-3. retrieved from http://www.earlyliteracylearning.org/ce llpapers/cellpapers_v5n1.pdf dunst, c. j., trivette, c. m., & raab, m. (2015). utility of implementation and intervention performance checklists for conducting research in early childhood education. in o. n. saracho (ed.), handbook of research methods in early childhood education: vol. 1. research methodologies (pp. 247-276). charlotte, nc: information age publishing. durlak, j. a. (2009). how to select, calculate, and interpret effect sizes. journal of pediatric psychology, 34(9), 917-928. doi:10.1093/jpepsy/jsp004 eshel, n., daelmans, b., cabral de mello, m., & martines, j. (2006). responsive parenting: interventions and outcomes. bulletin of the world health organization, 84(12), 991-998. doi:10.1590/s004296862006001200016 farrell, a., kagan, s. l., & tisdall, e. k. m. (eds.). (2016). the sage handbook of early childhood research. thousand oaks, ca: sage publishing. foster, s. l., & mash, e. j. (1999). assessing social validity in clinical treatment research issues and procedures. journal of consulting and clinical psychology, 67, 308-319. doi:10.1037/0022006x.67.3.308 gawande, a. (2009). the checklist manifesto: how to get things right. new york, ny: metropolitan books. groark, c. j., eidelman, s. m., maude, s., & kaczmarek, l. (2011). early childhood intervention: shaping the future for children with special needs and their families. santa barbara, ca: praeger. guralnick, m. j. (ed.) (2005). the developmental systems approach to early intervention. baltimore, md: brookes. hunt, j. m. (1961). intelligence and experience. new york, ny: ronald press. hunt, j. m. (1987). the effects of differing kinds of experiences in early rearing conditions. in i. uzgiris & j. m. hunt (eds.), infant performance and experience (pp. 39-97). urbana, il: university of illinois press. kazdin, a. e. (2005). social validity. in b. s. everitt & d. c. howell (eds.), encyclopedia of statistics in behavioral science (vol. 4, pp. 1875-1876). chichester, england: john wiley & sons. leko, m. m. (2014). the value of qualitative methods in social validity research. remedial and special education, 35(5), 275286. doi:10.1177/0741932514524002 mackiewicz, j., & yeats, d. (2014). product review users' perceptions of review quality: the role of credibility, informativeness, and readability. ieee transactions on professional communication, 57(4), 309-324. doi:10.1109/tpc.2014.2373891 mclean, m., sandall, s. r., & smith, b. j. (2016). a history of early childhood education. in b. reichow, b. a. boyd, e. e. barton, & s. l. odom (eds.), handbook of early childhood special education http://www.baojournal.com/slp-aba%20website/index.html http://www.baojournal.com/slp-aba%20website/index.html http://www.puckett.org/cecll/eclltools_5.pdf http://www.puckett.org/cecll/eclltools_5.pdf http://www.puckett.org/cecll/ecllreport_7_learnops.pdf http://www.puckett.org/cecll/ecllreport_7_learnops.pdf http://www.puckett.org/cecll/ecllreport_7_learnops.pdf http://www.earlyliteracylearning.org/cellpapers/cellpapers_v5n1.pdf http://www.earlyliteracylearning.org/cellpapers/cellpapers_v5n1.pdf reid (research and evaluation in education) 24 − reid (research and evaluation in education, 3(1), 2017 (pp. 3-20). switzerland: springer international. meisels, s. j., & shonkoff, j. p. (2000). early childhood intervention: a continuing evolution. in j. p. shonkoff & s. j. meisels (eds.), handbook of early childhood intervention (2nd ed., pp. 3-31). cambridge, england: cambridge university press. neill, j. (2006). zcalc: for converting a standardised mean effect size (es) into a z score and in expressing various ways (version 0.1). retrieved from www.wilderdom. com/research/zcalc.xls nunnally, j. c., & bernstein, i. h. (1994). psychometric theory (3rd ed.). new york, ny: mcgraw-hill. odom, s. l., hanson, m. j., blackman, j. a., & kaul, s. (2003). early intervention practices around the world. baltimore, md: brookes. odom, s. l., & wolery, m. (2003). a unified theory of practice in early intervention/early childhood special education: evidence-based practices. journal of special education, 37, 164-173. doi:10.1177/00224669030370030601 raab, m., dunst, c. j., & trivette, c. m. (2013). adult learning procedure for promoting caregiver use of everyday child language learning practices. everyday child language learning reports, number 3, 1-9. retrieved from http://www.cecll.org/download/ecll report_3_adultlearning.pdf reichheld, f. f. (2003). the one number you need to grow. harvard business review, 81(12), 46-54. reichow, b., boyd, b. a., barton, e. e., & odom, s. l. (eds.). (2016). handbook of early childhood special education. switzerland: springer international. schwartz, b. l. (2014). memory: foundations and applications (2nd ed.). los angeles, ca: sage. spector, p. e. (1992). summated rating scale construction: an introduction. newbury park, ca: sage. spss inc. (2005). spss 14.0. statistical package for the social sciences. chicago, il: author. strain, p. s., barton, e. e., & dunlap, g. (2012). lessons learned about the utility of social validity. education and treatment of children, 35, 183-200. doi:10.1353/etc.2012.0007 sukkar, h., dunst, c. j., & kirkby, j. (eds.). (2017). early childhood intervention: working with families of young children with special needs. abingdon, oxfordshire: routledge. swanson, j., raab, m., & dunst, c. j. (2011). strengthening family capacity to provide young children everyday natural learning opportunities. journal of early childhood research, 9, 66-80. doi:10.1177/1476718x10368588 trivette, c. m., dunst, c. j., & hamby, d. w. (2010). acceptability and importance of adaptations to literacy learning practices for young children with disabilities. cellpapers, 5(4), 1-4. retrieved from http://www.earlyliteracylearning.org/ce llpapers/cellpapers_v5n4.pdf trivette, c. m., dunst, c. j., hamby, d. w., & meter, d. (2012). research synthesis of studies investigating the relationships between practitioner beliefs and adoption of early childhood intervention practices. practical evaluation reports, 4(1), 1-19. retrieved from http://www.puckett.org/practical%20 evaluation%20reports/cpe_report_v ol4no1.pdf trivette, c. m., dunst, c. j., masiello, t., gorman, e., & hamby, d. w. (2009). social validity of the center for early literacy learning parent practice guides. cellpapers, 4(1), 1-4. retrieved from http://www.earlyliteracylearning. org/cellpapers/cellpapers_v4n1.pdf trivette, c. m., raab, m., & dunst, c. j. (2014). factors associated with head start staff participation in classroomhttp://www.wilderdom.com/research/zcalc.xls http://www.wilderdom.com/research/zcalc.xls http://www.cecll.org/download/ecllreport_3_adultlearning.pdf http://www.cecll.org/download/ecllreport_3_adultlearning.pdf http://www.earlyliteracylearning.org/cellpapers/cellpapers_v5n4.pdf http://www.earlyliteracylearning.org/cellpapers/cellpapers_v5n4.pdf http://www.puckett.org/practical%20evaluation%20reports/cpe_report_vol4no1.pdf http://www.puckett.org/practical%20evaluation%20reports/cpe_report_vol4no1.pdf http://www.puckett.org/practical%20evaluation%20reports/cpe_report_vol4no1.pdf http://www.earlyliteracylearning.org/cellpapers/cellpapers_v4n1.pdf http://www.earlyliteracylearning.org/cellpapers/cellpapers_v4n1.pdf reid (research and evaluation in education) practitioner-informed improvements to early childhood intervention performance... 25 carl j. dunst, deborah w. hamby, linda l. wilson, marilyn espe-sherwindt, & donna e. nelson based professional development. journal of education and training studies, 2(4), 3245. doi:10.11114/jets.v2i4.449 wainer, a., & ingersoll, b. (2013). intervention fidelity: an essential component for understanding asd parent training research and practice. clinical psychology: science and practice, 20(3), 335-357. doi:10.1111/cpsp.12045 wehby, j. h., maggin, d. m., moore partin, t. c., & robertson, r. (2011). the impact of working alliance, social validity, and teacher burnout on implementation fidelity of the good behavior game. school mental health, 4(1), 22-33. doi:10.1007/s12310-0119067-4 what works clearinghouse. (2014). what works clearinghouse procedures and standards handbook. washington, dc: u. s. department of education, institute of education sciences. retrieved from http://ies.ed.gov/ncee/wwc/pdf/refer ence_resources/wwc_procedures_v3_0 _standards_handbook.pdf. wilson, c. (2013). credible checklists and quality questionnaires: a user-centered design method. waltham, ma: morgan kaufman. http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_procedures_v3_0_standards_handbook.pdf http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_procedures_v3_0_standards_handbook.pdf http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_procedures_v3_0_standards_handbook.pdf reid (research and evaluation in education) 26 − reid (research and evaluation in education, 3(1), 2017 appendix a early childhood intervention performance checklist reid (research and evaluation in education) practitioner-informed improvements to early childhood intervention performance... 27 carl j. dunst, deborah w. hamby, linda l. wilson, marilyn espe-sherwindt, & donna e. nelson appendix b early childhood intervention practice guide copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(2), 2018, 155-163 available online at: http://journal.uny.ac.id/index.php/reid an evaluation of internship program by using kirkpatrick evaluation model *1lathifa rosiana dewi; 2badrun kartowagiran 1,2department of educational research and evaluation, graduate school of universitas negeri yogyakarta jl. colombo no. 1, depok, sleman, yogyakarta 55281, indonesia *corresponding author. e-mail: lathifarosianadewi@gmail.com submitted: 20 december 2018 | revised: 21 december 2018 | accepted: 22 december 2018 abstract this study was aimed at evaluating an internship program using kirkpatrick’s evaluation program. the subjects of the study were students of batch 2015 and instructors. slovin formula was used to calculate the sample. a questionnaire and teaching assessment sheet were used as instruments for collecting data. this study used content validity and exploratory factor analysis as the validity of the test. reliability was estimated by cronbach’s alpha. the results of this study showed that (1) in facility, the level of satisfaction was in the ‘very satisfactory’ category (77.01%); (2) in instructor, the level of satisfaction was in the ‘very satisfactory’ category (82.76%); (3) in schedule, the level of satisfaction was in the ‘satisfactory’ category (50.57%); (4) in material, the level of satisfaction was in the ‘very satisfactory’ category (89.66%); and (5) in students’ teaching abilities. the improvement was in the ‘very satisfactory’ category. keywords: program evaluation, internship, kirkpatrick model introduction teachers’ quality determines the quality of education. teachers are said to be qualified when they have competencies to plan, teach, evaluate, guide, train, research, and conduct community service (article 39 of law of republic of indonesia no. 20 of 2003). according to jailani (2014), there are some teachers who are not qualified to teach; in public elementary schools 78.93%, private elementary schools 71.06%, public secondary schools 45.88%, private secondary schools 39.01%; public high schools 34.71%, and private high school 35.27%. this may give unfavorable effects to the educational practices in indonesia. meanwhile, teachers in indonesia still have an important role in the national education. teachers, therefore, are expected to have good competencies. teachers who have good competencies are believed to have good abilities in teaching. this statement is supported by ardiansyah (2013) who states that teachers who have good competencies can teach well. there are four competencies which should be owned by the teacher namely pedagogic competence, personal competence, professional competence, and social competence. pedagogic competence is the teacher’s competence in managing teaching and learning processes. their ability in managing the class, arrange the students’ seats, and others are examples of pedagogic competences. personal competence is competence to influence students to have good attitudes. professional competence is a teacher’s competence in mastering the material. the last is social competence, where teachers should be able to have good interaction with the students, other teachers, and parents. these competencies can lead to the success of the teaching and learning process. hallo and munadi (2014) mention the same thing that teachers have important roles in the success of teaching and learning process. the success of the teaching and learnreid (research and evaluation in education), 4(2), 2018 issn 2460-6995 an evaluation of internship program... 156 lathifa rosiana dewi & badrun kartowagiran ing process cannot be realized if they do not have good competencies. this can be the reason why teachers should have good competencies since they still are prospective teachers. this is aimed at making them ready when they should become teachers in the field. if they do not have good competencies, they will be just teachers who only transfer knowledge. nowadays, some teachers only transfer knowledge to the students. they just deliver the class material without knowing whether their students understand the material or not. teachers should play their role to teach, evaluate the teaching and learning process, and improve anything that needs to be improved there. this should be realized by teachers from the very first time they are in the teaching and learning process. this can happen when they are trained to be a teacher while they are in the university. each university has a program called teaching training internship (tti). this is a program where prospective teachers train to be a real teacher while they are in university. tti is a program which is held in the last semester of the curriculum. this program trains the prospective teachers to teach and do anything real teachers do in the classroom. this program aims to build the prospective teachers’ characters so that they are ready to be teachers. mardiyono (2006) argues that this program focusses on the prospective teachers’ abilities in teaching in the classroom and doing school administration. this means that prospective teachers learn about not only how to manage the classroom and deliver the material, but also how to do school administration. it is in line with kiggundu and nayimuli (2009) who insist that teaching training is the activity to integrate the theory obtained from the class with practice. some teacher training institutions implement this program, however, some do not. they have another program called internship programs. both internship and tti have the same characteristics in that they train the prospective teachers to be real teachers. the aim of this internship program is to give students experience in teaching. this program lets the students in each batch teach in an addressed school. in the initial phase, they will be trained to create lesson plans and develop class material. they will then apply what they have learned in the teaching practice in the classroom. this program is a mandatory program which means that each student teacher should take it in a year. there is an instructor who comes from the addressed school. the instructor is an english teacher at that school. the instructor should guide the student teacher how to plan a lesson, create instructional material, manage the class, and do many other things that teachers do in the classroom. this program is divided into two. in the first semester, students should attend the debriefing. debriefing means that students are guided to create a lesson plan, develop instructional material, and complete classroom activities. at the end of the semester, students are expected to submit the lesson plan class material. in the second semester, students do the teaching practice at the assigned school. this program has been running for some years, but it has not been evaluated in an appropriate way. this means that the evaluation process in the program just merely gives how many students come and the strengths and weaknesses of the students. however, it has not been reported. considering its importance, the program should be evaluated. there are many approaches that can be applied to conduct a program evaluation. fitzpatrick, sanders, and worthen (2011, p. 114) explain that the differences in evaluation approaches come from the background, experience, and worldview of the authors. this means that each approach is affected by the author. this means that an author can choose the approach which is appropriate for the evaluation process. one of the evaluation models that can be used is kirkpatrick’s evaluation model. this model aims to evaluate the training program. there are four levels in this evaluation model namely reaction, learning, behavior, and result. kirkpatrick and kirkpatrick (2006, p. 21) mention that reaction assesses the satisfactory level of the program; learning assesses what knowledge has been obtained and improved; behavior assesses the changes of the reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 157 – an evaluation of internship program... lathifa rosiana dewi & badrun kartowagiran trainees’ behavior after the program; and result assesses the final result, focusing on the benefit for the institution. evaluation principles an evaluation is a systematic process which gives out information about program achievement. it means that evaluation gives information whether the objective has been achieved or not. evaluation is a systematic process to gather data, information, and interpretation so that this can be used as the basis for policy making, decision making, or creating another program as the results of the evaluation. this can be information that can be used to revise, stop, or continue the program (abrory & kartowagiran, 2014). evaluation is different from research in terms of objectives. while research is aimed at obtaining new theories, evaluation is not. people cannot get new theories from evaluation. what people obtain from evaluation is merely information about the success of a program. besides, evaluation can give information on the impact, or effectiveness, of a program (stufflebeam, madaus, & kellaghan, 2002). it indicates that evaluation has the same method with research, but the result is really different. research does not create a new theory but information. the information is really useful for policymaking. in doing an evaluation process, the evaluator should follow the standards that need to be done. this is in line with yarbrough, shulha, hopson, and caruthers (2011) who believe that there are four standards that should be followed namely utility, accuracy, feasibility, and propriety. the explanation of these standards is as follows. (1) utility means that the information which is obtained from evaluation should be useful and practical. in other words, the information can be used as a basis for decision making and for the success of the program. (2) accuracy means that the information which is gathered should fulfill the requirements for rules of data gathering. in this case, the process of information gathering should be conducted in the right way of research in terms of instrumentation, validity, reliability, measurement, and generality. (3) feasibility means that an evaluation study should be proper both in the politic or costeffectiveness. this means that, when doing an evaluation, everything should be considered. politics means that there is no interest while doing the evaluation. for example, policymaking requires evaluation and, thus, evaluation is developed. besides, cost-effectiveness should be considered so that there is no wasted cost. (4) propriety means that evaluation should be done legally. this means that evaluation cannot be done in secret. the code of ethics of evaluation should be obeyed. evaluation is a process to measure a program, make a decision, and know the usefulness of a program. evaluation is done when the decision maker or stakeholders are curious about the success of the program (irambona & kumaidi, 2015). evaluation has an important role in the running of a program. without evaluation, people do not know whether the program is successful or not so that follow-ups can be taken. kirkpatrick’s evaluation model kirkpatrick’s evaluation model was employed to evaluate a training program. there are four stages in this evaluation model, including: reaction, learning, behavior, and result. these four stages can be described as follows (kirkpatrick & kirkpatrick, 2006, p. 21). reaction in this stage, the researchers measure the level of participants' satisfaction with the program. training programs are considered successful if the trainees are happy with the program so that they are motivated to learn. interest, attention, and motivation of participants in following the course of training are indicators of the success of the program. in this first stage, trainees will be given a questionnaire of satisfaction on matters relating to training such as materials, instructors, training environment, and consumption in the training. learning learning can be defined as a change of attitude, improvement of knowledge, and or enhancement of the skills of the participants reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 an evaluation of internship program... 158 lathifa rosiana dewi & badrun kartowagiran after the program. there are three components to be measured in this evaluation: what knowledge has been learned, what attitude has changed, and what skills have been developed or improved. to measure all three components, then, it takes a test. behavior in this evaluation, what is assessed is the attitude change of the trainees after returning from the program. the focus in this level is whether or not the trainee applies what has been obtained from the program. result evaluation at this stage is at the final stage. it is focused on the final results after the participants follow the program. internship an internship is a program which is implemented in order to prepare prospective teachers to become teachers who have good skills. inside is a professional preparation stage where a student has gained knowledge to be applied in the field with the supervision of several interested parties and within a certain period of time (hamalik, 1990). thus, an internship program is a program in which a student does science applications that have been obtained. in education, internship can be interpreted as the application of competences which are possessed by a teacher in school. there are several objectives of holding an internship program of education as expressed by hamalik (1990). these include developing a more comprehensive view to the intern about education, equipping the intern with experience about the implementation and responsibility of education as a teacher, enabling the intern to get knowledge from supervisors in school, and providing an overview to the intern about the professional code of ethics of a teacher. in recent literature, internship is defined as an experiential learning that integrates both the theory and knowledge which are acquired in the classroom with practice (kiser, 2016). the purpose of holding an internship is to gain valuable experience about the application of science that has been obtained previously and make connections between the science and the field of profession based on the future career goals. kiser (2016) mentions several important things in the internship, that is, the time spent during the internship, how time is used, the quality of the internship, and the application of the previous learning. based on the importance of internship evaluation, the research objective is to find out five levels of satisfaction towards the components of the internship program. these are levels of satisfaction towards (1) facilities, (2) instructors, (3) scheduling, (4) content material, and (5) students’ improvement. method the study was conducted in the vicinity of muhammadiyah university of yogyakarta (or universitas muhammadiyah yogyakarta umy). of the four kirkpatrick’s model, only two are conducted: reaction and learning. for the first level, the study is intended to find out the satisfaction level towards the program seen from facilities, instructors, scheduling, and material. for the second, the study is intended to find out the students’ improvement of teaching abilities. the subjects of the study were students of english education department batch 2015 and some instructors. the sample for this study consisted of 87 of 103 students. the number of respondents was calculated by using the slovin formula. a questionnaire was used to gather data about the satisfaction level towards the internship program. there were four aspects namely facilities, instructors, material, and schedule. meanwhile, improvement of teaching abilities was obtained by using performance sheets. in addition, students and teachers in each school were interviewed to gather additional information. the validity measures implemented in the study were of content and construct. content validity is one which confirms what the instrument is supposed to measure (azwar, 2015, p. 111). the questionnaire and interview guidelines were judged by three experts, and the data were subjected to the aiken formula. all instruments were valid because the reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 159 – an evaluation of internship program... lathifa rosiana dewi & badrun kartowagiran aiken value was higher than 0.7. it is in line with azwar (2015, p. 149) who mentions that coefficient value can be said to be valid when the value is higher than 0.35. for the construct validity, factor analysis was used. there were four aspects in the questionnaire: facilities, instructors, schedule, and material. from the results of the construct validity measures, one item in the facilities and schedule aspect which should be dropped. the questionnaire reliability was estimated using cronbach’s alpha. there were 36 items. the reliability value was 0.844. this can be said to be reliable. for the quantitative data of the students’ survey, the descriptive statistics proposed by azwar (2017, p. 148) as presented in table 1 was employed. after analyzing the quantitative data, the results were interpreted qualitatively. the results from the quantitative analyses were then cross-checked with the students and teachers before a conclusion was made. table 1. normal curve statistics for students’ satisfaction score x categories x > m + 1.5 sd very satisfactory m+ 0.5 sd < x ≤ m + 1.5 sd satisfactory m − 0.5 sd < x ≤m + 0.5 sd fairly satisfactory m − 1.5 sd < x ≤m − 0.5 sd less satisfactory x ≤m − 1.5 sd not satisfactory notes: m: ideal mean of the concerned component in this research. [ m = (highest ideal score + lowest ideal score) ] x: the total point scored by each respondent regarding to each item/component to evaluate. sd: ideal standard deviation of each component. [ sd = (highest ideal score lowest ideal score) ] students' satisfaction toward facilities in this section, each student scored five points as an ideal minimum score and the maximum ideal score was 25. thus, the ideal mean was 15, and the standard deviation became 3.33. the facilities were judged satisfactory if the mean score belongs to the first category (very satisfactory). the criteria are defined in table 2. table 2. evaluation criteria of facilities, instructor, schedule, and material score x categories x > m + 1.5 sd very satisfactory m+ 0.5 sd < x ≤ m + 1.5 sd satisfactory m − 0.5 sd < x ≤m + 0.5 sd fairly satisfactory m − 1.5 sd < x ≤m − 0.5 sd less satisfactory x ≤m − 1.5 sd not satisfactory students’ satisfaction level toward instructor there are 20 questions used in this instructor aspect. based on the criteria, the ideal minimum score was 20 and the ideal maximum score was 100. thus, the ideal mean was 60 and the ideal standard deviation was 13.3. the instructor was considered to be satisfied if the mean score belongs to the first category (very satisfactory). then, the very satisfactory category was converted to a percentage. students’ satisfaction level toward schedule the schedule aspect included two questions. the ideal minimum score of this aspect was 2 and the ideal maximum score was 10. thus, the mean ideal of this aspect was 6 and the standard deviation was 1.33. the schedule was judged to be satisfactory if the mean score belongs to the first category (very satisfactory). students’ satisfaction level toward material there were seven questions in the material aspect. based on the criteria, the ideal minimum score was 7 and the ideal maximum score was 35. thus, the ideal mean was 21 and the standard deviation was 4.67. the material was considered to be satisfactory if the mean score belongs to the first category (very satisfactory). students’ teaching ability improvement to measure students teaching ability improvement, the instructors were asked to fill the performance sheet. there were ‘increase’ and ‘not increase’ category. the instructor should fill the sheet by putting check marks. for ‘increase’ category, there were five improvement categories as mentioned in table 3. then, each category was converted to a percentage. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 an evaluation of internship program... 160 lathifa rosiana dewi & badrun kartowagiran table 3. evaluation criteria of students’ teaching ability improvement score x categories x > m + 1.5 sd very high m+ 0.5 sd < x ≤ m + 1.5 sd high m − 0.5 sd < x ≤m + 0.5 sd fairly high m − 1.5 sd < x ≤m − 0.5 sd less high x ≤m − 1.5 sd not high findings and discussion students’ satisfaction becomes the most important aspect of any program. in this internship program, students’ satisfaction will affect student’s motivation and this can lead to the program success. badu (2013) asserts that program effectiveness is where the training program is fun and enjoyable so that students can get a high motivation to learn. evaluation of reaction for the internship program was measured based on the students’ satisfaction toward the program. there were 34 statements in the questionnaire, grouped into four aspects namely facilities, instructor, schedule, and material. each aspect has a different number of statements. the facility aspect has five statements, instructor aspect has 20 statements, schedule aspect has two statements, and the material aspect has seven statements. the indicator that represents the level of satisfaction toward the program is comfort and suitability. comfort means that the rooms were well equipped. this can be known from the using of media, air conditioner, and air freshener. suitability means the readiness of the room. two statements for this suitability factor is the readiness of room before it was used and room capacity was suitable for students’ number. the result showed that 77.01% of the students reported that facilities were in the very satisfactory category; 21.84% satisfactory category; and 1.15% fairly satisfactory category. each item in the facility aspect then was categorized 'very satisfactory' and 'satisfactory'. four items (the using of air conditioner, media, room readiness, and also room suitability to the student’s number were in the very satisfactory of fresheners was only in the satisfactory category. based on the interview, students said that the room which was used for the coaching was well equipped. however, the using of fresheners was less. vonny (2016) states that facilities can give satisfaction. this means that when students were asked about satisfaction, they will mention facilities aspect as one of the indicators. the implication from this study is that the better the facilities, then, the higher the increase. from the study, it can be concluded that a program can be said as satisfying where the facilities are good. the internship program can be regarded as successful because more than 50% of the students stated that the room for coaching has been equipped by the good facilities. instructor becomes one of the most important roles in a coaching program. the instructor should be selected carefully because they can give either good or bad effects for trainees. instructors of internship programs need to be evaluated because they give important material before students do the teaching practice. there were 20 statements to measure the students’ level of satisfaction toward the instructor. these statements include the instructor’s readiness before the coaching, the delivery strategy, the delivery of materials, the ability to communicate orally, the ability to communicate in writing, and the use of media. a total of 82.76% of the students stated that the instructor aspect was in the very satisfactory category. they mentioned that instructors’ abilities in delivering the material were good so that they could understand the material well. besides, they deliver the material in detail and a fun way. students enjoyed joining the coaching and they could understand the material well. this study found that students felt satisfied with the use of media and teaching video. this means that the instructor did not give them the teaching video as only an example. some students reported that their instructor did not use the media often. this can lead to the conclusion that they just talked in the class without doing anything. putri and kartika (2016) report the same thing that the highest level of satisfaction was attached to instructors who have good abilities in delivering material and who can be fun too. for example, the instructor used jokes while delivering the material. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 161 – an evaluation of internship program... lathifa rosiana dewi & badrun kartowagiran the internship program was scheduled for eight sessions in the semester. each group had a different schedule based on the agreement between students and instructors. this was revealed by the interview with students and instructor. they said that the internship schedule was flexible so that each group had a different schedule. this means that a group may complete the internship program in only two months but the others may not. this aspect actually included three items, but one item should be deleted due to the factor analyses. these items were the time to start the coaching and the time to end the coaching. these items can represent students’ satisfaction levels because the schedule is one of the crucial things. when the coaching was not based on the schedule, this can affect the students’ responses. students’ level of satisfaction toward the schedule was only categorized by 'satisfactory'. a number of 50.57% of the students mentioned that the schedule was in the satisfactory category. students reported that instructors used time in each coaching. students felt useless because the coaching time did not give them any information. zahro and wu (2016) state that time allocation in a program should be evaluated so that there would be an improvement of the schedule for the next coaching. to anticipate the instructor who has not kept the right schedule, there should be a team for monitoring the internship program. it is in accordance with rohani (2015) who mention that the needs of a quality control team will give a good supervisory function. supervisors should check the coaching time in a week, for example. they cannot just come then go, but they should be there along the coaching time. this aimed to decrease the bias. it means that when students do the best to teach, then, there is no supervisor who does not come to supervise, giving students disadvantages. it is in line with sahraini and madya (2015) who report that teachers who have good abilities in teaching will not be appreciated because the evaluation has no regular schedule. material becomes one of the most important aspects of evaluation. the better the material, the better the impacts it gives to the trainees. there are seven statements which are divided into two factors, namely material suitability with learning and material conformity to students’ needs. these items are material conformity with the lesson plan, the systematics of material delivery, the interrelationship within the material, the suitability of the material with the curriculum used in the partner school, the way the selection of teaching materials, the way of choosing learning strategies, and how to manage the class. in this study, evaluation toward material was in the very satisfactory category, as high as 89.66%. this was confirmed by the students’ interviews. they mentioned that the instructor gave the lesson plan before coaching so that they knew what will be done in the coaching. besides, the instructor gave the suitable material for them like curriculum, syllabus, and lesson plan which was used in each school. it leads to the students’ understanding of what should be written in the lesson plan and what should be done in the teaching practice. in other words, the material was really useful for their needs in the teaching practice. the material in the internship program has been fitted to the students’ need in both coaching and teaching practice. utomo and tehupeiory (2014) mention the same thing about the importance of aligning the material delivered to the students with the program objective. program coordinators should keep this right. this means that the material which was suitable for the students’ needs should be kept, while material which was not used for the internship program could be considered to be deleted. evaluation in learning is conducted to assess what has been learned by students, what kind of ability which improved, and what has changed (kirkpatrick & kirkpatrick, 2006, p. 21). this evaluation only focused on the improvement of ability in teaching. after coaching, students should do the teaching practice three times. they did the teaching practice in a class for the instructor to give them a grade. the evaluation result was that the students’ abilities in the practice teaching improved by a high-level category. this means that their teaching has changed in each time reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 an evaluation of internship program... 162 lathifa rosiana dewi & badrun kartowagiran of the teaching practice. there were five schools, coded school 1, school 2, school 3, school 4, and school 5. from the descriptive data, it can be interpreted that 79% of students who did their teaching practice in school 1 showed improvement in their teaching abilities. students at school 2 gave a higher score of 92%. at school 3, improvement was marked by 72%. students at school 4 improved their teaching ability as much as 92%. students at school 5 showed the lowest percentage of 55%. this was supported by the qualitative data from the interviews with instructors. they stated that some students have learned well but the others have not. students who have not improved were those who did not change their way of teaching. on the average, however, it is indicated that students’ ability in teaching improved by the high level of category. more than 50% of the students improved their ability in teaching by the high-level category in each school. the internship program can be said to be successful because there was an improvement in the students’ teaching abilities. it is in line with al yahya and norsiah (2013) who stated that ability improvement is an indication of success in a program. conclusion and recommendations conclusion based on the findings of the reaction aspect, it can be concluded that three aspects have occupied the 'very satisfactory' category. these were facilities, instructor, and material. on the other hand, the schedule aspect did not obtain the 'very satisfactory' category. this was mostly caused by the fact that the instructor was over-timed in each coaching. for the learning aspect, there were more than 50% of students in each school who had the 'high level' category of improvement. it can be concluded that students understood the material well so that they could apply the material learned from the coaching in the teaching practice. recommendations some recommendations are proposed for program coordinators. the program coordinators should monitor the internship program from the beginning until the end. this means that they should know what the strengths and what weaknesses of the program are. coordinators can come to the coaching session in each school or they can just interview students about what has missed in the program. besides, coordinators should evaluate the internship program periodically. it has been known that evaluation can be done before the program, whilst program, and at the end of the program. it is highly recommended that the coordinators have a team for such periodical evaluations. this can prevent the program from various difficulties and weaknesses. coordinators, lecturers, and instructors can create the criteria of the success of the internship program. it means that there should be specific criteria to measure the success of the program. this will help them in giving a quality evaluation to the program. references abrory, m., & kartowagiran, b. (2014). evaluasi implementasi kurikulum 2013 pada pembelajaran matematika smp negeri kelas vii di kabupaten sleman. jurnal evaluasi pendidikan, 2(1), 50–59. al yahya, m. s., & norsiah, b. m. (2013). evaluation of effectiveness of training and development: the kirkpatrick model. asian journal of business and management sciences (vol. 2). ardiansyah, j. (2013). peningkatan kompetensi guru bidang pendidikan di kabupaten tana tidung. ejournal pemerintahan integratif, 1(1), 38–50. azwar, s. (2015). reliabilitas dan validitas (4th ed.). yogyakarta: pustaka pelajar. azwar, s. (2017). penyusunan skala psikologi. yogyakarta: pustaka pelajar. badu, s. q. (2013). the implementation of kirkpatrick’s evaluation model in the learning of initial value and boundary condition problems. international journal of learning and development, 3(5), 74–88. https://doi.org/10.5296/ijld.v3i5.4386 reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 163 – an evaluation of internship program... lathifa rosiana dewi & badrun kartowagiran fitzpatrick, j. l., sanders, j. r., & worthen, b. r. (2011). program evaluation: alternative approaches and practical guidelines (4th ed.). boston, ma: pearson education. hallo, d., & munadi, s. (2014). evaluasi kinerja guru sma yang bersertifikat profesional di kabupaten halmahera barat. jurnal evaluasi pendidikan, 2(2), 111–122. hamalik, o. (1990). sistem internship kependidikan teori dan praktek. bandung: mandar maju. irambona, a., & kumaidi, k. (2015). the effectiveness of english teaching program in senior high school: a case study. reid (research and evaluation in education), 1(2), 114–128. https://doi. org/10.21831/reid.v1i2.6666 jailani, m. s. (2014). guru profesional dan tantangan dunia pendidikan. al-ta’lim journal, 21(1), 1–9. https://doi.org/ 10.15548/jt.v21i1.66 kiggundu, e., & nayimuli, s. (2009). teaching practice: a make or break phase for student teachers. south african journal of education, 29, 345–358. kirkpatrick, d. l., & kirkpatrick, j. d. (2006). evaluating training programs (3rd ed.). san francisco, ca: berrettkoehler. kiser, p. m. (2016). the human services internship: getting the most from your experience. boston, ma: cengage learning. law of republic of indonesia no. 20 of 2003 on national education system (2003). mardiyono, s. (2006). praktik pengalaman lapangan terpadu dalam peningkatan kualitas calon guru. jurnal cakrawala pendidikan, 25(1), 57–72. https://doi.org /10.21831/cp.v0i1.392 putri, y. e., & kartika, l. (2016). evaluasi efektivitas pelatihan marketing skills pada pt xyz. kolegial, 4(2), 11– 22. rohani, e. (2015). analisis kepuasan peserta pelatihan pertolongan pertama gawat darurat obstetri dan neonatal (ppgdon) di balai pengembangan tenaga kesehatan (bptk) mataram mengunakan metode servqual. media bina ilmiah, 9(2), 37–45. sahraini, s., & madya, s. (2015). model evaluasi internal kompetensi guru bahasa inggris (model_eikgbi) sma. jurnal penelitian dan evaluasi pendidikan, 19(2), 156–167. https://doi.org/ 10.21831/pep.v19i2.5576 stufflebeam, d. l., madaus, g. f., & kellaghan, t. (2002). evaluation models: viewpoints on educational and human services evaluation. new york, ny: kluwer academic publishers. utomo, a. p., & tehupeiory, k. p. (2014). evaluasi pelatihan dengan metode kirkpatrick analysis. jurnal telematika, 9(2), 37–41. vonny, r. p. e. (2016). pengaruh pelatihan, fasilitas kerja dan kompensasi terhadap kepuasan kerja karyawan pada pt united tractors cabang manado. jurnal berkala ilmiah efisiensi, 16(3), 407–418. yarbrough, d. b., shulha, l. m., hopson, r. k., & caruthers, f. a. (2011). the program evaluation standards: a guide for evaluators and evaluation users (3rd ed.). thousand oaks, ca: sage. zahro, s., & wu, m. (2016). implementing of the employees training evaluation using kirkpatrick’s model in tourism industry a case study. international journal of innovation and applied studies, 17(3), 1042–1049. this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(1), 2019, 21-29 available online at: http://journal.uny.ac.id/index.php/reid multiple intelligence assessment in teaching english for young learners 1ernawati; 2hana tsurayya; *3abdul rahman a. ghani 1,3department of educational research and evaluation, graduate school of universitas muhammadiyah prof. dr. hamka jl. warung buncit raya no. 17, pancoran, south jakarta 12790, indonesia 2early childhood education center of taman anak soleh as salaam jl. merapi raya, jatiwarna, pondok melati, bekasi, west java 17415, indonesia *corresponding author. e-mail: abdulrahman.ghani@yahoo.co.id submitted: 30 january 2019 | revised: 01 march 2019 | accepted: 11 march 2019 abstract there are many schools in indonesia that provide english as one of their subjects. english has been taught from elementary schools, even in kindergarten. ironically, teaching english in most rural schools still uses a conventional method such as memorizing and translating. many teachers cannot afford to provide well-designed, meaningful exercises for students to use on a one-to-one learning basis. as a result, students seem not having interest in learning english. based on this reason, this study was conducted to identify students’ intelligence through multiple intelligence assessment to get effective approach in teaching english for young learners. the participants are an english teacher and students at an early childhood education center. this research focuses on presenting a deep description of the multiple-intelligence assessment to identify students’ intelligence in order to get an effective way of teaching english for young learners. in collecting the data, three instruments were used: observation, interview, and document analysis. the findings of this study show that students have different interests and nature; some students love singing, some others enjoy drawing, and others like role-playing. multiple-intelligence assessment helps the teachers to identify students’ interests and bring them building some learning activities to attract them in learning english. keywords: multiple intelligence, assessment, teaching english, young learners permalink/doi: https://doi.org/10.21831/reid.v5i1.23376. introduction in recent years, indonesian parents tend to choose an educational institution which provides english as one of their subjects. english becomes more popular because of its prestige as an international language. the purpose of national education is to develop students’ potential to be faithful and devoted the almighty, noble, healthy, knowledgeable, skilled, creative, independent, and responsible citizen of the democratic country (law of republic of indonesia no. 20 of 2003). in order to reach the purposes, the government applies the 2013 curriculum which is based on knowledge, behavior, and ability competence cohesively to create productive, creative, and innovative students who are able to compete in globalization era. during the last few years, the world of teaching witnessed the innovation of teaching english for young learners. in indonesia, as reported by musthafa (2010), the government makes its own decision to put english as local content. this brings public awareness of learning english improved. english has been taught from elementary school, even in kindergarten. according to pinter (2006), language development starts well before children are http://dx.doi.org/10.21831/reid.v5i1.23376 multiple intelligence assessment in teaching english... ernawati, hana tsurayya, & abdul rahman a. ghani 22 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 able to say anything. cameron (2005) also states in her book that children learn a second language better than adults. these opinions bring many kindergartens in indonesia to offer english as one of their subjects. indonesian parents engage their children to learn english at an early age. in teaching english to young learners, teachers use many different approaches. all of the approaches have the same goal, to make an effective way in the learning process. therefore, teachers build some activities to maximize students’ potential in learning a language. ironically, teaching english in most rural schools, still, uses a conventional method such as memorizing and translating. many teachers cannot afford to provide well-designed, meaningful exercises for students to use on a one-to-one learning basis (musthafa, 2010). as a result, students seem to have less interest in learning english. maryanto (2005) on his research states the standard competency of kindergarten teachers developed with focus group discussion and delphi techniques have 50 indicators. one of these competencies is developing fun and interesting learning. based on those reasons, teachers need to find a way in teaching english which provides interesting and enjoyable activities suited to children’s interests and characteristics. students, as the object in the learning process, have a different nature. some students enjoy singing, some others love drawing, while some of them like reading. according to gardner (1983), all humans exhibit a range of intelligence. they are linguistic, logical-mathematical, musical, spatial, bodily-kinesthetic, interpersonal, intrapersonal, and natural. these differences indicate that children have a different way to enjoy the learning process. pinter (2006) states that it is important for teachers to take into account that all children have stronger and weaker aspects of their multiple intelligences and preferred learning style. multiple intelligences theory has been reported to be effective in teaching english to young learners. hassan and maluf (1999) who have conducted a study about the application of multiple intelligences in lebanese kindergarten got the result that mi theory has successfully improved students’ understanding in the learning process. multiple intelligence assessment was conducted to identify the variety of students’ intelligence. after knowing the interests of each student, the teacher will be easier to get an effective approach in teaching english for them. in detail, this study attempts to find out the activities and benefits of the implementation of multiple intelligence theory in teaching english for young learners. this study is expected to contribute to the development of multiple intelligence assessment in the learning process, especially in teaching english for young learners in indonesia. in addition, it can give inspiration for teachers in order to make and create some activities that can help students to improve their ability in learning a language and maximize their potential in the learning process. teaching english for young learners the term 'young learners' has been defined by pinter (2006) as children who start their primary schools, either in kindergarten or elementary school. wright (2001) states the specific age range of young learners, which is between age five to 12. children construct knowledge through other people, and through interaction with adults. adults/teachers work actively with children in the zone of proximal development (zpd). zone of proximal development (zpd) is the difference between the child’s capacity to solve problems on his own and his capacity to solve them with assistance. the adult’s role is very important in a child’s learning process. like vygotsky, bruner focused on the importance of language in a child’s cognitive development. he shows how the adult uses 'scaffolding' to guide a child’s language learning through finely-tuned talk (cameron, 2005). characteristics of young learners young learners or children have some typical characteristics. piaget, as cited by cameron (2005), states that a child is an active learner. they have a huge curiosity about learning something new. children are often multiple intelligence assessment in teaching english... ernawati, hana tsurayya, & abdul rahman a. ghani copyright © 2019, reid (research and evaluation in education), 5(1), 2019 23 issn 2460-6995 more enthusiastic and lively. however, they also lose interest more quickly and are less able to keep themselves motivated on tasks they find difficult (cameron, 2005). further, shin (2013) in her module entitled ‘teaching english to young learners’ has classified the characteristics of young learners in learning english. assessment assessment is an integral part of educational processes ghani (2008). according to fenton (1996), assessment is the collection of relevant information that may be relied on for making decisions. in addition, davies (2000) states that ‘assessment for learning is ongoing, and requires deep involvement on the part of the learner in clarifying outcomes, monitoring on-going learning, collecting evidence and presenting evidence of learning to others.’ she further points out that assessment which directly supports learning has five key characteristics: (1) learners are involved so a shared language and understanding of learning is developed, (2) learners self-assess and receive specific, descriptive feedback about the learning during the learning, (3) learners collect, organize, and communicate evidence of their learning with others, (4) instruction is adjusted in response to ongoing assessment information, and (5) a safe learning environment invites risk-taking, encourages learning from mistakes, enables focused goal setting, and supports thoughtful learning. assessment can be designed to measure a wide range of abilities. assessment is designed to measure the ability of students’ high order thinking which can be done by developing several instrument evaluation. these instruments can be used to measure the ability of students' high order thinking skill: multiple choice, essay, performance evaluation, and rubric. multiple intelligences multiple intelligences theory is genuinely introduced by harvard psychologist named howard gardner. regarding gardner (1999), multiple intelligence theory consists of seven bits of intelligence, they are linguistics, logicalmathematic, musical, spatial, bodily-kinesthetic, interpersonal, and also intrapersonal. in 1999, gardner, as cited by armstrong (2009), added an eighth, natural intelligence. this theory has become a tool that educators around the world seize with enthusiasm (hoerr, 2000). gardner, as cited by veenema, hetland, and chalfen (1997), defines intelligence as an ability to solve problems or create products that are valued in at least one culture, whereas helding (2009, p. 195) defines intelligence as a biopsychological potential. the specific explanation of multiple intelligence can be seen as follows: the first one is linguistic or verbal intelligence; verbal intelligence involves the mastery of language. people with verbal intelligence tend to think in words and have highly developed auditory skills. they are frequently reading or writing. the second is logical-mathematic intelligence; it consists of the ability to detect patterns, reason deductively, and think logically. these are usually the children who do well in the traditional classroom because they are able to follow the logical sequencing behind the teaching and are, therefore, able to conform to the role of model student. the third one is spatial intelligence. this intelligence gives a person the ability to manipulate and create mental images in order to solve a problem. spatial thinkers 'perceive the visual world accurately, to perform transformations and modifications upon one’s initial perception' (gardner, 1999, p. 173). people with this kind of intelligence tend to learn most readily from visual presentation such as movies, pictures, videos, and also demonstrations using models. the fourth is bodily-kinesthetic intelligence. it entails the ability to understand the world through the body. these people can use their body in very expressive skilled ways for a distinct purpose. the fifth is musical intelligence. it makes use of sound to the greatest extent of possible. those with musical intelligence have a firm understanding of pitch, rhythm, and timbre. the sixth one is interpersonal intelligence. it consists of the ability to understand, multiple intelligence assessment in teaching english... ernawati, hana tsurayya, & abdul rahman a. ghani 24 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 perceive, as well as discriminate between people’s mood, feelings, motives, and also intelligence. the seventh is intrapersonal intelligence. intrapersonal deals more with the individual itself. it is the ability to know oneself and to understand one’s own working. the last is natural intelligence. it involves the ability to understand nature’s symbols, to respect the delicate balance that let us continue to live. they have a genuine appreciation of the aspects of nature and how they intertwine. multiple intelligence activities in teaching english to young learners, teachers are expected to give their best efforts to maximize students’ potential. lash (2004) believes that in order to assist the children in getting the most from their learning experiences, the teacher must first identify the areas of intelligence in which each child excels. by doing this, the teacher will be able to understand the children’s learning styles, and thus, know the best way to help the children integrate their experience into their body of knowledge. armstrong, as cited by lash (2004, pp. 14–15), explains that there are some activities that can help students to maximize their potential based on their dominant intelligence, as follows: (1) linguistic: learners who fall into this category enjoy word games, drilling, creative writing, and reading for pleasure. they enjoy listening to stories being told aloud. (2) logical-mathematic: they enjoy playing strategy games like chess and checkers. they are willing to spend lots of time working on logic puzzles, such as rubik’s cube. they enjoy putting things into categories and using reason to work through problems. (3) spatial: art may be one of the activities in which spatially intelligent persons might like to spend lots of time. they enjoy jigsaw puzzles and other visual activities. (4) bodily-kines-thetic: they are good at competitive sports. they need to touch things in order to learn more about them. these individuals are good at mimicking people’s gestures, mannerisms or behaviors. they enjoy messy activities like working with clay or finger painting. (5) musical: this one seems pretty obvious. musically gifted learners enjoy playing musical instruments, singing, or collecting cds. they are sensitive to environmental sounds and respond strongly to different kinds of music. (6) interpersonal: they have lots of friends and enjoy socializing with others in large and small groups. they enjoy playing group games. they enjoy teaching others and are seen as natural leaders. (7) intrapersonal: they have a realistic sense of their strength and weaknesses. they react strongly when controversial topics are discussed. they have a sense of self-confidence. (8) natural: persons with high naturalist intelligence enjoy being in natural environments. hiking and camping might be listed as their hobbies. hoerr (2000) in his book entitled 'becoming a multiple intelligences school' writes a table about how to arrange and create some activities based on children’s intelligence. this table is adapted from ‘succeeding with multiple intelligences’, by the new city school faculty, 2000 (hoerr, 2000). method this research focuses on presenting a deep description of the multiple intelligence assessment used by the teacher in teaching english for young learners. for this reason, the researchers used a descriptive-qualitative method. qualitative research is a holistic approach that involves discovery. it is also described as an unfolding model that occurs in a natural setting that enables the researchers to develop a level of detail (cresswell, 1994 quoted in williams, 2011). furthermore, keegan (2009) explains qualitative as a research design that primarily concerned with meaning rather than measuring. there are several characteristics of qualitative research, namely: (1) the focus of the research is ‘quality’, (2) the aim is description, findings, and understanding, (3) the settings are natural, (4) the sample is small and purposive, and (5) the data collection consists of the researchers as the main instrument, interview, and observation (alwasilah, 2008, p. 92). multiple intelligence assessment in teaching english... ernawati, hana tsurayya, & abdul rahman a. ghani copyright © 2019, reid (research and evaluation in education), 5(1), 2019 25 issn 2460-6995 the aim of the descriptive method is to examine the current event or phenomenon of the research (alwasilah, 2008). for this reason, the qualitative approach of the descriptive method is suitable to this research because it could be used to explain the detail of multiple intelligence activities used by the teacher in teaching english for young learners. in addition, this study employed the descriptive qualitative method because there is no treatment given during the observation. this study only observes the phenomena happened in the classroom; in detail, this study was set to investigate the implementation of multiple intelligence activities in teaching english for young learners. the participants of this study are an english teacher and students at an early childhood education center of taman anak sholeh as-salaam, which is located in bekasi, west java, indonesia. the observation was held in two months, from february to april 2018. after collecting the data, the data of the study were then analyzed by several steps as proposed by huberman and miles, as cited by basrowi and suwandi (2008): first, data reduction; the researchers reduced unnecessary information from the data obtained through observation. in this process, the data from observation and interview were a transcript. second, data analysis; the data from observation, interview, and document analysis were analyzed. findings and discussion findings the findings are from the data gained through observation and interview. this chapter consists of two main points: (1) multiple intelligence activities in teaching english for young learners in the early childhood education center being the object of the study, and (2) the benefits and challenges of using multiple intelligence activities. based on the observation, the response of students to several multiple intelligence activities are related to their interests. the data are displayed in table 1. an interview was conducted to find out the teacher’s perspective of using multiple intelligence activities and the benefits of using the activities. according to the result of the interview on teacher’s perception of the benefits of using multiple intelligence activities for young learners, the teacher mentioned several benefits of using these activities. first, the various activities can stimulate students in learning and they can be more active in the learning process. since the activities were quite different from those in the conventional approach, the teacher admitted that students seem to enjoy the learning process through song, games, drilling, riddle, and so forth. second, the learning process can be more effective because students understand the material easily by doing several activities. third, it can motivate the students to learn english because the activities can cover their interests. in spite of its benefits, some challenges are found in the use of multiple intelligence activities. the result of the interview shows that some students thought that the use of various activities was confusing. the students may be confused when they have to change from one activity to another. the other challenge of using multiple intelligence activities was a limited time. the teacher said that sometimes the students need extra time to do the activity such as games and riddle. it means that using multiple intelligence activities needs a longer allocated time. from the findings, it can be inferred that the use of multiple intelligence activities in teaching english for young learners has several benefits and challenges according to the teacher’s perception. the advantages are that the activities can make students more active in the learning process, make the learning process more effective, and make students more motivated in learning english. meanwhile, the challenges are that a long time is needed for the teacher to prepare the activities, and some media are needed in order to make the activities run well. multiple intelligence assessment in teaching english... ernawati, hana tsurayya, & abdul rahman a. ghani 26 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 table 1. multiple intelligence assessment applied in the early childhood education center of taman anak sholeh as-salaam teacher’s activities students’ activities type of multiple intelligence: students’ responses assessment singing ‘kepala, pundak, lutut dan kaki’ song with body gestures following the teacher doing the same activity musical bodily-kinesthetic most of students seem happy and enjoy the activity ***** showing animals pictures and mentioning the name of each animal looking at the pictures and repeating after the teacher to mention the name of the animals spatial verbal-linguistic most of students seem happy and enjoy the activity ***** playing the riddle game and giving clues for students to guess the correct answer listening to the clues and information given by the teacher logic-mathematic verbal-linguistic some students look confused because they do not understand some words said by the teacher *** counting numbers in sequence and counting down the numbers following the teacher doing the same activity logic-mathematic most of students can follow the teacher well when they are asked to count from 1-10, but they look a bit confused when the teacher asked them to count down ***** showing picture and playing the ‘what color is it?’ game looking at the visualization of the new word spatial verbal-linguistic most of students seem excited to say the various colors in english ***** playing ‘take the ball’ game is the activity referring to bodily kinesthetic intelligence because the activities in it are related to physical movement the students have to run quickly to get the right ball; the fastest group that took the right ball is the winner bodily-kinesthetic spatial most of students seem excited to participate in this game ***** playing ‘parts of body’ game is an activity referring to interpersonal intelligence following the instruction by touching parts of body of their desk mate interpersonal some students seem uncomfortable when they have to interact with and touch their desk mate *** showing the picture of a fish to the students and giving personal connection by asking them about the fish. the teacher asked students about their previous knowledge of the material object giving opinions on the teacher’s questions spatial intrapersonal some students need more stimulus to speak up *** asking students to look outside the class and asked them about the weather looking outside the window, observing the weather, and giving some opinions natural most of students seem happy and enjoy the activity ***** note: ***: only some students participate actively in this activity *****: successfully attract most of students to participate actively in learning process multiple intelligence assessment in teaching english... ernawati, hana tsurayya, & abdul rahman a. ghani copyright © 2019, reid (research and evaluation in education), 5(1), 2019 27 issn 2460-6995 discussions the teacher used drilling as the activity referring to linguistic intelligence. this activity was found in every meeting. drilling is a strategy to improve pronunciation by imitating and repeating words, phrases, even whole utterances (thornbury, 2006). in the first meeting, the teacher showed animal pictures to students and mentioned the name of the picture. after that, the teacher asked the students to repeat the word. from the observation, it was discovered that the teacher taught a new word by saying it repeatedly in the drilling process. after saying the word out loud several times, the students finally remember the word. therefore, this activity can be an effective way to teach children about foreign language. students enjoy doing this activity, and saying the new words together with their friends make students get motivated to speak english in the classroom. however, students seem a bit confused when they have to repeat a long sentence. teacher should know how to divide the sentence into a shorter part, especially when drilling a lyric from a song. the teacher used riddle and counting numbers in a sequence as the activities referring to logical-mathematic intelligence. riddle was used by the teacher when teaching about an animal. the teacher prepared some animal pictures as the media. before the teacher showed the animal pictures to students, the teacher did a riddle by giving some clues to students about the animal. the students have to listen to the clues and information given by the teacher; after that, the students have to analyze the information and clues from the teacher. in this activity, they need to think logically before answering the riddle. if the students have known the answer, they can guess it by saying out loud and give the answer to the teacher. and the last, the teacher will show the picture and mention the name of the picture. the logical mind can be stimulated anytime information is put into some kind of rational framework (armstrong, 2009). in this activity, giving some clues and information about the animal is the stimulation of logical thinking. spatial intelligence has something to do with pictures either the pictures in one’s mind or the pictures in the external world, such as photos, movies, drawings, graphic symbols, ideographic language, and so forth (armstrong, 2009). showing picture and also games ‘what color is it?’ are the activities referring to spatial intelligence because these activities involve visualizing the objects and creating a mental image. showing picture was found in the first, fourth, and fifth meeting; and games ‘what color is it’ was found in the sixth meeting. the teacher used some pictures as the media to teach learning material. the pictures are related to the topic of each meeting. from the picture, students get the visualization of the new word. wright (2001, p. 10) said that picture can play a key role in motivating students, conceptualizing the language they want to use, giving them a reference, and in helping the discipline of the activity. body answers and playing ‘take the ball’ game are the activities referring to bodilykinesthetic intelligence because these activities are related to physical movement. responding to the instruction with physical gesture was found in the third and fourth meeting, while ‘take the ball’ game was found in the sixth meeting. playing the game ‘parts of body’ is an activity referring to interpersonal intelligence. in this activity, the students are supposed to be working in pairs. the teacher gives the instruction first, and then the students follow the instruction by touching part of the body of their chair mate. the students have to listen to the teacher’s instruction carefully. personal connection is an activity referring to intrapersonal intelligence. this activity was found in the first meeting. in the beginning, the teacher showed the fish picture to the students and gave personal connection by asking them about the fish. the teacher asked students about their previous knowledge of the material object. observing the weather is an activity referring to natural intelligence. this activity mostly happened at the beginning of the class before the teacher gives lesson material to the students. in this activity, the teacher asked the multiple intelligence assessment in teaching english... ernawati, hana tsurayya, & abdul rahman a. ghani 28 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 students to look outside the class and asked them about the weather. the students seem happy when the teacher asked them to look outside the window and observe the weather. conclusion and suggestions conclusion this study was concerned with identifying students’ intelligence by multiple intelligence (mi) assessment and applying mi activities in the process of teaching english for young learners. according to the findings and discussions, there are several multiple intelligence activities employed by the teacher in teaching english for young learners. these activities are categorized into eight groups based on the eight multiple intelligences. it can also be concluded that the teacher has to know students’ characteristics, interest and ability to create interesting and suitable activities for students. the variety of activities may improve students’ attention and motivation in learning english. moreover, the result of the interview shows that there are several benefits in using mi activities in teaching english for young learners. however, in implementing mi activities in teaching english for young learners, the teacher found many challenges. suggestions after concluding the analysis, the researchers would like to propose some suggestions related to the research conducted. since this study involved merely only one teacher as a respondent, further study is suggested to involve more teachers as respondents. furthermore, implementing multiple intelligence activities in teaching english for young learners can be a recommendation to be used by kindergarten english teachers. teachers should find many activities of teaching english which may cover students’ interests and intelligence. to improve teachers’ knowledge of multiple intelligence activities, the teachers can attend seminars, workshops, or training. in addition, teachers can get more information about multiple intelligence activities by reading books or searching on the internet. references alwasilah, a. c. (2008). pokoknya kualitatif: dasar–dasar merancang dan melakukan penelitian kualitatif. jakarta: pt.dunia pustaka jaya. armstrong, t. (2009). multiple intelligences in the classroom. alexandria, va: ascd. basrowi, & suwandi. (2008). memahami penelitian kualitatif. jakarta: rineka cipta. cameron, l. (2005). teaching languages to young learners. cambridge: cambridge university press. davies, p. (2000). computerized peer assessment. innovations in education and training international, 37(4), 346–355. https://doi.org/10.1080/13558000075 0052955 fenton, r. (1996). performance assessment system development. alaska educational research journal, 2(1), 13–22. gardner, h. (1983). frames of mind: the theory of multiple intelligences. new york, ny: basic books. gardner, h. e. (1999). intelligence reframed: multiple intelligences for the 21st century. new york, ny: basic books. ghani, a. r. a. (2008). pengaruh tes formatif dan kemandirian belajar terhadap hasil belajar ekonomi siswa sma. jurnal penelitian dan evaluasi pendidikan, 12(2), 162–176. https://doi.org/10.21831/ pep.v12i2.1425 hassan, k. el, & maluf, g. (1999). an application of multiple intelligences in a lebanese kindergarten. early childhood education journal, 27(1), 13–20. https:// doi.org/10.1023/a:1026063322136 helding, l. (2009). howard gardner’s theory of multiple intelligences. journal of singing, 66(2), 193–199. hoerr, t. r. (2000). becoming a multiple intelligences school. alexandria, va: association for supervision and curriculum development. keegan, s. m. (2009). qualitative research: good decision making through understanding people, multiple intelligence assessment in teaching english... ernawati, hana tsurayya, & abdul rahman a. ghani copyright © 2019, reid (research and evaluation in education), 5(1), 2019 29 issn 2460-6995 cultures and markets. london: kogan page. lash, m. d. (2004). multiple intelligences and the search for creative teaching. paths of learning, autumn(22), 13–15. law of republic of indonesia no. 20 of 2003 on national education system (2003). maryanto, a. t. t. (2005). pengembangan instrumen analisis kompetensi tutor pendidikan anak usia dini (paud). jurnal penelitian dan evaluasi pendidikan, 7(2), 241–252. https://doi.org/ 10.21831/pep.v7i2.2023 musthafa, b. (2010). teaching english to young learners in indonesia: essential requirements. educationist, 4(2), 120– 125. pinter, a. (2006). teaching young language learners. oxford: oxford university press. shin, j. k. (2013). teaching english to young learners. baltimore county: english language center. thornbury, s. (2006). how to teach speaking. new york, ny: longman. veenema, s., hetland, l., & chalfen, k. (eds.). (1997). multiple intelligences: the research perspective, a brief overview of the theory. in the project zero classroom: approaches to thinking and understanding. cambridge, ma: harvard graduate school of education and project zero. williams, c. (2011). research methods. journal of business & economics research (jber), 5(3), 65–72. https://doi.org/ 10.19030/jber.v5i3.2532 wright, a. (2001). pictures for language learning. cambridge: cambridge university press. this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(2), 2019, 144-151 available online at: http://journal.uny.ac.id/index.php/reid the respondent factors on the digital questionnaire responses *1muhardis; 2burhanuddin tola; 3herwindo hariwibowo 1,2,3department of educational research and evaluation, universitas negeri jakarta jl. r. mangun muka, rawamangun, pulo gadung, jakarta timur, dki jakarta 13220, indonesia *corresponding author. e-mail: adi_perdana2000@yahoo.com submitted: 2 september 2019 | revised: 19 november 2019 | accepted: 27 november 2019 abstract progress in the field of technology often facilitates human work. one of them is progress in the development of questionnaire modes. currently, existing questionnaires have been based on a digital platform, which makes evaluators easy to design, disseminate, and conduct scoring. all are computerbased, making them reachable by the respondents no matter how far the location of the respondent is, as long as they are connected to the internet. however, any progress is accompanied by several obstacles. for example, the respondents experienced an error in responding to having the intent to respond 'yes' option but pressing the 'no' button instead. it is very different from filling in paper and pencil based questionnaires in which they are sure to put a checkmark using a pencil on the answer choices. this problem is what the researchers found when distributing digital questionnaires to participants of the national questions writing program based on the 'siap' (sistem inovatif aplikasi penilaian) application. on conditional questions (if you choose 'no', please stop), some respondents who have chosen 'no' answers still respond to the next questions. it causes the data obtained are unreliable. after conducting a more in-depth analysis, the researchers found that respondents’ factors as psychological factors are the cause, such as the new experience of accessing applications, understanding of applications, stress, and personal health. uniquely, the respondents who have problems are those in the context of productive age, i.e 30 to 39 years old, more than five years of teaching experience, postgraduate level, and female. keywords: computer-based assessment, digital questionnaire, respondent factors, siap application permalink/doi: https://doi.org/10.21831/reid.v5i2.26943 introduction progress in the field of assessment is increasingly rapid. assessment is no longer carried out in conventional ways, such as paper-based test (pbt) but has been in the form of a computer-based test (cbt). many experts argue that well-designed computerbased assessments are considered more comprehensive, more accurate, and able to describe the profile of each test participant (nezami & butcher, 2000). computer-based assessment is also the first step in the test model for the future (yanxia, 2017) by not ignoring other aspects of the application that can cause the computer to make an error in processing the response given (mcarthur & choppin, 1984). errors are predictable and deserve more attention than the application of cbt, not only errors from the application side but also those from the respondent's side. web-based questionnaires, as one of the other forms of application of cbt, as they have been used in australia (nulty, 2008), also need to pay attention to these errors. before being applied, it is better for prospective respondents to be given socialization, training, and how to use the application (sevillano-garcía & vázquezcano, 2015). web-based questionnaires, also known as digital questionnaires, need to pay attention to some aspects such as questions asked and visual design so respondents are more aware in answering so that there are no mistakes https://doi.org/10.21831/reid.v5i2.26943 the respondent factors on the digital questionnaire... muhardis, burhanuddin tola, & herwindo hariwibowo copyright © 2019, reid (research and evaluation in education), 5(2), 2019 145 issn 2460-6995 when answering. excessive use of media, such as quality-check reminders, will take a long time which can make the respondents frustrated and leave the questionnaire prematurely (reja, manfreda, hlebec, & vehovar, 2003). it also happened to the web-based questionnaire used by the center for educational assessment or pusat penilaian pendidikan (puspendik) as an instrument for measuring the level of satisfaction with the implementation of the national question writing application based on siap. as a new program that has only been operating since 2017, siap needs to be evaluated to see its effectiveness. evaluation of the participants’ satisfaction level was done using a web integrated with the siap application. on the questionnaire’s initial page, it was stated that the questionnaire was arranged to increase the capability of siap in providing information technology services to users, especially writers, reviewers, and material administrators. there are two types of responses provided in this questionnaire: yes/no answers and gradation of five likert scales approval levels (siap application development questionnaire, n.d.). it is interesting to find based on the response that there were respondents who still answered follow-up questions that they should not answer if they chose the response 'no'. an example can be seen in table 1. table 1. communication results with the person in charge of the subject matter choice of answer number of respondents (n=295) percentage (%) yes no abstain 200 92 3 47.96 21.82 30.21 table 1 is the result of question data regarding communication with the person in charge (pic) of the subject matter. of the 200 respondents, 92 people answered that they did not communicate with the pic of the subject matter. however, of the 92 people, 30 of them (32.60%) still answered the follow-up questions, namely regarding clarity of indicators, improvements, and rejections. it shows that web-based questionnaires require the ability of 'extensive reading' rather than 'speed reading'. in fact, more than 60% of respondents are postgraduate students. thus, the habit of delaying the task of reading during education (onwuegbuzie et al., 2004) or the habit of skimming reading (cheng & tsai, 2017) can be the cause. the respondents who graduated from the bahasa indonesia major did not show a much different pattern. in addition, most of them are male. indeed, the results of the study by parquette (1952) show that men find it difficult to understand a text in the first chance of reading. data that are duly not filled, but still get the response, as mentioned earlier, is said to be data outliers or outlier data. these data cause the shrinking parameter to be zero so the model used becomes inappropriate, or there is a bias in the parameters (martel, 2015). further, these outliers can also influence subsequent observations which have implications for the resulting parameter estimators (mcquarrie & tsai, 2003). it may cause fatal errors if the next statistical analysis cannot be performed optimally only because of the data outliers. this study focuses on the context 'attached' to the outliers: explaining the context behind the outliers’ emergence. the context, in this case, refers to the things attached to the respondent, such as gender, age, education level, and length of teaching time. in contrast to studies conducted by psychometrists, such studies on outliers only focused on the outliers location in the data (ali, 1994), statistical behavior tests (prescott, 1978), and outliers’ patterns in the contingency tables (rapallo, 2012), this study looks at outliers from a qualitative perspective, which uses statistical data, that are, nonparametric scale. method data on the outliers and respondents' identity (context) were obtained from secondary data (harris, 2001) from the web questionnaire (siap application development questionnaire, n.d.) in the siap application. the respondents are 295 people from national question writers from national selection results in 2016 and 2017. the number of respondents sampled was all respondents who answered 'no', but still answered follow-up the respondent factors on the digital questionnaire... muhardis, burhanuddin tola, & herwindo hariwibowo 146 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 questions for three questions, 106 respondents for questions took a long time accessing the siap application, 92 respondents for communication questions with the subject matter pic, and four people for communication questions with the siap administrator. the questionnaire used as the data source is in the form of a digital questionnaire consisting of five parts, namely, the first part containing the identity of the respondents (name, subject, post, city, province, age, gender, the experience of accessing siap applications, length of teaching, level of education, suitability for subjects taught), the second part in the form of questions about the time needed to access the siap application, the third part in the form of questions about communication with the pic of the subject matter, the fourth part which is a question regarding communication with the siap administrator, and the last part in the form of suggestion. the stages of research are guided by evaluative methods (ross & cronbach, 1976) and (fíncher, 1981), namely by reading-intensive results of the response, especially responses related to questions that have further questions; recording into an identity data card (in this case the context in the form of gender, recent education, and length of teaching time) of respondents who are outliers; doing coding based on context; grouping the data; giving meaning to the context of outliers; drawing a conclusion; and compiling reports. findings and discussion the analysis was done on questionnaires distributed to national question writers about the satisfaction level with the siap application, communication with the subject matter pic, and communication with the administrator. the questions on the siap application include the need to access the siap application for a long time, interactive features, and overall satisfaction with the siap application. the level of satisfaction with regard to communication with the pic of the subject matter contains questions about the explanation of the indicators, remedial comments, and explanations of rejection. communication with the administrator contains questions related to the call center, whatsapp, and email. table 2. response to the siap access time, communication results with the pic of the subject matter and the siap administrator siap access time (n=295) communication with the pic of the subject matter (n=295) communication with siap admin (n= 295) number of respondent answered yes 189 200 266 number of respondent answered no 106 92 4 number of abstain respondent answered follow-up questions 86 30 3 a. gender male 44 13 1 female 42 17 2 b. level of education graduate 46 17 post graduate 40 13 3 c. age 25 to 29 years old 6 30 to 39 years old 37 14 1 40 to 49 years old 36 11 1 50 years old or more 6 4 1 d. length of teaching time 1 year 19 6 1 2 years 10 3 3 years 9 4 4 years 2 2 5 years or more 45 15 2 the respondent factors on the digital questionnaire... muhardis, burhanuddin tola, & herwindo hariwibowo copyright © 2019, reid (research and evaluation in education), 5(2), 2019 147 issn 2460-6995 based on the results of the analysis, information was obtained on the percentage of respondents who answered 'no' but still answered the follow-up questions for the three types of questions, respectively 81.13% (for questions that took a long time accessing the siap application), 32.6% (for communication questions with the pic of the subject matter, and 75% (for communication questions with the siap administrator). the details can be seen in table 2. taking a long time to access the siap application according to research conducted by several experts about computer-based applications, the average access needed to open applications is 4.7 seconds (hastomo & yuhana, 2013), 0.93 seconds (shubhi, yuliana, & winarno, 2011), and 6.959 kb/s (mulyana & sholekan, 2010). however, each application used certainly has different access times. software specifications used, when the situation opens access (morning, noon, or night) (mulyana & sholekan, 2010), availability of wifi (pangesti, 2017), as well as the software used are some of the factors that cause the speed of loading or data access. surely, it also applies to the siap application that is accessed by the writing participants the question that became the respondent of this study. of the 295 respondents who participated, 189 respondents answered 'yes' and 106 respondents answered 'no'. it means that most respondents need a long time to access siap. there were 86 people from all respondents who answered 'no' (45.50%) respond to the next questions. this number 86 consists of 42 female and 44 male respondents. it seems to support the results of the study by parquette (1952) that men tend to be difficult to understand reading at once reading. although only two points adrift, it can be said that men need time to understand reading compared to women. in fact, in the research conducted by qian, buchmann, and zhang (2018), women tend to have good educational adaptability. however, if they hold the opinions of respondents number 31 and 82 (both women of the same age), in the part of the question regarding the experience of accessing the siap application, they wrote that the problem they experienced when accessing was network. they will not have trouble as long as the signal when they open the application is in a strong position or from the side of the siap application itself, which is in maintenance status. indeed, there are respondents (number 71, male) who say it takes two minutes at the beginning of the login, but it becomes smooth during the process of writing questions and saving and submitting. for this reason, he did not consider that it took a long time to access the siap application. sometimes, their understanding of applications that are still relatively new or they are still unfamiliar with using applications makes them stutter technology (based on respondents' answers number 37, 59, and 154; all three are male and are in the same age group, namely 30 to 39 years old). it is undeniable that most people need a long time to understand or learn something new. the same is the case with the results of a research by onwuegbuzie et al. (2004) conducted on undergraduate graduates in america. communication with the person in charge of the subject matter siap application provides an opportunity for the writer of the questions to communicate with the pic of the material through the chat feature. however, some participants did not use the facility. of the 295 respondents, 200 people (68%) answered 'yes', 92 people (31%) answered 'no', and three people (1%) did not answer. of the 92 respondents who answered 'no', 30 respondents continued to answer questions. of the total 92 respondents who answered 'no', 30 respondents still answered further questions. male respondents numbered 13 people (43%) and female respondents numbered 17 people (57%). it indicates that more females are not willing to communicate with the pic of the subject matter. this condition is in contrary to consistency in continuing to answer questions, namely, in this case, women can be said to be inconsistent, which can be caused by psychological factors (hornung, 1977), such as stress due to communicating with the pic of the the respondent factors on the digital questionnaire... muhardis, burhanuddin tola, & herwindo hariwibowo 148 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 subject matter that must be socially elevated. in addition to stress, personal health conditions or family can also become the trigger. for example, the condition that occurred in respondent number 264 (female, aged 40 to 49 years old) said that she could not focus because she was caring for his mother who was hospitalized. the same is true for the answer of respondent number 164 (female, aged 50 years old or more) who said that she accessed the siap application when she was not ready because she was ill. it is possible that female respondents did communicate with the pic of the subject matter, but they received a rejection (it could be in the form of neglecting chat, having to make improvements to the questions submitted, even more extreme: the rejection of the question). it seems that in this case, the stimulus factors and responses (hovland, 1948) are the determinants of communication that can be established well or not. the reason for fear of being rejected for the questions they wrote seems to be justified when reading the respondent's answers in the question section on the experience of accessing the siap application. respondent number 238, female, postgraduate education, with five years teaching experience (respondents included in the sample who answered follow-up questions) wrote that she was happy and worried about the questions she wrote and submitted. she was afraid that the problem would be rejected so she was very aware of writing questions and submitting them. errors in running an application can also cause unauthorized written questions. these errors include mistakes in placing stimulus, choice of answers, and key answers and reasons. subconsciously, those female respondents accidentally choose the option 'no'. it is likely the inability of females to suppress expressive language tendencies (maynard, 1988) which is more dominant than male language behavior. she means to answer 'yes', but accidentally clicks the choice of 'no'. communication with the siap administrator the results of the studies have provided information that technological advances in the field of communication cause various changes in society, such as changes in the speed of information exchange (zamroni, 2009). administrators as parties that facilitate the questions writers with various information, such as, the time of the assignment, indicators that must be done, the number of questions that must be written, must be able to take advantage of technological advances in communicating. to support this, in addition to providing communication facilities with the pic of the material directly, the siap application also provides communication facilities between the question writer and the siap administrator. of the 295 respondents who responded, 266 people (90.16%) used this facility. the number of male respondents was 112 people and female respondents amounted to 154 people, while the number of respondents who answered 'no' was four people and those who did not answer are 25 people. of the four respondents who answered 'no', only three people participated in answering further questions. the interesting thing is from the three questions that have follow-up questions, specifically the communication questions with the admin getting the response that did not answer the most, namely 25 people. there were 13 female respondents with undergraduate education and five male students; four females with post-graduate education and three males with post-graduate education. the biggest possibility of respondents who did not provide answers regarding communication with the siap administrator was respondents who had less memorable experiences. for example, the answer of respondent number 285 (female, 50 years old or more, post-graduate education) on the question of the criticism and suggestions that she requested that the siap administrator be faster and more responsive in answering the questions submitted by the participants. they need immediate answers because the time given by the system for writing questions is not too long, especially for those who communicate via whatsapp private lines (respondent number 225, woman, undergraduate). another reason identified was the respondents' unpreparedness when they got a call from the siap administrator who tells them about assigning questions. often the adthe respondent factors on the digital questionnaire... muhardis, burhanuddin tola, & herwindo hariwibowo copyright © 2019, reid (research and evaluation in education), 5(2), 2019 149 issn 2460-6995 ministrator suddenly contacts the question writer to be assigned several indicators that must be written for a short period of time (respondent's answers number 66, 198, 202, and 290). upset? most people will feel annoyed if they have just gotten information but have been demanded to be able to immediately implement with deadlines that are not so long. as an administrator, it is appropriate to provide the best service to service recipients. everyone who serves as an administrator must have high commitment and motivation in carrying out their duties (moenir, 2010). if the administrator can provide optimal service and meet customer demands, the organization can be regarded as the greatest form of success on the service side. on the contrary, if the service provided is not satisfactory, it is necessary to improve the system and implementation mechanism. conclusion the web-based questionnaire or digital questionnaire still has many disadvantages. the presence of data outliers, for example, cannot be avoided. there are only factors that cause respondents to give unexpected data, such as psychological factors when responding to questionnaires. actually, it can be anticipated if the digital questionnaire is accompanied by an interview to get in-depth information about the aspects being evaluated. it is okay to use a web-based questionnaire, but in its implementation, it will be more targeted if the respondent is accompanied when inputting data on the media used (such as devices, mobile phones, tablet computers, etc.). it provides an opportunity for respondents to ask if there are questions that are difficult to understand. interviewers can also supervise carefully, especially taking preventive actions related to conditional questions (in this paper stated in terms of follow-up questions). data in the form of outliers do not have to be deleted or treated by means of statistical averaging, but can be used as a data source to conduct in-depth interviews with respondents who gave the response. it can be a new field for constructivist paradigm researchers. they can explain outliers from another point of view. even for psychometrics, outliers can be used as new data to see the relationships between the variables of nonparametric scale data found in the respondents, for example, to see if there is a relationship between the length of teaching time, gender, and age with communication with the subject matter pic, and siap administrator. in addition, it is suggested that in the formulation of questionnaires (both paperbased and web-based), it is necessary to consider the respondent's experience of novelty and current, such as responding in a new form (digital questionnaire), layered questions (questions that require further questions), because not all of the respondents were in a focus position to answer, even though they were highly educated, of productive age, and had a longer experience. references ali, m. a. (1994). identifying the location shift outliers that matter. sankhyā: the indian journal of statistics, series a, 56(3), 500–511. retrieved from https:// www.jstor.org/stable/25051015 cheng, y. h., & tsai, c. c. (2017). online research behaviors of engineering graduate students in taiwan. journal of educational technology & society, 20(1), 169–179. retrieved from https:// scholar.lib.ntnu.edu.tw/en/publications /online-research-behaviors-of-engineer ing-graduate-students-in-tai fíncher, c. (1981). air between forums: the literature of program evaluation. research in higher education, 14(3), 277– 280. retrieved from https:// www.jstor.org/stable/40195416 harris, h. (2001). content analysis of secondary data: a study of courage in managerial decision making. journal of business ethics, 34, 191–208. https:// doi.org/10.1023/a:1012534014727 hastomo, f., & yuhana, u. l. (2013). perancangan dan pembuatan perangkat lunak aplikasi android untuk pengolahan data transaksi pada perusahaan telekomunikasi “x” dengan menggunakan pentaho. jurnal teknik the respondent factors on the digital questionnaire... muhardis, burhanuddin tola, & herwindo hariwibowo 150 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 pomits, 2(1), 77–82. retrieved from http://ejurnal.its.ac.id/index.php/tekni k/article/view/2733 hornung, c. a. (1977). social status, status inconsistency and psychological stress. american sociological review, 42(4), 623– 638. https://doi.org/10.2307/2094560 hovland, c. i. (1948). social communication. proceedings of the american philosophical society, 92(5), 371–375. retrieved from https://www.jstor.org/stable/3143048 martel, a. r. (2015). the detection of outliers in nondestructive integrations with the generalized extreme studentized deviate test. publications of the astronomical society of the pacific, 127, 258–265. retrieved from https:// iopscience.iop.org/article/10.1086/680 382/meta maynard, d. w. (1988). language, interaction, and social problems. social problems, 35(4), 311–334. https://doi.org/ 10.2307/800590 mcarthur, d. l., & choppin, b. h. (1984). computerized diagnostic testing. journal of educational measurement, 21(4), 391– 397. https://doi.org/10.1111/j.17453984.1984.tb01042.x mcquarrie, a. d., & tsai, c.-l. (2003). outlier detections in autoregressive models. journal of computational and graphical statistics, 12(2), 450–471. retrieved from https://www.jstor.org/ stable/1391204 moenir, a. s. (2010). manajemen pelayanan umum di indonesia. jakarta: bumi aksara. mulyana, a., & sholekan, s. (2010). aplikasi mini market online berbasis web. bandung: universitas telkom. nezami, e., & butcher, j. n. (2000). objective personality assessment. in g. goldstein & m. hersen (eds.), handbook of psychological assessment (3rd ed., pp. 413–435). https://doi.org/ 10.1016/b978-008043645-6/50094-x nulty, d. d. (2008). the adequacy of response rates to online and paper surveys: what can be done? assessment & evaluation in higher education, 33(3), 301–314. https://doi.org/10.1080/ 02602930701293231 onwuegbuzie, a. j., mayes, e., arthur, l., johnson, j., robinson, v., ashe, s., … collins, k. m. t. (2004). reading comprehension among african american graduate students. the journal of negro education, 73(4), 443–457. https://doi.org/10.2307/4129628 pangesti, b. n. a. (2017). analisa kecepatan transfer data pada perancangan hotspot sederhana dengan system single sign on di perkantoran. fountain of informatics journal, 2(1), 1–7. https://doi.org/ 10.21111/fij.v2i1.814 parquette, w. s. (1952). intensive reading. the english journal, 41(2), 78–82. https:// doi.org/10.2307/809208 prescott, p. (1978). examination of the behaviour of tests for outliers when more than one outlier is present. journal of the royal statistical society, series c (applied statistics), 27(1), 10–25. retrieved from https://www.jstor.org/ stable/2346221 qian, y., buchmann, c., & zhang, z. (2018). gender differences in educational adaptation of immigrant-origin youth in the united states. demographic research, 38, 1155–1188. https://doi.org/ 10.4054/demres.2018.38.39 rapallo, f. (2012). outliers and patterns of outliers in contingency tables with algebraic statistics. scandinavian journal of statistics, 39(4), 784–797. https://doi. org/10.1111/j.1467-9469.2012.00790.x reja, u., manfreda, k. l., hlebec, v., & vehovar, v. (2003). open-ended vs. close-ended questions in web questionnaires. developments in applied statistics / metodološki zvezki, 19, 159– 177. ross, l., & cronbach, l. j. (1976). handbook of evaluation research. educational researcher, 5(10), 9–19. https://doi.org/ 10.3102/0013189x005010009 the respondent factors on the digital questionnaire... muhardis, burhanuddin tola, & herwindo hariwibowo copyright © 2019, reid (research and evaluation in education), 5(2), 2019 151 issn 2460-6995 sevillano-garcía, m. l., & vázquez-cano, e. (2015). the impact of digital mobile devices in higher education. educational technology & society, 18(1), 106–118. shubhi, m. m. f., yuliana, m., & winarno, i. (2011). sistem monitoring jaringan menggunakan brew (binary runtime environment for wireless). retrieved from http://repo.pens.ac.id/1077/1/paper_ sidang_ta.pdf siap application development questionnaire. (n.d.). no title. retrieved from https://docs.google.com/forms/ d/e/1faipqlsemvx0qsqlom5uecjs 55ave2n5khmcrjsqztdxhp4egc t-n-a/viewform yanxia, y. (2017). test anxiety analysis of chinese college students in computerbased spoken english test. journal of educational technology & society, 20(2), 63–73. retrieved from https://www. jstor.org/stable/90002164 zamroni, m. (2009). perkembangan teknologi komunikasi dan dampaknya terhadap kehidupan. jurnal dakwah: media komunikasi dan dakwah, 10(2), 195– 211. https://doi.org/10.14421/jd.2009. 10205 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(2), 2019, 85-94 available online at: http://journal.uny.ac.id/index.php/reid creative thinking ability and cognitive knowledge: big five personality *1jonni sitorus; 2nirwana anas; 3ermaliana waruhu 1badan penelitian dan pengembangan provinsi sumatera utara jl. sisingamangaraja no.198, siti rejo i, kota medan, sumatera utara 20216, indonesia 2faculty of tarbiyah science and teacher training, universitas islam negeri sumatera utara jl. william iskandar ps. v, kenangan baru, kab. deli serdang, sumatera utara 20371, indonesia 3sekolah dasar negeri 014648 padang mahondang padang mahondang, pulau rakyat, kabupaten asahan, sumatera utara 21273, indonesia *corresponding author. e-mail: sitorus_jonni@yahoo.co.id submitted: 04 january 2019 | revised: 27 may 2019 | accepted: 05 august 2019 abstract this research aims at describing the ability and level of student’s creative thinking and student’s cognitive knowledge. it is qualitative research to search for data and information. operationally, this research was conducted with some steps namely: (1) giving a set of big five personality test to 215 students to determine their personality type, (2) giving a set of creative thinking test of 215 students to measure the ability and level of their creative thinking, and (3) choosing one student randomly from each student’s personality type to be interviewed to search their cognitive knowledge. the results show that every student has a creative thinking ability, but the level of creative thinking varies. the category of student’s creative thinking ability based on big five personalities is 'moderate or high'. the level of student’s creative thinking based on the big five personality is 'very creative, creative, quite creative or less creative'. the student’s cognitive knowledge based on the big five personality is drawing, designing, ascertaining, dividing, reasoning, analogy, imagining, utilizing, solving, understanding, determining, mentioning, and using trial and error. keywords: creative thinking ability, cognitive knowledge, big five personality, creative thinking level, novel answer permalink/doi: https://doi.org/10.21831/reid.v5i2.22848 introduction creativity is not only a result but can also blossom other cognitive functions, such as cognitive thinking (silvia, 2008). most of the cognitive theory models focus on the ability to think and solve problems creatively. the cognitive theory model provides a place for psychometric procedures to understand various cognitive abilities, including creative thinking ability (batey & furnham, 2006). creative thinking is the process of understanding difficulty, problem, information gaps, loose elements, and inconsistency; formulating the problem clearly; supposing or formulating hypotheses about deficiency; examining the hypotheses and possibilities of revising and re-examining or redefining the problem, and ultimately communicating the results. creative thinking is an individual ability based on its uniqueness to generate worth and novel ideas. the formulation of creativity that emphasizes creative thinking ability is known as the major impetus for the research of creativity (santrock, 2003). the ability to think creatively is closely related to intelligence abilities or cognitive traits (setiawan, 2016). cognitive traits include fluency, flexibility, originality, elaboration, and also many affective traits (setiawan, 2017; wolfradt & pretz, 2001): curiosity, courage to take risks, challenged by plurality, and imagihttp://dx.doi.org/10.21831/reid.v5i2.22848 creative thinking ability and cognitive knowledge... jonni sitorus, nirwana anas, & ermaliana waruhu 86 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 native. the primary aptitude traits which are related to creativity, and typically called the characteristics of creative thinking ability (carson, peterson, & higgins, 2005), namely: sensitivity to problems; fluency, includes the fluency of word, expressional, and ideational; flexibility, includes the spontaneous and adaptive flexibility; originality; elaboration; and redefinition. creativity as an associative function is the ability to connect the objects, experiences, knowledge, and prior information to something new (mumford, 2003). batey, furnham, and safiullina (2010) state that there is a positive and negative relationship between creativity with the dimensions of the big five personality. creativity is positively correlated with extraversion and openness dimensions and is negatively related to agreeableness, conscientiousness, and neuroticism. individuals with an openness dimension have creative characteristics, broad interests, curious, original, and imaginative. neuroticism has an anxious, nervous, emotional, insecure, and incompetent dimension. creative thinking ability creative thinking is a process of constructing ideas to gain something new in insights, approach, perspective or way of understanding the problem (grieshober, 2004; isaksen, dorval, & treffinger, 2000; martin, 2009; mcgregor, 2007). some indicators of creative thinking are fluency, flexibility, novelty, productivity, impact, success, efficiency, coherence (briggs & davis, 2008; martin, 2009; santrock, 2007; sternberg, 2012). creative thinking is a combination of logical and divergent thinking based on intuition consciousness by caring for fluency, flexibility, and novelty (pehkonen & törner, 2004; siswono, 2004). everyone has the potential to think creatively, but the level of creative thinking for each person is different (alenikov, 2002; neethling, 2000). siswono (2004) classifies five creative thinking levels: level 4 (very creative), the student can solve the problem by finding more than one novel solution; level 3 (creative), the student can solve the problem by only finding one novel solution; level 2 (quite creative), the student can solve the problem by finding more than one flexible solution; level 1 (less creative), the student can solve the problem by only finding one flexible solution; and level 0 (not creative), the student is unable to solve the problem. fluency traits include sparking many ideas, answers, problem-solving, or questions fluently, providing many ways or suggestions for doing things, and always think of more than one answer. the flexibility traits include generating various ideas, answers, or questions, being able to see a problem from different perspectives, searching for many different alternatives or directions, and being able to change the approach or way of thinking. the originality traits include generating something new and unique, thinking of unconventional ways to express oneself, and being able to make unusual combinations of parts or elements. novelty is not idea really new, but new for the student (briggs & davis, 2008). the novelty concept must be returned to the student’s knowledge condition and cannot be generalized to all conditions. choi (2004) informs that novelty relates to a new experience, where the novelty level is an incompatibility function between the past and the present experience. the novelty of the concept of problem-solving is the student’s ability to solve problems by giving several different and correct answers or one unusual answer, which is adjusted to student’s knowledge level. different answer refers to the answer looks different and does not follow a certain pattern. big five personality personality is a dynamic organization or composition from the psychophysical system as unique individual characteristics (feeling, thought, behavior, physical, intelligence, or mood), settled at someone to adjust to the environment (feist & feist, 2006). one of the approaches to measure psychology personality type is the big five personality, which has five personality dimensions, namely: extraversion, agreeableness, conscientiousness, neuroticism, and openness (friedman & schustack, 2008). raymond b. creative thinking ability and cognitive knowledge... jonni sitorus, nirwana anas, & ermaliana waruhu copyright © 2019, reid (research and evaluation in education), 5(2), 2019 87 issn 2460-6995 cattell is a first theorist in measuring the personality, which is then developed into the basic form of personality structure, better known as the big five personality nowadays. the characteristics of the big five personality are: (1) extraversion, a high-score individual tends to be full of affection, cheerful, talkative, gregarious, and loving. conversely, a low-score individual tends to be self-contained, quiet, passive, and lack the ability to express feeling; (2) agreeableness; a high-score individual tends to have full trust, generous, receptive and kind-hearted. a low-score individual tends to be suspicious, stingy, unfriendly, irritable, more aggressive, critics, and less cooperative; (3) conscientiousness, a highscore individual tends to be hardworking, meticulous, timely, and diligent. a low-score individual tends to be irregular, lax, lazy, aimless, and easily give up when getting difficulty; (4) neuroticism, a high-score individual tends to be anxious, temperamental, self-pitying, self-aware, emotional, and prone to stress disorders. a low-score individual tends to be happier and content, calm, ordinary, self-satisfied, and unemotional; (5) openness, refers to how individual to adjust oneself to a new situation and idea. a high-score individual tends to be easy to tolerate and absorb information, focus, and be alert to feeling, thought, and impulsivity. a low-score individual tends to be narrow-minded, conservative, and does not like change. batey et al. (2010) state that there are positive and negative linkages between creativity and the dimension of the big five personality. creativity is positively associated with extraversion and openness dimensions and is negatively related to agreeableness, conscientiousness, and neuroticism. cognitive knowledge anderson et al. (2001) state that cognitive taxonomy as a revision of bloom’s taxonomy refers to memorizing, that is recognizing and recalling; understanding, that is interpreting, exemplifying, classifying, summarizing, comparing and explaining; applying, that is executing and implementing; analyzing, that is differentiating, organizing and attributing; evaluating, that is checking and critique; and creating, that is generating, planning, and producing (krathwohl, 2002; smith, 2008). moreover, de lange (2003) asserts that student’s cognitive knowledge in the process of mathematics learning is to produce great ideas to solve the mathematical problems; create a mathematical model created by students to solve problems of student's learning creativity; bring up various problem solving; express ideas; connect the mathematics concepts with everyday life; and use mathematics and mathematical mindset in everyday life in various sciences through the practice of acting and mathematical activities on the basis of logical, rational, critical, creative, accurate, honest, effective and efficient. according to galbraith and stillman (ee & widjaja, 2013; stillman, 2015), students' cognitive knowledge when given the problems are to understand and structuralize the problems; simplify and interpret the context; assume, formulate and perform the mathematization process; verify the results by comparing, critique, validating, communicating (rahayu (2015), justifying, and report on writing; and revise the incorrect answer based on the revision results. method this study is qualitative research. the researchers used the basic statistics (mean & percentage) to get the student’s creative thinking ability data and then interviewed some students to get the student’s cognitive knowledge data. the research was conducted in march 2017 for seven primary schools in north sumatra province, indonesia. population and sample the number of research population is 611 sixth class students from seven primary schools in north sumatra province. the number of research samples is 215 students chosen randomly, a minimum of 10% of the population (cohen, manion, & morrison, 2007). the sample consists of 98 female students and 117 male students. they are around 12-13 years old. creative thinking ability and cognitive knowledge... jonni sitorus, nirwana anas, & ermaliana waruhu 88 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 research instrument and data collection technique data were collected in two ways, namely: test and in-depth interview. the research instruments are a set of creative thinking tests, big five personality test, and interview guidelines. the researchers used the standard big five personality test and creative thinking test, so they do not need to be validated. the creative thinking test consists of an openended and problem-solving item focused on two-dimensional figure material in class vi of primary school for measuring the ability and level of student’s creative thinking. the big five personality test was used for determining student’s personality type. the interview guideline was used for searching student’s cognitive knowledge. data analysis technique qualitative and quantitative data were analyzed qualitatively by some phases, namely: coding each data and information obtained from interviews and tests; determining the similarity of data and information obtained from interviews and tests based on different contexts; collaborating on differences in data and information obtained from interviews and tests; classifying and categorizing data and information obtained from interviews and tests, and looking for relationships between each categorization research procedure first, the researchers gave a set of big five personality test (mayer, 2003, 2005) to 215 students to determine their personality type. the results are presented in table 1. second, the researchers gave a set of creative thinking tests of 215 students to measure the ability and level of their creative thinking. the creative thinking ability is measured from student’s answer fluency. the researchers gave the score of creative thinking ability without differentiating the creative thinking indicator. the score of one correct answer is 1, two correct answers are 2, and so on. the researchers then converted the score of value to categorize the student’s creative thinking ability based on 'scale 5', namely: very low (0-54), low (>54-64), moderate (>64-79), high (>79-89) and very high (>89100). the creative thinking level is measured from student’s answer flexibility and novelty. according to siswono (2004), the creative thinking level is categorized into five, namely: level 4 (very creative), student is able to solve the problem by giving more than one novel answer; level 3 (creative), student is able to solve the problem by only giving one novel answer; level 2 (quite creative), student is able to solve the problem by giving more than one flexible answer; level 1 (less creative), student is able to solve the problem by only giving one flexible answer; and level 0 (not creative), student is unable to solve the problem. third, the researchers chose one student randomly from each student’s personality type, as shown in table 1, as a key informant in this research to be interviewed to search their cognitive knowledge. the six students as research informants must have a creative thinking level 'very creative or creative'. the researchers interviewed them by using exploratory and confirmatory approaches. table 1. student’s personality type based on big five personality no. trends of student’s personality type number of student (person) percentage (%) 1. extraversion 28 13.02 2. agreeableness 21 9.77 3. extraversion + agreeableness + openness 53 24.65 4. extraversion + conscientiousness + openness 48 22.33 5. extraversion + neoroticism + openness 41 19.07 6. agreeableness + openness 24 11.16 total 215 100 creative thinking ability and cognitive knowledge... jonni sitorus, nirwana anas, & ermaliana waruhu copyright © 2019, reid (research and evaluation in education), 5(2), 2019 89 issn 2460-6995 findings and discussion findings student’s creative thinking ability based on the big five personality is shown in table 2, in which, student’s creative thinking ability with personality types of extraversion, agreeableness, or extraversion + neuroticism + openness is under overall mean value (75.50). student’s creative thinking ability that has personality types of extraversion + agreeableness + openness, extraversion + conscientiousness + openness, or agreeableness + openness is over the overall mean value (75.50). the mean difference of student’s creative thinking ability for each personality type is shown in table 3. student’s creative thinking level based on the big five personality can be seen in table 4. table 2. student’s creative thinking ability based on big five personality no. trends of student’s personality type mean value category of creative thinking ability 1. extraversion 65.82 moderate 2. agreeableness 70.04 moderate 3. extraversion + agreeableness + openness 83.73 high 4. extraversion + conscientiousness + openness 78.66 moderate 5. agreeableness + openness 80.51 high 6. extraversion + neoroticism + openness 74.22 moderate overall mean value 75.50 moderate table 3. mean difference in student’s creative thinking ability for each personality type trend no. trends of student’s personality type mean difference 1. extraversion agreeableness 4.22 extraversion + agreeableness + openness 17.91 extraversion + conscientiousness + openness 12.84 agreeableness + openness 14.69 extraversion + neoroticism + openness 8.4 2. agreeableness extraversion + agreeableness + openness 13.69 extraversion + conscientiousness + openness 8.62 agreeableness + openness 10.47 extraversion + neoroticism + openness 4.18 3. extraversion + agreeableness + openness extraversion + conscientiousness + openness 5.07 agreeableness + openness 3.22 extraversion + neoroticism + openness 9.51 4. extraversion + conscientiousness + openness agreeableness + openness 1.85 extraversion + neoroticism + openness 4.44 5. agreeableness + openness extraversion + neoroticism + openness 6.29 table 4. the number of student based on creative thinking level no. trends of student’s personality type the number of student based on creative thinking level (person) total level 4 (> 1 novel answer) level 3 (1 novel answer) level 2 (> 1 flexible answer) level 1 (1 flexible answer) level 0 (none) 1. extraversion 6 18 3 1 28 2. agreeableness 3 12 4 2 21 3. extraversion + agreeableness + openness 20 15 18 53 4. extraversion + conscientiousness + openness 18 19 3 8 48 5. extraversion + neoroticism + openness 22 17 2 41 6. agreeableness + openness 14 8 1 1 24 total 83 89 31 12 215 note: = no student creative thinking ability and cognitive knowledge... jonni sitorus, nirwana anas, & ermaliana waruhu 90 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 referring to table 4, students with personality type 'extraversion, agreeableness, extraversion + conscientiousness + openness, or agreeableness + openness' are very creative, creative, quite creative, or less creative. the students having personality type 'extraversion + agreeableness + openness, or extraversion + neuroticism + openness' are very creative, creative, or quite creative. overall, there are 83 very creative students (38.60%), 89 creative students (41.40%), 31 quite creative students (14.42%), and 12 less creative students (5.58%). no student is uncreative. student’s flexible or novel answers from their answer sheets as their creative products can be counted and decided in the following ways. one of the problems solved by students is to divide a rectangle into two equal-area parts of unique and various forms. for instance, students divided a rectangle into two equal rectangle area parts; two equal triangle area parts; two equal trapezoidal area parts; two equal two-dimensional area parts shaped zigzag and circular. it means that the student got five correct answer alternatives: rectangle, triangle, trapezoidal, two-dimensional zigzag shape, and two-dimensional circular shaped. the number of student’s flexible answer is two, namely: triangle and trapezoidal because their shapes are different from the original one, but not unique. the number of student’s novel answer is two, namely: twodimensional zigzag shape, and two-dimensional a circular shape, because they are unique. the rectangle divided into two equal rectangle area parts is not flexible nor novel answer because the shape is the same as the original one. to search for data and information on student’s cognitive knowledge by in-depth interviews, the researchers chose one student from each personality type, as shown in table 1. the student with personality type 'extraversion' is called student s1; 'agreeableness' is called student s2; 'extraversion + agreeableness + openness' is called student s3; 'extraversion + conscientiousness + openness' is called student s4; 'extraversion + neuroticism + openness' is called student s5; and 'agreeableness + openness' is called student s6. based on interview with student s1, she could draw and design various two-dimensional figures of unique shapes by cutting, folding, and measuring the rectangle into two parts in equal size and area. when the researchers asked her how she ascertained the two parts equal, she just said that if the twodimensional figure is divided into two parts in equal size, they must have equal area; it means she uses her math reasoning. when the researchers asked her how she is able to draw and design the two-dimensional polygon figure, she just said that she used her imagination to create creativity, it means she imagined the relevant things to find creative ideas. based on interview results from student s2, he divided a rectangle into two parts of the unique two-dimensional figure in the equal area by utilizing his intuitions ability. he had no relevant learning experience previously. he did not also cheat or ask his friends. it means that he solved the problem by his own conscience. student s3 divided a rectangle into two parts of trapezoidal in the equal area as one of his answers on the answer sheet. he said that a trapezoidal has a pair of facing lines of a parallel position. it means that he really understands the concept of trapezoidal. student s4 & student s5 divided a rectangle into two parts of the triangle in the equal area as one of their alternative answers. they determined the two-dimensional figure area by using formula. they understand the concept of the triangle by mentioning that one of the angles of the right triangle is 900. they also divided a rectangle into two parts of the unique two-dimensional figure in the equal area as another answer by using a trial and error system. they also utilized their intuition ability to find creative ideas. student s6 divided a rectangle into two parts of the two-dimensional figure in the equal area as one of her alternative answers by utilizing her prior knowledge and previous learning experience. according to her, their teacher ever taught and gave a similar problem, meaning that she made an analogy to solve the problem. referring to student’s cognitive knowledge description, the researchers try to summarize them, shown in table 5. creative thinking ability and cognitive knowledge... jonni sitorus, nirwana anas, & ermaliana waruhu copyright © 2019, reid (research and evaluation in education), 5(2), 2019 91 issn 2460-6995 table 5. student’s cognitive knowledge no. student’s cognitive knowledge 1. draw and design various two-dimentional figures in unique shapes 2. ascertain the two parts of unique two-dimentional figure of equal area 3. divide the two-dimentional figures in two equal size parts 4. reasoning 5. imagine the relevant things to create creativity and find creative ideas 6. utilize the intuition ability, prior knowledge and previous learning experience 7. solve the problem by his own conscience 8. understand the concept of two-dimentional figures 9. determine the two-dimentional figures area by using formula 10. mention the characteristic of two-dimentional figure 11. use trial and error system to determine the unique two-dimentional figure in equal area 12. analogy discussion one of the results and findings of this research is that the students have the creativity and the creative thinking ability to find the novel answers. it is in line with the opinion of munandar (1999) that creativity is defined as the ability to create new combinations based on the existing data, information, or elements, and find possibly many answers to one problem, where the emphasis is on the quantity, usability, and diversity of answers. creativity is the ability to reflect the answer originality. the ability to draw the unique and novel two-dimensional constructed through student’s creative ideas at the research findings is also in line with isaksen’s et al. opinion (grieshober, 2004) that the creative thinking is an idea-building process that emphasizes on the indicators of fluency, flexibility, novelty, and elaboration. creative thinking tends to the acquisition of insights, approaches, perspectives, or new ways of under-standing one mathematics problem (mcgregor, 2007). martin (2009), an inflexible-thinking individual uneasily changes his/her ideas or views even though he/she knows any contradiction between the belonging of a new idea. according to sharp (briggs & davis, 2008), novelty is not idea really new, but new for students. it is also found in this research where the student’s answers are only the pentagon, which is actually not two-dimensional, really original from the student's new idea, but the student himself who only drew such twodimensional in the class. it means that the student's answer has been categorized as a new one if compared to other students’ answers. when the student finds this solution to problems with the first time, he has found something new, at least for himself. every student has a creative thinking ability, but the level of creative thinking varies. it can be seen by the creation of evidence of certain people in extraordinary technology and knowledge. on the other hand, some people cannot be creative; they have no knowledge or skills at all or only use others' creativity. this state indicates the level or degree of creativity or the creative ability of someone is different. the level of someone's creative thinking can be viewed as a continuum from the lowest to the highest one. if an individual is taken randomly, we can place him/her in the continuum of the creative thinking level. however, because the number of discreet individuals is considerable, the approach to know the degree of creative thinking is a discrete and hierarchical classification. students’ personality types influence the ability and level of their creative thinking as the research findings. it is in line with the opinion of ivcevic and mayer (2006), stating that personality types can differentiate someone's creative thinking ability. an individual's creativity may differ based on his/her personality differences. personality can be defined as a psychological attributes system that describes how someone feels, thinks, interacts with the social world, and regulates behavior (funder, 2001; mayer, 2005). creative thinking ability and cognitive knowledge... jonni sitorus, nirwana anas, & ermaliana waruhu 92 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 in the last few decades, the big five personality has become the dominant model for describing broad personality traits. the openness of the big five personality is theoretically and empirically defined as a general disposition of creativity. creativity is also related to a narrower nature in the area of emotion and motivation, cognition, social expression, and self-regulation. the behavior of emotion and motivation in the creative thinking process offers an opportunity to be creative and can be a source of creative ideas. for example, motivated people intrinsically engage in activities because of their happiness in creating or enjoying the opportunity for expression. another behavior related to creativity is hypomania, which can enhance creativity, creative potential (e.g., self-perceived creativity), and creative behavior (e.g., involvement in creative activities). the mood increases the awareness, fluency, and flexibility of thinking. cognitive knowledge enables someone to generate creativity. one of the student's cognitive knowledge as the research findings is the reasoning ability. the students use their reasoning ability and can be improved by the creative and innovative learning approaches and require them to be more active and skilled in the learning process. the learning approach factor gives a significant influence on the improvement of students' mathematics reasoning ability, either whole or based on the subgroup of students. the student’s ability to determine the two-dimensional area on the research findings is in line with the opinion of van de walle, karp, and bay-williams (2008). according to them, a common mistake that often made by students is the using of incorrect-formula to conceptualize the height and pedestal of twodimensional. according to bahr and bossé (2008), students must learn mathematics by understanding, actively build new knowledge from previous experience and knowledge. learning by understanding is important to enabling students to solve the new problems that will inevitably face in the future. the student’s mathematics intuition as the research findings is in line with some opinions which state that the creative thinking in the mathematics subject is a combination of the logical and divergent thinking based on intuition with the indicators of fluency, flexibility, and novelty, one of the creative personal characters is characterized by an intuition ability — an individual needs two mathematical thinking skills, namely: the abilities of intuition and analytic thinking. conclusion based on the findings and discussion, the researchers conclude some points of the research. the category of student’s creative thinking ability based on the big five personality is 'moderate or high'. the level of student’s creative thinking based on the big five personality is 'very creative, creative, quite creative or less creative'. the student’s cognitive knowledge based on the big five personality is drawing, designing, ascertaining, dividing, reasoning, analogy, imagining, utilizing, solving, understanding, determining, mentioning and using trial and error. references alenikov, a. (ed.). (2002). the future of creativity. bensenville, il: scholastic testing press. anderson, l. w., krathwohl, d. r., airasian, p. w., cruikshank, k. a., mayer, r. e., pintrich, p. r., … wittrock, m. c. (2001). a taxonomy for learning, teaching, and assessing: a revision of bloom’s taxonomy of educational objectives. new york, ny: longman. bahr, d. l., & bossé, m. j. (2008). the state of balance between procedural knowledge and conceptual understanding in mathematics teacher education. international journal of mathematics teaching and learning, 1–28. retrieved from http://hdl.lib.byu.edu/ 1877/2880 batey, m., & furnham, a. (2006). creativity, intelligence, and personality: a critical review of the scattered lterature. genetic, social, and general psychology monographs, 132(4), 355–429. https://doi.org/ 10.3200/mono.132.4.355-430 creative thinking ability and cognitive knowledge... jonni sitorus, nirwana anas, & ermaliana waruhu copyright © 2019, reid (research and evaluation in education), 5(2), 2019 93 issn 2460-6995 batey, m., furnham, a., & safiullina, x. (2010). intelligence, general knowledge and personality as predictors of creativity. learning and individual differences, 20(5), 532–535. https://doi. org/10.1016/j.lindif.2010.04.008 briggs, m., & davis, s. (2008). creative teaching: mathematics in the early years and primary classroom. london: routledge. carson, s. h., peterson, j. b., & higgins, d. m. (2005). reliability, validity, and factor structure of the creative achievement questionnaire. creativity research journal, 17(1), 37–50. https:// doi.org/10.1207/s15326934crj1701_4 choi, j. n. (2004). individual and contextual predictors of creative performance: the mediating role of psychological processes. creativity research journal, 16(2–3), 187–199. https://doi.org/ 10.1080/10400419.2004.9651452 cohen, l., manion, l., & morrison, k. (2007). research methods in education (6th ed.). new york, ny: routledge. de lange, j. (2003). mathematics for literacy. in the national council on education and the disciplines (ed.), quantitative literacy: why numeracy matters for schools and colleges (pp. 75–89). retrieved from https://www.maa.org/sites/default/file s/pdf/ql/whynumeracymatters.pdf ee, d. n. k., & widjaja, w. (2013). mathematical modelling in the primary school: elements in teacher education. 5th redesigning pedagogy international conference. singapore: national institute of education. feist, j., & feist, g. j. (2006). theories of personality. boston, ma: mcgraw-hill education. friedman, h. s., & schustack, m. w. (eds.). (2008). the personality reader (2nd ed.). boston, ma: allyn and bacon. funder, d. c. (2001). personality. annual review of psychology, 52(1), 197–221. https://doi.org/10.1146/annurev.psych .52.1.197 grieshober, w. e. (2004). continuing a dictionary of creative term and definitions. buffalo, ny. isaksen, s. g., dorval, k. b., & treffinger, d. j. (2000). creative approaches to problem solving: a framework for change (2nd ed.). dubuque, ia: kendall/hunt. ivcevic, z., & mayer, j. d. (2006). creative types and personality. imagination, cognition and personality, 26(1), 65–86. https://doi.org/10.2190/0615-6262g582-853u krathwohl, d. r. (2002). a revision of bloom’s taxonomy: an overview. theory into practice, 41(4), 212–218. https:// doi.org/10.1207/s15430421tip4104_2 martin, p. n. (2009). societal transformation and reference services in the academic library: theoretical foundations for reenvisioning reference. library philosophy and practice, (may), 1–8. retrieved from https://digitalcommons.unl.edu/cgi/vi ewcontent.cgi?article=1265&context=li bphilprac mayer, j. d. (2003). structural divisions of personality and the classification of traits. review of general psychology, 7(4), 381–401. https://doi.org/10.1037/ 1089-2680.7.4.381 mayer, j. d. (2005). a tale of two visions: can a new view of personality help integrate psychology? american psychologist, 60(4), 294–307. https://doi.org/10.1037/ 0003-066x.60.4.294 mcgregor, s. l. t. (2007). international journal of consumer studies: decade review (1997-2006). international journal of consumer studies, 31(1), 2–18. https:// doi.org/10.1111/j.1470-6431.2006. 00566.x mumford, m. d. (2003). where have we been, where are we going? taking stock in creativity research. creativity research journal, 15(2–3), 107–120. https://doi. org/10.1080/10400419.2003.9651403 munandar, u. (1999). kreativitas dan keterbakatan, strategi mewujudkan potensi creative thinking ability and cognitive knowledge... jonni sitorus, nirwana anas, & ermaliana waruhu 94 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 kreatif dan bakat. jakarta: gramedia pustaka utama. neethling, k. (2000). the beyonders. in e. p. torrance (ed.), on the edge and keeping on the edge (pp. 153–166). bensenville, il: scholastic testing press. pehkonen, e., & törner, g. (2004). methodological considerations on investigating teachers’ beliefs of mathematics and its teaching. nomad nordic studies in mathematics education, 9(1), 21–49. rahayu, s. (2015). pembelajaran matematika dengan pendekatan pmri memang beda. buletin, vi(february). retrieved from http://www.pmri.or.id/main.php santrock, j. w. (2003). psychology (7th ed.). new york, ny: mcgraw-hill. santrock, j. w. (2007). child development (11th ed.). new york, ny: mcgraw-hill. setiawan, r. (2016). construct of creative thinking assessment on divergent and convergent ability. international journal of advance research and innovative ideas in education, 2(4), 1034–1041. retrieved from http://ijariie.com/formdetails. aspx?menuscriptid=1904 setiawan, r. (2017). the influence of income, experience, and academic qualification on the early childhood education teachers’ creativity in semarang, indonesia. international journal of instruction, 10(4), 39–50. https://doi. org/10.12973/iji.2017.1043a silvia, p. j. (2008). another look at creativity and intelligence: exploring higher-order models and probable confounds. personality and individual differences, 44(4), 1012–1021. https://doi.org/10.1016/ j.paid.2007.10.027 siswono, t. t. (2004). pendekatan pembelajaran matematika. jakarta: departemen pendidikan nasional republik indonesia. smith, m. k. (2008). howard gardner, multiple intelligences and education. retrieved from the encyclopedia of pedagogy and informal education (infed.org) website: https://infed.org/ mobi/howard-gardner-multipleintelligences-and-education/ sternberg, r. j. (2012). the assessment of creativity: an investment-based approach. creativity research journal, 24(1), 3–12. https://doi.org/10.1080/ 10400419.2012.652925 stillman, g. a. (2015). applications and modelling research in secondary classrooms: what have we learnt? in s. j. cho (ed.), selected regular lectures from the 12th international congress on mathematical education (pp. 791–805). https://doi.org/10.1007/978-3-31917187-6 van de walle, j. a., karp, k. s., & baywilliams, j. m. (2008). elementary and middle school mathematics: teaching developmentally. boston, ma: allyn and bacon. wolfradt, u., & pretz, j. e. (2001). individual differences in creativity: personality, story writing, and hobbies. european journal of personality, 15(4), 297–310. https://doi.org/10.1002/per.409 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(2), 2019, 120-129 available online at: http://journal.uny.ac.id/index.php/reid psychometric characteristic of positive affect scale within the academic setting *1kartika nur fathiyah; 2asmadi alsa; 3diana setiyawati 1faculty of education, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia 2,3faculty of psychology, universitas gadjah mada jl. sosio humaniora bulaksumur, karangmalang, sepok, sleman, yogyakarta 55281, indonesia *corresponding author. e-mail: kartika.fip_uny@yahoo.co.id submitted: 9 july 2019 | revised: 5 september 2019 | accepted: 12 september 2019 abstract this analysis study is one of several stages that must be passed before testing the structural model. this study is initiated due to the limited information related to the measurement of the positive affect scale within the academic settings. the research method used in this study was a quantitative method. it was done in among 724 students of state junior high schools in sleman, yogyakarta. the instrument development consisted of guideline arrangement, language feasibility testing, content validation through expert judgments, trials to measure the item discrimination index, item selection based on the item discrimination results, items representation for each indicator, and the construct validity test for the selected items. the testing of the measurement model used the data analysis techniques of structural equation models (sem) with the assistance of the amos program version of 21. the results of the study show that the validity analysis of the positive affect scale within the academic setting was able to produce items that can reveal constructs or latent concepts appropriately. keywords: positive affect scale, validity analysis, academic setting permalink/doi: https://doi.org/10.21831/reid.v5i2.25992 introduction affect plays a significant role in people’s life (nath & pradhan, 2012) consisting of positive and negative form. affect usually refers to one's emotion that is recognized and described as pleasantness or unpleasantness (watson, clark, & tellegen, 1988). the negative form provides short-term benefits to facilitate the tendency of specific behaviors in the form of responses, while the positive affect brings long-term benefits (fredrickson, 1998). the negative form includes tension, hopelessness, fear, and irritation, while the positive form covers spirit, strength, activeness, desire, and stamina (yik, russell, & steiger, 2011). the positive affect reflects the expansion of high energy, vigor and alert that make an individuals excited, full of concentration, and pleasant feeling. on the other hand, the low positive affect creates sadness and fatigue (watson et al., 1988). the positive affect means a person's tendency to have a variety of positive emotional experiences (watson et al., 1988). related to the trait (the tendency of an individual state to be relatively stable), the positive affect is associated with the more frequent and intense episodes experienced by individuals. based on the state (the person’s condition at a certain time), the positive affect is a beneficial emotional experienced at a particular time (watson & tellegen, 1985). the positive affect is a key component of assessment and effective coping towards stressful situations (folkman, 2008), and an antidote to negative emotions that can reduce its harmful influence (fredrickson, tugade, https://doi.org/10.21831/reid.v5i2.25992 psychometric characteristic of positive affect scale within... kartika nur fathiyah, asmadi alsa, & diana setiyawati copyright © 2019, reid (research and evaluation in education), 5(2), 2019 121 issn 2460-6995 waugh, & larkin, 2003). it develops a mental readiness to grow and step out from unpleasant situations and escalates sources of psychological coping to face stressors (fredrickson, mancuso, branigan, & tugade, 2000). this affect can also maintain physical and psychological health (danner, snowdon, & friesen, 2001), and build personal resources and wellbeing (fredrickson & joiner, 2002). based on neurobiological perspective, the positive affect occurs due to the release of large amounts of dopamine from temporary phasic to synaptic clefts. the dopamine is then multiplied through the midbrain of the dopaminergic system to the striatum, limbic area, and prefrontal cortex (ashby, isen, & turken, 1999). several studies have found that positive effects improve performance based on front striatal dopaminergic interactions among healthy individuals (demanet, liefooghe, & verbruggen, 2011). studies in various settings have revealed the role of positive affect in improving individual outcomes (samios, abel, & rodzik, 2013; lyubomirsky, king, & diener, 2005); steptoe, dockray, & wardle, 2009). in the academic field, the role of positive affect is considered as very meaningful (schutz & lanehart, 2002; goetz, pekrun, hall, & haag, 2006) because it affects teaching and learning (schutz & lanehart, 2002), student’s subjective well-being, process quality, learning achievement, teacher interaction with students, and learning process effectiveness (goetz et al., 2006). those roles indicate the importance of positive affect within the academic setting. in fact, the availability of information on the positive affect in the academic setting is very limited (linnenbrink-garcia & pekrun, 2011) and tends to have little attention from researchers (pekrun et al., 2010). therefore, linnenbrink (2006) and also seligman, ernst, gillham, reivich, and linkins (2009) suggest that psychological studies within an academic context should be gained more, especially development of positive affect scale within academic setting, as instrument to measure and support the optimal school functioning. this research aims at elevating the study on the role of positive affect within the academic setting by developing a proper instrument. many researches in various settings on the positive affect including those in the academic setting have been using the concept of watson et al. (1988) considered less specific. as a result, the positive affect cannot be optimally explored based on its context. in the academic settings, pekrun (1992) identifies the positive affect or emotions from motivation, learning process, and student performance. he classifies positive affect in the academic setting into positive affect related to assignments and social. regarding the task, the positive affect comes from (a) the process, as a pleasure when undergoing the academic process; (b) anticipatory joy, as a positive affect that arises before the academic process takes place with happy feeling imagining the results to be achieved and the expectation towards the academic activities; and (c) prospective, a positive affect after the academic process takes place shown by a joy feeling because of the achieved success (joy of success), and satisfaction, pride and relief after undergoing the academic process. meanwhile, the social concerns on the positive affect that appears because of social interactions during the academic process. the indicators are gratitude, empathy, admiration, and sympathy or love. the specific measurement model discusses the relation between latent variables (constructs) and measurement indicators, by conducting an instrument construct validity analysis to reveal how well the measurement indicators measure the latent (construct) concept. construct validity test includes exploratory and confirmatory factor analysis. exploratory factor analysis (efa) is for situations where the relation between observed and latent variables is not known so it requires exploration to determine how and how closely the observed variables relate to the underlying (latent) factors. conversely, in the confirmatory factor analysis (cfa), the factor structure is assumed to be known (dachlan, 2014). because the indicators of this research have been theorized by pekrun (1992), the analysis was done using cfa. thus, this paper aims to confirm whether the scale of positive affect within academic setting built already matched between the data obtained with the underlying theory. psychometric characteristic of positive affect scale within... kartika nur fathiyah, asmadi alsa, & diana setiyawati 122 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 method the research method used in this study was quantitative method. this study was conducted among the junior high school students in sleman, yogyakarta. the subject involved in this study were 724 students, including 359 students in the field trial stage and 365 for the empirical data collection. the data collection at each stage was done to different subjects. the analysis in the field trial used discrimination test, while the validation analysis in this study was the analysis of the empirical data collection in addition to the model testing. the instrument development consisted of guideline arrangement, language feasibility test, content validation through calculated by aiken’s v formula, discrimination index, item selection based on the item discrimination results, items representation for each indicator, the construct validity test in the selected items and validity and reliability test. the aiken’s formula is described as follows (aiken, 1985). notes: 1 = the lowest of validity assessment score (equal to 1) c = the highest of validity assessment score (equal to 4) r = the score from the assessor n = number of assessors = r-1 the testing of positive affect scale in the academic setting employed structural equation models (sem) with the assistance of amos program version 21. to determine the goodness of fit index (gfi) according to dachlan (2014), it used several criteria: chisquare and p-values, cmin/df, gfi, agfi (adjusted goodness of fit index), cfi (comparative fit index), tli (tucker-lewis index), and rmsea (root mean square error of approximation). findings and discussion findings the initial step of the study in carrying out the validity test of the positive affect scale is to make the guidelines for the instruments. this guideline was arranged referring pekrun's (1992) theory regarding the general taxonomy of positive emotions relevant to motivation, learning process, and student performance. the scale contains two aspects: (1) task and (2) social aspects. the positive affect scale comprises statements related to school activties. the students were asked to respond each statement based on their experience, feeling, and thought. this scale contains statements that support (favorable) and those that do not support (unfavorable). there were two models of answer choices to respond to the statements. the first model includes the frequency/intensity of ‘never’, ‘rarely’, ‘sometimes’, ‘often’, and ‘always’ with the score range from 1 (never) to 5 (always) respectively, while the second model focuses on its appropriateness containing ‘very inappropriate’, ‘inappropriate’, ‘sometimes’, ‘appropriate’, and ‘very appropriate’ with the score range from 1 (very inappropriate) to 5 (very appropriate) respectively. the number on the positive affect scale of the trial stage were 30 items. the details of the dimensions, indicators, and number of items is shown in table 1, while the scale is presented in figure 1. table 1. the guideline of positive affect scale aspects indicator sub-indicators number of test item tasks process joy 3 anticipatory joy 3 prospective hope 3 retrospective joy about success 3 satisfaction 3 pride 3 social gratitude 3 empathy 3 admire 3 love 3 total number of test items 30 v = ∑s/[n(c-1)] psychometric characteristic of positive affect scale within... kartika nur fathiyah, asmadi alsa, & diana setiyawati copyright © 2019, reid (research and evaluation in education), 5(2), 2019 123 issn 2460-6995 figure 1. the positive affect scale statements after preparing the guidelines, the language feasibility was tested to ensure that the sentence in the scale was understandable by the reader and present the same meaning as the researchers’ intention (azwar, 2016). the respondents of the test were seven junior high school students from various levels (two students from the seventh grade, three from the eighth grade, and two from the ninth grade). they also came from various types of schools: positive affect scale instruction the following statements are about your experiences, your feeling and your thought related to the school activities. please, give response on each statement with cross mark (x) based on your condition with the following possible answers. never nv never experiencing rarely rr rarely experiencing sometimes sm sometimes experiencing often oft often experiencing always alw always experiencing a. the frequency of experiencing the following items in schools 1. you are enthusiastic in completing the school assignments nv rr sm oft alw 2. your feel comfortable at the school nv rr sm oft alw 3. you feel happy when imagining the school assignments has been finished nv rr sm oft alw 4. you feel happy when imagining the school graduation nv rr sm oft alw 5. you miss your school friends nv rr sm oft alw 6. you want to do the best for your school nv rr sm oft alw 7. you care to your friends who experience learning difficulty nv rr sm oft alw 8. you feel happy when your friend attain academic success nv rr sm oft alw b. the frequency of expectation towards following items in schools : 1. you expect to complete your assignments as best as you can nv rr sm oft alw 2. you expect to graduate with the highest score nv rr sm oft alw c. the frequency of happiness due to the following items. 1. you succeed to finish the difficult test/ exercise item nv rr sm oft alw 2. you gain better school results that the previous semester nv rr sm oft alw d. the frequency of satisfaction due to the following items. 1. the teachers’ teaching strategies nv rr sm oft alw 2. the test score nv rr sm oft alw e. the frequency of proud feeling due to the following items. 1. your academic achievement nv rr sm oft alw 2. your learning progress nv rr sm oft alw f. the frequency of relief feeling due to the following items. 1. you gain the scores above the minimum completeness criteria nv rr sm oft alw 2. you have finished all your school assignments nv rr sm oft alw g. the frequency of being grateful due to the following items. 1. you have kind friend in the school nv rr sm oft alw 2. you are taught by the caring teachers nv rr sm oft alw h. the frequency of admiring due to the following items. 1. teachers’ explanation in the classroom nv rr sm oft alw 2. the effective learning strategies from the classmate for achieving high academic outcomes nv rr sm oft alw psychometric characteristic of positive affect scale within... kartika nur fathiyah, asmadi alsa, & diana setiyawati 124 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 state, private and islamic-based schools. each respondent was asked to examine and provide an assessment on the extent to which the items presented on the scale to be understood. after making sure with language feasibility, the content validation was following. it was the expert judgment from those who have the relevant scientific capacity to the issue measured, aimed at knowing whether the items were in line with the measured aspects. the assessment was focused on the appropriateness between the item indicators and the measured variables, and the writing procedures, and evaluation for high social desirability (azwar, 2016). this expert judgment was then calculated using aiken’s v formula to obtain content validity coefficient based on the measured construct (azwar, 2016). the obtained scores from the aiken's v formula calculation ranged from 0 to 1, the bigger number of coefficients indicates that the item shows better content validity (azwar, 2016). the items in the instrument were assessed by 21 experts with educational background at least master degree in psychology. the suitability level between the item and indicator ranged from 1 to 5 (five points): 1 is ‘very inappropriate’, 2 is ‘inappropriate’, 3 is ‘moderate’, 4 is ‘appropriate’, and 5 is ‘very appropriate’. based on the coefficient table of aiken by taking the value p=0.01 (1% margin of error) from 21 assessors, the score limit to be used so the items can be received was 0.71. the content validation with aiken coefficient value moved from 0.88-0.96, the mean value of aiken (v) was 0.91. thus, the items are suitable with its indicators according to experts which means the positive affect scale is considered to have good content validity. the next step after the expert judgment was the item discrimination test. this test was done to obtain items with high discrimination index to distinguish individuals or groups of individuals who have and do not have measured attributes. the approach employed total item consistency which showed the suitability between item functions and its scale functions (azwar, 2016). the score for each item was correlated with the total score. the high correlation values indicated that the item had a high function towards the overall scale function. the items less than 0.30, according to azwar (2016), can be interpreted to have a low discrimination index (invalid) so it can be deleted. based on the item discrimination test using pearson's total item correlation with the assistance of spss v.21, the discrimination index of high positive affect scale items moved from 0.330-0.652. further, these items were selected to be tested in its construct validity through confirmatory factor analysis. the item selection on positive affect scale was done by selecting two items having the highest discrimination index on each indicator and considering the item representation as the indicator. the selected items in positive affect scale for confirmatory tests can be seen in table 2. after obtaining the selected item, the construct validity was done by cfa to test the validity of the scale’s indicators as the measurement of latent construct. the construct validity provides the belief that the indicators taken from the sample really illustrate the actual scores in the population. thus, this analysis confirms empirically based on the sample data to provide theoretical truths for latent variables. table 2. the selected items in the positive affect scale for confirmatory tests aspect indicator sub indicator old version number revised version number task process joy a2, a3 a1, a2 prospective anticipa-tory joy a4, a5 a3, a4 hope b1, b2 b1, b2 retrospective joy about success c1, c2 c1, c2 satisfaction d1, d2 d1, d2 pride e1, e2 e1, e2 relief f1, f3 f1, f2 social gratitude g2, g3 g1, g2 empathy a7, a8 a7, a8 admiration h1, h3 h1, h2 sympathy a11, a12 a5, a6 psychometric characteristic of positive affect scale within... kartika nur fathiyah, asmadi alsa, & diana setiyawati copyright © 2019, reid (research and evaluation in education), 5(2), 2019 125 issn 2460-6995 table 3. the criteria of goodness-of-fit (gof) parameter critical scores experts chi-square closer to 0 is better arbuckle (2013), kline (2011) chi-square/df < 2 byrne (2001) probability ≥ 0.05 kline (2011) gfi ≥ 0.90 kline (2011), dachlan (2014), ghozali (2017) agfi ≥ 0.90 kline (2011), dachlan (2014), ghozali (2017), cfi ≥ 0.90 kline (2011), dachan (2014), ghozali (2017) tli ≥ 0.90 arbuckle (2012), dachlan (2014), ghozali (2017) rmsea ≤ 0.05 kline dachlan (2014) |------------------- 175.338 |* 191.820 |*** 208.302 |******* 224.784 |************ 241.266 |******************* 257.748 |****************** 274.230 |***************** n = 1000 290.712 |*************** mean = 266.192 307.194 |********* s. e. = 1.225 323.676 |******* 340.158 |**** 356.639 |** 373.121 |* 389.603 |* 406.085 |* |------------------- figure 2. the results of bootstrapping data in positive affect scale the construct validity can be analyzed from the factor load value (squared multiple correlation) indicators of latent constructs (ghozali, 2017). to measure the suitability of the model, it was used the measurement of gof known as the values of cmin, df, p, gof, agfi, tli, and rmsea. this gof standard referred to the opinions of kline (2011), arbuckle (2013), and ghozali (2017). the criteria of gof can be seen in table 3. the factor load value towards the latent construct to maintain the item on positive affect scale was 0.40. it was based on the opinion of hair, black, babin, anderson, and tatham (2010) who mention that the determination of the minimum limit of factor load with 200 subjects or more is 0.40. in line with this ides, hair et al. (2010) and netemeyer, bearden, and sharma (2003) state that the item should have the factor load of 0.40-0.90, while the value less than 0.40 should be disregarded. the confirmatory analysis of positive affect scale used maximum likelihood (ml) estimation method. the requirement that must be fulfilled by ml method was multivariate normality (byrne, 2010). the multivariate normality test in positive affect scale showed c.r of 35,069. because the value of c.r was beyond the range of -2.58 to +2.58, the data were declared abnormal, so it did not meet the assumption of multivariate normality. to overcome the non-normal data, the bootstrap procedure was applied. the visualization of the bootstrapping results on positive affect scale data with the sample of 1000, the percentile confidence level of 95%, and the bias corrected confidence interval of 95% can be seen in figure 2. figure 2 showed that chi-square distribution value with 1000 bootstrap samples in positive affect scale was 266.192; the cluster values in the multivariate center were normal with 266 because there were several values psychometric characteristic of positive affect scale within... kartika nur fathiyah, asmadi alsa, & diana setiyawati 126 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 above and under 266 that were comparable. after fulfilling the normality requirements of the data, the confirmatory analysis was conducted. the preliminary results indicated that positive affect scale measurement model was not in accordance with the model criteria (gof), as presented in figure 3. in figure 3, it can be seen that positive affect scale did not meet the measurement model of gof criteria. this was indicated by chi-square results=1286.932 (relatively high), chi-square/df=209, p=0.00 (critical score p≥ 0.05), gof, agfi, tli, and cfi which was still far below 0.9 (critical value ≥0.9) and rmsea = 0.119 (critical value ≤0.05). to achieve gof criteria to positive affect scale, the items that can be used were those with loading factor of 0.5. thus, the d2 items were deleted since they did not meet the criteria (loading factor=0.49). the next item selection was by paying attention to the modification suggestions by amos program, such as removing items affect c2, f2, e1, a6, g2, e1, a4, e2, a8, h2, a5, d1, b2, and g1 since it had variance with some other items (cross-loading) with relatively high mi values. based on the modifications made, positive affect scale can reach the measurement fit value as shown in figure 4, i.e. chi-square of 15.602 with p=0.76; chi-square/df=1.734; gfi=0.986; agfi=0.967; tli=0.978; cfi= 0.987; and rmsea=0.045 according to the established gof criteria. in detail, a summary of the analysis of positive affect scale factors based on gof criteria can be seen in table 4. six items are selected in table 4: affect a1, a3, b1 (related to the assignment aspects) and affect a7 and h1 (representing the social aspects). after modifications, items in the positive affect scale had been empirically confirmed to the established gof criteria. figure 3. analysis of confirmatory factors on positive affectivity scale which is not accordance with the criteria of goodness of fit (gof) psychometric characteristic of positive affect scale within... kartika nur fathiyah, asmadi alsa, & diana setiyawati copyright © 2019, reid (research and evaluation in education), 5(2), 2019 127 issn 2460-6995 figure 4. cfa on positive affect scale based on goodness of fit (gof) table 4. summary of cfa on positive affect scale based on gof aspect item scores of loading factors significance (p) assignment affecta1 0.691 significant affecta3 0.531 significant affectb1 0.628 significant affectc1 0.616 significant social affecta7 0.699 significant affecth1 0.506 significant discussion this research is one of several stages before testing the structural model of positive affect scale as one of the research instruments. the results of the study indicate that the positive affect scale in the academic setting is able to produce items that can reveal the latent constructs or concepts appropriately. there are six selected items in which four items represent the assignment aspects, and two items are related to the social aspects. associated with the development of positive affect instruments by watson et al. (1988), positive affect instruments in the academic domain generated through this research enriched the study of previous positive affect instruments. the positive affect instrument of watson et al., (1988) was general for all domains, while the positive affect instrument resulting from this study is more specifically revealing the positive affect that develops in academic settings. thus, the discussion on positive effects in academic settings becomes more detailed and clear according to context. this study can give beneficial contribution dealing with the limited studies on the affect in the academic setting as stated by linnenbrink-garcia and pekrun (2011) and pekrun et al. (2002). it is expected that psychological dynamics within the academic context can be investigated comprehensively to build appropriate and efficient solution towards various educational problems. conclusion it is concluded that the validity test of positive affect scale within the academic domain can produce items that can reveal constructs or latent concepts appropriately. by having correct and proper information related to psychological dynamics within the academic context, it can support to create appropriate and efficient solution for various educational problems to improve the quality of education. further studies are expected to continue its coverage on a wider range of area to see if the research findings can be applied to other unexamined subjects and contexts. psychometric characteristic of positive affect scale within... kartika nur fathiyah, asmadi alsa, & diana setiyawati 128 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 references aiken, l. r. (1985). three coefficients for analyzing the reliability and validity of ratings. educational and psychological measurement, 45(1), 131–142. https:// doi.org/10.1177/0013164485451012 arbuckle, j. l. (2013). ibm® spss® amostm 22 user’s guide. chicago, il: amos development corporation. ashby, f. g., isen, a. m., & turken, a. u. (1999). a neuropsychological theory of positive affect and its influence on cognition. psychological review, 106(3), 529–550. https://doi.org/10.1037/003 3-295x.106.3.529 azwar, s. (2016). konstruksi tes kemampuan kognitif. yogyakarta: pustaka pelajar. byrne, b. m. (2010). structural equation modeling with amos: basic concept, application, and programming (2nd ed.). new york, ny: routledge taylor & francis group. dachlan, u. (2014). panduan lengkap struktural equation modeling tingkat dasar: metodologi, konsepsi, aplikasi (dengan amos) (1st ed.). semarang: lentera ilmu. danner, d. d., snowdon, d. a., & friesen, w. v. (2001). positive emotions in early life and longevity: findings from the nun study. journal of personality and social psychology, 80(5), 804–813. https:// doi.org/10.1037/0022-3514.80.5.804 demanet, j., liefooghe, b., & verbruggen, f. (2011). valence, arousal, and cognitive control: a voluntary task-switching study. frontiers in psychology, 2(336), 1–9. https://doi.org/10.3389/fpsyg.2011.00 336 folkman, s. (2008). the case for positive emotions in the stress process. anxiety, stress and coping, 21(1), 3–14. https:// doi.org/10.1080/10615800701740457 fredrickson, b. l. (1998). what good are positive emotions? review of general psychology, 2(3), 300–319. https://doi. org/10.1037/1089-2680.2.3.300 fredrickson, b. l., & joiner, t. (2002). positive emotions trigger upward spirals toward emotional well-being. psychological science, 13(2), 172–175. https:// doi.org/10.1111/1467-9280. 00431 fredrickson, b. l., mancuso, r. a., branigan, c., & tugade, m. m. (2000). the undoing effect of positive emotions. motivation and emotion, 24(4), 237–258. https://doi.org/10.1023/a:1010796329 158 fredrickson, b. l., tugade, m. m., waugh, c. e., & larkin, g. r. (2003). what good are positive emotions in crisis? a prospective study of resilience and emotions following the terrorist attacks on the united states on september 11th, 2001. journal of personality and social psychology, 84(2), 365–376. https://doi. org/10.1037/0022-3514.84.2.365 ghozali, i. (2017). model persamaan struktural kkonsep dan aplikasi dengan program amos 24 update bayesian sem (7th ed.). semarang: badan penerbit universitas diponegoro. goetz, t., pekrun, r., hall, n., & haag, l. (2006). academic emotions from a social-cognitive perspective: antecedents and domain specificity of students’ affect in the context of latin instruction. british journal of educational psychology, 76(2), 289–308. https:// doi.org/10.1348/000709905x42860 hair, g., black, b., babin, b., anderson, r., & tatham, r. (2010). multivariate data analysis (7th ed.). upper saddle river, nj: pearson. kline, r. (2011). principle and practice of sructural equation modeling (3rd ed.). new york, ny: the guilford press. linnenbrink-garcia, l., & pekrun, r. (2011). students’ emotions and academic engagement: introduction to the special issue. contemporary educational psychology, 36(1), 1–3. https://doi.org/10.1016/ j.cedpsych.2010.11.004 linnenbrink, e. a. (2006). emotion research in education: theoretical and psychometric characteristic of positive affect scale within... kartika nur fathiyah, asmadi alsa, & diana setiyawati copyright © 2019, reid (research and evaluation in education), 5(2), 2019 129 issn 2460-6995 methodological perspectives on the integration of affect, motivation, and cognition. educational psychology review, 18(4), 307–314. https://doi.org/10.10 07/s10648-006-9028-x lyubomirsky, s., king, l., & diener, e. (2005). the benefits of frequent positive affect: does happiness lead to success? psychological bulletin, 131(6), 803–855. https://doi.org/10.1037/003 3-2909.131.6.803 nath, p., & pradhan, r. k. (2012). influence of positive affect on physical health and psychological well-being: examining the mediating role of psychological resilience. journal of health management, 14(2), 161–174. https://doi.org/10. 1177/097206341201400206 netemeyer, r. g., bearden, w. o., & sharma, s. (2003). scaling procedures: issues and applications. https://doi.org/10.4135/97 81412985772 pekrun, r. (1992). the impact of emotions on learning and achievement: towards a theory of cognitive/motivational mediators. applied psychology, 41(4), 359– 376. https://doi.org/10.1111/j.146405 97.1992.tb00712.x pekrun, r., goetz, t., titz, w., & perry, r. p. (2002). academic emotions in students’ self-regulated learning and achievement: a program of qualitative and quantitative research. educational psychologist, 37(2), 91–105. https://doi. org/10.1207/s15326985ep3702_4 pekrun, r., goetz, t., titz, w., perry, r. p., pekrun, r., goetz, t., … perry, r. p. (2010). academic emotions in students ’ self-regulated learning and achievement : a program of qualitative and quantitative research academic emotions in students ’ self-regulated learning and achievement : a program of qualitative and quantitative research. (october 2014), 37–41. https:/ /doi.org/10.1207/s15326985ep3702 samios, c., abel, l. m., & rodzik, a. k. (2013). the protective role of compassion satisfaction for therapists who work with sexual violence survivors: an application of the broaden-and-build theory of positive emotions. anxiety, stress & coping, 26(6), 610–623. https://doi.org/ 10.1080/10615806.2013.784278 schutz, p. a., & lanehart, s. l. (2002). introduction: emotions in education. educational psychologist, 37(2), 67–68. https://doi.org/10.1207/s15326985ep 3702_1 seligman, m. e. p., ernst, r. m., gillham, j., reivich, k., & linkins, m. (2009). positive education: positive psychology and classroom interventions. oxford review of education, 35(3), 293–311. https://doi.org/10.1080/03054980902 934563 steptoe, a., dockray, s., & wardle, j. (2009). positive affect and psychobiological processes relevant to health. journal of personality, 77(6), 1747–1776. https:// doi.org/10.1111/j.1467-6494.2009.005 99.x watson, d., clark, l. a., & tellegen, a. (1988). development and validation of brief measures of positive and negative affect: the panas scales. journal of personality and social psychology, 54(6), 1063–1070. https://doi.org/10.1037/ 0022-3514.54.6.1063 watson, d., & tellegen, a. (1985). toward a consensual structure of mood. psychological bulletin, 98(2), 219–235. https:// doi.org/10.1037/0033-2909.98. 2.219 yik, m., russell, j. a., & steiger, j. h. (2011). a 12-point circumplex structure of core affect. emotion, 11(4), 705–731. https:// doi.org/10.1037/a0023980 copyright © 2018, reid (research and evaluation in education) issn 2460-6995 reid (research and evaluation in education), 4(2), 2018, 94-104 available online at: http://journal.uny.ac.id/index.php/reid “my lecturer’s expressionless face kills me!” an evaluation of learning process of german language class in indonesia *1primardiana hermilia wijayati; 2rofi’ah; 3ahmad fauzi mohd ayub 1,2german department, faculty of letters, universitas negeri malang jl. semarang no.5, sumbersari, kota malang, jawa timur 65145, indonesia 3faculty of educational studies, universiti putra malaysia 43400 serdang, selangor, malaysia *corresponding author. e-mail: primardiana.hermilia.fs@um.ac.id submitted: 19 december 2018 | revised: 20 december 2018 | accepted: 21 december 2018 abstract this qualitative study aimed at evaluating the circumstances in plenary class that provoke learners’ speaking anxiety. to meet the objectives, this study investigated students of german as a foreign language (gfl) course who were experiencing speaking anxiety symptoms in the plenary class. the research was a narrative qualitative study, and the data were collected through observation and interview. the result of this study reveals that learners’ speaking anxiety occurred in particular circumstances of the plenary class, such as unfamiliar topic, still class, students’ unpreparedness for spontaneous speaking, expressionless face of the lecturer, and students’ fear of native speaker lecturers. keywords: speaking anxiety, german as a foreign language (gfl), foreign language anxiety (fla), sozialform introduction german is one of the foreign languages learned by several learners in indonesia, including in universitas negeri malang. unlike english, german is not an international language that had been learned earlier in the elementary school or even kindergarten. it is also not as familiar as english. in german department of universitas negeri malang, the students have various backgrounds. some students have prior knowledge about german from their high schools, some others do not have any knowledge at all and they started to learn german in the university. students’ german language knowledge is standardized through gemeinsamer europäischer referenzrahmen (ger) (2004) or modern language division (2001). according to ger, german skills are divided into three levels i.e. basic level that consists of a1 (breakthrough) and a2 (waystage), independent level that consists of b1 (threshold) and b2 (vantage), and competent level that consists of c1 (effective operational efficiency) and also c2 (mastery) (glaboniat, müller, rusch, schmitz, & wertenschlag, 2005). in universitas negeri malang, german was taught with a different level in each semester. students learned german i (a1) in the first semester, german ii (a2) in the second semester, german iii (a2b1) in the third semester, german iv (b1) in the fourth semester, and then german b2 (deutsch auf b2 niveau) in the fifth semester. the students had to pass the lower level first in order to reach the higher level (department of german letters, 2016). in the class of deutsch auf b2 niveau, the students did not learn the whole level. they only learn the beginning or the basis of b2 level. however, students’ german language skill at this level should be good, since reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 95 – “my lecturer’s expressionless face... primardiana hermilia wijayati, rofi’ah, & ahmad fauzi mohd ayub the students already got b1 and b2, which are independent levels. the point is students at this level should be able to communicate in german fluently. however, in fact, according to the preliminary research, the students were quite passive and looked anxious to speak especially when they had to speak in the plenary class or in front of their classmates and teacher. it indicated that they suffered classroom speaking anxiety. speaking is one of the language skills that should be acquired by students in a foreign language class. speaking as a productive activity is very important for them to communicate with each other not only for academic, but also interpersonal context (lightbown & spada, 2006). speaking is often considered the most difficult language skill because students need to go through a complicated process in order to speak correctly and understandably (mclaren, madrid, & bueno, 2006). speaking includes a combination of some cognitive and psychological aspects. in order to achieve successful speaking, students need to have sufficient language knowledge and good psychological (mental) state. the cognitive aspect consists of bottom-up and top-down processes (bashir, azeem, ashiq, & dogar, 2011; saville-troike, 2006). the bottom-up process involves language knowledge such as vocabulary, pronunciation and grammatical patterns. meanwhile, the top-down process involves content knowledge about a topic and cultural knowledge of the spoken language. furthermore, the psychological aspect or mental state also affects students’ speaking skill. one of the psychological aspects that affect speaking skill is anxiety (ansari, 2015; muhaisen & al-haq, 2012). speaking anxiety in a language class is manifested in some ways. some researches show that speaking anxiety increases students’ monitor use (dulay, burt, & krashen, 1982; el-sakka, 2016). students cannot speak fluently because they are self-conscious. this situation worsens their speaking ability (shabani et al., 2013; von wörde, 2003). they cannot achieve their maximum achievement in speaking. some researches show the cause of students’ speaking anxiety, such as lack of fluency, poor knowledge of vocabulary, unfamiliar topic, and negative feedback (awan, azher, anwar, & naz, 2010; barahmeh, 2013; nazir, bashir, & raja, 2014). this phenomenon can be seen in almost every language class, including german language class. some researches on speaking anxiety in german language class have revealed some familiar findings. students can suffer fear by speaking in a german language class. the main causes are, for example, fear of negative feedback, low language proficiency, and shyness (fischer & modena, 2005). that fear by speaking leads to speaking anxiety. this anxiety affects students’ language ability and worsened their linguistic mastery because they cannot think clearly under those circumstances (sevinç & backus, 2017). these findings are found in german as a second language class. because speaking anxiety has a huge effect, it is important to investigate speaking anxiety and its form. this research’s context is different from previous researches, namely german as a foreign language (gfl) in indonesia. it is familiar to see in german as a foreign language class in indonesia: the lecturer asks a question to the students and they respond it as if it is in a choir. however, when lecturer asks a student to raise a hand and to speak in front of the class, the student keeps quiet as if the class becomes a 'graveyard'. this is because most of the students are passive and anxious to speak in front of the class (cansrina, 2015). such description is a kind of culture in german as a foreign language classes in indonesia. based on the preliminary research, the students’ passiveness became a serious problem in the classroom. it also gave negative effects toward their performance. during the class, a few students who spoke actively were always the same persons. thus, the lecturer had to ask or even force the other students to speak. otherwise, they would only speak with their classmates when they had to interact with each other in pair work or group work. the students’ passiveness and fear of speaking, as mentioned before, show that there were speaking anxiety symptoms among them. this situation normally happened in the plenary class. plenary class is an interactive form or sozialform, which is a term defined reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 “my lecturer’s expressionless face... 96 primardiana hermilia wijayati, rofi’ah, & ahmad fauzi mohd ayub as a didactic methodology that arranges the interaction pattern between students and teachers and among students which consists of plenary class (frontalunterricht/pleno), individual work (einzelarbeit), pair work (partnerarbeit), and group work (gruppenarbeit) (kiper, meyer, & topsch, 2002). thus, this study focused only on the learners’ speaking anxiety in the plenary class. classroom speaking anxiety is a kind of unpleasant feeling suffered by foreign language learners as they are asked to speak in the classroom. speaking anxiety is defined as a feeling of fear, nervous, and lack of selfconfidence during speaking which are associated with visual signs (horwitz, 2001; horwitz, horwitz, & cope, 1986; tseng, 2012; wilson, 2006; zhiping & paramasivam, 2013). basic (2011) also states that anxiety is a sort of fear manifested by visual signs. speaking anxiety is a part of foreign language anxiety (fla) experienced by foreign language learners (bashir et al., 2011; horwitz, 2001; horwitz et al., 1986). thus, anxiety in the speaking skill is a problem experienced by most of the students in foreign language classes (arnaiz & guillén, 2012; basic, 2011; horwitz, 2001; horwitz et al., 1986; marwan, 2007; tseng, 2012; wilson, 2006; zhiping & paramasivam, 2013). it is caused by the complexity of speaking skill (basic, 2011). it becomes a reason why the researchers attempted to conduct deeper studies about speaking anxiety with various focuses and results. nowadays, there is a number of speaking anxiety studies in english as a foreign language (efl) classes as well as in german as a foreign language (gfl) classes. tseng (2012) explains that there are some factors that can cause speaking anxiety in english classes, such as parents’ and teachers’ demands for students to get good grades at school in english, lack of confidence in students’ ability to learn english, fear of making mistakes and of getting subsequent punishment or ostracism, i.e. fear of having embarrassing feeling for not being perfect, condition in childhood to believe that english is extremely difficult, and fear of foreigners and their behavior. it all shows that english triggers anxiety because of its role as an international language. however, the cause of speaking anxiety in another foreign language, such as german, should be different. zhiping and paramasivam (2013) attempted to look for the cause of speaking anxiety in an international class in malaysia where the students are from nigeria, iran, and algeria. their findings revealed that there are particular factors that provoke speaking anxiety, (e.g. fear of being in public and shyness, fear of negative evaluation, and fear of speaking inaccurately). in addition, students’ speaking anxiety level is various. it depends on the student and also their culture. therefore, the cause of speaking anxiety among students was much related to cultural difference since they came from different countries. the researches about speaking anxiety in german as a foreign language (gfl) class had been done by fischer and modena (2005) and cansrina (2015) who investigated speaking anxiety in modena university in italy. the results indicate that motivation has a great deal to the success of students’ speaking skill. students with high motivation as well as selfconfidence in learning german can speak german well. meanwhile, students who suffer speaking anxiety and are afraid to get negative evaluation have low speaking skill. gnjidić (2016), in his study, has found that anxiety and fear are the biggest obstacles to learning german for croatian students. when students have a high anxiety level, they can poorly concentrate in producing and expressing their idea through speaking (fischer & modena, 2005; inozemtseva, 2017). a local study by cansrina (2015) divided the causes of german students’ speaking anxiety in german literature padjajaran university based on three aspects, i.e. personal, social-didactic, and cultural aspects. the cause of speaking anxiety based on personal aspect is too much thinking about grammar and fear of negative evaluation. based on social-didactic aspect, students will feel anxious to speak when the topics are unfamiliar. meanwhile, seen from the cultural aspect, students’ feeling of fear was provoked by indonesian teachers’ behavior since elementary school, i.e. the students have to keep silent in the class. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 97 – “my lecturer’s expressionless face... primardiana hermilia wijayati, rofi’ah, & ahmad fauzi mohd ayub a bit different to the studies by fischer and modena (2005), cansrina (2015), gnjidić (2016), and inozemtseva (2017), this study investigated speaking anxiety in particular circumstances of interaction forms in the classroom. actually, the interaction form of the classroom consists of plenary class, individual-, pair-, and group work, but this study focused on speaking anxiety that occurs only in the plenary class. according to the authors’ teaching experiences and the preliminary research, it can be assumed that the students were more passive and anxious in the plenary class rather than in the individual-, pair-, or group work activities. that is why this study focused only on speaking anxiety in the plenary class. method this qualitative research aimed to evaluate the learning process of german language class in indonesia by trying to reveal which circumstances of the plenary class that provoked speaking anxiety of german learners in universitas negeri malang. the respondents in this study were students of universitas negeri malang who had zids or zertifikat indonesische deutschstudierende (certificate of german skill for indonesians students), and who attended deutsch auf b2 niveau (german level b2) class as well as deutsche literatur (german literature) class. such students were selected because they had sufficient input of german. ideally, they should be able to communicate in german fluently. data of the study were collected through observation and interview. the observation was conducted in deutsch auf b2 niveau class and deutsche literatur class to observe and to notice the symptoms of learners’ speaking anxiety during the plenary class. deutsch auf b2 niveau was taught by two indonesian lecturers, while deutsche literatur was taught by a german lecturer. in addition, the interview was conducted to support and confirm the data. this study used participant observation, i.e. passive participation. the researchers were not directly involved in the classroom activity, because they rolled as camera persons who recorded and observed the learning process. the researchers came to the class as researchers who observed and recorded the whole activities of the class by using a video recorder. through the videos, the data were analyzed using an observation sheet. there were several indicators on the sheet to find students who showed speaking anxiety symptoms. after conducting the observation, there are eight students who were indicated suffering from speaking anxiety were interviewed. the researchers met the students one by one and interviewed them personally to dig deeper data about their speaking anxiety. there were several questions in the interview sheet, but the questions could develop according to the information from the interviewee. it means that the interview was arranged to expose the interviewees’ personal view (creswell, 2013; sugiyono, 2012). in qualitative research, data analysis is a continuous process that needs continuous reflection along the study. the technique used in this study was adapted from spradley that consist of three kinds of analyses, i.e. domain analysis, taxonomy analysis, and componential analysis (spradley, 1980; wijayati, 1995). the data in this study were analyzed by those three techniques as mentioned before. in domain analysis, there is a term called cultural domain. it is a category of cultural meaning that includes small categories. domain, as the cultural category, consists of three basic elements, i.e. cover term, included term, and semantic relationship (see figure 1). the cover term is a term for a cultural domain category, included term is a term for smaller cultural domain category, and the semantic relationship is a term that relates the cover term and the included term (spradley, 1980, p. 89; wijayati, 1995, p. 32). the results of the domain analysis were analyzed through taxonomy analysis. it was almost the same as domain analysis that consists of categories arranged by semantic relationship. the difference was taxonomy analysis focused on the relationship that appears at the cultural domain. then the data were analyzed through componential analysis. a componential analysis was systematic research for the meaning components related to the structural category. the componential analysis reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 “my lecturer’s expressionless face... 98 primardiana hermilia wijayati, rofi’ah, & ahmad fauzi mohd ayub figure 1. example of domain analysis was used to find the meaning which were shown by the research object toward the cultural category (spradley, 1980; wijayati, 1995, p. 43). findings and discussion plenary is one of the interaction forms that arrange the interaction pattern between teacher and students in the classroom, in which the teacher stands in front of the class, while the students sit towards the teacher. in the plenary class, the teacher is the master of the class who conducts the learning process. the teacher controls the communication and the learning process of the class (nuhn, 2000). this kind of interaction form is easy to conduct because it does not need much preparation. moreover, the teacher could know or even evaluate students’ progress in learning a foreign language directly. however, there are particular circumstances of plenary class that provoke learners’ speaking anxiety. based on the study, learners’ speaking anxiety occurs during plenary class on particular circumstances, e.g. when the topic is unfamiliar, when the students are unprepared for spontaneous talks, when nobody answers and the class is so still, when the lecturer is expressionless, and when the students are taught by a native speaker. unfamiliar topic in the plenary class, the students would like to speak when the topic of the lesson was interesting and familiar to them because they found that it was great and easy. on the contrary, when the topic of the lesson was unfamiliar, they were passive because it was boring and difficult for them to speak. it happened in all classes no matter if they were taught by a german or indonesian lecturer. data (1)(13) are the data that show that issue. (1) when i talk a kind of topic that i used to talk in my daily life, i think my vocabulary is relatively okay, so, it doesn’t matter. but when the topic is rather difficult, i feel nervous, ma’am. (mdcw) datum (1) shows that the student was feeling nervous when he had to talk about a difficult topic. nervousness is one of the speaking anxiety symptoms that occur during speaking (horwitz, 2001; horwitz et al., 1986; lightbown & spada, 2006; spielberger, 1983; tseng, 2012; wilson, 2006; zhiping & paramasivam, 2013). such a case also happened to the following student, as recorded in datum (2). speaking anxiety are the symptoms of fear nervous passive semantic relationship included term cover term domain reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 99 – “my lecturer’s expressionless face... primardiana hermilia wijayati, rofi’ah, & ahmad fauzi mohd ayub (2) in zids class, i got a topic to speak, namely online shop. at that time i really had no idea about that and i said, ‘i’m sorry, i have no idea about that, because i never shop online yet… i had no experience about that, so i just can’t tell about that topic.’ (fns) the afore-mentioned description shows that topic played an important role in the classroom interaction. interesting, easy, and familiar topics could help the students to communicate easily, while difficult and unfamiliar topics provoked learners’ speaking anxiety. the students suffered speaking anxiety when the topic was not in their interest. difficult topics that require higher vocabulary skill provoked the students’ fear to speak as well. in addition, unfamiliar topics such as german culture or something that the students never experience by themselves triggered their speaking anxiety. those research findings support what cansrina (2015) explains that students have no fear to speak when the topic of the class is interesting. unprepared students for spontaneous talk the next circumstance that provoked learners’ speaking anxiety in the plenary class was when the students were unprepared to talk spontaneously. based on the observation, the students looked so shy and were smiling when they were being called and being asked to speak. some of them showed their tension and nervousness. the other students showed repetitive gestures such as scratching their hand and head or touching their face several times, which mean that they were nervous. it can be seen that they suffered speaking anxiety as basic (2011) says that anxiety is a sort of fear manifested by visual signs. based on the interview results, the students were feeling tension when they were suddenly asked to speak. they were afraid because they have no preparation before. it made them speechless, as seen in datum (3). (3) when i’m suddenly asked to speak, my mind was blank, and i don’t even know what to speak ... first, it is because i don’t prepare it well, sometimes the grammar sounds odd either. it is a kind of a mix between tension and confusion. (dna) datum (3) shows that the student experienced mental block or losing an idea because of the sudden call to speak. mental block indicated that the students suffered anxiety. horwitz et al. (1986) call it communication apprehension. communication apprehension is a part of fla that causes speaking disruption such as stutter, mind blank or losing idea and words, and high intonation or on the contrary. in this study, the students experienced mental block because they were shocked as they were suddenly called to speak. in a long-established habit in indonesian classes, most of the teachers used to call the students in sequence (based on position or alphabetically). thus, the students could prepare what to speak while they were waiting for their turns. that is why they were shocked and nervous when they were unprepared for spontaneous talks. other data which show that students were feeling the tension and afraid to speak are presented in data (4) and (5). (4) when i’m not ready, it is disturbing to be asked to speak. also it is much better not to ask or freiwillig*. but when i already prepared, it is okay to be asked to speak. (hi) *freiwillig: free willing (to speak) (5) i feel more afraid when i am asked to speak, because when i don’t understand yet, i am not ready, so what do i have to speak? (fwep) data (4) and (5) show that the students were afraid to talk if they are not ready and do not understand the material yet. they were afraid to make mistakes, whether it is in the content or the grammar. fischer and modena (2005) and also zhiping and paramasivam (2013) state that the cause of speaking anxiety is the fear of making mistakes. in cansrina (2015) research, the fear of making mistakes as one of the factors that provoke speaking anxiety was not significant, while, in this stureid (research and evaluation in education), 4(2), 2018 issn 2460-6995 “my lecturer’s expressionless face... 100 primardiana hermilia wijayati, rofi’ah, & ahmad fauzi mohd ayub dy, the fear of making mistakes was a big reason that provoked most of the learners’ speaking anxiety. in addition, the students were afraid of negative evaluation, so they were afraid to make mistake when they answered the lecturer’s question. it also supports the finding by fischer and modena (2005), zhiping and paramasivam (2013), and cansrina (2015) who state that students are afraid of negative evaluation, especially from the lecturer. that is why they suffer speaking anxiety. quiet class based on the observation in detusche literatur class, the lecturer asked for the students’ opinions about a part of a novel that they have read. nobody answered. the quieter the class, the worrier were the students. it could be seen in datum (6). (6) i don’t really understand what that german lecturer wants. i mean, what he wants us to do. even if i know, but why do my classmates keep silent? the lecturer asked us to do this, but why they say nothing? so i keep silent too. (wdb) as seen in datum (6), the students kept silent when there was nobody that has the courage to answer first. they were afraid to reveal their ideas orally (basic, 2011; horwitz, 2001; horwitz et al., 1986; spielberger, 1983; tseng, 2012; wilson, 2006; zhiping & paramasivam, 2013). they also have no interest to speak which indicates that they suffer speaking anxiety (horwitz et al., 1986). based on the observation, the students answered the lecturer’s question in a choir. cansrina (2015) says that they did that because if they were making mistakes, at least they were not alone. they did it together, so they felt safe. however, if no one had the courage to answer, it was better to keep silent. it seemed if somebody talked alone in front of the class and he/she made mistakes, then he/she would become the ‘defendant’. it made her/him embarrassed. that circumstance triggered students’ shyness, fear, and tension to speak. such circumstances created a negative atmosphere in the classroom. the negative atmosphere gave a negative impact to the students. thus, the negative atmosphere contributed to learners’ speaking anxiety. it means that the class needs a positive atmosphere as stated by cansrina (2015) that a positive atmosphere of the class contributes to learners’ speaking activity. expressionless face of the lecturer the next circumstance that provoked learners’ speaking anxiety in the classroom was the expressionless face of the lecturer. according to the interview, expressionless face of the lecturers triggered students’ tension as presented in data (7) and (8). (7) it is even more frightening if the listener’s* face was expressionless. if they are nodding, it means ‘o, everything is alright’ (laugh), but when they show their flat expression, o my, it kills me! what should i do then? (fwep) *lecturer (8) lecturers’ expression decides whether i can speak or not. if they ask me to speak with smiling face, i feel, well, still nervous, but not much. but when they are expressionless i’m afraid to speak in front of the class. (fns) according to data (7) and (8), it could be seen that expressionless face of the lecture provoked the learners’ speaking anxiety. they were afraid to interpret the lecturer’s expression, so they were feeling nervous and afraid to speak. when the lecturer’s face was expressionless, the students were frightened by him/ her so that they were afraid to speak freely and they more focused on language accuracy. the learners’ fear caused by the expressionless face of the lecturer appeared because of the learners’ own perception. this result did not appear in other relevant speaking anxiety researches. the students were not sure with their own answers. so they guessed the lecturers’ expression to know whether their answers were true or false. thus, they were afraid of making mistakes and afraid of getreid (research and evaluation in education), 4(2), 2018 issn 2460-6995 101 – “my lecturer’s expressionless face... primardiana hermilia wijayati, rofi’ah, & ahmad fauzi mohd ayub ting a negative evaluation from the lecturers (cansrina, 2015; fischer & modena, 2005; zhiping & paramasivam, 2013). logically, if they were not afraid of negative evaluation, they would not afraid of making mistakes. it means that such perception came from the students themselves. native speaker lecturer in the plenary class of the study, there was a distinction between the class that was conducted by indonesian and german lecturers. based on the observation, the students suffered speaking anxiety in particular circumstances, but they were still active enough when they were taught by indonesian lecturers. meanwhile, in the class that was conducted by a native speaker, the students were extremely passive. the students were passive and did not want to speak before the lecturer directly asked a student to speak. when the lecturer asked the class, nobody would answer. when the lecturer repeated the question, the students whispered to their classmates and discussed it with them in a whisper. in case they did not understand the question, instead of asking the lecturer directly, they asked their classmates. some students even avoided eye contact which is one of the speaking anxiety’s symptoms (cansrina, 2015). according to the interview outcomes, it was because the students had difficulty to understand what the native speaker said. his dialect and accent were a bit different. when indonesian lecturers spoke, the students could understand their accent, because they had the same mother tongue as the students. (9) if the language used in the class is full german but the lecturer is indonesian, they still could express it and their accent is still like indonesians. but in deutsche literatur class that is conducted by a native speaker, it is so confusing, because we have to speak full german and his pronunciation sounds ‘extremely german’. sometimes i do not like to attend the class (laughing). (md) (10) indonesian lecturers may understand when we made grammar mistakes. but native speaker, i‘m afraid if they don’t understand what we said. i’m afraid so. (tn) according to data (9) and (10), the students found that indonesian lecturers could understand them well. their accent was easy to understand. the students assumed that indonesian lecturers knew their difficulties in grammar because they had the same mother tongue. in addition, if the students did not understand particular words, indonesian lecturers could explain it in indonesian language. that is why the students were feeling glad and safe when they were taught by indonesian lecturers. on the contrary, students were nervous when they were taught by a native speaker because they thought that a native speaker could not understand their difficulties and their culture as well as indonesian lecturers which are evidenced by datum (11). (11) when i talked with a native speaker i feel so nervous, because, eee, every time we speak slowly and stuttered, he shows different expressions (confused), but indonesian lecturers, just like our lecturers, know our behavior well. (tntr) in addition, the students thought that a native speaker was the owner of the language they learned. that is why they were being forced by themselves to make the native speaker understood what they said. they thought that the native speaker would notice every grammar mistake they made more than indonesian (lecturer). for that reason, they had to focus on grammar accuracy that made them more nervous. it could be seen from data (12 and (13). (12) mostly i feel nervous when i talk with german native speaker (german lecturer) because german is his mother tongue. i’m very afraid, whether my grammar is true or false. (wdb) (13) when there are germans, i mean, outside of the university, actually i really want to speak with them. but, i’m reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 “my lecturer’s expressionless face... 102 primardiana hermilia wijayati, rofi’ah, & ahmad fauzi mohd ayub afraid if i make mistake during speaking. they are foreigners who don’t understand us well. i’m afraid if i make mistakes and they find it odd or something like that. (fns) according to data (12) and (13), it is found that the class that was conducted by a native speaker provoked difficulty for students. the students felt more nervous and were afraid to interact if the lecturer was a german native speaker because they were afraid to make grammar mistakes. in addition, the data show that the students had a feeling of fear of foreigner. according to the observation, the students avoided the chairs near with the german lecturer. some chairs in front of the german lecturer were empty at the beginning of the class and the chairs were only for them who came too late as if it was a punishment for them. it shows that the students avoided taking a seat near the native speaker because they did not feel ease and they were afraid of a foreigner. all of the data mentioned reveal that the students suffered speaking anxiety in the plenary class, when the lecturer was a native speaker. this finding supports the finding of tseng (2012) that the cause of the learners’ speaking anxiety is a fear of foreigner and their behavior. in this study, the students were quite passive and afraid when they were taught by a native speaker, but they did not afraid of his/her behavior. according to the interview, the students found that the german lecturer was nice and humble. however, the students found that the pronunciation and the accent of the german lecturer were quite different and sounded so difficult to understand, unlike the indonesian lecturers’ pronunciation that was easy to understand. the students were also afraid to make grammar mistakes and if the german lecturer did not understand them and their culture as well. when the students spoke german in front of a german lecturer, the fear of making mistakes intensified because the students assumed that the german lecturer was the owner of the language (german) who would easily notice when the students were making mistakes. that is why they focused on language accuracy. like what cansrina (2015) says, that learners’ speaking anxiety occurs because they think too much about grammar. that circumstance provoked students not to focus on meaning, but to focus on their fear of making mistakes. in addition, based on the observation in the deutsche literature class, the german lecturer’s teaching methods were not quite interesting to the students. they only read the stories in the books. the lecturer asked them the content or the main idea of the stories and their opinion about them. when nobody answered the lecturer’s question, the lecturer explained it by himself. on the other day, the german lecturer showed a german poem, explained the difficult vocabulary, and then asked the students to interpret it. every student kept silent. then the lecturer explained and interpreted the poem by himself, again. such methods were boring and too difficult for the students. that is why they had no desire to speak in the classroom. from all those data, it could be concluded that the students suffered speaking anxiety when the lecturer was a native speaker. they were afraid of making mistakes and getting a negative evaluation from the owner of the language. besides, they were also afraid of foreigner. in addition, their speaking anxiety increased when the native speaker lecturer’s teaching methods were not interesting and too difficult for them. conclusion and suggestions plenary is an interactive form that was often used in the classroom since the teacher could control the communication and the learning process. it was also easy to do (for the teacher/lecturer) and the teacher could know or even evaluate the students’ progress in learning a foreign language directly. on the other hand, there were particular circumstances of plenary class that provoked learners’ speaking anxiety, such as an unfamiliar topic, unprepared students for spontaneous talks, a still class and nobody who has the courage to talk, the expressionless face of the lecturers, and students’ fear of native speaker lecturers. reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 103 – “my lecturer’s expressionless face... primardiana hermilia wijayati, rofi’ah, & ahmad fauzi mohd ayub to decrease the learners’ speaking anxiety, the lecturers need to use particular strategies. the topic spoken in the class should be familiar and interesting so that the learners have the interest to speak. to avoid a silent class, the lecturers should have an asking strategy such as asking with an easy question form, reformulating the question, and giving some examples to the learners. in the end, the lecturers have to appreciate and help the learners by giving good attention with a calm and smiling face to avoid learners’ nervousness. it is actually fine to be taught by a native speaker, it would be even more useful, but the native speaker lecturer has to find strategies that could decrease the learners’ fear of foreigner, e.g. come closer to the learners’ lives in learning context, be humble, and use interesting methods such as games that could enhance the learners’ motivation and interest, so that the atmosphere of the class will be fun. references ansari, m. s. (2015). speaking anxiety in esl/efl classrooms: a holistic approach and practical study. international journal of educational investigations, 2(4), 38–46. arnaiz, p., & guillén, f. (2012). foreign language anxiety in a spanish university setting: interpersonal differences. revista de psicodidáctica, 17(1), 5–26. awan, r.-n., azher, m., anwar, m. n., & naz, a. (2010). an investigation of foreign language classroom anxiety and its relationship with students’ achievement. journal of college teaching & learning, 7(11), 33–40. barahmeh, m. (2013). measuring speaking anxiety among speech communication course students at the arab american university of jenin (aauj). european social sciences research journal, 1(3), 229– 248. bashir, m., azeem, m., ashiq, & dogar, h. (2011). factor effecting students’ english speaking skills. british journal of arts and social sciences, 2(1), 34–50. basic, l. (2011). speaking anxiety: an obstacle to second language learning? gävle: university of gävle. cansrina, g. (2015). ursachen von sprechangst im daf-unterricht ergebnisse einer untersuchung von indonesischen studentinnen an der universitas padjadjaran. jurnal ilmiah bahasa, sastra, dan budaya jerman, 2, 168–186. creswell, j. w. (2013). research design: qualitative, quantitative, and mixed methods approaches (4th ed.). thousand oaks, ca: sage publications. department of german letters. (2016). katalog jurusan sastra jerman. malang: fakultas sastra universitas negeri malang. dulay, h. c., burt, m. k., & krashen, s. (1982). language two. new york, ny: oxford university press. el-sakka, s. m. f. (2016). self-regulated strategy instruction for developing speaking proficiency and reducing speaking anxiety of egyptian university students. english language teaching, 9(12), 22–33. https://doi.org/ 10.5539/elt.v9n12p22 fischer, s., & modena. (2005). sprechmotivation und sprechangst im dafunterricht. german as a foreign language gfl, 3, 31–45. gemeinsamer europäischer referenzrahmen (ger). (2004). gemeinsamer europäischer referenzrahmen für sprachen: kurzinformationen. langenscheidt: landesverlag, linz. glaboniat, m., müller, m., rusch, p., schmitz, h., & wertenschlag, l. (2005). profile deutsch. langenscheidt: klett. gnjidić, v. (2016). l2 english and l3 german vocabulary learning strategies. zagreb. horwitz, e. (2001). language anxiety and achievement. annual review of applied linguistics, 21, 112–126. https://doi. org/10.1017/s0267190501000071 horwitz, e. k., horwitz, m. b., & cope, j. (1986). foreign language classroom reid (research and evaluation in education), 4(2), 2018 issn 2460-6995 “my lecturer’s expressionless face... 104 primardiana hermilia wijayati, rofi’ah, & ahmad fauzi mohd ayub anxiety. the modern language journal, 70(2), 125–132. https://doi.org/ 10.1111/j.1540-4781.1986.tb05256.x inozemtseva, n. (2017). sprechangst internationaler studierender in der fremdsprache deutsch. essen: universität duisburg-essen fakultät für geisteswissenschaften institut für deutsch als zweitund fremdsprache. kiper, h., meyer, h., & topsch, w. (2002). einfu ̈hrung in die schulpa ̈dagogik. oldenburg: cornelsen. lightbown, p. m., & spada, n. (2006). how languages are learned (3rd ed.). oxford: oxford university. marwan, a. (2007). investigating students’ foreign language anxiety. malaysian journal of elt research, 3(1), 37–55. mclaren, n., madrid, d., & bueno, a. (2006). tefl in secondary education. granada: universidad de granada. modern language division. (2001). common european framework of reference for language: learning, teaching, assessment. strasbourg: cambridge university press. muhaisen, m. s., & al-haq, f. a.-a. (2012). an investigation of the relationship between anxiety and foreign language learning among 2nd secondary students in second amman directorate of education. international journal of humanities and social science, 2(6), 226–240. nazir, m., bashir, s., & raja, z. b. (2014). a study of second language speakinganxiety among esl intermediate pakistani learners. international journal of english and education, 3(3), 216–229. nuhn, h.-e. (2000). die sozialformen des unterrichts. pädagogik (weinheim), 52(2), 10–13. saville-troike, m. (2006). introducing second language acquisition. cambridge: cambridge university press. sevinç, y., & backus, a. (2017). anxiety, language use and linguistic competence in an immigrant context: a vicious circle? international journal of bilingual education and bilingualism, 1–19. https://doi.org/10.1080/13670050.201 7.1306021 shabani, d. b., carr, j. e., pabico, r. s., sala, a. p., lam, w. y., & oberg, t. l. (2013). the effects of functional analysis test sessions on subsequent rates of problem behavior in the natural environment. behavioral interventions, 28(1), 40–47. https://doi.org/ 10.1002/bin.1352 spielberger, c. d. (1983). manual for the statetrait anxiety inventory. palo alto, ca: consulting psychologists press. spradley, j. p. (1980). participant observation (1st ed.). new york, ny: holt, rinehart and winston. sugiyono. (2012). metode penelitian pendidikan: penelitian kuantitatif, kualitatif, dan r&d. bandung: alfabeta. tseng, s.-f. (2012). the factors cause language anxiety for esl/efl learners in learning speaking. whampoa an interdisciplinary journal, 63, 75–90. von wörde, r. (2003). students’ perspectives on foreign language anxiety. inquiry, 8(1), 1–15. wijayati, h. w. (1995). analisis data penelitian etnografi. forum penelitian kependidikan, 7(1), 32–47. wilson, j. t. s. (2006). anxiety in learning english as a foreign language: its association with student variables, with overall proficiency, and with performance on an oral test. (doctoral thesis). universidad de granada, granada, spain. zhiping, d., & paramasivam, s. (2013). anxiety of speaking english in class among international students in a malaysian university. international journal of education and research, 1(11), 1–16. this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(1), 2020, 32-40 available online at: http://journal.uny.ac.id/index.php/reid alternative item selection strategies for improving test security in computerized adaptive testing of the algorithm *iwan suhardi faculty of engineering, universitas negeri makassar jl. daeng tata raya, parang tambung, mannuruki, tamalate, kota makassar, sulawesi selatan 90224, indonesia *corresponding author. e-mail: iwan.suhardi@unm.ac.id submitted: 5 march 2020 | revised: 21 april 2020 | accepted: 29 april 2020 abstract one of the ability estimation methods that is widely applied to the computerized adaptive testing (cat) algorithm is the maximum likelihood estimation (mle). however, the maximum likelihood method has the disadvantage of being unable to find a solution to the ability estimation of test-takers when the test takers’ scores do not have a pattern. if there are test takers who get either score of 0 or perfect score, then the abilities of test-takers are usually estimated using the step-size model. however, the step-size model often results in item exposure where certain items will appear more often than other items. this surely threatens the security of the test because items that often appear will be easier to recognize. this study tries to provide an alternative strategy by modifying the step-size model and randomizing the calculation results of the information function obtained. based on the results of the study, it is found that alternative strategies for item selection can make more varied items appear to improve the security of tests on the cat. keywords: item selection strategy, item exposure, step-size, adaptive testing how to cite: suhardi, i. (2020). alternative item selection strategies for improving test security in computerized adaptive testing of the algorithm. reid (research and evaluation in education), 6(1), 32-40. doi:https://doi.org/10.21831/reid.v6i1.30508. introduction the development of item response theory (irt) and computer technology that is faster and in a large capacity allows the development of computerized adaptive testing (cat) (haryanto, 2013, pp. 49–50). it is called “computerized” testing because the testing process no longer uses paper and pencil, but rather uses a computer device. it is called “adaptive” testing because the items that appear are chosen in such a way and adjusted to the ability of the test takers independently. cat is a test conducted for testtakers where the items are determined based on the answers of the test takers (winarno, 2013, p. 577). the efficiency of cat compared to conventional testing models has been supported by several studies. the results of research by eignor concluded that at the same level of measurement precision, adaptive tests only required a test length that was less than half of the computer-based test (cbt) device (eignor, stocking, way, & steffen, 1993; grist, 1989, p. 2; rudner, 1998, p. 2). mcbride and martin concluded that to achieve the same level of reliability, conventional testing required 2.57 times more items than adaptive testing (mcbride & martin, 1983). the method widely used to estimate the ability of test-takers is the maximum likelihood estimation (mle). the application of https://creativecommons.org/licenses/by-sa/4.0/deed.id https://doi.org/10.21831/reid.v6i1.30508 https://doi.org/10.21831/reid.v6i1.30508 iwan suhardi copyright © 2020, reid (research and evaluation in education), 6(1), 2020 33 issn: 2460-6995 (online) the maximum likelihood method has the disadvantage of being unable to find a solution when there are test takers who get extreme scores where all answers are always incorrect or always correct. to overcome this problem, the step-size method is generally employed. however, the application of the mle and step-size model often leads to item exposure, which is the frequent appearance of certain items given to test takers. although cat is more efficient and reliable, the security of this testing is not guaranteed because certain items appear repeatedly. the items are easily recognized because they appear frequently, especially at the beginning of the item sequence. therefore, modifications are needed to the conventional cat algorithm to minimize the appearance of these easily noticeable items. the procedures that are commonly used in developing conventional cat algorithms are elaborated as follows (thissen, 1990). starting cat cat generally starts with the selection of items with the difficulty level of moderate (mills, 1999, p. 123; santoso, 2010, p. 70; vispoel, 1999). a test taker who answers incorrectly will then be given items with the difficulty level of easy. conversely, if test taker answers correctly, they will be given items with the difficulty level of hard. estimating the ability of the test-takers the method commonly used to estimate the ability of test-takers is mle (baker, 1992; birnbaum, 1968). the estimation of the ability of test-takers using the maximum likelihood method is calculated using the newtonraphson iterative procedure (hambleton & swaminathan, 1985, p. 83). the newtonraphson iterative procedure is performed first by subtracting the ratio of the first derivative to the second derivative from the initial value so that it results in new . this procedure is repeated by using the new and calculating the value of the new derivative ratio. the estimated value of at (m + 1) iteration can be expressed using the iterative relation as presented in formula (1). meanwhile, the error value is a correction factor that is formulated as seen in formula (2), where u equals 1 if student’s answer is correct and u equals 0 if student’s answer is incorrect. besides, p is probability of participants answering the items correctly, which is obtained by formula (3). ……............. (1) … (2) …. (3) the iteration process is stopped when the error value , with ε as limiting number whose value is very small. in this study, the ε value of 0.0001 was used. one problem with the application of the mle method in adaptive testing is the inability of the mle method to find solutions when there are test takers who get an extreme score, which is either a score of 0 or a perfect score. to overcome the problem of the inability of the mle method to estimate the ability of test-takers when their responses did not have a pattern, the step size method can be used (dodd, 1990). based on the step size method, the test taker's ability level is upgraded or degraded by a certain constant as long as the test taker’s responses do not have a pattern, for example, by using a step size of 0.5. selection of the next item after the test taker’s ability is successfully estimated, the cat algorithm will then select the next item. lord recommended the use of the maximum item information procedure to select the next item (lord, 1977). this method guarantees a highly accurate estimation of the ability of test-takers (eignor et al., 1993). items that have the greatest information function value on the ability of certain test takers are selected to be presented to them. the item information function is calculated at each ability level with the equation in formula (4) (hambleton, swaminathan, & rogers, 1991, p. 107). …. (4) https://doi.org/10.21831/reid.v6i1.30508 https://doi.org/10.21831/reid.v6i1.30508 iwan suhardi 34 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) formula (4) shows that the information value only depends on the characteristic value of item parameters (for example the values of b, a, and c for the 3pl model) and the level of ability (θ). thus, for each ability level (θ), the information function contribution for each item in the question bank can be calculated. the test information function is the sum of the information functions of the test item and is written as in formula (5). meanwhile, the test information function illustrates the accuracy of the test set in estimating different levels of ability. the greater the information at the given ability level, the more accurate the ability is estimated from the test kit. the standard error of measurement (sem) is expressed by the equation in formula (6) (hambleton & swaminathan, 1985, p. 95). .............................. (5) ............................ (6) termination of cat cat termination uses criteria of equal measurement precision and a fixed number of items. equal measurement precision criteria aim to produce test scores with the same measurement error level for each test taker. the standard error of measurement is limited to 0.30, which is equivalent to reliability of 91% on conventional tests (thissen, 1990). by using the criteria, the number of items the test takers must work on can vary (where the number of items is not the same). however, to avoid the test process that may not be converging, the criterion of a fixed number of items is also used in the cat termination rules by limiting the maximum items that appear, for example, as many as 20 items. giving score to the ability of the test-takers the score of the ability estimation of the test-taker derives from the conversion of the value θ that is obtained by formula (7). ………………….. (7) in this study, the cats assessment results, which were the conventional cat model (by taking the information value of items or the largest i (θ)) and the alternative cat model (by taking some of the largest i (θ) values, then taken randomly to determine the value of i (θ) that would be used), were compared. after that, the alternative cat model was treated using the step-size method with an additional variable of response time when the test takers’ responses did not have a pattern. the assumption underlying the response time variable is those test-takers who have a high level of ability will be able to answer the items correctly in a shorter time than those who have a lower level of ability. lidia martinez compared groups of test-takers who took a test using cbt and found that the groups that spent the shortest average time responding to the initial test item obtained a higher average score (martinez, 2009). phil higgins’ research results showed that in cbt, if the item difficulty index was higher, then test-takers would need more time to answer and review the items (higgins, 2009). this showed that the test taker’s response time in working on the items correctly correlated with the estimation of the test taker’s ability level. method this study used a research and development (r&d) approach. the study began with the development of a question bank to obtain 265 items based on the 1-parameter logistic item response theory (1pl irt) model. characteristics of items in the form of parameters of the difficulty level of 265 items were obtained from the validation of processed results using the bilog-mg software, obtained from the response test using cbt. the total number of items before validation was originally 290 items. a summary of the question bank validation statistics developed and used in this study is presented in table 1. table 1. summary of item statistics on question bank general information based on 1pl irt number of items = 265 items criteria of item difficulty index (b) hard category = 40 items moderate category = 128 items easy category = 97 items https://doi.org/10.21831/reid.v6i1.30508 https://doi.org/10.21831/reid.v6i1.30508 iwan suhardi copyright © 2020, reid (research and evaluation in education), 6(1), 2020 35 issn: 2460-6995 (online) in the 1pl irt model, the probability of a person with a certain ability (θ) answering the items correctly depends only on the difficulty level of the items (b). in this study, the estimation methods of the ability of testtakers are the mle and step-size methods. next, two adaptive test designs developed were the conventional and the alternative cat model. in this study, the development of cat software referred to the incremental model (pressman, 2001, pp. 35–36). in the conventional cat model, the first item selection method employs a difficulty level of moderate, starting with a range of b values from -0.5 to 0.5 chosen randomly. the ability level estimation is calculated using the mle method. however, when the testtakers’ responses have not had a pattern, their ability is estimated using the step size method with a value of 0.5. the next item that is selected is the item that has the greatest information function value on a particular ability. the alternative cat model has the same principles as the conventional cat model. the difference is in the selection of the second and subsequent items, which uses the principle of randomizing the value of the information function in the 5-4-3-2-1 pattern. the pattern rule of 5-4-3-2-1 used was that the second item was selected from one item randomly from the five items that had the largest information function, the third item was selected from one item randomly from the four items that had the largest information function, the fourth item was selected from one item randomly from three items that had the largest information function, the fourth item was selected from one item randomly from three items that had the largest information function, and the fifth item was selected from one item randomly from two items that had the largest information function. meanwhile, for the sixth and subsequent items, the item selection criteria revert to the maximum information function criteria or revert to the conventional cat model. to estimate the ability of test-takers on the alternative cat model when their responses have not had a pattern, a step-size method is used with the addition of the response time variable. the test-takers’ estimated initial ability level is selected at the ability level of θ0. moreover, the step-size interval changes constantly by k (where in this study, the value of k=0.5). if the test taker responds by answering incorrectly, the testtaker’s estimated ability level becomes θ0 – k or equal to 0-0.5 = -0.5. meanwhile, if the test taker answers correctly, the estimated ability level becomes θ0+x k or 0.5 . x, where x is a positive constant multiplier and the value depends on the category of students’ response time when their answer is correct. table 2 shows a simulation procedure to estimate the test taker’s ability level with a step-size interval added to the response time factor. test takers were given 300 seconds to respond to each item. if for more than 300 seconds there is no response from test taker, the response is declared incorrect and easierlevel items will be displayed. in this study, the criterion for test termination is that the test is terminated if the sem value has reached 0.30. an sem value of 0.30 is equivalent to the reliability of 0.91 in conventional tests such as paper and pencil tests (thissen, 1990). table 2. estimation of ability of test-taker in the response-time-based step-size method annotation: θ0 = initial ability = 0 k = step size = 0.5 x = constant multiplier θke-i = θi-1 + xk (for correct response) θke-i = θi-1 – k (for incorrect response) responding with correct answer in consecutive times responding with incorrect answer in consecutive times item 1 item 2 item 3 item 1 item 2 item 3 θ1 θ2 θ3 θ1 θ2 θ3 very fast: x = 1.4 (≤ 30 seconds) 0.7 1.4 2.1 -0.5 -1.0 -1.5 fast : x = 1.3 (31 to 60 seconds) 0.65 1.3 1.95 -0.5 -1.0 -1.5 moderate: x = 1.2 (61 to 90 seconds) 0.6 1.2 1.8 -0.5 -1.0 -1.5 slow : x = 1.1 (91 to 120 seconds) 0.55 1.1 1.65 -0.5 -1.0 -1.5 very slow : x = 1 (≥ 121 seconds) 0.5 1.0 1.5 -0.5 -1.0 -1.5 https://doi.org/10.21831/reid.v6i1.30508 https://doi.org/10.21831/reid.v6i1.30508 iwan suhardi 36 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) table 3. testing results of conventional cat model when responses of answers have not had pattern yet item 1 item 1 was taken randomly with the difficulty level of moderate (-0.5 ≤ b ≤ 0.5) item test takers’ responses are always correct test takers’ responses are always incorrect value of θ list number of item value of θ list number of item item 2 0.5 209 0.5 275 item 3 1 164 1 081 item 4 1.5 113 1.5 002 item 5 2 044 2 091 item 6 2.5 237 2.5 115 findings and discussion before the answers have a pattern, the conventional cat model will use the stepsize method with an interval of 0.5. this means that if the test taker always responds with the correct answer, then the second and subsequent items that will appear are items that have the largest information function value at the ability level (θ) of 0.5, 1, 1.5, 2, 2.5, and 3 respectively. meanwhile, for testtakers who always respond with the incorrect answer, the second and subsequent items that will appear are items that have the largest information function value at θ of -0.5, -1, 1.5, -2, -2.5, and -3 respectively. the results that were obtained in the conventional cat model are summarized in table 3. from the results of the study, it was found that items with list numbers of 209, 164, 113, 044, 237, 275, 081, 002, 091, and 115 were items that appeared more often than other items. the items that often appear will make the security of the test in the conventional cat model degrade because they may be items that have been recognized by the test takers. from the results of conventional cat model testing, it was found that the number of items with difficulty index of moderate, which was indicated by the difficulty index value (b) ranging from -0.5 to +0.5, was 128 items. this meant that the probability of the first item having a chance to appear was 128 items chosen randomly. this was indeed in accordance with the criteria applied to the conventional cat model design algorithm, that the initially selected items were items with difficulty index of moderate (-0.5 to +0.5). after the first item displayed and was responded by the test taker, the second item was presented by using the step-size method. this meant that if students responded to the item with the correct answer, then the second item displayed was the item with maximum information for θ = 0.5. however, if students responded to items with incorrect answers, then the second item that was displayed was an item with maximum information for θ = 0.5. thus, it was certain that in the conventional cat, the second item only consisted of the possibility of 1 of 2 items only. in this study, the second item presented was question item number 275 (if the answer was correct) and question item number 209 (if the answer was incorrect). the frequent appearance of item number 275 and item number 209 made the security of cat threatened due to the familiarity with the question. another case that also often arises is that there has not been a pattern in students’ answers so that the step-size method is used. for example, if students answered questions correctly, the items that would appear were questions that had a maximum information value for θ = 0.5, 1.0, 1.5, 2.0, and 2.5, which were the second item whose item number was 275, the third item whose item number was 081, the fourth item whose item number was 002, the fifth item whose item number was 091, and the sixth item whose item number was 115. however, if students always answered the question incorrectly, then the item that appeared was questions that had a maximum information value for θ = -0.5, -1.0, -1.5, -2.0, and -2.5, i.e., the second item with item number 209, third item with item number 164, fourth item with item number 113, fifth item with item number 044, and sixth item with item number 237. in the conventional cat model, if the responses of the test takers https://doi.org/10.21831/reid.v6i1.30508 https://doi.org/10.21831/reid.v6i1.30508 iwan suhardi copyright © 2020, reid (research and evaluation in education), 6(1), 2020 37 issn: 2460-6995 (online) have the same pattern, then the items that appear will also be the same. this is what makes the security level of the conventional cat model suboptimal. if students’ responses already had patterns (where the responses already consisted of correct and incorrect answers), then the items that appeared next had been quite varied because the first item that appeared already had a relatively large variety of items (128 items). however, by using the maximum information function value model to search for items that corresponded to the estimated level of test-takers’ abilities, it was very possible that many items could not be presented because they never obtained the maximum function value for each level of ability. the alternative solution proposed was to use the step-size method based on the student’s response time in answering correctly. student responses were grouped into groups based on the time spent by students in answering the questions correctly. in the stepsize method based on response time, the stepsize value formula was given an additional constant multiplier based on the response time. the faster the students answered correctly, the greater the constant multiplier became. an additional solution proposed was to randomize the maximum information function value. if the conventional cat model determined the items that appeared based on the value of the (single) maximum information function, then the alternative cat model determined the items that appeared by randomizing the maximum information function values based on groups of 5–4–3–1–1. for example, one of the results of testing the alternative cat model is presented in table 4. from table 4, the calculation procedure for the alternative cat model can be observed. from the table, it can be seen that the items that appear in the alternative cat model are more varied compared to those in the conventional cat model. the algorithmic procedure in the alternative cat model can be explained as follows. the first item that appeared was item number 239 with b = -0.416 the first item appeared in accordance with the criteria that items were taken randomly with a difficulty index of moderate whose b value ranged from -0.5 to 0.5. item number 239 fulfilled the criteria. because students’ answers did not have a pattern, the method of estimating the ability level was the step-size of 0.5. students’ answers were declared correct (value 1). the time that was spent to work on the first item was 34 seconds, so it was included in the fast category (between 31 and 60 seconds) with a multiplier factor = 1.3. thus, the value of θ was 0.5 x 1.3 = 0.64. table 4. results of alternative cat model testing no. item b response time (second) θ iif tif sem 1 239 -0.416 1 34 0.65 0.7224 0.7224 1.18 2 182 0.662 1 40 1.3 0.7223 1.4447 0.83 3 192 1.32 0 8 1.1809 0.7225 2.1672 0.68 4 042 1.181 0 49 0.8579 0.7225 2.8897 0.59 5 132 0.861 1 20 1.3204 0.7225 3.6122 0.53 6 192 1.32 0 26 1.1161 0.7225 4.3347 0.48 7 152 1.119 1 10 1.5224 0.7225 5.0572 0.44 8 002 1.524 0 14 1.3846 0.7224 5.7796 0.42 9 161 1.396 0 7 1.2399 0.7224 6.502 0.39 10 013 1.251 1 9 1.5831 0.7225 7.2245 0.37 11 127 1.579 1 15 1.9486 0.7217 7.9462 0.35 12 060 1.987 0 12 1.8848 0.7223 8.6685 0.34 13 062 1.867 0 17 1.8118 0.7222 9.3907 0.33 14 163 1.787 0 19 1.7339 0.7214 10.1121 0.31 15 124 1.687 1 14 2.0656 0.7214 10.8335 0.3 https://doi.org/10.21831/reid.v6i1.30508 https://doi.org/10.21831/reid.v6i1.30508 iwan suhardi 38 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) the second item that appeared was item number 182 with b = 0.662 the second item appeared because it had the five largest information function values at the value of θ = 0.65, according to the use of randomization with the principle of 5– 4–3–2–1. from the five alternative values of the largest information function (see table 5), the item with number 182 was selected randomly. the second item was answered correctly (then the response value was 1). because students' answers did not have a pattern, the method of determining the estimated ability level was the step-size of 0.5. the item was done in 40 seconds and included in the fast category (between 31 to 60 seconds) with a multiplier factor of 1.3. thus, the value of θ = 0.65 + (0.5 x 1.3) = 1.3. table 5. the five alternative values of the largest information function rank information function item b 1 0.722495 153 0.647 2 0.722492 274 0.654 3 0.722474 202 0.643 4 0.722425 182 0.662 5 0.721861 003 0.685 the third item that appeared was item number 192 with b = 1.32 the third item appeared because it had the four largest information function values at the value θ = 1.3 according to the use of randomization with the principle of 5–4–3 –2–1. of the four alternative values for the largest information function (see table 6), the item with number 192 was randomly selected. the third item was responded with an incorrect answer (so the response value was 0). because students’ answers did not have a pattern, the method for estimating the level of ability was mle. the value of θ obtained was = 1.1809. table 6. the four alternative values for the largest information function rank information function item b 1 0.722349 053 1.317 2 0.722291 192 1.32 3 0.72227 179 1.321 4 0.722091 145 1.272 the fourth item that appeared was item number 042 with b = 1.181 the fourth item appeared because it had the three largest information function values at the value θ = 1.1809 according to the use of randomization with the principle of 5– 4–3 –2–1. of the three alternative values for the largest information function (see table 7), item with number 042 was randomly selected. the fourth item was responded with an incorrect answer (so the response value was 0). because students’ answers did not have a pattern, the method for estimating the level of ability was mle. the value of θ obtained was = 0.8579. table 7. the three alternative values for the largest information function rank information function item b 1 0.7225 042 1.181 2 0.7225 057 1.181 3 0.722449 021 1.171 the fifth item that appeared was item number 132 with b = 0.861 the fifth item appeared because it had the two largest information function values at the value θ = 0.8579 according to the use of randomization with the principle of 5–4–3 – 2–1. of the two alternative values for the largest information function (see table 8), the item with number 132 was randomly selected. the fifth item was responded with the correct answer (so the response value was 1). because students’ answers did not have a pattern, the method for estimating the level of ability was mle. the value of θ obtained was = 1.3204. table 8. the two alternative values for the largest information function rank information function item b 1 0.722495 132 0.861 2 0.722474 242 0.865 the sixth item that appeared was item number 192 with b = 1.32 this sixth item appeared because it had one largest information function value at the value θ = 1.32 according to the use of ranhttps://doi.org/10.21831/reid.v6i1.30508 https://doi.org/10.21831/reid.v6i1.30508 iwan suhardi copyright © 2020, reid (research and evaluation in education), 6(1), 2020 39 issn: 2460-6995 (online) domization with the principle of 5–4–3 –2–1. of the one alternative value for the largest information function (see table 9), the item with number 192 was randomly selected. the sixth item was responded with an incorrect answer (so the response value was 0). because students’ answers were patterned, the method for estimating the level of ability was mle. the value of θ obtained was = 1.1161. table 9. the one largest information function value rank information function item b 1 0.7225 192 1.32 the subsequent items (i.e. the seventh to fifteenth items) used the same method to determine the item that had the largest information function at its value of θ. the fifteenth item became the last item because the criterion for termination rule had been met (sem = 0.3). it was converted to a numerical value of 85. this alternative cat model has been proven to be able to overcome a fundamental shortcoming in the conventional cat model, which was the frequent appearance of certain items. from table 3, it can be seen that in the conventional cat model, several similar items would appear, especially in the initial patterns of cat execution. meanwhile, in table 4, there were many variations on the possible items that appeared on the alternative cat model, even though the patterns of students’ answers were the same. the many variations of items that appear in the alternative cat model can reduce the level of item exposure on cat so that it will make the cat more secure. the item variations that appeared in the alternative cat model actually had item difficulty index that was not much different from those that appeared in the conventional cat model, so it did not increase the test length or reduce the efficiency of the estimation of the ability of the test takers. conclusion from the results of this study, it can be concluded that the alternative cat model was able to decrease the level of item exposure on the cat, thereby increasing the security of the cat without increasing the test length or reducing the efficiency of the cat. the strategy adopted by the alternative cat model was to select items using the step-size method based on response time and randomization of the maximum information function value with the criteria of 5–4–3–1–1 by applying the maximum likelihood estimation (mle) to estimate the ability level of the test takers. the strategy has been proven to be able to present items with more variations, but still with item difficulty index which was not much different in the response patterns of the same test takers. references baker, f. b. (1992). item response theory: parameter estimation techniques. new york, ny: marcel dekker. birnbaum, a. (1968). some latent trait models and their uses in inferring an examinee’s ability. in f. m. lord & m. r. novick (eds.), statistical theories of mental rest scores (pp. 397–479). reading, ma: addisonwesley. dodd, b. g. (1990). the effect of item selection procedure and stepsize on computerized adaptive attitude measurement using the rating scale model. applied psychological measurement, 14(4), 355–366. https://doi.org/ 10.1177/014662169001400403 eignor, d. r., stocking, m. l., way, w. d., & steffen, m. (1993). case studies in computer adaptive test design through simulation. https://doi.org/10.1002/ j.2333-8504.1993.tb01567.x grist, s. (1989). computerized adaptive tests. in eric digest no. 107. retrieved from https://files.eric.ed.gov/fulltext/ed31 5425.pdf hambleton, r. k., & swaminathan, h. (1985). item response theory: principles and applications. boston, ma: kluwer nijhoff. hambleton, r. k., swaminathan, h., & rogers, h. j. (1991). fundamentals of item https://doi.org/10.21831/reid.v6i1.30508 https://doi.org/10.21831/reid.v6i1.30508 iwan suhardi 40 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) response theory. newbury park, ca: sage publications. haryanto, h. (2013). pengembangan computerized adaptive testing (cat) dengan algoritma logika fuzzy. jurnal penelitian dan evaluasi pendidikan, 15(1), 47–70. https://doi.org/10.21831/ pep.v15i1.1087 higgins, p. (2009). candidate measured ability and use of time. retrieved from https://www.rasch.org/mra/mra-1009.htm lord, frederic m. (1977). a broad-range tailored test of verbal ability. applied psychological measurement, 1(1), 95–100. https://doi.org/10.1177/01466216770 0100115 martinez, l. (2009). time usage and candidate performance. retrieved from http://www. rasch.org/mra/mra-06-09.htm mcbride, j. r., & martin, j. t. (1983). reliability and validity of adaptive ability tests in a military setting. in d. j. weiss (ed.), new horizons in testing: latent trait test theory and computerized adaptive testing (pp. 224–236). new york, ny: academic press. mills, c. n. (1999). development and introduction of a computer adaptive graduate record examinations general test. in f. drasgow & j. b. olson-buchanan (eds.), innovations in computerized assessment (pp. 117–135). mahwah, nj: lawrence erlbaum associates. pressman, r. s. (2001). software engineering: a practitioner’s approach (5th ed.). new york, ny: mcgraw-hill higher education. rudner, l. m. (1998). an on-line, interactive, computer adaptive testing tutorial. retrieved from http://edres.org/scripts/cat santoso, a. (2010). pengembangan computerized adaptive testing untuk mengukur hasil belajar mahasiswa universitas terbuka. jurnal penelitian dan evaluasi pendidikan, 14(1), 62–83. https://doi.org/10.21831/pep.v14i1.19 76 thissen, d. (1990). reliability and measurement precision. in h. wainer, n. j. dorans, r. flaugher, b. f. green, r. j. mislevy, l. steinberg, & d. thissen (eds.), computerized adaptive testing: a primer (2nd ed., pp. 161–186). hillsdale, nj: erlbaum. vispoel, w. p. (1999). creating computerized adaptive tests of music aptitude: problems, solutions, and future directions. in f. drasgow & j. b. olson-buchanan (eds.), innovations in computerized assessment (pp. 151–176). mahwah, nj: lawrence erlbaum associates. winarno, w. (2013). pengembangan computerized adaptive testing (cat) menggunakan metode pohon segitiga keputusan. jurnal penelitian dan evaluasi pendidikan, 16(2), 574–592. https://doi. org/10.21831/pep.v16i2.1132 https://doi.org/10.21831/reid.v6i1.30508 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(1), 2020, 10-19 available online at: http://journal.uny.ac.id/index.php/reid curriculum evaluation of french learning in senior high school *1irma nur af’idah; 2amat jaedun 1graduate school, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia 2faculty of engineering, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia *corresponding author. e-mail: irmanurafidah15.2017@student.uny.ac.id submitted: 5 november 2019 | revised: 20 december 2019 | accepted: 24 february 2020 abstract the research aims to describe the implementation of french language learning in high schools of sleman regency viewed from the components of planning, implementation, and results. this evaluation research uses a quantitative descriptive approach with a countenance model from stake. respondents in this research were teachers and students at three senior high schools. data collection techniques used in this study include research lesson plans, questionnaires, and documentation. the results of this research indicate that: (1) in the planning component, the quality of lesson plan preparation is very good and needs to be maintained because in the lesson plan review, the results obtained are 88.9% and the teacher questionnaire results of 26.6 are included in the excellent category; (2) in the implementation component, it has good results with the acquisition of a total score of 77 and a student questionnaire of 66.19; (3) in the component of the results, good results are obtained with an average value of students that is 86.38 and the results of the teacher questionnaire of 65.7 which is above 61 so that it falls into the good category. student scores are obtained from the results of the middle semester assessment and teacher questionnaire. keywords: curriculum evaluation, countenance model, french learning how to cite: af'idah, i., & jaedun, a. (2020). curriculum evaluation of french learning in senior high school. reid (research and evaluation in education), 6(1), 10-19. doi:https://doi.org/10.21831/reid.v6i1.28006. introduction evaluation in education is very broad since it includes various activities such as student assessment, measurement, testing, program evaluation, school personnel evaluation, school accreditation, and curriculum evaluation (anh, 2018, pp. 140–141). evaluation has an important role in every research as well as in academic studies. moreover, important points in the evaluation must meet the values that underlie the curriculum, pedagogy, and results which are the main focus in educational values (lai & kushner, 2013, p. 24). evaluation involves conducting research activities by an evaluator to provide information on the subject and object of the evaluation (johnson & christensen, 2000, p. 7; mccormick & james, 2019, p. 13). evaluation is present when an educational process is carried out by the school and when the teacher takes part in the task of parents in educating (hasan, 2009, p. 3). evaluation conducted by the teacher to students is done to find out how the abilities and knowledge of students in understanding the subject matter that has been studied to assess, correct, and improve a program systematically (tyler, 2013, p. 10). from the definition of evaluation, it is found the definition in curriculum evaluation, which is, scientific research conducted systematically to improve the curriculum applied in education. https://creativecommons.org/licenses/by-sa/4.0/deed.id https://doi.org/10.21831/reid.v6i1.28006 https://doi.org/10.21831/reid.v6i1.28006 irma nur af’idah & amat jaedun copyright © 2020, reid (research and evaluation in education), 6(1), 2020 11 issn: 2460-6995 (online) the curriculum is an activity and learning experience, as well as everything that affects the personal formation of students, both at school and also outside of school for the school's responsibility to achieve educational goals (arifin, 2011, p. 5). the curriculum as a learning plan is a facility in an educational program that serves as a guide and tool in teaching students. the curriculum aims to achieve a field in a subject that adheres to the categorization in education (hamalik, 2008). these objectives make the curriculum as a benchmark and foundation in implementing learning in schools. the curriculum becomes the operationalization of the concept of a curriculum that is still written in the actual form of learning, where learning in the classroom becomes a place to implement and test the curriculum to ensure the implementation of the curriculum in schools goes well (majid & rochman, 2014, p. 23). one of the problems of education in indonesia in the education system is the frequent change of curriculum. curriculum development as a curriculum based on character and competence to produce a generation that is competent, innovative, productive, creative, and characterless. in the implementation of the curriculum, as the operationalization of the curriculum concept, it is still written in nature which becomes actual in the form of learning, where learning in the classroom becomes a place to implement and test the curriculum to ensure the implementation of the curriculum in schools runs well. the aforementioned description shows the need for an evaluation of french language curriculum implementation in high school to get information about the readiness, implementation, and results of the french language curriculum. readiness includes the readiness of books, teachers, infrastructure, and the condition of lesson plans in each school. the implementation includes the process and evaluation of learning french at school, and the implementation results are the learning outcomes of students. the researchers conduct this research on the implementation of the french language curriculum because french is a cross-field study that is attracting students' interests. moreover, in the french language curriculum implementation, teachers experience constraints in making french language learning plans that are easy for students to understand in terms of material, readiness, and implementation. therefore, this study is focused on evaluating the implementation of the french language curriculum in senior high school. the curriculum is a system usually more contained in written form (hasan, 2009, p. 32). this dimension gains a lot of attention because its form can be seen and easily read and analyzed (arifin, 2011, p. 9). thus, the preparation of the curriculum must be in accordance with the components, rules, and structure in the curriculum. as a basic reference in the implementation of education, the curriculum plays an important and strategic role in the progress of a program, especially in the field of education (kurniawan, winarno, & dwiyogo, 2018). the components in the compiled curriculum must contain planning in the learning process and the development of students in the objectives, content, and teaching materials, which must be in accordance with educational objectives (arifin, 2011, pp. 6–7; dündar & merç, 2017, p. 137), so that, later, in the development and implementation of the curriculum in each subject, it will be following the rules and systems in the educational curriculum. it will be realized that all students will achieve academic success only if the curriculum is brought in line with the leadership skills and the education institution implements the right curriculum (sorenson, goldsmith, méndez, & maxwell, 2011, p. 5), but there are still many data found in the field that plans exist in the curriculum is still not specific and too general, so the curriculum implementers themselves still cannot understand the curriculum well. evaluation and curriculum have characteristics and roles in every education and social research (hasan, 2009, p. 32), so the two components do have quite dominant relationships. the broad curriculum evaluation is not only about activities in the classroom but also a comprehensive assessment process that involves all educational components such as students, teachers, models and methods of teaching, administration, and facilities (ismail, https://doi.org/10.21831/reid.v6i1.28006 https://doi.org/10.21831/reid.v6i1.28006 irma nur af’idah & amat jaedun 12 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) 2015, p. 15). the purpose of the curriculum itself is to introduce academic discipline to students so that they can use their knowledge with discipline and wisdom (schiro, 2017, p. 25). the curriculum in education in indonesia experiences significant changes and developments. this is intended to make the curriculum itself be able to improve learning implemented in schools. the focus on the curriculum is to ensure that the program achieves the mission and goals it has set. the curriculum in high school which was implemented in this decade is the 2013 curriculum. changes in the contents of the education unit level curriculum or kurikulum tingkat satuan pendidikan (ktsp) and 2013 curriculum in french should be able to make students able to understand the basic learning of french, especially because it changes into cross-interest subjects. however, the reality that occurred in the three schools that have been observed, they actually experience difficulties because the material to be studied is more complex. the teacher feels it difficult in making learning material that is suitable for the ability of students in learning french. the new challenges in the 2013 curriculum become an important lesson that must be completed by the teacher so that students are able to understand french lessons well. in addition, the main element that must be prepared by a teacher before teaching is to prepare approaches, strategies, techniques, and learning procedures so that they can run the teaching effectively (dewantara, 2017, p. 20). based on observations that have been made, problems regarding planning, learning, and student assessment results in teaching french are found. therefore, an evaluation of the 2013 curriculum in french subjects is needed to fit the objectives in the 2013 curriculum. after evaluating the curriculum, the steps that must be taken are knowing how to implement the improved curriculum, whether it has already been referred to as an improvement in learning and the quality of education, or it has not yet been carried out to the maximum. based on the background description of the problem that has been described, the evaluation carried out in this research is an evaluation by the stake countenance model which includes planning, implementation, and results. this research focuses on preparing the learning implementation plan, learning implementation, and the results obtained by students so that later an accurate evaluation can be made in the implementation of learning french. the formulation of the problems found in this research is as follows: how the implementation of the curriculum of french subjects in high schools in sleman regency is viewed from planning, implementation, and learning outcomes. the purpose of this research is to describe the implementation of french language learning in high schools in terms of the planning, implementation, and results components. method the method of this research was curriculum evaluation. in curriculum evaluation, evaluation becomes a main part of the world of education considering the curriculum is always developing and changing according to the context in its era (hasan, 2009, p. 41). curriculum evaluation in this research was carried out on the implementation of the french subject curriculum in high school. the evaluation model used was the stake countenance model. this model emphasizes two main things, which are drawing and considering. these two main things are obtained through the evaluation stages, they are: (1) the planning stage (antecedent) which includes planning in learning by looking at the readiness of learning in the preparation of lesson plans; (2) the implementation/process (transaction) stage, which was the implementation of french learning in the preliminary, core, and closing activities; (3) the results and assessment phase, namely the measurement of the results of the french learning assessment which includes aspects of attitude, knowledge, and skills and see the suitability of techniques, instruments, and follow-up conducted by the teacher in the implementation of learning french. characteristics in countenance evaluation models are evaluating the interrelation (contingency) at each stage and congruence between planning, implementation, and results to reach the consideration stage. considhttps://doi.org/10.21831/reid.v6i1.28006 https://doi.org/10.21831/reid.v6i1.28006 irma nur af’idah & amat jaedun copyright © 2020, reid (research and evaluation in education), 6(1), 2020 13 issn: 2460-6995 (online) eration is given to standards/criteria. the planning, implementation, and learning outcomes in this research are based on the regulation of the minister of education and culture of the republic of indonesia no. 22 of 2016. in addition to the regulation of the minister of education and culture no. 103 of 2014, the results also refer to the regulation of the minister of education and culture no. 4 of 2018 and the minimum completeness criteria or kriteria ketuntasan minimal (kkm). sources of data/research respondents were students in class x of senior high school. the sampling technique used was a random sampling technique. random sampling technique is a method of random sampling from members of the population and is taken using a table/number generator (sarjono & julianita, 2011, p. 23). the random sampling technique in this research was conducted by selecting two classes in each school. data collection techniques in this research used the research of lesson plans, observations, questionnaires, and documentation. the questionnaire in this research is the main instrument used in data collection. likert scale is a scale used to measure the attitudes, opinions, and perceptions of a person or group of people towards an event or social situation where the variable to be measured is translated into an indicator variable then the indicator is used as a starting point for compiling question/statement items (sarjono & julianita, 2011, p. 6). the questionnaire used was a likert scale with a rating scale of 1-4. there are two types of respondents in the questionnaire namely teachers and students, three teachers, and 145 students from three schools. the data collection technique used in this research is in the form of lesson plan research. the lesson plan research was used to find out the planning components that exist in implementing french learning in the three high schools in sleman regency where the research was conducted. the documentation used in this research is the value of students used in the results component. this research used content validity and construct validity. the content validity used aiken validity and the construct validity used exploratory factor analysis with the help of spss. in this research, the content validity was carried out by five experts (expert judgment), namely three lecturers who were experts in the field of language learning. the results obtained from 117 items from 71 indicators are that there is one statement that is failed because it does not have relevant relevance so that there are 116 items tested. in conducting trials and research conducted on three teachers, 145 students, and three lesson plan, 116 validated items were used. the construct validity in this research was proven using factor analysis. factor analysis is a statistical method that is commonly used in the development of measuring tools to analyze the relationship between variables (azwar, 2018, p. 121). thus, factor analysis answered the relationship and validity of the items in the instrument. the exploratory factor analysis (efa) procedure helps develop tests in recognizing and identifying various factors that help construct by finding the largest score variance with the least number of factors expressed in the form of eigenvalue >1. construct validity according to nunnally and fernandes (retnawati, 2014, pp. 2–3) is validity which shows the extent to which the instrument reveals a certain theoretical ability or construct that is intended to measure. construct validity is related to the provenience of the measurement result score. the construct validity can be proven by testing that the instrument construct does exist and empirically proven to confirm the existence of the construct of an instrument. the validity test model used was using kmo which is said to be valid if the kmo number is greater than 0.5 and the significance is the senior high school of more than 5%. on the diagonal axis anti-image correlation, all must be greater than 0.5 if there are less than 0.5 then the item is removed (priyatno, 2009). factor analysis is used to test the correlation between variables. to test the correlation between variables, the barlett's test of sphericity and the kaiser-meyer-olkin (kmo) test were used. if the results are significant with a kmo value above 0.5, then there is a significant correlation with several variables. the construct validity in this research was used on the student questionnaire with the result that five statehttps://doi.org/10.21831/reid.v6i1.28006 https://doi.org/10.21831/reid.v6i1.28006 irma nur af’idah & amat jaedun 14 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) ments fell out of the 26 items that existed. the final results of the acquisition of kmo and bartlett's test and the rotated component matrix are as follows. kmo is used to determine whether all data that have been taken are sufficient to be factored measuring the adequacy of the sampling (sampling adequacy). this value compares the magnitude of the observed correlation coefficient with a partial correlation coefficient, a small kmo value indicates that the correlation between pairs of variables cannot be explained by other variables. if the sum of the squares of partial correlation coefficients among all pairs of variables is of small value compared to the sum of the squares of the correlation coefficient, it will produce a kmo value close to 1. the kmo value is considered to be sufficient if more than 0.5. from those results, it can be said that the sampling that has been met can be used for further analysis. based on table 1, the kmo from the spss calculation is 0.815, so it is greater than 0.5, and bartlett's test is 0.000 so it is said to be good. the conclusion obtained is that the data can be used for further testing. from the results of the calculation of the rotated component matrix, it is known that there are six factors that affect the 21 items with details, namely component/factor 1, that is apperception and preparing a learning plan affecting items 1, 2, 3, 4; component 2, namely core activities affecting items 5, 10, 15, 16, 17, 18, 19; component 3, namely mastering the material taught that influences point 14; component 4, containing the use of media in learning influencing items 5, 6, 7; component 5, regarding asking how the understanding and involvement of students influence points 20, 21; and on factor/component 6 about ending learning influencing points 23, 24, 25, 26. instrument reliability in this study was estimated by looking at the alpha coefficient. reliability estimation is done by reliability analysis using spss program computer ver.22.0 for windows. to find out the alpha coefficient, the alpha-cronbach value for the reliability of all items in one variable was observed. the reliability test is said to be good if it is more than 0.7 (mardapi, 2017, p. 25). the reliability test results in this study were 0.77 and 1 and more than 0.7. it shows that the student questionnaire reliability is good so that it can be used to test the implementation of the curriculum in the implementation of french language learning in high school. the analysis technique used in this study is a descriptive statistical analysis technique using the spss program through a quantitative approach. it also uses a normal distribution with the following details (azwar, 2018, p. 148): if the results are said to be not good, if the results obtained are said to be not good, if the results are said to be good, if the results obtained are said to be very good, if it is the average overall score, if it is the standard deviation of the overall score, and if it is the score achieved by students. in the planning category for the teacher questionnaire, if a score of x <12.25 is obtained, the results are said to be not good; if the score is between 12.25-17.74, the results are said to be not good; if the score is between 17.75-22.75, the score is said to be good; and if the score is more than 22.75, the results are stated to be very good. furthermore, in the implementation category in the teacher questionnaire, if a score of x <41 is obtained, the results are said to be not good; scores between 41-52.00 are said to be not good; scores between 52.0163.00 are said to be good; and if the score is more than 63.01, the results are stated to be very good. for the implementation category for the students' schedule, if a score of x < 37 is obtained, the results are said to be not good; if the score is between 37-52.75, it is said to be not good; the scores between 52.76 68.25 are said to be good; and if the score is table 1. kmo and bartlett’s test kmo and bartlett’s test kaiser-meyer-olkin measure of sampling adequacy .815 bartlett’s test of sphericity approx. chi-square 1091.993 df 210 sig. .000 https://doi.org/10.21831/reid.v6i1.28006 https://doi.org/10.21831/reid.v6i1.28006 irma nur af’idah & amat jaedun copyright © 2020, reid (research and evaluation in education), 6(1), 2020 15 issn: 2460-6995 (online) more than 68.25, it is said to be very good. in the results category for the teacher questionnaire, if a score of x <34 is obtained, the results are said to be not good; a score between 34-42.5 is said to be poor; a score between 42.6-51 is said to be good; and if the score is more than 51, the results are stated to be very good. findings and discussion planning in learning is done by using the lesson plan. the lesson plan is an important component that must be present and made by the teacher before carrying out learning, because it is a plan in describing a teacher in carrying out learning to start learning, giving material, using media, and using assessment instruments that are appropriate to the method and given to students. in this research, there were three lesson plans analyzed and 37 items in the lesson plan review instrument using a score of 0.0. the calculations in the review of the lesson plan are used as the main instrument in planning (antecedent) with formula (1). the results were analyzed with the planning table criteria (arikunto, 2018, p. 35) presented in table 2. score ………. (1) table 2. lesson plan results percentage result 80– 100 % 88.9% 66 – 79 % 56 – 65 % 40 – 55% < 40 % table 2 is obtained from the evaluation standard criteria by arikunto (2018). descriptive percentages are used to facilitate the analysis of the evaluation of the french language curriculum in high schools based on established standards. the results are then interpreted and presented with numbers at the description stage, not until the generalization stage. quantitative data analysis using descriptive techniques is used to process data from the questionnaire results obtained that are used to be able to evaluate concerning the techniques used. from the results of the lesson plan analysis, it is found that it received a presentation score of 88.9%. then the score is compared with the planning criteria by knowing that the preparation of the lesson plan 100% has good results when viewed from the criteria. it can be said that the lesson plan of the five schools has a very good suitability of 88.9%. in this planning component, besides using the lesson plan, there is also a teacher questionnaire instrument consisting of seven statement items with the following categorization. the results obtained from the teacher questionnaire in the planning components of preparing lesson plans, designing learning, and evaluating french learning are equal to 25.7, so that it falls into the very good category. in the component of implementation (transaction) in this research, 45 items of teacher questionnaire and 26 items of student questionnaire were used. the student questionnaire was filled in by 145 respondents, namely students consisting of five schools, namely depok 1 high school, kalasan 1 high school, and angkasa adisucipto high school, located in sleman yogyakarta. based on the results of the research as a whole, the results of the implementation of french language learning in the three schools are included in either category. from the 26 statements of the student questionnaire in the implementation of learning, five statements fall after the exploratory factor analysis (efa) test using spss. the results of student questionnaire calculations in the implementation of french learning in senior high school obtained an average value of 66.19 so that it is included in the good category. the next aspect is the presentation of student questionnaire results in the implementation component of french learning. in addition to using student questionnaires, the implementation component also uses a teacher questionnaire instrument which amounts to 21 statements with scores ranging from 1-4, like the student questionnaire. from the calculation of the teacher's questionnaire, a value of 77 is obtained. the results are above 63 so it is included in the very good category. https://doi.org/10.21831/reid.v6i1.28006 https://doi.org/10.21831/reid.v6i1.28006 irma nur af’idah & amat jaedun 16 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) the results in this research were carried out using the mid-semester assessment, teacher questionnaires, and interviews. the results of this research were obtained by looking at the behavior and assessment results obtained by the teacher with a total of 17 statement items. based on calculations in the component results from the teacher questionnaire, a value of 65.7 is obtained, so it is included in the very good category. planning in evaluating the implementation of the curriculum has an important role so it is known how the preparation of lesson plans and teacher responses in implementing learning that will be done to students, in this case the cross-interest subjects in french. in the planning component, it is measured using the lesson plan research instrument and teacher's questionnaire. from the review of the lesson plan, it is found that the preparation of the lesson plan is known to be very good and the preparation reached 88.9%. whereas, in the teacher questionnaire, the planning component achieved 93.52% success, so the planning is included in the excellent category. in the french lesson plan, all components meet good requirements in the preparation of the lesson plan in line with the syllabus and the ministry of education and culture. based on research that has been done, the planning component using the main instrument, namely the lesson plan review, is supported by a teacher questionnaire that gets very good result. research that supporting the results of this planning component is found in research by abrory and kartowagiran (2014) that planning in preparing lesson plans has been included in the good category, even though the 2013 curriculum has just been applied. other research that supports the planning component in this study is the study by lukum (2015) which makes learning plans in the good category so that teachers are known to be able to compile lesson plans well. another relevant research is conducted by dewantara (2017) which shows that in planning indonesian learning, it has been done well and shows the suitability of planning with the standard policy process that is being applied. the main instrument used in this research is a questionnaire, namely the teacher's questionnaire and student questionnaire. the interview and observation were used as supporting instruments. in evaluating the curriculum, the implementation is a provider of information as an input in decision making (hasan, 2009, p. 42), then the implementation must meet the criteria to achieve the results and objectives set. the implementation of the french language learning in the three high schools in sleman regency obtained good results. research supporting the results in this research is a study by prasojo, kande, and mukminin (2018) which state that the implementation of learning is still not in accordance with the standard process because it is hampered by the process of motivating learning, learning media, and identification of students' abilities, even though the results in the questionnaire were already well. thus, there needs to be a deeper review. another research relevant to this study is a research by kurniawan et al. (2018) that the implementation component is good but there are still some components that do not meet the qualifications of the process standard. in the implementation of learning, one of the main keys to success is the qualification of an educator. hence, educators who already have a lot of teaching experience still need self-development as lifelong learners and need to open themselves to various educational innovations that can support learning (sumual & ali, 2017, p. 348). these studies indicate that many factors affect achievement in the implementation of learning so that all indicators must be reviewed and considered. the outcome component of this research was seen using the teacher questionnaire instrument and supporting instruments using interviews. from the results of teacher questionnaires, it is known that the preparation, reporting, remedial, and follow-up have been done well by the teacher by looking at the results of the grades obtained by students. in this case, the teacher is greatly helped by the assessment criteria that have been deter-mined from the specificity of the specified curriculum, namely the assessment of knowledge, attitude assesshttps://doi.org/10.21831/reid.v6i1.28006 https://doi.org/10.21831/reid.v6i1.28006 irma nur af’idah & amat jaedun copyright © 2020, reid (research and evaluation in education), 6(1), 2020 17 issn: 2460-6995 (online) ment, and skills assessment. guidelines regarding assessments in learning the 2013 curriculum for high schools are contained in the regulation of the minister of education and culture no. 4 of 2018. teacher activities to find out the results obtained by students are conducting assessments, planning follow-up activities in the form of remedial learning, enrichment programs, counseling services, and or assigning assignments groups and individuals in line with student learning outcomes. assessment of learning outcomes by educators is inseparable from the learning process. therefore, the assessment of learning outcomes by educators shows the ability of teachers as professional teachers. the purpose of conducting an assessment according to the regulation of the minister of education and culture no. 4 of 2018 is to determine the level of mastery of competencies in attitudes, knowledge, and skills that have been and have not been mastered by a/group of students to be improved in remedial learning and enrichment programs and, establish mastery requirements learners' learning competencies in a certain period of time, i.e. daily, midterm, one semester, one year, and the period of research of the education unit, establish improvement or enrichment programs based on competency mastery levels for those identified as learners who are slow or fast in learning and achieving learning outcomes, improving the learning process at the next semester meeting. in terms of the output component, the implementation of the assessment in learning french as a cross-interest lesson obtains good results by looking at the results of the midterm examination that has been conducted. the value gained by students varies because of the different characters they have. the average score obtained is 86.38 so that the learning carried out has been said to be good because all students have reached the minimum completion criteria or kriteria ketuntasan minimal (kkm), with a kkm in this subject that is 75. however, there are still students who have not yet met the kkm in the middle semester assessment because of the different characteristics and abilities of diverse students, even though the teacher has given special treatment. research that supports the study in this component is a study by lukum (2015) which shows that in the components of the students’ assessment results reached 65% and is included in the category of sufficient, but still not met in achieving the kkm because there is no match between the planning and implementation of the standard process. the implementation of learning needs to be improved and adjusted again to the standard process. moreover, this study still has shortcomings in the assessment because the results show that there are still students who have not reached the kkm even though the teacher has done variations in learning to ensure students can understand the subject matter well. other research in line with this study is by abrory and kartowagiran (2014) that the quality of student outcomes has not yet reached maximum results because the value of attitudes, knowledge, and skills has not shown any conformity and achievement in accordance with the planned targets so it can be concluded that in the learning process that has not yet reached perfect results, it is necessary to develop each assessment carried out in learning as in this study. there are previous studies that are relevant to this research, namely research by sumual and ali (2017) that a learning outcome is very much determined by the experience and way of the teacher in teaching and giving direction to students. the results in this study are included in the good category because the teacher also has competence in teaching french well. this research has the uniqueness compared to other studies in terms of the planning with a french language learning plan that is adapted, and the learning outcomes of students viewed from the results of the midterm examination that has been carried out to see and evaluate clearly the implementation of the french language curriculum in high school. the results of the 2013 curriculum implementation are expected to be able to create interesting and meaningful learning for students, especially in french, as a cross-interest lesson that is encouraged by students. for this reason, in implementing the 2013 curriculum in french language, schools need to continue to encourage the realization of national standards in schools. https://doi.org/10.21831/reid.v6i1.28006 https://doi.org/10.21831/reid.v6i1.28006 irma nur af’idah & amat jaedun 18 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) conclusion based on the evaluation results of the implementation of the french subject curriculum that have been conducted at senior high school, the following conclusions are drawn: the planning (antecedent) of learning french contained in the lesson plan and the teacher questionnaire obtained a very good result; the implementation (transaction) of french learning in the teacher questionnaire and student questionnaire is in a good category. however, in reality, the learning undertaken is still not procedurally and structurally following the lesson plan so that the conclusions in the implementation of learning french subjects are included in the good category and need to be improved. the results (outcomes) in french learning of the teacher's questionnaire are included in the very good category. hence, overall, french learning is included in the good category and still needs improvement. references abrory, m., & kartowagiran, b. (2014). evaluasi implementasi kurikulum 2013 pada pembelajaran matematika smp negeri kelas vii di kabupaten sleman. jurnal evaluasi pendidikan, 2(1), 50–59. retrieved from http://journal.student. uny.ac.id/ojs/index.php/jep/article/vie w/73 anh, v. t. k. (2018). evaluation models in educational program: strengths and weaknesses. vnu journal of foreign studies, 34(2), 140–150. https://doi.org/ 10.25073/2525-2445/vnufs.4252 arifin, z. (2011). konsep dan model pengembangan kurikulum. bandung: remaja rosdakarya. arikunto, s. (2018). evaluasi program pendidikan. jakarta: rineka cipta. azwar, s. (2018). reliabilitas dan validitas (4th ed.). yogyakarta: pustaka pelajar. dewantara, i. p. m. (2017). stake evaluation model (countenance model) in learning process bahasa indonesia at ganesha university of educational. international journal of language and literature, 1(1), 19–29. https://doi.org/10.23887/ijll. v1i1.9615 dündar, e., & merç, a. (2017). a critical review of research on curriculum development and evaluation in elt. european journal of foreign language teaching, 2(1), 136–168. https://doi. org/10.5281/zenodo.437574 hamalik, o. (2008). manajemen pengembangan kurikulum. bandung: remaja rosdakarya. hasan, s. h. (2009). evaluasi kurikulum. bandung: remaja rosdakarya. ismail, f. (2015). the evaluation of curriculum implementation at tarbiyah faculty iain raden fatah palembang. jisae: journal of indonesian student assessment and evaluation, 1(1), 12–27. https://doi.org/10.21009/jisae.011.0 2 johnson, r. b., & christensen, l. (2000). educational research: quantitative, qualitative, and mixed approaches. thousand oaks, ca: sage publications. kurniawan, r., winarno, m. e., & dwiyogo, w. d. (2018). evaluasi pembelajaran pendidikan jasmani, olahraga, dan kesehatan pada siswa sma menggunakan model countenance. jurnal pendidikan: teori, penelitian, dan pengembangan, 3(10), 1253—1264. https: //doi.org/10.17977/jptpp.v3i10.11599 lai, m., & kushner, s. (eds.). (2013). a developmental and negotiated approach to school self-evaluation. https://doi.org/ 10.1108/s1474-7863(2013)14 lukum, a. (2015). evaluasi program pembelajaran ipa smp menggunakan model countenance stake. jurnal penelitian dan evaluasi pendidikan, 19(1), 25–37. https://doi.org/10.21831/pep. v19i1.4552 majid, a., & rochman, c. (2014). pendekatan ilmiah dalam implementasi kurikulum 2013 (e. kuswandi, ed.). bandung: remaja rosdakarya. https://doi.org/10.21831/reid.v6i1.28006 https://doi.org/10.21831/reid.v6i1.28006 irma nur af’idah & amat jaedun copyright © 2020, reid (research and evaluation in education), 6(1), 2020 19 issn: 2460-6995 (online) mardapi, d. (2017). pengukuran, penilaian, dan evaluasi pendidikan (2nd ed.). yogyakarta: parama publishing. mccormick, r., & james, m. (2019). curriculum evaluation in schools. london: taylor & francis. prasojo, l. d., kande, f. a., & mukminin, a. (2018). evaluasi pelaksanaan standar proses pendidikan pada smp negeri di kabupaten sleman. jurnal penelitian dan evaluasi pendidikan, 22(1), 61–69. https: //doi.org/10.21831/pep.v22i1.19018 priyatno, d. (2009). 5 jam belajar olah data dengan spss 17 (j. widiyatmoko, ed.). yogyakarta: mediakom. regulation of the minister of education and culture no. 103 of 2014 on lerning in primary and secondary education. , (2014). regulation of the minister of education and culture no. 22 of 2016 on the process standard of primary and secondary education. , (2016). regulation of the minister of education and culture no. 4 of 2018 on the assessment of learning outcomes by the educational unit and the government. , (2018). retnawati, h. (2014). analisis kuantitatif instrumen penelitian. yogyakarta: parama publishing. sarjono, h., & julianita, w. (2011). spss vs lisrel: sebuah pengantar, aplikasi untuk riset. jakarta: salemba empat. schiro, m. s. (2017). teori kurikulum: visi-visi yang saling bertentangan dan kekhawatiran tanpa henti (b. sarwiji, ed.; e. sulistyowati, trans.). jakarta: indeks. sorenson, r. d., goldsmith, l. m., méndez, z. y., & maxwell, k. t. (2011). the principal’s guide to curriculum leadership. thousand oaks, ca: corwin press. sumual, m. z. i., & ali, m. (2017). evaluation of primary school teachers’ pedagogical competence in implementing curriculum. journal of education and learning, 11(3), 343–350. tyler, r. w. (2013). basic principles of curriculum and instruction. chicago, il: university of chicago press. https://doi.org/10.21831/reid.v6i1.28006 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(1), 2020, 41-50 available online at: http://journal.uny.ac.id/index.php/reid ngss-oriented chemistry test instruments: validity and reliability analysis with the rasch model *1roudloh muna lia; 1ani rusilowati; 1wiwi isnaeni 1graduate school, universitas negeri semarang jl. kelud utara iii, petompon, gajahmungkur, kota semarang, jawa tengah 50237, indonesia *corresponding author. e-mail: aliamoetz@yahoo.co.id submitted: 10 february 2020 | revised: 27 april 2020 | accepted: 1 may 2020 abstract the instrument of measuring test attributes must be valid and reliable. this study was carried out since the validity and reliability testing of the chemistry items used by the testee is necessary. this study aims to estimate the validity and determine the reliability of chemical test instruments oriented next generation science standards (ngss). the research was conducted through a quantitative descriptive approach in two vocational schools of engineering program which had 130 testees. the instrument used was an ngss-oriented chemistry test instrument containing 35 items and an expert validation questionnaire. the obtained test participant's response from the test instrument was collected through the documentation method. item in ngss test were presented to three subject matters experts. the validities used were the content validity and the construct validity. the reliability was tested through internal consistency and interrater consistency approaches. the results show that content validity (aiken’s v) is at a range of 0.50 to 1.00. the value of the unexplained variance is less than 10%, which means that it is well-categorized. this analysis is strengthened by cfa which has a goodness of fit and a good measurement model fit. the parameters used to test model fit are cfi, nfi, rmsea and the value of loading factor. some results values are over 0.90 and rmsea is 0.00 and more than 0.3 of loading factor value on each item. all scales had alpha reliability more than the criteria of 0.70. thus, the developed chemical test item were proven as valid and reliable instruments. keywords: validity, reliability, ngss how to cite: lia, r., rusilowati, a., & isnaeni, w. (2020). ngss-oriented chemistry test instruments: validity and reliability analysis with the rasch model. reid (research and evaluation in education), 6(1), 41-50. doi:https://doi.org/10.21831/reid.v6i1.30112. introduction in the government regulation no. 32 of 2013, it is written that learning process in the education unit is carried out interactively, inspiratively, pleasently, defiantly which motivates students to participate actively, as well as providing sufficient space for initiative, creativity, and independence by following their talents, interests, physical and psychological development of students. educators or teachers are required to carry out the mandate of government regulation. the implementation of learning will be achieved based on the goals set if it is suitable for the students' talents and interests. students from the engineering program of vocational school will be less suitable if business economics subject is taught because it does not match with interests and expertise areas of students, likewise the chemistry lessons that are applied at vocational high school (vhs). the existence of chemistry subjects in the engineering skills program can support the development of learners' competencies if the material is adjusted to the expertise area of students (wena, https://creativecommons.org/licenses/by-sa/4.0/deed.id https://doi.org/10.21831/reid.v6i1.30112 https://doi.org/10.21831/reid.v6i1.30112 roudloh muna lia, ani rusilowati, & wiwi isnaeni 42 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) 2009 in banne, 2018, p. 45). if the chemistry is taught separately and it is not associated with productive subjects in the expertise area which is occupied, the chemistry subject will be irrelevant (astuti, sunarno, & sudarisman, 2016). facts in the field from the results of questionnaire distribution in vocational students showed as many as 76 % of students stated that chemistry was a difficult subject. the reason is that students are less interested in chemistry lessons because they consider that chemistry subject is not important for them (lia & isnaeni, 2018, p. 403). chemistry as an adaptive subject in vhs is expected to be in accordance with productive material needs. one way to present chemistry subjects to be in accordance with productive material in learners' expertise area is through next generation science standards (ngss) (lia, 2019, p. 113). ngss provides the opportunity to include engineering in science (national research council, 2013, p. xviii). one of the assessment challenges in ngss is creating assignments that include the practical side of science and engineering (damelin, 2017). ngss offers a new standard combining content and practice in science and engineering (national research council, 2013). ngss creates a new vision for science education based on the idea that science is a unity of knowledge and a set of practices related to developing knowledge (penuel, harris, & debarger, 2015, p. 45). this teaching and learning approach is built on decades of research that identifies problems through learning in science classes and promising strategies to make learning to be more meaningful and effective for students (reiser, 2013). ngss-oriented chemistry learning had been successfully developed by lia (2019). after the learning process has been implemented, it is followed by an assessment activity. assessment is an activity conducted to measure and assess the curriculum achievement level (sudrajat, 2016, p. 1). through assessment, any lacks in learning can be identified and can be evaluated. the assessment instrument in measuring the question attributes as students' evaluation material must be valid and reliable. therefore, further research on the development of the ngss learning model, namely the preparation of chemical items needs to be conducted. the ngss-oriented chemistry items developed provide breakthroughs to give students a more meaningful assessment. assessment becomes more meaningful because it is associated with technical material by following the field occupied by students. before carrying out the test, some practicums were oriented towards ngss which made the chemical side more desirable (lia, 2019, p. 113). the ngss-oriented chemistry question items must have two important requirements. those are having a good validity and reliability level. validity and reliability will be fulfilled if the questions have been arranged. item analysis is analyzed in order to obtain the adequate quality of the question, and data processing and interpretation of the assessment result (kadir, 2015, p. 71). reynolds, livingston, and willson (2010, p. 144) state that validity means the extent to which theoretical and empirical evidence supports the meaning and interpretation of test scores. in addition, dewi and sukadiyanto (2015, p. 230) explain that a valid test is a test that can measure accurately and thoroughly the symptoms which are to be measured). reliability is test consistency (bhakti, 2015; khumaedi, 2012). it means that a reliable test must have consistent results even if tested repeatedly at different times. it is in accordance with the theory explained by reynolds et al. (2010, p. 91) that reliability is the accuracy or stability of the assessment results. the measuring tools used by evaluators when carrying out evaluation activities must have accuracy, consistency, and stability so that the measurement results obtained can measure accurately (amalia & susilaningsih, 2014). a set of tests must have accuracy when it is used. it also should be consistent and stable in the sense that there is no change from one measurement time to another (utami, 2018, p. 5). this study aims to estimate the validity and determine the reliability of chemical test instruments oriented ngss to measure the level of understanding of chemical material in https://doi.org/10.21831/reid.v6i1.30112 https://doi.org/10.21831/reid.v6i1.30112 roudloh muna lia, ani rusilowati, & wiwi isnaeni copyright © 2020, reid (research and evaluation in education), 6(1), 2020 43 issn: 2460-6995 (online) engineering. research on the validity and reliability of the test instruments has been conducted by mohamad, sulaiman, sern, and salleh (2015), kusaeri, sutini, suparto, and wardah (2019), and iskandar (2017). the differences between previous and current research are the analysis of the validity of the construct using the confirmatory factor analysis (cfa) modification and the rasch model. it is expected that research on validity and reliability will increase knowledge in the field of teaching, especially in the evaluation of learning. rasch model used in this study has several advantages which can identify the error response, predict missing data scores, distinguish the ability of respondents with the same raw score, and also identify any indications of guesses and cheaters (sumintono & widhiarso, 2015, pp. 44–45). these advantages make the rasch model more accurate (lord in nurcahyo, 2016). rasch modeling can produce standard error measurement values which can improve the accuracy of calculations (ardiyanti, 2016, p. 261). sabekti and khoirunnisa (2018, p. 69) confirm that the rasch model is more recommended to be used in the development of test instruments. an assessment of the appropriateness of the item's display and/or content validity becomes the earlier steps. assessments carried out by a panel of experts and chemistry teachers are also included in the expert panel (ismail, permanasari, & setiawan, 2016, p. 239). instruments that have been compiled and validated by experts are then validated empirically through trial instruments in small classes (prabowo & ristiani, 2011, p. 80). the high of agreement among experts who assess the feasibility of an item can be estimated and quantified. then, the statistical calculation is used as an indicator of the item content validity and the test content validity. this study used an assessment procedure in measuring validity thorough a content validity coefficient (the content validity of the test with a v index) proposed by aiken’s v. the construct validity was tested using cfa with the help of lisrel 8.8 software. proof of construct validity used first order confirmatory factor analysis which calculated the estimated value of the item against its latent variable. according to sitninjak and sugiarto in rusilowati (2014, p. 131), the validity of an observed variable can be seen from the factor loading of the variable against latent variable. variables are labelled as good construct validity when the goodness of fit and the measurement model fit are met. method the study was conducted in two vocational high schools in engineering program with a total of 130 testees. the instrument used was an ngss-oriented chemical test instrument, amounting to 35 items and validation sheet. based on the test instrument, the result of the test participants' answers was obtained and collected through the documentation method. three experts were assessing to obtain three sheets of questionnaire result. the validity was estimated by content validity, validity in large class trials, and construct validity. then, the reliability was estimated through internal consistency and interrater consistency approaches. to analysis the content validity, the aiken's v formula was used. the construct validity with cfa was used with the help of lisrel 8.8 software. the internal consistency reliability used in this study is the spearman-brown's formula in small class trials, whereas in large class trials, the rasch alpha cronbach model and interrater reliability using three raters tested using two-way anova with ebel formula were used. findings and discussion validity test content validity was estimated with aiken’s v index. items in ngss test were presented to three experts to assess the compatibility of the material, construction, language and compatibility with ngss. the experts also filled out a questionnaire containing the conclusions of the experts' assessment of chemistry-oriented items in ngss. quantitative data that present a summary of quantitative expert agreement coefficient data are shown in table 1. https://doi.org/10.21831/reid.v6i1.30112 https://doi.org/10.21831/reid.v6i1.30112 roudloh muna lia, ani rusilowati, & wiwi isnaeni 44 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) table 1. coefficient data of expert agreement item number aiken’s v index criterion conclusion 1, 5, 6, 7, 8, 9, 11, 12, 14, 16, 17, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 1.0 valid eligible 2, 3, 4, 13, 15, 18 19, 20 0.8 not valid enough little revision 21 0.7 not valid enough little revision 10 0.5 not valid enough little revision figure 1. unidimensionality test based on the results of the data analysis in table 1, 25 of 35 items are valid and 10 items are not valid enough, which means that there are some revisions. comparing to previous studies, the quality classification research items analyzed are better than the result of research by hasnah (2017) in which only nine of 40 items are categorized well. construct validity was proven by combining the factor analysis of the rasch model and cfa (using lisrel 8.8 software). the first step to see the construct validity with the rasch model is through output diagnosis item polarity (hayati & lailatussaadah, 2016, p. 173). all items have a positive point measure correction (pt. meacorr). a total of 14 items have strong or high correction numbers. one of the items (question number 5) has a moderate correlation number (0.57). it is in accordance with the opinion of othman, salleh, hussein, and wahid (2014, p. 117) that the high pt. mea corr (0.681.00) shows that a question item can distinguish respondents’ ability. the result of the correlation figures on pt. mea corr is strengthened to the results of the unidimensionality test through the output table unidimensionality. the output table unidimensionality is presented in figure 1. the raw variance in figure 1 shows a high number (73.2%). according to the opinion of hakiki, fitri, and agung (2018, p. 42), the results of the analysis which have a unidimensionality requirement of more than 60 % show special meaning. the instrument which is developed can measure what should be measured. variance values that cannot be explained (unexplained variance) successively are 3.7; 3.0; 2.9; 2.5; and 2.2. it shows that the variances which cannot be explained by the instruments are all less than 10%. it indicates that the unidimensionality in the instruments falls into a good category (wibisono, 2014, p. 744). the construct validity test on rasch is only for the response of the tested item, whereas to find out the covariance between the test items, the cfa model with the lisrel or amos or spss programs is needed. about specifying a model for a data set, the procedures for cfa appear to be more advanced, simpler, and more user-friendly than those developed for rasch (irt). the cfa model can calculate an accurate estimate of the chisquare size of the fit model and related degrees (reise, widaman, & pugh, 1993, pp. 554–563). therefore, the researchers strengthened the construct validity test through the lisrel program. conceptually, to make a test across ngss, three components should be recked, namely dcis, ses, and also ccs. dcis are https://doi.org/10.21831/reid.v6i1.30112 https://doi.org/10.21831/reid.v6i1.30112 roudloh muna lia, ani rusilowati, & wiwi isnaeni copyright © 2020, reid (research and evaluation in education), 6(1), 2020 45 issn: 2460-6995 (online) very dependent on the material that will be made from the instrument. then, seps and ccs are the characteristics of ngss-oriented statistics. seps consist of six aspects with 15 indicators. ccs consist of three aspects with 14 indicators. the results of the ngss instrument construct validity with cfa prove that the dimensions of ccs which consist of three aspects with 14 indicators are evidenced by the factor loading value and item compatibility parameters. the analysis of ccs components consisting of three aspects and 14 indicators is generated in a diagram presented in figure 2. analysis through cfa proved that ccs dimensions which consisted of three aspects with 14 indicators are evidenced by the value of loading factor and items that are compatible with the parameters. all factor loading’s value shows that there are more than 0.3. factor loadings which are less than 0.5 are removed (arifin, yusoff, & naing, 2012). the parameters that are used to test model fit are cfi, nfi, and rmsea. cfi and nfi are over 0.90 (cfi=0.92; nfi=0.90) and rmsea is 0.00. it is compatible with the theory that the expected cfi and nfi values are above 0.90 (zehir, akyuz, eren, & turhan, 2013, p. 9). rmsea is recommended to be under 0.05 though acceptable up to 0.08 (sohail & jang, 2017). in rusilowati (2014, p. 134), it is stated that the compatibility of the model that is developed by empirical data at a minimum can be seen from three match sizes that represent the three categories of match test different models. when two of the three categories are significant, the model developed is compatible with the data. all model fits were acceptable and according to the literature, the validity of the measurements in the current study met the criteria. figure 2. ccs path diagram https://doi.org/10.21831/reid.v6i1.30112 https://doi.org/10.21831/reid.v6i1.30112 roudloh muna lia, ani rusilowati, & wiwi isnaeni 46 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) the validity of the large class trial phase was analyzed using the rasch through the output model, item fit order. the output is presented in table 2. table 2. item fit item’s number outfit mnsq zstd pt. mea corr 1 2.11 3.9 0.68 3 1.99 3.2 0.75 6 1.47 1.7 0.75 2 1.33 1.4 0.73 7 1.33 1.7 0.75 9 1.16 0.8 0.74 12 1.08 0.5 0.80 5 1.03 0.2 0.57 8 0.94 -0.2 0.69 10 0.80 -1.0 0.87 4 0.81 -0.8 0.82 14 0.77 -1.0 0.83 13 0.56 -2.0 0.90 15 0.59 -1.9 0.78 11 0.48 -1.6 0.86 the item fit information is useful for identifying the indications of misconception (sumintono & widhiarso, 2015, p. 77). in table 2, based on mnsq, zstd, and pt. mea corr, it can be concluded that 15 items were classified as valid, but there is one item namely question number 1 which is indicated as a misconception. the mnsq value is 2.11 and the zstd is 3.9 which represents unexpected data. the cause of outlier mnsq and zstd values is from some testee’s answers. those are reversed between "the oxidation-reduction reaction and the reason", but pt. mea corr is still within the limit of more than 0.4 and less than 0.85. therefore, 15 items have been used to measure the quality of education because these questions have been analyzed. it is in accordance with the opinion of pancoro (2011, p. 94) that test questions need to be first analyzed to have the same characteristics so that they can be used to measure the quality of education. reliability test the reliability test consists of (a) interrater reliability, (b) small-scale trial reliability, and (c) large-class trial reliability. based on table 3, the values of the reliability of the tests are 0.17, 0.82, and 0.94. inter-rater reliability (among experts) is very low, the reliability of small class trials is very high, and the reliability of large classes is special. a discussion of the three reliability tests is elaborated as follows. inter-rater reliability inter-rater reliability is a preliminary part of a study (dockrell et al., 2012, p. 633). interrater reliability was calculated after calculating the content validity among three validators. level agreement between three validators can be explained through the reliability coefficient between rater (assessors) using two-way anova-analysis with the ebel formula. two-way anova analysis through spss 16.0 is presented in table 4. in table 4, it can be explained that rater is the assessor and item is a matter of items. the mean square value of rater is 0.495, the value of the item is 0159 and the interaction between rater and item (rater * item) is 0.132. these values are entered in the ebel formula and produce a reliability coefficient of 0.17. the reliability coefficient of r value is less than 0.2. the reliability table 3. reliability data analysis trial phase reliability n of items expert (expert judgment) 0.17 35 small class 0.82 25 big class 0.94 15 table 4. output reliability of two-way anova source mean square rater 0.495 item 0.159 rater*item 0.132 https://doi.org/10.21831/reid.v6i1.30112 https://doi.org/10.21831/reid.v6i1.30112 roudloh muna lia, ani rusilowati, & wiwi isnaeni copyright © 2020, reid (research and evaluation in education), 6(1), 2020 47 issn: 2460-6995 (online) among the assessors in assessing the contents of the instrument is still not consistent (rusilowati, 2014, p. 29). when the reliability coefficient obtained is not high enough, there are inconsistencies among raters (pinilih, budiharti, & ekawati, 2013, p. 25). the reason for this inconsistency in this research is the difference in viewpoints in evaluating chemical test instruments. for example, expert 1 puts more emphasis on its chemical content while expert 3 is more inclined in evaluating the appearance and suitability of the answers. small class trial reliability reliability using the spearman-brown formula was applied to small classes and searched using the anastes description application. the reliability coefficient of small class tests based on table 3 shows that the coefficient number is 0.82. figures for reliability coefficient is 0.8 r < 1.0, which indicates very high reliability. big class trial reliability in the big class stage, the reliability is seen with the help of winstep 3.73 program. reliability in the rasch model is illustrated by the presence of a separation index. the separation indexes reported are the item reliability and the person reliability which are supplemented by cronbach alpha kr-20 of reliability coefficient figures. those are three successive coefficient numbers (0.91, 0.98 and 0.94). all three of these figures indicate very high reliability. separation reliability (item or person reliability) is categorized as high value because the study sample and grain difficulty level have a wide range and produce a small measurement error. broad grain means that the item has a difficulty level from the easiest to the most difficult. similarly, in the study sample, a broad sample means that the sample can spread from the smartest to the least clever (linacre, 2016, p. 256). the output reliability can be seen in table 5. in table 5, in addition to the reliability coefficient, there is also important information related to the statistical summary of the test participant's overall response patterns, namely (a) infit mnsq zstd, and outfit mnsq zstd, and (b) separation. infit mnsq zstd and outfit mnsq zstd the mnsq infit and mnsq outfit values are 0.99 and 1.21, respectively for persons as well as 0.98 and 1.10 for mnsq infit values and mnsq outfit items. it is categorized as having a good value because the ideal value is 1 (the closer to 1 the better). the value of infit zstd and outfit values are 0.99 and 1.21, respectively for persons as well as 0.98 and 1.10 for mnsq infit values and mnsq outfit items. it is also categorized as having a good value because the ideal value is 1 (the closer to 1 the better). the value of infit zstd and outfit zstd in sequence person and item are 0.0, 0.2, -0.1, 0.3. the zstd value is ideally 0.0, so that the zstd value including ideal except for the value of infit zstd in the item shows a negative value (not good). table 5. output reliability of rasch model measured person infit outfit mnsq zstd mnsq zstd mean 0.99 0.0 1.21 0.2 separation 3.11 person reliability 0.88 measured item infit outfit mnsq zstd mnsq zstd mean 0.98 -0.1 1.18 0.3 separation 6.37 item reliability 0.97 kr-20 test reliability 0.94 https://doi.org/10.21831/reid.v6i1.30112 https://doi.org/10.21831/reid.v6i1.30112 roudloh muna lia, ani rusilowati, & wiwi isnaeni 48 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) separation the greater the value of separation, the quality of the instrument in terms of overall respondents and grain is getting better. the separation value on the items developed is 8.45 by entering the formula h that has been explained. score 8.45 rounded up to 8, which means that eight groups of items can be interpreted as groups of varied items. conclusion this test instrument has been proven for content validity, construct validity, interrater reliability, and reliability with the rasch model. the test instrument has fulfilled the content validity with expert judgment as evidenced by the acquisition of agreement index (aiken index) ranging from 0.50 to 1.00. the lowest score (0.5) is caused by each value’s interconsistence. the raw variance value in the analysis of the rasch model’s construct validity is 73.2% with a special category. variance values that cannot be explained are less than 10%, consecutively 3.7; 3.0; 2.9; 2.5; 2.2 indicating that unidimensionality in the instrument is in a good category. the parameters used to test model fit are cfi, nfi, rmsea, and the loading factor value. some results values are over 0.90 (cfi=0.92; nfi=0.90) and rmsea is 0.00, and more than 0.3 of loading factor value on each item which indicates that the variable has good validity to the construct. the test instrument increases the number of reliability coefficients at each step of the trial, i.e. 0.17, 0.82, and 0.94. the characteristics of the rasch model items analyzed can reveal interpretations in terms of items, personnel, and instruments. thus, the chemistry test items developed are tested to be valid, reliable and have adequate characteristics. references amalia, n. f., & susilaningsih, e. (2014). pengembangan instrumen penilaian keterampilan berpikir kritis siswa sma pada materi asam basa. jurnal inovasi pendidikan kimia, 8(2), 1280–1389. retrieved from https://journal.unnes. ac.id/nju/index.php/jipk/article/view /4443 ardiyanti, d. (2016). aplikasi model rasch pada pengembangan skala efikasi diri dalam pengambilan keputusan karir siswa. jurnal psikologi, 43(3), 248–263. https://doi.org/10.22146/jpsi.17801 arifin, w. n., yusoff, m. s. b., & naing, n. n. (2012). confirmatory factor analysis (cfa) of usm emotional quotient inventory (usmeq-i) among medical degree program applicants in universiti sains malaysia (usm). education in medicine journal, 4(2), 1–22. https:// doi.org/10.5959/eimj.v4i2.33 astuti, r., sunarno, w., & sudarisman, s. (2016). pembelajaran ipa dengan pendekatan ketrampilan proses sains menggunakan metode eksperimen bebas termodifikasi dan eksperimen terbimbing ditinjau dari sikap ilmiah dan motivasi belajar siswa. proceeding biology education conference, 13(1), 338– 345. retrieved from https://jurnal. uns.ac.id/prosbi/article/view/5742 banne, k. (2018). meningkatkan aktivitas belajar kimia (redoks) siswa kelas xii tkr smk negeri 1 sumarorong melalui penerapan model pembelajaran kooperatif tipe nht dengan materi berbasis kontekstual. jurnal mekom (media komunikasi pendidikan kejuruan), 5(1), 45–50. https://doi.org/ 10.26858/mekom.v5i1.8223 bhakti, y. b. (2015). pengaruh jumlah alternatif jawaban dan teknik penskoran terhadap reliabilitas tes. formatif: jurnal ilmiah pendidikan mipa, 5(1), 1–13. https://doi.org/10.30998/formatif.v5i1 .168 damelin, d. (2017). using technology to enhance ngss-aligned assessment tasks for classroom formative use. retrieved from the concord consortium website: https://concord. org/newsletter/2017-spring/using-tech nology-enhance-ngss-aligned-assess ment-tasks/ dewi, p. c. p., & sukadiyanto, s. (2015). pengembangan tes keterampilan olahraga woodball untuk pemula. jurnal https://doi.org/10.21831/reid.v6i1.30112 https://doi.org/10.21831/reid.v6i1.30112 roudloh muna lia, ani rusilowati, & wiwi isnaeni copyright © 2020, reid (research and evaluation in education), 6(1), 2020 49 issn: 2460-6995 (online) keolahragaan, 3(2), 228–240. https://doi. org/10.21831/jk.v3i2.6254 dockrell, s., o’grady, e., bennett, k., mullarkey, c., mc connell, r., ruddy, r., … flannery, c. (2012). an investigation of the reliability of rapid upper limb assessment (rula) as a method of assessment of children’s computing posture. applied ergonomics, 43(3), 632–636. https://doi.org/ 10.1016/j.apergo.2011.09.009 government regulation no. 32 of 2013, on national education standard. , (2013). hakiki, a. w., fitri, a. r., & agung, i. m. (2018). analisis properti psikometri subtes merkaufgaben (me) dengan rasch model. jurnal psikologi, 14(1), 40– 49. https://doi.org/10.24014/jp.v14i1. 4900 hasnah, h. (2017). analisis kualitas soal matematika ujian sekolah kelas xii ipa sma negeri di watansoppeng berdasarkan teori respon butir. pep educational assessment, 1(1), 27–33. retrieved from https://ojs.unm.ac.id/ uea/article/view/3776 hayati, s., & lailatussaadah, l. (2016). validitas dan reliabilitas instrumen pengetahuan pembelajaran aktif, kreatif dan menyenangkan (pakem) menggunakan model rasch. jurnal ilmiah didaktika, 16(2), 169–179. https://doi.org/10.22373/jid.v16i2.593 iskandar, a. (2017). teknik analisis validitas konstruk dan reliabilitas instrument test dan non test dengan software lisrel. https://doi.org/10.31227/osf.io/nbhxq ismail, i., permanasari, a., & setiawan, w. (2016). stem virtual lab: an alternative practical media to enhance student’s scientific literacy. jurnal pendidikan ipa indonesia, 5(2), 239–246. https://doi. org/10.15294/jpii.v5i2.5492 kadir, a. (2015). menyusun dan menganalisisi tes hasil belajar. al-ta’dib : jurnal kajian ilmu kependidikan, 8(2), 70–81. https://doi.org/10.31332/atdb.v8i2.41 1 khumaedi, m. (2012). reliabilitas instrumen penelitian pendidikan. jurnal pendidikan teknik mesin, 12(1), 25–30. retrieved from https://journal.unnes.ac.id/nju/ index.php/jptm/article/view/5273 kusaeri, k., sutini, s., suparto, s., & wardah, f. (2019). the validity and inter-rater reliability of project assessment in mathematics learning. beta: jurnal tadris matematika, 12(1), 1–13. https://doi. org/10.20414/betajtm.v12i1.266 lia, r. m. (2019). pengembangan butir soal kimia berorientasi ngss dan analisisnya menggunakan model rasch. master thesis, universitas negeri semarang, semarang. lia, r. m., & isnaeni, i. (2018). evaluation of chemistry learning programs at vocational high school semarang on vehicle engineering field. proceedings of the international conference on science and education and technology 2018 (iset 2018), 403–407. https://doi.org/ 10.2991/iset-18.2018.82 linacre, j. m. (2016). a user’s guide to winsteps ministep rasch-model computer programs. chicago, il: winsteps.com. mohamad, m. m., sulaiman, n. l., sern, l. c., & salleh, k. m. (2015). measuring the validity and reliability of research instruments. procedia social and behavioral sciences, 204, 164–171. https://doi.org/10.1016/j.sbspro.2015. 08.129 national research council. (2013). next generation science standards: for states, by states. https://doi.org/10.17226/18290 nurcahyo, f. a. (2016). aplikasi irt dalam analisis aitem tes kognitif. buletin psikologi, 24(2), 64–75. https://doi.org/ 10.22146/buletinpsikologi.25218 othman, n. b., salleh, s. m., hussein, h., & wahid, h. b. a. (2014). assessing construct validity and reliability of competitiveness scale using rasch model approach. the 2014 wei international academic conference proceedings, 113–120. retrieved from https://doi.org/10.21831/reid.v6i1.30112 https://doi.org/10.21831/reid.v6i1.30112 roudloh muna lia, ani rusilowati, & wiwi isnaeni 50 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) https://www.westeastinstitute.com/wp -content/uploads/2014/06/suriamohd-salleh.pdf pancoro, n. h. (2011). karakteristik butir soal ulangan kenaikan kelas sebagai persiapan bank soal bahasa inggris. jurnal penelitian dan evaluasi pendidikan, 15(1), 92–114. https://doi.org/ 10.21831/pep.v15i1.1089 penuel, w. r., harris, c. j., & debarger, a. h. (2015). implementing the next generation science standards. phi delta kappan, 96(6), 45–49. https://doi.org/ 10.1177/0031721715575299 pinilih, f. w., budiharti, r., & ekawati, e. y. (2013). pengembangan instrumen penilaian produk pada pembelajaran ipa untuk siswa smp. jurnal pendidikan fisika, 1(2), 23–27. retrieved from https://jurnal.fkip.uns.ac.id/index.php/ pfisika/article/view/2798 prabowo, a., & ristiani, e. (2011). rancang bangun instrumen tes kemampuan keruangan pengembangan tes kemampuan keruangan hubert maier dan identifikasi penskoran berdasar teori van hielle. kreano, jurnal matematika kreatif-inovatif, 2(2), 72–87. https://doi.org/10.15294/kreano.v2i2. 2618 reise, s. p., widaman, k. f., & pugh, r. h. (1993). confirmatory factor analysis and item response theory: two approaches for exploring measurement invariance. psychological bulletin, 114(3), 552–566. https://doi.org/10.1037/00332909.114.3.552 reiser, b. j. (2013). what professional development strategies are needed for successful implementation of the next generation science standards. the invitational research symposium on science assessment, 1–23. retrieved from http://www.ets.org/media/research/p df/reiser.pdf reynolds, c. r., livingston, r. b., & willson, v. l. (2010). measurement and assessment in education (2nd ed.). upper saddle river, nj: pearson education. rusilowati, a. (2014). pengembangan instrumen penilaian. semarang: unnes press. sabekti, a. w., & khoirunnisa, f. (2018). penggunaan rasch model untuk mengembangkan instrumen pengukuran kemampuan berpikir kritis siswa pada topik ikatan kimia. jurnal zarah, 6(2), 68–75. https://doi.org/ 10.31629/zarah.v6i2.724 sohail, m. s., & jang, j. (2017). understanding the relationships among internal marketing practices, job satisfaction, service quality and customer satisfaction: an empirical investigation of saudi arabia’s service employees. international journal of tourism sciences, 17(2), 67–85. https://doi.org/ 10.1080/15980634.2017.1294343 sudrajat, d. (2016). portofolio: sebuah model penilaian dalam kurikulum berbasis kompetensi. intelegensia, 1(2), 1–9. retrieved from http://ejurnal.unikarta. ac.id/index.php/intelegensia/article/vie w/257 sumintono, b., & widhiarso, w. (2015). aplikasi pemodelan rasch pada assessment pendidikan. cimahi: trim komunikata. utami, b. n. (2018). praktik evaluasi penyuluhan pertanian. malang. wibisono, s. (2014). aplikasi model rasch untuk validasi instrumen pengukuran fundamentalisme agama bagi responden muslim. jp3i (jurnal pengukuran psikologi dan pendidikan indonesia), 3(3), 729–750. https://doi.org/10.15408/jp3i.v3i3.107 31 zehir, c., akyuz, b., eren, m. s., & turhan, g. (2013). the indirect effects of servant leadership behavior on organizational citizenship behavior and job performance: organizational justice as a mediator. international journal of research in business and social science (2147-4478), 2(3), 1–13. https://doi. org/10.20525/ijrbs.v2i3.68 https://doi.org/10.21831/reid.v6i1.30112 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(1), 2019, 61-74 available online at: http://journal.uny.ac.id/index.php/reid an analysis of javanese language test characteristic using the rasch model in r program *1muchlisin; 2djemari mardapi; 3farida agus setiawati 1,2,3department of educational research and evaluation, graduate school of universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia *corresponding author. e-mail: muchlisinjanuary@gmail.com submitted: 28 february 2019 | revised: 02 may 2019 | accepted: 07 may 2019 abstract one skill required to solve a problem in the 21st century is communication. two international languages that are important in communication and thought at school are english and german language. however, beside international language, the local language, such as the javanese language, is also essential and need to be maintained. the purpose of this study is to analyze the javanese language test characteristics. this study was explorative research with secondary data collected by documentation of 220 students responses to the 50 multiple choice item of javanese language test in the 11th grade of vocational high school. data were analyzed using the rasch model assisted by r program. rasch model fits the data with 42 items after three times calibration. based on difficulty level, icc, and items reliability, there were 28 of 42 items (66.67%) that were good. this study finds out that generally, the javanese language test is in the moderate category of difficulty. hence, the need of evaluating the javanese language test to make a better test that gives more accurate information about examinees' ability is crucial. the evaluation of the javanese language test can be used to plan the next learning to get better javanese language learning. keywords: javanese language test, rasch model, r program permalink/doi: https://doi.org/10.21831/reid.v5i1.23773 introduction in the 21st century, there are some skills that are required. one of these skills is communication (dede, 2010, pp. 7–8; trilling & fadel, 2009, p. 54; zubaidah, 2017, p. 1). we need language to carry out communication. some international languages are important, taught in the school, and widely used in the world, such as english, german language, chinese language, etc. beside international language, the local language, such as the javanese language, is important and need to be maintained. central java and yogyakarta special region, two provinces in indonesia, are very rich in terms of tradition and culture of java. one of these traditions is the javanese language that is used to speak to each other in daily life. this is why the javanese language lesson at school, especially in java, still be held nowadays. at every end of the semester, a test is conducted to assess students ability in the javanese language. the assessment of the javanese language test can be carried out by analyzing test characteristics, which was begun by collecting the information about the previous results of the test score (sumintono & widhiarso, 2015, p. 12). besides to give a score to the students, the students' response can also be used to predict or explain the students’ ability and item characteristic by analyzing test characteristic based on the item response theory (irt). test is very important both for teacher and students. a test can be used to classify the weakness in terms of verbal skills, mehttp://dx.doi.org/10.21831/reid.v5i1.23773 an analysis of javanese language test characteristic... muchlisin, djemari mardapi, & farida agus setiawati 62 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 chanical skills, etc. (allen & yen, 1979, p. 1). besides, a test is a powerful method of data collection with an impressive array for gathering numerical data rather than verbal kind (cohen, manion, & morrison, 2007, p. 414). a test is defined as the standardized procedure for sampling behavior and describing it with categories or scores (gruijter & van der kamp, 2008, p. 2). the essential features of a test are a standardized procedure, a focused behavioral sample, and description in term of scores or categories mapping (gruijter & van der kamp, 2008, p. 2). the result of the test (scores) can be used to predict or explain the item and test performances (lord & novick, 2008, p. 358). thus, the javanese language test has to be analyzed in terms of its characteristics to get a better test in the next chance that can reach the test goal and give more accurate information about the examinee’s ability. the test has some uses. five uses of a test include classification, diagnosis and treatment planning, self-knowledge, program evaluation, and research (gregory, 2015, p. 29). a test can be a useful tool, but it can also be dangerous if misused (allen & yen, 1979, p. 5), depending on our professionality in ensuring the use of the test accurately and as fairly as possible. many extraneous factors can influence the test (gregory, 2015, p. 31). several sources that may influence the test are the manner of administration, the test characteristic, the testing context, examinee’s motivation and experience, and the scoring method (gregory, 2015, p. 31). in a test, some plannings need to be prepared, including identifying the purposes, the test specifications, and selection of the contents, considering the form, the writing test, the layout, the timing, and planning the scoring of the test (cohen et al., 2007, p. 418). we can make a good javanese language test by paying attention to the planning and some influencing factors. besides, a good result of the test, which is accurate, rich, and beneficial for evaluation will be obtained by analyzing the characteristics of the items or test of javanese language using item response theory (irt). there are some alternative ways to analyze test characteristics, including classical test theory (ctt) and item response theory (irt). in ctt, it is difficult to analyze a test with a large amount of calculation to get useful information (baker, 2001, p. 1). besides, ctt has some weakness, such as the result of the measurement depends on the test characteristic used, item parameter depends on the examinee's ability, and the error measurement provided is limited for group measurement instead of individual information (mardapi, 2017, p. 187). in ctt, if test is 'hard', the examinee ability will below; it is 'easy', the examinee ability will be higher (ronald k. hambleton, swaminathan, & rogers, 1991, p. 2). therefore, ctt is considered to be not effective to analyze the javanese language test. the weakness of ctt is that it can be covered by irt. irt is one of the modern psychometric theories that provide useful tools for ability testing (harrison, collins, & müllensiefen, 2017, p. 1). irt is a powerful tool used to solve a major problem of ctt (downing, 2003, p. 739). item response theory (irt) models, including rasch, show the relationship between the ability of test participants from latent trait (e.g., javanese language skills) and the opportunity to master the given items (answer the items correctly) in the form of logistic models (finch & french, 2015, p. 181). irt has 3 assumptions (finch & french, 2015, p. 181; mardapi, 2017, p. 187). these are monotonicity, unidimensionality, and local independence. ctt has served development well in a test over several decades, but irt has become mainstream rapidly as the theoretical measurement basis (embretson & reise, 2000, p. 3). the feature of irt is specification of a mathematical function relating probability of an examinee’s response on a test item to an underlying ability (embretson & reise, 2000, p. 8; finch & french, 2015, p. 177; gruijter & van der kamp, 2008, p. 133; r k hambleton & swaminathan, 1985, p. 9; ostini & nering, 2006, p. 2; reckase, 2009, p. 68; van der linden & hambleton, 1996, p. iii). in other words, the function describes in probabilistic terms, a person with low and high ability give an analysis of javanese language test characteristic... muchlisin, djemari mardapi, & farida agus setiawati copyright © 2019, reid (research and evaluation in education), 5(1), 2019 63 issn 2460-6995 a different response (ostini & nering, 2006, p. 2). irt is an important thing that can solve the problem of dealing the relationship between ability (examinee’s mental traits) and response (performance) to the item (lord & novick, 2008, p. 397). irt is used in so many education fields, not only in social science, even in medical education, it has some potential benefits (downing, 2003, p. 739). in the irt, some information about the test characteristic can be gained accurately, so that analyzing the javanese language test using irt needs to be conducted. one of the models in irt is the rasch model. the rasch model was developed by georg rasch, a danish mathematician, in 1960 (hailaya, alagumalai, & ben, 2014, p. 301; jambulingam, schellhorn, & sharma, 2016, p. 50; mallinson, 2007, p. 1; young, levy, martin, & hay, 2009, p. 545). there are some points of view about the rasch model. rasch model is a special case of one-parameter logistic (1 pl) model with item discrimination value is set equal to 1 (finch & french, 2015, p. 181). discrimination shows the ability of an item to differentiate among examinees ability (finch & french, 2015, p. 181). the rasch model can be expressed as: (1) in equation (1), xj is the response to the item j with 1 being correct in the context of an achievement test.  represents an individual ability, and bj is the difficulty level of item j. analysis of the javanese language test using rasch model has practical benefits. we can check the model fits the data. rasch model can define the probability of a specified response in relation to examinee’s ability and item difficulty of a javanese language test (hailaya et al., 2014, p. 301; jambulingam et al., 2016, p. 50). using rasch model, there is no need to differentially weight items to produce a total score that gives the maximum possible amount of information about latent trait; the number-right score is the best possible total score to use (allen & yen, 1979, p. 260). rasch model produces the latent-trait (javanese ability) and the item difficulty scale that have desirable. analyzing the javanese language test using the rasch model can be done by the r program. the javanese language test in the school has to be analyzed the characteristic using the rasch model in irt by r program to get some information. this information can gained from the item characteristic curves (icc). icc can provide the probability of the examinees at a given ability level of answering each item correctly (hambleton & swaminathan, 1985, p. 13). beside icc, there are the other important information about the items or the test that we can get by using the rasch model in irt.the javanese language test in the school has to be analyzed the characteristic using rasch model in irt by r program to get some information. this information can be collected from the item characteristic curves (icc). icc can provide probability of the examinees at a given ability level of answering each item correctly (hambleton & swaminathan, 1985, p. 13). beside icc, there are the others important information about the items or the test that we can get by using the rasch model in irt. there are many studies of irt application. they compared the use of irt and ctt or studied the application of irt to analyze the test characteristic. a study conducted by downing (2003) contrasts the irt with ctt and explores the benefit of irt application in typical medical education settings. downing just compares these models and explore the benefit of irt theoretically; he did not go further discussing the application of irt in the analysis. in this study, irt was used to analyze the test by the rasch model in the r program. essen, idaka, and metibemu (2017) analyze the model-data fit in irt using bilog and irtpro program. they used two programs to analyze the model-data fit, but in this study, one model in one program was used to analyze the model's fit data, item fit model, the difficulty level of the items, items characteristics curve (icc), item information curve (iic), test information curve (tic), the information given by each item, and the javanese ability distribution. more complex information would be revealed in this study. an analysis of javanese language test characteristic... muchlisin, djemari mardapi, & farida agus setiawati 64 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 the study of purnama (2017) was conducted to understand the characteristics of accounting vocational theory test items by irt using bilog program. in this study will analyze the characteristics of the javanese language test using the rasch model in the r program. purnama’s study analyzes the test using 2 pl, employing the rasch model, which is the special case of 1 pl. purnama’s study did not use the icc to analyze the item characteristics, while in this study, icc will be used. another study conducted by setiawati, izzaty, and hidayat (2018b, 2018a) using irt to analyze the test employs bilog program, while this study employs the r program. a study by iskandar and rizal (2018) has some relevancy with this study. these studies use a program to conduct analysis. in their study, they analyze the validity, reliability, difficulty level, and the other cases, but not the items and test characteristic curve, the information functions, the ability average of examinees, etc. those aforementioned studies used ctt, while this study uses irt. it is hoped that this study would present findings which can contribute to analyzing the characteristic of the javanese language test, so that there would be an evaluation for the javanese language test to get a better one. the javanese language test will be analyzed by irt. analyzing the javanese language test will be more accurate and can be used to estimate the relationship between the examinee ability and the examinee response to the items of the javanese language test. analyzing the javanese language test using irt will produce the analysis not just for the overall test, but also for individual items characteristic. the characteristics of item and test (iic and tcc) estimate how accurate the javanese language test will give us the information (iic and tic) and the other characteristics. based on the explanations, the researchers decided to analyze the javanese language test characteristics based on item response theory using the rasch model in the r program. method this study is explorative research, that is research which aims at finding the fact and characteristics systematically and accurately about athejavanese language test (arikunto, 2010, p. 14). the characteristics of the javanese language test were analyzed using the rasch model in the r program. this research was conducted in yogyakarta from may to june 2018. the data analyzed in this study are secondary data. the data were collected by the documentation method, which is collecting the answer sheet of 220 students' responses to the javanese language test in depok 1 vocational high school, yogyakarta. the javanese language test consists of 50 multiple choice items. the instrument unit, the javanese language test, was made by the javanese language teacher. then, the researchers summarize the responses in the dichotomy data table. the wrong responses are denoted by 0, and the true responses are denoted by 1. the item number 1 was symbolized with b1, item number 2 was b2, item number 3 was b3, and so on. the data of the javanese language test were analyzed based on irt using rasch model in the r program. after the data were collected and analyzed using the rasch model in the r program, some findings are gained. it described how the characteristics of the javanese language test told us the probability of an examinee’s response on the test item to an underlying ability (javanese language ability). the researchers analyzed the model fits of the overall data, the difficulty level, and item fits of the model, icc, tcc, iic, tic, item information, the javanese language ability distribution, and the descriptive statistics for the javanese language ability. the model fits the overall data. the goodness of fit model was conducted to test whether the rasch model fits with the overall data, whereas item fits model was done to test whether the model fits for individual items as well. both will be fit if the p-value more than 0.05. if the goodness of fit model has not met the fit criteria, then the item fits model would be conducted, and the items that did not fit would be removed. then, the goodness of fit of the remained items would be rean analysis of javanese language test characteristic... muchlisin, djemari mardapi, & farida agus setiawati copyright © 2019, reid (research and evaluation in education), 5(1), 2019 65 issn 2460-6995 analyzed until the criteria were met, and we can continue to the next analysis. in practice, the researchers set the category, e.g., a difficult level is said to be good if it has a difficulty value ranging from 2.0 to 2.0 (hambleton & swaminathan, 1985, p. 107). in this study, an item can be said a good item if have difficulty level from – 3.0 to 3.0. the icc will show about how the relationship between examinee ability with the true response probability, whereas tcc shows the relationship between examinee ability and the true score (sum of the true response probability). the iic and tic show the information that we can get based on the item or test for certain examinee ability. the item information is useful for item selecting. the criteria of the reliable item are if the item information value more than 0.5. the javanese language ability distribution and descriptive statistics are all about examinee ability in this test. all of the information would explore the javanese language test characteristics in this study. findings and discussion after the data were collected and analyzed, some results are gained. it describes how the characteristics of the javanese language test told us the probability of an examinee’s response to the test item to an underlying ability (javanese language ability). it can be seen from model fits data, the difficulty level, and item fits model, icc, tcc, iic, tic, the distribution of javanese language ability, etc. the first step of the analysis of the characteristic of the javanese language test is the assessment of the model fit for the rasch model. we have to make sure that overall model fit for rasch model. it can be said that the model fits the data if the frequency of the observed and the model-predicted individuals for each response pattern are close to one another (finch & french, 2015, p. 189). to analyze the model fit, we used the bootstrap chi-square procedure in r program (whether the model fits for the overall data). the bootstrap chi-square test of overall model fit for a rasch model was conducted by command gof.rasch(model.rasch, b=1000). first, the researchers analyzed the model fits for all items (50 items). the result shows that p-value is 0.006. if the p-value is less than 0.05, it means that the model does not fit the data. thus, it is said that the model did not fit the data (for all items). then the items fit model was analyzed (whether the model fits for the individual items as well) by command item.fit(model.rasch, simulate.p.value = true). there were three items that did not fit the model. these items are item number 27, 32, and 35. the data for these three items were removed, and the researchers analyzed the model which fits the data again. the second analysis of the model fit of the data was done, and we got the p-value 0.017. it was still less than 0.05. it means that the rasch model did not fit the data. then the researchers analyzed the items fit the model for these 47 items. they got that the items number 3, 11, 13, 36, and 48 did not fit the model. the data for these items were then removed. then, the researchers reanalyzed the model fit of the data with 43 items remained. the third analyzing of the model fit of the data showed that the model fits the data. it could be seen from the p-value were 0.053 (more than 0.05). finally, after three times calibration of the fit-model, the researchers got the rasch model fits the data without the items number3, 11, 13, 27, 32, 35, 36, and 48 (there are 42 items that would be analyzed). in other words, the researchers had gotten the overall model-fit for the rasch model, then, they could continue the other analysis. the researchers analyzed the difficulty level of the items, and the items fit the model. the summary of the analysis is clearly presented in table 1. the center of item difficulty level is 0; negative value represents relatively easy, and positive value indicates relatively more difficult items (finch & french, 2015, p. 184). based on that statement, it indicates that when the value of difficulty is increasingly negative, then the difficulty level of the problem is easier and when the value of the difficulty becomes more positive then the level of difficulty becomes increasingly difficult. an analysis of javanese language test characteristic... muchlisin, djemari mardapi, & farida agus setiawati 66 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 from the rasch's analysis of the difficulty level of the items, it is found that the easiest question is item number 20 (with difficulty level -15.7892) and the hardest problem is item number 23 (with difficulty level 0.9702). in theory, the difficulty levels are in the range of minus infinity to infinity. there are some items that have a good category based on their difficulty level. there are 28 good items, and the rest, 14 items, are not good based on the difficulty level. the not good items based on difficulty level are item number 5, 6, 7, 12, 14, 16, 17, 18, 19, 20, 25,29, 38,and 46. there are 69.77% of 43 items that are good in the difficulty level. hence, the test in the moderate category based on the difficulty level. table 1. difficulty level of items and the items fit of the model item no. difficulty level of the items the items fit of the model 1 -0.8355 0.0792 2 -1.0570 0.6634 4 -0.3796 0.4554 5 -4.6802* 0.7165 6 -4.6802* 0.5149 7 -3.5262* 0.3861 8 -1.0317 0.1683 9 -2.8874 0.2574 10 -1.5902 0.3366 12 -5.0950* 0.6832 14 -5.7976* 0.6436 15 -2.6885 0.9208 16 -4.1508* 0.3465 17 -3.7959* 0.0891 18 -3.9593* 0.9208 19 -5.7976* 0.1584 20 -16.0705* 0.1881 21 -0.2267 0.0396# 22 -0.5127 0.9802 23 0.9695 0.3960 24 -1.8959 0.8713 25 -4.3832* 0.8614 26 -1.3516 0.7426 28 -1.7202 0.9604 29 -3.1221* 0.0693 30 -1.5902 0.2970 31 -0.4016 0.4356 33 -2.6282 0.4059 34 -1.7202 0.4653 37 -1.9713 0.9406 38 -3.5263* 0.3168 39 -2.0908 0.1287 40 -2.0102 0.1386 41 -1.1084 0.1386 42 -1.5589 0.2277 43 -1.4678 0.2277 44 -2.0908 0.8119 45 -2.9610 0.3762 46 -3.3073* 0.6436 47 -2.1756 0.4158 49 -1.3235 0.9505 50 -1.6541 0.3366 notes: *item is not good based on the difficulty level #item misfit with the rasch model an analysis of javanese language test characteristic... muchlisin, djemari mardapi, & farida agus setiawati copyright © 2019, reid (research and evaluation in education), 5(1), 2019 67 issn 2460-6995 the teacher should pay attention to the not good category items. all of the items that are not good based on the difficulty level are categorized at too easy items. these items are not good because they are too easy for every examinee. it was indicated by all of their indexes of difficulty level which are smaller than -3.0. rasch model had fit with the data, but there is one item that did not fit with the rasch model. this item is item number 21. we could not decide on these items. it was because these items did not fit with the model. it means that the characteristics of this item (item no. 21) based on the rasch model were not adequately accurate. the analysis of item characteristics is displayed in the form of curves for all items can be seen in figure 1. the item characteristic curve (icc) places the test participant's location on the latent trait measured on the xaxis and the ability to master an item on the yaxis (finch & french, 2015, p. 184). the latent trait refers to the javanese language ability, and the ability to master an item (probability answer correctly) refers to the probability of the examinee to respond correctly to the item. from icc, it can be known about the probability of correctly answer from someone with a certain ability on an item. the command to get icc for all items (42 items) together is plot(model.rasch,type=c('icc')). it gives us all the icc of the item in the test. figure 1 shows the icc of 43 items. it was difficult to interpret the curve if we used all icc together. the icc of the items number 23 was located at the most right position of the x-axis (finch & french, 2015, p. 185). it means that the item number 23 is the most difficult item. the easiest item was not able to find, because it was so complex. however, it is clear that the item number 20 is the easiest item based on the difficulty level of the item. if the curve from these items is separated, we can see it more clearly. thus, the icc for item number 20, 23, and two other numbers can be compared. the icc for item number 20 and 23, and two other items are presented in figure 2. figure 1. the icc of javanese language test from figure 1, some of iccs are not good because the correct response probability for the examinee with low ability is high. these items are item number 5, 6, 7, 12, 14, 16, 17, 18, 19, 20, 25, 29, 38, and also 46 (total of 14 items). all of these items have fitted the model. figure 2. the icc for items number 20, 23, 24, and 29 however, the difficulty levels of these items are not good. thus, these items (see figure 3) are not good based on the icc and difficulty level. figure 3. items with not good icc to look the icc of a specific item, let us say that items number 20, 23, 28, and 29, an analysis of javanese language test characteristic... muchlisin, djemari mardapi, & farida agus setiawati 68 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 used the command plot(model.rasch,type=c("icc "),items=c(17,20,21,25)). it is a little different from the command for all icc, in which, sort number from specific items was mentioned. it would make every icc of some items in one graph to be able to compare easily. figure 2 presents some item characteristic information. for item number 20, regardless of the student’s ability, the probability to answer correctly is the same for all examinee, which is 1.0 (always true). it indicates that the item number 20 is too easy for every examinee. it means that examinee with any javanese language ability will be able to respond the item correctly (the examinee with ability value -4 through 4 could respond to this item correctly). for the hardest item (item number 23), the examinee with ability 1 will have probability approximately 0.5 to answer this item correctly. to get high probability about 0.9 or more, the examinee should have javanese language ability almost 4. the javanese language ability would be needed to increase the opportunity to answer this item correctly. the test characteristic in correlating the ability with true score can be found by tcc (test characteristic curve). true score is the sum of correct answer probability. the javanese language test tcc is shown in figure 4. figure 4. the tcc of the test from figure 4, it is known that the test is an easy category. the examinee with a low ability (-3) will have true scores approximately 19, and the examinee with an average ability (0) will have true scores approximately 35 (near to the maximum true score, that is 42). the examinee with ability value 0 (average ability) will have a different probability for each item. he/she will have probability 0.2 for item number 23, probability approximately 0.8 or more for item number 24 and 29, and probability 1.0 (true response) for item number 20. figure 2 explains that the difficulty level of item number 20 is easier than item number 24 and 29, and item number 24 and 29 are easier than item number 23. figure 1 shows that some iccs are not good since the correct response probability for examinee with low ability is high. these items are item number 5, 6, 7, 12, 14, 16, 17, 18, 19, 20, 25, 29, 38, and 46 (14 items). those items have fitted the model. the item characteristic for every item can be described the same way as we had done to the item number 20, 23, 24, and 29, by separating it from the other icc so that it will be seen clearly. in addition to the icc, we used the r program to plot the item information curve (iic). the iic describe the information function of an item. it refers to the degree to which item reduces the uncertainty in the estimation of javanese language ability (the latent trait) value for an individual (finch & french, 2015, p. 185). a high value of information for a specific range of ability distribution indicates that the item provides relatively more information regarding the latent trait (javanese language ability) in that region than another region in the distribution (finch & french, 2015, p. 186). based on the iic, we can see how reliable the item in giving information. all the iic are shown in figure 5. there are 50 iic with each degree in estimating the information given by each item. the command to get iic for all item in the test is plot(model. rasch,type=c('iic')). the command for specific iic is plot(model.rasch,type=c('iic'),items=c(18,21, 25,40)), that will produce iic for item number 20, 23, 28, and 47. the iic for 43 items is shown in figure 3, and the iic for item number 20, 23, 28, and 47 is shown in figure 7. there are 43 iic that can describe how reliable each item in the giving information about the javanese language ability value for an individual. there are just 43 iic of the 43 items that the rasch model fits for the data. from figure 4, we can get the most accurate and inaccurate items in giving information about the examinee’s ability in the javanese language. these are shown by item number an analysis of javanese language test characteristic... muchlisin, djemari mardapi, & farida agus setiawati copyright © 2019, reid (research and evaluation in education), 5(1), 2019 69 issn 2460-6995 20 and 23. the iic for these numbers is shown separately from the others in figure 5 with item number 28 and 47. figure 5. the iic of 42 items some of iics give maximum information for examinee with a low ability (figure 6). these items are item number 5, 6, 7, 12, 14, 16, 17, 18, 19, 20, 25, 29, 38, and 46 (14 items). these items did not give maximum or give low information for the examinee with the medium or high ability. these items are not good, because they give maximum or high information just for low ability examinee and these items based on the icc and the difficulty levels are not good. therefore, we can conclude that these items are not good based on the icc, iic, and difficulty level. figure 6. item with not good iic figure 7. the iic for item number 20, 23, 28, and 47 figure 7 shows the iic for item number 20 is the most inaccurate in giving information about the examinee’s javanese language ability. this item cannot give the information accurately because any examinee with any ability shows 0 information value that can be provided by this item. we cannot differentiate the examinee's ability. there is no information about the examinee ability (in the javanese language) that we can get if we use this item to measure them. the iic for item number 23 shows that it is needed ability approximately 1 to get information about 0.25, in other words that item 23 provides maximum information for estimating  (javanese language ability) around values of 1. the item number 28 and 47 will give maximum information about the examinee if he/she has ability about -2. the iic for every item is different, but this study shows more specific item information curve for item number 20, 23, 28, and 47. if we want to look at the iic from the other item, we can separate it from the others. item information curves show the information function for every item in the test. for the total information, the function can get from test information function. there are some features of the test information function. these are defined for a set of the test items at each point on the ability scale, the amount of the information is influenced by quality and number of test items, etc. one of the most important features of the test information function is that the contribution of each item to complete information is additive (hambleton & swaminathan, 1985, p. 104). the test information curve that shows the total information function is like figure 8. the command to get test information curve is plot(model.rasch,type=c("iic"), items=c(0)). figure 8. test information curve an analysis of javanese language test characteristic... muchlisin, djemari mardapi, & farida agus setiawati 70 copyright © 2019, reid (research and evaluation in education), 5(1), 2019 issn 2460-6995 figure 8 shows the estimate of the test information function on the curve. tic presents how reliable the javanese language test is. the tic interpretation is similar to the iic interpretation. the test provides us maximum information for estimating  around values of -2. thus, the test will be good to be used for examinee with low javanese language ability. the test was less accurate in giving information on examinee with javanese language ability 0 (average ability) or more than 0 ability. the information function (iic or tic) has some application in the test construction, item selection, measurement precision assessment, test comparison, scoring weight determination, and scoring methods comparison (hambleton & swaminathan, 1985, p. 101). in item selection, we can select the item that can provide accurate information on examinee’s ability. the item’s iic, which does not provide information, means the item should not be used in the test (like item number 20). the item does not provide information in any theta (ability), so it should not be used in the test. table 2. the information of each item in theta -3.0 until 3.0 item no. information percentage 1 0.88 87.60% 2 0.86 85.78% 4 0.90 89.93% 5 0.16 15.74% 6 0.16 15.66% 7 0.37 36.94% 8 0.86 86.01% 9 0.53 52.58% 10 0.79 79.35% 12 0.11 11.01% 14 0.06 5.82% 15 0.57 57.43% 16 0.24 24.03% 17 0.31 31.04% 18 0.28 27.67% 19 0.06 5.82% 20 0 0.09% 21 0.90 90.31% 22 0.89 89.44% 23 0.87 86.55% 24 0.74 74.38% 25 0.2 20.06% 26 0.83 82.61% 28 0.77 77.38% 29 0.47 46.78% 30 0.79 79.39% 31 0.9 89.86% 33 0.59 58.87% 34 0.77 77.38% 37 0.73 73.00% 38 0.37 37.05% 39 0.71 70.70% 40 0.72 72.27% 41 0.85 85.29% 42 0.8 79.81% 43 0.81 81.09% 44 0.71 70.70% 45 0.51 50.76% 46 0.42 42.14% 47 0.69 68.98% 49 0.83 82.95% 50 0.78 78.42% an analysis of javanese language test characteristic... muchlisin, djemari mardapi, & farida agus setiawati copyright © 2019, reid (research and evaluation in education), 5(1), 2019 71 issn 2460-6995 the complete information of the test across all values of the javanese language ability (latent trait) can be obtained by using the command information(model.rasch, c(-10,10)). the subcommand c(-10, 10) identifies the range of the theta (ability) for which information is requested. the total information that is provided by the test at the examinee’s ability ranges from -10 to 10 equal to 41.93 or 100%. it means that the test will give maximum information if the test were used in the examinees with ability -10 until 10. if we request for the ability values in range 0 to 10, with the command information(model.rasch, c(0,10)), is 5.9 or 14.08% of the total information provided by the javanese language test. in the normal distribution raw, the area of range -3 to 3 equals to 95% of the total area. the total information that could be given by the test if we measure in the ability range of -3 to 3 is 24.98 or 59.58% of the total information. there is still moderate information which we could obtain by using this instrument in measuring the examinee with the ability in this range. beside the icc, tic, and the total information, we can get the information given by each item in the range of a certain ability (theta). in this study, the information, that is given by each item in the ability range of –3 until 3, are listed in table 2. we can know the percentage that we get from the total information of each item. based on table 2, we can see the information given by each item in the theta -3.0 until 3.0. the information can be used for item selection. how reliable the item depends on the percentage of information gotten from each item in this range of theta. we can set the criteria for reliable item like we need. for example, if we will compose a test, we cannot use item number 20, because it gives us very small information. if we set the criteria for reliable information of each item by more than 50%, we get 28 reliable items of 42 items that can be used (there are 66.67%). the remaining unreliable items (14 items) are not good. incidentally, these unreliable items are also categorized as not good based on the icc, iic, and difficulty level. obtaining latent trait (javanese language ability) estimates for the rasch model in r program, we used the command theta.rasch0.7. a good test item has the difficulty index in the moderate category. the item discrimination power was also used as a consideration in deciding if test item is good or poor. fernandes (1984) states that a good test item is an item with the discrimination power of >0.2. he adds that a distractor is considered functioning when it is chosen by at least 2% of the total examinees. table 2 shows the analysis result of the item characteristic function based on the classical test theory. it shows the characteristics of the 40 test items in test package 1. in general, the items in test package 1 are in a poor category. a test item is said to be good when it meets three categories, i.e. having a moderate difficulty index and good discrimination power, and all of its distractors function well. in test package 1, only fourteen items (35%) are in a good category, while 26 items (65%) are in a poor category. table 2 also shows that 23 items (57.5%) are categorized as easy. table 3. difficulty index of five test packages (classical test theory approach) item number item difficulty index package 1 package 2 package 3 package 4 package 5 1 0.945 0.952 0.641 0.651 0.954 2 0.874 0.839 0.516 0.914 0.868 3 0.929 0.903 0.730 0.638 0.829 4 0.134 0.129 0.897 0.063 0.033 5 0.638 0.508 0.468 0.638 0.829 6 0.709 0.508 0.754 0.533 0.638 7 0.961 0.903 0.817 0.330 0.967 8 0.724 0.702 0.833 0.829 0.638 9 0.937 0.902 0.683 0.868 0.914 10 0.701 0.637 0.714 0.954 0.661 11 0.748 0.637 0.889 0.586 0.638 12 0.480 0.306 0.873 0.645 0.408 13 0.827 0.742 0.770 0.816 0.697 14 0.953 0.847 0.079 0.875 0.941 15 0.449 0.435 0.460 0.737 0.474 16 0.740 0.750 0.675 0.474 0.737 17 0.913 0.895 0.921 0.941 0.875 18 0.827 0.863 0.619 0.697 0.816 19 0.646 0.694 0.881 0.408 0.645 20 0.748 0.398 0.651 0.638 0.586 21 0.961 0.919 0.714 0.428 0.961 22 0.709 0.661 0.849 0.632 0.632 23 0.102 0.266 0.651 0.743 0.566 24 0.307 0.839 0.675 0.658 0.493 25 0.268 0.782 0.722 0.250 0.349 26 0.433 0.331 0.849 0.349 0.250 27 0.764 0.766 0.611 0.493 0.658 28 0.921 0.935 0.556 0.566 0.743 29 0.748 0.726 0.683 0.632 0.632 30 0.449 0.460 0.206 0.961 0.428 31 0.654 0.734 0.968 0.283 0.684 32 0.898 0.863 0.556 0.664 0.862 33 0.535 0.718 0.278 0.553 0.618 34 0.654 0.480 0.302 0.645 0.724 35 0.732 0.815 0.325 0.822 0.724 36 0.882 0.895 0.325 0.724 0.822 37 0.591 0.629 0.786 0.724 0.645 38 0.346 0.653 0.857 0.618 0.553 39 0.693 0.718 0.754 0.862 0.664 40 0.465 0.323 0.484 0.684 0.283 average (b) 0.675 0.677 0.651 0.661 0.668 parallel tests viewed from the arrangement of item numbers... badrun kartowagiran, djemari mardapi, dian normalitasari purnama, & kriswantoro 174 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 based on the analysis from test package 1, the difficulty index of each item in test packages 2, 3, 4, and 5 was analyzed. the item difficulty index analysis using the classical test theory was done with the quest program. the analysis result of the parameter of item difficulty index of each test package is shown in table 3. it shows the difficulty index of each item in the five test packages. package 1 is the original test package without any randomization, so package 2, package 3, package 4, and package 5 which had undergone randomization were reconstructed to their former forms with item numbers being rearranged to their original arrangement. table 4 shows that after the randomization of item numbers and alternative answers, the difficulty index of the five packages ranged from 0.102 to 0.968. this range is quite large, because, according to the classical test theory, the difficulty index should range from 0 to 1. further, based on the result of the analysis shown in table 4, the characteristics of each test items in the five packages was analysed. the result of analysis of the each test item characteristics in terms of difficulty index is shown in table 4. table 4 shows that all five test packages, viewed from the difficulty index, generally show that the test items are in easy and moderate categories. the test packages have undergone randomization and have been reconstructed into their former construction before randomization. it can be seen from the same proportion of the test packages, while the number of the items in the difficult category is only two or three. a deeper look into it reveals that some items have gone through changes in the category of difficulty index. for instance, item 6 in package 1 was categorized as an easy item, but after the randomization in package 2, it was categorized as a moderate item. another example is item 25 in package 1, categorized as a difficult item, but after the randomization of package 2, in package 3 it was categorized as an easy item. it shows that seen from the difficulty index category, many items change after the item numbers are randomized. the percentages of the changes or shifts in the item difficulty category is shown in table 5. table 5 shows that the biggest shift in difficulty index is the shift of 24 items (60%) from package 1 to package 3, while the smallest shift is the shift from package 1 to package 2, i.e. 9 items (22.5). based on the result of the analysis using the classical test theory approach, kruskall-wallis analysis was conducted to see whether there was any significant difference of the item difficulty index of the randomized test packages. the summary of the result of the analysis is in table 6. table 4. characteristics of item difficulty index based on classical test theory category package 1 (item number) package 2 (item number) package 3 (item number) package 4 (item number) package 5 (item number) easy 1, 2, 3, 6, 7, 8, 9, 10, 11, 13, 14, 16, 17, 18, 20, 21, 22, 27, 28, 29, 32, 35, 36 1, 2, 3, 7, 8, 9, 11, 13, 14, 16, 17, 18, 21, 24, 25, 27, 28, 29, 31, 32, 33, 35, 36, 39 3, 4, 6, 7, 8, 10. 11, 12, 13, 17, 19, 21, 22, 25, 26, 31, 37, 38, 39 2, 8, 9, 10, 13, 14, 15, 17, 23, 20, 35, 39 1, 2, 3, 5, 7, 9, 14, 16, 17, 18, 21, 28, 34, 35, 32, 34, 35, 36 % 57.5% 60% 47.5% 30% 45% moderate 5, 12, 15, 19, 24, 26, 30, 31, 33, 34, 37, 38, 39, 40 5, 6, 10, 12, 15, 19, 20, 22, 26, 30, 34, 37, 38, 40 1, 2, 5, 9, 15, 16, 18, 20, 23, 24, 27, 28, 29, 32, 34, 35, 36, 40 1, 3, 5, 6, 7, 11, 12, 16, 18, 19, 20, 21, 22, 24, 26, 27, 28 29, 32, 33, 34, 36, 37, 38, 40 6, 8, 10, 11, 12, 13, 15, 19, 20, 22, 23, 24, 25, 27, 29, 30, 31, 33, 37, 38, 39 % 37.5% 35% 45% 62.5% 52.5% difficult 4, 23, 25 4, 23, 14, 30, 33 4, 25, 31 4, 26, 40 % 7.5% 5% 7.5% 7.5% 7.5% table 5. category shift of item difficulty index of five test packages packages 1-2 packages 1-3 packages 1-4 packages 1-5 9 items (22.5%) 24 items (60%) 20 items (50%) 15 items (37.5%) parallel tests viewed from the arrangement of item numbers... badrun kartowagiran, djemari mardapi, dian normalitasari purnama, & kriswantoro copyright © 2019, reid (research and evaluation in education), 5(2), 2019 175 issn 2460-6995 table 6 shows that the value of asymp, sig in all items whose discrimination power is tested among package 1, package 2, package 3, package 4, and package 5 is above 0.05. it means that there is no difference in difficulty index of the items in all five test packages, so there is no effect item number randomization on the item difficulty index. after the effect of item number randomization was scrutinized, the effect of the randomization on discrimination index was analysed. the percentages of the good and poor discrimination index is shown in table 7. table 7 shows that the discrimination index of test packages 1, 2, 3, 4, and 5 is in a good category (> 60%). based on the analysis of test package 1, after the randomization of test packages 2, 3, 4, and 5, there is a shift in the good discrimination index. however, a closer look reveals that the shift is not big enough, occurring to two to four items only. parallel tests based on analysis using item response theory approach before scrutinizing whether package 2, package 3, package 4 and package 5 are parallel to package 1 or not, the researchers need to describe the assumption test of the item response theory (irt), which is the unidimension assumption test (naga, 1992). the requirement for unidimension is aimed at sustaining invariance in irt. if a test item measures more than one dimension, then the answer to the item is a combination of different competencies of the examinees. thus, the contribution of each competency to the answer is unknown. unidimension assumption testing is carried out to reveal whether a test measures one trait. the unidimension assumption is tested by the factor analysis and its empirical result. the kmo-msa value is sufficient if it is above 0.5 (field, 2009). by looking at the first eigenvalue contribution to test variance, according to reckase (1979), the formation of eigenvalue factor has to have a value above 1. in the factor analysis, the first eigenvalue has to have the biggest value (dominant) compared to the second, third, and so forth eigenvalues. the result of the analysis of unidimension assumption testing is shown in table 8. table 6. result of kruskall-wallis analysis of the classical test theory item numbers asymp, sig 1-5 0.810 6-10 0.885 11-15 0.819 16-20 0.760 21-25 0.418 26-30 0.882 31-35 0.344 36-40 0.760 table 7. category of power discrimination of five test packages discrimination power package 1 package 2 package 3 package 4 package 5 good 29 items (72.5%) 27 items (67.5%) 33 items (82.5%) 33 items (82.5%) 32 items (80%) poor 11 items (27.5%) 13 items (22.5%) 7 items (17.5%) 7 items (17.5%) 8 items (20%) table 8. unidimension assumption test test packages kmo and bartlett's test total variance explained category kmo sig. eigenvalue factor 1 eigenvalue factor 2 package 1 0.469 0.00 3.637 2.831 multidimension package 2 0.513 0.00 3.807 2.223 multidimension package 3 0.608 0.00 5.891 2.367 unidimension package 4 0.571 0.00 5.345 2.483 unidimension package 5 0.580 0.00 5.003 2.446 unidimension parallel tests viewed from the arrangement of item numbers... badrun kartowagiran, djemari mardapi, dian normalitasari purnama, & kriswantoro 176 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 table 8 presents that of the five packages whose unidimension assumption was analyzed, three packages are unidimensional (package 3, package 4, and package 5), while two packages are multidimensional (package 1 and package 2). the analysis was based on the size of the sample sufficiency value (kmo) and eigenvalue. the second assumption is the local independence assumption and parameter invariance. according to retnawati (2014, p. 7), this assumption is automatically proved after it is proved with unidimensionality. after the assumption testing, the test item characteristic was analysed by the irt. testing the fitness of each item to the model followed the formula by sumintono and widhiarso (2015, p. 81) that an item fits to a model if the value of outfit mnsq is between 0.5 and 1.5. an item difficulty index can be known from the most difficult, moderate, and easiest item. an item difficulty index is categorized easy if it has the difficulty index close to -2.00. an item difficulty index is categorized moderate if its difficulty index value ranges from -1.00 to +1.00. an item difficulty index is categorized difficult if its difficulty index is close to +2.00. the result of the analysis of item characteristics based on difficulty index is shown in table 9. table 9. characteristics of items in package 1 based on the item response theory item number model fitness category difficulty index category category 1 1.77 not fit 0.390 moderate poor 2 0.91 fit 0.270 moderate good 3 0.97 fit 0.350 moderate good 4 1.26 fit 0.270 moderate good 5 0.96 fit 0.190 moderate good 6 1.16 fit 0.200 moderate good 7 0.69 fit 0.460 moderate good 8 0.91 fit 0.210 moderate good 9 0.75 fit 0.370 moderate good 10 0.82 fit 0.200 moderate good 11 0.88 fit 0.210 moderate good 12 0.94 fit 0.190 moderate good 13 1.24 fit 0.240 moderate good 14 0.64 fit 0.420 moderate good 15 1.03 fit 0.190 moderate good 16 0.87 fit 0.210 moderate good 17 0.62 fit 0.320 moderate good 18 0.99 fit 0.240 moderate good 19 1,01 fit 0.190 moderate good 20 1.25 fit 0.210 moderate good 21 0.96 fit 0.460 moderate good 22 0.91 fit 0.200 moderate good 23 1.39 fit 0.300 moderate good 24 1.13 fit 0.200 moderate good 25 0.97 fit 0.210 moderate good 26 0.94 fit 0.190 moderate good 27 0.81 fit 0.220 moderate good 28 1.07 fit 0.340 moderate good 29 0.94 fit 0.210 moderate good 30 1.16 fit 0.190 moderate good 31 0.80 fit 0.200 moderate good 32 0.86 fit 0.300 moderate good 33 0.83 fit 0.190 moderate good 34 1.00 fit 0.200 moderate good 35 1.13 fit 0.210 moderate good 36 0.88 fit 0.280 moderate good 37 1.28 fit 0.190 moderate good 38 0.96 fit 0.200 moderate good 39 1.00 fit 0.200 moderate good 40 1.04 fit 0.190 moderate good parallel tests viewed from the arrangement of item numbers... badrun kartowagiran, djemari mardapi, dian normalitasari purnama, & kriswantoro copyright © 2019, reid (research and evaluation in education), 5(2), 2019 177 issn 2460-6995 table 9 shows that, in terms of good criteria items, 39 items fit, and one item does not fit to rasch model because it is outside the stated outfit mnsq range. furthermore, in terms of the item difficulty index, all items fall into the moderate category, and therefore it can be concluded that only one of the 40 items is not good. later, based on the result of the analysis of package 1, the analysis of the difficulty index of the items in the other test packages was conducted. the item analysis using the item response theory of five test packages resulted in the value of parameter of the difficulty index of each item as shown in table 10. table 10 shows the difficulty value of each test item in five test packages after the item is suited to the items in package 1. baker (2001, p. 11) divides difficulty indices of items according to the irt into five categories: very easy, easy, moderate, difficult, and very difficult. an item is said to be very easy if its difficulty index value is lower than -2.00. an item is categorized easy if it has the difficulty index value close to -2.00. an item is categorized moderate if it has the difficulty index value table 10. difficulty index of five test packages based on the item response theory item number difficulty index package 1 package 2 package 3 package 4 package 5 1 0.390 0.420 0.200 0.180 0.390 2 0.270 0.250 0.190 0.300 0.250 3 0.350 0.310 0.220 0.180 0.220 4 0.270 0.280 0.300 0.460 0.470 5 0.190 0.190 0.190 0.180 0.170 6 0.200 0.190 0.220 0.170 0.180 7 0.460 0.310 0.240 0.470 0.460 8 0.210 0.200 0.250 0.220 0.180 9 0.370 0.310 0.210 0.250 0.300 10 0.200 0.200 0.210 0.390 0.180 11 0.210 0.200 0.290 0.180 0.180 12 0.190 0.200 0.280 0.180 0.180 13 0.240 0.210 0.220 0.220 0.190 14 0.420 0.260 0.360 0.250 0.350 15 0.190 0.190 0.190 0.190 0.170 16 0.210 0.220 0.200 0.170 0.190 17 0.320 0.300 0.340 0.350 0.250 18 0.240 0.270 0.200 0.190 0.220 19 0.190 0.200 0.290 0.180 0.180 20 0.210 0.230 0.200 0.180 0.180 21 0.460 0.340 0.210 0.180 0.420 22 0.200 0.200 0.260 0.180 0.180 23 0.300 0.210 0.200 0.200 0.180 24 0.200 0.250 0.200 0.180 0.170 25 0.210 0.230 0.210 0.200 0.180 26 0.190 0.200 0.260 0.180 0.200 27 0.220 0.220 0.200 0.170 0.180 28 0.340 0.370 0.190 0.180 0.200 29 0.210 0.210 0.210 0.180 0.180 30 0.190 0.190 0.240 0.420 0.180 31 0.200 0.210 0.520 0.190 0.190 32 0.300 0.270 0.190 0.180 0.240 33 0.190 0.210 0.220 0.170 0.180 34 0.200 0.420 0.210 0.180 0.190 35 0.210 0.240 0.210 0.220 0.190 36 0.280 0.300 0.210 0.190 0.220 37 0.190 0.190 0.230 0.190 0.180 38 0.200 0.200 0.270 0.180 0.170 39 0.200 0.210 0.220 0.240 0.180 40 0.190 0.200 0.190 0.190 0.190 average 0.250 0.245 0.236 0.222 0.222 parallel tests viewed from the arrangement of item numbers... badrun kartowagiran, djemari mardapi, dian normalitasari purnama, & kriswantoro 178 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 ranging from -1.00 to +1.00. an item is categorized difficult if it has the difficulty index value close to +2.00, and categorized as very difficult if the difficulty index value is higher than +2.00. based on the result of the analysis using the item response theory, all items in package 1, package 2, package 3, package 4 and package 5 have the difficulty index in a good category. this is in line with table 8 which shows that all difficulty indexes of the items range from higher than -1.00 to lower than 1.00, which means that all items have the difficulty index in the moderate category. in addition to showing item characteristics based on difficulty index according to the irt, table 10 also shows the average difficulty index of 40 test items in five test packages. table 10 shows that the average difficulty index of the test items in package 1 is 0.250, in package 2 it is 0.245, in package 3 it is 0.236, in package 4 it is 0.222, and in package 5 sit is 0.222. table 10 also shows that all items in five packages have the difficulty index which is not very different from each other. based on the result of the analysis using the classical test theory, a test was done to see the significance of the differences in item difficulty index among the randomized test packages. the test was conducted using kruskall-wallis analysis. the summary of the analysis result is presented in table 11. table 11. the result of the test using kruskall-wallis of the classical test theory item number asymp. sig 1-5 0.591 6-10 0.795 11-15 0.178 16-20 0.222 21-25 0.063 26-30 0.094 31-35 0.054 36-40 0.110 table 11 shows the value of asymp, sig of all items whose difference among package 1, package 2, package 3, package 4, and package 5 is above 0.05. this means that there is no difference in the difficulty index of the five test packages. therefore, there is no effect of item number randomization on the item difficulty index. discussion mathematics is one of the school subjects which is tested in junior high school national examination. hamdi, kartowagiran, and haryanto (2018) believe that students’ mathematics competence can be used to solve varieties of problems and difficulties they face in learning various sciences, especially natural science. this fact forms the basis for the importance of mathematics, so that it becomes one of the school subjects examined in the national examination. the mathematics test in the national examination consists of a number of parallel test packages. the packages are constructed with the same items but with randomized item numbers and alternative answers in order to distinguish one package from the others. the use of parallel test packages is expected to prevent students from cheating, so that their real mastery can be known. unparallel tests may result in error of measurement, that is, the result of the test does not show the real competence mastery of the students (purnama, 2017). this research is conducted by analysing five test packages which are different based on the item randomization in order to prove whether being randomized the test packages are really parallel. whether or not a test is of good quality can be seen in the difficulty index of each item. a test item is said to be good if it is neither too difficult nor too easy, or in other words, the difficulty index is moderate. the item difficulty index is usually related to the aim of the test (mehrens & lehmann, 1973, p. 195). this research applies the classical test theory and the item response theory approaches in the analysis of test item difficulty index. the classical test theory approach is a very simple approach and easy to understand in analyzing test items empirically (gṻler, uyanik, & teker, 2014), while the item response theory approach is used to cover the weaknesses of the classical test theory approach. before a further analysis was conducted to find out whether a test remained parallel after its item numbers were randomized, the quality/characteristic function of the items in package 1 was analysed, because package 1 is parallel tests viewed from the arrangement of item numbers... badrun kartowagiran, djemari mardapi, dian normalitasari purnama, & kriswantoro copyright © 2019, reid (research and evaluation in education), 5(2), 2019 179 issn 2460-6995 the original test package as the reference for the analysis of the other four test packages. putro (2013) states that good test items have to meet at least three requirements, i.e. item difficulty index, discrimination power, and well-functioned distractors. the result of the analysis using the classical test theory shows that in general package 1 is in a poor category. this can be seen in the difficulty index, discrimination power, and the functioning of the distractors. viewed from the value of the difficulty index, it is very obvious that there are still many items in the easy category, and thus the students can answer correctly. the result of the analysis of the five test packages using the classical test theory approach shows that, in terms of the difficulty index, out of the 40 test items in five test packages, 5% to 7.5% of the items are difficult items, 35% to 62.5% of the items are moderate or good items, and 30% to 60% of the items are easy items. viewed from the average of the item difficulty index as shown in table 4, all of the five test packages have the average difficulty index categorized moderate or good. the value of the item difficulty index of the five test packages lies between 0.102 and 0.968. the higher the difficulty index, the easier the test item will be, and vice versa, the lower the item difficulty index, the more difficult the item will be (bichi, 2016). this is in line with allen and yen (1979) who state that in test item measurement, the item difficulty index is related to the percentage of the examinees who can do the test correctly. difficulty index is the proportion of the number of test takers who answer a particular question correctly, the proportion of all test takers. based on the classical test theory, it is known that there has been a shift in the category of the difficulty index of some items in package 2, package 3, package 4, and package 5 compared to that of the items in package 1. for example, test item 1 in package 1 is in the easy category, in package 3 and package 4 it is in the moderate category. another example is that test item 13 in package 1 is in the easy category, but in package 2 it is in the moderate category. overall, the percentage of the shift of the category of the difficulty index of package 2 is 22.5%, package 3 is 60%, package 4 is 50% and package 5 is 37.5%. this is due to the weakness of the result of the item analysis using the classical test theory approach, i.e. the size of the item characteristics (in this case the difficulty index) depends on the distribution of the competence of the test takers in the sample that is used (awopeju & afolabi, 2016). in line with this opinion, zaman, kashmiri, mubarak, and ali (2008) add that the comparison of test result of different test takers is one of the weaknesses of the classical test theory which is worth noting, because test takers must do the items which are the same or really parallel. it is one of these weaknesses that necessitate the irt to come into use. in the irt, the first thing to see is the assumption test. the unidimension assumption testing of the five test packages must first see the sufficiency of the sample. research findings show that the value of kmo-msa of package 1 is 0.469, package 2 is 0.513, package 3 is 0.608, package 4 is 0.571, and package 5 is 0.580. according to field (2009), the value of kmo-msa is considered sufficient if it is above 0.5. from this result, it can be concluded that four packages have sufficient sample, i.e. package 2, package 3, package 4, and package 5, because the value of kmo-msa >0.5. the result of the significance analysis using barlett’s test of sphericity shows that each of the five test packages is at the significance level of 0.000. therefore, the requirement is met because the significance level is below 0.05. there are a number of ways to interpret the sufficiency of unidimension assumption. one of the ways is by looking at the contribution of the first eigen value to test variance. the result of the above analysis shows that three test packages have dominant factors whose value is more than twice as much as the second factor, i.e. package 3 with 5.891 which is higher than the eigenvalue of the second factor of 2.367. package 4 with 5.345 higher than the eigenvalue of the second factor of 2.483, and package 5 with 5.003 345 higher than the eigenvalue of the second factor of 2.446, where the first factor is the most dominant factor. in the factor analysis, the parallel tests viewed from the arrangement of item numbers... badrun kartowagiran, djemari mardapi, dian normalitasari purnama, & kriswantoro 180 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 first eigenvalue should have the highest value (dominant) compared to the second, third, and so forth eigenvalue. this is because the size of the variance is directly proportional with the size of eigenvalue (field, 2009, p. 652; johnson & wichern, 2002, p. 441), and therefore, it can be concluded that the first factor in the factor analysis contributes the most compared to the other factors, and thus the unidimensionality assumption is met. difficulty index (b) which lies between the range of -2 and 2 is good (surya & aman, 2016). the result of the analysis using the irt approach shows that the five test packages have the difficulty index ranging from 0.170 to 0.470. the value of the difficulty index shows that all of the test items are in the moderate category, which lies between -1.00 and +1.00 (sumintono & widhiarso, 2015). it means that based on the result of the analysis using the irt, all test items in the five test packages have the same characteristics. the analysis of the characteristics of the item difficulty index was then followed by kruskall-wallis analysis to reveal the effect of the randomization of the item numbers and alternative answers on the item difficulty index. the kruskall-wallis analysis was conducted using the value of the difficulty index obtained using the classical test theory and item response theory approaches. the result of the analysis using the classical test theory approach shows that the randomization of the item numbers and alternative answers does not affect the item difficulty index as shown by the value of asymp, sig above 0.05. this result is in line with the finding of the research by santoso (2013) which states that the estimation of the competence and length of the test with randomized design is not significantly different from the test which was not randomized. the research finding applying the irt approach shows that there is a difference in the difficulty index of the test items in the five test packages after the randomization. in relation to the case of the classical test theory approach, the absence of the effect of the randomization of the item numbers and alternative answers may result from package 1 which is the original test package not having undergone any randomization. package 1 has the characteristics which tend to be poor. viewed from its difficulty index, more than 50% of the items are easy items which make most students, those with high competence and those with low competence, can answer questions correctly. it means that the test cannot distinguish students with high competence from those with low competence. a test that tends to be easy for students will not show any effect of randomization because they will tend to be able to do it. in addition, a test was conducted using kruskalwallis test on the difficulty index using the classical test theory and item response theory. package 1 which is the original test package is used to find out whether there is a difference in the difficulty index between items 1-10 and items 31-40. the result shows the assymp, sig value of 0.082 when using the classical test theory, and the assymp, sig value of 0.054 when using the irt, where the assymp, sig value is above 0.05. it means that in the original test package, before randomization, the values of the item difficulty index are not in a wide range. this may be the reason for the absence of the difference in the difficulty index after randomization. further studies need to be done on the test items which have good characteristics to see whether or not there is an effect of the randomization of the item numbers and alternative answers on item difficulty index. conclusion all of the five test packages have a good reliability index, lying between 0.96 and 0.97. package 1, package 2, and package 3 have the reliability index of 0.96, while package 4 and package 5 have the reliability index of 0.97. it can be concluded that based on the value of the reliability index, the five test packages have equal reliability. based on the result of the analysis using the classical test theory, viewed from the average value of the difficulty index, all five test packages have the average difficulty index ranging from 0.102 to 0.968. the result of kruskall-wallis analysis of the five test packages shows that there is no difference in the difficulty index of the items in package 1, parallel tests viewed from the arrangement of item numbers... badrun kartowagiran, djemari mardapi, dian normalitasari purnama, & kriswantoro copyright © 2019, reid (research and evaluation in education), 5(2), 2019 181 issn 2460-6995 package 2, package 3, package 4 and package 5. thus, the randomization of the item numbers and alternative answers has no effect on the item difficulty index. the analysis of the test items using the item response theory shows that the average value of difficulty index of the five test packages ranges from 0.170 to 0.470. the result of the analysis of the difficulty index of the items in the five test packages shows that there is no difference in the difficulty felt by the students doing package 1, package 2, package 3, package 4, and package 5. this means that the randomization of item numbers has no effect on the item difficulty index, which means that constructing parallel tests by randomizing the item numbers and alternative answers is good to do, and this research has proved that applying this method will result in parallel tests. references allen, m. j., & yen, w. m. (1979). introduction to measurement theory. los angeles, ca: wadsworth. awopeju, o. a., & afolabi, e. r. i. (2016). comparative analysis of classical test theory and item response theory based item parameter estimates of senior school certificate mathematics examination. european scientific journal, esj, 12(28), 263–284. https://doi.org/ 10.19044/esj.2016.v12n28p263 azwar, s. (2013). reliabilitas dan validitas (4th ed.). yogyakarta: pustaka pelajar. azwar, s. (2015). reliabilitas dan validitas. yogyakarta: pustaka pelajar. baker, f. b. (2001). the basics of item response theory (2nd ed.). college park, md: eric clearinghouse on assessment and evaluation. bichi, a. a. (2016). classical test theory: an introduction to linear modeling approach to test and item analysis. international journal for social studies, 2(9), 27–33. https://doi.org/10.26643/ijss. v2i9. 6690 center for educational assessment. (2014). laporan pengolahan ujian nasional tahun ajaran 2014/2015 (unpublished). jakarta: center for educational assessment of republic of indonesia. fernandes, h. j. x. (1984). testing and measurement. jakarta: national education planning, evaluation, and curriculum development. field, a. (2009). discovering statistics using spss (3rd 3d.). london: sage publications. gṻler, n., uyanik, g. k., & teker, g. t. (2014). comparison of classical test theory and item response theory in terms of item parameters. european journal of research on education, 2(1), 1–6. hamdi, s., kartowagiran, b., & haryanto, h. (2018). developing a testlet model for mathematics at elementary level. international journal of instruction, 11(3), 375–390. https://doi.org/10.12973/ iji.2018.11326a johnson, r. a., & wichern, d. w. (2002). applied multivariate statistical analysis. englewood cliffs, nj: prentice-hall. kronmüller, k.-t., saha, r., kratz, b., karr, m., hunt, a., mundt, c., & backenstrass, m. (2008). reliability and validity of the knowledge about depression and mania inventory. psychopathology, 41(2), 69–76. https:// doi.org/10.1159/000111550 law no. 14 of 2005 of republic of indonesia about teachers and lecturers. , (2005). mardapi, d. (2014). pengukuran, penilaian, dan evaluasi pendidikan. yogyakarta: nuha litera. mehrens, w. a., & lehmann, j. l. (1973). measurement and evaluation in education and psychology. new york, ny: holt, rinehart, and winston. miller, m. d., linn, r. l., & gronlund, n. e. (2009). measurement and assessment in teaching (10th ed.). upper saddle river, nj: pearson. naga, d. s. (1992). pengantar teori sekor pada pengukuran pendidikan. jakarta: gunadarma. parallel tests viewed from the arrangement of item numbers... badrun kartowagiran, djemari mardapi, dian normalitasari purnama, & kriswantoro 182 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 purnama, d. n. (2017). characteristics and equation of accounting vocational theory trial test items for vocational high schools by subject-matter teachers’ forum. reid (research and evaluation in education), 3(2), 152–162. https://doi. org/10.21831/reid.v3i2.18121 putro, n. h. p. s. (2013). karakteristik butir soal ulangan kenaikan kelas sebagai persiapan bank soal bahasa inggris. jurnal penelitian dan evaluasi pendidikan, 15(1), 92–114. https://doi.org/ 10.21831/pep.v15i1.1089 rasyid, h., & mansur, m. (2008). penilaian hasil belajar. bandung: cv wacana prima. reckase, m. d. (1979). unifactor latent trait models applied to multifactor tests: results and implications. journal of educational statistics, 4(3), 207–230. https://doi.org/10.3102/10769986004 003207 retnawati, h. (2014). teori respons butir dan penerapannya: untuk peneliti, praktisi pengukuran dan pengujian, mahasiswa pascasarjana. yogyakarta: nuha medika. reynolds, c. r., livingston, r. b., & willson, v. l. (2009). measurement and assessment in education (2nd ed.). upper saddle river, nj: pearson. rohmawati, r. (ed.). (2013). kurikulum 2013, 87 persen guru kesulitan cara penilaian. retrieved january 6, 2018, from https://unnes.ac.id/berita/87-persenguru-kesulitan-soal-penilaian-kurikulum -2013.html sanjaya, w. (2010). kurikulum dan pembelajaran. jakarta: kencana. santoso, a. (2013). pemilihan butir alternatif pada tes adaptif untuk peningkatan keamanan tes. jurnal kependidikan: penelitian inovasi pembelajaran, 43(1), 1–8. https://doi.org/10.21831/jk.v43i1.1953 sumintono, b., & widhiarso, w. (2015). aplikasi pemodelan rasch pada assessment pendidikan. cimahi: trim komunikata. surya, a., & aman, a. (2016). developing formative authentic assessment instruments based on learning trajectory for elementary school. reid (research and evaluation in education), 2(1), 13–24. https://doi.org/10.21831/reid.v2i1.654 0 werheid, k., hoppe, c., thone, a., muller, u., mungersdorf, m., & von cramon, d. y. (2002). the adaptive digit ordering test clinical application, reliability, and validity of a verbal working memory test. archives of clinical neuropsychology, 17(6), 547–565. https:// doi.org/10.1093/arclin/17.6.547 zaman, a., kashmiri, a.-u.-r., mubarak, m., & ali, a. (2008). students ranking, based on their abilities on objective type test: comparison of ctt and irt. educom international conference, 591–599. retrieved from https://ro.ecu.edu.au/ ceducom/52/ copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(1), 2021, 78-87 available online at: http://journal.uny.ac.id/index.php/reid hierarchical linear modeling for determining the effect of ict literacy on mathematics achievement suci paramitha liestari; muhardis* center for assessment and learning, ministry of education and culture of the republic of indonesia jl. gn. sahari eks komp. siliwangi no. 4, pasar baru, sawah besar, jakarta 10710, indonesia *corresponding author. e-mail: adi_perdana2000@yahoo.com introduction the 21st century or the era of the global economy forces the education sector to prepare students as the next generation to master life and career skills, learning and innovation, and information, media, and technology (trilling & fadel, 2009). one of these information, media, and technology skills is ict (information, communications, and technology) literacy. this literacy is defined as the skills to use digital technology, communication devices, and internet networks to access, organize, integrate, assess, and make information useful in an educated society (educational testing service, 2002; syarifuddin, 2014). until now, ict literacy is better understood as computer literacy, internet literacy (juditha, 2017), or digital skills. this literacy must also be supported by support systems such as assessment standards, instruction and curriculum, professional development, and a learning environment. in indonesia, the center for assessment and learning under the research and development agency of the ministry of education and culture, has an important role as a support system that handles assessment standards. this contribution was realized in aksi (asesmen kompetensi siswa indonesia) or indonesian student competency assessment. aksi is a national education article info abstract article history submitted: 4 march 2021 revised: 29 june 2021 accepted: 29 june 2021 keywords ict literacy; two level hlm; student and school variable; aksi 2019 scan me: this study aims to identify the effect of ict literacy on mathematics achievement in grade 8 by using the indonesian student competency assessment's (asesmen kompetensi siswa indonesia or aksi) 2019 questionnaire data. a multistage probability sample of 13,079 students was analyzed using a two-level hierarchical linear model (hlm) which student achievement scores are the first level laid in schools as the second level. the results of the analysis revealed that ses, the number of smartphones and computers that students have at home, the availability of ict at home and school, the use of ict for education, and perspective on the benefits of ict in daily life have a positive influence on mathematics literacy achievement, while the easiness of access to the use of digital devices in schools has negative influence at the student level. at the school level, the high mathematics literacy achievement of students is influenced by the location of the school and the number of certified teachers. school accreditation and completeness of learning facilities in schools are not factors upon better students toward their mathematics literacy achievement. however, the interaction between the easiness of access to the use of digital devices in schools and the completeness of learning facilities in schools have an influence in increasing students' mathematics literacy achievement. based on the diversity component, it is known that the diversity of students' mathematical literacy achievement explained by the student level and school-level variables are 33.24 and 0.18, respectively. this is an open access article under the cc-by-sa license. how to cite: liestari, s., & muhardis, m. (2021). hierarchical linear modeling for determining the effect of ict literacy on mathematics achievement. reid (research and evaluation in education), 7(1), 78-87. doi:https://doi.org/10.21831/reid.v7i1.39181 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i1.39181 https://doi.org/10.21831/reid.v7i1.39181 suci paramitha liestari & muhardis page 79 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) quality achievement mapping program consisting of aksi for schools and aksi surveys. instruments for the aksi survey consist of cognitive instruments and questionnaire instruments. cognitive instruments are in the form of questions on reading literacy, scientific literacy, and mathematical literacy. the questionnaire was distributed to school principals, teachers, and students (center for educational assessment of the ministry of education and culture, 2019a). figure 1. example of the published math problem with stimulus arranging the chain (center for educational assessment of the ministry of education and culture, 2019b) figure 2. example of the published math problem with stimulus title “online taxi” (center for educational assessment of the ministry of education and culture, 2019b) the results of the 2019 aksi based on cognitive instruments show that nationally, reading literacy is in the low (55.85%), moderate (38.01), and good (6.14%) positions; scientific literacy is in a low position (66.11%), moderate (33.12%), and good (1.78%), and; mathematical literacy is in the low (79.44%), moderate (18.98%), and good (1.58%) positions. these data provide information from the three literacy, and mathematical literacy is at the top for the less category. especially for mathematics, students can only solve routine simple math problems, basic computations in the form of direct equations, and basic concepts related to geometry and statistics. we can see from the published math problems, for example, questions with algebraic content-number pathttps://doi.org/10.21831/reid.v7i1.39181 suci paramitha liestari & muhardis page 80 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) terns, with the stimulus "students are asked to click on a bracelet to arrange a chain and use a ruler in the application in the right corner to measure the length of the chain”, as presented in figure 1. for another question with the stimulus title "online taxi", students are asked to click the simulation section followed by pressing the “run” button, as presented in figure 2. based on these two questions, it is found that students did not just make calculations based on mathematical skills but also required proficiency in using digital media. however, cognitive elements such as being used (fishbein & ajzen, 1975) or not in using digital media devices also influence students writing answers. today's digital media skills are better known as ict literacy. with regard to ict literacy, based on a student questionnaire in the digital proficiency section, it is known that 50% nationally the availability of digital devices at schools and at home. digital devices have been used in learning (50%) and support learning activities (49.86%). students also give a positive response to using digital devices in future learning (50.27%) (center for educational assessment of the ministry of education and culture, 2019c). based on this report, it is known that digital devices as support systems are no longer an obstacle in the learning environment. it means that students' ict literacy supports them in making it easier to find learning content from various sources themselves. align with the results of helaluddin (2019), technology-based learning is a solution for millennial students who prefer something new that is oriented towards the process of self-discovery (student-centered). apart from the use of digital media, a support system that needs to be considered is professional development. based on the school questionnaire, the percentage of teachers who have been certified, it is known that there are 38.5% of teachers who are mostly not yet certified, 29.7% of teachers who are already certified, and 28.69% who are already certified is more than 80% of the total number of teachers in schools. this provides information that teachers in the country have not reached half of those who are certified. supposedly, teacher certification is one of the administrative requirements for teacher professionalism (law no. 14 of 2005 of republic of indonesia about teachers and lecturers). professional development as a support system is also related to school accreditation. according to ban-s/m (badan akreditasi nasional or national accreditation board for schools/ madrasahs, school accreditation is categorized into accreditation a, b, c, and not accredited). the results of the questionnaire show that 53.65% of schools are in category a, 25.64% are in category b, 17.86% are in category c, and only 1.69% of schools are not accredited. this is an increase compared to the results of the accreditation carried out by ban-s/m in 2019 for 62,365 schools throughout indonesia (chaterine, 2019). they are accreditation a (25.34%), accreditation b (54.24%), accreditation c (18.15%), and not accredited (2.27%). this provides information that accreditation is indeed an indicator of quality improvement education in order to achieve quality standards. for teachers, the results of school/madrasah accreditation are motivation to improve themselves in providing the best service for students (nujumuddin, 2019). ideally, the variables in the school questionnaire contribute more or less to the results of mathematical literacy (as well as other literacy). of course, a more in-depth analysis is needed, which variables are predicted to affect the results of mathematical literacy based on the 2019 aksi ict literacy questionnaire. in this paper, the model used is hierarchy linear modeling (hlm). the information obtained from the results of the mathematical literacy cognitive instrument was used as level 1 variables, while the ict literacy questionnaire and the school location questionnaire, teacher certification, and school accreditation were made level 2. method this study uses aksi survey data in 2019 for 8th-grade junior high school students. the samples were determined using multi-stage probability sampling. the first stage selects sample districts/cities, the second stage selects sample schools, and the third stage selects sample students. the sample used in this study consisted of 13.079 students. the data were obtained from the center for educational assessment of the ministry of education and culture (2019a). https://doi.org/10.21831/reid.v7i1.39181 suci paramitha liestari & muhardis page 81 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) aksi 2019 consists of two types of measurement, aksi for schools and aksi for surveys. aksi for schools is a tool provided by the center for assessment and learning, ministry of education and culture of the republic of indonesia, in the form of a formative assessment module used to determine student abilities on essential topics in language, mathematics, and science lessons. the aksi survey is a program for mapping educational attainment to monitor the quality of education at the national/regional level, which describes the achievement of students‟ abilities done through a “longitudinal” survey. the instrument used in aksi for schools is a cognitive instrument consisting of mathematics, science, and reading test modules, while the instruments used in the aksi survey were cognitive instruments and questionnaire. the aksi survey questionnaire instrument consisted of questionnaires for school principals, teachers, and students (center for educational assessment of the ministry of education and culture, 2019a). this study uses student questionnaires and student cognitive results of mathematics scores. furthermore, the student questionnaire used for data analysis was a questionnaire to measure students‟ ict literacy and school questionnaire. a total of eight questions were taken from the ict literacy questionnaire and four questions from the school questionnaire that were considered relevant to support this study. table 1 describes the variables used from the ict literacy questionnaire, while table 2 describes the variables used from the school questionnaire. table 1. description for ict literacy questionnaire variable description sex this variable explains the gender of the student ses this variable describes student‟s social economics based on parental education and income ownership of goods at home this variable describes the types of items students have at home availability of ict at school and home this variable describes the it devices in schools and at students' home student opinions about ict access in schools this variable explains students' opinions about the easiness of access and the skills of teachers towards it in schools ict usage for education this variable explains the frequency of students using it for educational purposes ict usage for entertainer this variable explains the frequency of students using it for other than educational purposes perspective on the benefits of ict in daily life this variable describes the benefits of it for student‟s daily life table 2. descriptions for school questionnaire variable description school location this variables describes the conditions around the student‟s school environment teacher certification this variable describes the percentage of certified teachers in school school accreditation this variable describes the school accreditation teaching and learning using ict devices this variable describes teaching and learning activities using ict devices this study aims to examine the effect of students' ict literacy on mathematics literacy scores in aksi 2019 using hierarchical linear model (hlm) modeling. a two-level hlm was used to examine the effects of individualand school-level variables on mathematics literacy scores. this analysis approach was chosen because the study data have a hierarchical structure with individual students nested within the school. in our study, models were applied in the order of model 1 (only dependent variables used), model 2 (model 1 + individual context variables), model 3 (model 2 + school context variables), model 4 (model 3 + school context variables) and final model (model 4 + interaction of school and individual context variables) to separately identify the impact of the information-related variables. the analysis models are as follows. https://doi.org/10.21831/reid.v7i1.39181 suci paramitha liestari & muhardis page 82 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) ………………….. (1) in formula (1), yij is the mathematics score of student „i‟ going to junior high school „j‟. in order to estimate the value, seven individual-level variables (x) and four school-level variables (w) were applied in order. group-mean centering was conducted for all x(s) and grand mean centering for all w(s) except for the dummy variables. therefore, the intercept βoj of the model was the average mathematics score of the students. findings and discussion analysis on variables affecting the mathematics score using hlm we examined the results of the final model of hlm analysis to identify the significant variables affecting the students' mathematics literacy scores. the results are presented in table 3. first, the students' mathematical achievement diversity explained by the student level and schoollevel variables are 33.24 and 0.18, respectively. based on the family social-economic status (ses), it appears that the score of students who had a good ses was 2.06 points higher than that of students who had not. in line with the use of ict for education and students‟ perspectives on ict use in daily life has a positive effect on mathematics literacy scores. the score increased by 4.55 points and 3.28 points for an increase in every unit of computer usage time for education and daily life, respectively. in addition, the students‟ smartphones and computers, the availability of digital devices at home and school have a positive effect on mathematics literacy scores. meanwhile, the easiness of access to the use of digital devices in schools has a negative effect. this means that students‟ access to ict in schools does not guarantee students‟ mathematics score is better. however, the scores decreased by 1.74 points for the easiness of access to the use of ict in schools. with regard to information variables of the school level, the mathematics score of schools located in major cities was 0.55 points higher than that of schools located in islands or isolated and rural areas. the more certified teachers in the school, the mathematics literacy score of the students, is 0.55 points higher than those who are not, but this does not apply to school accreditation. a school with good accreditation does not guarantee that students will get better mathematics literacy scores, as well as, the completeness of learning facilities in schools is not a factor upon better students toward their mathematics literacy score. in addition, the study showed that there is an interaction between the easiness of access to the use of digital devices in schools and the completeness of learning facilities in schools. it means that the effect of the easiness of access to the use of ict in schools on mathematics literacy scores depends on the completeness of learning facilities in schools. the scores increased by 2.16 points for that. on the other hand, the interaction of the teacher certification and the use of ict for education do not affect the mathematics literacy score, in line with the interaction between the use of digital devices for education and the completeness of learning facilities in schools as well as the interaction between the use of ict for education and school locations has a negative effect on mathematics literacy score. https://doi.org/10.21831/reid.v7i1.39181 suci paramitha liestari & muhardis page 83 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) table 3. result of analysis of hlm determining the effect of ict literacy on mathematics achievement in the precedent study, inconsistent results have been reported for gender. many studies reported that female students have significantly higher ict literacy than male students do (ainley et al., 2010), whereas some studies showed that male students have a more positive attitude than female students. our study found that gender was not statistically significant in affecting mathematics scores in aksi 2019. https://doi.org/10.21831/reid.v7i1.39181 suci paramitha liestari & muhardis page 84 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) on the other hand, ict availability at home was found to be negatively affected by mathematics, reading, and science achievement, which aligns with previous research (lee & wu, 2012), and student ict availability at school was found to have no correlations with student mathematics, reading, and scientific literacy (hu et al., 2018). on the contrary, our study revealed that the availability of it at home and school has a significant effect on mathematics literacy scores. in addition, the number of smartphones and computers per student had a positive effect on mathematics literacy scores. in some countries, access to ict and the internet is provided to students with the intention of providing ict-based learning opportunities to a greater extent than for personal uses. interestingly, in pisa 2003, organisation for economic co-operation and development (2005) found that access to a computer at home had the largest impact on mathematics literacy (the major domain for pisa 2003). therefore, computer accessibility was included as a predictor in the current study. meanwhile, in our study, the easiness of access to the use of digital devices in schools has a negative effect on mathematics literacy scores. in terms of ict usage, in korea, the more students used computers for study or assignment, the more likely they were to attain lower levels instead of super-ordinate levels of ict literacy, which suggested that computer usage for study in korea is not strongly related with ict literacy (kim et al., 2014), which aligns with the research conducted by hu et al. (2018) that integrating ict into education is to facilitate student learning, the negative relationship between either student ict academic use at school or outside school and their learning outcomes deserve serious attention. the results might indicate that ict was not used in a satisfactory approach to enhance student learning. in contrast, in our study, it is revealed that using ict for education can increase mathematics literacy score, in line with plumm (2008) that the way in which students use ict for educational purposes may relate to achievement. with a large variety of software and internet applications accessible to students, some may be more beneficial to learning, while others may require more skill. ict leisure use outside school was positively associated with reading and scientific literacy but had no significant correlations with mathematics scores (hu et al., 2018). similarly, in this study, the usage of ict for entertainers did not significantly impact mathematics scores. with regard to the usage of it in daily life, it can significantly affect mathematics literacy scores. regarding school level, students who go to schools located in a region with relatively high accessibility to information tend to have a high ict literacy level and get better mathematics scores. in the study, teacher certification, which is one of the administrative requirements for teacher professionalism, has a significant impact on mathematics literacy score, similar to studies conducted by suci and mayangsari (2017) that there is a significant influence between teacher certification on student achievement in smkn 7 pandeglang. jamaliah and cahyaningsih (2020), in their literature review, found the same results that certified teachers had a positive impact on student achievement. on the other hand, sukarti (2013), in her study, showed that there are non-significant differences between the learning outcomes of students taught by certified teachers and students taught by a non-certified teacher. furthermore, a study conducted by siswandoko and suryadi (2013) revealed that the analysis indicated that the teachers' certification has hardly ever been able to promote certificate holders' competencies. the study found out that the student's learning achievement was determined more by the social-economic status of the students' families rather than by the actual certification mechanism. school accreditation is an indicator of improving the quality of education in order to achieve quality standards of education. in this study, it was included as a variable that is assumed to affect the mathematics literacy score in aksi 2019. however, the result of the study found that accreditation has a negative impact on mathematics scores, which means that it is not influenced by school accreditation. align with a study conducted by samad and mangindara (2019) https://doi.org/10.21831/reid.v7i1.39181 suci paramitha liestari & muhardis page 85 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) in 8th grade in gowa, the result of the study revealed that there is no effect of school accreditation on student learning outcomes, but there is an influence of learning models on student learning outcome. in contrast, mairing (2016) found that better school accreditation would have an impact on learning outcomes, especially in mathematics. regarding school infrastructure, the interaction between the easiness of access to the use of digital devices in schools and the completeness of learning facilities in schools has an influence in increasing students' mathematics literacy scores. in addition, the higher the satisfaction level of students in school classes using ict, the higher was the mathematics score of the students, which implied that ict using classes should be efficiently provided in school. however, the interaction between ict use for education and school locations has a negative effect on mathematics scores, contrary to our expectations. conclusion based on the research findings and discussion, it is concluded that the social-economic status of students (ses) affects the achievement of students‟ mathematics literacy scores. students with good social-economic status have higher mathematics literacy scores than students who had not. students with good social-economic status have a better smartphone or computer access. this is in contrast to the easiness of ict access in schools which decreased by 1.74 points. it means that students‟ access to ict in schools does not guarantee that students‟ mathematics literacy score is better. apart from social-economic factors, other variables such as school position also have an effect on mathematics literacy scores. they need guidance for schools that are far from the city. the achievement of the mathematics literacy score of students who are in these schools is lower. coaching also includes teacher certification. the more certified teachers in the school have a positive impact. this is in contrast to school accreditation, which has taken a large portion as a determining variable for student success. in fact, schools with good accreditation do not always contribute positively to student achievement in mathematics literacy scores. concerning gender, research shows that there is no significant relationship between gender and achievement in mathematics literacy scores. references ainley, j., fraillon, j., & freeman, c. (2010). national assessment program: ict literacy years 6 & 10 report, 2008. https://www.nap.edu.au/_resources/nap_ictl_2011_public_report_final.pdf center for educational assessment of the ministry of education and culture. (2019a). asesmen kompetensi siswa indonesia. pusat penilaian pendidikan. https://pusmenjar.kemdikbud.go.id/aksi-2/ center for educational assessment of the ministry of education and culture. (2019b). contoh soal pada asesmen kompetensi siswa indonesia. pusat penilaian pendidikan. https://aksi.puspendik.kemdikbud.go.id/report/example center for educational assessment of the ministry of education and culture. (2019c). respons siswa dalam penggunaan perangkat digital pada pembelajaran masa depan. pusat penilaian pendidikan. https://aksi.puspendik.kemdikbud.go.id/report/student/ict chaterine, r. n. (2019, december 17). akreditasi sekolah 2019: 25% a, 54% b, 18% c, dan 2% tak terakreditasi. detik news. https://news.detik.com/berita/d-4825881/akreditasi-sekolah2019-25-a-54-b-18-c-dan-2-tak-terakreditasi educational testing service. (2002). digital transformation: a framework for ict literacy. educational testing service. https://www.ets.org/research/policy_research_reports/publications/report/2002/cjik https://www.nap.edu.au/_resources/nap_ictl_2011_public_report_final.pdf https://pusmenjar.kemdikbud.go.id/aksi-2/ https://aksi.puspendik.kemdikbud.go.id/report/example https://aksi.puspendik.kemdikbud.go.id/report/student/ict https://news.detik.com/berita/d-4825881/akreditasi-sekolah-2019-25-a-54-b-18-c-dan-2-tak-terakreditasi https://news.detik.com/berita/d-4825881/akreditasi-sekolah-2019-25-a-54-b-18-c-dan-2-tak-terakreditasi https://www.ets.org/research/policy_research_reports/publications/report/2002/cjik https://doi.org/10.21831/reid.v7i1.39181 suci paramitha liestari & muhardis page 86 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) fishbein, m., & ajzen, i. (1975). belief, attitude, intention, and behavior: an introduction to theory and research. addison-wesley. https://people.umass.edu/aizen/f&a1975.html helaluddin, h. (2019). peningkatan kemampuan literasi teknologi dalam upaya mengembangkan inovasi pendidikan di perguruan tinggi. pendais, 1(1), 44–55. https://uit.ejournal.id/jpais/article/view/218 hu, x., gong, y., lai, c., & leung, f. k. s. (2018). the relationship between ict and student literacy in mathematics, reading, and science across 44 countries: a multilevel analysis. computers & education, 125, 1–13. https://doi.org/10.1016/j.compedu.2018.05.021 jamaliah, m., & cahyaningsih, u. (2020). pengaruh sertifikasi guru terhadap prestasi belajar siswa. prosiding seminar nasional pendidikan 2, 434–440. https://prosiding.unma.ac.id/index.php/semnasfkip/article/view/352 juditha, c. (2017). tingkat literasi teknologi informasi komunikasi masyarakat kota makassar. jurnal penelitian komunikasi, 14(1), 41–52. https://doi.org/10.20422/jpk.v14i1.167 kim, h.-s., kil, h.-j., & shin, a. (2014). an analysis of variables affecting the ict literacy level of korean elementary school students. computers & education, 77, 29–38. https://doi.org/10.1016/j.compedu.2014.04.009 law no. 14 of 2005 of republic of indonesia about teachers and lecturers, (2005). lee, y.-h., & wu, j.-y. (2012). the effect of individual differences in the inner and outer states of ict on engagement in online reading activities and pisa 2009 reading literacy: exploring the relationship between the old and new reading literacy. learning and individual differences, 22(3), 336–342. https://doi.org/10.1016/j.lindif.2012.01.007 mairing, j. p. (2016). kemampuan siswa kelas viii smp dalam memecahkan masalah matematika berdasarkan tingkat akreditasi. jurnal kependidikan: penelitian inovasi pembelajaran, 46(2), 179– 192. https://journal.uny.ac.id/index.php/jk/article/view/9655 nujumuddin, n. (2019). dampak kebijakan akreditasi terhadap peningkatan kinerja guru madrasah (studi di mi nurul muhsinin desa batujai). jurnal penelitian keislaman, 15(1), 1– 13. https://doi.org/10.20414/jpk.v15i1.1106 organisation for economic co-operation and development. (2005). are students ready for a technology-rich world? organisation for economic co-operation and development (oecd). https://doi.org/10.1787/9789264036093-en plumm, k. m. (2008). technology in the classroom: burning the bridges to the gaps in genderbiased education? computers & education, 50(3), 1052–1068. https://doi.org/10.1016/j.compedu.2006.10.005 samad, m., & mangindara, m. (2019). pengaruh model pembelajaran, akreditasi sekolah, dan kecerdasan emosional terhadap hasil belajar matematika siswa kelas viii smp negeri di kabupaten gowa. equals: jurnal ilmiah pendidikan matematika, 2(2), 74–84. https://ejournals.umma.ac.id/index.php/equals/article/view/307 siswandoko, t., & suryadi, a. (2013). kompetensi, sertifikasi guru, dan kualitas belajar siswa sekolah dasar. jurnal pendidikan dan kebudayaan, 19(3), 305–314. https://doi.org/10.24832/jpnk.v19i3.290 suci, s. c., & mayangsari, w. (2017). pengaruh sertifikasi guru terhadap prestasi belajar siswa pada sekolah menengah kejuruan (smk) negeri 7 pandeglang. publik: jurnal ekonomi dan publik, 13(2), 139–152. http://www.jurnal.stiebanten.ac.id/index.php/publik/article/view/84 https://people.umass.edu/aizen/f&a1975.html https://uit.e-journal.id/jpais/article/view/218 https://uit.e-journal.id/jpais/article/view/218 https://doi.org/10.1016/j.compedu.2018.05.021 https://prosiding.unma.ac.id/index.php/semnasfkip/article/view/352 https://doi.org/10.20422/jpk.v14i1.167 https://doi.org/10.1016/j.compedu.2014.04.009 https://doi.org/10.1016/j.lindif.2012.01.007 https://journal.uny.ac.id/index.php/jk/article/view/9655 https://doi.org/10.20414/jpk.v15i1.1106 https://doi.org/10.1787/9789264036093-en https://doi.org/10.1016/j.compedu.2006.10.005 https://ejournals.umma.ac.id/index.php/equals/article/view/307 https://doi.org/10.24832/jpnk.v19i3.290 http://www.jurnal.stie-banten.ac.id/index.php/publik/article/view/84 http://www.jurnal.stie-banten.ac.id/index.php/publik/article/view/84 https://doi.org/10.21831/reid.v7i1.39181 suci paramitha liestari & muhardis page 87 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) sukarti, s. (2013). isu gender dan sertifikasi guru versus prestasi belajar siswa. jurnal pendidikan, 14(1), 38–43. https://doi.org/10.33830/jp.v14i1.353.2013 syarifuddin, s. (2014). literasi teknologi informasi dan komunikasi. jurnal penelitian komunikasi, 17(2), 153–164. https://doi.org/10.20422/jpk.v17i2.14 trilling, b., & fadel, c. (2009). 21st century skills: learning for life in our times. jossey-bass. https://www.wiley.com/enus/21st+century+skills%3a+learning+for+life+in+our+times-p-9780470553916 https://doi.org/10.33830/jp.v14i1.353.2013 https://doi.org/10.20422/jpk.v17i2.14 https://www.wiley.com/en-us/21st+century+skills%3a+learning+for+life+in+our+times-p-9780470553916 https://www.wiley.com/en-us/21st+century+skills%3a+learning+for+life+in+our+times-p-9780470553916 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(2), 2020, 87-97 available online at: http://journal.uny.ac.id/index.php/reid comparison of methods for detecting anomalous behavior on large-scale computer-based exams based on response time and responses *1deni hadiana; 2bahrul hayat; 1burhanuddin tola 1graduate school, universitas negeri jakarta jl. r. mangun muka, rawamangun, pulo gadung, kota jakarta timur, jakarta 13220, indonesia 2faculty of psychology, universitas islam negeri syarif hidayatullah jakarta jl. kertamukti no. 5, cireundeu, ciputat timur, tangerang selatan, banten 15419, indonesia *corresponding author. e-mail: denihadianadua@gmail.com submitted: 18 april 2020 | revised: 25 june 2020 | accepted: 16 july 2020 abstract this study aims to determine the anomalous index (indeks anomali or ia) that considers both response time and responses and compares it with response time effort (rte) or rapid guessing (tebakan cepat or tc) on various thresholds. response time and responses from 732 examinees are in natural science subjects consist of 40 multiple choice items with four answer choices. response time and responses are analyzed to obtain descriptive statistics related to them, calculate the tc and ia index using two methods of the threshold, the first method (m1) is a visualization of identification, and the second method (m2) is based on the amount of time spent responding to each item related to the complexity of items, as proposed by nitko. the performance of the ia and tc scores is compared related to validity and reliability. the coefficient alpha of iam1 score 0.84, the coefficient alpha of iam2 0.82. both values of the alpha coefficient have fulfilled the reliability requirements of the index determination. the ia proposed in this study has a high correlation with erp, which is commonly used to determine the solution behavior's magnitude and rapid guessing. the correlation value of iam1 with tcm1 0.86, the correlation value of iam2 with tcm2 0.89, and this high correlation value shows that there is a strong relationship between ia and tc. determination of threshold time uses three categories of multiple choices item that reveal ia and tc distributions that are close to normal distribution so that it reflects natural empirical conditions. keywords: anomalous index (ia), rapid guessing (tc), threshold, reliability, validity how to cite: hadiana, d., hayat, b., & tola, b. (2020). comparison of methods for detecting anomalous behavior on large-scale computer-based exams based on response time and responses. reid (research and evaluation in education), 6(2), 87-97. doi:https://doi.org/10.21831/reid.v6i2.31260. introduction cognitive tests, such as computer-based national exams (cbne), measure the competency of students' knowledge after they complete the learning process for approximately three years for junior and senior high school in certain subjects according to the curriculum. since 2015, in addition to the paper and pencil-based national exams, cbne began to be implemented. even since the implementation of the national exams in 2018, cbne has become the main mode. based on the center for educational assessment report on the results of the national exams, the junior high school cbne examinees continued increasing nationally, in 2015 as much as 0.22%; in 2016 to 3.72%; in 2017 became 32.26%; in 2018 it became 62.97%, even in jakarta, indonesia, cbne was used in 2017 and 2018 with 100% each. cbne is expected to increase the validity, reliability, and integrity of the exams. https://creativecommons.org/licenses/by-sa/4.0/deed.id https://doi.org/10.21831/reid.v6i2.31260 https://doi.org/10.21831/reid.v6i2.31260 deni hadiana, bahrul hayat, & burhanuddin tola 88 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) computer-based exams, such as cbne, have many advantages over pencil and paperbased tests. computerized exams according to lee and chen (2011) can provide complex information, because, in addition to providing information about examinee responses, computer-based exams can provide information on the response times that reflect the amount of time that is spent by examinees to respond to each item. meanwhile, according to linacre and rudner, as quoted by georgiadou et al. (2006), the advantages of computer-based exams are test management flexibility, increased test security, increased motivation in information technology literacy, and time efficiency. a good test security procedure can be a quality control test implementation so the validity of a good test score is guaranteed (lewis et al., 2014) and can be obtained. test security is related to performance (validity and reliability) of a test (cizek & wollack, 2016, p. 3). thus, invalid cbne items and unreliable cbne tests scores result in information on test results that cannot be used, especially if cbne scores are used for various strategic interests such as for selection to next education levels, mapping of education quality, and policy interventions to improve the quality of education. cbne results will be meaningful, appropriate to target, and effective if the scores obtained by cbne examinees are accurate. this means that the scores obtained by cbne examinees truly reflect the ability of cbne examinees. the score obtained in the cbne is closely related to the response pattern and the response time pattern of the examinees since they can be used to determine anomalous data. thus, research that is related to response patterns and response time patterns to determine anomalous data is very urgent because the analysis of responses and response times accurately will have a real contribution to improving the quality of the examinee ability estimation (fox et al., 2007). the response time that is spent by each examinee during processing and responding to each item and the response can be directly obtained on a computer-based exam. based on the response time data, we can detect the anomaly response time of the test examinees compared to the response time of other test examinees. examinees who answer items too quickly compared to other examinees can be indicated as examinees who exhibit anomalous behavior. anomalous behavior is likely to occur due to various reasons, among others, the examinee has known information related to the item earlier, rapid guessing, and responded randomly. anomalous behavior is closely related to testing security, examinee's integrity, item validity, test reliability, fairness, and examinees ability. thus, van der linden (2006), marianti et al. (2014), meijer and sotaridona (2006), widiatmo and wright (2015), and wise and kong (2005) conclude that if anomalous data were analyzed appropriately, better measurement results for abilities would be obtained, for example, the anomalous data is not included in the estimated ability parameter. several methods can be used to determine anomalous data based on the response time. wise and kong (2005) believe that examinees with high efforts in responding to each item will show behavioral solutions. on the other hand, the examinees with low effort in responding to each item will show guessing behavior shown by responding to the items rapidly. this rapid guessing can be seen from the short response time where they did not take the time to read the item in full and it is impossible to consider the item carefully. this rapid guessing behavior underlies the determination of the examinees' anomalous behavior, wise and kong named it response time effort (rte). before determining rte for dichotomous items, the sbij is calculated first using the equation in formula (1). (1) sbij is the solution behavior of examinee j on item i, ti is the threshold between the time of rapid guessing behavior and solution behavior on item i, rtij is the response time of examinee j in item i. next, the rte is calculated using the equation in formula (2), in which k is the number of items in the test, and the range of rte scores from 0 to 1 reflecting the proportion of items for examinees who have solution behavior. https://doi.org/10.21831/reid.v6i2.31260 https://doi.org/10.21831/reid.v6i2.31260 deni hadiana, bahrul hayat, & burhanuddin tola copyright © 2020, reid (research and evaluation in education), 6(2), 2020 89 issn: 2460-6995 (online) (2) rte value close to 1 indicates the higher effort in answering all items of the test, on the contrary, the value of rte getting closer to 0 indicates the lower effort that occurs. in this study the term rapid guessing (tc) will be used as an index that has an inverse relationship with rte, meaning that the lower of the rte value or the closer it is to 0, the tc value is higher or closer to 1. from formula (1) and formula (2), it is known that rte calculations do not consider the responses of examinee answers both to the items below the threshold and to the items above the threshold. examinees who correctly answer selected items with less time than the threshold must be treated differently from examinees who wrong answers selected items with less time than the threshold. likewise, examinees who correctly answer selected items longer than the threshold must be treated differently from examinees who wrong answers selected items longer than the threshold. examinees who answered correctly on selected items with response time above the threshold are normal behavior. conversely, examinees who answered correctly on selected items with response times below the threshold are anomalous behavior. based on these two conditions, we propose the normal index (indeks wajar or iw) and the anomaly index (indeks anomali or ia) that consider the response time and responses. mathematically, the iw is stated in formula (3), while the ia is stated in formula (4). (3) (4) iwj is the normal index of examinee j, pwij is reasonable behavior on item i of examinee j which is given a score 1 for the examinee who correctly answers item i (bi) with response time (wrij) above the threshold (ti), or pwij = 1 if bi >ti. iaj is the anomaly index of examinee j, paij is an anomalous behavior on item i of examinee j which is given a score 1 for examinees who correctly answer item i with time below or equal to the threshold or paij = 1 if bi ≤ ti. the relationship between iwj and iaj is shown in formula (5). iwj + iaj = 1 (5) the iw scores getting closer to 1 indicates more normal behavior, while iw scores getting closer to 0 show more anomaly behavior. in contrast, the ia scores getting closer to 1 indicate behavior anomalous increasingly, while the ia scores getting closer to 0 show more normal behavior. the challenge in determining the rte, iw, and ia lies in determining the accurate threshold (t). determination of the threshold must consider the characteristics of items, like the number of words in an item, the presence of stimulus pictures, tables, and illustrations, items with calculation, the difficulty of items. nitko (naga, 2013, p. 47) said simple multiple-choice items need 40 to 60 seconds to answer, complex multiple-choice items need 70 to 90 seconds, and multiple-choice items with calculation need 120 to 300 seconds. wise and kong (2005) determined the threshold based on the number of characters. items with the number of characters less than 200 are a threshold of three seconds, items with characters between 200 to 1000 are five seconds of a threshold and the threshold of items with more than 1000 characters are in ten seconds. kong et al. (2007) apply several methods to determine the threshold: the common threshold for each item; based on item characteristics such as the number of characters and the presence of pictures; visualization identification of response time-frequency distribution graphs; and estimation using two mixed models. the rte that had been developed by wise and kong since 2005 did not consider the responses of each item of examinees for determining the index of effort or the index of rapid guessing. thus, this study was conducted to determine the anomaly index (ia) which considers both response time and responses and comparing with rte at various thresholds. https://doi.org/10.21831/reid.v6i2.31260 https://doi.org/10.21831/reid.v6i2.31260 deni hadiana, bahrul hayat, & burhanuddin tola 90 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) method log file data of 732 examinees obtained from the educational assessment center, research and development agency, ministry of education and culture was converted into structured spreadsheet data. the structured data consists of response time in seconds and a dichotomous score of responses from 40 multiple-choice items in natural sciences subjects. the data were then screened and processed with the help of minitab, microsoft excel, and spss application. the data were then analyzed quantitatively to obtain descriptive statistical information related to the response time and responses, determine the tc index and ia index on the two methods of determining the threshold. the first method (m1) was done by identification visualization (vi), which is looking at the response time for the first time the response time decreases sharply then increases again through visualization of the response time-frequency graph. the second method (m2) was done using the length of time criteria for working on the multiplechoice items proposed by nitko, namely, simple multiple-choice items using a 40 second, 70 seconds complex multiple-choice, and 120 multiple-choice calculations. for this reason, before the index calculation, those items are grouped into three, namely simple multiplechoice, complex multiple-choice, and multiple-choice calculation. then, we conducted a comparison of the results of anomalous data analysis with rte and ia index. findings and discussion lindsey (2004, p. 197) explains that the characteristics that must be considered in the selection of the response time distribution: (1) must be positive; (2) short response time is more common than long response time, or in other words, the magnitude of the short response time probability is very large compared to the magnitude of the long response time probability or positive skewed (van der linden, 2006). distributions that match these two characteristics are lognormal, weibull, and gamma distributions (lindsey, 2004, pp. 203-206). figure 1 shows the mean and standard deviation and median scatterplots and averages on 40 items and 732 examinees. all response times are positive with a minimum response time of 1 second. this condition is following the response time criteria according to lindsey (2004, p. 197). titik titik 2,0 50 70 90 110 130 150 170 190 210 230 250 50 100 150 200 250 300 350 s d mean 25 50 75 100 125 150 175 200 225 250 275 300 25 75 125 175 225 275 m e a n median 0 20 40 60 80 100 120 140 160 180 200 0 50 100 150 200 m e a n median 5 55 105 155 205 255 25 50 75 100 125 150 175 200 s d mean n = 40 i te m; mi n = 1 se cond n = 40 i te m; mi n = 1 se cond n = 732 exa mi nees ; mi n = 1 se cond 1 2 3 4 n = 732 exa mi nees ; mi n = 1 se cond figure 1. scatterplots of mean and standard deviation, mean, and median of items and examinees https://doi.org/10.21831/reid.v6i2.31260 https://doi.org/10.21831/reid.v6i2.31260 deni hadiana, bahrul hayat, & burhanuddin tola copyright © 2020, reid (research and evaluation in education), 6(2), 2020 91 issn: 2460-6995 (online) figure 2. distribution of response time for items number 4, 11, and 15 figure 3. frequency of response time of a selected item in addition, figure 2 shows the distribution of response time for items number 4, 11, and 15. according to fox et al. (2007), van der linden (2006), lindsey (2004, p. 197), and wulansari et al. (2019), the distribution of response time that tends to skew to the right shows that a small portion of the response time-frequency is on the right, meaning that the probabilities of short response time are higher than the probabilities of long response time. this distribution, according to lindsey in wulansari (2019, p. 140) is following the characteristics of lognormal, weibull, and gamma distribution. in this study, the anderson darling test was performed to determine the distribution of response time samples. wulansari (2019, p. 88) states that the distribution with the lowest anderson darling value is the most proper distribution for the response time on each item. from the results of comparison of the anderson darling on lognormal, weibull, and gamma distributions for 40 items, it can be seen that items number 12, 20, 21, and 23 are appropriate with the characteristics of the gamma distribution, while 36 other items match the characteristics of the lognormal distribution. information on the characteristics of distribution is very crucial when determining the response time threshold through the visualization method of response timefrequency graphs. https://doi.org/10.21831/reid.v6i2.31260 https://doi.org/10.21831/reid.v6i2.31260 deni hadiana, bahrul hayat, & burhanuddin tola 92 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) figure 3 provides information about the frequency of response time (seconds) for one of the items used to determine the threshold time using m1. the threshold is set at a sharp drop in response time then rises again, in the figure, the threshold is 19 seconds in the circle. this threshold is a short response time and is insufficient for the examinees to process the items and options. this method was proposed by wise (2006). conceptually, according to kong et al. (2007), the thre-shold is a short response time for examinees so that they do not have sufficient time to process and determine the correct answer from the items. after visualizing the identification of 40 response time-frequency graphs obtained a threshold in seconds based on the m1 method for items number 1 to 40 respectively: 26; 14; 8; 17; 15; 17; 21; 22; 26; 8; 6; 26; 21; 20; 15; 7; 14; 14; 11; 23; 28; 22; 15; 19; 18; 22; 23; 20; 14; 17; 19; 11; 19; 18; 13; 5; 16; 21; 13; and 13. based on the results of an analysis of the characteristics of 40 items including the number of words, the presence of pictures, tables, illustrations, and the cognitive level of items. items number 2; 3; 5; 6; 7; 14; 25; 26; 29; 30; 31; 32; 33; 36; and 37 are simple multiple-choice type and use a threshold of 40 seconds for each item. items number 1; 4; 17; 22; 24; 28; 35; 38; 39; and 40 are complex multiple-choice type and use a threshold of 70 seconds for each item. items number 8; 9; 10; 11; 12; 13; 15; 16; 18; 19; 20; 21; 23; 27; and 34 are multiple choice type of calculation and use a threshold of 120 seconds for each item. the threshold of 40 items is shown in figure 4. determination of threshold using the visualization identification method of the response time graph does not consider the characteristics of the item, such as the complexity of items and items require calculation. this method considers the certainty response time, especially for the response time that decreases sharply for the first time and then increases. as a result, this method produces less stable response time and does not reflect the degree or level of each item, and tends to produce a low threshold so that it impacts on the low percentage of examinees who exhibit anomalous behavior (hauser & kingsbury, 2009). if this threshold determination method is applied to detect the level of an anomaly behavior, the examinees who are detected anomaly behavior are very few, especially for the complex items, the items that require calculations, the items that have pictures, and the items that contain many words, as usually found on items in the national testing like in indonesia. 22 24 28 35 38 39 40 8 9 10 11 12 13 15 16 18 19 20 21 23 0 10 20 30 40 50 60 70 80 90 100 110 120 0 40 nitko 1 = 40 second nitko 2 = 70 second nitko 3 = 120 second items number mc-simple mc-compleks mc-calculations figure 4. threshold of 40 items https://doi.org/10.21831/reid.v6i2.31260 https://doi.org/10.21831/reid.v6i2.31260 deni hadiana, bahrul hayat, & burhanuddin tola copyright © 2020, reid (research and evaluation in education), 6(2), 2020 93 issn: 2460-6995 (online) hauser and kingsbury (2009) states that the use of thresholds that are too short and applied to all items by regardless the characteristics of the item has many limitations, because the characteristics of each item are different than the other ones, each item has a unique psychometric parameter such as difficulty levels, and each item has a unique surface characteristic, like the number of words. therefore, the determination of the threshold that considers the characteristics of the item or the subgroup of items makes more reasonable as occurred performed in this study. wise and kong (2005) said that the determination of an index must have an adequate degree of reliability. according to wise and kong, a minimum alpha coefficient of 0.80 is acceptable for index determination. with 95% ci, the coefficient alpha of tcm1 and tcm2 index is 0.85 and 0.83 respectively. with 95% ci, the coefficient alpha of iam1 and iam2 index is 0.84 and 0.82 respectively. this coefficient alpha value reaches the reliability requirements of index determination. figure 5 shows the frequency of tcm1, tcm2, iam1, and iam1 on the distribution of index scores. in figure 5, most of the index scores are close to and equal to 0, meaning that most of the examinees are not indicated to rapid guessing behavior, but most of them show solution behavior during responding to the item or show normal behavior. this can be observed in the figure that shows negative skewness distribution. the iam1 mean is higher than the tcm1 mean because the determination of ia considers the response time and responses of examinees. this finding is following wise and kong (2005). (1) (2) (3) (4) figure 5. frequency index of tcm1 (1), tcm2 (2); iam1 (3), iam2 (4). https://doi.org/10.21831/reid.v6i2.31260 https://doi.org/10.21831/reid.v6i2.31260 deni hadiana, bahrul hayat, & burhanuddin tola 94 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) besides, figure 5 shows that the frequency of tcm2 and iam2 index scores is mostly distributed at the middle or moderate index scores and close to the normal distribution. it means that a small number of examinees showed anomaly behavior or natural behavior and most of them behaved normally or moderate degrees both an anomaly and rapid guessing. thus, this method is following empirical conditions. the iam2 mean is greater than the tcm2 mean, because the determination of ia takes the response time and responses into account. therefore, the determination of the m2 threshold reflects or approaches the normal distribution in both tc and ia. a significant correlation on the tc index developed by wise and kong and ia proposed in this study shows that conceptually has a strong relationship between the tc index and the ia index (see table 1). the relationship is higher when the threshold determination method used is the same. this can be seen in the tcm2 correlation value with iam2 which is higher than the tcmi correlation value with iam1. thus, m2 has a higher relationship than m1. by using an index range of 0.74 to 1, iam2 succeeded in detecting 16 examinees, tcm2 succeeded in detecting eight examinees, while iam1 and iam2 each had one examinee. iam2 was the most successful in detecting anomalous examinees because the calculation of the iam2 threshold considered the characteristics of the items and the responses so that the probability of detecting anomalies was higher as shown in figure 5. table 1. correlation of spearmen’s rho tcm1, tcm2, iam1, iam2 tcm1 tcm2 iam1 iam2 spearman’s rho tcm1 rsp 1.000 .489 ** .859 ** .443 ** sig. (2-tailed) . .000 .000 .000 tcm2 rsp .489 ** 1.000 .415 ** .884 ** sig. (2-tailed) .000 . .000 .000 iam1 rsp .859 ** .415 ** 1.000 .419 ** sig. (2-tailed) .000 .000 . .000 iam2 rsp .443 ** .884 ** .419 ** 1.000 sig. (2-tailed) .000 .000 .000 . **. correlation is significant at the 0.01 level (2-tailed). figure 6. ia and tc index of range 0.74 to 1 (left); total time test and correctness proportion (right) https://doi.org/10.21831/reid.v6i2.31260 https://doi.org/10.21831/reid.v6i2.31260 deni hadiana, bahrul hayat, & burhanuddin tola copyright © 2020, reid (research and evaluation in education), 6(2), 2020 95 issn: 2460-6995 (online) from figure 6, examinee 576 were detected anomaly by iam2, iam1 and detected rapid guessing by tcm2, meaning that tcm1 was detected by all indices in this study. data analysis of the proportion of correct answers and the total test response time provided information that examinee 576 had the proportion of correct answers low, below 0.25 but spent the total time to answer all items is too short, that is 1800 seconds or 45 seconds per item from the 7200 seconds time allocation. thus, examinee 576 indicated that the behavior of rapid guessing as examinee 336. anomalous behavior in examinees 576 and 336 was due to a lack of effort to process and answer items as predicted by meijer (2003). examinees 276 and 104 showed a great effort by maximizing the allocation of test time available but the test results were not good enough with the proportion of correct answers below 0.5. examinee 292 had the highest proportion of 0.75 and was able to complete the test with about 3900 seconds. then why was the examinee detected anomaly by iam2? first, this phenomenon may occur due to the quality of the item, for example, there are keywords in the stimulus that lead to options, the construction of item which results in the examinee not having to process every word and symbol in the item but just connecting each of these keywords to answer the item. second, the options’ distractors do not work properly because the distractors are not homogenous and logical. there are various possible types of anomalous behavior based on the results of responses and response time analysis, includeing cheating, creative responses, careless responses, lucky guesses (meijer, 1996). therefore, to confirm these various types of anomaly behavior, it is better to analyze item characteristics and psychometric parameters of items such as discrimination index and difficulty level. conclusion a simple method for determining the solution behavior index or rapid guessing behavior (tc) is the erp that is proposed by wise and kong since 2005, this method is still often used mainly for low stakes tests. this erp method considers the response time only and does not consider the responses of examinees in each item. the ia method that is proposed in this study considers the response time and the responses of each examinee on each item and it is easily implemented. the reliability coefficient alpha of the iam1 score 0.84, while the coefficient alpha of the iam1 reliability is 0.82. both values of the alpha coefficient have fulfilled the reliability requirements of the index determination. ia that is proposed in this study has a high correlation with erp which is commonly used to determine the magnitude of the solution behavior or rapid guessing behavior. the correlation value of iam1 with tcm1 is 0.86, the correlation value of iam2 with tcm2 is 0.89. this high correlation value shows that there is a strong relationship between the ia and erp (tc). the determination of the threshold must consider the characteristics of the items, such as the presence of pictures and the number of words and psychometric characteristics such as the level of difficulty items. the determination of the threshold uses three groups of multiple-choice items, namely: the simple multiple-choice, complex multiple-choice, and multiple-choice with calculation resulting in ia and tc distributions that are close to normal distribution so that it reflects natural empirical conditions. to conclude the type of anomaly shown by examinees, ia should be confirmed by qualitative and psychometric attributes of the test items and examinees' abilities. to perfect this study, research should be conducted regarding the determination of a more comprehensive threshold by considering the item surface characteristics such as the number of words, the cognitive level of items, the complexity of items, and the psychometric characteristics of items such as difficulty level, discriminating index, and the ability of the examinees. acknowledgment the researchers deliver their gratitude to the center for educational assessment, research and development agency, ministry of education and culture for providing support in the form of data on the response time and responses in this study. https://doi.org/10.21831/reid.v6i2.31260 https://doi.org/10.21831/reid.v6i2.31260 deni hadiana, bahrul hayat, & burhanuddin tola 96 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) references cizek, g. j., & wollack, j. a. (2016). handbook of quantitative methods for detecting cheating on tests (1st ed.). routledge. https://doi.org/10.4324/97 81315743097 fox, j.-p., entink, r. k., & van der linden, w. (2007). modeling of responses and response times with the package cirt. journal of statistical software, 20(7). https://doi.org/10.18637/jss.v020.i07 georgiadou, e., triantafillou, e., & economides, a. a. (2006). evaluation parameters for computer-adaptive testing. british journal of educational technology, 37(2), 261–278. https://doi. org/10.1111/j.1467-8535.2005.00525.x hauser, c., & kingsbury, g. g. (2009, november 4). individual score validity in a modest-stakes adaptive educational testing setting [paper presentation]. the annual meeting of the national council on measurement in education, sandiego, ca. https://www.nwea.org/resources/ individual-score-validity-modest-stakesadaptive-educational-testing-setting/ kong, x. j., wise, s. l., & bhola, d. s. (2007). setting the response time threshold parameter to differentiate solution behavior from rapid-guessing behavior. educational and psychological measurement, 67(4), 606–619. https:// doi.org/10.1177/0013164406294779 lee, y.-h., & chen, h. (2011). a review of recent response-time analyses in educational testing. in psychological test and assessment modeling (vol. 53, issue 3). http://www.psychologie-aktuell. com/fileadmin/download/ptam/32011_20110927/06_lee.pdf lewis, c., lee, y.-h., & davier, a. a. von. (2014). test security for multistage tests: a quality control perspective. in n. kingston & a. clark (eds.), test fraud (statistical detection and methodology) (1st ed.). routledge. https://doi.org/ht tps://doi.org/10.4324/9781315884677 van der linden, w. j. (2006). a lognormal model for response times on test items. journal of educational and behavioral statistics, 31(2), 181–204. https://doi. org/10.3102/10769986031002181 lindsey, j. k. (2004). statistical analysis of stochastic processes in time. cambridge university press. https://doi.org/ 10.1017/cbo9780511617164 marianti, s., fox, j.-p., avetisyan, m., veldkamp, b. p., & tijmstra, j. (2014). testing for aberrant behavior in response time modeling. journal of educational and behavioral statistics, 39(6), 426–451. https://doi.org/10.3102/10 76998614559412 meijer, r.r., & sotaridona, l. (2006). detection of advance item knowledge using response times in computer adaptive testing (lsac research report series no. ct 03-03). law school admission council. meijer, r. r. (1996). person-fit research: an introduction. applied measurement in education, 9(1), 3–8. https://doi.org/ 10.1207/s15324818ame0901_2 meijer, r. r. (2003). diagnosing item score patterns on a test using item response theory-based person-fit statistics. psychological methods, 8(1), 72–87. https: //doi.org/10.1037/1082-989x.8.1.72 naga, d. s. (2013). teori sekor pada pengukuran mental (2nd ed.). nagarami citrayasa. widiatmo, h., & wright, d. b. (2015, april). comparing two item response models that incorporate response times [paper presentation]. national council on measurement in education annual meeting, california, illionis, usa. https://www.researchgate.net/publicati on/283711098_comparing_two_item _response_models_that_incorporate_ response_times wise, s. l. (2006). an investigation of the differential effort received by items on a low-stakes computer-based test. applied measurement in education, 19(2), 95–114. https://doi.org/10.1207/s153 24818ame1902_2 https://doi.org/10.21831/reid.v6i2.31260 https://doi.org/10.21831/reid.v6i2.31260 deni hadiana, bahrul hayat, & burhanuddin tola copyright © 2020, reid (research and evaluation in education), 6(2), 2020 97 issn: 2460-6995 (online) wise, s. l., & kong, x. (2005). response time effort: a new measure of examinee motivation in computer-based tests. applied measurement in education, 18(2), 163–183. https://doi.org/10.1207/s153 24818ame1802_2 wulansari, a. d. (2019). model logistik dalam irt dengan variabel random waktu respon untuk tes terkomputerisasi [doctoral dissertation, universitas negeri yogyakarta]. eprints uny. http:// eprints.uny.ac.id/id/eprint/66079 wulansari, a. d., kumaidi, & hadi, s. (2019). two parameter logistic model with lognormal response time for computer-based testing. international journal of emerging technologies in learning (ijet), 14(15), 138–158. https://doi. org/10.3991/ijet.v14i15.10580 https://doi.org/10.21831/reid.v6i2.31260 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(2), 2020, 109-118 available online at: http://journal.uny.ac.id/index.php/reid developing an instrument to measure student’s perception of the medical education curriculum from the perspective of communities of practice theory *1 yoga pamungkas susani; 2 gandes retno rahayu; 2 yayi suryo prabandari; 2 rossi sanusi; 2 harsono mardiwiyoto 1 faculty of medicine, universitas mataram jl. majapahit no. 62, dasan agung baru, selaparang, kota mataram, nusa tenggara barat 83125, indonesia 2 faculty of medicine, public health, and nursing, universitas gadjah mada jl. farmako, senolowo, sekip utara, depok, sleman, yogyakarta 55281, indonesia *corresponding author. e-mail: yoga.pamungkas.susani@gmail.com submitted: 5 may 2020 | revised: 27 october 2020 | accepted: 1 december 2020 abstract the concept of participation as a learning process is essential to foster professional identity development. faculties are expected to provide a curriculum that supports students' participation in the profession's context. curriculum evaluation is needed to assess the extent to which curriculum implementation supports participation. in this regard, this study aims to develop instruments that measure students' perceptions of the medical education curriculum. the blueprint for the instrument's development was based on the concept of participation in communities of practice theory. qualitative research, which involved 17 pre-clinical and clinical medical students as participants, was conducted to explore medical students' perception about formal learning activities that encourage participation. the results were used to generate the items. a series of review processes, item reduction, revisions, and analysis generated 20 items in four factors, namely: engagement support, imagination support, convergence, and feedback. this shows that the instrument is multidimensional. the instrument also has good discriminant validity and composite reliability. keywords: curriculum in action, communities of practice, participation, medical education how to cite: susani, y., rahayu, g., prabandari, y., sanusi, r., & mardiwiyoto, h. (2020). developing an instrument to measure student's perception of the medical education curriculum from the perspective of communities of practice theory. reid (research and evaluation in education), 6(2), 109-118. doi:https://doi.org/10.21831/reid.v6i2.31500. introduction the environment in education shapes the student's learning process (genn, 2001a). learning can be seen in several concepts. the learning concept that underlies curriculum development will affect the learning environment that is formed. there are two learning metaphors, namely learning as an acquisition process, and also learning as a participation process (bleakley, bligh, & browne, 2011; mann, 2011). in the current medical education, there is an increase in the attention to the formation of professional identity with participation as the learning process. sociocultural learning theory is considered to be fundamental in this condition (bleakley, 2006; mann, 2011). one of the sociocultural learning theories is situated learning theory (lave & wenger, 1991). communities of practice (which is also known as cop) (wenger, 1998) that evolved from the situated learning theory (lave & https://doi.org/10.21831/reid.v6i2.31500 https://doi.org/10.21831/reid.v6i2.31500 yoga pamungkas susani, gandes retno rahayu, yayi suryo prabandari, rossi sanusi, & harsono mardiwiyoto 110 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) wenger, 1991) sees the learning process as one’s participation process in entering communities of practice. this theory rejects learning as merely an acquisition process. thus, in this theory, the learning process is bound by the situation or context. in this way, the curriculum for the learning process is seen as chances that are provided by the educational program for students to participate. undergraduate students who start medical education can be seen as someone who starts to learn to become a member of the medical profession. they develop their identity as a physician. the professional identity formation is important for a future physician because it will influence their practice later as a physician (forsythe, 2005). participation is the source of professional identity formation (wenger, 1998). participation is a complex process of an individual that involves physical, emotions, and feelings in both individual or group activities, such as the sense of belonging, thinking, speaking, and engagement in activities related to their part in a community. there are three forms of participation, namely engagement, imagination, and alignment. these three are not separated concepts but related to each other. engagement is the main participation form in practice. engagement can appear as actions that are carried out either individually or in a group, for example, group discussion, being involved in professional activities, or, using and making an artifact in the professional community (wenger, 2009). imagination is a form of participation that aims to build perspective about self, about community, and about the outside world to conduct self-orientation, situation reflection, and possibility exploration. alignment is a process of choosing and developing commitment. alignment determines the participation conducted according to concepts or principles of the community and can ensure that the local activities are also aligned with other processes that are globally accepted. participation, as a part of learning process, should be supported by the faculty by implementing curriculums that do not emphasize on the teaching process. faculties need to encourage the students' opportunity in order to engage, imagine, and explore themselves as well as the community. evaluation of the curriculum in action was needed to ensure it. the curriculum in action is a way a curriculum is implemented in practice or reality (fish & colles, 2005). for measuring the curriculum implementation in providing this learning and participation facilities, a measurement instrument is needed. this instrument can be utilized to evaluate the educational process that supports participation. the perception from medical students is very essential for the evaluation because based on this theory, learners are the only one who experiences proper resources for themselves to be able to learn or participate in a professional community. method instrument development is conducted through several steps. the first step was determining the aim of the instrument and developing the blueprint. instrument development aims to enable measurement of perceptions of participation support in the medical education curriculum in action. the instrument measure students' perception of the curriculum they receive in terms of their opportunities to engage, imagine, and also to know their alignment in the medical education context. instrument development steps are presented in figure 1. ethical approval for the present study was obtained from the ethical committee, faculty of medicine, ugm. this research has received permission from the faculty of medicine, universitas gadjah mada (ugm) and faculty of medicine, universitas mataram. qualitative research was done to explore students' perceptions regarding the curriculum in action. these study results are utilized for instrument items. the qualitative exploration involved 17 pre-clinical and clinical medical students as participants. sampling took into consideration education year, gpa, sex, and their activeness in an organization. data collection was conducted through semi-structured interviews. aspects explored included support towards engagement, such as opportunities to interact with the medical professional community, ophttps://doi.org/10.21831/reid.v6i2.31500 yoga pamungkas susani, gandes retno rahayu, yayi suryo prabandari, rossi sanusi, & harsono mardiwiyoto copyright © 2020, reid (research and evaluation in education), 6(2), 2020 111 issn: 2460-6995 (online) portunities to discuss medical problems, opportunities to practice clinical skills. support to imagination included a clear explanation of medical profession roles, clinical explanation contextual level, clear medical professional overview, and opportunity to reflect students' experience. alignment support included the curriculum's ability in ensuring students about their capability during education according to medical profession roles and competencies, including material relevance with medical practice, assessment system clearness, and feedback in students' competencies improvement. qualitative analysis was conducted by coding interview transcripts. coding procedure with peer coder (using independent second coder) and member checking procedure was conducted to improve the reliability of the qualitative analysis. the codes were classified into subthemes and themes. figure 1. the instrument development process https://doi.org/10.21831/reid.v6i2.31500 yoga pamungkas susani, gandes retno rahayu, yayi suryo prabandari, rossi sanusi, & harsono mardiwiyoto 112 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) the results of this qualitative study and literature review were elaborated to produce instrument items. codes were converted into sentences used as instrument items. sentences were aligned to measurement formats and dimensions within the instrument. this process resulted in 163 statement items. these items were reduced, redundant statements were removed, leaving in a total of 103 items. the instrument used is a self-report instrument with a likert scale 1-5 (1 is strongly disagree, 5 is strongly agree). the next step is developing guidelines for the subject and layouting the instrument. the guidelines cover measurement purposes, measured aspects, data confidentiality, and filling instruction. review with experts reduced overlapping items resulting in 58 items left. improvement was regularly conducted including revised ambiguous sentences and sentences inappropriate with the expected measurement (i.e., statement that should be answered with highly disagree-highly agree but appeared to measure frequency or a yes-no question). in the next step, the instrument was tested to several students to measure its readability. this step is useful to strengthen face validation by discovering any misinterpretation of statements within the instrument. the first test was conducted to five pre-clinical medical students and eight clinical students. the feedbacks were obtained from individual written comments and of the focused group discussion. beside trying to confirm students' interpretation as a user with expected interpretation, the test aims to provide information on the duration of instrument filling, instrument length, instrument layout for comfort, and ease in the filling process. the second test was done to students, divided into two phases. the first one was tested on seven pre-clinical students and the second was on three pre-clinical and four clinical students. after each test, the instrument was revised considering students’ feedbacks. the results of these revisions were used for survey as pilot testing. survey was done to medical undergraduates and interns from the faculty of medicine universitas mataram. out of 347 students, 303 filled the questionnaire. the survey results underwent a factor analysis to obtain construct validity and reliability. the reliability limit is relative but as high as possible (azwar, 2004). factor analysis was done with partial least square-structural equation model (pls-sem). pls-sem can be applied in a research project with limited participants and skewed distribution data (wong, 2013). in this research, construct convergent validity is shown by ave value > 0.5. beside good convergent validity, the instrument also needs to have good discriminant validity. discriminant validity indicates the items measuring a construct have a low correlation with other constructs. in this case, discriminant validity is shown by higher construct ave square root compared to correlation value to other constructs. reliability can be seen from the composite reliability value > 0.7. findings and discussion qualitative research resulted in codes classified into eight categories. these eight categories are aspects to take into a concern to support participation according to wenger (1998). these are mutuality, competence, continuity, all three support engagement; orientation, reflection, exploration, all three support imagination; convergence, jurisdiction, and also coordination, and all three support alignment. a total of 58 items in eight categories was tested through the survey. mutuality is the availability of adequate group activities with peer students, lecturers, physicians, other professions, or patients. in this concept, the more group activities there are, the easier students interact and learn. competence is an opportunity for students to show their competencies. continuity covers opportunity that allows values, principles, and information on medical professional community delivery from fulltime members to the new member. imagination support includes opportunity in the curriculum to provide an overview of the medical profession, reflect on experiences in the community, and explore self capabilities and possibilities in the community. alignment support includes activity convergences in the curriculum to achieve the learning process in preparing students to enter the professional community. jurisdiction is feedback facilities for students to improve https://doi.org/10.21831/reid.v6i2.31500 yoga pamungkas susani, gandes retno rahayu, yayi suryo prabandari, rossi sanusi, & harsono mardiwiyoto copyright © 2020, reid (research and evaluation in education), 6(2), 2020 113 issn: 2460-6995 (online) their competencies. coordination is facilitating students to coordinate with the faculty in curriculum improvement. exploratory factor analysis (efa) was done to obtain the items’ clustering tendency. from efa kmo value (kaiser-meyer-olkin) met the condition >0.5, i.e. 0.796, and significant bartlett's test value <0.05, i.e. 0.001. the result shows the variables and samples used can be analyzed further. next, the measure of sampling adequacy (msa) value of all items shows >0.5 that means variables can be predicted and analyzed further. the items were extracted in eight factors (based on the categorization from qualitative study). from the component matrix, items with loading factor low (.5) were taken out from the next analysis step. the remaining items then were confirmed with confirmatory factor analysis (cfa) by pls-sem. model 1 involved 34 items grouped in eight factors: five mutuality items, three actualization opportunity (competency) items, six orientation items, three reflection facility items, four exploratory items, eight convergency items, two jurisdiction items, three coordination items, and two opportunities with patients items. items with loading factor <.7 were taken out from the analysis, so 23 items in eight factors were left, named as shown in table 1. the second-order cfa shows that the eight factors are the constructs of the curriculum in action. vif value is < 3.3, so no colinearity problem is present in the model 2 instrument. the collinearity problem indicates the items are redundant. model 2 curriculum in action instrument has good reliability, with composite reliability coefficient >.7. the convergent validity with ave of all constructs or curriculum in action constructs element factor >.5. the discriminant validity is very good with all constructs or factors ave square root higher than the inter construct correlation coefficient. the construct validity and reliability are shown in table 2. the instrument was then analyzed in third-order cfa. mutuality, opportunity to interact with patients, and opportunity for self-actualization constructs became a construct namely engagement support; orientation and reflection facility constructs became imagination support construct. the coordination construct has the lowest loading factor and indicator weight in the curriculum. thus, in model 3, the coordination construct was not included in construct elements of the curriculum in action. table 1. constructs in measurement instrument model 2 constructs resulted from factor analysis results and definition statement examples no of items mutuality: measures the availability of opportunities that allow practice sharing with peers and lecturers the learning sessions facilitate me to share knowledge with my colleagues. 4 actualization opportunity: measures the availability of opportunity to show competency the curriculum provides me an adequate opportunity to apply my clinical skill towards the patient/simulated patient. 2 opportunity to encounter patients: measures the availability of opportunity for students to face real-life patients the curriculum provides me an adequate opportunity to interact with real patients 2 orientation: measures the availability of opportunities for students to obtain orientation on medical practice the curriculum provides me an adequate opportunity to directly observe a clinical practice. 4 facility for reflection: measures support to students for reflection when i encounter a disconnection between ideal medical practice with reality, the faculty help me to analyze it 2 convergence: measures learning process suitability in the formation of knowledge and medical mindset the curriculum helps ease me to apply the knowledge and skill according to patients’ problems 4 feedback facility: measures curriculum in action to facilitate the availability of feedback on students’ competencies the curriculum facilitates students to gain feedback from peers, nurses, patients, or residents 2 coordination: measures curriculum in action in facilitating students’ coordination with faculty in curriculum improvement students can provide feedback to the faculty for learning process improvement 3 https://doi.org/10.21831/reid.v6i2.31500 yoga pamungkas susani, gandes retno rahayu, yayi suryo prabandari, rossi sanusi, & harsono mardiwiyoto 114 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) table 2. reliability and construct validity of subdimension 1 2 3 4 5 6 7 8 reliability composite reliability 0.843 0.864 0.833 0.847 0.891 0.784 0.719 0.911 crönbach alpha 0.752 0.686 0.733 0.759 0.816 0.449 0.219 0.804 convergence validity ave 0.573 0.761 0.556 0.580 0.731 0.645 0.562 0.836 discriminant validity* convergence 0.757 0.166 0.347 0.256 0.243 0.242 0.285 0.190 feedback 0.166 0.855 0.111 0.081 0.295 0.227 0.211 0.164 mutuality 0.347 0.111 0.746 0.266 0.225 0.095 0.331 0.190 orientation 0.256 0.081 0.266 0.762 0.332 0.212 0.409 0.464 coordination 0.243 0.295 0.225 0.332 0.872 0.252 0.345 0.395 facilities reflect 0.242 0.227 0.095 0.212 0.252 0.803 0.202 0.142 actualization 0.285 0.211 0.331 0.409 0.345 0.202 0.749 0.478 interact with patient 0.190 0.164 0.190 0.464 0.395 0.142 0.478 0.681 1=convergence; 2=feedback; 3=mutuality; 4=orientation; 5=coordination; 6=facility for reflection; 7=opportunity for actualization; 8=opportunity to interact with patients. *good discriminant validity showed by all constructs ave square root (shadowed in diagonal) higher than inter construct correlation coefficient (unshadowed) table 3. reliability and convergence validity of four dimensions in the instrument engagement imagination convergence feedback instrument composites reliability 0.790 0.755 0.843 0.864 0.815 crönbach alpha 0.600 0.350 0.752 0.686 0.697 ave 0.560 0.606 0.573 0.761 0.526 the reliability of model 3 instrument is good, with the composite reliability coefficient >.7. convergent validity is good, with all construct ave or constructs element factor of curriculum action >.5 (table 3). these four factors are proven to be curriculum in action construct elements (indicator weight < .001). instrument development is inseparable from validity and reliability issues. validity depicts the conformity of items measured by the instrument with measurement purposes. validity is a continuum, meaning that the more proof showing the instrument is valid, the bigger the opportunity to obtain suitable or needed information. validity also shows degree, not only valid or invalid but will be better if classified as high validity or low validity (colton & covert, 2007). validity is also conceptualized in several ways. in this instrument development, the validation process resulted in information on content validity, face validity, and construct validity that also portrays convergence validity and discriminant validity. content validity is a degree that the instrument represents topics or processes that should be measured. in this instrument development, content validity is strengthened with literature review especially regarding the participation concept and the way environment supports participation according to literature. this step helped instrument development in terms of purpose development and limiting construct definition within the instrument. the literature review results became the foundation for instrument blueprint development. content validity was also supported by expert reviews. in this case, experts provided inputs especially in terms of content and language. face validity was strengthened by requesting inputs from students through repeated qualitative tests until the instrument was easily understood. the step of quantitative factor analysis of test results with surveys is a step to obtain information on instrument construct validity. in this study, an efa technique was utilized first to get a picture regarding the tendency of items to cluster and construct suitability in the instrument. cfa was used next to reconfirm the conformity of items and constructs. in this study, the analysis must be done gradually to obtain items that can explain construct > https://doi.org/10.21831/reid.v6i2.31500 yoga pamungkas susani, gandes retno rahayu, yayi suryo prabandari, rossi sanusi, & harsono mardiwiyoto copyright © 2020, reid (research and evaluation in education), 6(2), 2020 115 issn: 2460-6995 (online) 70% (loading factor >0.7), significant indicator weight, good convergence validity, discriminant validity, and no collinearity issues. colinearity issues occur when there are redundant items or are measuring similar things. convergence validity indicates that items in one construct relate to each other. model 1 analysis still resulted in items with loading factors < 0.70, therefore model 2 analysis was needed. model 2 analysis has provided good construct validity and reliability, but the convergence validity of the instrument was not good enough. third-order cfa was done to simplify constructs and remove constructs that had the lowest indicator weight, i.e. coordination construct. from the analysis, an instrument consisting of 20 items in four factors was obtained (table 4). this instrument then is called the pasport ciame (participation support in curriculum in action of medical education) instrument. students' perception of the educational environment defines their behavior in the learning process (genn, 2001a). the formal curriculum is an element in the educational environment. implementation of the formal curriculum called curriculum in action, a curriculum received and perceived by the students. many instruments for the measurement of educational climate has been developed (genn, 2001b). besides, soemantri, herrera, and riquelme (2010) have identified 31 instruments for measuring educational climate in the health profession education context. table 4. loading factor of items instrument items (translated from the original bahasa version) loading factor support of engagement mutuality the learning sessions facilitate me to share knowledge with my colleagues. 0.707 the learning sessions help ease the interaction between me and my lecturers. 0.731 the learning sessions provide opportunities for students to exchange ideas. 0.791 the learning sessions allow students to discuss and exchange ideas with the lecturers. 0.751 the opportunity for self-actualization each learning session provides me an opportunity to express my understanding of the topics. 0.749 the curriculum provides me an adequate opportunity to apply my clinical skill towards the patient/simulated patient. 0.749 the opportunity to engage with the patients the curriculum provides me an adequate opportunity to interact with real patients 0.915 the curriculum provides me an adequate opportunity to interact with the community. 0.915 support of imagination orientation the curriculum provides me an adequate opportunity to directly observe a clinical practice. 0.736 the curriculum provides an adequate portrayal of the real condition of health service in the community 0.779 activities at the clinical skill laboratory demonstrate my ability as a medical doctor. 0.781 activities at the clinical skill laboratory allow me to perform as a real medical doctor. 0.751 facility for reflection when i encounter a disconnection between ideal medical practice with reality, the faculty help me to analyze it 0.803 reflection activity is one of the learning activities applied as part of the curriculum. 0.803 convergence the curriculum facilitates me to better understand the lessons/discipline. 0.787 the curriculum helps ease me to apply the knowledge and skill according to patients’ problems 0.772 the previous learning process sufficiently practice appropriate thinking patterns to deal with current learning 0.717 the learning process provides a strong scientific basis to comprehend the next level of learning 0.752 feedback the curriculum facilitates students to gain feedback from peers, nurses, patients, or residents 0.872 the curriculum facilitates students to gain feedback from the lecturers 0.872 https://doi.org/10.21831/reid.v6i2.31500 yoga pamungkas susani, gandes retno rahayu, yayi suryo prabandari, rossi sanusi, & harsono mardiwiyoto 116 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) at the undergraduate level, dreem instrument (dundee ready educational environment measure) developed by roff et al. (1997) has been used to measure educational environment. this instrument consists of 50 items classified into five constructs, i.e. perception on learning process, learning system organization, self-perception in academics, learning atmosphere, and also self-perception on social. dreem is widely used in many countries with varied purposes as well as varied validity and reliability reports (roff, 2005). several adopters reported unsupported construct validity (miles, swift, & leinster, 2012; yusoff, 2012). differing from dreem, the pasport ciame instrument focuses on the curriculum in action which is an implementation of formal curriculum perceived by students, whereas dreem not only measures the curriculum but also general educational environment. different from the former curriculum or educational environment measurement instruments, the development of pasport ciame instrument is based on the participation concept in the cop theory. participation, in this context, is not only students' participation through lectures or other formal instructional processes, but activities related to the medical professional community context. in cop, participation as a learning process highly depends on the engagement in a group activity, practical share, imagination development, and also aligning process in medical profession work context (susani, rahayu, sanusi, prabandari, & harsono, 2015). these concepts are emphasized in this measurement. the main constructs in this instrument are engagement support, support of imagination, convergence, and feedback. as stated in the introduction, there are three forms of participation, i.e. engagement, imagination, and alignment. support of engagement in this instrument is formed from mutuality, self-actualization, and opportunities for interaction with real patients and the community. mutuality that is seen is the togetherness with peers and lecturers who are indeed included in the medical professional community. interaction with lecturers allows dialogue, discussion about patients, and sharing of experiences as doctors. not only the interaction with lecturers as clinical supervisors, but also in the continuous interaction with patients will provide opportunities for students to participate and learn (hägg-martinell, hult, henriksson, & kiessling, 2017; steven, wenger, boshuizen, scherpbier, & dornan, 2014). good interaction with peers will facilitate the adaptation process of students to the learning environment (sari & susani, 2018). the support of imagination construct consists of orientation and facilities for reflection. both of them can support students to get an overview of the medical profession. convergence and feedback originate from the alignment concept, but factor analysis shows bad convergence validity if they are used as one construct of support of alignment. the passport ciame instrument can be used to evaluate faculties in providing a curriculum that is capable to support participation as a students’ learning process. this instrument utilizes concepts that are independent of local culture, therefore, might be used widely with language adaptation. this instrument can also be used both in the undergraduate program or clinical program. further research needs to be done to strengthen information on its validity and reliability. several other validities like predictive validity or concurrent validity were not examined. conclusion the metaphor 'learning as participation' has consequences in the determination and im-plementation of the curriculum for medical students. the pasport ciame instrument was developed to measure curriculum implementation that sees learning as participation. this instrument could be a tool to evaluate the magnitude of faculty support for students' participation. the pasport ciame instrument which was developed in this study is a multidimensional instrument and has good validity and reliability. it needs strengthening with other studies that reexamine validity and reliability, including predictive and concurrent validity. this instrument may be useful in assisting faculties to evaluate the availability of participation support as a students' learning process in the medical professional community. https://doi.org/10.21831/reid.v6i2.31500 yoga pamungkas susani, gandes retno rahayu, yayi suryo prabandari, rossi sanusi, & harsono mardiwiyoto copyright © 2020, reid (research and evaluation in education), 6(2), 2020 117 issn: 2460-6995 (online) references azwar, s. (2004). reliabilitas dan validitas. pustaka pelajar. bleakley, a. (2006). broadening conceptions of learning in medical education: the message from teamworking. medical education, 40, 150–157. https://doi.org/ 10.1111/j.1365-2929.2005.02371.x bleakley, a., bligh, j., & browne, j. (2011). medical education for the future, identity, power and location. springer. colton, d., & covert, r. w. (2007). designing and constructing instruments for social research and evaluation (1st ed.). jossey-bass. fish, d., & colles, c. (2005). medical education: developing a curriculum for practice. open university press. forsythe, g. b. (2005). identity development in professional education. academic medicine, 80(10), s112-7. genn, j. m. (2001a). amee medical education guide no. 23 (part 1): curriculum, environment, climate, quality and change in medical education: a unifying perspective. medical teacher, 23(4), 337–344. https:// doi.org/10.1080/01421590120063330 genn, j. m. (2001b). amee medical education guide no. 23 (part 2): curriculum, environment, climate, quality and change in medical education: a unifying perspective. medical teacher, 23(5), 445–454. https:// doi.org/10.1080/01421590120075661 hägg-martinell, a., hult, h., henriksson, p., & kiessling, a. (2017). medical students’ opportunities to participate and learn from activities at an internal medicine ward: an ethnographic study. bmj open, 7(2), e013046. https:// doi.org/10.1136/bmjopen-2016-013046 lave, j., & wenger, e. (1991). situated learning: legitimate peripheral participation. cambridge university press. mann, k. v. (2011). theoretical perspectives in medical education: past experience and future possibilities. medical education, 45(1), 60–68. https://doi. org/10.1111/j.1365-2923.2010.03757.x miles, s., swift, l., & leinster, s. j. (2012). the dundee ready education environment measure (dreem): a review of its adoption and use. medical teacher, 34(9), e620–e634. https://doi. org/10.3109/0142159x.2012.668625 roff, s. (2005). the dundee ready educational environment measure (dreem)—a generic instrument for measuring students’ perceptions of undergraduate health professions curricula. medical teacher, 27(4), 322– 325. https://doi.org/10.1080/0142159 0500151054 roff, s., mcaleer, s., harden, r. m., alqahtani, m., ahmed, a. u., deza, h., … primparyon, p. (1997). development and validation of the dundee ready education environment measure (dreem). medical teacher, 19(4), 295299. https://doi.org/10.3109/014215 99709034208 sari, d. p., & susani, y. p. (2018). the role of senior peers in students’ transition to clinical clerkships. jurnal pendidikan kedokteran indonesia: the indonesian journal of medical education, 7(2), 143. https:// doi.org/10.22146/jpki.39113 soemantri, d., herrera, c., & riquelme, a. (2010). measuring the educational environment in health professions studies: a systematic review. medical teacher, 32(12), 947–952. https://doi. org/10.3109/01421591003686229 steven, k., wenger, e., boshuizen, h., scherpbier, a., & dornan, t. (2014). how clerkship students learn from real patients in practice settings. academic medicine: journal of the association of american medical colleges, 89(3), 469–476. https://doi.org/10.1097/acm.000000 0000000129 susani, y. p., rahayu, g. r., sanusi, r., prabandari, y. s., & harsono. (2015). medical student’s participation for https://doi.org/10.21831/reid.v6i2.31500 yoga pamungkas susani, gandes retno rahayu, yayi suryo prabandari, rossi sanusi, & harsono mardiwiyoto 118 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) developing professional identity. mededpublish, 5(7), 1–12. https://doi. org/http://dx.doi.org/10.15694/mep.2 015.005.0007 wenger, e. (1998). communities of practice: learning, meaning, and identity. cambridge university press. wenger, e. (2009). communities of practice and social learning systems: the career of a concept. in c. blackmore (ed.), communities of practice and social learning systems. springer verlag and the open university. wong, k. k. (2013). partial least squares structural equation modeling (plssem) techniques using smartpls. marketing bulletin, 24, 1–32. https://doi. org/10.1108/ebr-10-2013-0128 yusoff, m. s. b. (2012). the dundee ready educational environment measure: a confirmatory factor analysis in a sample of malaysian medical students. international journal of humanities and social science, 2(16), 313–321. this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(2), 2019, 130-143 available online at: http://journal.uny.ac.id/index.php/reid evaluation of the implementation of batik-skills training program *1hendro prasetyono; 2dedeh kurniasari; 3laila desnaranti 1,3department of economics education, universitas indraprasta pgri jl. nangka no 58c, tanjung barat, jagakarsa, jakarta selatan 12530, indonesia 2lembaga pelatihan dan keterampilan cosmos jl. kasuari e2/106, raya patriot-jakasampurna, bekasi, jawa barat 17145, indonesia *corresponding author. e-mail: hen.dro23@yahoo.com submitted: 8 march 2019 | revised: 3 october 2019 | accepted: 29 october 2020 abstract the purpose of this study is to evaluate the implementation of batik skills training program as a recommendation for program improvement. the method used in this research is a qualitative approach using context, input, process, and product evaluation model. samples were taken from the institute of skills and training in the areas of jakarta, bogor, depok, bekasi, and tangerang. the results of the evaluation components that meet the evaluation criteria are all aspects of the context component, discipline and learning process, while the components of batik teachers' education qualifications, the use of educational facilities and infrastructure standards, curriculum components, program financing, evaluation of learning outcomes, mastery of theoretical competencies, practices, and impacts to program participants have not been met. the batik skills training program needs to be continued with some improvement. it is recommended for the product components, especially on the impact aspects felt by graduates, to be improved. keywords: training program implementation, batik skill, learning outcomes, institute of skills and training permalink/doi: https://doi.org/10.21831/reid.v5i2.23918 introduction the development of batik in indonesia today is impressive. batik is very popular and growing rapidly in the country since the recognition of batik by unesco in 2009 as a world cultural heritage which is originated in indonesia (pancapalaga, bintoro, pramono, & triatmojo, 2014). these developments affect and make people realize that batik can create jobs with special skills which are very promising. it is reinforced by data from central java industry and trade department (2001) that there are 11,391 of batik unit productions spreading at 146 production centers with production values of €5.4 million (hunga, 2011). batik is a decorated textile made using the wax-resist method. wax can be used to create intricate designs using an instrument called canting, and batik made using this vessel is called batik tulis (lee, 2016). batik development in indonesia experienced various obstacles. one of the inhibiting factors in the development of batik industry using a printing machine. it happens because of advancements in information technology and knowledge which result in instability in all aspects of industrial life. it is necessary to use management theory to characterize and understand the nature of turbulence that occurred (anderson, mason, hibbert, & rivers, 2017). one of the management theory implementations in the education field is the development of batik skill practice in the form of non-formal education. non-formal education as part of the education system has the same task as formal education, which provides the best service to the community. alternative services programmed outside the education can serve as a replacement, enhancer, and or complement to the formal education system. https://doi.org/10.21831/reid.v5i2.23918 evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti copyright © 2019, reid (research and evaluation in education), 5(2), 2019 131 issn 2460-6995 according to field in hoppers (2006, p. 21), non-formal education is rarely used, the term lifelong learning has increasingly gained currency when referring to the totality of educational activities outside the school system. in the third form of education, the instructor and the learner are not in direct contact with each other. the learning material is provided for students by post and is written in a very simple language so that students can easily understand and comprehend the material. diagrams and exercises are included in these courses to support and guide the students accordingly. some materials are communicated through tv channel programs and the internet (saif, reba, & din, 2017). non-formal education is usually not documented by a certificate or transcript. it occurs in educational institutions or public organizations, clubs, circles, as well as during individual sessions with teachers or coaches (ivanova, 2016). non-formal forms of education are often characterized by the strong participation of volunteers, viewed as an expression of the protestant teaching of the priesthood of all believers (schweitzer, 2017). the targets of non-formal education are increasingly diverse, such as serving the poor, those who have not completed basic education, dropouts and dropping out of formal education, people who do not have access to formal education such as; isolated tribes, rural communities, border areas, and outer island communities and the development of a special skill or skill in the form of training. the existence of non-formal education such as courses and training institutions helps the community to develop. to improve the quality of education and quality of human resources in the field of education, the government has issued a policy. law of republic of indonesia no. 20 of 2003 on the national education system and the regulation of the minister of national education no. 19 of 2005 on the national education standards, clause 26 verse 5 state that courses and training are held for people who need knowledge, skills, life skills, and attitudes to develop themselves, professions, work, independent business, or continue their education to a higher level. one of non-formal education form which is widely adopted by the community is training. mahapatro (2010, p. 252) states that training is the systematic development of the attitude, knowledge, skill pattern required by a person to perform a given task or job adequately. mahapatro also state that development is ‘the growth of the individual in terms of ability, understanding, and awareness.’ training is required to properly motivate and prepare the workers for operating these mechanisms effectively (white, 2004, p. 641). there are three specific training objectives according to stredwick (2005, p. 376): (1) to develop the competencies of employees and improve their performance, (2) to help people grow within the organization in order that, as far as possible, in the future, the needs for human resources can be obtained from within the organization, (3) to reduce learning time for employees starting in new jobs on appointment, transfer or promotion, and ensure they become fully competent as quickly and economically as possible. besides, three components must be prepared in training: academic preparation, pedagogical skills, and teaching practice (younus & akbar, 2017). the training approach can be used in batik development since the training delivery approach, containing a combination of technical and entrepreneurial skills, was relevant in responding to the needs and objectives of adult trainees (mayombe, 2017). training can address gap issues between education and industry needs. perhaps one of the most ambitious challenges currently that instructors of introductory management courses have to deal with is their desire to make their courses more relevant and meaningful for today’s learners, including students with little or no work experience (wright & gilmore, 2012 in durant, carlon, & downs, 2017). it is in accordance with kyriakides and campbell in govender, grobler, and mestry (2016) that school quality improvement seems supporting a culture of school audits, the philosophy of which notes that governments are now formally placing increased expectations on school leadership/management teams to integrate self-evaluation into both their strategic and operational structures and procedures. evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti 132 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 the batik training program is designed to accommodate training participants with knowledge, workability, designing, making, organizing, and packaging batik products needed by the industry. batik will be preserved and in accordance with the needs of the industrial world. hence, it is necessary to conduct research to evaluate the success rate of the batik skills training program. it is in accordance with the opinion of posavac and carey (1980, p. 8) regarding the six reasons for the need of evaluation: '(1) fulfillment of accreditation requirements; (2) accounting funds; (3) answering requests for information; (4) making administrative decisions; (5) assisting staff in program development; (6) learning about unintended effects'. an evaluation is a systematic process that gives out information about program achievement (dewi & kartowagiran, 2018). program evaluation is a systematic and ongoing process to collect, describe, interpret and present the information to be used as a basis to make decisions, develop policies, and preparing the next program (prasetyono, 2016). meanwhile, evaluation in training according to torrington, hall, and taylor (2005, p. 402) 'evaluation is straight forward when the output of the training is clear to see, such as reducing the number of dispatch errors in a warehouse or increasing someone’s typing speed'. the essence of 'evaluation is a comparison, and surveys are one (but not the only) way to collect information useful for comparing programs or for comparing individual performance’ (langbein & felbinger, 2006, p. 192). evaluation is about a particular initiative. it is generally carried out to assess the initiative, and the results are not generalizable. evaluations are designed to improve an initiative and to provide information for decision making at the program or policy level; the research aims to prove whether there is a cause and effect relationship between two entities in a controlled situation (harris, 2010, p. 2). evaluation is different from monitoring. according to singh (2007, p. 54), evaluation is different from monitoring in many ways. monitoring usually provides information regarding the performance of process indicators, whereas evaluation assesses the performance of impact indicators. monitoring is an internal process where all concerned project staff devises a monitoring system, while evaluation is usually done by an external agency to assess the project’s achievements. evaluation is a selective exercise that attempts to systematically and objectively assess progress towards the achievement of an outcome. the purpose of the evaluation according to edwards, scott, and raju (2007, p. 58) is 'the ultimate goal for the evaluation team is to deliver the most useful and accurate information to key stakeholders most cost-effectively and realistically possible'. program evaluation is a comprehensive approach that involves three general steps: (1) developing program theory, (2) formulating and prioritizing evaluation questions, and (3) answering evaluation questions (donaldson & scriven, 2008, pp. 109–110). the purpose of this research is to evaluate the implementation of the batik skills training program as a recommendation for improving the batik training program in indonesia based on the program evaluation result with context, input, process, product (cipp) model. this cipp model is suitable for evaluating formal and non-formal education programs, such as training (madaus & stufflebeam, 2002) because evaluation can answer the variation of the statement and determine the success in viewing the quality of education (prasasti & istiyono, 2018). method the research method used a qualitative approach. qualitative research is defined as the collection, analysis, and interpretation of comprehensive narrative and visual (i.e., no numerical) data to gain insights into a particular phenomenon of interest (gay, mills, & asian, 2012, p. 7). the program evaluation model selected is cipp. corresponding to the letters in acronym cipp, the model's core concepts are context, input, process, and product evaluation. context evaluation assesses needs, problems, and opportunities as a base for defining goals and priorities and judging the significance of outcomes. input evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti copyright © 2019, reid (research and evaluation in education), 5(2), 2019 133 issn 2460-6995 evaluation assesses alternative approaches to meetings need as a means of plans to guide activities and later to help explain outcomes. the evaluative method was used in order to evaluate the process of product development (istiyani, zamroni, & arikunto, 2017). product evaluation identifies intended and unintended outcomes both to help keep the process on track and determine effectiveness (stufflebeam, madaus, & kellaghan, 2002, p. 279). the research was conducted on six skills and training institutions (sti) under the asosiasi profesi batik dan tenun nusantara or association of batik and tenun nusantara professions 'bhuana' (apbtn 'bhuana'). the apbtn 'bhuana' innovates in the planning of non-formal education programs for batik courses according to national education standards. apbtn 'bhuana' under the directorate of community education and training course of the ministry of national education of the republic of indonesia has a great influence and authority in conducting batik training in indonesia. the six stis are sti cosmos (bekasi city), sti lesha (bekasi regency), sti tradisiku (bogor regency), sti asri (depok city), sti kris (dki jakarta), and sti datik (tangerang regency). the total key informants in the interviews were 18 people and the respondents for the questionnaire were 120 people. the data were collected using interviews, questionnaires, observations, and document studies. the research instrument was developed based on four components in the cipp evaluation model. the component and aspect of program evaluation with the cipp model are presented in table 1. the program evaluation criteria were divided into points of the interview, questionnaire, observation statements, and document review. the total is 66 items. expert analysis was used in testing the instrument validity and reliability. table 1. evaluation component and aspect of batik training program component aspect 1. the purpose of batik skills training a. formulation of goals b. basic formulation c. the foundation of objective formulation deliberation 2. the program design a. formulation of the problem b. the use of standard learners c. the use of educational standards d. the use of educator and of education staff standard e. the use of the curriculum program f. the determination of material program g. the design of learners tasks h. financing 3. program implementation a. the rules for learner discipline b. calendar of events and learning schedule c. syllabus and learning process plan d. the learning process e. program supervision f. implementation of learning evaluation g. supervision of learning evaluation 4. program implementation results a. mastery of theoretical competencies b. mastery of practice competencies c. mastery of attitudinal competence d. graduation certificates e. opening new entrepreneurs f. open your own training g. become an educator h. being a batik assessor evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti 134 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 data obtained from interviews, document studies, and observations in the analysis using a qualitative approach. qualitative data analysis is data reduction, display data, and verification (miles & huberman, 1992, pp. 16–17). the survey questionnaire consisted of four-point likert scale items and qualitative items that were developed from and linked to the reviewed literature. the questionnaire was supervised by five teachers to get feedback and make necessary changes; however, their responses were not included in the collected data and they were asked not to take the survey again (mcminn, kadbey, & dickson, 2015). the evaluation study of the unit analysis program implementation focuses on how the process and quality training. according to dane and schneider and dusenbury, et al. in huff, preston, and goldring (2013), their coaches are differentiated in two key dimensions of implementation: dose and the quality of the program delivery. findings and discussion findings the first component is the context component in the evaluation with the cipp model consisting of three aspects of evaluation. the first aspect is the policy background. based on the results of interviews and study documents on the law of republic of indonesia no. 20 of 2003, regulation of the minister of national education no. 19 of 2005, and regulation of the minister of national education no. 47 of 2010 on organizing courses, it is found that the batik skills program conducted by sti under the apbtn 'bhuana' had a policy background in accordance with the law. the second aspect is the formulation of the objectives of the batik skills training program. based on the results of interviews and document studies regarding the teaching design of each sti, the results of the formulation of program objectives are in line with the law and learning objectives. the third aspect is the objectives of the batik skills program. based on interviews and studies of sti administrative documents, the results of the process of formulating the objectives of the program involved various central parties ranging from government elements to representatives from all regions in indonesia. besides, the program's objectives are in accordance with community needs, especially in increasing employment and creating national batik business development. the second component is the input component in the evaluation with the cipp model consisting of six aspects of evaluation. the first aspect is the problem formulation of the batik skills program. based on the results of interviews with managers, teachers, and participants of each sti, the results of the formulation of problems in each sti are adjusted to the conditions in the field and respond to challenges ahead, it's just not specific enough. the second aspect is the standard of prospective students in the batik skills program. based on the results of interviews with the study participants' biodata documents, it was found that there was a mismatch between the minimum educational background of program participants on the condition of being a program participant. thus, it can be concluded that the standard aspects of prospective program participants 50% are met. the third aspect is the teacher's academic qualifications for the batik skills training program. based on the results of interviews and studies of training instructors' qualification documents, 67% of the educational qualifications of the trainers meet the educational qualification standards and about 33% have not met the educational standards. all teachers have certificates as a batik instructor issued by the ministry of education and culture also ministry of manpower but have no educational background in art majors. there are 80% of teachers who have high school educational background and 65% of teachers have experience as a batik instructor at least for three years. the fourth aspect is the use of facilities and infrastructure standards. based on direct observations and photos of new infrastructure facilities, only 50% of stis meet the standards. the fifth aspect is the batik skills training program curriculum. based on the results of interviews and study of learning curriculum documents, the curriculum used have used evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti copyright © 2019, reid (research and evaluation in education), 5(2), 2019 135 issn 2460-6995 the indonesian national qualification framework or kerangka kualifikasi nasional indonesia (kkni) up to level 3. the sixth aspect is the financing of the batik skills training program. based on interviews and studies of sti financial report documents, each sti still relied on the money from the training participants as the main source of income. the third component is the process component in the evaluation with the cipp model consisting of three aspects of evaluation. the first aspect is the training rules. based on the results of the distribution of questionnaires to 120 respondents from six stis, interviews, and document studies, it is found that the results of the activity regulations have been implemented in accordance with activity planning, although not 100% of all stis have implemented them. the second aspect is the learning process of the batik skills program. based on the results of the distribution of questionnaires to 120 respondents from six stis, interviews, and participants observation, the sub-aspect of teachers provide instructional objectives when starting training gains the achievement rate of 70%. teaching sub-aspect explained the material with discussion methods with the level of achievement of 87.5%. as for the sub-aspects of the instructor in explaining the material using learning tools, the achievement level is 100%. the third aspect is the evaluation of learning outcomes in the batik skills training program. based on the results of the distribution of questionnaires to 120 respondents from six stis, interviews, and participant observation, the highest answer for the subaspect of assignment at home is the sti tradisiku, whereas the lowest number of answers is sti lesha and sti datik. the results of the recapitulation of the sub-aspects questionnaires show that the institution which most often achieved the highest number of answers to quizzes are sti tradisiku, and the lowest number of answers is sti datik and sti cosmos. the fourth component is the product component in the evaluation with the cipp model consisting of four aspects of evaluation. the first aspect is the mastery of theoretical competence. based on the results of the recapitulation of questionnaires, document studies, and interviews, the average score of quizzes and theory tests was 78. meanwhile, for the sub-aspects of mastery of the theory of materials in making batik, 60% was achieved. sub mastery of material aspects on various techniques in making batik by 42.57% was achieved. sub mastery of material aspects regarding a variety of tools needed in the new batik process of 88.3% was achieved. the second aspect is the mastery of practice competencies. based on the results of a recapitulation of questionnaires, observations, and interviews, all participants get a minimum score of 'good'. the third aspect is the proof of graduation. based on the results of the questionnaire recapitulation, the study of documents, and interviews, all participants get a mark of competency test certificate graduation. the fourth aspect is the impact on program participants. based on the results of the questionnaire recapitulation, document studies, and interviews, it is found that the sub-aspects of the program graduates were able to open a batik production business with a percentage of the success rate of 20% with the highest number of alumni coming from sti datik and sti cosmos and the lowest number of alumni is from sti kris and sti asri. for the sub-aspect of 'the program graduates can open a batik selling business', the percentage of success rate is 50% with the highest number of alumni coming from sti traditional and the lowest number of alumni is from sti lesha. for the sub-aspect of 'the program graduates are able to open a batik course', the percentage of success rate is 16.7% with the highest number of alumni coming from sti asri and the lowest number of alumni from sti kris and sti datik. for the sub-aspect of 'the program graduates can open a batik workshop', the percentage of success rate was 23.3% with the highest number of alumni coming from sti asri and cosmos while the lowest number of alumni from sti kris and sti datik. for the sub-aspect of 'the program graduates can open a tiered batik training', the percentage of the success rate is 16.7% with evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti 136 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 the highest number of alumni coming from sti cosmos and the lowest number of alumni from sti lesha and sti tradisiku. for the sub-aspect of 'the program graduates can become batik educators in schools', the percentage of success rate is 86.7% with the highest number of alumni coming from sti lesha, traditional, and datik, while the lowest number of alumni from sti asri and sti cosmos. for the sub-aspect of 'the program graduates are financially capable', only 20% of the total respondents are financially capable and have started the business to open their own batik production, 50% of the total respondents who answered has opened a selling batik business and have their own store or cooperation with third parties; 13.3% of the total respondents who answered are financially capable and had opened a batik course. the majorities are constrained by complex funds and permits; 23.3% of total respondents are financially capable and have opened batik training workshops; 16.7% of the total respondents who answered are financially capable to open tiered batik training. the toughest constraint is the license and standards that must be fulfilled; 86.7% of the total respondents answered they are capable and had become batik educators in schools; 93.3% of the total respondents answered they can and have become batik educators in the general public. discussion the first component in the evaluation of the cipp model is the context component. the first aspect of the context component is the policy background. according to ball policies ‘create circumstances in which the range of options available in deciding what to do are narrowed or changed’ (ward et al., 2016). the batik skills program has a clear legal basis. law of republic of indonesia no. 20 of 2003 on national education system, clause 26 verse 5 up to the regulation of the minister of national education no. 19 of 2005, on national standard of education, clause 1 verse 18 explain that education evaluation is an activity of controlling, guaranteeing, and determining the quality of education to various education component at every path, ladder, and type of education as a form of accountability of education. it is regulated in the implementation of permit for nonformal education and early childhood, number: 421.10/1572/dik/k.018. it is followed by all stis which have a policy base in accordance with the law and the theory of (thrupp & robert, 2003, p. 195). the second aspect is the basic formulation of program objectives. the basic formulation of program objectives is based on the analysis of learning conducted by nonformal education. it is in accordance with the law of republic of indonesia no. 20 of 2003 on national education system clause 26 verse 5 that courses and training are held for people who need knowledge, skills, life skills, and attitudes to develop themselves, develop their profession, work, independent business, and or continue to higher education. it is also in line with the definition of objectives in the evaluation programs outlined by edwards et al. (2007, p. 58) that it is the ultimate goal for accurate information to key stakeholders in the most cost-effective and realistic manner possible. the basic formulation of the program should be able to provide accurate information based on the problems created in society. the third aspect is the purpose of the program. the process of formulating the goals of the batik skills program involved various parties ranging from the training director of course and training, directorate general of early childhood education and community education, chairman of the aptbn 'bhuana', and representative of every sti in indonesia. the objectives of the program are as follows: (1) preserving batik art, (2) improving the quality and quantity of batik, (3) developing batik as one of the professions in education like a professional teacher, (4) improving the welfare of batik with competency certificates, (5) preparing prospective professsional educators before working for the community, (6) creating entrepreneurship and employment opportunities for people in batik. context evaluation measures the needs, based on objectives and priorities and also assesses the results significantly (stufflebeam et al., evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti copyright © 2019, reid (research and evaluation in education), 5(2), 2019 137 issn 2460-6995 2002, p. 279). context evaluation assesses activities on batik skills which determine the situation and background that affect the types of strategic objectives to be developed in the system. the idea is inappropriate with frye and hemmer (2012) that context is a study that identifies and defines program goals and priorities by assessing needs, problems, assets, and opportunities relevant to the program. the first aspect is problem formulation. the problem formulation on each sti matched with the conditions in the field and is oriented to the future, only less specific. in brief, each sti requires the right strategy and plan. it is in line with the understanding of the strategy by fidler (2002, p. 10) that strategy is the direction and scope of an organization over the long term goal which achieves advantages for the organization through its configuration of resources within a changing environment, to meet the needs of markets and to fulfill stakeholder expectations. meanwhile, strategy in management education according to de kluyver and pearce ii (2009), gamble, thompson and peteraf (2013), mintzberg, lampel, quinn, and ghoshal (2003), as well as thompson, strickland, and gamble (2010) cited by albert and grzeda is described as a strategic framework that includes an analytical process relying on several prescribed tools and expects the student to arrive at a list of strategic options and subsequent recommendations for implementation (albert & grzeda, 2015). this strategy is included in the plan which will be achieved by the training program graduates. the findings are in accordance with the concept of planning said by yukl (2010, p. 72) that planning is a broadly defined behavior that includes making decisions about objectives, priorities, strategies, organization of the work, assignment of responsibilities, scheduling of activities, and allocation of resources among different activities according to their relative importance. the second aspect is the standard of prospective learners. each sti has met the criteria in the standard of prospective learners. the ages of prospective learners vary from the youngest in high school student grade 2 and the oldest is 53 years old. the educational background is diverse, ranging from in with a background of junior high school education to undergraduate (s1). it is in accordance with regulations where trainees are at least 17 years of age with diverse educational backgrounds. the third aspect is the trainers’ educational qualification. the findings, as compared to the prevailing regulations, need to be reanalyzed according to the regulation of the minister of national education no. 19 of 2005 clause 29, verse 4, that educators at senior high school or madrasah aliyah (islamicbased senior high school), or other equivalent forms of education have: (a) a minimum education qualification of diploma-four (d-iv) or bachelor (s1); (b) a higher education background with an education program appropriate to the subject being taught; and (c) professional teacher certificate for senior high school. whereas, it includes educators at the course and training skills institutions which consist of teachers, mentors, trainers or instructors, and examiners. therefore, it means that the faculty of the course and training institutions may come from faculty in higher education institutions. according to sullivan, mackie, massy, and sinha (2012, p. 24), higher education qualifies graduates for jobs or additional training as well as increasing their knowledge and analytic capacities. these benefits of undergraduate, graduate, and professional education manifest as direct income effects, increased social mobility, and health, as well as other indirect effects. the fourth aspect is the condition of the facilities and infrastructure. the findings at the institute of skills and education (sti) cosmos, sti lesha, and sti tradisiku have met the minimum standards of educational facilities and infrastructure and the other sti has not met the criteria. when referring to the regulation of the minister of national education no. 19 of 2005 clause 42 verse 1 and 2, the minimum standards of facilities and infrastructure that must be owned by universities as institute of teachers' education or lembaga pendidikan tenaga kependidikan (lptk) are as follows: (1) each educational unit shall have facilities including furniture, educational equipment, educational media, books, and evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti 138 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 other learning resources, consumables, and other equipment necessary to support a regular and continuous learning process. (2) each educational unit is required to have infrastructure covering land, classrooms, head unit room, educator room, administrative room, library room, laboratory space, workshop space, production unit space, canteen, power and service installation, sports venues, places for worship, playgrounds, creative venues, and other space or places needed to support the regular and ongoing learning process. the findings are in line with the theories of fry, ketteridge, and marshall (2009, p. 308) that 'most educational organizations now work under the pressure of a system in which space in their buildings and infrastructure is measured and accounted for in relation to student numbers and activities'. based on the aforementioned findings, the researchers conclude that the condition of infrastructure from three of six stis as the organizer of batik training has been feasible. the fifth aspect is the batik training program curriculum. all stis have a competency-based curriculum design. the six stis have a standard design of batik graduation levels: level 1, 2, and 3. all stis have skill program material. based on all these findings, it can be concluded if aspects of the program curriculum have met the criteria. the sixth aspect is training program financing. this finding is less relevant when compared to the existing provisions. standard of financing is the standard that regulates the components and the amount of operating unit cost of education applicable for one year (fry et al., 2009, p. 3). meanwhile, the financing standard according to the regulation of the minister of national education no. 19 of 2005 clause 62 is elaborated as follows: educational financing consists of investment cost, operating cost, and personal cost. (1) the investment unit cost of education covers the cost of providing facilities and infrastructure, human resource development, and fixed working capital. (2) the personal cost covers the tuition fees that must be paid by the learners in order to be able to follow the learning process regularly and continuously. the operating cost of the educational unit includes (a) educators and education personnel salaries and all allowances attached to the salary, (b) educational materials or equipment, and (c) indirect education operating costs in the form of electric power, water, telecommunication services, facilities maintenance and infrastructure, overtime pay, transportation, consumption, taxes, insurance, and so forth. (3) the standard operating cost of the education unit shall be stipulated by the minister of regulation based on the bsnp proposal. every program must have good financial management. shattock (2003, p. 30) states that financial management emphasizes integrity, frugality, a concern for the pennies rather than the pounds, and a reluctance to borrow, the more it will command internal respect and provide a secure financial base for acting opportunistically and responding quickly to environmental change. conservative financial control mechanisms, however, can create unnecessary layers of hierarchy and bureaucracy and can choke initiative. the input component of the evaluation on the batik skills program which consists of the problem formulation and the students’ standard has met the criteria. however, the batik teacher's qualification, the use of educational facilities and education standards, the curriculum, and the program financing components have not been fulfilled the criteria. evaluation at the stage of the process is to see the program achievements and the obstacles encountered by sti in running the batik program. the evaluation studied the program implementation which emphasizes the process components (tuytens & devos, 2014). the further evaluation process focuses on the coaching and professional growth of the teacher during at least two evaluation conferences (one formative and one summative). at the end of this process, an evaluation report is handed to the teacher. in this report, the teacher receives a final conclusion (two possibilities, namely: satisfactory or unsatisfactory) about his or her performance. the first aspect is the discipline in training. each sti has discipline regulations for their trainees, but only 80% of the participants who have received the socialization of the contents. evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti copyright © 2019, reid (research and evaluation in education), 5(2), 2019 139 issn 2460-6995 the second aspect is the process of teaching and learning. teachers provide instructional goals when they start the training. it has a 70% achievement level, while for learning with an explanation with a two-way model, the achievement level is 87.5%. besides, for the use of learning tools in teaching materials, the achievement level is 100%. the third aspect is learning evaluation. not all stis provide assignments at home. only sti tradisiku and sti asri give a home assignment, and sti tradisiku is the only one that always gives their trainee quizzes. all stis conduct formative, summative, and also competency tests. stufflebeam and shinkfield (2007, p. 294) said that the evaluation process is an ongoing check on a plan’s implementation plus documentation of the process, including changes in the plan and key omissions or poor execution of certain procedures. one of the goals is to provide staff and managers feedback about the extent to which staff is carrying out planned activities on schedule, as planned, and efficiently. this opinion is supported by patton (1980, p. 60) that the 'process' focus in an evaluation implies an emphasis on looking at how product or outcome is produced rather than looking at the same product itself; that is, it is an analysis of the processes whereby a program produces the result it does. process evaluation is developmental, descriptive, continuous, flexible, and inductive. process components evaluation emphasizes how the product is produced compared to the product itself. the process stage in this research involves the aspects of strategy implementation and the use of facilities or capital or resources in real activities in the field. thus, it is concluded that the components of the evaluation process from the batik skills training program implementation consist of the order and process of learning activities that have met the criteria, while the evaluation of learning outcomes has not met the evaluation criteria. product evaluation is the next stage of the evaluation program in the cipp model. it is directed to the things that show the changes that occur in the input and what kind of results. product evaluation serves to interpret success in achieving objectives, assessing the data set, comparing the established criteria with the results obtained in the field, and the considerations associated with the context, inputs, and processes and formulating a rational interpretation. the products from this program are batik professionals so the measured product component is the competence of batik skill training program graduates. the first aspect is the mastery of the batik theory of program graduates. all the stis have the average score from the quiz and theoretical exams are above 70. based on the previously-mentioned findings, the mastery of batik material and various technique theories have not been fulfilled except the mastery of theory about the tools required in batik which has met the criteria. the second aspect is the competency mastery of program practice graduates. all stis have implemented these criteria and the majority of trainees score above 80%. if there are participants who score below 80, then the stis held the remedial exam. if the trainees did not pass the remedial exam, then they are required to repeat the training. there are 86.7% of participants from all stis who can use canting, while 100% of participants can use batik cap, 100% of participants can prepare the wax, 87.5% of participants can perform the coloring process, and only 82.5% can perform the process of pelorotan. the third aspect is the proof of graduation. all participants get a certificate of the graduate program but not all participants get a certificate of batik competence. if they want to get a certificate of batik competence, they must take the test conducted by place competency test (pct). it means that after passing the program, a participant will not always be recognized as a batik professional. he/she needs to take another exam recognized by the ministry of education and culture. the fourth aspect is the impact toward the participant of the batik skill program. it is the final stage of a series of evaluations program in product components. impact evaluation is directed at focusing on the impact questions felt by the participants after the program. it serves to analyze the influence felt by the participants after the program. evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti 140 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 rossi, freeman, and lipsey (1999, p. 25) believes that impact evaluation is directed at focusing on the impact questions felt by the participants after the program. the evaluation questions are about which impact assessment is organized that is related to such matters as whether the desired program outcomes were attained, whether the program was effective in producing a change in the targeted social conditions, and whether the program impact included to unintended side effects. these questions assume a set of operationally defined objectives and criteria for success. it is also important to know whether the impact measured has a positive or negative impact, the kind of impact, how big the impact is, and whether it can give a positive change for the participants for their future, so it can be analyzed for further program improvement. stufflebeam et al. (2002, p. 229) said that to assess performance beyond goals, evaluators need to search for unanticipated outcomes, both positive and negative. they might conduct hearings or group interviews to generate hypotheses about the full range of outcomes and follow these up with an effort to confirm or disconfirm the hypotheses. the findings on the impact aspect which is the most important thing for the individual after a program is to get a better job or income. it is in line with sullivan et al. (2012, p. 24) that the benefits of undergraduate, graduate and professional education manifest as direct income effects, increased social mobility, health, and also other indirect effects. measures have been created to monitor changes in these outputs, narrowly defined: numbers of degrees, time to degree, degree mix, and the like. attempts have also been made to estimate the benefits of education using broader concepts, such as the accumulation of human capital. for estimating the economic returns to education, a starting point is to examine income differentials across educational attainment categories and institution types, attempting to correct for other student characteristics. all the evaluation stages, starting from the evaluation of context, input evaluation, process evaluation, and product evaluation, are a unity that cannot be separated, but it has to be done together depending on the conditions in the field. the findings are in accordance with stufflebeam and shinkfield (2007, p. 294) that the purpose of product evaluation is to measure, interpret, and judge an enterprise’s achievements. its main goal is to ascertain the extent to which the evaluation met the needs of all the rightful beneficiaries. based on the findings, the number of stakeholders who have been made as the respondent has met the requirements. according to bamberger, rugh, and mabry (2006, p. 271), program impact and quality cannot be determined without understanding the diverse experience of stakeholders. the perceptions of many must be search out. the impact aspect on the evaluation of the implementation of batik skills program has given many positive influences to the program participants but has not given certainty about the condition of their destiny in the future, because those who have passed the program have not been able to open their own job or looking for work as batik craftsman. product components in the evaluation of the program which consist of mastery of theoretical competence has not been fulfilled, mastery of practice competence and pro-gram pass mark has been fulfilled. the impact aspect on the program participants has not provided a promising future when the program participants have completed their training. conclusion referring to the research findings and discussion, all evaluation component in the implementation of the training program is carried out well, except the results of program implementation. the results that must be improved is the impact that the participants will get after attending the training at the sti. if sti graduates cannot work either opening a batik business or work as batik craftsman, the batik skills training program may not develop. if it does not develop, there is no generation who will continue their batik skills. it is recommended for the ministry of national education to collaborate with the ministry of cooperatives and small medium enterprises, local governments, and related agencies to provide business opportunities to sti graduevaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti copyright © 2019, reid (research and evaluation in education), 5(2), 2019 141 issn 2460-6995 ates in the form of soft loan assistance, entrepreneurship training, and facilitation in providing business licenses. acknowledgments the researchers would like to thank the association of batik and tenun nusantara professions ‘bhuana’ (apbtn ‘bhuana’) for giving permission and supporting the authors to conduct this research. references albert, s., & grzeda, m. (2015). reflection in strategic management education. journal of management education, 39(5), 650–669. https://doi.org/10.1177/10525629145 64872 anderson, l., mason, k., hibbert, p., & rivers, c. (2017). management education in turbulent times. journal of management education, 41(2), 303–306. https://doi.org/10.1177/10525629166 82208 bamberger, m., rugh, r., & mabry, l. (2006). real world evaluation. london: sage publications. dewi, l. r., & kartowagiran, b. (2018). an evaluation of internship program by using kirkpatrick evaluation model. reid (research and evaluation in education), 4(2), 155–163. https://doi. org/10.21831/reid.v4i2.22495 donaldson, s. i., & scriven, m. (2008). evaluating social programs and problems. mahwah, nj: lawrence erlbaum associates, inc. durant, r. a., carlon, d. m., & downs, a. (2017). the efficiency challenge: creating a transformative learning experience in a principles of management course. journal of management education, 41(6), 852–872. https://doi.org/10.1177/10525629166 82789 edwards, j. e., scott, j. c., & raju, n. s. (2007). evaluating human resources programs. san francisco, ca: john wiley & sons. fidler, b. (2002). strategic management for school development: leading your school’s improvement strategy. https://doi.org/ 10.4135/9781446219614 fry, h., ketteridge, s., & marshall, s. (eds.). (2009). a handbook for teaching and learning in higher education: enhancing academic practice (3rd ed.). new york, ny: routledge. frye, a. w., & hemmer, p. a. (2012). program evaluation models and related theories: amee guide no. 67. medical teacher, 34(5), e288–e299. https://doi. org/10.3109/0142159x.2012.668637 gay, l. r., mills, g. e., & asian, p. (2012). educational research: competencies for analysis and applications. boston, ma: pearson. govender, n., grobler, b., & mestry, r. (2016). internal whole-school evaluation in south africa. educational management administration & leadership, 44(6), 996–1020. https://doi.org/ 10.1177/1741143215595414 harris, m. j. (2010). evaluating public and community health programs. san francisco, ca: jossey-bass. hoppers, w. (2006). non-formal education and basic education reform: a conceptual review. paris: international institute for educational panning unesco. huff, j., preston, c., & goldring, e. (2013). implementation of a coaching program for school principals. educational management administration & leadership, 41(4), 504–526. https://doi.org/ 10.1177/1741143213485467 hunga, a. i. r. (2011). uncover the invisible: home-workers in micro-small-medium industries based on “putting-out” system (the case study of the batik and batik convection industry in a sragensurakarta-sukoharjo cluster of indonesia). the international journal of interdisciplinary social sciences, 5(9), 311– 322. evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti 142 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 istiyani, d., zamroni, z., & arikunto, s. (2017). a model of madrasa ibtidaiya quality evaluation. reid (research and evaluation in education), 3(1), 28–41. https://doi.org/10.21831/reid.v3i1.139 02 ivanova, i. v. (2016). non-formal education: investing in human capital. russian education & society, 58(11), 718–731. https://doi.org/10.1080/10609393.201 7.1342195 langbein, l., & felbinger, c. l. (2006). public program evaluation. new york, ny: m. e. sharpe. law of republic of indonesia no. 20 of 2003 on national education system. , (2003). lee, t. (2016). defining the aesthetics of the nyonyas ’ batik sarongs in the straits settlements, late nineteenth to early twentieth century. asian studies review, 40(2), 173–191. https://doi.org/ 10.1080/10357823.2016.1162137 madaus, g. f., & stufflebeam, d. l. (2002). evaluation in education and human services. basel: springer nature. mahapatro, b. b. (2010). human resource management. new delhi: new age international limited. mayombe, c. (2017). integrated non-formal education and training programs and centre linkages for adult employment in south africa. australian journal of adult learning, 57(1), 105–125. retrieved from https://www.ajal.net.au/integra ted-non-formal-education-and-trainingprograms-and-centre-linkages-for-adultemployment-in-south-africa/ mcminn, m., kadbey, h., & dickson, m. (2015). the impact of beliefs and challenges faced, on the reported practice of private school science teachers in abu dhabi. journal of turkish science education, 12(2), 69–79. https:// doi.org/10.12973/tused.10141a miles, m. b., & huberman, a. m. (1992). qualitative data analysis. beverly hills, ca: sage publication. pancapalaga, w., bintoro, v. p., pramono, y. b., & triatmojo, s. (2014). the chrometanned goat leather for high quality of batik. journal of the indonesian tropical animal agriculture, 39(3), 188–193. https://doi.org/10.14710/jitaa.39.3.188 -193 patton, m. q. (1980). qualitative evaluation methods. beverly hills, ca: sage publication. posavac, e. j., & carey, r. g. (1980). program evaluation: methods and case study. englewood cliffs, nj: prentice-hall. prasasti, i. h., & istiyono, e. (2018). developing an instrument of national examination of equivalency education package c of mathematics subject. reid (research and evaluation in education), 4(1), 58–69. https://doi.org/ 10.21831/reid.v4i1.15556 prasetyono, h. (2016). graduate program evaluation in the area leading educational, outlying and backward. journal of education and practice, 7(36), 109–116. retrieved from https:// www.iiste.org/journals/index.php/jep /article/view/34641 regulation of the minister of national education no. 19 of 2005, on national standard of education. , (2005). regulation of the minister of national education no. 47 of 2010, on the competence standard of training graduates. , (2010). rossi, p. h., freeman, h. e., & lipsey, m. (1999). evaluation: a systematic approach (2nd ed.). thousand oaks, ca: sage publications. saif, p., reba, a.-, & din, j. u. (2017). a comparitive study of subject knowledge of b.ed graduates of formal and nonformal teacher education systems. journal of education and educational development, 4(2), 270–283. https://doi. org/10.22555/joeed.v4i2.1354 schweitzer, f. (2017). researching nonformal religious education: the example of the european study on confirmation evaluation of the implementation of batik-skills... hendro prasetyono, dedeh kurniasari, & laila desnaranti copyright © 2019, reid (research and evaluation in education), 5(2), 2019 143 issn 2460-6995 work. hts teologiese studies / theological studies, 73(4), 1–9. https://doi.org/ 10.4102/hts.v73i4.4613 shattock, m. (2003). managing successful universities. london: open university press. singh, k. (2007). quantitative social research methods. new delhi: sage publications. stredwick, j. (2005). an introduction to human resource management. oxford: elsevier. stufflebeam, d. l., madaus, g. f., & kellaghan, t. (2002). evaluation models: viewpoints on education and human services evaluation (2nd ed.). boston, ma: kluwer academic publisher. stufflebeam, d. l., & shinkfield, a. j. (2007). evaluation theory, models and applications. san francisco, ca: jossey-bass. sullivan, t. a., mackie, c., massy, w. f., & sinha, e. (eds.). (2012). improving measurement of productivity in higher education. https://doi.org/10.17226/ 13417 thrupp, m., & robert, w. (2003). education management in managerialist times: beyond the textual apologists. philadelphia, pa: open university press. torrington, d., hall, l., & taylor, s. (2005). human resource management sixth edition (6th ed.). london: pearson education. tuytens, m., & devos, g. (2014). the problematic implementation of teacher evaluation policy. educational management administration & leadership, 42(4_suppl), 155–174. https://doi.org/ 10.1177/1741143213502188 ward, s., bagley, c., lumby, j., hamilton, t., woods, p., & roberts, a. (2016). what is ‘policy’ and what is ‘policy response’? an illustrative study of the implementation of the leadership standards for social justice in scotland. educational management administration & leadership, 44(1), 43–56. https://doi. org/10.1177/1741143214558580 white, c. (2004). strategic management. new york, ny: palgrave macmillan. younus, f., & akbar, r. a. (2017). comparison of evaluation methods of teaching practice of formal and nonformal teacher education institutions of punjab. bulletin of education and research, 39(1), 159–173. retrieved from http://pu.edu.pk/images/journal/ier/p df-files/12_39_1_17.pdf yukl, g. a. (2010). leadership in organizations. upper saddle river, nj: prentice hall. this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(1), 2020, 51-65 available online at: http://journal.uny.ac.id/index.php/reid item parameters of yureka education center (yec) english proficiency online test (epot) instrument 1endrati jati siwi; *1rosyita anindyarini; 1sabiqun nahar 1yureka education center yogyakarta jl. palem hijau no. 120, sidoarum, godean, sleman, yogyakarta 55264, indonesia *corresponding author. e-mail: anin@eurekatour.com submitted: 1 april 2020 | revised: 29 april 2020 | accepted: 15 may 2020 abstract yureka education center (yec) is one of the institutions which has developed an online-based english proficiency test. the test is called the english proficiency online test (epot) which follows the toefl itp (institutional testing program) framework. thus, this study aimed to analyze the characteristics of epot instruments consisting of listening, structure, and reading subtests, which later the quality of each epot test item is identified. this study used a descriptive quantitative approach by describing the characteristics of epot test items in terms of item difficulty index, item discrimination index, test information’s function, and test measurement’s errors. the data were collected through epot trials conducted by 2,652 online test-takers as participants from 20 provinces in indonesia. the collected data were then analyzed using the item response theory (irt) approach using the bilog program on all logistic parameter models which began with the item compatibility test against the model. based on the results of the analysis, all subtests match the 3-pl model. most of epot’s test items had a good range of difficulty index and discrimination index. the epot information’s function shows that accurate items are used on the 3-pl model for a certain capability range. this study is expected to point out that the epot test could be used as an alternative english proficiency test that is easy to use and useful. keywords: analysis, parameter, epot, listening, structure, reading how to cite: siwi, e., anindyarini, r., & nahar, s. (2020). item parameters of yureka education center (yec) english proficiency online test (epot) instrument. reid (research and evaluation in education), 6(1), 51-65. doi:https://doi.org/10.21831/reid.v6i1.31013. introduction in this era of globalization or better known as free trade, each individual is required to prepare reliable skills, especially in the communication field. in the current situation, english has a big role related to global communication between countries. therefore, each individual is expected to be able to master english actively both oral and written. as in indonesia, english is one of the foreign languages learned at school. nowadays, foreign languages, especially english, have an important role, especially in careers. the working world will give high appreciation to the people who have good english ability (handayani, 2016, p. 106). english ability is needed for various job positions, such as teachers, employees, receptionists, security guards, programmers, and job seekers. many companies, government agencies, including the selection process for civil servant candidates (calon pegawai negeri sipil or cpns) require english proficiency, one of which is proved by a test of english as a foreign language (toefl) certificate (arnani, 2019). in addition to functioning as a requirement for studying abroad and applying for work, the usage of toefl in indonesia has https://creativecommons.org/licenses/by-sa/4.0/deed.id https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar 52 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) an additional function as a test instrument. this addition gives a chance for several institutions to develop and organize a test measuring an individual’s english proficiency level. sharpe states that there are 180 countries that take the toefl test every year in language institutions spread throughout the world (sharpe, 2002, p. 3). yureka education center (yec) is one of the institutions which develop english proficiency tests as a test instrument following one of ets products, toefl itp (institutional testing program). english proficiency online test (epot) is a toefl prediction test which has been developed by yec since 2018. as the name implies, epot measures an individual’s english proficiency level in three aspects which are listening, structure and written expression, and reading skills which can be done online. epot gives several benefits for the test takers. one of the benefits is that the test can be done almost anywhere and anytime, as long as the test takers are connected to the internet. moreover, the result of epot can be delivered instantly after the test ends. test takers will receive a digital certificate sent to their registered email. epot is a web-based proficiency test, therefore, the test takers are not required to download any software or applications. they can take the test using a web browser on their laptops or personal computers. epot has a test structure which refers to toefl itp, consisting of three sections, namely: listening comprehension, structure and written expression, and also reading comprehension. epot is held for 115 minutes. the exercises are in multiple-choice with four answer choices. table 1 is a comparison table of the number of questions and estimation time between toefl itp and epot yec. to find out the quality of epot yec test items, it is necessary to prove that each epot’s test item is also capable of measuring someone’s english proficiency as toefl itp. the researchers verified each epot’s test item using item response theory (irt) since the developed epot’s test items do not depend on the ability of the test takers and vice versa. this means that the items’ level of difficulty and discrimination do not depend on the test-takers (anderson & morgan, 2008, p. 76; olufemi, 2013, p. 378; yang & kao, 2014, p. 171). in addition, fan also said that the analysis using irt emphasizes more on the level of test items’ information, whereas, in classical test theory, the analysis emphasizes more on the level of the test’s set information (fan, 1998, p. 359). thus, an analysis using irt will give more detailed and accurate results (pollard, dixon, dieppe, & johnston, 2009, p. 3). epot’s items produce data with dichotomous scores in the form of correct (1) and incorrect (0). for dichotomous data, it can be analyzed using a latent linear model, perfect scale model, latent distance model, normal ogive parameter model, as well as the logistic parameter (de ayala, 2009, p. 120; van der linden & hambleton, 1996, p. 18). this analysis of epot’s test items chooses to use the parameter logistic model because the mathematical calculation is simpler using a logistic distribution model than using a normal distribution (chung, 2005, p. 41). table 1. the comparison between toefl itp and epot yec section toefl itp epot yec section 1: listening comprehension number of questions: 50 (35 minutes) number of questions: 50 (35 minutes) section 2: structure & written expression number of questions: 40 (25 minutes) number of questions: 40 (25 minutes) section 3: reading comprehension number of questions: 50 (55 minutes) number of questions: 50 (55 minutes) https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar copyright © 2020, reid (research and evaluation in education), 6(1), 2020 53 issn: 2460-6995 (online) several previous studies about item analysis to measure the cognitive skills of the students used classical test theory. still, the analysis using classical test theory did not yield enough information to find out the effectiveness of test items. the reason was the existing assumptions that could not be met. item statistics depended on the test takers’ characteristics and standard error of estimator score which applied to all of the test takers. therefore, there was no estimator score for each of the test-takers and test items. nowadays, there are several studies which are using irt because this theory is considered to be more detailed and valid to reveal the test items' quality. the main advantages of irt are that (1) the item parameters are invariant function or the response curve unchanged; and (2) the item selection can be done based on the amount of item information and test information (hambleton, swaminathan, & rogers, 1991, p. 7). according to naga, there are two types of parameters that are related to one another. in this case, participant characteristic parameters can be known if the parameter characteristics of the items are known or also known as a logistic model estimation. this model estimation is then developed into a logistic model one-to-three parameter. likewise, the parameter features of the items can be measured if the parameter characteristics of the participants are known as the maximum likelihood estimation or the estimation of the maximum probability of occurrence (naga, 1992). according to the logistic distribution, irt model is classified based on the number of test item’s parameter into three types namely one-parameter logistic model (1-pl), two parameters logistic model (2-pl), and also three-parameter logistic model (3-pl) (hambleton, 1989, p. 148; hambleton et al., 1991, p. 7; magis, 2013, p. 305). the 1-pl model only has one parameter which is the level of difficulty; the 2-pl model has two parameters, namely, the level of item difficulty and discrimination index; while the 3-pl model displays the parameter of difficulty index, discrimination index, and also pseudoguessing. item difficulty index (b) shows the difficulty level of an item. item discrimination index (a) shows how each test item differentiates test takers' ability in answering that test item. meanwhile, pseudo-guessing (c) shows the probability of test-takers with low ability to correctly answer a test item. in order to apply the theory, the researchers need to determine a suitable model with the analyzed data. for statistical model selection, from the three models, then the compatibility of the items was made based on the chi-square values. if an item has a probability of the chisquare value ≥0.05, then that item is considered fit or compatible with the model. for this reason, the logistic model in data that has the most compatible items will be chosen as the model for data analysis (retnawati, 2014, p. 25). a research of the test of english proficiency (toep) developed by direktorat pendidikan sma or the directorate of senior secondary education has been done by several researchers using three-parameter logistics (3pl). it was in contrast with test items developed by private english courses. currently, there are many institutions which offer online toefl prediction test which can be easily accessed. however, the quality of test items they developed cannot be validated since it was not tested and evaluated properly. there were many test takers like college students or fresh graduates who have taken these tests to find out their english proficiency. as one of the institutions which develop toefl prediction like test called english proficiency online test (epot) and an online course, yec makes serious efforts to analyze its test items using the irt approach. this study was conducted to analyze and describe the parameter of epot’s test items based on the parameter logistics which suited to the responses of epot’s test-takers. method the study is aimed at finding out the parameters or the characteristics of epot’s test items through the trial results. the parameter of epot’s test items can be observed from the difficulty, discrimination, and also pseudo-guessing level of each test item. there https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar 54 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) were 2,652 participants from 20 provinces throughout indonesia which become the research subjects. most of them are fresh graduates who wanted to apply for a job and students who wanted to continue their study. a simple random sampling technique was used in order to gather samples from the population. the samples were picked randomly neglecting any difference in the population. this method is used if the members of a population are considered homogeneous (sugiyono, 2014). the samples were fresh graduate students from bachelor level with the minimum age of 23 years old. most of the samples were taking epot since they needed a toefl certificate to apply for job vacancies or to continue their studies. others were taking epot to test their proficiency level since epot’s framework is equivalent to the toefl itp. all of the research subjects took epot online test through the official yureka education center’s website yec.co.id. a set of epot test consists of 50 listening comprehension questions, 40 questions of structure and written expression, and 50 questions of reading comprehension. the test should be done in 115 minutes. previously, the testing of epot’s validity and reliability has been conducted. the content validity testing was done by three english experts, examining the content and structure of the test. the results of the validity testing showed that there were four test items that were not valid since their aiken’s v index was less than 0.67 (azwar, 2017, p. 113). these four items were then being revised and tested again to achieve a good aiken’s v index. the distribution of aiken’s v value is shown in figure 1. figure 1. v aiken value distribution the face validity test was conducted by two experts on learning media. the experts examined the test appearance and the item context compatibility with the aim of the test. as the results, for the test appearance, yec should add a button to change audio volume; recheck the audio playback; change the test instructions’ placement; fix the test items’ placement; fix the consistency of font size; and fix the writing whether it should be capital, italic, or bold. after the revision was done and the appearance of the test was improved, it can be considered that the face validity has been met (azwar, 2017, p. 43). the reliability test of epot showed that it has cronbach’s alpha score of 0.908. it meant that 90.8% of the observed score variant resembled the true score. according to the literature, the reliability score of 0.908 showed that epot’s test instrument has good reliability (gliem & gliem, 2003; guilford, 1956). therefore, the developed epot’s test instrument is assumed to highly reliable. the results of the reliability test are shown in table 2. table 2. reliability index cronbach's alpha cronbach's alpha based on standardized items n of items .908 .910 140 the item analysis on epot used the logistic parameter model. in irt theory, the item’s difficulty level can be labeled as good if the value is in the range -2 up to 2 (de ayala, 2009, p. 15; fan, 1998; hambleton et al., 1991, p. 13). theoretically, the item discrimination index is in the scale -∞ ≤ a ≤ ∞, but practically, the a value is in the range 0 up to 2 (hambleton et al., 1991, p. 15). meanwhile, c value was considered as a good item if it is in the range of 0 up to 1 or 1/k that k is the total answer choices (hulin, drasgow, & parsons, 1983). after going through the comparison process from the three logistic parameters, the 3-pl model was considered to be the most suitable model for epot trial result data. the item analysis used bilog-mg software. the computer program for maximum likelihood estimation was the bilog-mg fit program that was used for one, two, or threeparameter model. the bilog-mg program https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar copyright © 2020, reid (research and evaluation in education), 6(1), 2020 55 issn: 2460-6995 (online) was able to estimate multiple-choice items and also for estimating latent skills in huge amounts (crocker & algina, 1986, p. 354; hambleton et al., 1991, pp. 43–50; yen & fitzpatrick, 2006, pp. 131–132). based on the output of the bilog-mg program, it could be obtained item difficulty index (b) or threshold, item discrimination index (a) or slope, and pseudo guessing (c) or asymptote. the difficulty index, discrimination index, and the ability of items to be guessed by a participant will be shown in a graph. besides, the item characteristics curve (icc) graph would show the quality of several items, and the test information curve (tic) graph will show the quality of epot. findings and discussion epot consists of three sections, namely listening comprehension, structure and written expression, and reading comprehension. the summary of difficulty index, discrimination index, and matched item can be seen in table 3. if the data are accumulated in 1-pl, there will be only 71 items from listening, structure, and reading which has chi-square ≥ 0.05. in the 2-pl model, there are 117 items which have chi-square ≥ 0.05. meanwhile, in the 3-pl model, there are 123 items which have chi-square ≥ 0.05 or can also be considered as fit items. in conclusion, the logistic model that fits the epot test-takers answers results is the 3-pl model. the selection of the 3-pl model is also caused by some test-takers who already fulfilled the requirements for the use of the 3-pl model. other than that, it also reinforces the assumption that proficiency tests using multiple-choice formats are examples of situations where the 3-pl model is suitable. test takers tend to choose the best answer which they found most interesting if they could not find the correct answer, so the guessing factor is considered in this study (huriaty, 2019, pp. 35– 36). table 3. summary of item parameters’ characteristics and matched item analysis section model item’s description number of good item/ item fit percentage listening 1pl b 49 98% fit item 27 54% 2pl a 45 90% b 48 96% fit item 45 90% 3pl b 46 92% a 50 100% c 10 20% fit item 48 96% structure 1pl b 34 85% fit item 25 62.5% 2pl a 35 87.5% b 40 100% fit item 26 65% 3pl b 40 100% a 39 97.5% c 12 30% fit item 27 67.5% reading 1pl b 44 88% fit item 19 38% 2pl a 46 92% b 45 90% fit item 46 92% 3pl b 44 88% a 49 98% c 3 6% fit item 48 96% https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar 56 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) the first section, listening, consists of 50 questions with a duration of 35 minutes. based on the test-takers’ response data, it is found out that epot listening has various difficulty index, discrimination index, and pseudo-guessing which can be seen in figure 2, figure 3, and figure 4. figure 2. difficulty index of epot listening figure 3. discrimination index of epot listening figure 4. pseudo guessing values of epot listening https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar copyright © 2020, reid (research and evaluation in education), 6(1), 2020 57 issn: 2460-6995 (online) according to figure 2, it can be concluded that there are 46 items out of 50 which have good difficulty index while four items are considered as poor. those four items are number 4 (b = -2.473), number 36 (b = 2.068), number 40 (b = 2.572) and number 49 (b = 2.552). number 36, 40 and 49 are considered too difficult because the b > 2, while number 3 is considered too easy because b < 2. it causes the answer responses’ patterns tend to be poor and not able to show the difficulty index parameter. in figure 3, it can be seen that the items in the listening section have shown the various difficulty index and are distributed well. all 50 test items show a good discrimination index with the range between 0 up to 2. accordingly, the high and low ability of the test takers can be shown by the epot listening test items. on the other hand, figure 4 shows that the listening section has 43 items with good pseudo guessing. it means there are only 14% out of all items that can be answered correctly because there is an element of guessing. the next analysis is about the item fit analysis on listening which gives an illustration in the form of item characteristic curve (icc) as presented in figure 5 and figure 6. figure 5. an icc example of listening item number 1 figure 6. an icc example of listening item number 2 https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar 58 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) figure 5 and figure 6 are examples of test-takers’ responses pattern toward epot listening test items number 1 and 2. figure 5 shows a graph of the relationship between test takers’ ability and parameter estimation item number 1 with b = -0.983; a = -0.542; and c = 0.500. figure 6 illustrates the relationship between test takers’ ability and parameter estimation item 2 with b = 0.195; a = -0.925; and c = 0.500. epot structure section consists of 40 items done in 25 minutes. according to the data of test-takers’ responses, 40 items of epot structure also have various difficulty and discrimination index. these findings can be seen in figure 7, figure 8, and figure 9. figure 7. difficulty index of epot structure figure 8. discrimination index of epot structure figure 9. pseudo guessing value of epot structure https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar copyright © 2020, reid (research and evaluation in education), 6(1), 2020 59 issn: 2460-6995 (online) figure 7 shows that all 40 epot structure items have good difficulty level. in figure 8, the 39 items have a good discrimination index. however, there is one item with a poor discrimination index, that is number 12 with a = -0.395. it shows that number 12 cannot show the difference between the low and high ability of the test takers. meanwhile, figure 9 shows that the structure section has 35 items with good pseudo-guessing. in other words, there are only 12.5% out of all items that can be answered correctly because of the guessing element. the next analysis is about the item fit analysis on structure, which gives an illustration in the form of icc, as presented in figure 10 and figure 11. figure 10 shows the relationship graph of test takers ability and parameter estimation of item number 1 in structure with b = 0.793; a = -0.746; and c = 0.500. meanwhile, figure 11 shows a relationship graph of test takers’ ability and parameter estimation of epot structure item number 2 with b = 0.879; a = 0.893; and c = 0.500. the last section is reading comprehension. epot reading section consists of 50 items that are done in 55 minutes. according to the test takers’ responses, it can be concluded that 50 items of epot reading also have various difficulty and discrimination index. it can be seen in figure 12, figure 13, and figure 14. figure 10. an icc example of structure item number 1 figure 11. an icc example of structure item number 2 https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar 60 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) figure 12. difficulty index of epot reading figure 13. discrimination index of epot reading figure 14. pseudo guessing value of epot reading based on figure 12, 45 items have good difficulty index, and the remaining five items are considered poor. these five items are number 5 (b = -2.657), number 9 (b = 2.264), number 22 (b = -2.407), number 23 (b = 2.771), and number 49 (b = -2.547). the items number 5, 22, 23, and 29 are considered too difficult since the difficulty level is < -2; and number 9 is considered too easy because the difficulty level is > 2. thus, the test takers’ responses tend to be poor, and these items cannot show the difficulty index parameter. https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar copyright © 2020, reid (research and evaluation in education), 6(1), 2020 61 issn: 2460-6995 (online) figure 13 shows that all of the items in the epot reading section have good discrimination index since they are in the range of 0 to 2 so that the test takers' low or high ability can be shown in all epot reading’s test items. meanwhile, figure 14 shows that the epot reading section only has 43 items with good pseudo-guessing. it means 86% of all items can be answered correctly because of the guessing elements. the next analysis is about items fit in the epot listening section, which gives an illustration in the form of icc, as shown in figure 15 and figure 16. figure 15 shows a graph between the test takers’ ability and estimated parameter reading section item number 1 with b = 0.536; a = 0.181; and c = 0.455. in addition, figure 16 depicts a graph between the test takers’ ability and estimated parameter of epot reading section item number 2 with b = 0.899; a = 0.291; and c = 0.484. the next discussion will be about information function analysis and standard error measurement (sem). the epot information function value will show epot’s reliability and measurement accuracy. the epot information function describes a low curve that increases, reaching the highest score in the middle before falling far from the midpoint. the curve’s width shows the extent of the effective capability from the measurement results. figure 15. an icc sample of reading item number 1 figure 16. an icc sample of reading item number 2 https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar 62 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) test information function (tif) will be effective if the curve line extends above the sem line without having an intersection point. however, epot items’ analysis yields tif and sem curves that have interaction between the two. these are three figures which show the total information curve (tic) for 1-pl, 2-pl, and 3-pl model. figure 17. epot’s tic for 1-pl model figure 18. epot’s tic for 2-pl model figure 19. epot’s tic for 3-pl model https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar copyright © 2020, reid (research and evaluation in education), 6(1), 2020 63 issn: 2460-6995 (online) figure 17, figure 18, and figure 19 show tic, which consists of the tif line, sem line, and interaction among them. tic illustrates the total information produced by any level of ability. the dotted line shows sem, which means the greater the information function, the smaller the measurement error is. the three graphs show the tif curve above sem with two intersection points; it means that the information obtained from the measurement results is only accurate on abilities with a certain range. this research’s finding shows that the 3-pl irt model provides the highest tif compared to the 1-pl and 2-pl models. it is caused by the average of epot’s items discrimination index with 3pl model (a = 0.948) higher than the item’s discrimination index with 1-pl (a = 0.777) and 2-pl (a = 0.460). in the irt model that accommodates the presence of discrimination index, if the discrimination index gets bigger, the value of tif obtained will be greater (setiawati, izzaty, & hidayat, 2018, p. 17; yang & kao, 2014, pp. 173–174; zięba, 2013, p. 96). the presence of this discrimination index causes the item information with 2-pl is higher than 3-pl. as a result, the 1-pl model that becomes the lowest because this model does not accommodate the discrimination index parameter. based on the previous analysis, 93% of listening, structure, and reading test item has a good average of difficulty index between -2 to 2. there are 10 test items that were considered poor; they were too difficult or too easy. these items were still used to vary the test items. as stated by hingorjo and jaleel (2012), test items with an average difficulty index are more desirable, test items with easy level can be placed in the beginning question as warming up, and the difficult item should be reviewed to avoid language confusion. in addition, out of the 140 epot’s test items, one item of structure test and one item of the reading test had a discrimination index of > 2. the two items are not modified since the gap between the scores and also the standard score is not significant. meanwhile, the pseudo-guessing index showed that only 19 test items can be answered correctly by the test takers, which rely solely on guessing. the results of tif and sem curved almost perfectly and interacted at two intersection points. the results of the study pointed out that the irt 3-pl model provides higher test information function than the 1-pl and 2-pl model. the reason was the average of the epot’s 3-pl discrimination index was higher than the 1-pl and 2-pl model. conclusion item analysis can give useful information related to the item characteristics of a test set. english proficiency online test (epot) is a set of english proficiency test developed by yec and has gone through several processes of testing and evaluation on its test items. the testing and evaluation are using a 3-pl model to show the characteristics of the test, consisting of difficulty index, discrimination index, and pseudo-guessing index. based on the results of epot’s item analysis using the irt 3-pl model, it can be concluded that most of the items have a good difficulty index. several items that have poor difficulty index are still used to vary the test items. moreover, epot’s test items are also able to effectively distinguish test takers' ability and improve test takers’ reliability (nelson, 2001; wells & wollack, 2003). several test items that have poor discrimination index are not modified as the gap between the scores, and the standard score is not significant. as for the pseudo guessing index, there are only a few test items that can be answered correctly by the test takers who rely on guessing. in conclusion, epot has sufficient quality of effective test items, and it can be employed as a toefl prediction test. references anderson, p., & morgan, g. (2008). developing tests and questionnaires for a national assessment of educational achievement (v. greaney & t. kellaghan, eds.). https: //doi.org/10.1596/978-0-8213-7497-9 arnani, m. (2019, november 14). cpns 2019, 9 instansi ini wajibkan toefl, berapa skornya? kompas.com. retrieved from https://www.kompas.com/tren/ read/2019/11/14/120925265/cpnshttps://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar 64 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) 2019-9-instansi-ini-wajibkan-toeflberapa-skornya?page=all azwar, s. (2017). reliabilitas dan validitas (4th ed.). yogyakarta: pustaka pelajar. chung, h. (2005). calibration and validation of the body self-image questionnaire using the rasch analysis. master thesis, university of georgia, athens, ga. crocker, l. m., & algina, j. (1986). introduction to classical and modern test theory. fort worth, tx: harcourt brace jovanovich. de ayala, r. j. (2009). the theory and practice of item response theory. new york, ny: guilford press. fan, x. (1998). item response theory and classical test theory: an empirical comparison of their item/person statistics. educational and psychological measurement, 58(3), 357–381. https://do i.org/10.1177/0013164498058003001 gliem, j. a., & gliem, r. r. (2003). calculating, interpreting, and reporting cronbach’s alpha reliability coefficient for likert-type scales. midwest researchto-practice conference in adult, continuing, and community education, 82–88. colombus, oh: the ohio university. guilford, j. p. (1956). fundamental statistics in psychology and education (3rd ed.). new york, ny: mcgraw-hill. hambleton, r. k. (1989). principles and selected applications of item response theory. in r. l. linn (ed.), educational measurement (3rd ed., pp. 147–200). new york, ny: macmillan. hambleton, r. k., swaminathan, h., & rogers, h. j. (1991). fundamentals of item response theory. newbury park, ca: sage publications. handayani, s. (2016). pentingnya kemampuan bahasa inggris dalam menyongsong asean community 2015. jurnal profesi pendidik, 3(1), 102–106. retrieved from http://ispijateng.org/wp-content/up loads/2016/05/pentingnya-kem ampuan-berbahasa-inggrissebagai-dalam-menyong song-asean-community-2015 _sri-handayani.pdf hingorjo, m. r., & jaleel, f. (2012). analysis of one-best mcqs: the difficulty index, discrimination index and distractor efficiency. jpma: the journal of the pakistan medical association, 62(2), 142– 147. retrieved from https://jpma.org. pk/article-details/3255?article_id=3255 hulin, c. l., drasgow, f., & parsons, c. k. (1983). item response theory: application to psychological measurement. homewood, il: dow jonesirwin. huriaty, d. (2019). analisis karakteristik parameter butir berdasarkan model logistik 3 parameter. lentera: jurnal pendidikan, 14(2), 33–40. https:// doi.org/10.33654/jpl.v14i2.885 magis, d. (2013). a note on the item information function of the fourparameter logistic model. applied psychological measurement, 37(4), 304–315. https://doi.org/10.1177/01466216134 75471 naga, d. s. (1992). pengantar teori sekor pada pengukuran pendidikan. jakarta: gunadarma. nelson, l. (2001). item analysis for test and surveys using lertap 5. perth: curtin university of technology. olufemi, a. s. (2013). item response theory as a basis for measuring latent trait of interest. greener journal of social sciences, 3(7), 378–382. https://doi.org/ 10.15580/gjss.2013.7.062513691 pollard, b., dixon, d., dieppe, p., & johnston, m. (2009). measuring the icf components of impairment, activity limitation and participation restriction: an item analysis using classical test theory and item response theory. health and quality of life outcomes, 7, 1–20. https://doi.org/10.1186/1477-7525-741 retnawati, h. (2014). teori respons butir dan penerapannya: untuk peneliti, praktisi https://doi.org/10.21831/reid.v6i1.31013 https://doi.org/10.21831/reid.v6i1.31013 endrati jati siwi, rosyita anindyarini, & sabiqun nahar copyright © 2020, reid (research and evaluation in education), 6(1), 2020 65 issn: 2460-6995 (online) pengukuran dan pengujian, mahasiswa pascasarjana. yogyakarta: nuha medika. setiawati, f. a., izzaty, r. e., & hidayat, v. (2018). analisis respons butir pada tes bakat skolastik. jurnal psikologi, 17(1), 1– 17. https://doi.org/10.14710/jp.17.1.117 sharpe, p. j. (2002). how to prepare for the toefl test: test of english as a foreign language (10th ed.). jakarta: binarupa aksara. sugiyono, s. (2014). metode penelitian pendidikan: pendekatan kuantitatif, kualitatif, dan r&d. bandung: alfabeta. van der linden, w. j., & hambleton, r. k. (1996). handbook of modern item response theory. https://doi.org/10.1007/978-14757-2691-6 i wells, c. s., & wollack, j. a. (2003). an instructor’s guide to understanding test reliability. madison, wi: university of wisconsin. yang, f. m., & kao, s. t. (2014). item response theory for measurement validity. shanghai archives of psychiatry, 26(3), 171–177. https://doi.org/ 10.3969/j.issn.1002-0829.2014.03 yen, w., & fitzpatrick, a. (2006). item response theory. in r. l. brennan (ed.), educational measurement (4th ed., pp. 111–153). westport, ct: american council on education and praeger. zięba, a. (2013). the item information function in one and two-parameter logistic models – a comparison and use in the analysis of the results of school tests. didactics of mathematics, 10(14), 87– 96. https://doi.org/10.15611/dm.2013. 10.08 https://doi.org/10.21831/reid.v6i1.31013 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(2), 2020, 98-108 available online at: http://journal.uny.ac.id/index.php/reid factor analysis: competency framework for measuring student achievements of architectural engineering education in indonesia *1 rihab wit daryono; 2 v. lilik hariyanto; 2 husaini usman; 2 sutarto 1 graduate school, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia 2 faculty of engineering, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia *corresponding author. e-mail: rihab0561pasca.2019@student.uny.ac.id submitted: 25 june 2020 | revised: 1 november 2020 | accepted: 16 november 2020 abstract this study aims to prove the validity test and estimate the instrument's reliability to measure student achievement in the department of architectural engineering education in indonesia. the cluster random sampling technique was used to determine the number of students consisting of 103 vocational education students. this study uses a survey method to examine and analyze the structure of students' competency achievements factors. the collected empirical data were analyzed using descriptive and inferential statistics using exploratory factor analysis (efa). efa test is intended to reveal factors that can be formed from instruments that have been established for measurement of achievements of vocational student competencies. the analysis results using efa showed that the instrument had good construct validity. the result of this research shows that the instrument test for measuring the achievement of student competencies has good reliability and consists of 30 competency items covering ten competency aspects, namely general competencies, technical drawing, statically structures, basic building construction, land measurement engineering, software application and building interior design, road and bridge construction, estimated construction costs, building construction and utility, creative and entrepreneurship product competencies. keywords: achievement competency, vocational education, architectural engineering education, exploratory factor analysis how to cite: daryono, r., hariyanto, v., usman, h., & sutarto, s. (2020). factor analysis: competency framework for measuring student achievements of architectural engineering education in indonesia. reid (research and evaluation in education), 6(2), 98-108. doi:https://doi.org/10.21831/reid.v6i2.32743. introduction the quality of human resources (hr) is the key to the competitiveness of a nation to determine who can develop in global competition and maintain survival. innovative, creative, technology literate, and having multiple intelligences are the hallmarks of superior human resources (slamet, 2013). manpower issue is always related to human resources, thus the quality of human resources needs to be improved and developed to obtain a competent workforce with high morale will maintain the industry and strengthen the country's economy (widodo & pardjono, 2013). the 2019 global competitiveness index database on the world economic forum (wef) states that one of indonesia's shortcomings is the 65 th skill pillar of 141 countries. the pillar includes an index of the current and future components of the workforce skills that is still low to make indonesia's competitiveness decline (forum económico mundial, 2019). https://doi.org/10.21831/reid.v6i2.32743 https://doi.org/10.21831/reid.v6i2.32743 rihab wit daryono, v. lilik hariyanto, husaini usman, & sutarto copyright © 2020, reid (research and evaluation in education), 6(2), 2020 99 issn: 2460-6995 (online) furthermore, based on the insead data (2019) “the global talent competitiveness index”, indonesia was ranked 67 th in 125 world countries. this also shows that the provision of human resources to improve the competitiveness of educational skills is still weak (lanvin, & monteiro, 2019). the data supported in human development report 2019 show that the indonesian hr quality index ranks 111 th out of 189 countries in the world. the quality of human resources is important, especially to obtain follow-up so that indonesian human resources are able to compete in facing the era of globalization, technological development, and other global challenges (united nations development programme, 2019). thus, such condition that indonesia does not have yet is quality human resources. the increasing number of the construction sector makes indonesia the construction market in the asean country (daryono et al., 2020). indonesia has the largest construction market compared to neigh boring countries. the need for a professional workforce increase will affected by the rise in the construction market until 2025 (kesai et al., 2018). besides, the world of work continues to change creating new challenges for employers and employees (suarta et al., 2018). the progress of industry resulted in the advancement of the employment sector and the number of workers in the construction sector. indonesian employment data explains that the construction sector is in the top four with 18.98% in august 2019. employment of the population of each business sector shows the ability in the construction sector in the labour absorption rate (central bureau of statistics, 2020). preparing vocational high school graduates, according to the industrial qualifications and technological developments, is one of the goals for vocational education. therefore, students must be equipped with competencies in line with the needs of the business and industrial world (diwangkoro & soenarto, 2020; fakhri & munadi, 2019). educational development based on community needs will produce competent graduates (muaini, zamroni, & dwiningrum, 2019). thus, quality vocational education is able to adjust economic development and the progress of science and technology. essentially, in order to strengthen the global economy, there must be successful vocational education as support. vocational high schools institutions in indonesia are called sekolah menengah kejuruan. vocational high schools (vhs) are organized to prepare students for work after completing vocational education. zhang (2009) explains that one of the basic goals for vocational education can be successful, namely by increasing students' skills. if after graduating students can work immediately, then the problems of unemployment in indonesia will decrease. the research of joo (2018) states that there are four premise factors that contribute to increasing the employment rate of vocational education graduates, including professional teachers, industry-equivalent curriculum, leadership spirit, and also link and match. vocational education realizes the aim to strengthen education to become professional and improve economic and social development so that the increase in employment rates of vocational education graduates does not run optimally (xiao, 2009). the first factor as a cause of absorption of building engineering graduates is the low teaching and learning process. the teaching and learning process of vocational schools that have not maximized the learning of soft skills and hard skills together, the school emphasizes the learning of hard skills only for competencies that are required to work. the learning process as explained by sutarto and jaedun (2018) “emphasizes authentic-learning and assessment that promote higher-order thinking skills: creative, innovative, and problem-solving in real life”, to support the competencies of graduates who are ready to work in the construction industry. a nation can develop the new world, which is necessary to prepare superior and quality human resources with multiple and broad skills, as well as multilingual literacy to be able to develop sustainably (widarto et al., 2012; triyono et al., 2018). the second factor is the graduate competency standards applied in learning process. some competencies are not applied due to time constraints in the implementation of the https://doi.org/10.21831/reid.v6i2.32743 rihab wit daryono, v. lilik hariyanto, husaini usman, & sutarto 100 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) learning process. research by manap, hassan, and syahrom (2017) concludes that the constraints include the lack of the equipment in vhs, teaching strategies to increase students' readiness to work in the industry, competent and unqualified workers carried out to the maximum. the competency framework is a tool that determines the competencies that are needed by individuals to reduce current challenges and enforce sustainable development (lai, hamisu, & salleh, 2019). this research was conducted due to the problem of competitiveness of vocational students. this situation is influenced by the irrelevance of school competencies with the current state of development of the construction industry technology. the main solution is growing and developing student competitiveness by increasing student competency through vocational education (rahdiyanta, nurhadiyanto, & munadi, 2019), because this condition causes the low competitiveness of competencies in entrepreneurial and technical skills in both national and international markets (sukardi, wildan, & fahrurrozi, 2019; ismail & hassan, 2019). thus, it is crucial to conduct research to determine the framework of vocational education competencies in developing techniques in accordance with industrial competencies. this is done to obtain ideal standard of work competency information that must be mastered by vocational students so that they become competent workers in the field of construction services and whether the competencies provided in vocational schools are in accordance with the competencies that are needed by the construction service industry. method participant characteristics this research is based on a descriptive survey study, drawing conclusions from testing hypotheses to get answers to the problems studied (creswell, 2012; ingleby, 2012). this research was conducted at nine vhs in central java and yogyakarta, indonesia. because the number of vhs in central java and yogyakarta are stratified, the sample was collected using cluster sampling and stratified random sampling (creswell, 2014). determination of the number of respondents of vocational education students in the department of architectural engineering was done by cluster random sampling technique, resulting in 103 students. the samples in the range of 100 are acceptable (maccallum et al., 1999; sugiyono, 2019). measures and covariates the questionnaire was prepared based on the regulation of the minister of education and culture no. 34 of 2018 concerning the national standards for vocational secondary education, skkni no. 374 of 2013 for implementing buildings and public facilities, skkni no. 85 and 205 of 2015, and no. 340 of 2013. the questionnaire consisted of ten aspects of competence containing five competency indicator for each. the questionnaire consisted of ten questions measured on a four-point likert scale at 4 'very good', 3 'good', 2 'not good', and 1 'very not good'. competency aspects to measure the achievement of student competencies in the department of architectural engineering are shown in table 1. table 1. the instrument for measuring the achievement of vocational student competencies no competency aspects item 1 general competencies a 1-5 2 technical drawing competencies b 1-5 3 statically structures competencies c 1-5 4 basic building construction competencies d 1-5 5 land measurement engineering competencies e 1-5 6 software application and building interior design competencies f 1-5 7 road and bridge construction competencies g 1-5 8 estimated construction costs competencies h 1-5 9 building construction and utility competencies i 1-5 10 creative and entrepreneurship product competencies j 1-5 number of item 50 https://doi.org/10.21831/reid.v6i2.32743 rihab wit daryono, v. lilik hariyanto, husaini usman, & sutarto copyright © 2020, reid (research and evaluation in education), 6(2), 2020 101 issn: 2460-6995 (online) research method testing of construct validity was carried out using factor analysis with the employment of explanatory factor analysis (efa) method (barbara & linda, 2010). factor analysis using descriptive statistics is intended to ensure as well as simplify any coherent variables and variables with each other in one factor (hidayat et al., 2018). the reliability test was obtained based on the cronbach alpha value. furthermore, the component factor analysis using the efa method was intended to ensure the validity and confirmation of the construction. data analysis the first data analysis is a descriptive statistical test that includes multicollinearity, normality, and also data reduction using spss 25.0. the testing of the normality of data on the construct of the measurement model was carried out based on the value of skewness and kurtosis with a recommended value ranging from -1.96 to +1.96 at a significance level of 0.05 for each question item (kline, 2005; hair et al., 2010). the multicollinearity analysis can conclude the inter-change matrix with a value of ≤ 0.90 (tabri & elliott, 2012). in addition, all items were included in the factor analysis criteria and the data analyzed using efa in order to determine the factors measuring the achievement of the student competencies. exploratory factor analysis (efa) data analysis was performed using the spss in order to reveal how many factors can be formed so that it can find out the correlating factors and the contribution value of each variable in order to measure the factors (kumar, 2012). the analysis results are based on the kaisermeyer-olkin (kmo) values, the bartlett test values, the measure of sampling adequacy (msa), the communality values, the total variance values that are described related to eigenvalues, factor loading, and also the plot scree. furthermore, for the assessment of the hypothesized construct position, the construct validity was carried out using the convergent and discriminant validity. findings and discussion findings preliminary analysis the first result conducted in this study is a preliminary analysis intended to find out the data that has been obtained from the survey results. the data were obtained from 103 vocational education students majoring in architecture engineering in indonesia and covering ten aspects consisting of five questions each. all 50 questions are called competency items. table 2 shows the multicollinearity and normality data. because 20 items other than those presented in table 2 get skewness and kurtosis values outside the range of -1.96 to +1.96 and a significance level with a value of ≤0.05 (hair et al., 2010; hidayat et al., 2018), then the 20 items are declared not normally distributed and excluded, and are not included in the next factor analysis. after the items were removed, 30 items of data that still survived or were normally distributed were analyzed again with descriptive statistics as presented in table 2. there are 30 items that reach normality with the skewness values ranging from -1.612 until -0.641, and then the kurtosis values ranging from -0.935 to +1.857. furthermore, the descriptive statistics reveal that the mean value ranges from 3.408 to 3.786. the value of deviation standard ranges from 0.412 to 0.760. the variance ranges from 0.170 to 0.577. the acquisition of total values ranged from 351.0 to 390.0 from the maximum value of 412.0. in the case of multicollinearity, the relationship between ten competency items analyzed in the construction value ranges from 0.306 to 0.632. sequentially, pearson correlations and sig. (2-tailed) on each variable a to j are as follows: (0.327; 0.001), (0.413; 0.000), (0.607; 0.000), (0.632; 0.000) (0.616; 0.000), (0.306; 0.002), (0.620; 0.000), (0.554; 0.000), (0.483; 0.000). these results indicate that the discriminant validity of each competency variable achieved because the inter-correlation matrix value is ≤0.90 (kline, 2005; hidayat et al., 2018) and the correlation is significant at the 0.01 level (2-tailed). https://doi.org/10.21831/reid.v6i2.32743 rihab wit daryono, v. lilik hariyanto, husaini usman, & sutarto 102 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) table 2. results of statistical descriptive analysis and data normality (n= 103) no variable mean sd var. skew kurtosis ∑ 1 a1 3.631 0.56 0.314 -1.221 0.543 374 2 a2 3.738 0.44 0.195 -1.098 -0.811 385 3 a3 3.476 0.65 0.428 -1.086 1.009 358 4 b1 3.408 0.76 0.577 -1.254 1.283 351 5 b2 3.447 0.73 0.544 -1.384 1.857 355 6 b3 3.534 0.62 0.389 -0.998 -0.029 364 7 c1 3.602 0.56 0.320 -1.074 0.185 371 8 c2 3.524 0.60 0.370 -1.163 1.695 363 9 c3 3.524 0.65 0.428 -1.268 1.372 363 10 d1 3.553 0.59 0.348 -0.943 -0.082 366 11 d2 3.621 0.54 0.296 -1.057 0.109 373 12 d3 3.505 0.62 0.390 -1.128 1.427 361 13 e1 3.563 0.53 0.288 -0.645 -0.797 367 14 e2 3.670 0.49 0.243 -0.982 -0.438 378 15 e3 3.689 0.50 0.255 -1.287 0.620 380 16 f1 3.563 0.53 0.288 -0.645 -0.797 367 17 f2 3.660 0.56 0.266 -1.118 0.151 377 18 f3 3.660 0.51 0.266 -1.118 0.151 377 19 g1 3.621 0.56 0.316 -1.170 0.415 373 20 g2 3.602 0.56 0.320 -1.074 0.185 371 21 g3 3.621 0.52 0.277 -0.916 -0.322 373 22 h1 3.660 0.53 0.285 -1.265 0.644 377 23 h2 3.689 0.50 0.255 -1.287 0.620 380 24 h2 3.563 0.66 0.445 -1.450 1.657 367 25 i1 3.786 0.41 0.170 -1.418 0.012 390 26 i2 3.767 0.44 0.200 -1.612 1.481 388 27 i3 3.728 0.44 0.200 -1.041 -0.935 384 28 j1 3.456 0.63 0.407 -0.987 1.012 356 29 j2 3.612 0.54 0.299 -1.009 -0.002 372 30 j3 3.476 0.63 0.409 -1.056 1.118 358 table 3. reliability analysis of competency items no. competency aspects item ca overall ca 1 general competencies 3 0.7 0.9 2 technical drawing competencies 3 0.7 3 statically structures competencies 3 0.7 4 basic building construction competencies 3 0.7 5 land measurement engineering competencies 3 0.7 6 software application and building interior design competencies 3 0.8 7 road and bridge construction competencies 3 0.9 8 estimated construction costs competencies 3 0.8 9 building construction and utility competencies 3 0.8 10 creative and entrepreneurship product competencies 3 0.7 reliability of instrument reliability is the stability and suitability of each score found. it is said to be reliable if the question items get the same and identical scores when the instrument is tested to several clans and in different times and places (hidayat et al., 2018). the reliability value of an item is based on the cronbach alpha value. according to lin (2002), if the cronbach alpha value is ≥ 0.7, then the item on the instrument is reliable, and vice versa, if the cronbach alpha value is < 0.7, then it is not reliable. the results of the instrument's reliability are shown in table 3. the cronbach alpha results in this research instrument are in the reliable category. overall, the cronbach alpha value obtained was ≥ 0.7, this value is included in the recommended value by lin (2002). in general competencies, technical drawing, statically structures, basic building construction, land measurement engineering, as well as creative and https://doi.org/10.21831/reid.v6i2.32743 rihab wit daryono, v. lilik hariyanto, husaini usman, & sutarto copyright © 2020, reid (research and evaluation in education), 6(2), 2020 103 issn: 2460-6995 (online) entrepreneurship product competencies get a value of α = 0.7, while software application and building interior design, estimated construction costs, building construction and utility competencies get a value of α = 0.8, and road and bridge construction get a value of α = 0.9 (≥0.7; lin, 2002). therefore, the instrument to measure the achievement of student competencies has a good level of reliability. exploratory factor analysis (efa) in this efa test, the study considers testing items that have passed the multicollinearity and normality and reliability tests for each item. all 30 items that passed were included in ten aspects of competency. efa test criteria are based on the kmo index values, bartlett's test, measure of sampling adequacy (msa), communalities, factor loading, eigenvalues, and plot scree. the results of the kmo measure of sampling adequacy obtained a value of 0.886, that is more than 0.70, then the coverage of each factor is satisfactory. the bartlett's test of sphericity approx. chi-square obtained a value of 1932.501; df = 35; sig. = 0.000. the scree plot pattern was used to reduce variance to several factors. the point at which the slope of the line begins to change is where the limit of the number of factors that can take. this point is called the inflection point. in figure 1, after the 10th point, the line begins to change in tilt and the variations explained are less and less. thus, it can reduce 30 items to ten factors. the next step of identifying the extraction of community values, eigenvalues, percentage variants, and loading factors is shown in table 4. the value of communalities indicates the value of the variable under study whether it can explain the magnitude of the effective contribution (%) of each variant to the factor formed for each item. the results of the communalities in this instrument range from 0.751 to 0.909 (≥0.50) can be categorized as adequate variants in the instrument. msa values range from 0.709 to 0.961 (≥ 0.70). the rotated component matrix shows the loading factor on each factor. the results of data analysis, it is recommended for all items to measure of competency achievement. this value is obtained from high loading factors ranging from 0.461 to 0.899 (>0.40). in addition, table 5 shows a summary of the results of the efa value of the competency framework to determine the competency attainment of architectural engineering students in indonesia. figure 1. scree plot of achieving student competency framework https://doi.org/10.21831/reid.v6i2.32743 rihab wit daryono, v. lilik hariyanto, husaini usman, & sutarto 104 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) table 4. msa, communalities, factor loading var. item msa comm. factor loading components 1 2 3 4 5 6 7 8 9 10 a a1 0.741a 0.751 0.828 a2 0.709a 0.770 0.760 a3 0.795a 0.717 0.772 b b1 0.821a 0.888 0.899 b2 0.861a 0.820 0.739 b3 0.813a 0.870 0.837 c c1 0.882a 0.784 0.501 c2 0.926a 0.782 0.558 c3 0.875a 0.758 0.514 d d1 0.939a 0.767 0.705 d2 0.878a 0.909 0.794 d3 0.873a 0.880 0.850 e e1 0.887a 0.719 0.717 e2 0.846a 0.794 0.710 e3 0.872a 0.873 0.677 f f1 0.931a 0.764 0.503 f2 0.930a 0.779 0.461 f3 0.889a 0.762 0.600 g g1 0.925a 0.879 0.760 g2 0.865a 0.884 0.718 g3 0.865a 0.861 0.709 h h1 0.870a 0.829 0.590 h2 0.912a 0.864 0.649 h2 0.927a 0.803 0.711 i i1 0.961a 0.809 0.706 i2 0.948a 0.797 0.658 i3 0.883a 0.797 0.761 j j1 0.861a 0.885 0.785 j2 0.894a 0.804 0.667 j3 0.854a 0.816 0.775 table 5. measurement indicators in the exploratory factor analysis test index value results recommendation decision index kmo 0.886 0.50 < x ≤ 0.8 fit 0.80 < x ≤ 1.0 fit bartlett's test p <0.000 p <0.05 fit msa 0.709 – 0.961 >0.07 fit factor loading 0.461 – 0.899 0.4 – 0.9 fit eigenvalues >2.228 ≥1.0 fit discussion after efa test was done to show that the analyzed student data involve ten factors: general competencies, technical drawing competencies, statically structures, basic building construction, land measurement engineering, software application and building interior design, road and bridge construction, estimated construction costs, building construction and utility competencies, and creative and entrepreneurship product competencies. this is in line with the original fact structure, although 20 competency items have fallen from 20 items since the data do not meet the descriptive statistical requirements and normality of data so there are still 30 items of competency items. this study is in line with previous studies by hidayat et al. (2018), hazriyanto and ibrahim (2019), lai et al. (2019), and nashir, mustapha, ma’arof, and rui (2020). https://doi.org/10.21831/reid.v6i2.32743 rihab wit daryono, v. lilik hariyanto, husaini usman, & sutarto copyright © 2020, reid (research and evaluation in education), 6(2), 2020 105 issn: 2460-6995 (online) table 6. competency framework for measuring student achievements of architectural engineering education construct item competency general competencies a a1 having a habit of behaving honestly in carrying out their job duties a2 being able to complete work according to the criteria set in the workplace a3 having the ability to generate innovative work ideas according to their expertise technical drawing competencies b b1 presenting the types, functions and engineering drawing tools b2 drawing orthogonal (2d) and pictorial (3d) projections b3 drawing symbol, notation, and dimensional rules on engineering drawings statically structures competencies c c1 presenting factors influencing the building structure based on design and loading criteria c2 presenting various styles and ways of arranging styles in building structures c3 calculating the forces (moment, shear, normal force, and rod force) on the building structure basic building construction competencies d d1 implementing occupational health and safety in building works d2 presenting the types of construction work (buildings, roads, bridges and irrigation) d3 carrying out concrete, steel, wood, earth and stone construction work land measurement engineering competencies e e1 carrying out measurement principles of land measurement techniques e2 performing maintenance techniques and checking optical types e3 carrying out the operation of tools for levelling, theodolite, and stake out work software application and building interior design competencies f f1 presenting data on the needs of interior design work f2 creating 2d and 3d construction drawings with colour schemes and artificial lighting f3 creating interior designs with elements, materials, models and accessories in every room road and bridge construction competencies g g1 presenting road and bridge classification g2 presenting road and bridge pavement material specifications g3 drawing of the detailed construction of roads and bridges estimated construction costs competencies h h1 presenting materials specifications for building, road and bridge construction work h2 calculating the estimated cost of construction work h3 checking the results of the estimated construction costs building construction and utility competencies i i1 creating floor plans, cuts, and building construction drawings i2 creating detailed building construction drawings i3 making isometric drawings of clean water and dirty water installations, electricity installations, air conditioners, and lightning rods creative and entrepreneurship product competencies j j1 presenting the attitudes and behaviour of entrepreneurs j2 creating worksheets/work drawings for making prototypes of services j3 creating and test product/service prototypes this study aims to test the instrument for measuring the achievement of vocational education student competencies from architectural engineering study programs in indonesia. the results of the instrument analysis in this study have a high level of overall reliability: the cronbach alpha value=0.9 (≥0.7) the value is calculated in the categorization criteria (lin, 2002), the highest reliability value in basic building construction competency is the cronbach alpha value=0.9, and the lowest value is the competence of creative and entrepreneurial products with a cronbach alpha value of 0.7, therefore, the competence of creative and entrepreneurial products taught to be mastered by vocational students must be increased so the achievement of entrepreneurial competencies owned by students is deeper. factor analysis shows ten factors with ten items formulated consisting of 30 competency items. each item shows a satisfactory loading with a value of 0.474 to 0.987 (≥0.40). moreover, from the efa analysis, the kmo index=0.886; bartlett's test = p <0.000; msa = 0.709-0.961; factor loading = 0.461-0.899; eigenvalues >2.228, these values meet the minimum criteria recommended (hidayat et al., 2018; hazriyanto & ibrahim, 2019). the results of the analysis of convergent validity and discriminant validity have met the multihttps://doi.org/10.21831/reid.v6i2.32743 rihab wit daryono, v. lilik hariyanto, husaini usman, & sutarto 106 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) variate analysis requirements. thus, the development instrument is suitable for measuring the level of mastery of vocational education student competencies in the building department consisting of architectural engineering study programs which mainly involve these ten factors. further studies are expected to develop a competency framework model and be able to apply and evaluate applications in continuous learning. the list of competencies to measure student achievement in architectural engineering education in indonesia is shown in table 6. conclusion this study confirms the validity and reliability of the measurement instrument for the achievement of the competency level of students of architectural engineering vocational schools in indonesia. this research shows that the instrument test for measuring the achievement of student competencies has good reliability. the efa analysis shows that the instrument has good construct validity consisting of 30 competency items covering ten competency aspects, namely general competencies, technical drawing, statically structures, basic building construction, land measurement engineering, software application and building interior design, road and bridge construction, estimated construction costs, building construction and utility, creative and entrepreneurship product competencies. references barbara, g. t., & linda, s. f. (2010). using multivariate statistics (6 th ed.). pearson. central bureau of statistics of indonesia. (2020) berita resmi statistik: keadaan ketenagakerjaan indonesia agustus 2020. central bureau of statistics. creswell, j. w. (2014). research design: qualitative, quantitative, and mixed methods approaches. sage publications. creswell, j. w. (2012). planning, conducting, and evaluating quantitative and qualitative research (vol. 66). pearson. daryono, r. w., yolando, a. p., jaedun, a., & hidayat, n. (2020). competency of vocational schools required by construction industry in consultants’ supervisor. journal of physics: conference series, 1456(1), 1–10. https://doi.org/ 10.1088/1742-6596/1456/1/012057 diwangkoro, e., & soenarto, s. (2020). development of teaching factory learning models in vocational schools. journal of physics: conference series, 1456(1), 1–5. https://doi.org/10.1088/1742-6596 /1456/1/012046 fakhri, a. a., & munadi, s. (2019). the evaluation of industrial internship for vocational school of mechanical engineering in tegal. american journal of educational research, 7(11), 806–809. htt ps://doi.org/10.12691/education-7-11-8 forum económico mundial. (2019). índice global de competitividad 2019. world economic forum. retrieved from http: //www3.weforum.org/docs/wef_the globalcompetitivenessreport2019.pdf hair, j. f., black, w. c., babin, b. j., & anderson, r. e. (2010). multivariate data analysis (7th ed.). prentice hall. hazriyanto, & ibrahim, b. (2019). the factor analysis of organizational commitment, job satisfaction and performance among lecturers in batam. journal of technical education and training, 11(1), 151–158. https://doi.org/10.30880/jtet.2019.11.0 1.19 hidayat, r., zamri, s. n. a. s., & zulnaidi, h. (2018). exploratory and confirmatory factor analysis of achievement goals for indonesian students in mathematics education programmes. eurasia journal of mathematics, science and technology education, 14(12), 1–12. https://doi.org/ 10.29333/ejmste/99173 ingleby, e. (2012). research methods in education. professional development in education, 38(3), 507–509. https://doi. org/10.1080/19415257.2011.643130 ismail, a. a., & hassan, r. (2019). technical competencies in digital technology towards industrial revolution 4.0. journal https://doi.org/10.21831/reid.v6i2.32743 rihab wit daryono, v. lilik hariyanto, husaini usman, & sutarto copyright © 2020, reid (research and evaluation in education), 6(2), 2020 107 issn: 2460-6995 (online) of technical education and training, 11(3), 55–62. https://doi.org/10.30880/jtet.20 19.11.03.008 joo, l. (2018). vol. 1: the excellence of technical vocational education and training (tvet) institutions in korea: yeungjin college case study. international education studies, 11(7), 136. https://doi. org/10.5539/ies.v11n7p136 kesai, p., soegiarso, r., hardjomuljadi, s., setiawan, m. i., abdullah, d., & napitupulu, d. (2018). indonesia position in the globalization of construction industry. journal of physics: conference series, 1114(1), 012133. https: //doi.org/10.1088/1742-6596/1114/1/ 012133 kline, r. b. (2005). principles and practice of structural equation modeling. the guilford press. kumar, k. (2012). a beginner’s guide to structural equation modeling (3rd ed.). in journal of the royal statistical society: series a (statistics in society), 175(3), 828829. https://doi.org/10.1111/j.1467-985 x.2012.01045_12.x lai, c. s., hamisu, m. a., & salleh, k. m. (2019). development of competency framework for nigerian tvet teachers in tertiary tvet institutions. journal of technical education and training, 11(1), 11– 18. https://doi.org/10.30880/jtet.2019. 11.01.002 lanvin, b., & monteiro, f. (eds.) (2019). the global talent competitiveness index 2019: entrepreneurial talent and global competitiveness. insead, the adecco group, and tata communications. lin, y. k. (2002). using minimal cuts to evaluate the system reliability of a stochastic-flow network with failures at nodes and arcs. reliability engineering and system safety, 75(1), 41–46. https://doi. org/10.1016/s0951-8320(01)00110-7 maccallum, r. c., widaman, k. f., zhang, s., & hong, s. (1999). sample size in factor analysis. psychological methods, 4(1), 84–99. https://doi.org/10.1037/1082-989x.4.1. 84 manap, n., hassan, n. s., & syahrom, n. s. (2017). preparation of vocational college graduates as skilled workforce in the local construction industry. journal of technical education and training, 9(2), 69– 80. muaini, m., zamroni, z., & dwiningrum, s. i. a. (2019). the concept of vocational high school development industry-based in central lombok regency. journal of physics: conference series, 1179(1), 1–5. https://doi.org/10.1088/1742-6596/11 79/1/012054 nashir, i. m., mustapha, r., ma’arof, n. n. m. i., & rui, t. j. (2020). modified delphi technique: the development of measurement model for innovative instructional leadership in technical and vocational education systems. journal of technical education and training, 12(1 special issue), 24–37. https://doi.org/ 10.30880/jtet.2020.12.01.003 rahdiyanta, d., nurhadiyanto, d., & munadi, s. (2019). the effects of situational factors in the implementation of workbased learning model on vocational education in indonesia. international journal of instruction, 12(3), 307–324. https ://doi.org/10.29333/iji.2019.12319a regulation of the minister of education and culture no. 34 of 2018 on the national standards for vocational secondary education/vocational madrasah aliyah. , (2018). slamet, p. (2013). pengembangan smk model untuk masa depan. jurnal cakrawala pendidikan, 5(1), 14–26. https://doi.org/ 10.21831/cp.v5i1.1256 suarta, i. m., suwintana, i. k., sudana, i. g. p. f. p., & hariyanti, n. k. d. (2018). employability skills for entry level workers: a content analysis of job advertisements in indonesia. journal of technical education and training, 10(2), 4961. https://doi.org/10.30880/jtet.2018. 10.02.005 sugiyono. (2019). metode penelitian kuantitatif, kualitatif, dan r&d. alfabeta. https://doi.org/10.21831/reid.v6i2.32743 rihab wit daryono, v. lilik hariyanto, husaini usman, & sutarto 108 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) sukardi, s., wildan, w., & fahrurrozi, m. (2019). vocational education: a missing link for the competitive graduates? international education studies, 12(11), 2635. https://doi.org/10.5539/ies.v12n11 p26 sutarto, h. p., & jaedun, m. p. d. (2018). authentic assessment competence of building construction teachers in indonesian vocational schools. journal of technical education and training, 10(1), 91– 108. https://doi.org/10.30880/jtet.2018. 10.01.008 tabri, n., & elliott, c. m. (2012). principles and practice of structural equation modeling. canadian graduate journal of sociology and criminology, 1(1). https://doi. org/10.15353/cgjsc-rcessc.v1i1.25 triyono, m. b., trianingsih, l., & nurhadi, d. (2018). students’ employability skills for construction drawing engineering in indonesia. world transactions on engineering and technology education, 16(1), 29–35. united nations development programme. (2019). human development report 2019: inequalities in human development in the 21st century jordan (pp. 1–10). united nations development programme. retrieved from http://hdr.undp.org/en/data widodo, w. n., & pardjono, p. (2013). pengembangan model pembelajaran soft skills dan hard skills untuk siswa smk. jurnal cakrawala pendidikan, xxxi(3), 409–423. https://doi.org/10.21831/cp. v0i3.1139 xiao, x. (2009). construction and practice of the new business dpecialty talent cultivation mode. international education studies, 2(4), 56–60. https://doi.org/ 10.5539/ies.v2n4p56 zhang, w. (2009). issues of practical teaching in vocational-technical schools in china and their countermeasures. international education studies, 2(4), 75–78. https:// doi.org/10.5539/ies.v2n4p75 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(2), 2020, 150-159 available online at: http://journal.uny.ac.id/index.php/reid evaluation of education and training programs in solo technopark central java in indonesia *1 sudiyatno; 2 iswahyuni wulandari 1 faculty of engineering, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia 2 smk negeri 1 gelumbang jl. raya prabumulih km. 50, gelumbang, muara enim, sumatera selatan 31171, indonesia *corresponding author. e-mail: sudiyatno@uny.ac.id submitted: 21 december 2020 | revised: 31 december 2020 | accepted: 31 december 2020 abstract this study aims to evaluate the implementation of a training program for youth in solo technopark, central java, indonesia, and get important feedback and recommendations to increase its effectiveness. the evaluation method used is based on four levels of the kirkpatrick model to assess: (1) participants' reaction to the training program (to the contents, facilitators, and facilities); (2) training participants' learning (knowledge and basic skills); (3) change of participant behaviors; and (4) succeed of the training participants based on competence skills and rate of graduate employability and employer’s satisfaction. research respondents were 47 training participants from four optional courses at the reaction and learning level. there were 59 alumni, three training instructors, and one employer as the respondents at the behavior and results level. data were collected using questionnaires, interviews, documentation, and an observation checklist. the data were analyzed using the quantitative descriptive analysis method. the study shows that: (1) at the reaction level, most training participants are satisfied with the training program, instructors, and facilities; (2) at the learning level, most participants have a good category in knowledge and basic skills; (3) at the behavior level, participants can implement the attitudes and major skills at a good level; (4) at the results level, alumni have high skills at workplaces; they have been employed in various industries, and the alumni's employability rate is less than 50%. to increase the low rate of employability in the labor market, solo technopark needs to improve alumni networking and collaboration with industries and employers. keywords: education and training, evaluation program, kirkpatrick model, solo technopark how to cite: sudiyatno, s., & wulandari, i. (2020). evaluation of education and training programs in solo technopark central java in indonesia. reid (research and evaluation in education), 6(2), 150-159. doi:https://doi.org/10.21831/reid.v6i2.36794. introduction indonesia has a golden opportunity to improve the welfare of its population through optimizing demographic bonus in the form of a larger percentage of the population of productive age (15 to 64 years). the population and family planning agency estimates that indonesia will have a population of 305.65 million or even more in the year 2035. the composition of growth in the population of productive age in 2035 will reach 207.5 million people or as much as 67.9% of the population (central bureau of statistics, 2013). therefore, the government must educate and train them to become a skilled workforce. conversely, if the government fails to educate them into a skilled workforce and there are not enough jobs available, then the opposite thing will happen. unemployment rate will increase sharply so that it can result in an increase in various kinds of social crimes, such as theft, burglary and robbery. https://doi.org/10.21831/reid.v6i2.36794 doi:https://doi.org/10.21831/reid.v6i2.36794 sudiyatno & iswahyuni wulandari copyright © 2020, reid (research and evaluation in education), 6(2), 2020 151 issn: 2460-6995 (online) the central bureau of statistics stated that the educational background of workers in indonesia in august 2019 was dominated by elementary aand junior high school graduates (45%), vocational and high school graduates as much as 38% and college and uviersities graduates as much as 17%. in february 2020 the number of workforce was 137.91 millions and the number of open unemployed was 6.88 millions (central bureau of statistics, 2020). on the other hand, vocational schools (sekolah menengah kejuruan or smk) which is expected to be able to produce skilled workers turned out to have the highest unemployment rate compared to other groups, 8.49% in the early 2020 (suprayitna, 2020). with the large number of workers with low education, and the number of unemployed smk graduates, in the future the government needs to: (1) prioritize the development of labor-intensive industries so that they can accommodate a large number of workers; and (2) provide alternative education and non-formal training in order to rapidly increase the competence of prospective workers. the government through the minister of industry has identified several labor-intensive industries, such as textiles, footwear and garments that are capable of absorbing a workforce of 225 thousand people per year or 56% of the absorption of 400 thousand workers per year (putra, 2019). therefore, the government has encouraged investors to build labor-intensive industries, not capital-intensive industries, especially in provincies, so that they can reduce the rate of urbanization and simultaneously increase economic output and also labor absorption (winardi, priyarsono, siregar, & kustanto, 2019). for example, furniture industry that uses rattan and wood as a base material was developed in palu, central sulawesi and in kendal, central java which was also supported by the construction of the furniture industry polytechnic to increase the competence of its human resources. the development of the furniture industry, based on data from central bureau of statistics in 2017, there were 1,918 business units in the medium and large scale, capable of absorbing up to 200 thousand workers (ministry of industry of the republic of indonesia, 2019) the government, through the department of manpower, has organized several forms of non-formal education and training in the regions, including training organized by the work training agency. currently, there are 305 work training agencies that are able to accommodate as many as 275 thousand participants, mostly primary school, junior high school, and senior high school/vocational high school graduates in various vocational fields, such as business and management, tourism, electronic engineering and others, which was served online via the official page, https:/kemnaker.go.id/training. besides, the government also organizes non-formal education and training programs through the development of a science techno park (stp) in several regions. in 2019, 22 stps spread across several provicincies. solo techno park in solo, central java, developed since 2009, is an integrated technology area, as a development center for micro, small and medium enterprises (msmes), vocational training skills and innovations that combines elements of science and technology development, market, industry and business needs to strengthen the regional competitiveness. solo techno park provides four training fields, namely manufacturing mechanics, manufacturing welding, manufacturing automation and manufacturing design. the government needs to assess the extent to which the level of effectiveness of the training programs at techno park as one of the education and training institutions that is relied on in supporting the availability of skilled labor, how does the level suit the needs of the labor market, and the level of user absorption of graduates. several studies that have been conducted have not specifically focused on the performance of the graduates. for example, ramadhani (2015) evaluated the implementation process of the business and technology incubator program at solo techno park. mukhlish (2018) examined the model of collaboration between government, industry and universities developed in several institutions including solo teckno park. in addition, muhammad, et al. (2017) conducted a survey on the quality of a number of tecknopark embryos in indonesia. rahayu and nurharjadmo doi:https://doi.org/10.21831/reid.v6i2.36794 sudiyatno & iswahyuni wulandari 152 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) (2017) generally evaluated the implementation of the technopark solo development program. pitaloka and humaedi (2020) in their research explained that science technology park can habituate the culture of science and technology in the community and reveal stp development phenomena in the regional context in indonesia. this research explains something different from the previous studies because it more specifically explains the evaluation results of training outcomes based on the opinions of participants, alumni and employers. the evaluation model of kirkpatrick’s four level is chosen because the model provides one technique for appraisal of the evidence for any reported training program and could be used to evaluate whether a training program is likely to meet the needs and requirements of both the organization implementing the training and the staff who will participate in it (smidt, balandin, sigafoos, & reed, 2009). thus, the results become valuable inputs in improving the performance of non-formal education and training programs, especially at solo techno park. method this research was conducted at solo technopark in july to september 2019. the subjects of this study were the training participants, training alumni, and industrial employers of the alumni. the selection of respondents method was purposive sampling. the evaluation model used in this study was the kirkpatrick model. this model consists of four levels: reaction, learning, behaviour, and results. data were collected through interview, observation, questionnaires, and documentation. the respondents of the training program were 47 training participants for reaction and learning levels and 59 alumni for behaviour and results level. the scores collected from the questionnaires were graded in five categories based on the ideal mean and standard deviation, there were very low; low; intermediate; high; and very high, and the data were analyzed quantitively. meanwhile, the data collected from observations and interviews were analyzed by using qualitative technique. indicators measured in the reaction aspect were training participants' motivation and satisfaction to the subject contents, the instructors and the facilities. indicators measured in the learning aspects were understanding of the theories and the degree of practical skills. indicators measured in the behaviour aspects are changes in the behaviour of training participants related to skills, attitude and changes in the behaviour of participants related to the skills after completing the program. indicators measured in result aspect were the impact after attending the training program, the degree of absorption of graduates, and the impact of alumni in the work place. findings and discussion evaluation of a training program using kirkpatrick's four level model focuses on the development of training outcomes in training participants which includes: reaction, learning, behavor, and results. the evaluation of reaction is how the participant felt about the training or the learning experience. the evaluation of learning is the measurement of the increase in knowledge of the participants during the training activities. the evaluation of behavior is the extent of applied learning back on the work where the participant does their jobs. the evaluation of results is the effect on the business or environment by the participant. the reaction and learning criteria are considered internal, because they focus on what occurs within the training program. moreover, the behavioral and results criteria focus on changes that occur outside (and typically after) the program, and are thus seen as external criteria (praslova, 2010). furthermore, grohmann and kauffeld (2013) stated that all of the four levels are important for training evaluation, because organizations can use the reaction level as an indicator of customer satisfaction, and the learning level is assumed to be a requirement for behavior change. behavior level results can demonstrate how the training contents are actually applied to the job, so that it is organizationally usable, while the results level shows how the training contributes to organizational success. doi:https://doi.org/10.21831/reid.v6i2.36794 sudiyatno & iswahyuni wulandari copyright © 2020, reid (research and evaluation in education), 6(2), 2020 153 issn: 2460-6995 (online) level 1: reaction the objective for the reaction measurement is to evaluate how each participant reacts to the training program. questions were developed to figure out if the trainees enjoyed their experience and if they found the subject contents and facilities used in the program useful for their work. aspects measured of the trainee reaction were: (1) motivation of attending the program; (2) response to the subject contents, to the facilitators/instructors and to the equipment and facilities provided in the training program. in the solo technopark training program, according the trainees, motivation to participate the training was obtained that participants had a very high level of motivation that is equal to 48.90%. participants who had a high motivation level were 44.70% and those who had moderate motivation were 6.40%. distribution of respondents’ answers to statements in the training participants' motivation questionnaire obtained the highest average percentage value on the statement that said “i feel happy in participating in training”, as many as 20 respondents expressed very satisfied and 27 respondents said they were satisfied with the statement. meanwhile, the respondent's answer with the lowest average percentage on the statement that says “ask if you have difficulty”, as many as eight respondents said they were very satisfied, 38 respondents said they were satisfied, and one respondent said they were not satisfied. based on these data, the participants’ reactions were satisfied with the motivation to take part in the training, but it was necessary to be given the opportunity to ask questions if participants experienced difficulties. the satisfaction of the participants to the subject matters in the training was in a very high level of satisfaction (63.8%). participants who had a high level of satisfaction with the training material were 34.0% and those with moderate interest were 2.1%. distribution of respondents' answers to statements in the questionnaire related to participant satisfaction with training material obtained the highest average percentage value is in the statement “the training subjcts provided are very useful for dealing with competition in the work place”, 36 respondents stated very satisfied and 11 respondents said they were satisfied in the statement. meanwhile, the respondent's answer with the lowest average percentage value on the statement that reads “material is well mastered”, as many as seven respondents said they were very satisfied, 39 respondents said they were satisfied, and one respondent said they were not satisfied. based on these data the participants’ reactions were satisfied with the training material. however, seen from the lowest mean of the answers of respondents not all material is well mastered. this must be taken more seriously. participants’ satisfaction with the training instructor results in the participant having a high level of satisfaction with the training instructor at 66.00%. participants who had a high level of satisfaction with training instructors were 29.80% and those who had moderate satisfaction were 4.3%. distribution of respondents' answers to statements in the questionnaire related to the participant satisfaction with training instructors obtained the highest average percentage value is in the statement “instructor gives motivation to participants”, 24 respondents stated very satisfied and as many as 22 respondents expressed satisfaction, and as many as one respondent states are in the statement. meanwhile, the respondents’ answers with the lowest average percentage value on the statement that said “question and answer time” show that eight respondents expressed very satisfied, 34 respondents said they were satisfied, and five respondents said they were not satisfied. degree of participants’ satisfaction with education and training facilities gives the result that participants have a very high level of satisfaction with education and training facilities that is equal to 87.20%. participants who had a high level of satisfaction with education and training facilities were 12.80%. distribution of respondents’ answers to statements in the questionnaire related to participant satisfaction with education and training facilities obtained the largest average value percentage was in the statement “all material available practical tools”, 25 respondents said they were very satisfied and 22 respondents said they were satisfied with the statement. meanwhile, doi:https://doi.org/10.21831/reid.v6i2.36794 sudiyatno & iswahyuni wulandari 154 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) the respondent’s answer with the lowest average percentage value on the statement “lunch time available”, as many as eight respondents said they were very satisfied, 23 respondents said they were satisfied, 12 respondents said they were not satisfied, and four respondents said they were very dissatisfied. based on these data, the participants' reactions were satisfied with the education and training facilities. it is found that the satisfaction of participants in the training program have a direct impact on motivation and enthusiasm for learning. chang and chang (2012) also mentioned that the level of learning motivation directly affects learning satisfaction. based on the results, it is concluded that the satisfaction of the participants in the motivation to take part in the training is included in both categories. it means that the training participants have motivation and enthusiasm for learning. level 2: learning the objective for the degree of the participant has been learning in the aspects of knowledge and basic skills. in this program, learning aspects are assessed based on the level of mastery of the subject contents and basic skills. results of the measurement are shown in figure 1. understanding of theoretical training material provides the results of participants having a high level of understanding of theoretical material that is equal to 59.60%. participants who had a very high level of understanding of theoretical material were 27.7% and participants who had moderate understanding of theoretical material were 12.80%. distribution of respondents’ answers to the statements in the questionnaire related to participants’ understanding of learning theory obtained the largest percentage of average values in the statement “i understand the material kinds of bench work equipment well”, in which as many as 13 respondents stated very understanding, 33 the respondents said they understood, and one respondent did not understand the material. meanwhile, the respondents’ answers with the lowest average percentage value on the statement “i understand the basic theory of electrical welding”, 12 respondents said they understood very well, 21 respondents said they understood, 14 respondents said they did not understand. based on these data, participants understand the knowledge/learning of training theory. however, judging from the lowest answer average, there needs to be an increase in the basic material of electrical welding. mastery level of basic skills is shown in figure 1. it is seen that most participants had a high level of understanding of the overall practice material that is equal to 68.10%. participants who have a very high level of understanding of practice material are 19.10% and participants who have a moderate understanding of practice material are 12.80%. the detail degree of participants’ understanding of the practical subject contents in the training program is explained as follows. practice of bench work the participants’ understanding of the bench work skills resulted in 25.5% being highly skilled in bench work skills, 58.6% skilled in bench work skills, and 14.9% having moderate skills in bench work skills. distribution of respondents’ answers to statements in a questionnaire related to participants' understanding of bench work skills obtained the largest percentage of the average value is in the statement “conducting a scroll using a tap” procedure, in which as many as 15 respondents expressed very understanding, and 32 respondents expressed understanding of the statement. meanwhile, the respondent’s answer with the lowest average percentage value on the statement “flow filing procedure” shows that six respondents said they understood very well, 25 respondents said they understood, and 12 respondents said that they did not understand. practice of the grinding tool the participants’ understanding of the tool grinding operation skills gave 17.0% results of very skilled participant in the tool grinding operation skills, 66.0% were skilled in the tool grinding operation skills, 10.6% had moderate skills on the tool grinding operation skills, and three respondents had the skills low in grinding tool operation skills. doi:https://doi.org/10.21831/reid.v6i2.36794 sudiyatno & iswahyuni wulandari copyright © 2020, reid (research and evaluation in education), 6(2), 2020 155 issn: 2460-6995 (online) distribution of respondents’ answers to statements in the questionnaire submitted by researchers related to their understanding of the grinding tool operation skills obtained the highest average percentage value was in the statement “carrying out the hand chiseling procedure”, eight respondents said they really understood, 33 respondents stated they understood, and five respondents said they did not understand the material. meanwhile, the respondent’s answer with the lowest average percentage value on the statement “drill bit sharpening procedures” are ten respondents said they understood very well, 22 respondents said they understood, 13 respondents said they did not understand, and two people stated they did not understand the material. practice of turning the participants’ understanding of turning lathes using level categorization yielded 29.8% highly skilled in turning lathes, 59.6% were skilled in turning lathes, and 10.6% had moderate skills. grinding tool operation skills. distribution of respondents’ answers to statements in the questionnaire submitted by researchers related to the participants’ understanding of turning operation skills (lathe) obtained the largest percentage of the average value is in the statement “i can do the average turning procedure well”, in which as many as 20 respondents expressed very understanding, 26 respondents expressed their understanding, and one respondent did not understand the material. meanwhile, the respondent’s answer with the lowest average percentage value on the statement “lathe cartel procedure”, is that as many as ten respondents said they understood very well, 23 respondents said they understood, ten respondents said they did not understand, and four respondents said they did not understand the material. practice of technical drawing participants’ understanding of technical drawing skills using level categorization results in 12.8% highly skilled in technical drawing skills, 70.2% skilled in technical drawing skills, 8.5% have moderate skills in technical drawing skills, and 8.5% have low skills in technical drawing skills. distribution of respondents’ answers to the statements in the questionnaire submitted by researchers related to participants’ understanding of technical drawing skills obtained the largest percentage of the average value in the statement “applying equipment and standardization of images”, where nine respondents expressed very understanding, 33 respondents expressed understanding, and five respondents said they did not understand the material. besides, respondent’s answer with the lowest average percentage value on the statement “procedure of drawing a size on a workpiece” shows six respondents expressed very understanding, 26 respondents stated they understood, 14 respondents stated did not understand, and one respondent stated really did not understand the material. practice of basic electrical welding figure 1 shows that participants' understanding of basic electrical welding skills using level categorization results in 23.4% highly skilled in basic electrical welding skills, 29.8% skilled in basic electrical welding skills, 12.8% have moderate skills in basic electrical welding skills, and 34.0% have low skills in basic electrical welding skills. frequency distribution of respondent’s answers to statements in questionnaire related to participants’ understanding of the subject matters and basic electrical welding skills obtained the largest percentage of the average value is in the statement “weld plate 8 mm with a horizontal position/1f” and “weld plate 8mm with a horizontal position-horizontal/2f”, where 14 respondents stated they understood very well, 17 respondents said they understood, and 16 respondents stated they had moderate skills in the statement. besides, respondent’s answer with the lowest average percentage value on the statement “weld plate 8 mm in a vertical position/3f”, shows ten respondents said they understood very well, 14 respondents said they understood, 21 respondents said they did not understand, and two respondents stated really did not understand the material. thus, the training participants are skilled in practical learning. however, seen from the lowest average answer, there are some matters that are not yet understood by the respondents. doi:https://doi.org/10.21831/reid.v6i2.36794 sudiyatno & iswahyuni wulandari 156 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) figure 1. percentage of mastery of basic skills kirkpatrick (2006) explained there were three things taught in the education and training program: knowledge, skills, and behavioral attitudes. gough (2018) agreed that tvet focuses on the process of gaining knowldge and skills for the world of work through cooperative relationships. thus, it is concluded that the training participants understood the theoretical material and also the skills/practices. it means that participants are said to have learned if they have experienced an increase in knowledge, skills, and attitude changes in themselves. without those three things mentioned in the training participants, then the training program can be said to be a failure. level 3: behaviour change in behavior is an improvement of knowledge, attitude and practical skills of the alumni in their workplaces as a result of the training program. it is measured after the participant complete the program. there were 59 respondents doing internship program assessed in attitude and practical skills. the respondents are from four different courses: 25 respondents from manufacturing mechanics, seven respondents from welding, 15 respondents from automation mechanics, and 12 respondents from manufacturing design. the behavior changes in attitude skills shown by the participants after attending the training courses are as follow. most participants have a very high level of skill is 27.1% and at moderate level in 72.9%. according to the alumni responses, the distribution percentage of the alumni practical skills is shown in figure 2. participants’ mastery of mechanical manufacturing skills were found that the majority of alumni has a high level of skill (64.0%). almuni who have a very high skill level were 24.0%, participants who have moderate skill levels are 8.0%, and alumni who have a low skill level are 4.0%. in this course, it is found that there are several weakness of alumni’s skills of basic subjects, such as in electric, pneumatic, and hydrolic behavior of alumni after the training of welding skills are as follow: alumni who have very high skill levels were 28.6%, a high level of skills is equal to 42.9%, alumni with moderate skills were 14.3% and alumni with low skills levels were 14.3%. the frequency distribution of the respondents’ answers related to alumni understanding of skills in the direction of welding obtained the largest percentage of average values in the statement “understanding argon gas welding equipment (tig welding)” are three respondents stated very skilled, three respondents said skilled, and one respondent said not skilled in these skills. meanwhile, the respondent's answer with the lowest average percentage value on the statement “skills to weld soft steel plates with argon gas welding (tig welding)” shows one respondent stated they were very skilled, three respondents stated skilled, two respondents said they were unskilled, and one respondent said not very skilled in these skills. doi:https://doi.org/10.21831/reid.v6i2.36794 sudiyatno & iswahyuni wulandari copyright © 2020, reid (research and evaluation in education), 6(2), 2020 157 issn: 2460-6995 (online) figure 2. degree of participant’s behavior in term of the behavior of alumni after the training related to the mastery of automation skills, it is obtained that the majority of participants have a high level of skills after the automation training, which was 46.7%. alumni who had very high skill levels are 20.0%, participants who ave moderate skills levels are 26.7%, and participants who have low skill levels are 6.7%. distribution of respondents’ answers to the statements related to participants’ skills in the department of automation obtained the largest percentage of the average value is in the statement “i master the use with one cylinder”, as many as five respondents stated very skilled, nine respondents stated that they have moderte skill, and one respondent stated that he has medium skills on the statement. further, the respondent's answer with the lowest average percentage value on the statement “hardware programming skills” shows that one respondent stated very skilled, six respondents stated skilled, and eight respondents stated unskilled in these skills. in term of the behavior of alumni after the training of manufacturing design skills, it is obtained that the majority of participants have a high level of skill (66.7%). alumni who have a very high skill level are 25.0%, and participants who had moderate skills levels after the training are 8.3%. it is found that the largest percentage of the average value is in the statement “drawing a picture of pieces with auto-cad”, “drawing 2d & 3d images with auto-cad”, and “mastering the basics of autodesk-inventor/solid work program”, as many as four alumni said they were very skilled, and eight respondents said they were skilled. meanwhile, the respondent's answer with the lowest average percentage value on the statement “drawing assembled kinematic simulations in cats” shows that two respondents stated that they were very skilled, three respondents said they were skilled, and seven respondents stated that they were not skilled in these skills. based on these data, participants are skilled in majors skills. however, there are skills that need to be improved, such as skill in drawing cavity and kinematic assembling in catia. level 4: results assessment of results is a measurement of the primary goal of a program. level four determines the overall success of the training model by measuring factors such as lowered spending, higher returns on investments, improved quality of products, less accidents in the workplace, the more efficient production time, and a higher quantity of sales. in this research, results of the program were measured in two aspects: impact of the alumni at the workplace, and employability rate of alumni. based on the results of interviews with the employers and study of productivity documents, it can be concluded that according to the employers, the alumni of the solo doi:https://doi.org/10.21831/reid.v6i2.36794 sudiyatno & iswahyuni wulandari 158 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) technopark training program have high quality knowledge and skills, are able to learn new skills and adapt to new environments quickly. this alumni’s capability increases in the quality and quantity of products as well as product completion time according to the target so the cost production can be reduced. data of the employability rate is based on review to the alumni document collected in solo technopark office. there were 949 alumni listed in the document since 2016. there are 61 participants known resigned, 457 alumni are employed in 60 differents work places, 60 alumni are taking other courses, and two alumni are studying in universities. as much as 429 alumni have not been known where they are employed. the data show that the employablity rate of the alumni is less than 50% (457 out of 949) alumni in 2019. conclusion based on the results and discussion about the evaluation of training programs in solo technopark using four level kirkpatric model, it can be concluded that according to respondents, the solo technopark training participants’ level of satisfaction to the subjects contents, instructors, and training facilities is mostly moderate. most participants of the training progam at the level of learning master the knowledge and basic skills well. at the level of behavior, most participants of the program have mastered high knowledge, attitude, and skills. at the results aspect, according to the employers, the alumni have high quality knowledge and skills, and are able to learn new skills and adapt to new environments quickly. besides, based on the alumni document, it is known that the employability rate of the alumni is less than 50%. therefore, solo technopark should have a better tracer study to improve the data quality of alumni by improving the networking of alumni and collaboration with many more industries and employers. acknowledgment this paper and the research behind it would not have been possible without the exceptional support of our collegue in solo technopark, sutrisno kusuma. his enthusiasm, knowledge, and exacting attention to detail have been an inspiration and kept our work on the track. furthermore, this work was fully supported by the postgraduate program of universitas negeri yogyakarta. references central bureau of statistics. (2013). prediksi penduduk indonesia 2010-2035. central bureau of statistics of the republic of indonesia. retrieved december 3, 2020, from https://www.bappenas.go.id/files /5413/9148/4109/proyeksi_penduduk _indonesia_2010-2035.pdf central bureau of statistics. (2020). keadaan pekerja di indonesia agustus 2020. central bureau of statistics of the republic of indonesia. retrieved december 4, 2020, from https://www.bps.go.id/publicati on/2020/11/30/351ae49ac1ea9d5f2e4 2c0da/keadaan-pekerja-di-indonesiaagustus-2020.html chang, i. -y. & chang, w. -y. (2012). the effect of student learning motivation on learning satisfaction. international journal of organizational innovation, 4(3), 281– 305. gough, s. (2018). technical and vocational education and training. mpg books group. grohmann, a., & kauffeld, s. (2013). evaluating training programs: development and correlates of the questionnaire for professional training evaluation. international journal of training and development, 17(2), 135–155. kirkpatrick. (2006). evaluating training program. berrett-koehler. ministry of industry of the republic of indonesia. (2019). berbasis padat karya dan orientasi ekspor, pemerintah pacu industri furnitur. retrieved december 5, 2020, from https://kemenperin.go. id/artikel/20405/ muhammad, n. a., muhyiddin, m., faisal, a., & anindito, i. a. (2017). studi pembangunan science and technopark doi:https://doi.org/10.21831/reid.v6i2.36794 sudiyatno & iswahyuni wulandari copyright © 2020, reid (research and evaluation in education), 6(2), 2020 159 issn: 2460-6995 (online) (stp) di indonesia. jurnal perencanaan dan pembangunan, 1(1), 14–31. https:// doi.org/10.36574/jpp.v1i1.6 mukhlish, b. m. (2018). kolaborasi antara universitas, industri dan pemerintah dalam meningkatkan inovasi dan kesejahteraan masyarakat: konsep, implementasi dan tantangan. jurnal administrasi bisnis terapan, 1(1), 31–43. https://doi.org/10.7454/jabt.v1i1.27 pitaloka, a. a., & humaedi, m. a. (2020). science and technology park (stp): transformation to quadruple helix approach for habituation of science and technology in indonesia science technology park (stp). jurnal sosioteknologi, 19(1), 201–217. https:// doi.org/10.5614%2fsostek.itbj.2020.19. 1.14 praslova, l. (2010). adaptation of kirkpatrick’s four level model of training criteria to assessment of learning outcomes and program evaluation in higher education. educational assessment, evaluation and accountability, 22(3), 215–225. https:// doi.org/10.1007/s11092-010-9098-7 putra, e. p. (2019). menjawab persoalan kebutuhan tenaga kerja terampil. republika. retrieved october 4, 2019, from https://republika.co.id rahayu, a. t., & nurharjadmo, w. (2017). evaluasi implementasi program pengembangan solo technopark. jurnal waca publik, 1(6), 48–57. ramadhani, a. d. p. (2015). evaluasi proses pelaksanaan program inkubator bisnis dan teknologi solo technopark di kota surakarta. thesis, universitas sebelas maret, surakarta. smidt, a., balandin, s., sigafoos, j., & reed, v. a. (2009). the kirkpatrick model: a useful tool for evaluating training outcomes. journal of intellectual & developmental disability, 34(3), 266–274. https://doi.org/10.1080/13668250903 093125 suprayitna, i. (2020). 6,88 juta pengangguran di indonesia paling banyak lulusan smk. suara.com. retrieved on may 5, 2020, from https://suara.com/bisnis/ winardi, w., priyarsono, d., siregar, h., & kustanto, h. (2019). peranan kawasan industri dalam mengatasi gejala deindustrialisasi. jurnal ekonomi dan pembangunan indonesia, 19(1), 84–95. https://doi.org/10.21002/jepi.v19i1.83 4 https://doi.org/10.21002/jepi.v19i1.834 https://doi.org/10.21002/jepi.v19i1.834 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(1), 2021, 13-22 available online at: http://journal.uny.ac.id/index.php/reid assessment and feedback practices in the efl classroom ida ayu made sri widiastuti* universitas mahasaraswati denpasar jl. kamboja no. 11a, dangin puri kangin, denpasar utara, kota denpasar, bali 80233, indonesia *corresponding author. e-mail: idaayuwidia@unmas.ac.id introduction the most recent recommendations from the 2013 curriculum suggested that assessment should be conducted to measure the students' knowledge, attitudes, and skills. to assess the three domains, the 2013 curriculum recommends five assessment characteristics: mastery learning, authentic, employing predetermined assessment criteria, and using various assessment techniques. assessing the domain attitudes, direct or indirect observation, self-assessment, peer assessment, and journal are utilized by the teachers. competency achievement is assessed by monitoring the learning process, learning progress, and competency achievement in conjunction with the students' potency and ability, which are expected to improve. the assessment of knowledge is carried out by measuring the students’ mastery, including factual, conceptual, and procedural knowledge at various levels of thought processes (widiastuti, 2018, pp.2-5). assessment can also provide feedback to teachers in order to correctly plan and carry out the learning process. moreover, the assessment helps teachers obtain useful feedback on what, how much, and how well they are learning (taras, 2005, pp.468-471). assessment is conducted through a series of steps, including planning, assessment preparation, and information collection through an amount of evidence showing the students' achievement, processing, and utilization of information about the student's competence. assessment in the 2013 curriculum is intended primarily to “diagnose” students' learning progress toward accomplishing the standards. diagnostics assessment is administered to map out the students' ability in areas of language dimension. additionally, a diagnostics assessment is also conducted to article info abstract article history submitted: 13 january 2021 revised: 5 march 2021 accepted: 17 march 2021 keywords assessment; feedback; practices; efl classroom scan me: the present study explored the implementation of language classroom assessment and feedback within the english as a foreign language (efl) classroom. this research was conducted using a qualitative research design, and the obtained data were analyzed descriptively. data were collected by conducting direct classroom observation using observation checklists and in-depth interviews with the three professional english instructors using an observation guide. the data were transcribed and coded according to their categories. the findings of the study indicated that the students learning progress were assessed through a short question-answer and completion test. verbal feedback was merely provided by the teachers, and feedback was given only occasionally. the teacher hardly provided any follow-up action in order to modify their way of teaching. consequently, there was a slow, gradual improvement of the students' learning achievement. therefore, teachers were recommended to utilize various classroom language assessments to assess the students' learning. the teachers should provide both verbal and written feedback for their students to enhance their achievement continually. this is an open access article under the cc-by-sa license. how to cite: widiastuti, i. (2021). assessment and feedback practices in the efl classroom. reid (research and evaluation in education), 7(1), 13-22. doi:https://doi.org/10.21831/reid.v7i1.37741 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i1.37741 https://doi.org/10.21831/reid.v7i1.37741 ida ayu made sri widiastuti page 14 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) identify the students' weaknesses to enable teachers to practice better learning treatment. assessment is carried out to measure the holistic achievement of learning competencies (widiastuti et al., 2020). diagnostics assessment is administered to map out students' ability in areas of language dimension. additionally, a diagnostics assessment is also conducted to identify the students' weaknesses to enable teachers to practice better learning treatment. summative assessment is often called assessment of learning because its specific purpose is merely to know students' strengths and weaknesses (stiggins, 2002; mcmillan et al., 2013). in general, assessment can be classified into two distinctive types of assessment: summative and formative assessment. both of these assessments are administered to collect the students' learning achievement. some specific purposes create differences in both assessment practices. to measure the students’ learning achievement, teachers usually conduct the summative and formative assessment at the end of the school term in one semester. additionally, the summative assessment functions help teachers organize the upcoming lesson and nailing down how the next lesson should be designed and practiced in the classroom. meanwhile, formative assessment is intended to measure the students' learning progress and teachers have a clear picture of the students' achievement in a particular learning unit. this is an ongoing assessment conducted at the completion of every learning unit, and then the result of the formative assessment can be used as a learning device to improve teaching and learning practices, so that is why formative assessment is frequently termed assessment for learning (stiggins, 2002; mcmillan et al., 2013, p.2; derrick & ecclestone, 2006, p.3). moreover, black and wiliam (1998, p.25) emphasize that assessment is carried out with the aim of obtaining data about student abilities. feedback is given to make the students attain higher achievement in learning. the primary assessment is carried out by the teacher through observation during a certain period of time and attitude assessment is not carried out on every basic competency (widiastuti & saukah, 2017, p.52). formative assessment is used to assist students in learning by giving them the opportunity to assess their work based on their learning progress feedback for various types of teacher-made tests and performance assignments such as student portfolios. therefore, assessment is critical in learning because it allows teachers to provide practical instruction in evaluating students' competence (tante, 2018, p.43). on the other hand, feedback is also a significant thing in the perspective of education. one of the most powerful influences on learning is feedback. feedback is the essential aspect of effective teaching and learning. it is a significant viewpoint in assessment and as a basis for advancing the students’ achievement. it can make the students move forward and utilize themselves to make improvements in learning. appropriate formative feedback may improve students' achievement and improve teachers' pedagogical practices (black & wiliam, 1998, p.22; dunn & mulvenon, 2009, p.2). to speed up the students’ improvement in learning feedback is a vital thing that should be provided by the teachers. feedback is a part of formative assessment implementation because every formative implementation should be followed by formative feedback. it is used to enhance the students’ progress in learning which helps the teachers and students to know their existing position. feedback also enables the students to attain better ability and quality of their work. it also gives students chances to modify the upcoming classes (box et al., 2015, p.22). consequently, after implementing formative assessment, the teachers should provide formative feedback to give a clear overview of their learning. several studies have been conducted which mainly focused on investigating the process of the assessment implementation administered by the teachers in the classroom (e.g. taras, 2005; stiggins, 2002; ahsan, 2009). there was, however, very limited information on how the corrective feedback was given by the teachers. thus, this study was highly important to be conducted to reveal the assessment and feedback practices carried out by the efl teachers in the real classroom settings. https://doi.org/10.21831/reid.v7i1.37741 https://doi.org/10.21831/reid.v7i1.37741 ida ayu made sri widiastuti page 15 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) considering the aforementioned phenomenon, this paper aims to provide the information of the assessment and feedback practices of junior high school teachers focusing on the english teacher's assessment practices, the teachers' understanding of assessment, and their feedback after implementing formative assessment. method a qualitative research approach was employed in examining and interpreting the data of this study. participants in this study were obtained by conducting an interview with several junior high school english teachers. data were collected from three junior high school english teachers as indicated by their teaching qualifications and having at least two years of teaching experience. the study participants are also necessary to accomplish the criteria of being a qualified teacher, which can be seen from their teaching certificate and has the same understanding of classroom assessment. data were collected using in-depth interviews and classroom observation. constant comparative techniques of each participant's interviews were used to recognize the teachers' assessment and feedback practices. the collected data were firstly transcribed and interpreted critically and argumentatively. to ensure the validity of the data, the triangulation process was conducted by matching the data collected from the interviews and the data collected from the observation. prior to triangulating the data, the transcriptions of the interviews were also checked several times to ensure their compliance with the recordings, and then teachers were asked to read the transcriptions to reconfirm what they said during the interviews. the transcriptions then were carefully matched with the observation notes to ensure all the data were valid and reliable. findings and discussion assessment used by the teachers in efl classes the result of the interview with the english teachers showed that english teachers in this study did not know the english curriculum. they think that the curriculum is similar to the lesson plan or syllabus they used. the teachers in this study have a significantly clear understanding of the importance of administrating assessments. it was found that they were able to carry out some assessments. the term formative assessment is often called a daily assessment. they also administered summative assessments at the end of the school term. when they were asked the purpose of conducting a formative assessment, they answered a sufficiently comprehensive description. the assessment was mainly administered because it was required by the school rather than for learning and teaching improvement. therefore, formative assessment was conducted to test the students' ability and show their score for a particular topic. the formative assessment was conducted, but the result was not adequately modulated to improve teaching and learning processes. most of them stated that the assessments were conducted based on the assessment guidelines stated in the curriculum. teachers in this study clearly understand the purpose of conducting classroom assessments. two main types of assessment conducted by teachers in assessing the students' learning ability: formative and summative evaluation. summative and formative assessment practiced in english classes to measure the students' knowledge at the end of each semester summative assessment is conducted; however, the formative assessment was conducted at the last part of every learning unit, which is known as "daily repetition." "i think any assessment is important to be administered in the classroom. i do a summative assessment in the classroom at the end of each semester intending to know my students' learning achievement and formative assessment at the end of each learning unit." (teacher b) https://doi.org/10.21831/reid.v7i1.37741 https://doi.org/10.21831/reid.v7i1.37741 ida ayu made sri widiastuti page 16 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) "i administered a formative assessment to find out whether or not the students understand the lesson. i usually give the students assessment at the end of every learning unit, which i usually give after finishing teaching by giving them a short test. this i do in order i know the one unit of the assessment which i have done usually conducted at the end of every unit of learning students' understanding about the lesson and give the students score for that learning unit. i also administer a summative test at the end of every semester to give the students score for one semester." (teacher a) the english teacher administers a summative assessment to get information about the students' ability. the main objective of administering summative assessment is to assess the students' ability by scoring the student learning outcomes for one school term. in contrast, the formative assessment was conducted at the end of every learning unit to know the students' progress in the classroom. it is a very clear distinction between the purpose of summative and formative assessment. both are important to carry out for the sake of higher education quality. the excerpt of the interview above demonstrated that teachers in this research comprehend the contrast between developmental formative and summative assessment implementation. during the interview, the teachers also explained that they are willing to learn more about the formative and the summative assessment. this indicates that the teachers did not yet fully understand those two assessments. it was noted that teachers had various comprehension of how formative assessment implementation. the teachers understood that formative assessment is really useful to enhance the students' performance, but they did not really practice the result of formative assessment to modify the teachers' teaching strategy to make learning improvement especially on their teaching style. the teacher merely uses the result of the formative assessment to assess the students' performance. additionally, the teachers did not usually communicate the formative assessment result accompanied by feedback to the students. based on the teacher's interviews, the teacher did not know exactly the purpose of conducting the formative assessment. they conducted the assessment based on the school's prevailing policies so that the continuation of the assessment results was not done maximally. in this study, it was found that formative assessment was conducted by using several types of tests. "i often administered formative assessment; however, i did not use the results of the assessment to modify my teaching strategy. i do this to meet the policy in the 2013 curriculum." (teacher a) "formative assessment for me is very important to do. when the formative assessment time comes, i ask my students to continue doing the coursebook exercises. this is easier for me because i don't need to write a new test, and the students are already familiar with the test or exercise format. i score the students' work as soon as possible and then return them to the students. by returning their works after being marked, the students know their ability straight away." (teacher b) the aforementioned interview excerpt indicated that the teacher conducted a formative assessment in their teaching practice because the teacher believed that formative assessment is helpful to identify the students' learning progress. however, the teacher did not adequately prepare the assessment to ensure that the assessment is conducted in line with the learning objective and learning competence. the teacher in this study asked the students to do extra exercises from the coursebook. in practice, this could be acceptable if the coursebook is designed to suit the learning standard, and the tests are designed to meet the learning objectives and learning competence. based on the classroom observation, the coursebook was mainly designed to develop the students' language competence, and it was not for assessing the students. it can then be said that the teacher is required to construct the test accordingly. additionally, the teacher explained that the formative assessment was not used to modify the teaching style. the formative assessment's primary purpose is to enhance the learning process. formative assessment result gives a significant input for the teacher in order to figure out a new teaching strategy if the students' ability in that particular test was not sufficiently accepted. the students' weaknesses do not merely cause the poor student's achievement, but also because of the teacher's inappropriate learning strategy. that is why the formative assessment should be https://doi.org/10.21831/reid.v7i1.37741 https://doi.org/10.21831/reid.v7i1.37741 ida ayu made sri widiastuti page 17 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) respected in its nature to improve teaching and learning quality. it can be said that formative assessment is a golden solution to teaching and learning challenges because it allows teachers and students to work together to make the quality of learning progressively excellent. concerning the summative assessment administration, teachers in this study have almost similar views regarding its purpose and how and when it should be administered. the summative test is administered at the end of every semester or end of the school academic term. this assessment is mainly intended to give a clear picture of the students' achievement for a certain period after a particular semester's learning syllabus is completed. "summative assessment is administered at the end of the semester. i usually work with the school principal, in this case, the vice-principal of education affairs. he set an assessment schedule, and i just prepared the test and administered it to the students. to make easier, i usually downloaded the test from the internet and compiled them to make a set of 50 multiple choice test." (teacher c) this excerpt of the interview indicated that a summative assessment was administered by the teacher at the end of the semester or school academic term to measure the students' ability for a period of time. the teacher's test construction procedure seemed unacceptable since most test items were taken from the internet. the teacher's consideration to download the test items from the internet is not in line with good test construction characteristics. in any test construction, the teacher needs to develop a test based on the learning objectives and learning competence that guided the teacher during the learning process. getting the test items from the internet is more likely to cause a new problem in testing since the test items available on the internet are not valid and reliable for assessing the students in every school. therefore, test items should be really constructed by respecting the syllabus content for that semester or school academic term. "for me, summative assessment is not really important. i usually combined and compiled the students' daily works and daily assessments to figure out their abilities. i can score their ability by critically judging the results of their works and daily assessment. daily assessment will show me their progressive achievement, so it is necessary to do a summative assessment, which requires money and special time. it is a good way to assess students, but for me, it is not visible and impractical." (teacher a) the interview expert's expert showed that this teacher was not really keen on conducting summative assessment because it is not practical and time-consuming. scoring the students through a collection of daily works and daily assessments is considered more practical and economical. this teacher seemed to be in favor of measuring the students based on daily assessments and daily assignments. this teacher might give accurate scores for the students if the class size is ideally small. the teacher has adequate time allocation to analyze all the students' work and daily assessment critically. however, recently it was compulsory for the school to administer the summative assessment in order to assess the students' learning achievement appropriately. on the other hand, problems of using formative assessment to improve the teaching and learning quality were found by the teachers for the time they had. the follow-up action was not carried out by the teachers after implementing the formative assessment. this was obtained from interviews conducted with teacher c. "i cannot use formative assessment results well because, after the daily test, i have to continue the learning activities by discussing the next material to finish all the material before the end of the semester.” (teacher c) in accordance with interviews and observations administered to the three english teachers, it was found that there are several types of tests used in assessing the students' learning ability in the classroom. this study indicated that several different types of questions were used by teachers for assessing the students' comprehension of learning. a type of activity used to assess the students' comprehension was the question-and-answer session. however, the question-and-answer session focused only on the teacher asking questions to the students, and the students were not given an opportunity to ask any question to the teacher. the teacher asked questions according to the material that has been previously taught. the students raised their hands and answered the https://doi.org/10.21831/reid.v7i1.37741 https://doi.org/10.21831/reid.v7i1.37741 ida ayu made sri widiastuti page 18 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) teacher's questions according to their abilities. most of the students' questions were answered in front, and only some students who were sitting in the middle and at the back answered the teachers' questions. as previously mentioned, the finding is in line with the finding of ahsan (2009, p.234) and rahman (2018, p.278), in their study that students were actively answered the teachers' questions. however, the questions were mainly requiring the students' lower level of a cognitive domain, which merely measured the students' knowledge of recalling the learning material. the questions were in the form of open-ended questions, closed-ended questions, yes/no questions; questions were directed to individual students and questions to the whole group. the teachers questioned the close-ended questions allowed students to answer using one or two words. most students answered easier questions, but many found difficulty answering more complicated and comprehensive questions. similarly, the data obtained from the observation also confirmed that the three english teachers used several questions in assessing the ability of students, such as; closed-ended questions, open-ended questions, and yes-no questions. the teachers administered close-ended questions to get a brief and specific answer from the students. in this study, the teachers gave several closed-ended questions to the students related to the material being discussed. "i usually give close-ended questions if i ask my students years, such as birth year, production year, etc., it is also happening when i ask about a place or when i need a specific answer.” (teacher c) "close-ended question is usually given if i need a short answer to my question. students usually answer the question briefly. but this type of question did not give an appropriate explanation."(teacher b) "i usually give an open-ended question to my students, so they can explain the answers they provide, from the students' answer. i know how far my students understand the material being studied. it takes more time, so not all students can get the question.” besides giving a specific answer, the purpose of providing close-ended questions is to minimize the test’s time allotment, and each student has the opportunity to answer all questions presented by the teacher. time allotment for learning is very limited due to many topics. learning competence needs to be taught to the students. the teachers consider close-ended questions to take a shorter period of answering. ideally, adequate time should be allocated to ensure that all students are assessed properly with sufficient tests. without doing assessment appropriately may confuse the judgment process of the ability. additionally, it makes it harder to grade the students' achievement. furthermore, an open-ended question is the teacher's type of question, which aims to provide opportunities to the students to explore their answers. in this case, the student can provide more than one answer and an explanation of the teacher's questions. in this study, teachers usually provide an open-ended question to obtain a lot of information about the students' learning achievement. in this case, the question was given to know the students' understanding in the classroom; however, this type of question has weaknesses because not all students can answer them well. the students could not respond to the teachers appropriately. they can usually mention the desired answer but cannot explain it in detail. open-ended questions were provided by one of the teachers to determine the students’ ability. he asked students to mention and explain the type of paragraph writing. some students can mention it well, but they cannot explain it appropriately. the teacher gave the same question to other students, without giving them any feedback. on the other hand, the questions have also varied, which is given individual and questions for the whole class. individual questions are usually given to determine the ability of each student. teachers provide some questions to students. teachers usually invited each student to answer the questions they asked. if the student could not answer, then the questions were given to another student. some teachers, in this case, did not provide any explanation about the answer whether the answer asked by the students appropriate or not. https://doi.org/10.21831/reid.v7i1.37741 https://doi.org/10.21831/reid.v7i1.37741 ida ayu made sri widiastuti page 19 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) based on the classroom observations, it was also found that the questions to the whole class were given by the teacher after providing explanations about the particular material or showing a picture to the students. teachers provided questions in accordance with the picture. students who were able to answer were raising their hands and answer the teacher's questions. in this case, not all students can answer teacher questions. one teacher in this study said that he gave a question to convey the students' understanding by providing a detailed explanation of the teacher's questions. based on the data obtained, it can be seen that the types of questions mostly offered by the teachers in order to assess the achievement of students were close-ended questions, whether the questions were given individually or to the whole classes. the teacher conducted this to provide specific answers and minimize the time provided. another fact that was obtained in this study was when the teacher gave open-ended questions to the students, most students replied with a close-ended style. in this case, the student tried to answer the teacher's question with one or two words only. according to the teacher, it was usually done by the students in the classroom, and the teacher could not give a further explanation because of the limited time of the assessment. based on the classroom observation conducted, the teachers rarely gave feedback after their students responded to their questions. based on the result of this study, it can be seen that the most technique used by teachers to assess the students' ability was the question. this study found that teachers gave more close-ended questions to the students and rarely gave open-ended questions. on the other hand, the students still answered with a close-ended question, even though the teacher gave open-ended questions. the students' answers might be caused by their limited ability to elaborate on the answers. therefore, students need to be more intensively practice their english to express their opinion elaborately. further studies were also conducted by researchers and revealed similar findings (burns & myhill, 2004, p.38; ahsan, 2009, p.234; yang, 2010, p.184). the findings indicated the teachers’ questions were answered by students in one or two words when they were asked closed or yes/ no questions. yang (2010, p.184) also proved that students were trying to respond more comprehensively when they were asked by using open-ended questions. however, the students were able to answer in more words only when teachers pushed them to describe more elaborately. this study proved that students were willing to speak more english words when they feel that they need to be explained further. the teacher often uses open-ended questions to encourage students to speak english more intensively, using more than one or two words. moreover, black and wiliam (2009, p.11) suggested that the teachers need to plan the questions before asking them to avoid asking the same questions. the aforementioned data indicate that two types of assessments were conducted by teachers to measure the students’ achievement of learning. formative assessment was usually administered at the end of every learning unit, and summative assessment was administered at the end of the school term. the teachers in this study seem to understand the principle of assessment execution; however, most of the teachers had poor ability to properly conduct the assessment and made use of the assessment result. in the practice of formative assessment, for example, teachers did not use the result to improve their teaching processes. most of the teachers focused on establishing the students’ achievement scores after taking the assessment. feedback practices in efl classes based on the classroom observation and interviews conducted with the three teachers in this study, it was found that feedbacks were given by teachers to the students after administering different types of assessment in the classes; however, providing any feedback to students was not a regular practice. feedback is an important component of the teaching and learning process. the data revealed that the teachers gave two forms of feedback namely written and oral feedback. in accordance with teacher a, the feedback was an interaction between teachers to students and students to other students. feedback helps students understand the mistakes they have made, https://doi.org/10.21831/reid.v7i1.37741 https://doi.org/10.21831/reid.v7i1.37741 ida ayu made sri widiastuti page 20 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) so students reached the target faster. feedback was given not only for the whole class but also for individual students. in this study, teachers mainly provided verbal feedback to the students and written feedback on a writing task. “i usually give feedback to the students, both spoken and written feedback. this is important to develop the students’ learning achievement and make them ready for the upcoming lesson. my feedback is sometimes in the form of a very strong suggestion to ensure my students do them properly.” types of the feedback given by the teacher in this study were sometimes positive and negative feedback. teachers provided positive feedback to improve the performance of students. the teachers paid attention seriously, take notes, and directly respond to the students’ answers. sometimes teachers provided negative feedback. in general, it was revealed that teachers asked any questions to the students and sometimes gave feedback on those responses. for example, teacher c taught reading skills in the classroom. teacher c invited the students to read aloud; when the student made mistakes in reading, the teacher correctly made a mistake. another fact that was found in this study was the other student also assisted the reader when they made a mistake in reading, so the students themselves also gave feedback to each other. in this study, teachers described that feedback should be provided to the students regularly because it could inspire students. they mention that they gave feedback to the students; however, many teachers may not understand the role of feedback for students in the learning period. teachers rarely give feedback to students because they only focused on finishing the topic being discussed. this is also influenced by their culture, where a teacher gave explanations to students without allowing students to do something. it causes the teacher to focus on delivering the material. on the other hand, it made teachers rarely provide feedback on students. teachers gave feedback when students do test/repetition questions where students have no chance to ask questions. usually, teachers provide feedback numbers without any comments about students. students need written feedback, not only numbers but also comments from teachers. “sometimes i give feedback to our students; however, the feedback is not regularly provided after conducting assessment because sometimes i think it is important to give feedback since questions have been answered by the students.” (teacher c) “of course, feedback is very important. i provide feedback in my classes. feedback will help the students to correct their mistakes. students get inspiration from the feedback given by the teacher.” (teacher a) feedback can have a good effect on students because feedback students feel that the teacher paid attention to them. besides, feedback makes students more active in doing something. it can improve students’ learning ability because students are understood about the mistakes they make and get encouragement from the teacher to continue to enhance their ability and justify the error. feedback has an important role, both for students and teachers. feedback is an important factor that support students to be successful in the learning process (decristan et al., 2015, p.1136). feedback can be in the form of corrections and suggestions for students. this may contain criticism or encouragement for better performance. feedback is important for the teacher to correct the mistakes that have been made. conclusion assessment is vividly essential to conduct and administer according to the prime purposes of assessment practices. summative assessment is administered accordingly to determine the students’ ability for a certain period of time. in contrast, formative assessment was administered at the end of every learning unit to measure the students’ progressive achievement toward a particular learning competence. the teachers mostly used open-ended and closed-ended questions to assess the students’ achievement of learning. feedback was required to be provided to the students to gain higher learning achievement and arouse their active learning. positive corrective feedback certainly leads to successful learning, meanwhile, the result of formative assessment could also be https://doi.org/10.21831/reid.v7i1.37741 https://doi.org/10.21831/reid.v7i1.37741 ida ayu made sri widiastuti page 21 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) used for teachers as their reflection of the teaching style and to modify their fossilized teaching strategy into a more effective way of teaching. it, therefore, implies the need for a closer and stronger relationship between universities and schools to organize more intensive and practical development programs for the teachers to enrich their competence in carrying out the assessment and feedback practices in the classroom. references ahsan, s. (2009). classroom assessment culture in secondary schools of dhaka city. teacher’s world, 33–34(9), 231–244. black, p., & wiliam, d. (1998). assessment and classroom learning. assessment in education: principles, policy & practice, 5(1), 7–74. https://doi.org/10.1080/0969595980050102 black, p., & wiliam, d. (2009). developing the theory of formative assessment. educational assessment, evaluation and accountability, 21(1), 5–31. https://doi.org/10.1007/s11092-0089068-5 box, c., skoog, g., & dabbs, j. m. (2015). a case study of teacher personal practice assessment theories and complexities of implementing formative assessment. american educational research journal, 52(5), 956–983. https://doi.org/10.3102/0002831215587754 burns, c., & myhill, d. (2004). interactive or inactive? a consideration of the nature of interaction in whole class teaching. cambridge journal of education, 34(1), 35–49. https://doi.org/10.1080/0305764042000183115 decristan, j., klieme, e., kunter, m., hochweber, j., büttner, g., fauth, b., hondrich, a. l., rieser, s., hertel, s., & hardy, i. (2015). embedded formative assessment and classroom process quality: how do they interact in promoting science understanding?. american educational research journal, 52(6), 1133–1159. https://doi.org/10.3102/0002831215596412 derrick, j., & ecclestone, k. (2006). formative assessment in adult literacy, language and numeracy programmes: a literature review for the oecd. centre for learning, teaching and assessment through the lifecourse, university of nottingham. dunn, k. e., & mulvenon, s. w. (2009). a critical review of research on formative assessment: the limited scientific evidence of the impact of formative assessment in education. practical assessment, research and evaluation, 14(7), 1-11. https://doi.org/10.7275/jg4h-rb87 mcmillan, j. h., venable, j. c., & varier, d. (2013). studies of the effect of formative assessment on student achievement: so much more is needed. practical assessment, research & evaluation, 18(2), 1–15. https://doi.org/10.7275/tmwm-7792 rahman, m. m. (2018). exploring teachers practices of classroom assessment in secondary science classes in bangladesh. journal of education and learning, 7(4), 274–283. https://doi.org/10.5539/jel.v7n4p274 stiggins, r. j. (2002). assessment crisis: the absence of assessment for learning. phi delta kappan, 83(10), 758–765. https://doi.org/10.1177/003172170208301010 tante, a. c. (2018). primary school teachers’ classroom-based assessment feedback culture in english language. international journal of educational research review, 3(4), 32–47. https://doi.org/10.24331/ijere.425151 taras, m. (2005). assessment summative and formative some theoretical reflections. british journal of educational studies, 53(4), 466–478. https://doi.org/10.1111/j.14678527.2005.00307.x https://doi.org/10.21831/reid.v7i1.37741 https://doi.org/10.1080/0969595980050102 https://doi.org/10.1007/s11092-008-9068-5 https://doi.org/10.1007/s11092-008-9068-5 https://doi.org/10.3102/0002831215587754 https://doi.org/10.1080/0305764042000183115 https://doi.org/10.3102/0002831215596412 https://doi.org/10.7275/jg4h-rb87 https://doi.org/10.7275/tmwm-7792 https://doi.org/10.5539/jel.v7n4p274 https://doi.org/10.1177/003172170208301010 https://doi.org/10.24331/ijere.425151 https://doi.org/10.1111/j.1467-8527.2005.00307.x https://doi.org/10.1111/j.1467-8527.2005.00307.x https://doi.org/10.21831/reid.v7i1.37741 ida ayu made sri widiastuti page 22 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) widiastuti, i. a. m. s. (2018). teachers’ classroom assessment and grading practices. shs web of conferences, 42(00052), 1–7. https://doi.org/10.1051/shsconf/20184200052 widiastuti, i. a. m. s., mukminatien, n., prayogo, j. a., & irawati, e. (2020). dissonances between teachers’ beliefs and practices of formative assessment in efl classes. international journal of instruction, 13(1), 71–84. https://doi.org/10.29333/iji.2020.1315a widiastuti, i. a. m. s., & saukah, a. (2017). formative assessment in efl classroom practices. bahasa dan seni: jurnal bahasa, sastra, seni dan pengajarannya, 45(1), 50–63. https://doi.org/10.17977/um015v45i12017p050 yang, c. c. r. (2010). teacher questions in second language classrooms: an investigation of three case studies. asian efl journal, 12(1), 181–201. retrieved from https://www.asianefl-journal.com/main-editions-new/teacher-questions-in-second-language-classrooms-aninvestigation-of-three-case-studies/ https://doi.org/10.21831/reid.v7i1.37741 https://doi.org/10.1051/shsconf/20184200052 https://doi.org/10.29333/iji.2020.1315a https://doi.org/10.17977/um015v45i12017p050 https://www.asian-efl-journal.com/main-editions-new/teacher-questions-in-second-language-classrooms-an-investigation-of-three-case-studies/ https://www.asian-efl-journal.com/main-editions-new/teacher-questions-in-second-language-classrooms-an-investigation-of-three-case-studies/ https://www.asian-efl-journal.com/main-editions-new/teacher-questions-in-second-language-classrooms-an-investigation-of-three-case-studies/ this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(2), 2020, 130-141 available online at: http://journal.uny.ac.id/index.php/reid evaluation of the implementation of professional development efforts in improving the professionalism of geography teachers *1 alfin nuramalia yuniandita; 1 mukminan 1 faculty of social science, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia *corresponding author. e-mail: alfinnuramalia.2018@student.uny.ac.id submitted: 2 november 2020 | revised: 25 december 2020 | accepted: 27 december 2020 abstract this study aims to determine how the implementation of teacher professional development efforts and the government and/or related agencies' participation in supporting teacher professional improvement can formulate sustainable professional development efforts in improving the professionalism of high school geography teachers in brebes regency. this evaluation study used the discrepancy model. the method used was descriptive quantitative and analyzed using percentage descriptive techniques. the data in this evaluation study were collected through questionnaires, interviews, and observations. the sample in this study consisted of 30 high school geography teachers in brebes regency, while the sampling technique used was disproportionate stratified random sampling. regardless of the results of the study that geography teachers had a good understanding of the duties and obligations that had to be fulfilled as professional teachers, that efforts for sustainable professional development had not been fully implemented properly. the result shows that the sustainable professional development efforts carried out by geography teachers are in the criteria of ―good enough‖ with an average score of 66.83 (out of a maximum score of 120). from several programs that are supporting the teachers’ professional development, only a few programs are routinely carried out by geography teachers, including internal coaching by supervisors, geography subject forum activities, and workshops. besides, several professional development programs deserve special attention from the government due to the lack of teacher participation, namely apprenticeships, short courses, distance learning, level training, and further education, given that some of these programs provide opportunities for teachers to update and develop their knowledge. keywords: sustainable professional development, professional improvement, geography teachers how to cite: yuniandita, a., & mukminan, m. (2020). evaluation of the implementation of professional development efforts in improving the professionalism of geography teachers. reid (research and evaluation in education), 6(2), 130-141. doi:https://doi.org/10.21831/reid.v6i2.35455. introduction the education world is facing various challenges in the globalization era. facing the development of the world globally demands preparing competent human resources (hr) to be able to compete in the global world. the globalization era is marked by various changes in a relatively fast period time in various fields, especially in terms of the development of science and technology. when people begin to be required to follow global developments, formal education in schools is believed as the main factor in equipping people with knowledge that continues to evolve throughout life, because through education, people will have a scientific and structured mindset based on existing facts; besides, students will be helped to understand and recognize the science and knowledge that continues to develop (zubaedi, 2011, p.178). https://doi.org/10.21831/reid.v6i2.35455 https://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan copyright © 2020, reid (research and evaluation in education), 6(2), 2020 131 issn: 2460-6995 (online) concerning the development of education in indonesia, there have been various changes until 2020, for example, the changing of the education curriculum in indonesia. these changes are basically aimed at improving the education system in indonesia to suit the functions and objectives of national eduction as stipulated in the law of the republic of indonesia no. 20 of 2003 article 3. therefore, the education world needs to be more directed so it can facilitate students with educational and learning instruments that enable a more open, creative, and adaptive learning process to change (mukminan, 2008, p.2). efforts in realizing national education goals are undoubtedly inseparable from the learning process because it is the core of the education process in general, and teachers have a very important role in the learning process. amri and rohman (2013, p.9) state that the learning process is influenced by several factors: teacher factors, student factors, infrastructure factors, and environmental factors. through the learning process, there will be a series of activities between the teacher and students in an educational situation to achieve the goals that have been set so that the teacher plays an important role in it. through the learning process, a teacher can determine what behavior is expected from students after participating in the learning process (nugroho, 2013, p.2). basically, the learning process is a concept of receiving and processing information that allows changes in behavior both in biological and emotional aspects (barron et al., 2015). based on the aforementioned concept and the learning aspect at school, the process of receiving information is surely two-way, and teachers play an important role in providing information for students. teachers progressively carry out mobility and transfer teacher-centered knowledge through a student-centered action that combines student inquiry abilities. teachers have a role in designing and implementing a more modern educational curriculum in accordance with the era. thus, it can meet students' learning needs to explore the environment and observe through aspects of various scientific studies (chang et al., 2015, p.178). based on the regulation of the minister of national education no. 16 of 2007 concerning academic qualification standards for professional competencies that must be possessed by geography subject teachers include: first, mastering the nature of the scientific structure, scope, and geographic objects. second, differentiating geographic approaches. third, mastering geography material broadly and deeply. fourth, showing the benefits of geography subjects. moreover, teachers' competence and performance must be improved so that their strategic and determinant role makes education successful. as explained by anggraini et al. (2020), the importance of increasing teacher competence and performance is that teachers with good quality will produce quality generations who are ready to face all difficulties and challenges in life. uerz et al. (2018, p.18) explain that the important aspects for measuring teachers' competence are technology ability, pedagogical competence, and the use of educational technology about teaching and learning and professional competence in learning. meanwhile, based on the law of the republic of indonesia concerning teachers and lecturers, it is stated that an educator has at least four competencies, namely: pedagogical competence, social competence, personality competence, and professional competence. teacher competence describes the theoretical and empirical knowledge of a teacher and is related to aspects of skills, personality, aware-ness, and willingness of teachers to self-development and how teachers contribute to student education and self-development (semradova & hubackova, 2014, p.437). akhmetova et al. (2013, p.77) explain that professional competence is a standard that integrates the subject approach and subject matter in an educational unit. agree with that, lauermann and könig (2016, p.9) explain that teachers’ professional development, which covers their professional knowledge, skills, beliefs, and motivation, is an integral aspect that a professional teacher must own. meanwhile, niemi et al. (2016, p.471) argue that professional competence is related to teachers' main duties and responsibilities both at school and in the community. https://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan 132 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) jailani (2014, p.4) explained that, basically, professional teachers are teachers who have a collective and complete awareness of their position as educators. a teacher's professionalism is an important component in order to describe effective learning (kulshersta and pandey, 2013, p.33). in addition, increasing teacher competence and performance can be realized through various practices and professional development training. jayatilleke and mackie (2013, p.312) reveal based on their findings that professional development can be through the implementation of training or practice in accordance with the discipline of knowledge. it is also supported by visković and jevtić (2017, p.1573) that professional development can be developed through professional training by combining aspects of theoretical knowledge, practice, and discussion well as the implementation of workshops. improving the quality of education is needed along with the development of science and technology. as a component that plays the most important role in a learning process, the teacher must continue to improve his understanding of professional standards' demands, so the teacher can carry out his duties optimally and with direction. this was also confirmed by rohmah (2016) research results, who stated that sustainable professional development must be carried out in accordance with the standards of professional demands and the need to support their professional development. the need referred to is the efforts to achieve and/or increase competence above the teaching profession's competence, thus, they affect the promotion or functional position of a teacher. sustainable professional development, as mandated in the regulation of the minister of state apparatus empowerment and bureaucratic reform number 16 of 2009 concerning teacher functional positions and credit score, is the implementation of teacher competency development to meet the needs and requirements of professional standards which are carried out gradually and continuously to increase his professionalism. the efforts for sustainable professional development can include: (1) self-development; (2) carrying out scientific publications; and (3) find and/or create innovative work. more specifically, mudlofir (2012, pp.133-134) explains as follows. …strategies and techniques to increase teacher professionalism can be reached through the following activities: (1) in-house training (iht), (2) apprenticeship programs, (3) school partnerships, (4) distance learning, (5) tiered training, (6) short courses at universities or other educational institutions, (7) internal guidance by schools, (8) further education, (9) discussion of educational issues, (10) seminars, ( 11) workshops, (12) research, (13) writing books/teaching materials, (14) making learning media, and (15) making technological works/works of art. seeing the demands that teachers must fulfill to meet professional qualification standards requires the participation of related agencies to support and facilitate concerning efforts for sustainable professional development. we need an evaluation to determine how the implementation of sustainable professional development improves teacher professionalism. thus, it can be seen that the efforts to implement sustainable professional development still need to be optimized to improve teacher professionalism. based on the study of the background that has been described, this research was specifically carried out to know: (1) implementation of sustainable professional development by high school geography teachers in brebes regency; (2) the role of the government in supporting the sustainable professional development of high school geography teachers in brebes regency; and (3) the efforts to improve the professionalism of geography teachers. method the research method used was the descriptive quantitative method with percentage descriptive analysis techniques. the researchers intended to find out and describe how sustainable professional development for senior high school geography teachers is implemented in brebes regency. this study was a descriptive research using a survey method because the data studied were based on facts that occur in the field. the study results dehttps://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan copyright © 2020, reid (research and evaluation in education), 6(2), 2020 133 issn: 2460-6995 (online) scribed quantitative data related to the state of the subject or the phenomenon of a population. the data consisted of teacher understanding of the duties and responsibilities of being a professional teacher, and an overview of the implementation of sustainable professional development for high school geography teachers in brebes regency the evaluation model used in this study was the discrepancy model based on malcolm provus. evaluation of the discrepancy model was carried out to determine the suitability of the standards that had been set in the implementation of teacher professional competence and implementation in the field (muryadi, 2017, p.4). the standard measured in this evaluation study was whether or not the professional development efforts of geography teachers corresponded to the regulation of the minister of national education no. 16 of 2007. determination of program achievement effectiveness was seen from the suitability of research data with predetermined standard indicators related to the implementation of sustainable professional development as regulated in the regulation of the minister of state apparatus empowerment and bureaucratic reform no. 16 of 2009 on teacher functional position and credit score. the research was carried out in 17 senior high schools (sekolah menengah atas or sma) in brebes regency, consisting of 11 public sma and six private sma. the target objects in this study included all teachers who taught high school geography subjects in brebes regency. the sample size in this evaluation study was 30 respondents. baley (mahmud, 2011, p.33) stated that research using statistical data analysis must have a sample of at least 30 respondents. table 1 shows the specific details of the classification of the research respondents. the data collection method used was the data source triangulation technique involving the geography subject teachers, the principal, the geography subject teacher supervisor, and the head of the geography subject teachers forum. the instruments used in data collection were closed questionnaires that had been tested for validity and reliability, interviews, and observations. the research results were processed using the percentage of descriptive analysis techniques. the questionnaire employed to measure teacher understanding variables toward professional competence in its preparation was based on established standards as regulated in the regulation of the minister of national education no. 16 of 2007 concerning teaching qualification and competency standards. the standards for professional competence achievement indicators are shown in table 2. data analysis was performed using the descriptive percentage method with scoring techniques based on the likert scale. the assessment categories were positive to negative. the assessment categories were classified into four, namely: (1) very good; (2) good; (3) good enough; (4) not good (see table 3 and table 4). thus, it can be studied the professional development efforts that still need to be developed to improve geography teachers' professionalism. the formula used for descriptive percentage analysis is presented in formula (1), where n = score obtained, and n = the total score (ali, 2013, p.201). % = ………….. (1) table 1. classification of research respondents school status school accreditation teacher employment status total government employees non-government employees public a 12 11 23 b 1 − 1 c − − 0 private a 1 1 2 b − 3 3 c − 1 1 total 14 16 30 source: research, 2020 https://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan 134 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) table 2. criteria of success for teacher professional development understanding no. professional competence indicators (the regulation of the minister of national education no. 16 of 2007) criteria of success 1. mastering the scientific material, structure, concepts, and mindsets that support the subjects being taught − being able to understand and apply the nature of the scientific structure, scope, and geographic objects in geography learning. − being able to understand and apply the educational foundation both philosophically and psychologically. 2. mastering competency standards and basic competencies of the subjects being taught − being able to handle the subject or field of study assigned to him. − being able to understand and apply appropriate teaching methods. − being able to carry out learning following the learning objectives and design. 3. developing creative learning materials − being able to determine and provide geography subject matter creatively by the students’ level of development. − being able to use and modify various learning tools and media and other learning facilities. 4. developing professionalism sustainably by taking reflective action − being able to carry out learning evaluations. − being able to take advantage of the results of selfreflection to improve professionalism. − being able to carry out classroom action research for professional improvement. − being able to follow the information on science and technology developments as an effort for selfdevelopment and professionalism. 5. utilizing information and communication technology to develop themselves − being able to take advantage of information and communication technology in communicating. − being able to take advantage of information and communication technology for self-development. source: the regulation of the minister of national education no. 16 of 2007 table 3. frequency distribution criteria of professional competence understanding of high school geography teachers in brebes regency no interval score percentage (%) criteria 1. 55.25 -≤ 68 81.25 -≤ 100 very good 2. 42.50 -≤ 55.25 62.50 -≤ 81.25 good 3. 29.75 -≤ 42.50 43.75 -≤ 62.50 pretty good 4. 17 -≤ 29.75 25 -≤ 43.75 not good source: research data, 2020 table 4. frequency distribution criteria of professional development efforts of high school teachers in brebes regency no. interval score percentage (%) criteria 1. 97.5 ≤ 120 81.25 -≤ 100 very good 2. 75 ≤ 97.5 62.50 -≤ 81.25 good 3. 52.5 ≤ 75 43.75 -≤ 62.50 pretty good 4. 30 ≤ 52.5 25 -≤ 43.75 not good source: research data, 2020 findings and discussion professional competence in its implementation must be balanced with the continuous development of these competencies as an effort to increase teacher competence and performance. besides, development is carried out to maintain professional competence following developments in science, technology, https://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan copyright © 2020, reid (research and evaluation in education), 6(2), 2020 135 issn: 2460-6995 (online) arts and culture and/or sports. teachers are actively guided in sustainable professional development by learning from various sources. thus, it can add insight and experience to help in his professional development. based on the research results, the teacher's understanding of the duties and responsibilities of being a professional teacher showed good criteria with an average score of 54.7 (from a maximum score of 68). a good professional understanding means that the teacher understood the main duties and responsibilities to be seen as a professional teacher. an overview of professional understanding is described in table 5 and figure 1. if teachers understood their professional duties, they also understood the criteria that ought to be met as a professional teacher. it covered an understanding of how a teacher's main tasks start from compiling lesson plans, implementing learning, and carrying out learning evaluations. this was also supported by information in an interview with the head of geography subject teachers forum in brebes regency who stated, ―in terms of mastery of geography subjects, the geography teachers in brebes regency are surely very good and correspond to the standards.‖ it was also supported by the statement of mrs. evi as the supervisor of geography teachers at high school in brebes regency as follows. the mastery of geography subjects is quite good, it is also seen on the recap of the results of teacher competency test scores. even though the scores are not high enough, yet it has reached at least the minimum threshold criteria, so, there must be further professional development. however, regardless of the teachers’ understanding of their professional competence, which meant good, sustainable professional development efforts ought to be maximized in its implementation in order to improve teacher professionalism. susan (2012, p.36) explains that although the student learning process in practice had developed well over the last few decades, the approach to educators' professional development had lagged. thus, it was needed consistency of related agencies in the implementation of professional development. table 5. understanding of the professional competence of high school geography teachers in brebes regency no. interval score percentage (%) criteria f % 1. 55.25 -≤ 68 81.25 -≤ 100 very good 14 47 2. 42.50 -≤ 55.25 62.50 -≤ 81.25 good 16 53 3. 29.75 -≤ 42.50 43.75 -≤ 62.50 pretty good 0 0 4. 17 -≤ 29.75 25 -≤ 43.75 not good 0 0 amount 30 100 source: research data, 2020 figure 1. diagram of understanding of teacher professional competence source: research data, 2020 https://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan 136 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) it is also confirmed by the research results of ghaikhorst et al. (2014, p.53), that professional development programs had a very positive impact on teacher knowledge and self-improvement of a teacher. in addition, kabadayi (2016, p.8) explained that to achieve professional standards, teachers were required to equip themselves with professional skills and competencies so that they could carry out their main duties and obligations as a professional teacher. efforts to develop high school geography teachers' professional competence in brebes regency consisted of two types, namely, efforts provided by agencies and efforts carried out by the geography teachers themselves. professional development efforts facilitated by related agencies or institutions that are included the implementation of in house training (iht) or education and training, internships, distance learning, tiered training, short courses, internal coaching, further education, seminars, workshops, research, equivalency programs, supervision, subject teachers forum, and teacher symposiums. meanwhile, geography teachers' efforts independently/individually included reading journals/scientific papers, following actual news, participating in professional organizations, and collaborating with peers. in connection with the efforts that are made for professional development either conducted individually or through agencies, table 6 presents the results of research on the involvement of geography teachers in professional development/improvement efforts. the research results described in table 6 show that the sustainable professional development efforts done by geography teachers showed the criteria of "good enough" with the average score obtained was 66.83 (out of a maximum score of 120). based on the research data, it could also be seen that sustainable professional development efforts by carrying out further education were the efforts with the lowest score. this showed that the lack of teachers who decided to join further education—carrying out further education in addition to updating educational level qualifications so that it could show the competence of a teacher professionally. further, kunter (2013, p.283) stated in his findings that professional competence developed through various active advances in the learning process, and individual characteristics greatly affected the teacher’s individual attitude in taking advantage of these opportunities. table 6. implementation of sustainable professional development efforts no sustainable professional development efforts criteria total score(n) % (from ʃn=120) 1. in house training (iht) good 82 68.33 2. internship not good 30 25.00 3. distance learning not good 35 29.17 4. tiered training not good 40 25.00 5. short course not good 41 34.17 6. internal training very good 98 81.67 7. further education not good 31 25.83 8. seminar pretty good 68 56.67 9. workshop good 88 73.33 10. research good 78 65.00 11. equalization program pretty good 54 45.00 12. supervision good 83 69.17 13. geography subject teachers forum good 85 70.83 14. teachers symposium pretty good 73 60.83 15. reading journals / scientific papers good 83 69.17 16. follow live news good 77 64.17 17. participate in professional organizations good 77 64.17 18. collaborating with peers good 90 75.00 average good enough 66.83 55.70 source: research data, 2020 https://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan copyright © 2020, reid (research and evaluation in education), 6(2), 2020 137 issn: 2460-6995 (online) from all programs in professional development efforts that teachers must implement, only a few programs had been implemented properly by them, including internal coaching activities that were usually directly given by the principal to geography teachers. this internal coaching was very important in evaluating teachers’ professional competence considering that the principal, who was a supervisor, certainly understood the teachers’ competence in implementing the learning activities. it is also confirmed by shagrir (2012) that mentoring activities motivated teachers to self-evaluate and improve the quality of their teaching because mentoring/counseling allowed teachers to get guidance in meeting the demands of professional standards. professional development efforts could include activities that were carried out independently or organized by related agencies. based on the results of the research data, it showed that geography teachers had never carried out apprenticeship and tiered training. this was a special concern of the government to further improve the effectiveness of the implementation of the sustainable professional development program. the inadequate implementation of sustainable professional development was also supported by the statement of the chairman of the geography subject teachers forum of brebes regency, mr. budi raharjo, who explained that: the teachers’ participation in professional development so far is still very low, it can be seen from how disciplined they prepare classroom action research reports. thus, only subject teachers forum activities are still actively participated in by the teachers until now. there may be some teachers who are actively participating in seminars or other professional development activities but it is very rare. it was supported also by mrs. evi as the supervisor of the geography teacher at senior high school in brebes regency as follows. professional development that was carried out independently by teachers was still a concern due to the lack of teachers awareness. it must be based on the demands of the profession to attract teachers to actively participating in scientific activities. apart from that, there were only a few teachers who actively participate even no one of them. meanwhile, table 7 and figure 2 show the frequency distribution regarding the percentage of professional development efforts of senior high school geography teachers in brebes regency. based on the results, it could be seen that the professional development efforts of teachers with the highest percentage of 50% were included in a good category both in terms of implementation and participation of geography teachers. the program of teachers’ professional development efforts, which in its implementation were included in good criteria, are iht program, workshops, geography subject teachers forum, reading scientific journals or articles, following the latest news, participating in professional organizations, and collaborating with peers. meanwhile, the lowest percentage of 5% was included in the very good category, namely professional development efforts in the form of an internal coaching program. this was surely very unfortunate since professional development efforts with very good criteria should have the highest implementation percentage. on the contrary, it got the lowest percentage of teacher participation compared to other criteria. table 7. frequency distribution of professional development efforts of high school teachers in brebes regency no interval score percentage(%) criteria f % 1. 97.5 ≤ 120 81.25 ≤ 100 very good 1 5.56 2. 75 ≤ 97.5 62.50 ≤ 81.25 good 9 50.00 3. 52.5 ≤ 75 43.75 ≤ 62.50 pretty good 3 16.67 4. 30 ≤ 52.5 25 ≤ 43.75 not good 5 27.78 total 18 100 source: research data, 2020 https://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan 138 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) figure 2. diagram of professional development efforts of high school teachers in brebes regency source: research data, 2020 with regard to the results of the research data, there were several efforts for sustainable professional development that still needed to be improved including apprenticeship programs and tiered training. teachers' minimal participation in various sustainable professional development programs became a particular concern from both schools and other relevant agencies. mak and pun (2014, p.5) also emphasized that based on their findings, sustainable professional development for teachers requires commitment from various related elements, such as support from the schools, parents, and also the wider educational community. in her findings, smylie (2014, p.108) revealed that there was a lack of clarity in the policies governing the teacher evaluation system, so the effect on the professional development of tenders was still weak. thus, monitoring policies regarding the teacher evaluation system must be paid more attention, so that its implementation can be maximally implemented. given that the teacher evaluation process was designed and implemented correctly following the objectives of learning and professional development, it could influence the quality of teaching and enhance student learning achievement (looney, 2011, p.440). delvaux et al. (2013, p.1) emphasized that teacher evaluation had a very important role in sustainable professional development. teachers’ professional development efforts required the school committee's participation in its implementation, so it could run optimally. and programs that supported professional development efforts required supervision from school committees and/or related institutions to evaluate their implementation. it was also emphasized by kadarwati (2016, p.119) explained that the school committee had a role and responsibility for improving teacher professionalism, including (1) motivating teachers to carry out their teacher responsibilities; (2) assisting teachers in implementing learning following the applicable curriculum provisions; (3) guiding teachers in carrying out learning evaluations; (4) carrying out supervision in the academic/teaching field; (5) equipping skills and knowledge to support proessional development. meanwhile, berdiati (2020, p.38) explained that in addition to school committees, supervisors also play an important role in sustainable professional development for teachers, namely: (1) as a motivator, who guided and guided teachers to develop their professionalism; (2) as a facilitator; (3) as supervision in the academic field. besides, to develop professionalism, a collaboration between educators at the secondary school level and educators at the university level was needed, thus forming a new concept of professionalism in teaching (herbert and rainford, 2014, p.243). https://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan copyright © 2020, reid (research and evaluation in education), 6(2), 2020 139 issn: 2460-6995 (online) conclusion sustainable professional development is needed to improve teachers’ professionalism. referring to the research results, the implementation of professional development for high school geography teachers in brebes regency had not been maximal. it can be seen from the results of the study, that there were only 55.70% of high school geography teachers in brebes regency who were good enough to actively participate in professional development effort programs. from all programs in professional development efforts, there were only a few programs that most geography teachers routinely joined, including iht program, internal coaching, seminars, workshops, research, supervision by school supervisors and principals, subject teachers forum, and teacher symposiums. in this case, the government and related agencies' role is needed to provide support and encouragement for professional development programs. references akhmetova, n. s., omarova n. n., kuznetsova, s.v., & sheveleva, a. n.. (2013). professional competence teacher: theoritical aspect. education and science without borders, 4(7), 76-79. ali, m. (2013). penelitian kependidikan prosedur & strategi. angkasa. amri, s., & rohman, m. (2013). strategi dan desain pengembangan sistem pembelajaran. prestasi pustakarya. anggraini, f., mirizon, s., & inderawati, r. (2020). professional development of novice english teacher junior high school. jurnal pendidikan progresif, 10(2), 233-249. https://doi.org/10.23960/jpp. v10.i2.202009 barron, a. b., hebets, e. a., cleland, t. a., fitzpatrick, c. l., hauber, m. e., & stevens, j. r.. (2015). embracing multiple definitions of learning. trends in neurosciences, 38(7), 405–407. https:// doi.org/10.1016/j.tins.2015.04.008 berdiati, i. (2020). peran pengawas dalam pengembangan keprofesian berkelanjutan bagi guru. jurnal diklat keagamaan, 16(1), 38-49. chang, y. –l., wu, s. –c., & wu, h. –h. (2015). they are learning: changes through teacher professional development of inquiry curriculum design and implementation. procedia social and behavioral sciences, 177, 178182. https://doi.org/10.1016/j.sbspro. 2015.02.375. delvaux, e., vanhoof, j., tuytens, m., vekeman, e., devos, g., & van petegem, p. (2013). how may teacher evaluation have an impact on professional development? a multilevel analysis. teaching and teacher education, 36, 1-11. https://doi.org/10.1016/j.tate .2013.06.011 herbert, s., & rainford, m. (2014). developing a model for continuous professional development by action research. professional development in education, 40(2), 243-264. https://doi. org/ 10.1080/19415257.2013.794748 jailani, m. s. (2014). guru profesional dan tantangan dunia pendidikan. al-ta’lim, 21(1), 1-9. https://doi.org/10.15548/ jt.v21i1.66 jayatilleke, n. & mackie, a. (2013). reflection as part of continuous professional development for public health professionals: a literature review. journal of public health (united kingdom), 35(2), 308-312. https://doi.org/ 10.1093/pubmed/fds083 kabadayi, a. (2016). a suggested in-service training model based on turkish preschool teacher’s conceptions for sustianable development. journal of teacher education for sustainability, 18(1), 5-15. https://doi.org/10.1515/jtes2016-0001 kadarwati, a. (2016). the improvement of teaching-learning quality through academic supervisison with a classroom visit technique. jurnal studi sosial, 1(2), 103-120. http://dx.doi.org/10.23960/jpp.v10.i2.202009 http://dx.doi.org/10.23960/jpp.v10.i2.202009 https://doi.org/10.1016/j.sbspro.%202015.02.375 https://doi.org/10.1016/j.sbspro.%202015.02.375 https://doi.org/10.1016/j.tate%20.2013.06.011 https://doi.org/10.1016/j.tate%20.2013.06.011 https://doi/ https://doi.org/10.1080/19415257.2013.794748 http://dx.doi.org/10.15548/jt.v21i1.66 http://dx.doi.org/10.15548/jt.v21i1.66 https://doi.org/10.1093/pubmed/fds083 https://www.researchgate.net/deref/http%3a%2f%2fdx.doi.org%2f10.1515%2fjtes-2016-0001 https://www.researchgate.net/deref/http%3a%2f%2fdx.doi.org%2f10.1515%2fjtes-2016-0001 https://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan 140 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) kulshersta, a. k., & pandey, k. (2013). teacher training and professional competencies. professional competencies, 1(4), 29-33. kunter, m. (2013). motivation as an aspect of professional competence: research findings on teacher enthusiasm. cognitive activation in the mathematics classroom and professional competence of teachers. 273– 289. https://doi.org/10.1007/978-14614-5149-5_13 lauermann, f. & könig, j. (2016). teachers’ professional competence and well being: understanding the links between general pedagogical knowledge, selfefficacy and burnout. learning and instruction, 45, 9–19. https://doi.org/ 10.1016/j.learninstruc.2016.06.006 law of the republic of indonesia no. 14 of 2005 on teacher and lecturer, . (2005). law of the republic of indonesia no. 20 of 2003 on the national education system, . (2003). looney, j. (2011). developing high-quality teachers: teacher evaluation for improvement. european journal of education, 46(4), 440–455. https://doi. org/10.1111/j.1465-3435.2011.01492.x mahmud, m. (2011). metode penelitian pendidikan. pustaka setia. mak, b. & pun, s. h. (2014). cultivating a teacher community of practice for sustainable professional development: beyond planned efforts. teachers and teaching, 21(1), 4-21. https://doi.org/ 10.1080/13540602.2014.928120 mudlofir, a. (2012). pendidik profesional. pt raja grafindo persada mukminan, m. (2008). menuju program studi pendidikan geografi fise uny yang unggul. jurnal geomedia, 2(6), 1-12. https://doi.org/10.21831/gm.v6i2.145 38 muryadi, a, d. (2017). model evaluasi program dalam penelitian evaluasi. jurnal ilmiah penjas, 3(1), 1-16. retreieved from http://ejournal.utp. ac.id/index.php/jip/article/view/538 niemi, h., nevgi, a. & aksit, f. (2016). active learning promoting student teachers’ professional competences in finland and turkey. european journal of teacher education, 39(4), 471–490. https://doi.org/10. 1080/02619768.2016.1212835 nugroho, d.h. (2013). strategi pembelajaran geografi. ombak. regulation of the minister of national education no. 16 of 2007 on teacher qualification and competency standards, . (2007). regulation of the minister of state apparatus empowerment and bureaucratic reform no. 16 of 2009 on the teacher functional positions and credit score, . (2009). rohmah, w. (2016). upaya meningkatkan pengembangan keprofesian berkelanjutan dalam peningkatan profesionalisme guru. seminar nasional pendidikan. universitas muhammadiyah surakarta. semradova, i., & hubackova, s. (2014). responsibilities and competences of a university teacher. procedia social and behavioral sciences, 159, 437–441. https://doi.org/ 10.1016/j.sbspro.2014.12.403 shagrir, l. (2012). bagaimana proses evaluasi mempengaruhi pengembangan profesional lima guru di pendidikan tinggi. jurnal beasiswa pengajaran dan pembelajaran, 12(1), 23-35. smylie, m. a. (2014). teacher evaluation and the problem of professional development. teacher evaluation, (26)2, 97-111. susan, m. (2012). sustainable professional development. district administration, 48(10), 36-41. uerz, d., volman, m., & kral, m. (2018). teacher educators’ competences in fostering student teachers’ proficiency in teaching and learning with technology: an overview of relevant research literature. teaching and teacher https://doi/ https://www.researchgate.net/deref/http%3a%2f%2fdx.doi.org%2f10.1007%2f978-1-4614-5149-5_13 https://www.researchgate.net/deref/http%3a%2f%2fdx.doi.org%2f10.1007%2f978-1-4614-5149-5_13 https://doi.org/ https://www.researchgate.net/deref/http%3a%2f%2fdx.doi.org%2f10.1016%2fj.learninstruc.2016.06.006 https://doi.org/%2010.1080/13540602.2014.928120 https://doi.org/%2010.1080/13540602.2014.928120 https://doi.org/10.21831/gm.v6i2.14538 https://doi.org/10.21831/gm.v6i2.14538 http://ejournal.utp/ https://doi.org/10.%201080/02619768.2016.1212835 https://doi.org/10.%201080/02619768.2016.1212835 https://doi.org/%2010.1016/j.sbspro.2014.12.403 https://doi.org/%2010.1016/j.sbspro.2014.12.403 https://doi.org/10.21831/reid.v6i2.35455 alfin nuramalia yuniandita & mukminan copyright © 2020, reid (research and evaluation in education), 6(2), 2020 141 issn: 2460-6995 (online) education, 70, 12–23. https://doi.org/ 10.1016/j.tate.2017.11.005 visković, i., & jevtić, a. v. (2017). development of professional teacher competences for cooperation with parents. early child development and care, 187(10), 1569–1582. https://doi.org/ 10.1080/03004430.2017.1299145 zubaedi. (2011). desain pendidikan karakter. kencana. https://doi.org/%2010.1016/j.tate.2017.11.005 https://doi.org/%2010.1016/j.tate.2017.11.005 https://doi.org/%2010.1080/03004430.2017.1299145 https://doi.org/%2010.1080/03004430.2017.1299145 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(1), 2021, 66-77 available online at: http://journal.uny.ac.id/index.php/reid evaluation of reading pleasure character in smpn 19 and smp adhyaksa 1 of jambi city tanti1; dwi agus kurniawan2*; muhammad sofyan zain2; febrina rosa winda2; rini siski fitriani2 1universitas islam negeri sulthan thaha saifuddin jambi jl. arif rahman hakim no. 111, simpang iv sipin, telanaipura, kota jambi, jambi 36361, indonesia 2universitas jambi jl. raya jambi muara bulian km. 15, mendalo indah, jambi luar kota, jambi 36361, indonesia *corresponding author. e-mail: dwiagus.k@unja.ac.id introduction education in indonesia prioritizes the formation of character, attitudes, and values in society so that students have a sense of nationalism and a desire to compete on the international stage (sujana, 2019, p. 31). in reality, students' character in everyday life has not shown a good adoption of the national character. according to purnomo (2014, pp. 74–75), the concept of character education in indonesia is already beneficial, but the problem of poor student character lies in the process of transverting the value. whether or not the character of students in school is very dependent on the teacher as a medium for transverting values as well as an example or role model for students and parents as supervisors and mentors with more intensity to educate their children's character at home (purnomo, 2014, pp. 73–74). also, according to ramdhani (2014, pp. 34–35), a person's character is influenced by interactions in everyday life. whether or not students' character needs supervision by teachers and parents so that students do not enter the wrong association. article info abstract article history submitted: 18 october 2020 revised: 22 june 2021 accepted: 29 june 2021 keywords reading pleasure; junior high school; character scan me: this study was conducted to determine the reading pleasure of junior high school students through four indicators. the research was conducted with a total sample of 281 students from grades vii, viii, and ix in junior high schools. the sampling technique used was purposive sampling. the research design used is an explanatory design. the quantitative instrument in the form of a reading pleasure instrument consists of 40 statements, while the qualitative instrument is in the form of interviews with ten students, four teachers, and two heads of the library. the data collection technique was carried out by survey (field research). the results of this study indicate that the indicators of general attitudes towards reading are on a neutral scale, the indicators of reading preferences are on a neutral scale, the indicators of the effects of reading on ability are on the agreeing scale, the indicators of students' negative views on reading are on the disagree scale. the results of interviews with the teacher can be seen that the teacher has reminded students always to read books other than in face-to-face activities in class, then the results of interviews with the head of the library can be seen that junior high school students have high enthusiasm for reading and borrowing books in the library, the results of interviews with male students and female students show that female students have better reading pleasure than male. this is an open access article under the cc-by-sa license. how to cite: tanti, t., kurniawan, d., zain, m., winda, f., & fitriani, r. (2021). evaluation of reading pleasure character in smpn 19 and smp adhyaksa 1 of jambi city. reid (research and evaluation in education), 7(1), 66-77. doi:https://doi.org/10.21831/reid.v7i1.35147 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 67 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) according to setiawan (2013, p. 55), the character is a way of thinking and behaving in social and everyday life. the pleasure of reading is a character that is closely related to academic activities. the perception of reading pleasure is the habit of reading various readings that benefit him (hasan, 2010, p. 10). thus, the character of reading pleasure is a habit of liking reading activities that are manifested through ways of thinking and behaving in everyday life so that they benefit themselves. this reading pleasure character is included in 18 characters that are applied at the basic education level (ningsih et al., 2016, p. 231). the application of character education is strengthened by hasan (2010) through a book of training materials for the development of national character and culture education published by the ministry of national education of the republic of indonesia. this book is also the basis for this research because, based on the book, the character of the pleasure of reading needs to be applied in learning in junior high schools; besides, reading activities are never separated from learning activities. in academic activities, reading is an exercise that is needed and always exists (acheaw, 2016, p. 214). reading activities that are not limited to fiction and non-fiction show that a student has a good reading pleasure (clark & rumbold, 2006, p. 7). according to alexander and jarman (2018, p. 78), the reading pleasure shows a person's mood and personal preferences. the mood is determined by attitudes (elen et al., 2013, p. 922) and views or perceptions of something (zadra & clore, 2011). meanwhile, the personal preferences to be achieved in this study are students' reading preferences. general attitudes towards reading, reading preferences, students' views on the effects of reading on their abilities, and negative views on reading are indicators that need to be explored in order to find out how much fun reading junior high school students have. the importance of tracking these indicators is in order to find out how happy reading students in junior high school are. in today's era, obtaining information is an obligation so as not to be left behind by others; a way that can be done to obtain information is by reading (rahadian et al., 2014, p. 28). howard (2011, pp. 53–54) divides the positive impact of students' reading pleasure into several perspectives, including educational perspectives, social perspectives, and personal perspectives. the positive impact from an educational perspective is to help improve literacy and thinking skills and to help young people clarify and explore career goals. from a social perspective, the pleasure of reading helps young teens understand historical and current events, helps them develop compassion and empathy, empowers them to develop and act on their beliefs, and helps them understand the consequences of risky behavior. from a personal perspective, the benefits of pleasure reading for teens are entertainment, relaxation, reassurance, creative release, and a means of escape. meanwhile, according to ikawati (2013, p. 11), the negative impact of reading does not exist in the aspect of science since reading is the key to basic knowledge. thus, considering there is no negative impact of reading in learning activities, the character of reading pleasure needs to be improved in various aspects of learning. improving the character of the pleasure of reading requires efforts from the government, schools, librarians, and the community (kasiyun, 2015, pp. 86–89). the efforts made by the indonesian government in the field of education and culture so schools can apply the character of reading pleasure, one of which is the school literacy movement (gerakan literasi sekolah) program through the guidebook for the school literacy movement in junior high schools (retnaningdiyah et al., 2016). it should be underlined that the literacy program aims to make students able to understand the contents of the reading well or improve students' reading skills, but in the process, they need to apply the character of reading pleasure. the literacy program is carried out by getting used to reading 15 minutes before studying. the government and schools feel that the literacy movement is not optimally implemented. the reason is based on the opinion of widodo (2020, p. 15), teachers in junior high schools do not understand the essence of doing the program, and in its application, it is not socialized by related parties, even though the goal is to improve the character and character of students. thus, for the character of reading pleasure can increase, the government needs to seriously maximize its programs to increase the character of reading pleasure in students in schools. https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 68 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) apart from the government, schools play a role in increasing the students' reading pleasure. according to dewayani (2018, p. 4), schools need to increase intrinsic motivation to read compared to extrinsic motivation through the provision of various reading sources, a school environment that is rich in reading, quality and variety of reading sources, various reading activity programs, facilitating student reading clubs, and teachers participating likes and exemplifies reading activities at school. according to syamsuri et al. (2020, p. 149), what generally results in students not having enough pleasure in reading them is that schools do not make good use of the library. some things the principal can do, according to sriwahyuni (2018, pp. 175–177), are to become role models for students by visiting the library to read, directing teachers to take advantage of learning activities in the library, rewarding students who diligently borrow books and visit the library, increasing the collection of books in the library, paying attention to the convenience of the library for students, and fostering librarians with seminars and upgrading activities. increasing students 'reading pleasure is not only school and government assignments, but students' parents must cultivate the pleasure of reading for their children. however, according to tahmidaten and krismanto (2019), the culture of turning off television from six to nine at night and replacing it with reading activities only occurs in a small part of society. according to lilawati (2020, p. 555), a culture like this is less attractive due to the parents' level of education. in general, parents with at least a high school education are willing to implement reading activities at home, while parents who are not educated or below secondary school do not apply it. this is crucial because the students’ parents have different levels of education, so the need to educate the students’ parents to implement reading activities at home is something that is very much needed today. the role of the teacher as an educator is expected to be able to change the character of students who do not like reading to like reading. this aims to improve students' ability to understand subjects and change their attitudes in everyday life for the better. to improve the character of reading pleasure, teachers must be able to provide good role models for students to imitate and make students aware of the importance of reading (aulawi, 2012, p. 126). also, to increase students' reading pleasure, habituation must be done from an early age, when children begin to speak and understand what they are saying (artana, 2016, p. 11). then what the teacher does to improve the character of the pleasure of reading is to motivate students to realize the importance of reading (halidjah, 2011). students really need a good understanding of the concepts taught by their teachers. therefore, students must have a character who likes reading in learning activities. some of the results of previous research on reading pleasure include research by alexander and jarman (2018), showing that the pleasure of reading non-fiction in science is a good source of pleasure for students. research by gilbert and fister (2011, p. 490), with the results of a survey of academic librarians and a small number of college writing instructors, said that students enjoy reading for pleasure much more than previous reports indicate. research by clark and rumbold (2006, p. 24) found that this overview has shown that reading for pleasure offers many benefits and that encouraging a love of reading, besides, good intrinsic motivation to read is a desirable goal. research of garces-bacsal et al. (2018), with the results that teachers who are identified as unfaithful readers can provide strategies to motivate their students to read, teachers who identify themselves as unfaithful readers can enumerate strategies that can promote more engaged reading, and teachers who identify as unfaithful readers can provide literacy learning practices that are deemed useful in promoting reading engagement. based on the results of previous studies that are similar, no one has examined the context of evaluating the reading pleasure characters in junior high school. several previous studies used different methods and models in measuring reading pleasure. thus, with this statement, this evaluation research is feasible to know how much fun reading students in junior high school are. method this research was conducted at state junior high school (sekolah menengah pertama negeri or smpn) 19 jambi city with a sample of 145 students and smp adhyaksa 1 jambi city with a samhttps://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 69 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) ple of 136 students from vii, viii, & ix classes, so the total sample collected from the two schools was 281 students. the purposive sampling technique used in this study is based on gay et al. (2012, p. 141) that purposive sampling is a sampling technique that focuses on the knowledge and experience of the researchers towards the research sample, which is believed to represent the population. the schools were chosen because the number of students was adequate and the learning resources of students were sufficient. this research is a mixed-methods study with an explanatory sequential design; a mixed-method design carried out in two stages of research, namely quantitative data collection, quantitative data analysis, formulating the results of quantitative data analysis, followed by qualitative data collection, then interpreting the results of the study. the qualitative data collection instrument consisted of interview sheets that were used to obtain in-depth data about the reading pleasures of students in junior high schools. the questions in the interview were conducted in a semi-structured manner (kumar, 2011, p. 145). researchers interviewed in smpn 19 jambi city and smp adhyaksa, and chose to interview two teachers per school with a total sample of four teachers, one head of library administrators per school with the total sample are two of the head of library administrators, and five students per schools with the total sample are ten students. researchers chose to interview to teachers because teachers as people who interact and observe students while in class, library administrators as people who manage library administration, and ten randomly selected students. the quantitative data collection instrument used to determine students' reading pleasure in this study was a questionnaire consisting of four indicators, including attitudes towards reading, student reading preferences, the effect of reading on students' abilities, and students' negative views on reading. the questionnaire used was adapted from ögeyik & akyay (2009, pp. 74–76) research with a total of 40 statements. the questionnaire used in this study is a five-scale questionnaire: strongly agree given a score of 5; agree is given a score of 4; neutral was given a score of 3; disagree is given a score of 2, and strongly disagree given a score of 1. the distribution of per-indicator statements on the questionnaire is shown in table 1. table 1. indicators and distribution statements of reading pleasure indicators num. of statement items total of statement items general attitude toward reading 1, 6, 9, 17, 18, 20, 25, 26, 39, 40 10 student reading preferences 2, 4, 7, 12, 13, 15, 19, 22, 24, 29, 31, 34, 38 13 the effect of reading on students' abilities 5, 8, 11, 16, 21, 27, 30, 32, 33, 36, 10 students' negative views of reading 3, 10, 14, 23, 28, 35, 37 7 total items 40 table 2. interval of general attitude toward reading interval category 10 – 18 strongly disagree 18.1 – 26 disagree 26.1 – 34 neutral 34.1 – 42 agree 42.1 – 50 strongly agree table 3. interval of student reading preferences interval category 13 – 23.4 strongly disagree 23.5 – 33.8 disagree 33.9 – 44.2 neutral 44.3 – 54.6 agree 54.7 – 65 strongly agree https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 70 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) table 4. interval of the effect of reading on students' abilities interval category 10 – 18 strongly disagree 18.1 – 26 disagree 26.1 – 34 neutral 34.1 – 42 agree 42.1 – 50 strongly agree table 5. interval of students' negative views of reading interval category 7 – 12.6 strongly disagree 12.7 – 18.2 disagree 18.3 – 23.8 neutral 23.9 – 29.4 agree 29.5 – 35 strongly agree the quantitative data processing technique uses descriptive statistics to analyze data by describing the data collected as it is without intending to make general conclusions or generalizations (muchson, 2017). in this study, data were processed by ibm spss statistics. the percentage of quantitative data obtained is searched for then expressed in several categories according to the predetermined intervals for each indicator, as presented in table 2, table 3, table 4, and table 5. findings and discussion in this study, the researchers looked at reading pleasure levels for students in junior high school using four indicators of reading pleasures, including attitudes towards reading, students' reading preferences, the effect of reading on students' abilities, and students' negative views of reading. the results of data analysis on attitudes towards reading indicators are shown in table 6. table 6. descriptive statistics of general attitude toward reading no. statistical description value 1. std. deviation 4.93 2. mean 33.5 3. mode 32 4. median 34 5. max 46 6. min 20 7. total 281 based on table 6, the statistical description of the indicators of attitudes towards reading shows that for obtaining a numerical score of 281 respondents by producing valid data, the minimum value is 20 and the maximum value is 46 with an average of 33.5, median 34, mode 32 and with a standard deviation of 4.93. to find out students' tendency to choose answers to strongly disagree, disagree, neutral, agree, and strongly agree, an interval score is needed to differentiate them, which in this case can be seen in table 7. table 7. general attitude toward reading interval category total % 10 – 18 strongly disagree 0 0 18.1 – 26 disagree 29 10.3 26.1 – 34 neutral 136 48.3 34.1 – 42 agree 108 38.4 42.1 – 50 strongly agree 8 2.8 total 281 100 https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 71 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) based on the calculated statistical results, it was found that 48.3% of students answered neutral, 38.6% answered they agree, 10.3% answered they disagree, 2.8% answered strongly agreed, and 0% answered strongly disagree. from the results of this percentage, it can be seen that the highest percentage is held by neutral answers, namely 48.3%. nevertheless, the statistically quite high result was also held by agreeing answers, namely 38.6%. from the results of this percentage, it can be seen that most students are neutral because the neutral percentage tends to agree, so some students also agree to have a positive attitude. even so, still in percentage terms, the attitude of the students was neutral. based on the comparison between mode and median, it is known that the value of mode < median, so most students' answers, namely 32, are still below the middle value of the data but are still included in the neutral interval, namely 26.1-34. while the comparison between the mean and median is known that the median > means, then the students' average answers are still below the middle value of the data but are still included in the neutral interval 26.1-34. based on the standard deviation, the general attitude towards reading is at the lowest data distribution, namely 4.93 compared to other indicators. from the data that has been analyzed, it can be concluded that the general attitude towards reading students is that they are still doubtful because they are at neutral intervals. according to khir et al. (2019, p. 94), attitudes towards reading can be influenced by motivation from within and outside oneself. the influence from within comes from the experience of reading, while from outside, for example, from the teacher. according to nootens et al. (2019, p. 9), attitudes towards reading in junior high school students are not as good as when students are still in elementary school. this cannot be denied because the higher the school level, the higher the level of reading skills students must master (kholiq & luthfiyati, 2018, p. 3). after getting the general attitudes towards reading, the researchers then analyzed the reading preference indicators that showed the students' preferences for various types of reading. the data that the researchers got were then processed and inputted into table 8. table 8. descriptive statistics of students reading preferences no. statistical description value 1. std. deviation 6.82 2. mean 42.7 3. mode 39 4. median 43 5. max 58 6. min 18 7. total 281 based on table 8, the statistical description of the student's reading preference indicator shows that for the acquisition of numerical scores from 145 respondents by producing valid data, the minimum value is 18 and the maximum value is 58 with an average of 42.7, a median of 43, mode 39.00 and with a standard deviation of 6.82. to find out students' tendency to choose answers to strongly disagree, disagree, neutral, agree, and strongly agree, an interval score is needed to differentiate them, which in this case can be seen in table 9. table 9. students reading preferences interval category total % 13 23.4 strongly disagree 4 1.4 23.5 33.8 disagree 13 4.6 33.9 44.2 neutral 153 54.4 44.30 54.6 agree 99 35.2 54.7 65 strongly agree 12 4.3 total 281 100 https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 72 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) based on the calculated statistical results, it was found that 54.5% of students answered neutral, 35.2% of students answered they agreed, 4.8% of students answered strongly agree, 4.1% of students answered disagree, and 1.4% of students answered strongly disagreed. from these statistical data, it can be seen that neutral answers of 54.5% hold the highest percentage, and the second-highest percentage is held by agreeing to answers with a percentage of 35.2%. based on these results, it can be seen that most students answered neutral, and some students answered agreed. however, because the neutral percentage is above 50%, it can be ascertained that students are still neutral about their reading preferences. based on the comparison of the mean and median data, it is known that the middle value is greater than the average. however, the mean results are still in the neutral interval, namely 33.944.2. then the comparison of mode and median data shows that the mode data is smaller than the median, but the mode data is still in the neutral interval, namely 33.9-44.2. based on the standard deviation results, namely 6.82, which is the highest value among other indicators, then the distribution of reading preference data is the widest compared to other indicators. thus, it is known that students' answers to reading preferences are the most varied because many students have different opinions. besides, it is evidenced by a minimum value of 18 and a max of 58. even so, it can still be concluded that overall student reading preferences are still neutral or students are still in doubt to prioritize reading. reading preferences are currently starting to experience a shift from conventional to digital, so that libraries in schools must be ready for this (munandar & irwansyah, 2019, p. 95). almost all types of reading are now starting to move from conventional models to digital models so that students' reading preferences have begun to change (singer & alexander, 2017, p. 12). this research is different from the research we did, especially in general we saw from the type of reading, for example, digital or printed scientific reading, digital or printed novel reading, so we did not differentiate between digital or print. table 10. descriptive statistics of the effect of reading on students' abilities no. statistical description value 1. std. deviation 5.92 2. mean 37.7 3. mode 36 4. median 37 5. max 50 6. min 20 7. total 281 based on table 10, the statistical description of the effect of reading indicators on students' abilities shows that for obtaining numerical scores from 281 respondents by producing valid data, the minimum value is 20 and the maximum value is 50 with an average of 37.7, median 37, mode 36 and with a standard deviation of 5.92. to find out students' tendency to choose answers to strongly disagree, disagree, neutral, agree, and strongly agree, an interval score is needed to differentiate them, which in this case can be seen in table 11. table 11. the effect of reading on students' abilities interval category total % 10 – 18 strongly disagree 0 0 18.1 – 26 disagree 8 2.8 26.1 – 34 neutral 71 25.5 34.1 – 42 agree 136 48.3 42.1 – 50 strongly agree 66 23.4 total 281 100 based on the calculated statistical results, it is known that 48.3% of students answered agree, 25.5% of students answered neutral, 23.4% of students answered strongly agree, 2.8% of https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 73 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) students answered disagree, and 0% of students answered strongly disagree. from this data, it is known that the highest percentage of answers is agreed, which is equal to 48.3%, the secondhighest answer is neutral at 25.5%, and the third-highest answer is strongly agree, at 23.4%. from these results, it is known that students agree and strongly agree to have a total value of 71.7%. from these data, students know that reading has an effect on their abilities. from the comparison between mode and median, it is known that mode < median, so the highest number of student choices is smaller than the middle value, but the mode is still at the agree interval, namely 34.1-42. from the comparison of the mean and median, it is known that the mean > median, then the average number of answers that students choose is greater than the middle value, and the mean is still at the agreed interval, namely 34.1-42. from the standard deviation, it is known that the data distribution is the second highest compared to other indicators. this can be seen from the min value of 20 and the max of 50. from the statistical results that have been analyzed, it can be concluded that students agree that reading affects their ability. table 12. descriptive statistics students' negative views of reading no. statistical description value 1. std. deviation 5.22 2. mean 16.5 3. mode 19 4. median 17 5. max 28 6. min 7 7. total 281 based on table 12, regarding the statistical descriptions of students' negative view indicators towards reading, it shows that for the acquisition of numerical scores from 145 respondents by producing valid data, the minimum value is 20 and the maximum value is 50 with an average of 37.7, median 37, mode 36 and with a standard deviation of 5.92. to find out students' tendency to choose answers to strongly disagree, disagree, neutral, agree, and strongly agree, an interval score is needed to differentiate them, which in this case can be seen in table 13. table 13. students' negative views of reading interval category total % 7 12.6 strongly disagree 64 22.8 12.7 18.2 disagree 110 39.3 18.3 23.8 neutral 78 27.6 23.9 29.4 agree 29 10.3 29.5 – 35 strongly agree 0 0 total 281 100 based on the calculated statistical results, 39.3% of students answered disagree, 27.6% of students answered neutral, 22.8% of students answered strongly disagree, 10.3% of students answered agreed, and 0% of students answered strongly agree. from these data, it is known that the highest percentage disagrees at 39.3%, the second-highest is neutral at 27.6%, and the thirdhighest is strongly disagreed at 22.8%. thus, from the percentage, it is known that students do not agree to have a negative view of reading. based on the comparison of the mean and median, the median > mean, but the mean interval is in disagreement, namely 12.7-18.2. based on the comparison between mode and median, it is known that mode > median, but the mode is still at the disagree interval, namely 12.7-18.2. based on the results of the standard deviations, the data distribution is the third-highest compared to other indicators. this can be seen from the minimum of 7 and the max of 28. from the statistical results, students do not agree to have a negative view of reading. furthermore, the researchers conducted interviews with four of the teachers from two schools, two of the head of the library, and six students with the following results. https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 74 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) teacher interview interviews with four junior high school teachers were conducted to find out whether there was an attempt by the teacher to instruct students to read and use school book literature as scientific reference material. the researchers asked whether, after the learning activities, students were instructed to repeat reading the material that had been taught independently at home? the whole teacher answered yes. after learning, the students were instructed to repeat reading at home so that they could understand more about the material being taught. furthermore, the researchers asked, do students have more than one reading source? the entire sample of teachers answered yes. some students have one or more books as a reading source because the school provides one textbook. the rest get it from borrowing as an additional reference. then the researchers asked, do students provide theories based on clear sources during discussion forums? the entire sample of teachers answered that most of the students had clear sources, namely from available reading books and some from the internet. do you require students to read books? the teacher replied that he required students to read books at home. based on the results of interviews with junior high school teachers, it can be seen that the teacher instructs students to repeat the material that has been taught independently at home so that there is an effort from the teacher to keep students reading at home. students also use reading sources in the form of the same textbooks during the learning activities. the teacher also allows books from various sources, and the teacher also allows students to use reading sources obtained from the internet. it can be concluded that the teacher makes efforts to support students to enjoy reading. head of librarian interview interviews with the head of the library are intended to determine the frequency of borrowing books every day, student activities in the library, and the purpose of students coming to the library. researchers conducted interviews with two heads of libraries from private and public junior high schools, do many students borrow books every day? the head of the smp negeri library said that there were a lot of books, about 50 reading books, while the head of the private smp library also answered that many students borrow books every day. what do students generally do when they are in the library? the head of the public junior high school library replied, students usually use the library to read and borrow books, while the head of the private smp libraries answered that there are students who come just to sit around, but most students come and read books. do students mostly come to the library because they are told by their teachers or because of their own wishes? the head of public and private junior high school libraries had a similar answer, namely that their teachers asked some to go to the library and because of their own wishes, but mostly because of their own wishes. it can be concluded that the number of library enthusiasts can be said to be large. this shows that student enthusiasm in supporting reading habits is very high. also, students generally come to the library to read and borrow books. some are ordered by the teacher and from their own wishes. students interview interviews with ten students were intended to determine the level of students' reading pleasure. from the results of the interviews conducted, the researchers asked, do you often read books before learning starts, and why do you do that? the results that the researchers got from ten students were nine students who often read and one who rarely read before learning began. students who answered often said that because the purpose of reading was to increase knowledge and to improve the ability of their subject areas, while one student rarely mentioned it because there were other activities which he thought were more important. then the researchers asked, do you like reading more than one subject book? all students answered that they liked reading more than one subject book. then the researchers asked, did you make the library a source of reading material? of the ten students, it turned out that six students made the library a place to https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 75 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) find reading material while four students did not use the library as a reading source, so the researchers asked why? students who use the library as a place to find reading material say that libraries are fun because they can focus, have a comfortable atmosphere, and have interesting books. meanwhile, students who did not use the library as a place to find reading sources revealed that the location of the library was far from home, they were afraid of being fined for returning the books for a long time, they were afraid that the books were lost, and students felt happier to look for reading on the internet. based on the results of interviews with ten students, students read quite often. students read books from school and internet sources. four students did not make the library a place to find reading material because of their fear of going to the library and the difficult conditions to go to the library. it can be concluded that most students have a positive view of reading and enjoy reading. conclusion the results of this study indicate that reading pleasure for students of junior high schools. they are quite good, although it is still not optimal, in terms of four indicators, including attitudes towards reading which indicate students are still neutral. these students' reading preferences indicate students are still neutral for having reading preferences, the effects of reading on students 'abilities showed a fairly good view of reading. students' negative views on reading show that the results of students did not view reading as a negative activity. the result of interviews with students indicates that students have positive views on reading to develop their abilities in learning, and students consider reading is a very important activity. the teacher's application of reading pleasure at school is good because there is an attempt to instruct students to read books at home at the end of each lesson. moreover, based on the results of interviews with the head of the library, students are enthusiastic and happy to read and borrow books from the library because of their own desires. this statement is supported by the statement of a student who frequently visits the library. this result was evidenced by students who actively read and borrowed books from the library and students who took the initiative to use more than one sourcebook in learning. the role of teachers and adequate school facilities such as libraries is a supporting factor in cultivating a character who likes reading in junior high school. references acheaw, m. o. (2016). social media usage and its impact on reading habits: a study of koforidua polytechnic students. international journal of social media and interactive learning environments, 4(3), 211–222. https://doi.org/10.1504/ijsmile.2016.079493 alexander, j., & jarman, r. (2018). the pleasures of reading non-fiction. literacy, 52(2), 78–85. https://doi.org/10.1111/lit.12152 artana, i. k. (2016). upaya menumbuhkan minat baca pada anak. acarya pustaka: jurnal ilmiah perpustakaan dan informasi, 2(1), 1–13. https://ejournal.undiksha.ac.id/index.php/ap/article/view/10099 aulawi, m. b. (2012). optimalisasi layanan perpustakaan dalam meningkatkan minat baca siswa. pustakaloka, 4(2), 117–127. https://jurnal.iainponorogo.ac.id/index.php/pustakaloka/article/viewfile/639/473 clark, c., & rumbold, k. (2006). reading for pleasure: a research overview. https://literacytrust.org.uk/research-services/research-reports/reading-pleasure-researchoverview/ dewayani, s. (2018). seri manual gls membaca untuk kesenangan. direktorat jenderal pendidikan dasar dan menengah kementerian pendidikan dan kebudayaan. https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.1504/ijsmile.2016.079493 https://doi.org/10.1111/lit.12152 https://ejournal.undiksha.ac.id/index.php/ap/article/view/10099 https://jurnal.iainponorogo.ac.id/index.php/pustakaloka/article/viewfile/639/473 https://literacytrust.org.uk/research-services/research-reports/reading-pleasure-research-overview/ https://literacytrust.org.uk/research-services/research-reports/reading-pleasure-research-overview/ https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 76 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) elen, m., d’heer, e., geuens, m., & vermeir, i. (2013). the influence of mood on attitude– behavior consistency. journal of business research, 66(7), 917–923. https://doi.org/10.1016/j.jbusres.2011.12.011 garces-bacsal, r. m., tupas, r., kaur, s., paculdar, a. m., & baja, e. s. (2018). reading for pleasure: whose job is it to build lifelong readers in the classroom? literacy, 52(2), 95–102. https://doi.org/10.1111/lit.12151 gay, l. r., mills, g. e., & airasian, p. w. (2012). educational research competencies for analysis and applications (10th ed.). pearson education. gilbert, j., & fister, b. (2011). reading, risk, and reality: college students and reading for pleasure. college & research libraries, 72(5), 474–495. https://doi.org/10.5860/crl-148 halidjah, s. (2011). pemberian motivasi untuk meningkatkan kegiatan membaca siswa sekolah dasar. jurnal cakrawala kependidikan, 9(1), 1–9. https://jurnal.untan.ac.id/index.php/jckrw/article/view/271 hasan, s. h. (2010). bahan pelatihan penguatan metodologi pembelajaran berdasarkan nilai-nilai budaya untuk membentuk daya saing dan karakter bangsa. badan penelitian dan pengembangan pusat kurikulum, kementerian pendidikan nasional. howard, v. (2011). the importance of pleasure reading in the lives of young teens: selfidentification, self-construction and self-awareness. journal of librarianship and information science, 43(1), 46–55. https://doi.org/10.1177/0961000610390992 ikawati, e. (2013). upaya meningkatkan minat membaca pada anak usia dini. logaritma: jurnal ilmu-ilmu pendidikan dan sains, 1(2), 1–12. http://jurnal.iainpadangsidimpuan.ac.id/index.php/lgr/article/view/219 kasiyun, s. (2015). upaya meningkatkan minat baca sebagai sarana untuk mencerdaskan bangsa. jurnal pena indonesia, 1(1), 79–95. https://doi.org/10.26740/jpi.v1n1.p79-95 khir, a. m., kassim, a. f. m., & zaharim, m. z. a. (2019). motivasi membaca, persekitaran membaca di rumah dan sikap membaca dalam kalangan pelajar di universiti putra malaysia (upm). malaysian journal of social sciences and humanities (mjssh), 4(6), 92–100. https://msocialsciences.com/index.php/mjssh/article/view/268 kholiq, a., & luthfiyati, d. (2018). tingkat membaca pemahaman siswa sman 1 bluluk lamongan. reforma: jurnal pendidikan dan pembelajaran, 7(1), 1–11. http://jurnalpendidikan.unisla.ac.id/index.php/reforma/article/view/35 kumar, r. (2011). research methodology: a step-by-step guide for beginners (3rd ed.). sage publications. lilawati, a. (2020). peran orang tua dalam mendukung kegiatan pembelajaran di rumah pada masa pandemi. jurnal obsesi: jurnal pendidikan anak usia dini, 5(1), 549–558. https://doi.org/10.31004/obsesi.v5i1.630 muchson, m. (2017). statistik deskriptif. guepedia. munandar, d. i., & irwansyah, i. (2019). format cetak vs digital: preferensi membaca bahan bacaan akademik mahasiswa pascasarjana universitas indonesia. pustakaloka, 11(2), 82–97. https://doi.org/10.21154/pustakaloka.v11i2.1620 ningsih, t., zamroni, z., & zuchdi, d. (2016). implementasi pendidikan karakter di smp negeri 8 dan smp negeri 9 purwokerto. jurnal pembangunan pendidikan: fondasi dan aplikasi, 3(2), 225–236. https://doi.org/10.21831/jppfa.v3i2.9811 https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.1016/j.jbusres.2011.12.011 https://doi.org/10.1111/lit.12151 https://doi.org/10.5860/crl-148 https://jurnal.untan.ac.id/index.php/jckrw/article/view/271 https://doi.org/10.1177/0961000610390992 http://jurnal.iain-padangsidimpuan.ac.id/index.php/lgr/article/view/219 http://jurnal.iain-padangsidimpuan.ac.id/index.php/lgr/article/view/219 https://doi.org/10.26740/jpi.v1n1.p79-95 https://msocialsciences.com/index.php/mjssh/article/view/268 http://jurnalpendidikan.unisla.ac.id/index.php/reforma/article/view/35 https://doi.org/10.31004/obsesi.v5i1.630 https://doi.org/10.21154/pustakaloka.v11i2.1620 https://doi.org/10.21831/jppfa.v3i2.9811 https://doi.org/10.21831/reid.v7i1.35147 tanti, dwi agus kurniawan, muhammad sofyan zain, febrina rosa winda, & rini siski fitriani page 77 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) nootens, p., morin, m.-f., alamargot, d., gonçalves, c., venet, m., & labrecque, a.-m. (2019). differences in attitudes toward reading: a survey of pupils in grades 5 to 8. frontiers in psychology, 9, 1–13. https://doi.org/10.3389/fpsyg.2018.02773 ögeyik, m. c., & akyay, e. (2009). investigating reading habits and preferences of student teachers at foreign language departments. the international journal of language society and culture, 28, 72–78. https://aaref.com.au/wp-content/uploads/2018/05/28-7.pdf purnomo, s. (2014). pendidikan karakter di indonesia: antara asa dan realita. jurnal kependidikan, 2(2), 66–84. https://doi.org/10.24090/jk.v2i2.553 rahadian, g., rohanda, r., & anwar, r. k. (2014). peranan perpustakaan sekolah dalam meningkatkan budaya gemar membaca. jurnal kajian informasi dan perpustakaan, 2(1), 27–35. https://doi.org/10.24198/jkip.v2i1.11628 ramdhani, m. a. (2014). lingkungan pendidikan dalam implementasi pendidikan karakter. jurnal pendidikan uniga, 8(1), 28–37. https://journal.uniga.ac.id/index.php/jp/article/view/69 retnaningdiyah, p., laksono, k., mujiyem, m., setyorini, n. p., sulastri, s., & hidayati, u. s. (2016). panduan gerakan literasi sekolah di sekolah menengah pertama. direktorat pembinaan sekolah menengah pertama direktorat jenderal pendidikan dasar dan menengah kementerian pendidikan dan kebudayaan. setiawan, d. (2013). peran pendidikan karakter dalam mengembangkan kecerdasan moral. jurnal pendidikan karakter, 4(1), 53–63. https://journal.uny.ac.id/index.php/jpka/article/view/1287 singer, l. m., & alexander, p. a. (2017). reading across mediums: effects of reading digital and print texts on comprehension and calibration. the journal of experimental education, 85(1), 155–172. https://doi.org/10.1080/00220973.2016.1143794 sriwahyuni, e. (2018). peran kepala sekolah dalam meningkatkan minat baca siswa melalui perpustakaan sekolah. jmksp (jurnal manajemen, kepemimpinan, dan supervisi pendidikan), 3(2), 170–179. https://doi.org/10.31851/jmksp.v3i2.1856 sujana, i. w. c. (2019). fungsi dan tujuan pendidikan indonesia. adi widya: jurnal pendidikan dasar, 4(1), 29–39. https://doi.org/10.25078/aw.v4i1.927 syamsuri, c. k., m. hosnan, & jamaludin, u. (2020). penanaman karakter gemar membaca melalui program literasi sekolah rakica di sd negeri taman ciruas permai. jurnal pendidikan dasar nusantara, 6(1), 147–162. https://doi.org/10.29407/jpdn.v6i1.14424 tahmidaten, l., & krismanto, w. (2019). implementasi pendidikan kebencanaan di indonesia (sebuah studi pustaka tentang problematika dan solusinya). lectura: jurnal pendidikan, 10(2), 136–154. https://doi.org/10.31849/lectura.v10i2.3093 widodo, a. (2020). implementasi program gerakan literasi sekolah di sekolah menengah pertama (smp). tarbawi: jurnal ilmu pendidikan, 16(1), 11–21. https://doi.org/10.32939/tarbawi.v16i01.496 zadra, j. r., & clore, g. l. (2011). emotion and perception: the role of affective information. wiley interdisciplinary reviews: cognitive science, 2(6), 676–685. https://doi.org/10.1002/wcs.147 https://doi.org/10.21831/reid.v7i1.35147 https://doi.org/10.3389/fpsyg.2018.02773 https://aaref.com.au/wp-content/uploads/2018/05/28-7.pdf https://doi.org/10.24090/jk.v2i2.553 https://doi.org/10.24198/jkip.v2i1.11628 https://journal.uniga.ac.id/index.php/jp/article/view/69 https://journal.uny.ac.id/index.php/jpka/article/view/1287 https://doi.org/10.1080/00220973.2016.1143794 https://doi.org/10.31851/jmksp.v3i2.1856 https://doi.org/10.25078/aw.v4i1.927 https://doi.org/10.29407/jpdn.v6i1.14424 https://doi.org/10.31849/lectura.v10i2.3093 https://doi.org/10.32939/tarbawi.v16i01.496 https://doi.org/10.1002/wcs.147 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(1), 2021, 46-56 available online at: http://journal.uny.ac.id/index.php/reid evaluation of learning process: knowledge of ict integration among preservice english language teachers dyah setyowati ciptaningrum; nur hidayanto pancoro setyo putro; nila kurnia sari*; nurqadriyanti hasanuddin universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia *corresponding author. e-mail: nilakurniasari@uny.ac.id introduction ict has the potentials to assist learners in the process of acquiring english language proficiency. for example, it can bring them the immersion-like language learning experiences. ict also has the tools to provide foreign language learners with authentic language learning materials, authentic cultural context, and meaningful human interactions which can enhance motivation for language practice and the development of students‟ language skills (golonka et al., 2014; kern et al., 2017). in addition to this, ict has made the customization of learning possible (ahmed et al., 2020). therefore, ict also offers customized language learning experience. another advantage that can be offered by using ict in english language teaching (elt) is that students are facilitated to acquire the 21st century learning skills. these skills become an article info abstract article history submitted: 06 march 2020 revised: 28 june 2021 accepted: 29 june 2021 keywords technological and pedagogical content knowledge; tpack; preservice teacher scan me: this research aims at investigating the knowledge of ict integration among the students of the english language education study program at a particular university in yogyakarta. by employing the quantitative method, this ex-post facto research involved 70 english department students who had taken microteaching as respondents in a survey to gather the data using a questionnaire. the questionnaire was taken from the model developed by one of the researchers based on the tpack framework. there are five domains in the questionnaire. four domains measure tpack perceptions: technological knowledge (tk), technological content knowledge (tck), technological pedagogical knowledge (tpk), and technological pedagogical content knowledge (tpack). one domain measures the pre-service teachers‟ perceptions of their ict-related learning experiences. demographic questions are included to identify the characteristics of the respondents in order to understand gender differences or relationships between teachers who have access to technologies at home and those who do not. the closed-ended questions applied a five-point likert-type scale ranging from 1=strongly disagree to 5=strongly agree. correspondingly, it contains one open-ended question in the tpack domain. descriptive analysis and manova were conducted for the quantitative data analysis. the result of this study showed that the pre-service students possess a high self-confidence in the application of tk (technological knowledge), tpk (technological pedagogical knowledge), and tck (technological content knowledge). however, their tpack appears to need further development. the results from manova showed that the pre-service teachers‟ related learning experiences significantly differed by the students‟ gender, with male students reporting more experiences than female students. this is an open access article under the cc-by-sa license. how to cite: ciptaningrum, d., putro, n., sari, n., & hasanuddin, n. (2021). evaluation of learning process: knowledge of ict integration among pre-service english language teachers. reid (research and evaluation in education), 7(1), 4656. doi:https://doi.org/10.21831/reid.v7i1.30521 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i1.30521 https://doi.org/10.21831/reid.v7i1.30521 dyah s. ciptaningrum, nur hidayanto p. s. putro, nila kurnia sari, & nurqadriyanti hasanuddin page 47 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) important aspect of employability since what companies need from its workers today are their abilities to be flexible and adaptable; to have initiative and self-direction; to show social and cross-cultural skills; to be productive and accountable; and to show leadership capacity and responsibility (keengwe, 2013; koh et al., 2015). ict needs to be used as a tool to foster meaningful learning where learners are engaged in a critical thinking process (bryan & wang, 2013). despite all the advantages, it is still quite a challenge for educators to fully optimize the application of technology in practice. in fact, the use of ict in classroom has been criticized as being too focused on transferring information as opposed to helping learners in constructing knowledge (turgut, 2017). similarly, research by hutchison and reinking (2011) and tour (2015) reported that in practice, language teachers have not been fully utilizing technology to reinforce the delivery of instruction and the application of curriculum, and that the use of technology in teaching language has been superficial. a survey study on k-12 foreign language teachers in the u.s. by rosetta stone and project tomorrow (2013) revealed that the majority of the teachers had been incorporating technology in their classroom. however, up to 88% of them had been using technology only for searching resources in the internet. this is far from realizing the full potential of ict in foreign language class because as ding et al. (2019) pointed out, ict may present a 'unique opportunity' for language learners to learn cultures and 'interpersonal communication'. thus, the use of ict in language class should be more than just for accessing and presenting information. the potentials of ict in elt can only be realized if foreign language teachers/instructors can use ict well (alkamel & chouthaiwale, 2018; mei et al., 2018; taopan et al., 2020). it means that merely knowing how to operate the ict hardware and software is inadequate for efl teachers. this knowledge needs to be connected with curriculum and good pedagogy (raygan & moradkhani, 2020; tondeur et al., 2013). as turgut (2017, p. 4) pointed out, it is necessary for teachers and preservice teachers to be well-trained in the use of technology in regard to the following domains: '...knowledge, skills, and attitudes'. teachers‟ ability to design learning environments with authentic learning tasks that utilize ict as a tool where students are engaged in collaborative activities that address individual student‟s learning style, needs, and interest is essential to successful ict integration (majumdar, 2015). the indonesian ministry of education has mentioned that indonesian teachers, including english language teachers, need to integrate ict in the learning and teaching process (regulation of the minister of national education no. 16 of 2007; regulation of the minister of national education no. 41 of 2007; regulation of the minister of national education no. 78 of 2009). if the ministry of national education requires indonesian teachers to integrate ict in the learning and teaching process, then the role of efl pre-service teacher education becomes crucial as it serves as the initial and primary source of efl teachers‟ knowledge. efl teacher education programs need to extend efl teachers‟ knowledge base to incorporate a set of knowledge on how to link technology with the nature english language as the subject matter and the appropriate strategies in elt that are suitable with their classroom and school‟s context (ahmadi, 2018; gönen, 2019). what efl teachers learned during their pre-service study would influence the way they teach as in-service teachers. previous researches revealed that there is a correlation between ict knowledge and how preservice teachers apply technology in their classroom. as reported by kartchava and chung (2015) and turgut (2017), preservice teacher who were highly trained in the use of ict to facilitate classroom activities are more inclined to use technology to reinforce their own class. this statement highlights the importance of ensuring that preservice teachers receive the best ict education prior to entering workforce. on this note, teachers who belief that learning process should be learner-centered were also found to be more receptive towards incorporating technology into their class (deng et al., 2014). have indonesian efl pre-service teachers, i.e. the indonesian efl pre-service teachers at this particular university in yogyakarta, been prepared to teach using ict? there have been a number of studies that measure the https://doi.org/10.21831/reid.v7i1.30521 dyah s. ciptaningrum, nur hidayanto p. s. putro, nila kurnia sari, & nurqadriyanti hasanuddin page 48 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) development of teachers‟ tpack (chai et al., 2010; ersanli, 2016; koehler et al., 2007; kurt et al., 2014; kwangsawad, 2016; öz, 2015). however, little is known on the tpack level of the indonesian efl pre-service teachers. this study aims at examining the perception of indonesian efl pre-service teachers in this particular university in yogyakarta on their knowledge in applying ict-mediated instructions through the lens of mishra and koehler‟s model of tpack. their level of tpack is measured and observed. the result of the study will inform the development of a particular instructional design to assist efl pre-service teachers in designing an ict-mediated instruction. this study will also expand knowledge on pre-service teachers‟ tpack in developing countries through a particular case. while the introduction section covers the overview to this research, this section reviews previous research on the teachers‟ tpack. this literature review aims to establish the relationships of this research to the existing theory and other related research. it also informs the rest of the design in this research. krashen's (1985) input hypothesis has been the foundation of recognizing the contribution of ict in language learning. in this theory, interaction provides opportunities for second language learners to understand messages (comprehensible input) to develop learners‟ second language (brown, 2000). swain and lapkin (1995) criticised this input hypothesis by arguing that output was central in developing learners‟ competence in the target language. the output hypothesis views that language learners realize the gaps in their linguistic knowledge based on the external and internal feedback of the language that they produced. pica (1994) stated that in the process of interaction, language learners have various opportunities for receiving comprehensible input and to test their linguistic knowledge based on the interlocutors‟ feedback. thus, this kind of interaction which provides comprehensible input and output for language learners is present in the total-immersion situation (goodwin, 1991). in this situation, language learners received the language exposure to develop their linguistic knowledge and they are forced to use the language actively if they want to survive in this environment because their surrounding context is written in the target language and the people speak the target language. another concept that underpins the application of ict in language learning is the collaborative learning principle which has its root in vygotsky‟s sociocultural theory of mind. knowledge acquisition, in this sociocultural learning theory, is not an individual pursuit but its development is mediated by symbolic means (lantolf, 1994). language and technology are two examples of the symbolic means as they allow humans to organize their mental and physical activities. learning occurs as a result of interaction between the learner‟s biological and sociocultural factors in the learner‟s environment. social interaction, then, has a significant role in creating an environment to learn language. collaborative learning provides a means toward students‟ cognitive development since assistance from other people who are more skilful and experienced is available (warschauer, 1997). since the 1980‟s, a rich body of research has been conducted on the role of ict in affecting language learning. the early uses of call were mainly limited in the form of drill and practice activities and with the rapid growth of technologies, communicative and integrated uses of call were emerging (liu et al., 2002). as the advent of technologies are giving ways to the widespread use of web 2.0 tools, recent studies (e.g., amir et al., 2011; kılıçkaya & krajka, 2012; mirzaei et al., 2016; wang et al., 2014) have also confirmed ict to be effective in influencing language learning. tpack, which stands for technological pedagogical content knowledge, is a framework developed by mishra and koehler (2006). this framework is an extension of shulman's (1986) concepts of pedagogical content knowledge (pck). as teaching practice developed, shulman's (1986, 1987) notion of pedagogical content knowledge as teachers‟ knowledge base needs to be expanded to include knowledge of ict use in education to address the constraints in the ict integration mentioned earlier. thus, mishra and koehler (2006) included one more dimension into the framework: technological knowledge and discovered that more types of knowledge can exist between the three domains of knowledge, as presented in figure 1. https://doi.org/10.21831/reid.v7i1.30521 dyah s. ciptaningrum, nur hidayanto p. s. putro, nila kurnia sari, & nurqadriyanti hasanuddin page 49 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) figure 1. tpack framework by mishra and koehler (2006) technology knowledge (tk) refers to the skills to use the technology. teachers need to show the ability to use the standard technology like the black/white board, textbooks, visual aids, or the new technology like the internet and digital video. including in this knowledge are teachers‟ skills to operate computer system and hardware, and use software tools like word processors, powerpoint, spreadsheet, web browsers, e-mail, and instant messaging. digital technology is continuously changing. it is imperative for teachers to have the ability to keep up and adapt with the changes in technology. in addition, teachers should also need to decide whether the technology supports or hinders the attainment of the purpose of the lesson (koehler et al., 2014). technological content knowledge (tck) includes the ability to select the appropriate technology tool to deliver the subject matter since technology can support or impede the learning of the subject matter. the nature of the ideas in the subject matter drives the selection process. tck for foreign language teachers can be defined as “the body of knowledge that teachers have about their target language and its culture and how technology is used to represent this knowledge” (van olphen, 2008, p. 113). technological pedagogical knowledge (tpk) is the interaction between technology and pedagogy. teachers have a repertoire of teaching strategies and they should be able to skilfully select the one that best represents the idea in the subject matter and suits the students‟ context or characteristics such as age, fluency/mastery level of the topic, learning style, or background knowledge. with technology, the complexity increases. teachers need to understand how technology can change the teaching and learning. there are different technology tools that can be used for a task. the selection of the appropriate tool is “based on its fitness, strategies for using the tool‟s affordances, and knowledge of pedagogical strategies and the ability to apply those strategies for use of technologies. this includes knowledge of tools for maintaining class records, attendance, and grading, and knowledge of generic technology-based ideas such as webquests, discussion boards, and chat rooms” (mishra & koehler, 2006, p. 1028). technological pedagogical content knowledge (tpack) is the heart of effective teaching using technology. it requires “an understanding of how to represent concepts with technologies, pedagogical techniques that use technologies in constructive ways to teach content; knowledge of what makes concepts difficult or easy to learn and how technology can help students learn; knowledge of students‟ prior knowledge and theories of epistemology; and knowledge of how technologies can be used to build on existing knowledge and to develop new epistemologies or strengthen old ones” (mishra & koehler, 2008, p. 10). https://doi.org/10.21831/reid.v7i1.30521 dyah s. ciptaningrum, nur hidayanto p. s. putro, nila kurnia sari, & nurqadriyanti hasanuddin page 50 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) using mishra and koehler‟s concept of tpack, van olphen (2008, p. 117) believes that meaningful technology integration in language teaching entails the following condition: (1) an understanding of how linguistic and cultural concepts can be represented using technology, (2) educational approaches to language teaching that draw from socio-constructivist philosophies to develop students‟ language and cultural competence, (3) an awareness of what facilitates or hinders the acquisition of language and the development of language competence and how technology, specifically call or cmc, can revamp common problems that students ordinarily face, (4) an awareness of students‟ previous knowledge, and particularly a knowledge of second language acquisition and cognitive development theories, (5) an understanding of how current and emerging technologies can be used to advance present knowledge and to develop new epistemologies and sustain previous ones. the three domains in shulman‟s framework are not exclusive. tseng (2016) explained that the relationship between the three components (technological, pedagogical, and content) in the theoretical framework of tpack is 'dynamic' and 'transactional' and that it is especially important for teachers to be aware of these natures. lack of awareness in the nature of tpack may result in treating the domains as separated knowledge when in fact, all tpack domains have overlapping sections and are interchangeable. method the research took place at the english language and education department at a university in yogyakarta. the initial sample included 70 pre-service teachers who sit in micro-teaching classes during march 2017 to august 2017 (fifth semester students). a tpack survey was administered at the end of the semester to obtain a description on their level of tpack before they left for a teaching practicum at selected schools. a set of questionnaires has been developped, and it had undergone a validity and reliability study (ciptaningrum, 2017). this set of questionnare consists of three questions under the technological knowledge (tk) section, ten questions for technological content knowledge (tck) section, and six questions under the technological pedagogical knowledge (tpk) section. all of the questions are closed-ended. for technological pedagogical content knowledge (tpack), there are six closed-ended questions and one open-ended question. all questions were uploaded using google form platform. it was pilot tested three times and revised before the final version was uploaded. at the beginning of july 2017, an invitation was sent to the whole cohort to participate in the survey through the whatsapp group of each class. from the total of 110 students who enrolled in this topic/subject, 70 students responded, constituting a response rate of 63.6%. the data were analyzed based on descriptive statistics by applying central tendency measures to find the mean. then, the qualitative data conversion for likert scale questionnaire was used to classify the mean. thus, the tpack level of the pre-service teachers will be revealed, whether it is „very poor‟, „poor‟, „fair‟, „good‟, or „very good‟. then, prior to the manova test, the box‟s test of equality of covariance matrices was conducted to check the assumption of homogeneity of covariance across the groups using p < .001 as a criterion. findings and discussion findings the mean score of each question under the technological knowledge (tk) showed that pre-service teachers are generally confident with their tk: q1 (m = 4.12, sd = 0.74), q2 (m = 3.74, sd = 1.03), and q3 (m = 3.97, sd = 0.99). different types of technology, both hardware and software, are familiar to them since opportunities to „play around‟ with different technology are available. https://doi.org/10.21831/reid.v7i1.30521 dyah s. ciptaningrum, nur hidayanto p. s. putro, nila kurnia sari, & nurqadriyanti hasanuddin page 51 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) similarly, they also have confidence in using technology to help them build the content knowledge required as future english language teachers (tck). the mean score for each question under this section are: q4 (m = 3.98, sd = 0.94), q5 (m = 3.39, sd = 0.91), q6 (m = 4.02, sd = 0.85), q7 (m = 4.47, sd = 0.65), q8 (m = 4.60, sd = 0.68), q9 (m = 4.28, sd = 0.76), q10 (m = 4.21, sd = 0.83), q11 (m = 3.82, sd = 0.91), q12 (m = 4.05, sd = 0.86), q13 (m = 4.07, sd = 0.82). as for tpk, the data show that the pre-service teachers are also confident with their knowledge on using ict for pedagogical reasons. the mean score for the tpk questions are: q14 (m = 4.11, sd = 0.82), q15 (m = 3.77, sd = 0.93), q16 (m = 3.72, sd = 0.89), q17 (m = 3.78, sd = 0.84), q18 (m = 3.98, sd = 0.87), q19 (m = 3.95, sd = 0.85). the tpack questions yields the following result: q21 (m = 3.97, sd = 0.68), q22 (m = 3.71, sd = 0.81), q23 (m = 3.82, sd = 0.72), q24 (m = 3.50, sd = 0.79), q25 (m = 3.60, sd = 0.84), q26 (m = 3.47, sd = 0.84). this indicates that the pre-service teachers are less confident about their knowledge on efective ict integration. in short, the pre-service teachers‟ tk, tck and tpk in integrating ict into english language teaching and learning are classified as „good‟ with the mean scores after qualitative data conversion of 3.94, 4.08, and 3.88 respectively. however, their tpack falls under the category „fair‟ with a mean score of 3.67. next, the box‟s test of equality of covariance matrices was run to explore the assumption of homogeneity of covariance across the groups using p < .001 as a criterion. the results showed that the box‟s m (19.55) was not significant, p (.343) indicating that there are no significant differences between the covariance matrices. subsequently, manova was conducted with gender as the independent variable and the five dimensions of knowledge of ict integration as the dependent variables. the results show significant associations between gender and the five dimensions of knowledge of ict integration. in other words, there was a statistically significant difference in the five dimensions of knowledge of ict integration based on the students gender, f (5, 64) = 3.58, p < .01; wilk's λ = 0.781. the multivariate effect size, as presented in table 1, was estimated at .219, which implies that 21.9 % of the variance in the canonically derived dependent variable, i.e. the five dimensions of knowledge of ict integration were accounted for by gender. table 1. results of multivariate tests effect value f hypothesis df error df sig. partial eta squared gender pillai‟s trace .219 3.582b 5.000 64.000 .006 .219 wilks‟ lambda .781 3.582b 5.000 64.000 .006 .219 hotelling‟s trace .280 3.582b 5.000 64.000 .006 .219 roy‟s largest root .280 3.582b 5.000 64.000 .006 .219 a. design: intercept + gender b. exact statistic table 2. results of tests of between-subjects effects source dependent variable type iii sum of squares df mean square f sig. gender tk .089 1 .089 .042 .839 tck 34.300 1 34.300 1.074 .304 tpk 1.289 1 1.289 .095 .758 tpck 4.629 1 4.629 .360 .551 trle 68.014 1 68.014 13.768 .000 a closer look at the results of the test between subjects effects as presented in table 2 showed that gender has a statistically significant effect on pre-service english language teachers‟ related to learning experiences (f (1, 68) = 13.77; p < .01) with male pre-service teachers reported more experiences (m= 17.00, sd= 1.66) compared to their female counterparts (m= 14.54, sd= 2.33). https://doi.org/10.21831/reid.v7i1.30521 dyah s. ciptaningrum, nur hidayanto p. s. putro, nila kurnia sari, & nurqadriyanti hasanuddin page 52 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) discussion studies on teachers‟ learning on the use of ict (e.g. alelaimat et al., 2020; ciptaningrum, 2018; yet & noordin, 2017) conclude that ict professional development (pd) for teachers comprise of standardize, one-size-fit all programs that show teachers how to use the technology hardware and software. pre-service teachers at the english language and education depart-ment at this university in yogyakarta, have received training on how to use an online learning management system for learning purposes. in addition, most of them tend to use ict on daily bases. they also have taken topics on content and pedagogy as parts of curriculum at this institution. this explains their confidence in their tk, tck, and tpk. tpk and most importantly tpack are important because in planning their lessons, teachers should not only include the selection of technology tools that can best represent difficult concepts, but they also need to understand how these technology tools can afford the pedagogy techniques that are student-centred, authentic, facilitate meaningful learning (mishra & koehler, 2008; van olphen, 2008), and develop students‟ english language ability, not only in reading and listening, but also in speaking and writing skills. besides, van olphen (2008) had outlined the components of tpack in the context of foreign language education (refer to section 2.2). thus, developing students‟ communicative skills in the target language will benefit from using the technology in the way that this technology does not only represent linguistic concepts, but they also need to facilitate teachers to introduce culture to their students (blake, 2013), and support educational approaches to pedagogy that draw from socio-constructivist philosophies (hidayati, 2016; kern et al., 2017). this tpack development needs to be the framework which underlines the english language pre-service teachers‟ ict pd. related to the gender differences, there is no statistically significant difference between the males and females regarding to their tpack. however, in terms of technology-related learning experiences, the male students were reported to have more learning experiences than the female students. they learn the use of technology from various sources: friends, enrolling in a technology course inside and outside university, reading books or magazines on technology and attending workshops, seminars or conference on technology. most of them also provide additional comments that they learn the use of new apps by themselves, through youtube tutorials and they are not afraid to make mistakes. regarding the gender differences in the technological related learning experience in this research, it tends to support some empirical studies that found that women have a low significance for technological participation than men due to social cultural attitudes about the role of women in society (antonio & tuffley, 2014). however, further investigation need to be conducted to confirm this. conclusion pre-service teachers at this english language and education department, in a university in yogyakarta, have higher confidence in using the technoly hardware and software, as well as using ict to deliver the subject matter. they also understand how technology can change the teaching and learning. however, it seems that they need more assistance to learn how ict can be integrated effectively to enhance students learning and improve their english language skills. further studies might need to confirm the result of this study‟s finding qualitatively. furthermore, studies on developing an instructional model of ict integration in english language classrooms need to be conducted before analyzing the implementation result of the model. references ahmadi, m. r. (2018). the use of technology in english language learning: a literature review. international journal of research in english education, 3(2), 115–125. https://doi.org/10.29252/ijree.3.2.115 https://doi.org/10.29252/ijree.3.2.115 https://doi.org/10.21831/reid.v7i1.30521 dyah s. ciptaningrum, nur hidayanto p. s. putro, nila kurnia sari, & nurqadriyanti hasanuddin page 53 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) ahmed, r., al-kadi, a., & hagar, t. (2020). enhancements and limitations to ict-based informal language learning: emerging research and opportunities (r. ahmed, a. al-kadi, & t. hagar (eds.)). igi global. https://doi.org/10.4018/978-1-7998-2116-8 alelaimat, a. m., ihmeideh, f. m., & alkhawaldeh, m. f. (2020). preparing preservice teachers for technology and digital media integration: implications for early childhood teacher education programs. international journal of early childhood, 52(3), 299–317. https://doi.org/10.1007/s13158-020-00276-2 alkamel, m. a. a., & chouthaiwale, s. s. (2018). the use of ict tools in english language teaching and learning: a literature review. veda’s journal of english language and literaturejoell, 5(2), 29–33. http://joell.in/wp-content/uploads/2018/04/29-33-the-use-ofict-tools-in-english-language.pdf amir, z., ismail, k., & hussin, s. (2011). blogs in language learning: maximizing students‟ collaborative writing. procedia social and behavioral sciences, 18, 537–543. https://doi.org/10.1016/j.sbspro.2011.05.079 antonio, a., & tuffley, d. (2014). the gender digital divide in developing countries. future internet, 6(4), 673–687. https://doi.org/10.3390/fi6040673 blake, r. (2013). brave new digital classroom: technology and foreign language learning. georgetown university press. brown, h. d. (2000). principles of language learning and teaching. longman. bryan, v. c., & wang, v. x. (eds.). (2013). technology use and research approaches for community education and professional development. igi global. https://doi.org/10.4018/978-1-4666-2955-4 chai, c. s., koh, j. h. l., & tsai, c.-c. (2010). facilitating preservice teachers‟ development of technological, pedagogical, and content knowledge (tpack). journal of educational technology & society, 13(4), 63–73. https://www.learntechlib.org/p/52307/ ciptaningrum, d. s. (2017). the development of the survey of technology use, teaching, and technology-related learning experiences among pre-service english language teachers in indonesia. journal of foreign languange teaching and learning, 2(2), 11–26. https://doi.org/10.18196/ftl.2220 ciptaningrum, d. s. (2018). the story of “julie”: a life history study of the learning experiences of an indonesian english language teacher in implementing ict in her classroom. in s. madya, f. hamied, w. a. renandya, c. coombe, & y. basthomi (eds.), elt in asia in the digital era: global citizenship and identity proceedings of the 15th asia tefl and 64th teflin international conference on english language teaching (pp. 495–503). routledge. https://www.taylorfrancis.com/chapters/edit/10.1201/9781351217064-71/story-julie-lifehistory-study-learning-experiences-indonesian-english-language-teacher-implementing-ictclassroom-ciptaningrum deng, f., chai, c. s., tsai, c.-c., & lee, m.-h. (2014). the relationships among chinese practicing teachers‟ epistemic beliefs, pedagogical beliefs and their beliefs about the use of ict. journal of educational technology & society, 17(2), 245–256. ding, a.-c. e., ottenbreit-leftwich, a., lu, y.-h., & glazewski, k. (2019). efl teachers‟ pedagogical beliefs and practices with regard to using technology. journal of digital learning in teacher education, 35(1), 20–39. https://doi.org/10.1080/21532974.2018.1537816 ersanli, c. y. (2016). improving technological pedagogical content knowledge (tpack) of preservice english language teachers. international education studies, 9(5), 18–27. https://doi.org/10.5539/ies.v9n5p18 https://doi.org/10.4018/978-1-7998-2116-8 https://doi.org/10.1007/s13158-020-00276-2 http://joell.in/wp-content/uploads/2018/04/29-33-the-use-of-ict-tools-in-english-language.pdf http://joell.in/wp-content/uploads/2018/04/29-33-the-use-of-ict-tools-in-english-language.pdf https://doi.org/10.1016/j.sbspro.2011.05.079 https://doi.org/10.3390/fi6040673 https://doi.org/10.4018/978-1-4666-2955-4 https://www.learntechlib.org/p/52307/ https://doi.org/10.18196/ftl.2220 https://www.taylorfrancis.com/chapters/edit/10.1201/9781351217064-71/story-julie-life-history-study-learning-experiences-indonesian-english-language-teacher-implementing-ict-classroom-ciptaningrum https://www.taylorfrancis.com/chapters/edit/10.1201/9781351217064-71/story-julie-life-history-study-learning-experiences-indonesian-english-language-teacher-implementing-ict-classroom-ciptaningrum https://www.taylorfrancis.com/chapters/edit/10.1201/9781351217064-71/story-julie-life-history-study-learning-experiences-indonesian-english-language-teacher-implementing-ict-classroom-ciptaningrum https://doi.org/10.1080/21532974.2018.1537816 https://doi.org/10.5539/ies.v9n5p18 https://doi.org/10.21831/reid.v7i1.30521 dyah s. ciptaningrum, nur hidayanto p. s. putro, nila kurnia sari, & nurqadriyanti hasanuddin page 54 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) golonka, e. m., bowles, a. r., frank, v. m., richardson, d. l., & freynik, s. (2014). technologies for foreign language learning: a review of technology types and their effectiveness. computer assisted language learning, 27(1), 70–105. https://doi.org/10.1080/09588221.2012.700315 gönen, s. i̇. k. (2019). a qualitative study on a situated experience of technology integration: reflections from pre-service teachers and students. computer assisted language learning, 32(3), 163–189. https://doi.org/10.1080/09588221.2018.1552974 goodwin, j. (1991). teaching pronunciation. in m. celce-murcia (ed.), teaching english as a second or foreign language (pp. 117–138). heinle & heinle thomson learning. hidayati, t. (2016). integrating ict in english language teaching and learning in indonesia. jeels (journal of english education and linguistics studies), 3(1). https://doi.org/10.30762/jeels.v3i1.173 hutchison, a., & reinking, d. (2011). teachers‟ perceptions of integrating information and communication technologies into literacy instruction: a national survey in the united states. reading research quarterly, 46(4), 312–333. https://doi.org/10.1002/rrq.002 kartchava, e., & chung, s. (2015). pre-service and in-service english as a second language teachers‟ beliefs about the use of digital technology in the classroom. studies in english language teaching, 3(4), 355–383. https://doi.org/10.22158/selt.v3n4p355 keengwe, j. (ed.). (2013). research perspectives and best practices in educational technology integration. igi global. https://doi.org/10.4018/978-1-4666-2988-2 kern, r., ware, p., & warschauer, m. (2017). network-based language teaching. in second and foreign language education (pp. 197–209). springer international publishing. https://doi.org/10.1007/978-3-319-02246-8_30 kılıçkaya, f., & krajka, j. (2012). can the use of web-based comic strip creation tool facilitate efl learners‟ grammar and sentence writing? british journal of educational technology, 43(6), e161–e165. https://doi.org/10.1111/j.1467-8535.2012.01298.x koehler, m. j., mishra, p., kereluik, k., shin, t. s., & graham, c. r. (2014). the technological pedagogical content knowledge framework. in handbook of research on educational communications and technology (pp. 101–111). springer new york. https://doi.org/10.1007/978-1-4614-3185-5_9 koehler, m. j., mishra, p., & yahya, k. (2007). tracing the development of teacher knowledge in a design seminar: integrating content, pedagogy and technology. computers & education, 49(3), 740–762. https://doi.org/10.1016/j.compedu.2005.11.012 koh, j. h. l., chai, c. s., benjamin, w., & hong, h.-y. (2015). technological pedagogical content knowledge (tpack) and design thinking: a framework to support ict lesson design for 21st century learning. the asia-pacific education researcher, 24(3), 535–543. https://doi.org/10.1007/s40299-015-0237-2 krashen, s. d. (1985). the input hypothesis: issues and implications. addison-wesley longman ltd. kurt, g., akyel, a., koçoğlu, z., & mishra, p. (2014). tpack in practice: a qualitative study on technology integrated lesson planning and implementation of turkish pre-service teachers of english. elt research journal, 3(3), 153–166. https://dergipark.org.tr/tr/download/article-file/63644 kwangsawad, t. (2016). examining efl pre-service teachers‟ tpack trough self-report, lesson plans and actual practice. journal of education and learning (edulearn), 10(2), 103–108. https://doi.org/10.11591/edulearn.v10i2.3575 https://doi.org/10.1080/09588221.2012.700315 https://doi.org/10.1080/09588221.2018.1552974 https://doi.org/10.30762/jeels.v3i1.173 https://doi.org/10.1002/rrq.002 https://doi.org/10.22158/selt.v3n4p355 https://doi.org/10.4018/978-1-4666-2988-2 https://doi.org/10.1007/978-3-319-02246-8_30 https://doi.org/10.1111/j.1467-8535.2012.01298.x https://doi.org/10.1007/978-1-4614-3185-5_9 https://doi.org/10.1016/j.compedu.2005.11.012 https://doi.org/10.1007/s40299-015-0237-2 https://dergipark.org.tr/tr/download/article-file/63644 https://doi.org/10.11591/edulearn.v10i2.3575 https://doi.org/10.21831/reid.v7i1.30521 dyah s. ciptaningrum, nur hidayanto p. s. putro, nila kurnia sari, & nurqadriyanti hasanuddin page 55 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) lantolf, j. p. (1994). sociocultural theory and second language learning: introduction to the special issue. the modern language journal, 78(4), 418–420. https://doi.org/10.2307/328580 liu, m., moore, z., graham, l., & lee, s. (2002). a look at the research on computer-based technology use in second language learning. journal of research on technology in education, 34(3), 250–273. https://doi.org/10.1080/15391523.2002.10782348 majumdar, s. (2015). emerging trends in ict for education & training. gen. asia pacific reg. iveta. https://www.stthomascollegebhilai.in/wpcontent/uploads/2016/10/emergingtrendsinictforeducationandtraining.pdf mei, b., brown, g. t. l., & teo, t. (2018). toward an understanding of preservice english as a foreign language teachers‟ acceptance of computer-assisted language learning 2.0 in the people‟s republic of china. journal of educational computing research, 56(1), 74–104. https://doi.org/10.1177/0735633117700144 mirzaei, a., rahimi domakani, m., & rahimi, s. (2016). computerized lexis-based instruction in efl classrooms: using multi-purpose lexisboard to teach l2 vocabulary. recall, 28(1), 22–43. https://doi.org/10.1017/s0958344015000129 mishra, p., & koehler, m. j. (2006). technological pedagogical content knowledge: a framework for teacher knowledge. teachers college record, 108(6), 1017–1054. https://doi.org/10.1111/j.1467-9620.2006.00684.x mishra, p., & koehler, m. j. (2008). introducing technological pedagogical content knowledge. annual meeting of the american educational research association research on schools, neighborhoods, and communities: toward civic responsibility, 1–15. öz, h. (2015). assessing pre-service english as a foreign language teachers‟ technological pedagogical content knowledge. international education studies, 8(5), 119–130. https://doi.org/10.5539/ies.v8n5p119 pica, t. (1994). research on negotiation: what does it reveal about second-language learning conditions, processes, and outcomes? language learning, 44(3), 493–527. https://doi.org/10.1111/j.1467-1770.1994.tb01115.x raygan, a., & moradkhani, s. (2020). factors influencing technology integration in an efl context: investigating efl teachers‟ attitudes, tpack level, and educational climate. computer assisted language learning, 1–22. https://doi.org/10.1080/09588221.2020.1839106 regulation of the minister of national education no. 16 of 2007 on the teacher‟s standard of academic qualification and competence, (2007). regulation of the minister of national education no. 41 of 2007 on the process standard for primary and secondary educational units, (2007). regulation of the minister of national education no. 78 of 2009 on the implementation of international-standard school on primary and secondary educational levels, (2009). rosetta stone and project tomorrow. (2013). world language teachers and their use of technology. rosetta stone. https://resources.rosettastone.com/assets/lp/9999999999/resources/speaking-thelanguage-of-the-21st-century.pdf shulman, l. s. (1986). those who understand: knowledge growth in teaching. educational researcher, 15(2), 4–14. https://doi.org/10.3102/0013189x015002004 shulman, l. s. (1987). knowledge and teaching: foundations of the new reform. harvard educational review, 57(1), 1–23. https://doi.org/10.17763/haer.57.1.j463w79r56455411 https://doi.org/10.2307/328580 https://doi.org/10.1080/15391523.2002.10782348 https://www.stthomascollegebhilai.in/wp-content/uploads/2016/10/emergingtrendsinictforeducationandtraining.pdf https://www.stthomascollegebhilai.in/wp-content/uploads/2016/10/emergingtrendsinictforeducationandtraining.pdf https://doi.org/10.1177/0735633117700144 https://doi.org/10.1017/s0958344015000129 https://doi.org/10.1111/j.1467-9620.2006.00684.x https://doi.org/10.5539/ies.v8n5p119 https://doi.org/10.1111/j.1467-1770.1994.tb01115.x https://doi.org/10.1080/09588221.2020.1839106 https://resources.rosettastone.com/assets/lp/9999999999/resources/speaking-the-language-of-the-21st-century.pdf https://resources.rosettastone.com/assets/lp/9999999999/resources/speaking-the-language-of-the-21st-century.pdf https://doi.org/10.3102/0013189x015002004 https://doi.org/10.17763/haer.57.1.j463w79r56455411 https://doi.org/10.21831/reid.v7i1.30521 dyah s. ciptaningrum, nur hidayanto p. s. putro, nila kurnia sari, & nurqadriyanti hasanuddin page 56 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) swain, m., & lapkin, s. (1995). problems in output and the cognitive processes they generate: a step towards second language learning. applied linguistics, 16(3), 371–391. https://doi.org/10.1093/applin/16.3.371 taopan, l. l., drajati, n. a., & sumardi, s. (2020). tpack framework: challenges and opportunities in efl classrooms. research and innovation in language learning, 3(1), 1–22. https://doi.org/10.33603/rill.v3i1.2763 tondeur, j., roblin, n. p., van braak, j., fisser, p., & voogt, j. (2013). technological pedagogical content knowledge in teacher education: in search of a new curriculum. educational studies, 39(2), 239–243. https://doi.org/10.1080/03055698.2012.713548 tour, e. (2015). digital mindsets: teachers‟ technology use in personal life and teaching. language learning & technology, 19(3), 124–139. https://www.lltjournal.org/item/2923 tseng, j.-j. (2016). developing an instrument for assessing technological pedagogical content knowledge as perceived by efl students. computer assisted language learning, 29(2), 302– 315. https://doi.org/10.1080/09588221.2014.941369 turgut, y. (2017). tracing preservice english language teachers‟ perceived tpack in sophomore, junior, and senior levels. cogent education, 4(1), 1368612. https://doi.org/10.1080/2331186x.2017.1368612 van olphen, m. (2008). world language teacher education and educational technology: a look into ck, pck, and tpck. annual meeting of the american educational research association research on schools, neighborhoods, and communities: toward civic responsibility. wang, y., chung, c. j., & hattingh, e. (2014). a blending approach in technology integrated esl writing instruction. in m. searson & m. ochoa (eds.), proceedings of site 2014--society for information technology & teacher education international conference (pp. 1139–1144). association for the advancement of computing in education (aace). https://www.learntechlib.org/primary/p/130922/ warschauer, m. (1997). computer-mediated collaborative learning: theory and practice. the modern language journal, 81(4), 470–481. https://doi.org/10.2307/328890 yet, t. s., & noordin, n. b. (2017). the use of ict among pre-service english language teachers. international journal of english language education, 5(1), 100–112. https://doi.org/10.5296/ijele.v5i1.10779 https://doi.org/10.1093/applin/16.3.371 https://doi.org/10.33603/rill.v3i1.2763 https://doi.org/10.1080/03055698.2012.713548 https://www.lltjournal.org/item/2923 https://doi.org/10.1080/09588221.2014.941369 https://doi.org/10.1080/2331186x.2017.1368612 https://www.learntechlib.org/primary/p/130922/ https://doi.org/10.2307/328890 https://doi.org/10.5296/ijele.v5i1.10779 this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(1), 2020, 78-86 available online at: http://journal.uny.ac.id/index.php/reid an analysis of the suitability of students' civic knowledge and disposition in the topic of citizen's rights and obligations *dwi riyanti institute for educational development, universitas ahmad dahlan jl. ringroad selatan, kragilan, tamanan, banguntapan, bantul, yogyakarta 55191, indonesia *corresponding author. e-mail: dwiriyanti.ysu@gmail.com submitted: 10 may 2020 | revised: 26 may 2020 | accepted: 23 june 2020 abstract civic education has been taught in primary education, but it has not impacted significantly with no strength and function. it is proven by the number of youths who have not understood and implement the citizen right and obligation (hak dan kewajiban warga negara) topic in civic education for their daily life. thus, civic knowledge and civic disposition have not run as expected. this matter has become a great task for civic education lecturers to maximize and correlate comprehensively civic knowledge and civic disposition. this study discussed the suitability of civic knowledge and civic disposition in the topic of “hak dan kewajiban warga negara”. this study was descriptive research with a qualitative approach conducted in universitas negeri yogyakarta (uny), universitas ahmad dahlan (uad), smkn 1 kalikajar (vocational high school), smpn 1 magelang (junior high school), and sd kembangsongo (elementary school) and focused on civic education subject by using purposeful sampling. the data were collected through interviews and documentation. the data were then analyzed through the triangulation technique. this study resulted that students in their daily life have not fully implemented both civic knowledge and civic disposition. the matter was caused by students and lecturers of the civic education, whereas the subject’s topic has met the criteria of the curriculum in the level of elementary school, junior high school, and senior high school/vocational high school although not all topics were taught in these levels. keywords: civic knowledge, civic disposition, civic education how to cite: riyanti, d. (2020). an analysis of the suitability of students' civic knowledge and disposition in the topic of citizen's rights and obligations. reid (research and evaluation in education), 6(1), 78-86. doi:https://doi.org/10.21831/reid.v6i1.31621. introduction sunarso, sartono, dwikusrahmadi, and sutarini (2016, p. 6) explains the report of the session of bpupki and ppki stating that education in indonesia must be able to prepare students to become citizens who have a strong commitment to maintain the republic of indonesia unity which has the essence of modern nationalism. it means that in a modern era, the formation of nation and state is based on a sense of nationalism of the people who have a strong determination to build a future with a variety of different populations. a country that adheres to the concept of democracy will prove the development of the concept of civil society that has a concept of a position to gather the strength of society to maintain the freedom, diversity, and independence of the community against state and government power. although there is independence, both are having a mutual relationship (alam, 2014, p. 196). ferguson, hume, and adam smith began to identify the concept of civil society with a civilized society oriented to the material organization (jb & darmawan, 2016, p. 40). therefore, to realize https://creativecommons.org/licenses/by-sa/4.0/deed.id https://doi.org/10.21831/reid.v6i1.31621 https://doi.org/10.21831/reid.v6i1.31621 dwi riyanti copyright © 2020, reid (research and evaluation in education), 6(1), 2020 79 issn: 2460-6995 (online) the civil society concept, a state with a good civic competence is required. civic education has been taught since elementary school education in indonesia. in addition, citizenship knowledge can be obtained from non-formal education by reading news both from print and electronic media. therefore, citizenship knowledge is not only theoretically, but also practically looking directly on the current evidence in the scope of citizenship. even though citizenship knowledge can be obtained through non-formal education, formal education still has a very significant share of students' knowledge of citizenship (galston, 2007, p. 627). currently, civic education has not been able to have a significant impact; it is still not functioning properly and powerless, although, in the reformation era, civic education demands that it can revitalize itself so that it can carry out its vision and mission. charles in print, ellickson-brown, and baginda (1999, pp. 133–135) believes that the contents of civic education can be arranged in three models, namely formal curriculum implemented in learning, an informal curriculum that can be implemented in extracurricular activities, and hidden curriculum such as ethical development that can be developed in daily actions. with these three models, it is expected that students can have citizenship knowledge and can internalize it in everyday life. according to branson (1999, pp. 8–25), there are three aspects of civic education, namely civic knowledge, civic skills, and civic disposition. one of the aspects of civic education is civic knowledge. this material/topic is a substance related to rights and obligations that citizens should know it. this must be owned by every individual because it can positively affect and a picture of democratic values in society. when specifically described, the material/topic on citizenship knowledge includes several things, namely knowledge in terms of structure and political system in government, national identity, free and impartial justice, the constitution used, and the values that live in society. civic disposition is a citizenship competency. this is a combination of civic knowledge and civic skills. civic disposition is a component related to a citizen's character in the scope of democracy that can be measured through the level of citizen awareness. this includes how a citizen understands his rights and obligations by complying with applicable laws, thinking critically, expressing opinions, having good morals, being responsible, being a good listener, discipline, and upholding human dignity (feriandi & harmawati, 2018, p. 77). it is also similar to the purpose of national education to develop students’ potential to be faithful and devoted the almighty, noble, knowledgeable, skilled, creative, independent, and responsible for a democratic country (ernawati, tsurayya, & ghani, 2019, p. 21). civic disposition in the formal curriculum can play an important role in shaping the character of students. moreover, this is supported by law no. 12 of 2012 of republic of indonesia on higher education in article 35, paragraph 3 for which the higher education curriculum must contain compulsory subjects, one of which is civic education. therefore, through this subject, students can get civic knowledge and civic skills so that they can form civic disposition. in addition to the formal curriculum of civic disposition, it can be formed through activities undertaken by students, such as ukm or unit kegiatan mahasiswa (students’ extracurricular activities) and state defense training activities for students that have been carried out by universitas ahmad dahlan. frailon, schulz, and ainley argue that civic disposition research resulted from iccs on civic education situation in five countries such as indonesia, hong kong, republic of korea/south korea, taiwan, and thailand, have produced civic knowledge in indonesia and thailand on viii class students is lower as compared to other sample countries in asia. in addition, there are still many traffic violators derived from students themselves (ainley, fraillon, & schulz, 2012, p. 3). this opinion is in line with wardhani, who states that operasi progo polresta jogja has cited and reported 977 violators, and 327 are students. therefore, there are some points for educators to improve students' civic knowledge in order to produce a good civic disposition (wardhani, 2019). https://doi.org/10.21831/reid.v6i1.31621 https://doi.org/10.21831/reid.v6i1.31621 dwi riyanti 80 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) according to setiawan and suardiman (2018, p. 12), the social attitude can be identified through positive and negative trust in a particular entity's feeling and attitude that has three categories, namely emotional, cognitive, and attitude as it is connected with the previous matter, so that students have not fulfilled themselves in these categories, and that the civic knowledge and civic disposition have not been reached optimally. the fact indicates that there are still a number of tasks of the lecturers to improvise civic knowledge for students to gain and reach for good civic disposition. as long as the civic disposition has been well developed, the citizens will have good behavior to support their political participation, and that the political system will function proportionally in order to improvise dignity and public’s interest (sunarso et al., 2016, p. 15). therefore, this study discussed and analyzed the suitability of civic knowledge and civic disposition in the topic of citizen's rights and obligations in the subject of civic education. method this research was a descriptive study that used a qualitative approach. this research aimed to find and analyze the suitability of civic knowledge and civic disposition, especially in the topic of rights and obligations of citizens on students of universitas negeri yogyakarta (uny) and universitas ahmad dahlan (uad). this study collected various kinds of data and exploited a time effectively in the study’s field (creswell, 2016, p. 254). the procedures undertaken in this study were interviews and documentation. the interview was conducted since february 15 to april 17, 2020, with ten different informants as classified into three categories, namely (a) civic education expert and civic education lecturer, (b) civic education teacher both in elementary/junior high school/vocational school, and (c) five students at uny and uad. this qualitative study was started by using an assumption as well as interpretation that could form and influence the studied matter (creswell, 2015, p. 59). in validating the data, the researcher used source triangulation that involved different information that could coherently build thematic justification (creswell, 2016, p. 269). miles and huberman (1994, pp. 10–12) stated that the technique of data analysis used data reduction, data display, and conclusion. this study used a purposeful sampling technique by two considerations in determining the subject of research. the two considerations were the decision to choose the subject of research and the sample’s specific strategy (creswell, 2015, p. 215). therefore, the researcher chose the specific subject of research to obtain a reflection on a problem that was being investigated, namely the suitability of civic knowledge and civic disposition in the subject of civic education for students at uny and uad. based on the criteria, the researcher involved some subjects as informants, namely the expert on civic education, lecturer on civic education, teacher on civic education, and students at uny and uad who were enrolled in civic education class. the research was conducted at universitas negeri yogyakarta (uny), universitas ahmad dahlan (uad), smpn 1 magelang, smk negeri 1 kalikajar, and also sd negeri kembangsongo. the reason that the researcher chose the civic education subject was to know how far the suitability between civic knowledge and civic disposition in the topic of rights and obligations as well as to identify its relationship to the curriculum of the elementary school, junior high school, and senior high school/vocational high school. besides, the reason that the researcher conducted the study at uny and uad was to know how far the suitability between civic knowledge and civic disposition in the topic of citizen's rights and obligations in the private university, state university, and islamic university. smpn 1 magelang, smk negeri 1 kalikajar, and sd negeri kembangsongo were chosen because the teachers were the alumni of the civic education study program at uny. to get a clear description and information about the suitability of civic knowledge and civic disposition in tertiary institutions, especially at uny and uad as well as the compatibility between subject/course with https://doi.org/10.21831/reid.v6i1.31621 https://doi.org/10.21831/reid.v6i1.31621 dwi riyanti copyright © 2020, reid (research and evaluation in education), 6(1), 2020 81 issn: 2460-6995 (online) civic education curriculum in elementary/ junior high school/ vocational high school, this study determined the researched subject by using a purposive sampling technique. creswell (2015, p. 217) states that research that used purposeful sampling technique was to determine specific and qualified subjects who could provide an overview of the investigated problem. the followings are the characteristics of the subject based on their roles classification. (1) two civic education experts and civic education lecturers were involved in determining the extent of conformity between civic knowledge and four civic disposition in citizens' rights and obligations. (2) three civic education teachers both in elementary/middle school/vocational school were involved in studying the compatibility among subjects in college with the civic education curriculum in elementary/middle school/vocational school. (3) five students at uny and uad, who had taken civic education subjects to find out the suitability between civic knowledge and civic disposition, were involved. in collecting data, this study applied two kinds of techniques, namely interview and documentation. the performed interview was a structured interview for which the issues and questions were previously determined. the results of the interview were called primary data obtained from research subjects. meanwhile, the documentation technique was to support and supplement primary data, namely, the interview. documentation was taken from data and records related to the compatibility between civic knowledge and civic disposition in the civic education subject. findings and discussion civic knowledge according to cogan in winarno (2013, p. 4), civic education is a subject/course that prepares young people to have an active role in the nation's life and state. civic education aims to prepare students to become active, critical, rational, and creative in addressing the issue of citizenship. rosnawati, kartowagiran, and jailani (2015, p. 187) also state that critical thinking can ease someone to process and use the information to solve any problem. besides, the goal of civic education is that the younger generation can actively participate as well as intelligently be responsible for social activities in terms of the nation and state (winarno, 2013, p. 95). in this study, the learning outcomes are conducted by focusing on one of the aspects of civic knowledge where students can know, understand, and internalize the material/topic in the civic education subject. the material/ topic involves the rights and obligations of citizens. certainly speaking, this is a process of teaching and learning activity that students must achieve because civic knowledge is basically a matter related to rights and obligation that citizens should carry out (budimansyah, 2010, p. 49). from an interview on students at uny and uad, it was found that they have already known about the rights and obligations of citizens. they have also understood the elements in the rights and obligations as regulated and clearly stipulated in the 1945 constitution of the republic of indonesia. the interview with adw, a student at the automotive study program of uny on february 15, 2020, has indicated that he could explain the rights and obligations in detail. in addition, other students could also explain the articles in the 1945 constitution of the republic of indonesia that regulated indonesians' rights and obligations. citizens are those who live in a certain area and people in relation to the state. in their relation to the state, citizens have obligations to the state, and that the citizens also have rights that must be granted and protected by the state. citizens' rights are everything that citizens must obtain from the state (government). obligations are all things that must be carried out by citizens of the state. rights and obligations of citizens are according to the 1945 constitution on citizens' rights in the article 27 (1,2,3) the article 28 (a, b, c, d, e, f, g, h, i, j), the article 29 (2) on the freedom of religion, the article 30 on defence and national security, the article 31 on obtaining education, and rights and obligations of citizens in the article 27 (1) on establishing the same citizens' rights in law and https://doi.org/10.21831/reid.v6i1.31621 https://doi.org/10.21831/reid.v6i1.31621 dwi riyanti 82 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) government, and the obligation to uphold the law and government. it is also stipulated in article 27 (2) on establishing the right of citizens to work and a decent living for humanity, and finally, in article 27 (3) on establishing the rights and obligations of citizens to participate in efforts to defend the state. the result of the interview indicated that the student understood the mutual relationship between citizens and the government. the mutual relationship means that both citizen and the government have their own rights and obligations. thus, the rights and obligations as in the 1945 constitution of the republic of indonesia regulate not only citizens' obligations and rights but also the rights and obligations of the state (government). in the same context, a similar opinion was also expressed by far (on february 16, 2020) as one of the students at universitas ahmad dahlan, that between citizen and the state, there is a mutual relationship. he added that a right is something a citizen has a freeaccess to perform an action or speak a statement (opinion/argumentation) because he has done his duty as a citizen, whereas the obligetion of citizens is everything that must and a citizen himself do compulsory. both interviews have indicated that students from both uny and uad have understood the meaning of citizens and the state's rights and obligation. therefore, civic knowledge competency is achieved. similarly, civic knowledge has a relationship to what citizens must know. the content of civic knowledge is also related to the compulsory knowledge for which the citizens should recognize and comprehend (budimansyah, 2010, p. 29). according to feriandi and hermawati, civic knowledge is not only seen from the cognitive aspect, but also from other aspects such as social services and discussions in lectures on the issues of citizenship. students are also done in doing social services in orphanages and in communities with low economic level. in addition, students have also been accustomed to be told in a class about current topics in the community. thus, students can explore the civic knowledge they have gained in the subject they have learned (feriandi & harmawati, 2018, p. 78). civic disposition civic disposition is a very basic and essential competency. civic disposition is considered as the spearhead of the development of civic knowledge and civic skills. quigley, buchanan jr., and bahmueller (1991, p. 11) explain civic disposition as "...those attitudes and habits of mind of the citizen that are conducive to the healthy functioning and common good of the democratic system". it means that citizens' attitudes and habits are conducive to healthy functioning and the same virtue in a democratic order. similar to this opinion, branson (1999, p. 23) states that both public and private characters are important in developing a constitutional democratic system. branson (1999, pp. 23–25) strengthens that public and private characters can be described as follows. (1) becoming an independent member of the community. (2) being able to fulfill the responsibilities of being a citizen in the economic and political fields. (3) being able to respect the dignity of all individuals regardless of social status and so on. (4) being able to actively participate in the affairs of citizenship effectively, responsibly, and also wisely. (5) being able to develop the function of democracy in a healthy way. as previously explained, it can be seen that the character (private and public character) of citizenship is very important in the survival of the nation and state. in this case, the researcher examines civic disposition's competence in terms of citizens' rights and obligations. this resulted in both students at universitas negeri yogyakarta and universitas ahmad dahlan having implemented civic disposition in their daily lives, although it is not optimally performed. as a citizen, i have the right to get appropriate education services as my choice, and i can get it. as a citizen, i am obliged to help maintaining harmony, and that i perform by always adapting with tolerance against differences, for example being tolerant in the campus environment; remembering that the campus consists of a variety of different religion, culture, ethnicity, and so on. (an interview with sm on february 16, 2020). https://doi.org/10.21831/reid.v6i1.31621 https://doi.org/10.21831/reid.v6i1.31621 dwi riyanti copyright © 2020, reid (research and evaluation in education), 6(1), 2020 83 issn: 2460-6995 (online) the aforementioned statement indicated that the civic disposition of citizens' rights and obligations has appeared in student’s personality by not only claiming rights but also carrying out his obligations as citizens through maintaining tolerance in the campus environment with a diversity of religion, culture, and ethnicity. thus, any effort to establish students’ civic disposition on the material of "citizen's rights and obligations" has been achieved, although not fully performed. the similar result can be also identified through an interview on april 6, 2020 with da as a student at uny who stated that taking education in tertiary institution is an implementation that every citizen has the right to receive education, to carry out worship according to religion, to choose their respective religion, and the fulfilment of food and clothing as a form of citizens’ right in a decent living, obeying the tax system as a fulfilment of tax and legal obligation, following the pancasila and civic education course as an obligation to defend the country, and avoiding sara or suku, agama, ras, and antargolongan (ethnic, religion, race, and multi-groups) as an implementation of the obligation to respect the rights of others. the interview also proves another similar result on april 6, 2020, with dcn as a student at uad, who explains that students have indeed implemented their rights as citizens through participating in public’s opinion, such as participating in demonstration activities. therefore, it can be concluded that not all students have embedded in civic disposition, because there are still the rest of the students who have not implemented their rights and obligations properly. the suitability of civic knowledge and civic disposition in the material of “rights and obligations” somantri (2001, p. 116) states that civic education is an effort done scientifically and psychologically in providing easy access for students to learn so that the moral internalization of pancasila and knowledge of citizenship to realize personal integrity and everyday behavior are based on the national education goals. thus, civic education in higher education is expected to prepare students as young people to participate in the nation's life and state. students, as the young generation, are given an understanding of national ideals and how to act to overcome problems through civic education. therefore, students can withdraw any decision that is responsible for overcoming private and national problems. the 1945 constitution is a foundation of formal values, norms, and moral education in indonesia and implemented through civic education. it is also poured into law no. 12 of 2012 of republic of indonesia on higher education in article 35, which states that universities are required to teach subjects in religion, pancasila (five pillars of the nation), citizenship, and indonesian both at undergraduate and diploma level. civic education in nomenclature is always undergoing a transformation. initially, it is previously a civic course, and transformed to be civic education course, although there are still a number of topics that are typically citizenship, such as the concept of pancagatra and trigatra. in this context, s as an expert in civic education at uny stated that after the reformation, the transformation of the nomenclature has changed from gallantry (manliness) to civic education. although it has undergone a change, there are still typical topics of dignity such as pancagatra and trigatra, which indicate that the atmosphere still has an atmosphere and nuance of gallantry (manliness). furthermore, s said: although the material of democracy, human rights, and knowledge on state institutions are included in the civic education subject, the metamorphosis of that authority tends to be considered to represent military typology in the current state defence framework. civic knowledge aspects are related to democracy, human rights, local government in tertiary institutions especially in state universities issued by the general director of higher education in 2004. because the civics education subject in tertiary institutions is still new, it was previously becoming a gallantry (manliness). the figure is kunto wibisono; the development team for the transformation of civic education from gallantry (manliness) in higher education. for me, there are actually many things that must be clarified from the epishttps://doi.org/10.21831/reid.v6i1.31621 https://doi.org/10.21831/reid.v6i1.31621 dwi riyanti 84 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) temology both from the scientific framework and from the scope of activities. on the other hand, building the character of citizenship in the university’s civic education subject on post-citizenship, we can also see the difference after it was regulated in 2012; when there was a circular from the general director of higher education mentioned that one of the concepts of law no. 12 of 2012 on tertiary education required tertiary institutions to teach a minimum of four compulsory subjects in tertiary and diploma colleges. those subjects were religion, indonesian, pancasila, and civic education. (an interview with s on march 9, 2020) the statement indicates that civic education still tends to represent the military typology in the concept of state defense and its civic knowledge aspects related to democracy, human rights, and government. there must also be clarity from the epistemological concept. another perspective is also given by c as an expert of civic education at uny on march 9, 2020. he states that to make civic disposition matched to civic knowledge, a theory is not the first priority, but rather to the understanding of citizens concerned that each citizen has to associate for the sake of living in a democratic country. in addition, c emphasized that civic knowledge and civic disposition should be able to develop citizens' intelligence, so that students as young people can stick the values of virtue with a good character. the attitude can be controlled through intelligence, although there are two ways to control attitudes, namely habit (obtained since early childhood education to junior high school) and the level of intelligence understanding of the importance of attitude that compulsory to be realized by high school and college students. he also argued that that civic disposition presently is only introduced to concepts and theories, so it will not be formed at any time. thus, civic knowledge should gain a high level of understanding than just a theory. in the view of the material aspects of citizens' rights and obligations, the suitability between civic knowledge and civic disposition is still not fully internalized, although many students actively participate in student’s organizations on the campus. this evidence has addressed that they have internalized their rights as citizens of an association (community). in the same context, the interview on march 1, 2020, with hh, as one of the lecturers of civic education at uad stated that the material rights and obligations are appropriate, although students are still not aware of the importance of the material so that any experience in the field is needed. civic education has not been very effective in shaping citizens' character because it still needs to be strengthened with other supplements outside of civic education. in line with it, s, another informant being interviewed on march 9, 2020, also stated that there was a need to reorganize the substance of civic education subject in higher education with current issues in defending the country, deradicalization, and efforts to minimize intolerance. one of the values built on civic education was how to live together and have responsibility for the nation and the state with dynamic challenges. further, s argued that state defense material also had an impression of denying the existing civics education models. defending the state is similar to defending the nationality model by gallantry (manliness). any physical activity is how to differentiate it. in this context, the topic of state defense in the national defense institute is like revitalizing the spirit of dignity that once exists in the civic education subject. it is similar to the view that civic education merely reaches the cognitive domain, and there are some studies found in state defense materials in the candidates for civil servants through education and training. further, civic knowledge and disposition in the material of rights and obligations have not been fully successful, although it is caused by individual factor. the interview on april 17, 2020, with syt, a lecturer of civic education at uny, found that civic knowledge and disposition on citizen's rights and obligations were caused by individual factors. it is also reinforced by the opinion of ffh, a lecturer of civic education at uny who stated that the students’ factors have become a cause of the entirely appropriateness of civic knowledge and disposition (resulted from an interview on march 4, 2020). https://doi.org/10.21831/reid.v6i1.31621 https://doi.org/10.21831/reid.v6i1.31621 dwi riyanti copyright © 2020, reid (research and evaluation in education), 6(1), 2020 85 issn: 2460-6995 (online) apart from the students’ factors’, c additionally pointed out that it was influenced by factors of lecturers who did not necessarily have a civic education background; and that the lecturers from the graduate program of civic education were also similarly considered not to have the same perspective in terms of attitude (resulted from an interview on march 9, 2020). the attitude perspective has varied, and there should be a powerful test for it. besides, the course was not conducted intensively. another weakness was that it was rare for someone to pursue a field of expertise during the course (resulted from an interview on march 9, 2020). thus, it is the time for lecturers to be engaged in a truly ingrained field so that they can get more actual and relevant views to be effectively substantive and productive in the development of science. the compatibility of civic education subject in tertiary institutions and elementary/ junior high school/senior high school/vocational high school must also be assessed because the assessment needs to be conducted to verify and validate teachers' competency. on the other hand, tm, a teacher at state junior high school 1 magelang and the alumnus of civic education at uny, stated that most were appropriate and in class vii put more emphasis on constitutionality (resulted from an interview on march 5, 2020). in line with this, ww as a teacher at kembangsongo elementary school and alumnus of civic education at uny, also argued that there was conformity even though the curriculum for elementary school still applied for k13 curriculum were elaborated in the form of theme, sub-theme, and basic competency (resulted from an interview on april 6, 2020). based on the interviews, it is identified that most of the assumptions have met the criteria of analysis, although not all materials/ topics obtained from university’s lecture have been taught in the elementary/junior high school/senior high school/vocational high school. ep positively confirms this argumentation as a teacher at state vocational high school 1 kalikajar that not all materials/ topics obtained from university’s lecture have been already taught, and that the teachers have already delivered any material/topic by using a variety of methods, such as discovery learning, project-based learning, and problem-based learning (resulted from an interview on march 3, 2020). conclusion the suitability between civic knowledge and disposition in the subject of civic education in the topic of "citizen's rights and obligations" has not been fully implemented as expected. this is proven by the fact that the students have understood the mutual relationship between citizens and government, but they have not fully implemented their comprehension into their daily life. a field study like democratic learning is necessary in order that students do not only understand but also implement it. apart from the students and lecturers factor, the lecturers of the subject civic education do not certainly have the background knowledge on civic education and those who come from the civic education study program with a few of attitude and perspective enrichment. this evidence can make civic knowledge, and civic disposition have not been optimally conducted and matched. it means that the civic education has not been effective in forming citizenship character for youths, like the supplementary program the candidates for civil servants that are also expected to form the character of the citizen. a study on the suitability of civic education in higher education with the curriculum on elementary school, junior and senior high school, and vocational high school has indicated the existence of suitability although not all topics discussed in higher education are previously taught in elementary school, junior and senior high school, and vocational high school. references ainley, j., fraillon, j., & schulz, w. (2012). iccs 2009 asian report: civic knowledge, attitudes, and engagement among lowersecondary students in five asian countries. retrieved from https://research.acer. edu.au/civics/17 alam, b. (2014). antropologi dan civil society: pendekatan teori kebudayaan. https://doi.org/10.21831/reid.v6i1.31621 https://doi.org/10.21831/reid.v6i1.31621 dwi riyanti 86 copyright © 2020, reid (research and evaluation in education), 6(1), 2020 issn: 2460-6995 (online) antropologi indonesia, 30(2), 193–200. https://doi.org/10.7454/ai.v30i2.3564 branson, m. s. (ed.), syarifudin, s. (trans.). (1999). belajar civic education dari amerika. yogyakarta: lembaga kajian islam dan sosial (lkis). budimansyah, d. (2010). pembelajaran pendidikan kesadaran kewarganegaraan multidimensional. bandung: genesindo. creswell, j. w. (2015). penelitian kualitatif & desain riset: memilih di antara lima pendekatan (a. l. lazuardi, trans.). london: sage publications. creswell, j. w. (2016). research design: pendekatan metode kualitatif, kuantitatif, dan campuran (a. fawaid & r. k. pancasari, trans.). london: sage publications. ernawati, e., tsurayya, h., & ghani, a. r. a. (2019). multiple intelligence assessment in teaching english for young learners. reid (research and evaluation in education), 5(1), 21–29. https://doi.org/ 10.21831/reid.v5i1.23376 feriandi, y. a., & harmawati, y. (2018). analisis penguasaan kompetensi kewarganegaraan pada mahasiswa ppkn universitas pgri madiun. jurnal citizenship: media publikasi pendidikan pancasila dan kewarganegaraan, 1(2), 76– 83. https://doi.org/10.12928/citizen ship.v1i2.13620 galston, w. a. (2007). civic knowledge, civic education, and civic engagement: a summary of recent research. international journal of public administration, 30(6–7), 623–642. https://doi.org/10.1080/01900690701 215888 jb, m. c., & darmawan, l. (2016). wacana civil society (masyarakat madani) di indonesia. jurnal sosiologi reflektif, 10(2), 35–64. https://doi.org/10.14421/jsr. v10i2.1157 law no. 12 of 2012 of republic of indonesia on higher education. , (2012). miles, m. b., & huberman, m. (1994). qualitative data analysis: an expanded sourcebook (2nd ed.). thousand oaks, ca: sage publications. print, m., ellickson-brown, j., & baginda, a. r. (eds.). (1999). civic education for civil society. london: asean [association of south east asian nations] academic press. quigley, c. n., buchanan jr., j. h., & bahmueller, c. f. (1991). civitas: a framework for civic education. calabasas, ca: center for civic education. rosnawati, r., kartowagiran, b., & jailani, j. (2015). a formative assessment model of critical thinking in mathematics learning in junior high school. reid (research and evaluation in education), 1(2), 186–198. https://doi.org/10. 21831/reid.v1i2.6472 setiawan, a., & suardiman, s. p. (2018). assessment of the social attitude of primary school students. reid (research and evaluation in education), 4(1), 12–21. https://doi.org/10.21831/reid.v4i1.192 84 somantri, m. n. (2001). menggagas pembaharuan pendidikan ips. bandung: remaja rosdakarya. sunarso, s., sartono, k. e., dwikusrahmadi, s., & sutarini, y. c. n. (2016). pendidikan kewarganegaraan: pkn untuk perguruan tinggi. yogyakarta: uny press. wardhani, c. m. (2019). operasi keselamatan progo, polresta yogya tilang 977 pelanggar, 327 merupakan pelajar dan mahasiswa (a. nugroho, ed.). retrieved from tribun jogja website: https://jogja.tribunnews.com/2019/05 /13/operasi-keselamatan-progo-polres ta-yogya-tilang-977-pelanggar-327-meru pakan-pelajar-dan-mahasiswa winarno, w. (2013). pembelajaran pendidikan kewarganegaraan: isi, strategi, dan penilaian. jakarta: bumi aksara. https://doi.org/10.21831/reid.v6i1.31621 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(1), 2021, 35-45 available online at: http://journal.uny.ac.id/index.php/reid from an international english language assessment framework to a teacherbased assessment: a study of primary english teachers’ agentive perspectives and projections santi farmasari* universitas mataram jl. majapahit no.62, gomong, selaparang, kota mataram, nusa tenggara barat 83115, indonesia *corresponding author. e-mail: santifarmasari@unram.ac.id introduction impacts resulted from the termination of the international standard school (iss) system in indonesia in early 2013 are still being felt until today, especially by former iss. the iss founding through the enactment of republic of indonesia act no. 20 of 2003 was intended to inform and ultimately improve the education quality in indonesia (kustulasari, 2009; sundusiyah, 2011) by integrating internationally certified education standard along with the national education curriculum (nec). the abolition of iss system created confusion at the micro-level in terms of adjustments that need to be taken to maintain the international standard teaching and assessment (damarjati, 2013), especially for english subject as an iss image (farmasari, 2020). the confusions brought by the absence of further guidelines for teaching was aggravated when english is excluded from the 2013 primary school nec, resulting in the absence of an assessment framework for english. the absence of an assessment framework has reinforced teachers to seek for solution to fill the assessment gap. in this case, teacher-based assessment (tba) is believed to article info abstract article history submitted: 18 february 2021 revised: 14 may 2021 accepted: 31 may 2021 keywords agentive perspectives; projection; teacher-based assessment; teacher agency scan me: this study explores four primary english teachers' agentive perspectives about assessment in their schools. given the challenges brought by the abolition of international standard school (iss), the changed status of the school, and the exclusion of english from the 2013 national education curriculum (nec), a teacher-based assessment was the only solution to the unavailability of assessment guidelines and the unsuitability of assessment materials and methods. employing teacher agency theory, this study examines the agentive sides of the teachers' perspectives as they would represent the teachers' strategic solutions toward the school's emerging problems. the teacher-based assessment was expected to accommodate the school's context, the students, and the subject taught. this instrumental case study's data was collected through semi-structured interviews and focus group discussion, which was then analyzed through six phases of thematic analysis and by employing nvivo12 software. the study indicates that the challenges brought by the changed educational policies reinforced collaborative work amongst the teachers. the teachers' perspectives also represent their agentive projections toward english language assessment which was heavily shaped by the teachers' previous assessment experiences during iss and their ultimate teaching objectives. the findings are expected to provide insightful knowledge about how english teachers responded to shifted educational policies and projected to accommodate the school's specific contexts of assessment. the findings are also expected to explain how people's perspectives can be examined from an agency perspective. this is an open access article under the cc-by-sa license. how to cite: farmasari, s. (2021). from an international english language assessment framework to a teacher-based assessment: a study of primary english teachers’ agentive perspectives and projections. reid (research and evaluation in education), 7(1), 35-45. doi:https://doi.org/10.21831/reid.v7i1.38850 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i1.38850 https://doi.org/10.21831/reid.v7i1.38850 santi farmasari page 36 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) be a strategic solution to respond to the assessment reference unavailability. tba also aligns with english as an elective subject (sulistiyo et al., 2020) at primary school since its design and enactment are at school and teacher level. besides, tba has been acknowledged to be a more appropriate scheme of assessment at school level as it enables teachers to address and integrate the specific contexts of the school and region into the assessment they develop (dunlea et al., 2020; janssens & meier, 2013). literature shows that studies related to iss implementation in indonesia were mainly done during the iss implementation, focusing on the areas of policy studies (e.g. kustulasari, 2009; lumbanraja, 2009; soepriyanti, 2004); issue of national identity (e.g. sakhiyya, 2011); teachers’ professionalism (e.g. sundusiyah, 2011); economic consequences (e.g. coleman, 2011); and the legal aspect (e.g. rosser & curnow, 2014). studies on how school stakeholders, especially teachers, respond and adapt their academic practices to the changed policy in the aftermath of the abolition, such as the one in this study, remain scarce but much needed. considering iss which used to implement internationally certified assessment for the final year students and the formative assessment was also projected as a preparation for the final international assessment, english teachers would have taken into account these previous assessment experiences into their current tba practices (buchanan, 2015). in another vein, the english teaching inputs that iss students previously received were socially, culturally, pedagogically and linguistically richer than those in the nec (farmasari, 2020; mahyuni et al., 2010), there need to be substantial adjustments for the assessment system. given this problematic situation, english teachers are required to exercise their capacity to respond to the problems by taking into account the school's specific contexts and of the students (biesta & tedder, 2007; emirbayer & mische, 1998; priestley et al., 2015). the problematic situations have provided opportunities for teachers to exhibit and exercise their agentive capacity in seeking strategic solutions toward emerging situations. as a starting point of examining agentive capacity, this study explores english teachers' perspectives about moving from an international assessment framework to a tba amidst the school's given assessment contexts (x. yan et al., 2018). while studies about people's perspectives have been dominantly conducted to examine people's attitudes and views toward specific issues (jerald & shah, 2018; zein, 2017), this study offers an angle in which perspectives can be examined from a teacher agency point of view. this theory examines whether the participants' perspectives are agentive or less agentive by examining their responses towards the cultural, structural and material contexts of assessment in the school (priestley et al., 2016). as visualized in figure 1, the changed policy of iss and english teaching in primary school is part of teacher agency's practical-evaluative element with an interplay between cultural, structural, and material aspects of assessment in the school. figure 1. teacher agency model (priestley et al., 2016, p. 30) https://doi.org/10.21831/reid.v7i1.38850 santi farmasari page 37 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) given the insightful international background from the previous international assessment experiences and the practical-evaluative aspects of the assessment in the school, this study was conducted to find out (1) what agentive perspectives that the participants possess concerning the practice-evaluative aspects of assessment in the school, and (2) what projections can be identified from the participants’ agentive perspectives. these findings of this study are expected to provide insightful knowledge into how english teachers in a former iss respond to the changed educational policies and provide examples of how agentive perspectives represent projective educational practices, particularly in the assessment of english language in the context of primary schools in indonesia. the context of the study the school where the study was undertaken used to be an international standard school (iss) which implemented both the national curriculum and international curriculum, i.e. the cambridge primary checkpoint (cpc) in which english was included as an international subject. the english teaching and assessment complied with the cpc scheme where students had to sit on an international assessment at their final year. after the abolition of iss in early 2013, former iss was not given any further guidelines (damarjati, 2013), except for a return and complete implementation of the national curriculum. the school and the teachers experienced a dilemma whether they had to comply entirely with the policy mandate or maintain the school's international curriculum; what to do with the students accustomed to the teaching pedagogy and assessment of the international standard curriculum. one of the participating teachers commented that, “if we returned to the national curriculum, the contents of the available teaching materials and assessment were too low for our students…” (pripa). these conditions had reinforced the teachers’ personal capacity to seek a strategic solution in response to the change (priestley et al., 2015, 2016). therefore, it is essential to explore the english teachers’ agentive perspectives about assessing the students’ english achievement when moving from an international assessment framework to a teacher-based assessment. method type of study and the participants this study is an instrumental case-study since it studied a single specific case of the english teachers' agentive perspectives. this research aims to explore an issue based on a lived experience in a “bounded system” (creswell, 2014 p. 73). besides, it draws on a qualitative interpretative paradigm which analyses cases in diverse schools of thoughts in the social sciences, such as social constructionism, phenomenology, and symbolic interactionism (collins, 2010). table 1. the participants’ profile participants age gender role leadership experience (years) teaching experience (years) employment status pripa 43 male 6 19 government teacher1 gesi 29 female 6 9 honorary teacher2 miki 30 male 5 8 honorary teacher daru 23 female 5 8 months honorary teacher 1 a government officer/teacher is a permanent position where a teacher is employed and paid by the national government 2 an honorary teacher is a non-permanent position where a teacher is employed and paid by the school. this study was conducted in a former iss located in a provincial city in the south-eastern part of indonesia. the school was selected purposively as it satisfies the purpose of the study, and the participants possessed understandings of the case study or the research problems and the core phenomenon in the study (creswell, 2014; silverman, 2016). the participants of this study were four english teachers who were teaching grade 5 and 6 classes (as presented in table 1). https://doi.org/10.21831/reid.v7i1.38850 santi farmasari page 38 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) over one month (mid-august to mid-september 2018), the empirical data about the participants’ agentive perspectives were collected through semi-structured interviews and focus group discussion (fgd). the teachers were then interviewed two times in order to explore their agentive perspectives and assessment projections. the second interview followed up with the teachers’ responses in the first interview. the fgd was conducted one time, following the interviews, by referring to the interview data. the interview and fgd protocols and the questions were developed by following the teacher agency theory proposed by priestley et al. (2016), representing the cultural, structural, and material contexts of english language assessment in the school. data analysis the data of this study were analyzed thematically by adopting the six phases of thematic analysis from braun and clarcke (2006), as presented in table 2. the data were then analyzed inductively in nvivo 12 pro software. all of the codes and relevant data extracts were collated together for the later analysis stages. table 2. steps to thematic analysis phases activities tools 1 familiarisation with the data collected  developing understanding of the meaning of each data  discovering how each data related to the corresponding research questions microsoft word 2 developing codes across the data sets  creating codes inductively  reducing and revising through a refinement process of the data nvivo12 pro 3 searching for themes  a careful line-by-line reading of the excerpts  looking for repetition of words, key-words-in-context (kwic) and shifts in content nvivo12 pro 4 reviewing for themes developing a coding table with lists of the themes, codes and their relationship to ensure the integrity and consistency an representativeness of the data microsoft word 5 defining and naming themes  defining, naming and finalizing each theme  writing their descriptions and illustrating them in a matrix microsoft word 6 reporting reporting the findings microsoft word findings and discussion influences from the past assessment practice buchanan (2015) states that the contexts of teachers’ previous teaching experiences heavily shaped how teachers exercised agency. as teachers engaged with the material resources from the previous practice and adjusted them to the students’ current context (stritikus, 2003), the teachers’ agentive perspectives may influence their analysis, interpretation and adaptation to the available resources for their respective contexts (smagorinsky et al., 2011). during the iss period with the cpc program, the english teachers received great support from the school in forms of professional development and supervisions, which helped their understanding of which assessment forms would contribute to the expected outcomes. this critical support from the school is expected to still embedded in them; thus it can affect agency (sachs, 2016) as they were not only given opportunities to exercise their agentive role but also supported with relevant knowledge and skills (verberg et al., 2016). however, since the abolition of the iss policy, the school's change status, and the exclusion of english from the 2013 nec, similar professional development programs for teachers were rarely conducted. in response to this, the participants, miki and gesi perceived that they needed to pursue professional development indehttps://doi.org/10.21831/reid.v7i1.38850 santi farmasari page 39 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) pendently to support their assessment practices. this agentive perspective resulted from miki’s and gesi’s personal network beyond the school culture they enacted voluntarily, yet purposefully (chisholm et al., 2019). as a result of insightful experience from the past assessment practice, miki perceived that her past assessment practice needed to be adopted as it aligned with the school's english learning outcomes, i.e. assessing communication skills using alternative assessment forms. in support of miki’s preferences, pripa, admitted that he perceived the students’ real-life needs to be integrated into assessment as it used to be during the iss period with the cpc assessment. implementing authentic assessment principles (cheng et al., 2010), pripa perceived that incorporating authentic use of english in the classroom context would assist students’ perspectives of the use of english in their daily life. from these perspectives, the teachers’ past assessment practice has given the teachers valuable insights into their current assessment practice. the assessment design and the materials provided in the cpc assessment packs (assessment samples, rubrics, and scoring system) assisted teachers to develop their current assessment tasks, assessment criteria, and scoring as gesi stated, “…the cpc pack was complete…there was a teacher’s book with suggestions for assessment tasks, scoring…i learnt from it”. with the current english teaching policy in primary school and the absence of suitable assessment materials for english, the teachers’ experiences with the cpc benefit their practice in terms of guidance in developing and implementing the assessment. even though they admitted their practice to become more challenging, the teachers understand that their role in teacher-based assessment (tba) is critical, explained as follows. teachers’ pivotal role in tba due to teachers' opportunity to address the specific contexts of assessment in their respective settings, teachers' role in tba becomes very critical (brookhart, 2011). when explicitly asked about their perspective about their role tba, miki initially defined what she understands about teacher-based assessment before commenting about her role, “teacher-based assessment is an assessment scheme where teachers are given authority as well as responsibilities to plan, develop and conduct an appropriate assessment.” miki's definition of tba implied her perspectives about her role and responsibilities in the assessment as pivotal as she needed to design and implement assessment by herself for her respective class due to the unavailability of assessment guidelines for primary school english. she further mentioned that teachers are “the right people on the right job” which is echoed by daru that, “… the role of teachers as the key persons in the development and implementation had to be substantially recognized…” they believe that the right person to assess students is their teachers as they are in possession of knowledge about the school, the subject and the students so that the assessment could be directly monitored and improved. another participant, pripa, viewed that teachers' role in tba echoed the view that teachers should act not only as assessment developers but also as assessment enactors and evaluators (black & william, 2009; davison, 2019). further, pripa believed that the assessment developed by teachers would be more effective than the assessment created by external parties as he stated, “tba provides opportunities for teachers to use their knowledge about their class to develop a more direct and suitable assessment for students…” pripa implied that tba provides opportunities for teachers to employ a more direct assessment of students' learning in the four language skills based on the school's goals and the students' needs. as part of teacher assessment literacy (webb, 2002), this perspective can be viewed as the teachers’ processes in collecting data directly about their students’ learning. as a result, their perspectives about their pivotal role, the four teachers, miki, gesi, pripa, and daru, represent their agentive intention. the teachers resisted using external english tests in the school as the tests presumably included under-representative items to assess the students’ learning achievement (bachman, 2002; z. yan & cheng, 2015). to achieve the maximum benefit from tba, teachers’ role in ensuring whether the tasks are effectively aligning with the pre-determined learnhttps://doi.org/10.21831/reid.v7i1.38850 santi farmasari page 40 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) ing objectives is significant. despite the teachers’ critical role in tba, gesi highlighted another vital role of teachers, which can help improve assessment quality, building a positive relationship with students and parents as she stated, “students and parents need to be involved during the preparation process…after the assessment, i always inform the results to parents so that we can work together to improve students’ learning.” teachers need to consider what role parents can play in assisting their children (ho, 2006) such as evaluating their children's progress based on teachers' evidence (cheng et al., 2010). even though this involvement requires enormous time and vigorous commitment from both teachers and parents, its valuable contribution to improving students' learning is promising (azevedo et al., 2010; cheng et al., 2010). in a similar vein, miki perceived that her role in maintaining an effective communication about students’ learning progress with both students and parents was es-sential and influential in achieving her goal: “…the first thing is my relationship with students and then parents”. moreover, simpson (2017) and lasky (2005) recognized that a positive relationship with students as a sociocultural process as a manifestation of agency as it helps students improve their assessment performance. further, when specifically inquired about their role and responsibilities in tba after the abolition of iss, three teachers remarked that the situation after the iss abolition had encouraged them to work collaboratively to maintain the quality of assessment. they believed that the opportunities to learn from and aspire to each other were facilitated during the collaborative work, while at the same time respecting each other’s capacity. learning from, aspiring and respecting each other for a quality assessment practice as stated earlier, the situation amidst the changed national policy of iss, the school's changed status, and the exclusion of primary english teaching in the 2013 nec had become a driving force for collaborative work amongst the english teachers. collaborative works were required as teachers needed to have intensive discussions and analysis of the school's assessment contexts. the work was initiated to create a more suitable english curriculum, teaching and assessment, and better english language assessment quality. the teachers admitted that the collaborative work was a place for their reciprocal learning, aspiration and respect towards each teacher’s capacity. the beneficial effect of collaborative work would be achievable when teachers possess a sincere motivation to receive learning, even from younger members (novotný & brücknerová, 2014). miki stated, “i am passionate about working with other teachers, senior and junior, as they must have different knowledge and skills which will enrich my own [knowledge and skills].” the extract above shows miki’s perspectives about working with other teachers in an intergenerational environment as a learning opportunity. rubin and land (2017) stated that the situational collaborative framework within a school context would create a bi-influential process where the induced and inducing teachers could influence each other’s practices positively. further, priestley et al. (2015) also highlighted that that achievement of the agency in a school context is achieved when teachers engage with colleagues in a collaborative way maximizing their contributions in response to a problematic situation while taking into account the “cultures and structures of schooling” (p.195) they belong to. a similar perspective was revealed from daru, the most junior teacher who values his working team as knowledge sources. he tried to position himself as a learner. he enjoyed working with other senior teachers as they regarded him as a partner. he admitted that this acknowledgment had built his confidence when working with the team, and he was not reluctant to deliver his ideas during the teachers’ meeting. novotný and brücknerová (2014) maintained that to facilitate every member in teamwork and learn from each other; all participants must welcome each other as a working partner, although the members are from different generations. the most senior teacher, pripa, perceived his experience working with the three junior teachers as positive. as his juniors frequently consulted him, he could see their potential through their discussions during the meetings. he trusted that his juniors were committed to maintain and improve the qualhttps://doi.org/10.21831/reid.v7i1.38850 santi farmasari page 41 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) ity of english teaching and assessment, including its related programs, by stating, “i believe they [the junior teachers] are competent, they are good team. they’ve been working hard to maintain the quality of english teaching, they work on the right track and i will always support them…” (pripa). the reciprocal trust between pripa and his junior fellows would have brought a positive atmosphere to the teachers' collaborative work in developing and implementing the tba. as vaughn and faircloth (2011) stated, collaborative work amongst teachers exhibits teachers' agentive actions as they negotiate obstacles to achieve their goals. collaborative work also facilitates teachers' capabilities to solve specific problems in their educational settings (biesta & tedder, 2007). despite the positive opportunities for reciprocal learning, the four teachers repeatedly admitted that their work is quite challenging due to some factors such as the limitation of available resources for assessment, supervision and professional development programs in the school as one of the teachers stated, “…we had to work hard to develop our own curriculum and assessment, but we don’t have someone we can discuss with, a supervisor. …it is very difficult for us now…” (miki). this extract represents the challenges faced by the teachers in their collaborative work. the decision taken by the constitutional court of the republic of indonesia (ccri) to terminate the iss implementation without a follow-up scenario for the former iss positioned them in a difficult situation, that was aggravated when the local public university's cooperation for the supervision programs was also terminated. the teachers' collaborative work was expected to be fruitful in terms of seeking strategic solutions to the challenges they faced (rubin & land, 2017). the challenges had reinforced their agentive perspectives as the teachers need to project actions “by means of their environment, rather than in their environment” (biesta & tedder, 2007 p. 137). the teachers’ projections of english language assessment as the second dimension of teacher agency model (priestley et al., 2016), projective in this study refers to the teachers’ aspirations or goals toward their tba practice. data indicate that the teachers’ projections were influenced by the teachers’ past practice and professional development and their cultural factors, i.e., their perspectives (priestley et al., 2015). when asked about their tba projections, the four teachers highlighted the importance of implementing assessment for learning (afl) and addressing the students’ characteristics as young language learners. they believed that assessment should inform them about their students’ learning progress, and teachers should use the assessment result to improve their instructions as pripa stated, “my goal is simple, assessment for learning, i will assess the students’ progress towards the intended outcomes, and i will use the result to improve my teaching” (pripa). similarly, miki highlighted the importance of achieving the english teaching objectives in the school, that was enabling students to use english actively for their daily communication, “i aim my assessment to improve the students’ speaking skill because it is our goal [the english teaching goal in the school]”. miki’s projection may have been influenced by her role as a coordinator of english subject in the school where she endeavoured to achieve english teaching goals. concerning assessment, miki further stated that “the assessment will inform me about the students’ problems to address them in my teaching”. miki’s projections of her tba practice show how she related assessment to teaching, reflecting an essential principle of afl (willis, 2011). this also reflects miki’s agentive projection in which she oriented her practice by responding to the contextual conditions (of the students) in her assessment practice (biesta & tedder, 2007). interestingly, daru’s (miki’s teaching partner) short-term tba projection was similar to miki’s: improving the students’ oral language skills. stating that, “my assessment goals are… students can use english orally in their daily activities based on the materials taught”. this similar projection may result from daru and miki’s daily interactions, where they discussed and shared their practices and inspired each other (york-barr & duke, 2004). miki, who was more senior in terms of teaching length and experiences with the cpc, may have inhttps://doi.org/10.21831/reid.v7i1.38850 santi farmasari page 42 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) fluenced daru regarding the importance of improving students’ communication skills through various methods, not only traditional paper-based tests as commonly practised in indonesian schools (saefurrohman & balina, 2016). gesi commented about this as a common practice in the past with the cpc, “in the past [with the cpc], we used lots of methods [of assessment] …we used portfolio, peer-assessment, performance-based, project based…” (gesi). gesi’s comment shows her projection to incorporate various assessment methods as informed by her past assessment practice, strengthening what buchanan (2015) theorized that teachers’ past teaching experiences heavily shape their current practices. conclusion to summarize, from the four teachers’ perspectives about english language assessment in the school, some potential capitals can significantly contribute to their agentive actions during the practical-evaluative works of assessment. these capitals include the teachers’ experiences with the cpc in the past, the teachers’ understanding of their key role, the teachers’ concerns and commitment to maintaining the symbol of quality of education in the school through collaborative work. after the abolition of iss, the teachers’ situation was instrumental and encouraging for their sense of agency (seen from their perspectives), finding strategic solutions for the challenges and finding ways to succeed within their current assessment contexts despite the struggles they faced. furthermore, the situation also increased the teachers’ awareness of the surrounding contexts afforded their projective strategies regarding constraints and opportunities. given all the constraints, the participating teachers believe that the circumstances provided opportunities for them to find a way to determine better techniques for assessing the students’ english proficiency, rather than just copying the prescribed curriculum and assessment guidelines of other subjects. references azevedo, r., johnson, a., chauncey, a., & burkett, c. (2010). self-regulating learning with metatutor: advancing the science of learning with metacognitive tools. in m. s. khine & i. m. saleh (eds.), new science of learning: computers, cognition, and collaboration in education. https://asu.pure.elsevier.com/en/publications/selfregulated-%0alearning-with-metatutoradvancing-the-science-of-l bachman, l. f. (2002). some reflections on task-based language performance assessment. language testing, 19(4), 453–476. https://doi.org/10.1191/0265532202lt240oa biesta, g., & tedder, m. (2007). agency and learning in the life course: towards an ecological perspective. studies in the education of adults, 39(2), 131–149. https://doi.org/10.1080/02660830.2007.11661545 black, p., & william, d. (2009). developing the theory of formative assessment. educational assessment, evaluation and accountability, 21(5), 31. https://link.springer.com/article/10.1007/s11092-008-9068-5 braun, v., & clarcke, v. (2006). using thematic analysis in psychology. qualitative research in psychology, 3(2), 77–101. https://doi.org/10.1191/1478088706qp063oa brookhart, s. m. (2011). educational assessment knowledge and skills for teachers. educational measurement: issues and practice, 30(1), 3–12. https://doi.org/10.1111/j.17453992.2010.00195.x buchanan, r. (2015). teacher identity and agency in an era of accountability. teachers and teaching, 21(6), 700–719. https://doi.org/10.1080/13540602.2015.1044329 https://asu.pure.elsevier.com/en/publications/selfregulated-%0alearning-with-metatutor-advancing-the-science-of-l https://asu.pure.elsevier.com/en/publications/selfregulated-%0alearning-with-metatutor-advancing-the-science-of-l https://doi.org/10.1191/0265532202lt240oa https://doi.org/10.1080/02660830.2007.11661545 https://link.springer.com/article/10.1007/s11092-008-9068-5 https://doi.org/10.1191/1478088706qp063oa https://doi.org/10.1111/j.1745-3992.2010.00195.x https://doi.org/10.1111/j.1745-3992.2010.00195.x https://doi.org/10.1080/13540602.2015.1044329 https://doi.org/10.21831/reid.v7i1.38850 santi farmasari page 43 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) campbell, c. m., & o’meara, k. (2014). faculty agency: departmental contexts that matter in faculty career. res high educ, 55, 49–74. https://doi.org/10.1007/s11162-013-9303-x cheng, l., andrews, s., & yu, y. (2010). impact and consequences of school-based assessment (sba): students’ and parents’ views of sba in hong kong. language testing, 28(2), 221– 249. https://doi.org/10.1177/0265532210384253 chisholm, j., alford, j., halliday, l., & cox, f. (2019). teacher agency in english language arts teaching: a scoping review of the literature. english teaching: practice & critique, 18(2), 124– 152. https://doi.org/10.1108/etpc-05-2019-0080 coleman, h. (2011). allocating resources for english: the case of indonesia’s english medium in international standard schools. in h. coleman (ed.), dreams and realities: developing countries and the english langauge. the british council. https://www.researchgate.net/publication/313576556_allocating_resources_for_engli%0 ash_the_case_of_indonesia’s_english_medium_international_standard_schools collins, h. (2010). creative research: the theory and practice of research for the creative industries. ava academia. creswell, j. w. (2014). educational research: planning, conducting and evaluating quantitative and qualitative research. pearson. damarjati, d. (2013, april 10). mendikbud belum putuskan status sekolah kategori mandiri pasca rsbi. detik news. http://news.detik.com/read/2013/01/13/131732/ davison, c. (2019). using assessment to enhance learning in english language education. in second handbook of english language teaching (pp. 433–454). springer. https://doi.org/10.1007/978-3-030-02899-2_21 dunlea, j., fouts, t., joyce, d., & nakamura, k. (2020). eiken and teap: how two test systems in japan have responded to different local needs in the same contexts. in l. i-. su, c. j. weir, & j. r. w. wu (eds.), english language proficiency testing in asia: a new paradigm bridging lobal and local contexts. routledge. emirbayer, m., & mische, a. (1998). what is agency? american journal of sociology, 103(4), 962– 1023. https://www.jstor.org/stable/pdf/10.1086/231294.pdf?refreqid=excelsior%3a110855271 58b64c5f26989ebfc406102. farmasari, s. (2020). exploring teacher agency through english language school-based assessment: a case study in an indonesian primary school [queensland university of technology]. https://eprints.qut.edu.au/205615/ ho, e. (2006). social disparity of family involvement in hong kong. school community journal, 16(2), 7–26. https://www.fed.cuhk.edu.hk/~hkcisa/articles/ho_2006_soc_disparity.pdf janssens, g., & meier, v. (2013). establishing placement test fit and performance: serving local needs. colombian applied linguistics journal, 15(1), 100–112. https://doi.org/10.14483/udistrital.jour.calj.2013.1.a07 jerald, g., & shah, p. m. (2018). the impact of cefr-aligned curriculum in the teaching of esl in julau district: english teachers’ perspectives. international journal of innovative research and creative technology, 4(6), 121-125. https://www.ijirct.org/viewpaper.php?paperid=ijirct1801023 kustulasari, a. (2009). the international standard school project in indonesia: a policy document analysis [ohio state university]. http://rave.ohiolink.edu/etdc/view?acc_num=osu1242851740 https://doi.org/10.1007/s11162-013-9303-x https://doi.org/10.1177/0265532210384253 https://doi.org/10.1108/etpc-05-2019-0080 https://www.researchgate.net/publication/313576556_allocating_resources_for_engli%0ash_the_case_of_indonesia’s_english_medium_international_standard_schools https://www.researchgate.net/publication/313576556_allocating_resources_for_engli%0ash_the_case_of_indonesia’s_english_medium_international_standard_schools http://news.detik.com/read/2013/01/13/131732/ https://doi.org/10.1007/978-3-030-02899-2_21 https://www.jstor.org/stable/pdf/10.1086/231294.pdf?refreqid=excelsior%3a11085527158b64c5f26989ebfc406102. https://www.jstor.org/stable/pdf/10.1086/231294.pdf?refreqid=excelsior%3a11085527158b64c5f26989ebfc406102. https://eprints.qut.edu.au/205615/ https://www.fed.cuhk.edu.hk/~hkcisa/articles/ho_2006_soc_disparity.pdf https://doi.org/10.14483/udistrital.jour.calj.2013.1.a07 https://www.ijirct.org/viewpaper.php?paperid=ijirct1801023 http://rave.ohiolink.edu/etdc/view?acc_num=osu1242851740 https://doi.org/10.21831/reid.v7i1.38850 santi farmasari page 44 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) lalani, s. s., & rodriguesa, s. (2012). teachers’ perception and practice of assessing the reading skills of young learners: a study from pakistan. journal on english language teaching, 2(4), 23– 33. https://doi.org/10.26634/jelt.2.4.2068 lasky, s. (2005). a sociocultural approach to understanding teacher identity, agency and professional vulnerability in a context of secondary school reform. teaching and teacher education2, 21(8), 899–916. https://doi.org/10.1016/j.tate.2005.06.003 lumbanraja, s. (2009). expanding international education in indonesia: an analytical map of government and ngo construction of education policy. university of pittsburg. mahyuni, m., farmasari, s., & harmayani, h. (2010). tingkat kelayakan tes bahasa inggris sd buatan guru di kota mataram. novotný, p., & brücknerová, k. (2014). intergenerational learning among teachers: an interaction perspective. studia paedagogica, 19(4), 27–40. https://doi.org/10.5817/sp2014-43 priestley, m., biesta, g. j. j., & robinson, s. (2015). teacher agency: what it is and why it matters. in r. kneyber & j. evers (eds.), flip the system: education from the bottom up (pp. 134–148). routledge. priestley, m., biesta, g. j. j., & robinson, s. (2016). teacher agency: an ecological approach. bloomsbury. http://ebookcentral.proquest.com/lib/qut/detail.action?docid=2146745. rosser, a., & curnow, j. (2014). legal mobilisation and justice: insights from the constitutional court case on international standar school in indonesia. the asia pacific journal of anthropology, 15(4). https://doi.org/10.1080/14442213.2014.916341 rubin, j. c., & land, c. l. (2017). this is english class: evolving identities and a litercy teacher’s shifts in practice accross figured worlds. teaching and teacher education, 68, 190–199. https://doi.org/10.1016/j.tate.2017.09.008 sachs, j. (2016). teacher professionalism: why are we still talking about it. teachers and teaching, 22(4), 413–425. https://doi.org/10.1080/13540602.2015.1082732 saefurrohman, s., & balinas, e. s. (2016). english teachers classroom assessment practices. international journal of evaluation and research in education, 5(1), 82–92. http://doi.org/10.11591/ijere.v5i1.4526 sakhiyya, z. (2011). interrogating identity: the international standard school in indonesia. culture & society, 19(3), 345–356. https://doi.org/10.1080/14681366.2011.607841 silverman, d. (2016). qualitative research. sage publications. simpson, a. (2017). teachers negotiating professional agency through literature-based assessment. literacy, 51(2), 111–119. https://doi.org/10.1111/lit.12114 smagorinsky, p., wilson, a. a., & moore, c. (2011). teaching grammar and writing: a beginning teacher’s dilemma. english education, 43(3), 262–292. https://www.jstor.org/stable/23017093?seq=1#metadata_info_tab_contents soepriyanti, h. (2004). bilingual education in international standard school in indonesia: an analysis of policy [the university of sunshine coast]. http://research.usc.edu.au/vital/access/manager/repository/usc:13218 stritikus, t. t. (2003). the interrelationship of beliefs, context, and learning: the case of a teacher reacting to language policy. journal of language, identity & education, 2(1), 29–52. https://doi.org/10.1207/s15327701jlie0201_2 https://doi.org/10.26634/jelt.2.4.2068 https://doi.org/10.1016/j.tate.2005.06.003 https://doi.org/10.5817/sp2014-4-3 https://doi.org/10.5817/sp2014-4-3 http://ebookcentral.proquest.com/lib/qut/detail.action?docid=2146745 https://doi.org/10.1080/14442213.2014.916341 https://doi.org/10.1016/j.tate.2017.09.008 https://doi.org/10.1080/13540602.2015.1082732 http://doi.org/10.11591/ijere.v5i1.4526 https://doi.org/10.1080/14681366.2011.607841 https://doi.org/10.1111/lit.12114 https://www.jstor.org/stable/23017093?seq=1%23metadata_info_tab_contents http://research.usc.edu.au/vital/access/manager/repository/usc:13218 https://doi.org/10.1207/s15327701jlie0201_2 https://doi.org/10.21831/reid.v7i1.38850 santi farmasari page 45 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) sulistiyo, u., haryanto, e., widodo, h. p., & elyas, t. (2020). the portrait of primary school english in indonesia: policy recommendations. education 3-13, 48(8), 945–959. https://doi.org/10.1080/03004279.2019.1680721 sundusiyah, a. (2011). teachers in international standard schools: what is missing? what can be improved? what does it take in education counts? in a. sakhiyya, a. arsana, & r. mikha (eds.), the contribution of indonesian students studying overseas for education in indonesia. insight media. terosky, a., o’meara, k., & campbell, c. (2014). enabling possibility: women associate professors’ sense of agency in career advancement. journal of diversity in higher education, 7, 58–72. https://doi.org/10.1037/a0035775 vaughn, m., & faircloth, b. s. (2011). understanding teacher visioning and agency during literacy instruction. in p. l. dunston, l. b. gambrell, k. headley, s. k. fullerton, p. m. stecker, v. r. gilles, & c. bates (eds.), 60th yearbook of the literacy research association (pp. 156–164). literacy research association. https://tigerprints.clemson.edu/cgi/viewcontent.cgi?article=1020&context=ed_human_% 0advlpmnt_pub verberg, c. p. m., tigelaar, d. e. h., veen, k., & verloop, n. (2016). teacher agency within the context of formative teacher assessment: an in-depth analysis. educational studies, 42(5), 534–552. https://doi.org/10.1080/03055698.2016.1231060 webb, n. l. (2002). assessment literacy in a standards-based urban education setting. the american educational research association annual meeting, 1–23. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.573.676&rep=rep1&type=pdf willis, j. (2011). affiliation, autonomy and assessment for learning. assessment in education: principles, policy & practice, 18(4), 399–415. https://doi.org/10.1080/0969594x.2011.604305 yan, x., zhang, c., & fan, j. j. (2018). “assessment knowledge is important, but...”: how contextual and experiential factors mediate assessment practice and training needs of language teachers. system, 74. https://doi.org/10.1016/j.system.2018.03.003. yan, z., & cheng, e. c. k. (2015). primary teachers’ attitudes, intentions and practices regarding formative assessment. teaching and teacher education, 45, 128–137. https://doi.org/10.1016/jt.tate.2014.10.002 york-barr, j., & duke, k. (2004). what do we know about teacher leadership?: findings from two decades of scholarship. review of educational research, 74(3), 255–316. https://doi.org/10.3102/00346543074003255 zein, m. s. (2017). professional development needs of primary efl teachers: perspectives of teachers and teacher educators. professional development in education, 43(2), 293–313. https://doi.org/10.1080/19415257.2016.1156013 https://doi.org/10.1080/03004279.2019.1680721 https://doi.org/10.1037/a0035775 https://tigerprints.clemson.edu/cgi/viewcontent.cgi?article=1020&context=ed_human_%0advlpmnt_pub https://tigerprints.clemson.edu/cgi/viewcontent.cgi?article=1020&context=ed_human_%0advlpmnt_pub https://doi.org/10.1080/03055698.2016.1231060 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.573.676&rep=rep1&type=pdf https://doi.org/10.1080/0969594x.2011.604305 https://doi.org/10.1016/j.system.2018.03.003 https://doi.org/10.1016/jt.tate.2014.10.002 https://doi.org/10.3102%2f00346543074003255 https://doi.org/10.1080/19415257.2016.1156013 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(1), 2021, 1-12 available online at: http://journal.uny.ac.id/index.php/reid the characteristics of chemistry test items on nationally standardized school examination in yogyakarta city ummul karimah1*; heri retnawati1; deni hadiana2; pujiastuti3; eri yusron1 1universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia 2center for assessment and learning of the ministry of education and culture jl. gunung sahari eks komp. siliwangi no. 4, pasar baru, sawah besar, jakarta 10710, indonesia 3sekolah menengah atas negeri 7 yogyakarta jl. mt. haryono no. 47, suryodiningratan, mantrijeron, yogyakarta 55141, indonesia *corresponding author. e-mail: ummulkarimah01234@gmail.com introduction assessment is one of indicators to measure the success of a testing process. a good assessment makes a good education quality. the results of the assessment can determine the success of the future education process (mardapi, 2016, p.10). one of the educational processes occurs in the learning process in classroom. teacher is one of the factors that determine students’ learning success (akbar, 2016, p.6). competent, skilled, and highly dedicated teachers can enhance students’ achievement in the classroom. however, in the learning process, one of the difficulties is caused by chemistry subjects that are less liked by students. the cause of this difficulty is that chemistry subjects study a lot of abstract things. according to gabel (ristiyani & bahriah, 2016, p.19), this abstract makes chemisarticle info abstract article history submitted: 22 april 2020 revised: 8 november 2020 accepted: 19 december 2020 keywords difficulty index; discrimination index; reliability scan me: this study aimed to describe the characteristics of chemistry test items in nationally standardized school examination or ujian sekolah berstandar nasional (usbn), consisted of discrimination index, difficulty index, and question reliability. the data were collected by using documentation of the answers of 194 students. the type of research was descriptive exploratory with quantitative and qualitative approaches. quantitative data analysis was performed by using the classical test theory approach and item response theory with 1 logistical parameter. meanwhile, qualitative data analysis was conducted to describe the items categorized as difficult and bad categories. the results show that, according to classical test theory, the usbn chemistry item test has an average level of difficulty of 0.57, which is categorized as moderate. regarding the discrimination index, the average of different test power obtained is equal to 0.146 and belongs to the good category. based on the item response theory, the average difficulty index of -0.00086 is categorized as moderate. the results of the estimated reliability of the questions amounted to 0.48 is included in the moderate category. the results of the qualitative analysis show that the items which belong to the difficult category are the items on salt hydrolysis material, acid-base titration, the concept of periodicity, manufacture, and use of chemical compounds according to classical test theory. meanwhile, according to the theory of response, items that belong to the difficult category were found in acid-base titration material, the colligative solution phenomenon, and the concept of periodicity of the main metal elements. this is an open access article under the cc-by-sa license. how to cite: karimah, u., retnawati, h., hadiana, d., pujiastuti, p., & yusron, e. (2021). the characteristics of chemistry test items on nationally-standardized school examination in yogyakarta city. reid (research and evaluation in education), 7(1), 1-12. doi:https://doi.org/10.21831/reid.v7i1.31297 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i1.31297 https://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 2 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) try a complex subject. coll and taylor (ristiyani & bahriah, 2016, p.19) added that there was difficulty in understanding chemical concepts due to the inability to connect the macroscopic and microscopic world or also called terms of triangle as macroscopic, submicroscopic, and symbolic (taber, 2013, p.156). these concepts consisted of concept of mole, atomic structure, kinetic theory, thermodynamics, electrochemistry, chemical change and reactivity, equalization of redox reaction equations, and stereochemistry. one example of test which aims to determine students’ ability in chemistry at the third year in high school level is the implementation of usbn. regarding to minister of education and culture regulation number 4 article 1 paragraph 5 of 2018, usbn is a form of test which aims to measure the achievement of students' competencies conducted by certain education units that refer to graduates competency standards in order to get recognition of learning achievements, especially in chemistry subjects. usbn question scripts consist of 20% to 25% items given by the ministry and 75% to 80% are given by the teachers and discussed in teacher working group (kelompok kerja guru or kkg), teacher subject meeting (musyawarah guru mata pelajaran or mgmp), tutors forum, and salafiah islamic boarding school teacher working group (kelompok kerja or pokja pps) regulated in the same regulation in article 11 paragraph 1. this shows that the role of teachers is important in the preparation of an instrument for evaluating the implementation of usbn as a standard for the graduation of students in schools. the preparations for the implementation of the usbn seem relatively short by the sudden distribution of the question grid by the ministry which causes the teachers who make the questions only have a short amount of time. besides, the teachers who do not really master the preparation of good assessment instruments (retnawati, 2015, p.400) is one of the determinants of the quality of usbn questions, especially in yogyakarta city. a good assessment quality is a matter that fulfills the requirements of good quality characteristics. the characteristics of this item can be seen based on the level of difficulty, discrimination index, reliability, and measurement error parameters (herkusumo, 2011, p.457). according to classical test theory (herkusumo, 2011, p.457), the level of difficulty of an item explains the percentage of students’ correct answers that are divided into easy, moderate, and difficult categories. the power of differentiation shows the ability of an item in order to distinguish intelligent students from low student ability. it is often assumed that items that have a low coefficient of differentiation interpret that high student ability cannot answer the question compared to low students ability. this makes the items become bad items. the item reliability indicates the level of trustworthiness of an assessment instrument even though it has been used repeatedly. the incongruity of an instrument certainly makes the quality of the assessment become not good. this study is relevant to the research conducted by ariani et al. (2019, p.42) who also analyzed usbn questions but in mathematic subject that were centred on analyzing student errors in solving usbn problems. these errors consisted of errors in reading, understanding, transformation, process skills, and also writing answers. the item analysis research was also conducted by yustika et al. (2015, p.1330) who analyzed the characteristics of chemistry subject items but in the school examination or ujian akhir sekolah (uas) question. based on the background of the problem and the study of the previous studies, it can be concluded that the analysis of item characteristics, both using classical test theory and item response theory is important to do. thus, this study aimed to describe the characteristics of chemistry test items on usbn in yogyakarta city. method this research employs a descriptive exploratory study with quantitative and qualitative approaches. quantitative data analysis is used to describe the characteristics of items using the classical test theory and item response theory. qualitative data analysis was presented to describe the questions that belonged to difficult and not good categories. respondents in this study were 194 students who took usbn for chemistry subjects in yogyakarta city. https://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 3 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) data were collected by documenting student responses or answers. the data source comes from the results of the implementation of usbn in yogyakarta city which were consisted of 35 items presented in form of multiple-choice, students' answers, and question-answer keys. quantitative data analysis uses anbuso 4.0 to determine the level of difficulty and discrimination index of questions based on classical test theory. meanwhile, quest program was used as supportive tool to find out the level of difficulty and reliability based on item response theory. the steps in qualitative analysis were identifying items that belonged to difficult category, outlining the problem-solving procedures, and analyzing the factors that might become the factor which causes difficulties and students’ problems. data analysis with classical test theory was used to determine the level of difficulty and discrimination index of the questions, while item response theory was used to determine the level of difficulty and reliability value of usbn questions in yogyakarta city. the level of difficulty and discrimination index of questions were identified based on the results of student answers in the implementation of usbn. based on the analysis of classical test theory, questions were categorized as difficult if the coefficient of the difficulty index was below 0.3, easy questions with a coefficient of more than 0.7, and the rest were classified in the moderate category (kolte, 2015, p.321). discrimination index classified good and bad questions based on the results of students' answers with the coefficient of distinguishing value which was more than 0, and belonged to good category, while below 0 was categorized as bad category (kolte, 2015, p.321). based on the analysis of item response theory, the problem was categorized as difficult if the coefficient of difficulty index is more than 2, and the categorized as easy if it was less than -2, and category is in between (sarea & hadi, 2015, p.38). findings and discussion data analysis in this study used classical test theory analysis and item response theory. the result of reliability estimation based on item response theory was 0.48, which was categorized as less reliable category (azwar, 2008, p.113). the results of the analysis of the level of difficulty and discrimination index based on the results of classical test theory by using anbuso 4.0 are presented in table 1. table 1. difficulty index coefficient based on classical test theory no difficulty index coefecient note no difficulty index coefecient note no difficulty index coefecient note 1 0.758 easy 13 0.418 moderate 25 0.361 moderate 2 0.521 moderate 14 0.232 difficult 26 0.464 moderate 3 0.448 moderate 15 0.397 moderate 27 0.057 difficult 4 0.485 moderate 16 0.727 easy 28 0.582 moderate 5 0.335 moderate 17 0.201 difficult 29 0.680 moderate 6 0.897 easy 18 0.624 moderate 30 0.381 moderate 7 0.789 easy 19 0.840 easy 31 0.809 easy 8 0.773 easy 20 0.314 moderate 32 0.768 easy 9 0.789 easy 21 0.510 moderate 33 0.541 moderate 10 0.892 easy 22 0.804 easy 34 0.536 moderate 11 0.768 easy 23 0.510 moderate 35 0.887 easy 12 0.557 moderate 24 0.289 difficult based on the difficulty index coefficient, it was found that the average level of difficulty of usbn questions is 0.57 which is included in the medium category based on classical test theory. this means that overall the chemistry subject usbn questions in yogyakarta city have a medium category to be worked on by students. judging from the magnitude of the difficulty index coefficient, the most difficult question order is item 27, 17, 14, and 24. https://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 4 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) table 2. discrimination index coefficient based on classical test theory no discrimination index coefficient note no discrimination index coefecient note no discrimination index coefficient note 1 0.027 good 13 -0.086 poor 25 0.146 good 2 0.051 good 14 0.227 good 26 0.110 good 3 0.141 good 15 0.054 good 27 -0.072 poor 4 0.167 good 16 0.264 good 28 0.129 good 5 0.152 good 17 0.100 good 29 0.273 good 6 0.065 good 18 0.072 good 30 0.109 good 7 0.229 good 19 0.323 good 31 0.247 good 8 0.094 good 20 -0.041 poor 32 0.184 good 9 0.182 good 21 0.283 good 33 0.234 good 10 0.146 good 22 0.331 good 34 0.282 good 11 0.316 good 23 0.016 good 35 0.161 good 12 -0.007 poor 24 0.213 good based on the coefficient of discrimination index in table 2, the average discrimination index of the usbn problem was 0.146 which was included in both categories. this means that overall, the chemistry test items in usbn yogyakarta city belonged to good categories to be done by students, based on the coefficient of discrimination index, the order of the most unfavourable items is item 13, 27, 20, and 12. based on the results of the coefficient of difficulty and the discrimination index of usbn questions, the data can be recapitulated as in table 3. table 3. data recapitulation based on classical test theory analysis difficulty index coefficient discrimination index coefficient easy (11.43%) moderate (51.43%) difficult (37.14%) good (88.57%) poor (11.43%) 13 items 18 items 4 items 31 items 4 items number 1, 6, 7, 8, 9, 10, 11, 16, 19, 22, 31, 32, and 35 number 2, 3, 4, 5, 12, 13, 15, 18, 20, 21, 23, 25, 26, 28, 29, 30, 33, and 34 number 14, 17, 24, and 27 number 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 28, 29, and 30 number 12, 13, 20, and 27 based on the results of the analysis with the help of anbuso 4.0, four questions belonged to the difficult category. these problems consisted of material on salt hydrolysis, calculation of changes in reaction enthalpy, the process of making and use of chemical compounds, and the periodicity of the main metal elements. table 4. difficulty index coefficient based on item response theory no difficulty index coefecient note no difficulty index coefecient note no difficulty index coefecient note 1 -1.33 moderate 13 0.2 moderate 25 0.44 moderate 2 -0.23 moderate 14 1.1 moderate 26 0.01 moderate 3 0.07 moderate 15 0.29 moderate 27 2.74 difficult 4 -0.08 moderate 16 -1.16 moderate 28 -0.52 moderate 5 0.56 moderate 17 1.28 moderate 29 -0.95 moderate 6 -2.36 easy 18 -0.67 moderate 30 0.36 moderate 7 -1.51 moderate 19 -1.85 moderate 31 4.06 difficult 8 -1.41 moderate 20 0.67 moderate 32 2.48 difficult 9 -1.51 moderate 21 -0.22 moderate 33 -0.32 moderate 10 -2.31 easy 22 -1.6 moderate 34 1.98 moderate 11 -1.41 moderate 23 -0.2 moderate 35 2.95 difficult 12 -0.38 moderate 24 0.8 moderate https://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 5 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) problems included in the unfavourable category were four items with material on acid-base titration, the solution colligative phenomenon, and the concept of periodicity of the main metal elements. after analyzing the data in classical test theory, the results of the coefficient of difficulty according to item response theory are presented in table 4. based on the difficulty index coefficient, it is found that the average level of difficulty of usbn questions is -0.00086 which is included in the moderate category based on item response theory. this means that overall the chemistry subject usbn questions in yogyakarta city have a medium category to be worked on by students. judging from the difficulty index coefficient, the most difficult question sequence is item 31, 35, 27, and 32. the recapitulation of the results of the difficulty index analysis based on item response theory is presented in table 5. table 5. recapitulation of difficulty index based on item response theory difficulty index coefecient easy (5.71%) medium (82.86%) difficult (11.43%) 2 items 29 items 4 items number 6 and 10 number 1, 2, 3, 4, 5, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 33, and 34 number 27, 31, 32, and 35 based on the results of the quantitative data analysis, there were eight difficult items and four bad items. one of the items in item 27 belongs to the three, which belonged to difficult category based on classical test theory and item response theory. after conducting quantitative data analysis, it was continued with a qualitative analysis of questions in order to describe the causes of the problems (the difficult and bad categories). several examples of difficult and bad items based on classical test theory and item response theory are as follows. item 27 item 27 was the most difficult item based on classical test theory analysis and the second hardest based on item response theory. this problem was also categorized as a bad question. the items presented are as follows. consider the physical and chemical properties of three elements such as the following unknown period: element boiling point electrical conductivity ionization energy form k 280°c does not conduct electricity 1,012 kj/mol solid l 2470°c conduct electricity 0,579 kj/mol solid m 58°c does not conduct electricity 1,21 kj/mol gas arrange them based on the increasing of the atomic number of the element ..... a. m – l – k b. m – k – l c. k – m – l d. l – k – m e. l – m – k the problem presented in the table of physical and chemical properties of an element in one period in the periodic table of elements, consisted of the initials of elements, boiling points, electrical conductivity, ionisation energies, and elemental states. students were asked to arrange the three elements based on the increase in atomic number. the difficulty index coefficient for this problem was 0.057, in the difficult category. interestingly, this problem was also included in the unfavourable category with a differentiating coefficient of -0.072. the negative value indicated that the problem was not good, since the table information was incomplete for students to answer the question. the tables containing boiling points and electrical conductivity should not be needed in determining the atomic number of an element. periodicity consists of ionisation energy, atomic radius, and electronegativity which is an indicator to guess the atomic number of an element and its location, since the information in the table was insufficient to be inserted in a table of atomic radii and electronegativity which caused a difficult problem for answered and questions that were not good for students. https://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 6 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) item 17 after point 27, the next most difficult item was item 17 based on classical test theory. the items presented are as follows. the data of average bond energy are: c – c = 348 kj/mol o = h = 463 kj/mol c – h = 414 kj/mol o = o = 495 kj/mol c – o = 358 kj/mol c = o = 799 kj/mol spiritus which contains ethanol is burned by the reaction equation: the resulting enthalpy change for burning 23 grams of ethanol is …… (ar h = 1; c = 12; o = 16) a. -2500 kj b. -1250 kj c. -625 kj d. +625 kj e. +1250 kj this problem presented the equation of the reaction of spiritus containing ethanol. students were asked to calculate the change in enthalpy of the combustion reaction with 23 grams of ethanol, while the average bond energy data had been known. the coefficient of the difficulty index of this problem was 0.201, categorized as difficult. this problem became difficult due to the problems which had relatively long calculation stages and required more accuracy, while the time given for each problem were relatively less. item 14 after items 27 and 17, the next most difficult item was item 14 based on the analysis of classical test theory. the items are presented as follows. consider the hydrolysis reaction data for the following salts: no salt formula type of hydrolysis hydrolysis reaction estimated ph 1 mgcl2 total mgcl2(aq) + 2h2o(l) mg(oh)2 +2h+(aq) + 2 cl-(aq) 7 2 nach3coo partly nach3coo(aq)+h2o(l) na+(aq)+ch3cooh(aq)+oh-aq) >7 3 nh4ch3coo total nh4ch3coo(aq) + h2o(l) nh4oh(aq)+ch3cooh (aq) 7 4 (nh4)2so4 partly (nh4)2so4(aq) + 2h2o(l) nh4oh(aq) + h2so4(aq) <7 5 k2co3 not hydrolyzed k2co3(aq) + 2h2o(l) 2k+(aq) + h2co3(aq) + 2oh-(aq) >7 the pair of data that are correlated correctly between the formula, the type of hydrolysis, and the hydrolysis reaction are.. a. 1 and 2 b. 1 and 3 c. 2 and 3 d. 3 and 4 e. 3 and 5 this problem presented a data table of the hydrolysis reaction of several salts consisting of columns of salt formula, type of hydrolysis, hydrolysis reaction, and estimated ph. students were asked to find the correct data pair for each column by selecting two pairs. the coefficient of the difficulty index of this problem is 0.232 where the questions belonged to the difficult category. this problem become difficult since it required a strong understanding of the concept of students to not be confused in interpreting the characteristics of salt, such as total hydrolysis, partial hydrolysis (partial), or not undergoing hydrolysis. item 24 item 24 was the most difficult item because it has the lowest difficulty index compared to other items. the items presented are as follows. + 3 o = o 2 o = c = o + 3 h – o – o https://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 7 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) the following table contains the formula for compounds, the manufacturing process, and uses of chemical compounds: no compound formula making process usability 1 na2co3 castner-kellner method is by electrolyzing nacl used as a water softener in washing clothes 2 naoh solvay process used to remove rust stains on iron 3 naclo the hooker process namely chlorine is passed cold and dilutes the hydroxide solution used in bleach liquid 4 nh3 harber process with a high temperature of ± 450 ° c and pressure between 200-400 atm used as a raw material for the manufacture of rocket fuel hydrazine 5 h2so4 contact process with vanadium pentaoxide (v2o5) catalyst. used as a catalyst in the reaction of making alkenes the correct pair of compound formulas, manufacturing processes, and uses of compounds are... a. (1), (2), and (3) b. (1), (3), and (4) c. (2), (3), and (5) d. (2), (4), and (5) e. (3), (4), and (5) this problem presented a table containing the formulation of chemical compounds, the manufacturing process, and the uses of each chemical compound. students are asked to match which chemical compounds are suitable for the manufacturing process and their use. difficulty coefficient for this problem was 2,289 which belonged to the difficult category. this problem became difficult because it did not only require understanding but also required special memorization in the process of making compounds, while the names matched the names of the inventors. item 20 the next item that has a negative differentiation coefficient is item 20. item 20 is presented as follows. consider the following illustration of the composition of the solution! the most appropriate statement is... a. vapour pressure (a) is higher than the solution (b) b. the freezing point of solution (a) is higher than the solution (b) c. the boiling point of solution (a) is higher than the solution (b) d. the boiling point of solution (a) is the same as that of solution (b) e. the freezing point of solution (a) is the same as that of solution (b) this problem presents an illustrated image of the composition of a solvent solution difficult to evaporate: solution a contains three substances, solution b contains five substances. students were asked to identify which statement is the most appropriate among the five statements listed in the answer choice. viewed from the difficulty index coefficient, this problem belongs to moderate category, but viewed from the differentiating power coefficient, this is considered as a bad question. the problem is similar to the problem that occurs in item 13. this is caused by to the concept of students still lacking for the material phenomenon of the colligative nature of the solution if presented with an image. hence, even smart students can be distracted by the statement on the choice of answers to these questions, and the quality of the questions is not good. item 31 item 31 was the item with the highest coefficient of blurring among the four other problems based on item response theory. item 31 is presented as follows. https://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 8 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) consider the following organic chemical reactions! (1) ch3 – ch2 – oh + hcl (soupy) ch3 – ch2 – cl + h2o (2) ch3 – ch2cl + ch3ok ch2 = ch2 + kcl + ch3oh (3) ch3 – ch = ch2 + hcl ch3 – ch2 ch2cl (4) ch3 – ch3 +cl2 uv ch3 – ch2cl + hcl (5) ch3cooh + ch3oh ch3cooch3 + h2o the elimination reaction is in the number reaction equation… a. (1) f. (2) g. (3) h. (4) i. (5) the problem presented several reaction equations where students were asked to choose which reaction equation which showed an elimination reaction. under the concept of the elimination reaction was a reaction that removed two substituents from a molecule. this reaction was usually characterized by the change of a single bond into a double bond by releasing small molecules. in the choice of answers, it appeared that the reaction equation (2) was showed. the possibility of this problem became difficult for students was the lack of understanding of the concept of organic chemistry. based on the results of quantitative analysis based on classical test theory and grain response theory, students’ difficulties in chemistry subjects were found on the material nature of periodicity, salt hydrolysis, enthalpy of reaction, manufacture and use of chemical compounds, organic chemical reactions, formation of petroleum fractions, and structure, nomenclature, properties, and classification of macromolecules. if seen from the presentation of the items, students had difficulty in understanding the concept of the nature of periodicity, salt hydrolysis, and organic chemical reactions. another difficulty was in calculating the change in reaction enthalpy, while in other materials, students had difficulty in answering the items because of the lack of students' insights, since the material required students' memorization abilities. based on the analysis results, many students had difficulty in answering questions related to chemical concepts and calculations. one of the chemical concepts that were difficult for students in yogyakarta city to answer in the usbn was the chemical concept of salt hydrolysis material. chemical equilibrium materials, acid-base solutions, and buffer solutions, which are the basic material for studying salt hydrolysis material. the lack of understanding of these material concepts makes students experience misconceptions on the material of salt hydrolysis. this is relevant to maratusholihah et al. (2017, p.919) that chemical equilibrium and acid-base material are prerequisite materials for buffering solutions and salt hydrolysis. other studies also revealed that students will experience difficulties in understanding chemical materials that require understanding the concepts of chemical equilibrium prerequisites (indriani et al., 2017, p.10). the misconception was caused by material, teacher, or student factors. material that tends to be abstract and complex makes it difficult for students to understand salt hydrolysis material. in addition, chemistry learning which is often more oriented to chemical calculations to be able to answer the problem well finally put aside the basic concepts that should be understood first before answering the problem. this is relevant to research conducted by maratusholihah et al. (2017, p.919), which revealed that chemical materials require integration in the macroscopic, microscopic, and symbolic aspects of studying buffer material and salt hydrolysis. errors caused by the lack of integration of these aspects lead to misconceptions among students that are caused by students' daily experiences, textbooks for subjects, and learning by the teacher. one solution that can reduce misconceptions among students is to improve the teaching method to provide more enjoyable learning so they do not get bored in the learning process. this had been studied by other researcher (akbar, 2016, p.6) who was looking for solutions in solving the difficulties of students studying salt hydrolysis material based on the teacher's reflection. the results show that the teachers suggested extending the material to the modern applicative level, https://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 9 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) increasing the practicum, procuring pretest with a maximum of three items, and using assessment tools using multiple-choice questions, and the construction of questions related to material outside hydrolysis. the material can use prerequisites for salt hydrolysis, namely, chemical equilibrium, acid-base and buffer solution. based on the results of these studies indicate that the teacher has very large role in solving misconceptions in students, especially in the concept of chemistry. difficulties of other chemical concepts also occur in the material nature of periodicity. students have difficulty in determining and sorting atomic numbers based on the nature of the periodicity listed. the nature of periodicity is a basic concept that must be understood when students study chemistry. this problem is very rarely found and researched by many researchers because the material is not categorized in difficult material and complained by students. therefore, teachers and education experts can examine the problem as a new reference for teachers in teaching. in addition, the concept of chemistry and chemical calculations were also the cause of the difficulty of the usbn problem in yogyakarta city. the chemical calculations presented in the problem were the elliptical calculations of the reactions in the thermochemical chapter. this material required a deep understanding of the concept and calculation of chemical reactions. thermochemistry is one of the chemicals that was considered as relatively difficult and less attractive to students because many use calculations. errors often occur in calculations caused by poor understanding of concepts. the concept of thermochemistry was very closely related to the previous material, such as the concept of chemical equilibrium. the lack of the concept of chemical equilibrium in students will be an additional obstacle for studying thermochemical material. this was in line with research conducted by sugiawati (2013, p.27) which found misconceptions on thermochemical material that made it difficult for students to identify a reaction equation and caused errors in the calculation of h reactions in the questions. this happened since one of them is due to the way the teacher submits that is only focused directly on chemical calculations and does not care much about planting students' concepts. many studies have examined these problems by holding innovations in overcoming the difficulties of thermochemical learning. one study was conducted by dirgahayuning (2017, p.14) who implemented an active learning strategy with learning start with question (lsq) to achieve student learning completeness. this strategy seeks students to actively ask questions in the learning process, so students are required to be active in making questions before being explained by the teacher by emphasizing reading and asking skills before entering class. in addition to the concept of hydrolysis, the nature of the periodicity and enthalpy of the reaction, some chemicals require memorization and the daily experience of students. as with the material process of making chemical compounds. after students understand the basic concepts of elements and compounds, students also need to know how the process of making these compounds. difficulties also occur in the matter of organic chemical reactions, the formation of petroleum fractions, and the structure, nomenclature, properties, and classification of macromolecules. this becomes difficult because not all students can memorize well and lack students' insights. the problem is rarely found and not studied deeply by researchers, so teachers and education experts can examine the problem as a new reference in learning chemistry . problems in chemistry such as the lack of understanding of concepts, accuracy in counting, and memorization of certain materials. in addition to improving the learning system by designing better learning strategies, teachers can also provide more motivation for students to be more active in learning. learning in class is sometimes still focused on the teacher, so some students feel bored, sleepy and indifferent. according to stuckey et al. (2013, p.27), the teacher must realize that not all students have the same interests and pay attention to the consequences. many students are still not intrinsically motivated by science learning. these conditions cause students' learning motivation to be low and will have an impact on students' cognitive achievement (lubis & ikhsan, 2015, p.192). student learning motivation will increase with more structured learning. one way to increase motivation and learning outcomes of students is by giving structured assignments accompanied by providing feedback on direct learning (sabriani, 2012, p.41). these struchttps://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 10 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) tured tasks can help students and make them easy learning since they have been given clear learning syntax. in addition to structured tasks, giving feedback such as awards can also increase students' learning motivation. another factor that can support the process of learning chemistry is by instilling the significance of science learning to students. according to fitriani et al. (2014, p.2), meaningfulness in learning can be grown from good chemical literacy skills also in students. it has also been studied which combines social-scientific problems, life cycle analysis (lca), and inquiry-based learning (ibl) on chemistry teaching. the research has the opportunity to grow the goals of modern education including scientific literacy (juntunen & aksela, 2013, p.150). good chemical literacy skills can be improved by connecting chemical concepts in everyday life, so learning becomes more meaningful and students can save long-term memory not just to answer questions. reviewing these problems not only lies in students but can be caused by teacher factors. this was shown based on the results of the analysis using the help of quest, the reliability of the usbn questions in yogyakarta city was 0.48, categorized as less reliable. according to azwar (2008, p.113), the coefficient of reliability was in the range of 0.00 to 1.00. if the coefficient of reliability approaching 1.00 can be said to be more reliable, whereas if getting closer to 0.00 it is said to be less reliable. this makes the assessment standards less appropriate for measuring the ability of students in chemistry subjects, especially for questions of a national standard such as the usbn problem, because psychometric evidence is very important in making instrument (arjoon et al., 2013, p.536). these problems can be solved by increasing teacher insight in preparing and conducting assessments, such as conducting training for teachers. the teacher can also conduct a study in the classroom to find solutions to improve effective and efficient learning that can improve students' motivation and learning outcomes. conclusion the results show that the usbn device chemistry subject in terms of difficulty based on the classical test theory consisted of 13 easy items, 18 moderate items, and four difficult items and discrimination index consisted of 31 good items and four bad items. if it was reviewed based on the response theory, the item difficulty index consists of two easy items, 29 moderate items, and four difficult items. the estimated reliability results of 0.48 which indicates that the questions are less reliable. problems with difficult categories consist of material properties of periodicity, salt hydrolysis, reaction enthalpy, manufacture and use of chemical compounds, organic chemical reactions, the formation of petroleum fractions, and structure, nomenclature, properties, and classification of macromolecules. meanwhile, the problem of bad category consists of acid-base material, acid-base titration, colligative nature of the solution, and the concept of periodicity. these materials were categorized to be difficult and not good because of the lack of understanding of students' concepts, lack of accuracy in counting, and difficult to memorize for certain chemical materials. problems with the category of not good questions on this usbn problem were based on in the type of questions by choosing the right answer from a given phenomenon. these problems can be solved by improving appropriate learning strategies, providing structured assignments, and providing feedback to students, to make the learning process more meaningful in students' daily lives and can increase motivation and students’ learning achievements. the first thing in solving the easy, bad, and unreliable items is to review the teacher's ability, prepare, and conduct assessments. these results can be used as consideration for improvement, for example by conducting special training for teachers, especially in national assessments for students' graduation standards. it also can invite teachers and education experts to conduct research on learning chemistry that is more effective and efficient in the classroom, especially in creating material process and the use of chemical compounds, the concept of periodicity, and organic chemical reactions which are still considered difficult for students and are still very rare to be studied. this innovation is needed to improve the quality of education, especially in chemistry subject, in yogyakarta city. https://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 11 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) acknowledgment the authors thank the center of assessment and instruction, indonesia, for supporting this research in terms of funding research and data retrieval. the research was also supported by the chemistry teacher in yogyakarta city regarding the usbn result data. references akbar, s. a. (2016). desain didaktis pembelajaran hidrolisis didasarkan hasil refleksi diri guru melalui lesson analysis. jurnal edukasi kimia, 1(1), 6–11. http://ojs.serambimekkah.ac.id/jek/article/view/161 ariani, l. h. d., maimunah, m., & roza, y. (2019). analisis kesalahan siswa dalam menyelesaikan soal usbn matematika sma. edumath, 8(1), 42–48. https://ejournal.stkipjb.ac.id/index.php/math/article/view/1117 arjoon, j. a., xu, x., & lewis, j. e. (2013). understanding the state of the art for measurement in chemistry education research: examining the psychometric evidence. journal of chemical education, 90(5), 536–545. https://doi.org/10.1021/ed3002013 azwar, s. (2008). reliabilitas dan validitas. pustaka pelajar. dirgahayuning, a. (2017). penerapan strategi pembelajaran aktif learning start with question untuk mencapai ketuntasan belajar siswa pada pokok bahasan termokimia kelas xi ipa 6 sma negeri 5 pekanbaru. perspektif pendidikan dan keguruan, 8(2), 13–20. retrieved from https://journal.uir.ac.id/index.php/perspektif/article/view/775 fitriani, w., hairida, h., & lestari, i. (2014). deskripsi literasi sains siswa dalam model inkuiri pada materi laju reaksi di sman 9 pontianak. jurnal pendidikan dan pembelajaran, 3(1), 1-13. https://jurnal.untan.ac.id/index.php/jpdpb/article/view/4432 herkusumo, a. p. (2011). penyetaraan (equating) ujian akhir sekolah berstandar nasional (uasbn) dengan teori tes klasik. jurnal pendidikan dan kebudayaan, 17(4), 455-471. https://doi.org/10.24832/jpnk.v17i4.41 indriani, a., suryadharma, i. b., & yahmin , y. (2017). identifikasi kesulitan peserta didik dalam memahami kesetimbangan kimia. j-pek (jurnal pembelajaran kimia), 2(1), 9–13. https://doi.org/10.17977/um026v2i12017p009 juntunen, m., & aksela, m. (2013). life-cycle analysis and inquiry-based learning in chemistry teaching. science education international, 24(2), 150–166. kolte, v. (2015). item analysis of multiple choice questions in physiology examination. indian journal of basic and applied medical research, 4(4), 320–326. https://www.ijbamr.com/assets/images/issues/pdf/september%202015%20320326.pdf.pdf lubis, i. r., & ikhsan, j. (2015). pengembangan media pembelajaran kimia berbasis android untuk meningkatkan motivasi belajar dan prestasi kognitif peserta didik sma. jurnal inovasi pendidikan ipa, 1(2), 191-201. https://doi.org/10.21831/jipi.v1i2.7504 maratusholihah, n. f., rahayu, s., & fajaroh, f. (2017). hidrolisis garam dan larutan penyangga. jurnal pendidikan: teori, penelitian dan pengembangan, 2(7), 919–926. http://journal.um.ac.id/index.php/jptpp/article/view/9645/4559 mardapi, d. (2016). pengukuran, penilaian, dan evaluasi pendidikan. parama publishing. retnawati, h. (2015). hambatan guru matematika sekolah menengah pertama dalam menerapkan kurikulum baru. jurnal cakrawala pendidikan, 34(3), 390–403. https://doi.org/10.21831/cp.v3i3.7694 http://ojs.serambimekkah.ac.id/jek/article/view/161 https://ejournal.stkipjb.ac.id/index.php/math/article/view/1117 https://doi.org/10.1021/ed3002013 https://journal.uir.ac.id/index.php/perspektif/article/view/775 https://jurnal.untan.ac.id/index.php/jpdpb/article/view/4432 https://doi.org/10.24832/jpnk.v17i4.41 https://doi.org/10.17977/um026v2i12017p009 https://www.ijbamr.com/assets/images/issues/pdf/september%202015%20320-326.pdf.pdf https://www.ijbamr.com/assets/images/issues/pdf/september%202015%20320-326.pdf.pdf https://doi.org/10.21831/jipi.v1i2.7504 http://journal.um.ac.id/index.php/jptpp/article/view/9645/4559 https://doi.org/10.21831/cp.v3i3.7694 https://doi.org/10.21831/reid.v7i1.31297 ummul karimah, heri retnawati, deni hadiana, pujiastuti, & eri yusron page 12 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) ristiyani, e., & bahriah, e. s. (2016). analisis kesulitan belajar kimia siswa di sman x kota tangerang selatan. jurnal penelitian dan pembelajaran ipa, 2(1), 18–29. https://doi.org/10.30870/jppi.v2i1.431 sarea, m. s., & hadi, s. (2015). analisis kualitas soal ujian akhir semester mata pelajaran kimia sma di kabupaten gowa. jurnal evaluasi pendidikan, 3(1), 35–43. http://journal.student.uny.ac.id/ojs/index.php/jep/article/view/1223 sabriani, s. (2012). penerapan pemberian tugas terstruktur disertai umpan balik pada pembelajaran langsung untuk meningkatkan motivasi dan hasil belajar siswa (studi pada materi pokok struktur atom kelas x6 sma negeri watampone). chemica: jurnal ilmiah kimia dan pendidikan kimia, 13(2), 39–46. https://ojs.unm.ac.id/index.php/chemica/article/view/625 stuckey, m., hofstein, a., mamlok-naaman, r., & eilks, i. (2013). the meaning of ‘relevance’ in science education and its implications for the science curriculum. studies in science education, 49(1), 1–34. https://doi.org/10.1080/03057267.2013.802463 sugiawati, v. a. (2013). penggunaan strategi konflik kognitif dalam pembelajaran tps untuk mereduksi miskonsepsi siswa pada materi termokimia. jurnal nalar pendidikan, 1(1), 26–31. https://ojs.unm.ac.id/nalar/article/view/1935 taber, k. s. (2013). revisiting the chemistry triplet: drawing upon the nature of chemical knowladge and the psychology of learning to inform chemistry education. chemistry education research and practice, 14(2), 156–168. https://doi.org/10.1039/c3rp00012e yustika, a., susatyo, e. b., & nuswowati, m. (2015). uji kriteria instrumen penilaian hasil belajar kimia. jurnal inovasi pendidikan kimia, 8(2), 1330–1339. https://journal.unnes.ac.id/nju/index.php/jipk/article/view/4438 https://doi.org/10.30870/jppi.v2i1.431 http://journal.student.uny.ac.id/ojs/index.php/jep/article/view/1223 https://ojs.unm.ac.id/index.php/chemica/article/view/625 https://doi.org/10.1080/03057267.2013.802463 https://ojs.unm.ac.id/nalar/article/view/1935 https://doi.org/10.1039/c3rp00012e https://journal.unnes.ac.id/nju/index.php/jipk/article/view/4438 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(1), 2021, 23-34 available online at: http://journal.uny.ac.id/index.php/reid evaluating the implementation of the character education strengthening program of vocational high schools in yogyakarta city edhy susatya*; budi santosa; andriyani; dwi ariyani universitas ahmad dahlan jl. pramuka no.42, pandeyan, umbulharjo, yogyakarta 55161, indonesia *corresponding author. e-mail: edhy.susatya@yahoo.com introduction the human development index (hdi) states that education is one of the indicators that determine the success of the development of human quality and life expectancy and per capita income. hdi by country 2020 released that indonesia was in 117th place out of 190 countries. with a total population of 273,523,615 people, indonesia has an hdi value of 0.69, the same as vietnam and bolivia. hdi classifies the success rate of human quality development into four levels, namely: (1) very high, with a value of 0.8 1.0, (2) high, with a value of 0.7 0.79, (3) medium, with a value of 0.55 0.70, and (4) low, with a value below 0.55. with a value of 0.69, indonesia is in the middle level (world population review, 2020). the success of education is affected by many aspects, one of which is the implementation of character education. there are currently negative phenomena mostly done by students, including acts of violence, vandalism, student brawls, drug abuse, corrupt behavior, plagiarism, cheating on exams, and social unrest. manasikana and anggraeni (2018) describe that in 2010, article info abstract article history submitted: 20 january 2021 revised: 14 april 2021 accepted: 14 april 2021 keywords character education; implementation; vocational high school scan me: the research aims to evaluate the character education strengthening program (cesp) implementation in vocational high schools (vhs) throughout yogyakarta city. the evaluation is done in the functions of (1) planning, (2) implementation, and (3) evaluation. in this descriptive qualitative research, the researchers collected information related to the implementation of cesp. the subjects are school principals, vice principals, and teachers determined using the snowball sampling technique in four vocational high schools in yogyakarta city. the evaluation uses the discrepancy model, which looks for the gap between planning and implementation. data were collected through observation, interviews, and documentation, validated using the source and collection triangulation, and analyzed using a descriptive technique carried out during and after completing data collection within a certain period. the findings show that: (1) the cesp planning consists of elements of initial assessment, cesp socialization, vision and mission, policy design, and cesp design, with an average score of 2.74, meaning that it is good; (2) the cesp implementation consists of the elements including cesp development in learning, school culture development, community participation, and implementation of the cesp main values, with an average score of 2.98, meaning that it is in a good category; and (3) the cesp evaluation has an average score of 2.50, meaning that it is in a good category. the results of this study are recommended as consideration for mapping the cesp implementation, determining education policies related to character education, developing cesp models, reference for character research, and as materials for discussions regarding character education. this is an open access article under the cc-by-sa license. how to cite: susatya, e., santosa, b., andriyani, a., & ariyani, d. (2021). evaluating the implementation of the character education strengthening program of vocational high schools in yogyakarta city. reid (research and evaluation in education), 7(1), 23-34. doi:https://doi.org/10.21831/reid.v7i1.38029 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i1.38029 https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 24 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) there were at least 128 cases of student brawls. this figure increased sharply in 2011 to 330 cases of student brawls that killed 82 students. in the first semester of january-june 2012, some brawls killed 139 students. the national narcotics agency estimates that drug addicts increased by 2.8% in 2015 and are beginning to reach primary school-age children. the indonesian child protection commission (icpc) explained that there were 171 students whom the metro jaya police department arrested during a demonstration against the job creation law on tuesday, october 20, 2020. the 171 students are in the age range of junior high school, high school, and vocational high school (vhs). they were detained in several places for committing vandalism, social unrest, arson, and fighting against officers (egeham, 2020). those data and problems show how important character education is in educational institutions. character education, which is contained in moral and religious teaching materials, which emphasizes the left brain aspect, has not been able to foster strong character and student creativity. the concept of character education must be integrated with the school climate, culture, and learning. as stated by nast et al. (2020), character education is intentional efforts to develop in young people core ethical and performance values, which are widely affirmed across all cultures. to be effective, character education must include all stakeholders in a school community and must permeate school climate, culture, teaching, and learning. character is one of the keys to the success of education. ki hadjar dewantara said that education is an effort to promote the growth of moral (character), mind (intellect-competence), and physic (skills-literacy). the three parts must not be separated from each other so that we can advance the perfection of a child's life. thus, the success of education depends on the parties' ability to combine character, competence, and literacy skills in one educational concept. strong character combined with high competence will produce a human resource that is strong, competitive, has integrity, and is reliable at work (ministry of education and culture, 2017). character education develops cultural values and character in students so that it becomes the basis for thinking, behaving, and acting in developing themselves as individuals, members of society, and citizens. character education is the development of values that come from ideology, history of the indonesian nation, religion, culture, and the values stated in national education goals. mujiyati et al. (2019) believe that the problem-based local history module effectively improves the students' critical thinking ability. through this module, the students are directed to understand various aspects affecting a problem and relate it to the knowledge they have owned. character education is developed based on the source of values contained in (1) religion as a source of divine norms, (2) pancasila (five pillars of the nation) as a proven source of character education able to ward off all forms of challenges and divisions, (3) culture as the basis for interpreting events, phenomena, and incidents in society, and (4) the goals of national education as the basis for formulating the goals of character education (ministry of education and culture, 2017). character consists of knowing the good, desiring the good, and doing the good. in this case, habits of the mind, habits of the heart, and habits of the action are needed (zubaedi, 2011). the objectives of character education strengthening programs (cesp) are (1) to build and equip students as indonesia's golden generation in 2045 with the spirit of pancasila, (2) to develop a national education platform that places character education as the main soul in education, and (3) to revitalize and strengthen the potential and competence of educators, teaching staff, students, community, and family environment. character education aims to improve the quality of educational processes and outcomes that lead to complete, integrated, and balanced education of students' character and noble character, in accordance with the competency standards of graduates in each school (presidential regulation no. 87 of 2017). character education requires a value internalization process. it must be implemented in all subjects taught in class, underlie all student activities, and become deoxyribonuclease (dna) in all learning activities. to demonstrate their abilities, think rationally, and analyze something they see, students must learn and develop competence, tolerance, and togetherness. jannah et al. (2018) describe that character education is implemented through integrating character education in the https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 25 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) school curriculum and through learning methods. widiyani et al. (2020) state that the goal of the school partnership program, which is carried out by holding an mou with partner institutions, is to improve the quality related to fostering interests and talents, and achievements. riantoni and nurrahman (2020) explain that there is a significant relationship between students' honesty character and integrated science learning outcomes in the very strong category where the pearson correlations score is 0.919. the implementation of the cesp must consider, among other things, age and learning models, be based on diverse backgrounds and individual differences, and establish collaboration among all education providers. the age of students is a consideration for the implementation of character education because vocational high school students are aged 15-21 years with various levels of cognitive ability. the age of vocational school children is at the stage of moving from pedagogy to andragogy, so the learning strategy is very influential for the success of education. learning models that are suitable, effective, and efficient are the capital of successful education. wibowo et al. (2018) state that integrated learning is modeled by conducting curriculum study on basic competence of adaptive subjects and inserting productive basic competency subjects on adaptive subjects. after the integrated learning model is prepared, the learning model applied to the class is a competencybased learning model. the diversity of backgrounds and individual differences of students is the basis for the implementation of the cesp. handayani and wulandari (2017) write that applying character values to all students must be based on the diversity of backgrounds (religion, culture, socio-economic status, ethnicity, language, and intellectual abilities) in the school environment. this is confirmed by mayhew and rockenbach (2021) that development occurs through exposure to and participation in college experiences that help students achieve the outcomes related to religious, spiritual, and worldview development. the collaboration of education providers (school principals, teachers, and families) ensures the success of character education. the principal must be able to create an environment that is conducive to improving the quality of character education and creating a good school cultural climate. effendi (2020) explains that the steps of the principal's transformational leadership role based on cultural, humanistic, and nationalism approaches effectively optimize the implementation of character education strengthening in schools. teachers, as the spearhead of character education, must have professional competence, as stated by bunyamin (2016) that teacher professional competence is the ability of a teacher to manage the teaching and learning process, the ability to manage learning supported by classroom management, and the mastery of learning materials, teaching strategies, and use of learning media. wardoyo et al. (2020) write that teachers understand pedagogical and professional competencies by applying learning strategies and methods relevant to students' characteristics, integrating strengthening character education, literacy, high order thinking skills, and 21th-century skills. meanwhile, family is an inseparable part of character education, as shown by asbari et al. (2019) that parenting style and genetic personality have a positive effect and significantly contribute to children's character building. vocational high schools (vhs) are formal education institutions that provide vocational education at the secondary education level under the directorate of vocational education. vhs prioritizes the development of student competencies in accordance with certain jobs in a professional manner. for this reason, an integrated vocational learning model integrated with business and industrial sectors is needed to provide work experience to students. barabasch and keller (2020) state that some factors contribute to positive learning experiences for apprentices. they support the development of the competencies essential in the modern workplace. they include taking the initiative, acting autonomously, communicating challenges and seeking advice, critical thinking, self-management, ability to work in different teams, and the apprentices' management of their own learning process. misbah et al. (2020) state that school principals, teachers, and students noticed the realization of comprehensive cbe framework principles in the study program to differing degrees, except for the principle of flexibility that is largely absent. https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 26 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) vhs collaboration program with business and industrial sectors aims to bring graduates closer to work situations and to develop independence and social assimilation. mortaki (2012) reveals that vocational education helps in the social assimilation of various social groups. the development of vocational education and training is based on anticipating quantitative long-term demand for labor and educational needs and on qualitative anticipation of the needs for skilled workers at the national level. mgaiwa and poncian (2016) state that cooperation or partnerships in education can solve several problems in financing, management, access, and quality of education. this is in accordance with the theory that in a partnership, there are two or more parties who form a cooperation bond based on agreement and mutual need in order to create the active participation of partnership members and to increase the capacity and capability of a particular field so that it can get better results (wheeler et al., 2018). unfortunately, nowadays, many schools are shackled by an education system that prioritizes material content and demands standardization so that they forget the character aspect as the basis of education. widowati and retnowati (2016) research shows that the implementation of character education in state senior high schools in yogyakarta is in a good category, but there is a gap among exemplary teachers, unavailability of an honesty canteen, and unavailability of facilities for finding lost items. the integration of character education in school subjects is in a good category, but the learning implementation plan made by the teacher does not include student character assessment. school culture in the implementation of character education is also in the good category, but the implementation of al-qur'an recital every morning for muslim students has not been carried out. the data on the results of the evaluation of the implementation of cesp in special needs schools in 2017 show that: (1) not all schools have developed the capacity of the environment, school committees, and other learning resources to support the implementation of the cesp; (2) there are still many schools that do not have instruments and program documentation of cesp, there are still many schools whose principals, teachers, and school committee members have not carried out regular and sustainable monitoring activities, and there are still schools that have not included students in cesp evaluations; (3) there are several obstacles in the implementation of cesp, including: lack of facilities and infrastructure, the absence of continuous cesp training for educators, lack of understanding of the parents and schools about the values of cesp; (4) schools have implemented cesp programs in schools but are not yet organized, well documented, and do not formally have a legal umbrella in the implementation of cesp in schools (ministry of education and culture, 2017). based on the description of the concept and the research results on the implementation of cesp programs, the researchers conducted an evaluation study on the implementation of character education in vocational high schools in yogyakarta city, which had never been done so far. the research is focused on evaluating the cesp in terms of the functions of planning, implementation, and evaluation. the research sample is four vocational high schools in yogyakarta city. the benefits of the research results can be expected to be used as material for mapping the implementation of cesp in vocational high schools, determining education service policies related to cesp, and developing a cesp model as a reference for character education research, and the material for discussion about cesp. method this research is descriptive qualitative research. in this study, the researchers collected the information related to the implementation of cesp at the time the study was conducted, without making changes to the subject under study. the evaluation uses the discrepancy model, which is to find the gap between planning programs and implementation. the evaluation stages include: identifying character education goals, analyzing the three functions of cesp activities (planning, implementation, and evaluation), compiling grids, retrieving data, and processing data. the research subjects are school principals, vice principals, and teachers. the determination of research https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 27 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) informants was carried out by using the snowball sampling technique. the research object is the implementation of cesp with a sample of vocational high schools in yogyakarta city. the research was conducted from june to december 2020. the data collection, in an effort to meet the credibility of the implementation of cesp, used the natural setting techniques (natural conditions) as data sources, namely: participant observation, in-depth interviews, and documentation. the guidelines used are the main instruments and supporting instruments. the main instruments are human, while the supporting instruments include interview guides, observation sheets, and documentation checklists. the interview instrument consists of 63 items, based on instrument evaluation and supervision of the ministry of education and culture's cesp program. table 1. techniques of data collection function techniques of data collection instrumentts data sources planning interviews, interview guide, principals, implementation observation, observation sheet, viceprincipals, evaluation document checking checklist and teachers the data analysis technique in this study used the qualitative descriptive analysis, which was carried out at the time the data collection took place and after completing data collection within a certain period. the credibility of the data collected is measured through triangulation techniques so that valid data are obtained (arikunto, 2017). the data collection techniques, instruments, and data sources in research data collection are presented in table 1. findings and discussion findings the data analysis was performed by grouping 63 questions into three functional groups: planning, implementation, and evaluation. the planning function contains 22 questions, including initial assessment, cesp socialization, vision and mission, policy design, and cesp design. the implementation function contains 32 questions, consisting of the elements: developing cesp in teaching, developing school culture, community participation, and implementing the main values of cesp. meanwhile, the evaluation function contains nine questions that contain the elements of cesp evaluation implementation. in detail, the function grid, elements, and assessment indicators are shown in table 2. table 2. grids of functions, elements, and assessment indicators functions elements indicators planning 1. initial assessment item numbers 1-5, 2. cesp socialization item numbers 6-8, 3. vision, mission, and formulation item numbers 9-13. 4. cesp policy design item numbers 14-17, 5. cesp program design item numbers 18-22. implementation 1. developing cesp in teaching item numbers 23-28, 2. developing school culture item numbers 29-43. 3. community participation item numbers 44-49, 4. implementing main values of cesp item numbers 50-54, evaluation cesp evaluation item numbers 55-63. indicators of planning function the planning function consists of five elements: assessment, socialization, vision and mission, policy design, and program design. the initial assessment is indicated by the indicators of https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 28 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) identifying learning resources and infrastructure, human resources who understand cesp, the potential for school culture, sources of funding for cesp development, and school governance. cesp socialization is indicated by the indicators of cesp dissemination to education stakeholders, prioritizing cesp's main values, and determining the distinctive values of schools. the vision and mission are indicated by the indicators of the formulation of vision and mission according to the main values of character, core values, school branding, the main values according to the 21st-century competencies, and the vision, mission, and branding contained in the curriculum. the cesp policy design is indicated by the formation indicators of a cesp implementation team, school regulations that support the implementation of cesp, services for students with disabilities, and norms and regulations for the growth of core character values. the cesp design is indicated by the indicators of the compilation of cesp development, featured programs integrating main values, class-based featured programs, student facilitation programs, and school branding support activities. implementation function indicators the implementation function consists of developing cesp, developing school culture, community participation, and implementing the main values. developing cesp in teaching is proven by the indicators of the integration of the main values in the lesson plans, learning materials with life issues, proper methods, inculcating the main values of character, teachers being class managers, and role models, and sustainable capacities. developing school culture is evidenced by the indicators of the creation of featured cultural traditions, main values of character, local wisdom, learning culture, guidance and counseling that supports cesp, literacy culture, reading corners, featured traditions, reflecting school culture, school branding, religious values, nationalism values, independence value, mutual cooperation value, and integrity value. community participation is proven by the indicators of parents‟ supports of cesp, the school committee‟s active role in cesp, community involvement in cesp, utilization of external learning sources, raising community funds, and sustainable community input and criticism. the implementation of the main values of cesp is proven by the indicators of the development of the religious dimension, the spirit of nationalism, the independence of students, the spirit of mutual cooperation, and the value of student integrity. evaluation function indicators the evaluation of cesp is measured by some indicators of the instrument's construction for successful cesp. these indicators include monitoring activities, feedback mechanisms, follow-up monitoring, activities involving all elements in implementation, use of infrastructures, utilization of media, and increasing academic and non-academic achievement. discussion the results of the interviews in the form of the answers to questions were analyzed based on interview guidelines, which were then matched with the results of observations (field checking) and document checking. for example, the initial assessment is measured by the indicators of identification of learning resources and infrastructure. question: does the school identify learning resources and infrastructure? to the answer of not identifying learning sources, a score of 0 was given; to that of identifying less than three valuable learning sources, a score of 1 was given; to that of identifying three to six valuable learning sources, a score of 2 was given; to that of identifying six to nine valuable learning sources, a score of 3 was given; and to that of identifying more than nine valuable learning sources, a score of 4 was given. respondent‟s answers are checked against the evidence in the field and documents that are owned. for example: checking learning resources by looking at management documents and checking infrastructures by reviewing rooms, equipment, and supporting facilities. https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 29 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) the indicator value is analyzed to find the mean score of the elements. the mean score of the elements is used to determine the category. there are four categories: “not implemented” category for the mean score of 0.0-0.1; “quite implemented” category for the mean score of 1.12.0; “well implemented” category for the mean score of 2.1-3.0; “very well implemented” category for the mean score of 3.1-4.0. based on the process of grouping the items, assigning values to each element indicator, taking the mean value of the elements, taking the school‟s average score, and taking the school‟s mean score, the research findings are as follows. research findings of the planning function the score of the indicator of each element of the planning function consisting of assessment, socialization, vision and mission, policy design, and program design for the four schools is presented in table 3. furthermore, the average value of each element of the planning function is shown in table 4. table 3. values of element indicators schools initial assessment socialization vision and mission policy design program design a b c d e a b c a b c d e a b c d a b c d e vhs s1 4 4 4 2 4 3 3 2 3 3 3 4 4 3 3 0 4 3 0 0 4 3 vhs s2 2 2 1 0 1 1 0 0 1 1 0 0 3 0 0 1 0 0 0 0 1 1 vhs s3 4 4 4 4 4 3 3 3 4 4 4 4 3 2 4 4 4 4 4 4 4 4 vhs s4 4 3 4 4 4 3 4 4 4 3 3 3 4 4 4 3 4 4 4 3 4 3 table 4. average scores of planning element elements scores vhs s1 vhs s2 vhs s3 vhs s4 initial assessment 3.67 1.00 4,.0 3.60 cesp socialization 2.67 0.33 3.00 3.67 vision, mission, and formulation 3.40 1.00 3.60 3.60 cesp policy design 3.33 0.33 3.50 4.00 cesp program design 2.01 0.40 4.00 3.60 school average scores 3.02 0.61 3.62 3.70 vhs average score 2.74 table 4 shows that the planning of cesp at vhs 1 has a mean score of 3.02, which means that it is in a very good category. at vhs 2, it has a mean score of 0.61, which means that it is in a poor category. vhs 3 has a mean score of 3.62, which means that it is in a very good category, while vhs 4 has a mean score of 3.70, which means it is in a very good category. the mean score in the planning for cesp in vocational high schools in yogyakarta city is 2.74, which means that it is in a good category. based on the respondents' answers, there are answers with a zero score (extreme score) in two vocational high schools, namely: (1) vhs 1 in terms of the service of students with disabilities, the featured program integration into learning, and the featured program of cesp; (2) vhs 2 in terms of the identification of learning resources, identification of cesp fundraising, formulation of cesp's main value priorities, cesp's distinctive values, school branding, formation of a cesp implementation team, regulations that support cesp implementation, cesp implementation rules, cesp program development, integration of featured programs in learning, and class-based featured programs. the results of the analysis show that the planning of cesp at vhs 2 is not implemented, which can be seen from the answers, with many having zero scores. the results of the analysis were corroborated by the results of the interviews with the acting principal of vhs 2 who said, “this school has not implemented cesp optimally because there are many obstacles in planning, starting from assessment to designing activities, and until now there is no definitive school principal”. https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 30 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) the findings on implementation function the implementation function, which consists of elements of developing cesp in teaching, developing school culture, community participation, and implementation of the main values of cesp in the four schools, has varied results. the findings of the research at vhs 1 show that the score of 4 is obtained on the indicators including inculcating the main values of character, classroom management and teacher exemplary, sustainable capacity, learning culture, guidance and counseling support for cesp, featured traditions, reflecting school culture, school branding, religious values, nationalism, independence, integrity, and the spirit of mutual cooperation. a score of 3 is obtained on the indicators, including the content of learning materials, accuracy of methods, implementation of parental support, community involvement, utilization of learning resources, sustainable community input and criticism, creation of featured traditions, and local wisdom. a score of 2 is obtained on the indicators, including the school committee's role and raising community funds. a score of 1 is obtained on the reading corner indicator. a score of 0 is obtained on the literacy culture indicator. the research findings at vhs 2 show that the scores of 4 and 3 are not obtained on any indicator. the score of 2 is obtained on the indicators including the content of learning materials, classroom management and teacher exemplary, main values of character, learning culture, guidance and counseling support, religious values, the active role of the school committee, utilization of external learning resources, sustainable community input and criticism, religious dimensions, student independence, and spirit of mutual cooperation. the score of 1 is obtained on the indicators including the integration of the main values in the lesson plan, the accuracy of the method, inculcation of the main values of character, sustainable capacity, creation of cultural featured traditions, featured traditions, reflecting school culture, nationalism, independence, mutual cooperation, and student integrity. a score of 0 is obtained on the indicators, including local wisdom, literacy culture, reading corner, school branding, implementation of parental support, community involvement, and raising community funds. the findings of the research at vhs 3 show that the score of 4 is obtained on the indicators including the integration of main values in lesson plans, the content of learning materials, accuracy of methods, inculcation of the main values of character, class management, and teacher exemplary, sustainable capacity, creation of featured traditions, main character values, local wisdom, learning culture, featured traditions, reflecting school culture, school branding, religious values, nationalism, independence, mutual cooperation, integrity, the active role of the school committee, community involvement, utilization of external learning resources, sustainable community input and criticism, the spirit of nationalism, student independence, and also the spirit of mutual cooperation. a score of 3 is obtained on the indicators including guidance and counseling support for cesp, the implementation of cesp's main values, the development of the religious dimension, and student integrity. a score of 2 is obtained on the indicators including the parental support for cesp. additionally, a score of 1 is obtained on the indicator of a literacy culture and reading corner. the findings of the research at vhs 4 show that the score of 4 is obtained on the indicators including the integration of the main values in lesson plans, the content of learning materials, accuracy of the method, inculcation of the main values of character, class management and teacher exemplary, creation of cultural featured traditions, main values of character, learning culture, reading corners, featured traditions, reflecting school culture, school branding, religion, nationalism, independence, mutual cooperation, integrity, community involvement, utilization of external learning resources, raising public funds, sustainable community input and criticism, the spirit of nationalism, and student integrity. the score of 3 is obtained on the indicators including capacity on an ongoing basis, local wisdom, guidance and counseling support for cesp, literacy culture, parental support, the active role of school committees, development of a religious dimension, the spirit of mutual cooperation, and student independence. the scores of 2, 1, and 0 were not obtained at vhs 4. https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 31 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) furthermore, the mean score of each element of the implementation function is shown in table 5. table 5 shows that the implementation of cesp at vhs 1 has a mean score of 3.28, which means it is in a very good category; at vhs 2, it has a mean score of 1.26, which means that it is in a sufficient category; at vhs 3, it has a mean score of 3.70, which means that it is in a very good category; vhs 4 has a mean score of 3.68, which means that it is in a very good category. the average score of the implementation of cesp in vocational high schools in yogyakarta city is 2.98, which means that it is in a good category. table 5. the mean value of the elements in the implementation function elements scores vhs 1 vhs 2 vhs 3 vhs 4 developing cesp in teaching 3.67 1.33 4.00 3.83 developing school culture 3.33 1.01 3.53 3.80 community participation 2.67 1.00 3.67 3.67 implementing main values of cesp 3.20 1.60 3.60 3.40 school average score 3.28 1.26 3.70 3.68 vhs average score 2.98 based on the respondents' answers, there are answers with zero scores at two vocational high schools (vhs), namely: (1) at vhs 1 on the development of a reading corner, (2) at vhs 2 on the development of local wisdom, development of a reading culture, literacy programs, reflecting school branding, involving parents in cesp, community involvement in cesp, and raising community funds. the analysis results show that the implementation of cesp at vhs 2 is not well done, which can be seen from many zero-score answers. especially in relation to the reading corner element, almost all schools have not provided facilities, and this was confirmed by the head of the administration of vhs 3, who said, “our vocational school has not provided a reading corner, due to difficulties in funding and facilities, and student's interest in reading is currently facilitated in the library.” research findings on the evaluation function the evaluation function scores as measured by the indicators of the composition of the cesp success instrument, monitoring activities, feedback mechanisms, follow-up monitoring, activities involving all elements in implementation, infrastructure use, media utilization, increasing academic achievement, and increasing non-academic achievement in the four schools are shown in table 6. besides, the mean scores of each element of the evaluation function are in table 7. table 6. element indicator scores schools cesp evaluation a b c d e f g h i vhs 1 0 3 3 3 3 4 3 3 3 vhs 2 0 0 0 0 0 2 2 1 1 vhs 3 2 3 3 3 4 4 4 3 3 vhs 4 4 4 4 3 3 2 3 3 3 table 7. average score of elements in the evaluation function elements scores vhs 1 vhs 2 vhs 3 vhs 4 evaluation of cesp implementation in teaching 2.78 0.67 3.33 3.22 private vhs average score 2.50 table 7 shows the cesp evaluation function at vhs 1 has a mean score of 2.78, which means that it is in a good category, vhs 2 has a mean score of 0.67, which means that it is in a poor category, vhs 3 has a mean score of 3.33, which means that it is in a very good category, https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 32 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) and vhs 4 has a mean score of 3.22, which means that it is in a very good category. the mean score of the evaluation of cesp in vocational high schools in yogyakarta city is 2.50, which means that it is in a good category. the indicator scores of 62 and 63 show that the cesp increases academic and non-academic achievement by 62.5%. the analysis results are supported by the explanation of the principals of vhs 1 and vhs 4, who said, “there is an increase in academic and non-academic achievements after cesp was implemented. national exam scores have increased, and many students have won various competitions, both sports and arts, and culture.” this finding is supported by research by manasikana and anggraeni (2018), which reports that character education can improve the quality of education as evidenced by the increase in the quality of human resources. based on the respondents‟ answers, there are zero-score answers at two vhss, namely: (1) at vhs 1 on the assessment indicators, (2) at vhs 2 on the assessment indicators, implementation monitoring, evaluation mechanisms, follow-up evaluations, and involvement of school elements in cesp implementation. the analysis results show that the evaluation of cesp at vhs 2 is not implemented, which can be seen from the many score-zero answers. this is confirmed by the explanation of the acting principal of vhs 2, who said, “we still lack human resources who master character education, and no one has participated in character training, so it is still difficult to do a detailed and correct cesp evaluation.” the description of the analysis results shows the constraints and implications of cesp in the four schools, which are generally related to teachers and students. in this regard, the supporting data on the number of classes, teachers, and students in the four schools where the research was conducted are shown in table 8. table 8. school data school total classes teachers students vhs 1 21 44 672 vhs 2 12 31 277 vhs 3 36 85 910 vhs 4 12 32 210 table 8 shows that two schools are classified as medium schools, with the student body ranging from 500 to1000, and two schools are classified as small schools (number of students under 500). the teacher-student ratio of 1:7-15 is a fairly good ratio for vocational schools. in addition, the implementation of cesp in the four schools is depicted in figure 1. excellent good fair poor unimplemented vhs s1 vhs s2 vhs s3 vhs s4 vhs s1 vhs s2 vhs s3 vhs s4 vhs s1 vhs s2 vhs s3 vhs s4 category planning function implementation function evaluation function figure 1. cesp implementation per school figure 1 shows that the implementation of cesp in three private vocational high schools (pvhs) in yogyakarta city has been carried out very well (75%), and one pvhs has been less than optimal (25%). pvhs 2, in particular, needs guidance, mentoring, and guidance in the implementation of cesp. many factors affect the inadequacy of cesp. one factor is that the principle is not yet definitive, so more in-depth research is needed. in addition, there is no effect and correlation between school data and the successful implementation of cesp. the successful implementation of cesp depends on the seriousness of all education stakeholders and the school principal's leadership. https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 33 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) conclusion the planning for the character education strengthening program (cesp) at vocational high schools in yogyakarta city, consisting of the initial assessment, cesp socialization, vision and mission, policy design, and program design, is in a good category, with a score of 2.74. meanwhile, the implementation of cesp, consisting of developing learning, developing school culture, community participation, and implementing the main values, is in a good category, with a score of 2.98. these two scores are in line with the evaluation implementation component, which is also in a good category, with a score of 2.50. the findings also show no effect and correlation between school data and the seriousness and success of the implementation of cesp. in addition, the data on the instrument also show that there are answers to the questions that are extreme (score of 0) in two schools, namely: vhs1 and vhs2, mostly found in vhs 2. the implication of this research is that in the early stages, vhs felt burdened and had difficulty implementing cesp in terms of planning, implementation, and evaluation. however, at the end of the program, it was proven that cesp impacted the improvement of student's academic and non-academic achievement. the results of this study are recommended as consideration for mapping the implementation of cesp, determining education office policies related to character education, developing cesp models, as a reference for research on character, and as materials for discussions regarding character education. references arikunto, s. (2017). pengembangan instrumen penelitian dan penilaian program. pustaka pelajar. asbari, m., nurhayati, w., & purwanto, a. (2019). the effect of parenting style and genetic personality on children character development. jurnal penelitian dan evaluasi pendidikan, 23(2), 206–218. https://doi.org/10.21831/pep.v23i2.28151 barabasch, a., & keller, a. (2020). innovative learning cultures in vet – „i generate my own projects.‟ journal of vocational education & training, 72(4), 536–554. https://doi.org/10.1080/13636820.2019.1698642 bunyamin, b. (2016). teacher professionalism: a study on teacher‟s professional and pedagogic competence at vocational high schools in the northern coastal of jakarta. ijer indonesian journal of educational review, 3(1), 77–84. http://journal.unj.ac.id/unj/index.php/ijer/article/view/1203 effendi, y. r. (2020). model pendekatan kepemimpinan transformasional kepala sekolah berbasis nilai-nilai budaya, humanistik, dan nasionalisme dalam penguatan pendidikan karakter. jurnal pendidikan karakter, 11(2), 161–179. https://doi.org/10.21831/jpk.v10i2.31645 egeham, l. (2020, october 21). kpai sebut 171 pelajar diamankan polda metro jaya terkait demo 20 oktober. liputan 6. https://www.liputan6.com/news/read/4388101/kpai-sebut171-pelajar-diamankan-polda-metro-jaya-terkait-demo-20-oktober handayani, n., & wulandari, t. (2017). implementasi pendidikan karakter berbasis multikultural di smk negeri 2 mataram. istoria: jurnal pendidikan dan ilmu sejarah, 13(2). https://doi.org/10.21831/istoria.v13i2.17650 jannah, i. n., chamisijatin, l., & husamah, h. (2018). implementasi pendidikan karakter dalam pembelajaran ipa di smpn xy kota malang. jurnal biotek, 6(1), 1–14. https://doi.org/10.24252/jb.v6i1.4243 manasikana, a., & anggraeni, c. w. (2018). pendidikan karakter dan mutu pendidikan indonesia. prosiding seminar nasional pendidikan iii 2018 (pendidikan akuntansi fkip ums), 102–110. https://publikasiilmiah.ums.ac.id/handle/11617/10206 https://doi.org/10.21831/pep.v23i2.28151 https://doi.org/10.1080/13636820.2019.1698642 http://journal.unj.ac.id/unj/index.php/ijer/article/view/1203 https://doi.org/10.21831/jpk.v10i2.31645 https://www.liputan6.com/news/read/4388101/kpai-sebut-171-pelajar-diamankan-polda-metro-jaya-terkait-demo-20-oktober https://www.liputan6.com/news/read/4388101/kpai-sebut-171-pelajar-diamankan-polda-metro-jaya-terkait-demo-20-oktober https://doi.org/10.21831/istoria.v13i2.17650 https://doi.org/10.24252/jb.v6i1.4243 https://publikasiilmiah.ums.ac.id/handle/11617/10206 https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 34 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) mayhew, m. j., & rockenbach, a. n. (2021). interfaith learning and development. journal of college and character, 22(1), 1–12. https://doi.org/10.1080/2194587x.2020.1860778 mgaiwa, s. j., & poncian, j. (2016). public–private partnership in higher education provision in tanzania: implications for access to and quality of education. bandung: journal of the global south, 3(1), 1–21. https://doi.org/10.1186/s40728-016-0036-z ministry of education and culture. (2017). pedoman supervisi penguatan pendidikan karakter. directorate of special education and service ministry of education and culture of the republic of indonesia. misbah, z., gulikers, j., dharma, s., & mulder, m. (2020). evaluating competence-based vocational education in indonesia. journal of vocational education & training, 72(4), 488–515. https://doi.org/10.1080/13636820.2019.1635634 mortaki, s. (2012). the contribution of vocational education and training in the preservation and diffusion of cultural heritage in greece: the case of the specialty “guardian of museums and archaeological sites.” international journal of humanities and social science, 2(24), 51–58. https://www.cabdirect.org/cabdirect/abstract/20133197829 mujiyati, n., warto, w., & sutimin, l. a. (2019). developing a problem-based local history module to improve the critical thinking ability of senior high school students. reid (research and evaluation in education), 5(1), 30–40. https://doi.org/10.21831/reid.v5i1.13334 nast, t., elias, m. j., & yuan, m. (2020). the 11 priciples of character: overview. journal of character education, 16(2), 11–18. presidential regulation no. 87 of 2017 concerning the strengthening of character education, (2017). riantoni, c., & nurrahman, a. (2020). analisis tingkat hubungan karakter jujur siswa terhadap hasil belajar ipa terpadu. jurnal pendidikan edutama, 7(2), 1–7. https://doi.org/10.30734/jpe.v7i2.512 wardoyo, c., satrio, y. d., & ratnasari, d. a. (2020). an analysis of teachers‟ pedagogical and professional competencies in the 2013 curriculum with the 2017-2018 revision in accounting subject. reid (research and evaluation in education), 6(2), 142–149. https://doi.org/10.21831/reid.v6i2.35207 wheeler, l., guevara, j. r., & smith, j.-a. (2018). school–community learning partnerships for sustainability: recommended best practice and reality. international review of education, 64(3), 313–337. https://doi.org/10.1007/s11159-018-9717-y wibowo, p. a., kuat, t., & sayuti, m. (2018). integrated learning based on competence in vocational high school. journal of vocational education studies, 1(2), 71–76. https://doi.org/10.12928/joves.v1i2.699 widiyani, w., susatya, e., widodo, h., & suyatno, s. (2020). partnership program in increasing the quality of education. international journal of education humanities and social science, 3(3), 167–183. https://ijehss.com/link2.php?id=124 widowati, r., & retnowati, t. h. (2016). evaluasi implementasi pendidikan karakter di sma negeri yogyakarta. jurnal evaluasi pendidikan, 4(1). world population review. (2020). human development index (hdi) by country. world population review. https://worldpopulationreview.com/country-rankings/hdi-by-country zubaedi, z. (2011). desain pendidikan karakter: konsepsi dan aplikasinya dalam lembaga pendidikan. kencana. https://doi.org/10.1080/2194587x.2020.1860778 https://doi.org/10.1186/s40728-016-0036-z https://doi.org/10.1080/13636820.2019.1635634 https://www.cabdirect.org/cabdirect/abstract/20133197829 https://doi.org/10.21831/reid.v5i1.13334 https://doi.org/10.30734/jpe.v7i2.512 https://doi.org/10.21831/reid.v6i2.35207 https://doi.org/10.1007/s11159-018-9717-y https://doi.org/10.12928/joves.v1i2.699 https://ijehss.com/link2.php?id=124 https://worldpopulationreview.com/country-rankings/hdi-by-country https://doi.org/10.21831/reid.v7i1.38029 edhy susatya, budi santosa, andriyani, & dwi ariyani page 35 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 6(2), 2020, 160-173 available online at: http://journal.uny.ac.id/index.php/reid estimating the ability of pre-service and in-service teacher profession education (tpe) participants using item response theory *1 lian gafar otaya; 2 badrun kartowagiran; 3 heri retnawati; 4 siti salina mustakim 1 faculty of tarbiyah and teacher training, institut agama islam negeri sultan amai gorontalo jl. gelatik, heledulaa, kota timur, kota gorontalo, gorontalo 96135, indonesia 2 faculty of engineering, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia 3 faculty of mathematics and natural sciences, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia 4 faculty of educational studies, universiti putra malaysia persiaran masjid, 43400 serdang, selangor, malaysia *corresponding author. e-mail: lianotaya82@iaingorontalo.ac.id submitted: 24 november 2020 | revised: 31 december 2020 | accepted: 31 december 2020 abstract this research generally aimed to describe the characteristic of the ability of pre-service and in-service tpe participants using the item response theory, irt. the research subject comprised 516 participants divided into 239 participants of the pre-service tpe program and 277 participants of the in-service tpe program using the purposive sampling technique. data were collected through the technique of observation and documentation. in estimating the item parameter and ability parameter, the irt model polytomous was implemented, which was furthermore described. this finding shows that the assessor could directly recognize the position of the ability of students in the tpe program based on the item characteristic and the ability between the highest and the lowest grade in the ability scale, so this finding did not only support the implementation of tpe program in indonesia, but also its applicability was expected to revise the assessment of teachers’ performance, the supervision of teachers, field teaching practice, and the assessment in the other teaching fields, so it could be used as an evaluation in revising the assessment model. keywords: teacher professional education, ability, pre-service, in-service, item response theory how to cite: otaya, l., kartowagiran, b., retnawati, h., & mustakim, s. (2020). estimating the ability of pre-service and in-service teacher profession education (tpe) participants using item response theory. reid (research and evaluation in education), 6(2), 160-173. doi:https://doi.org/10.21831/reid.v6i2.36043. introduction teacher profession education (tpe) program is strongly related to the professionalism of teachers, because it gives the opportunity in mastering knowledge related to teacher profession and gives learning experiences in order to improve the competences of teacher as demanded. besides, it can enrich the knowledge, theoretical concept, and experience which are deeper in order to be a professsional teacher (caena, 2011; galih & iriani, 2018; oviyanti, 2016; petrie & mcgee, 2012). it shows the importance of profession education for teachers and it will continuously become an essential issue in improving the high learning quality. https://doi.org/10.21831/reid.v6i2.36043 doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim copyright © 2020, reid (research and evaluation in education), 6(2), 2020 161 issn: 2460-6995 (online) the high learning quality is the key component in the agenda of educational reformation (hammond & moore, 2018). some findings show the quality of education or learning process which really depend on the quality of the teachers (bahcivan & cobern, 2016; gerritsen, plug, & webbink, 2017; kartowagiran, 2012; le cornu, 2016; retnawati, apino, & anazifa, 2018). some studies also found that there is a strong relation between what is done by teachers and the achievement of students. if teachers have good performance, the achievement of students will also be good. then, the effort to improve the performance of teachers can be done through the evaluation of the quality of teachers (fahmi, maulana, & yusuf, 2011; steinberg & garrett, 2016; stronge, 2018; sulisworo, nasir, & maryani, 2017; suswantar & retnawati, 2016). therefore, it is important to develop the professionalism of teachers in indonesia. one kind of profession education conducted in the development of the professionalism of teachers in indonesia is through the teacher profession education (tpe) program. tpe program is a program from the government which aims to produce teachers/ teacher candidates who are able to master all required competences such as pedagogical competence, professional competence, social competence, and personality competence. this tpe program is expected to produce teachers/teacher candidates who have complete competences such as qualified and characterized besides the other professionalism competences that are required. besides, tpe program is an absolute requirement for teachers to obtain the experiences that support their professionalism as stated in the national education standard especially to achieve an educator certificate (amadi, 2013; anita & rahman, 2013; hotimah & suyanto, 2017; ningrum, 2012; nurmaliah, 2018). tpe program in indonesia is initiated by the government in order to respond the problems of national education, such as: (1) shortage, the lack of teachers especially in the remote and rural area, (2) unbalanced distribution, (3) under qualification, (4) low competence or the incompetent teachers, and also (5) mismatched, the irrelevance between the academic qualification and also the course taught (kemenristekdikti, 2017, 2018). it is also supported with some opinions which state that profession education for teachers can help them to mater the learning materials and can support the readiness to be a professional teacher (gerdeman, garrett, & monahan, 2018; hotimah & suyanto, 2017; robertson, 2017; wahyudin, 2016). therefore, to be a professional teacher, it is essential to follow tpe program, even though there are still problems in its implementation. those problems are referred to the way to improve competence mastery of tpe participants. if there are still many graduates of tpe program who still do not meet the demanded requirement, then the assessment conducted should be questioned whether it has reached the components which can describe the whole competences of tpe participants or not. then, the research which can estimate the competence mastery of tpe participants by using the item response theory (irt). the most important reason to estimate the competence mastery of tpe participants using irt approach is the assessment estimated using the raw score. it is conducted by summing the scores in every aspect becoming the total score which is divided by maximum score, then the score obtained is compared with the passing grade of tpe program which is 76 (good). this kind of assessment is relative and cannot differentiate students who have good ability, average ability, and low ability based on the component of every aspect assessed by using the classical theory approach. the measurement using classical theory approach has some limitedness such as its real score really depends on the measurement and the testing cannot be compared, because the assessment approach and classical theory approach are random (not systematic), where there is no relation between the real score and the error score. the observation score and the real score change depending on the difficulty level and the scoring, so both of them really depend on the result of students’ measured characteristics where the observation score is the only score which can be seen meanwhile the real score and the error score are latent doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim 162 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) (istiyono, mardapi, & suparno, 2014; mardapi, 2017; retnawati, 2011, 2016; sumintono & widhiarso, 2014). it refers to the assessment mostly used in the field where that is unable to obtain accurate information. method this study used a descriptive-explorative approach which aimed to describe the characteristics of the ability of in-service and pre-service tpe participants. those abilities were the assessment of students’ ability in composing lesson plans using the lesson plan instrument and the ability in the implementation of learning by using learning assessment instrument based on the assessment from lecturers in the workshop, field teaching practice or competency test. that assessment then were analyzed to estimate the item parameter from each instrument and the parameter of tpe participants’ ability by using irt polytomous model and the result was described. sample the subject in this research was the groups of tpe program divided into in-service and pre-service tpe program 2019. the program were conducted at three state universities in indonesia. the total subject of the research was 516 participants comprising 239 participants of pre-service tpe and 277 participants of in-service. the subjects were selected using purposive sampling technique with the consideration that the subjects taken were appropriate with the number of participants from each program. instrument and procedures data collected in estimating the ability of the participants of tpe program were divided into two data groups. the first data group is the estimation of the ability of tpe participants in composing lesson plans. the second data group is the estimation of the participants of tpe program in implementing the learning process. the both of data groups were collected through the observation and documentation. the observation technique was used to assess the material mastery of tpe participants which was assessed using lesson plan assessment instrument and learning assessment instrument. furthermore, the documentation technique was used to assess lesson plans composed by tpe participants in the workshop, field teaching practice or performance practice in the competency test. the instrument of lesson plan assessment used consisted of 25 items measured by four indicators: the fromulation of competency achievement indicators comprising six items (item 1-6), organizing the materials, methods, media and learning sources comprising six items (item 7-12), organizing the process, assessment and learning evaluation comprising six items (item 13-18), and the implementation of techno pedagogical content knowledge principle comprising seven items (item19-25). furthermore, the instrument of lesson plan assessment consisted of 20 items measured by four indicators, namely: conducting an educated learning comprising four items (item 1-4), conducting a good learning comprising seven items (item 5-11), facilitating the development of self-potency and characters of participants comprising four items (item 12-15), and also assessing and evaluating the learning comprising five items (item 16-20). both of the instruments, accurately fulfilled the requirements of validity aspects measured by seven expert judgements refered to aiken’s v table showing all items in the instrument were valid because they fulfilled the required aiken index which was > 0.75 (aiken, 1980, 1985). besides, the estimation of reliability using the inter-rater reliability technique can be seen in table 1. table 1. the estimation of inter-rater reliability no instrument criteria reliability coefficient explanation 1 lesson plan ≥0.70 0.84 reliable 2 learning implementation ≥0.70 0.81 reliable based on table 1, it is concluded that generally all instruments responded by the rater have had the reliable inter-class coefficient. the instrument was stated as reliable if the coefficient was ≥ 0.70 (nunnally & bernstein, 1994). doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim copyright © 2020, reid (research and evaluation in education), 6(2), 2020 163 issn: 2460-6995 (online) data analysis the ability of the tpe participants was analyzed using the polytomous item response theory approach and was estimated using the partial credit model (pcm) method through r program with the extended rasch modeling (erm) package. furthermore, to describe the assessment result toward the accuracy of tpe participants’ ability using the item information function. with that information function, it can state the contribution of the item instrument in revealing the latent trait measured by that instrument and connected with the standard error of measurement. hambleton et al. (1991) and retnawati (2014) stated that the information function value had the reversed correlation with sem, the bigger the information value, the smaller sem will be or vice versa. hence, the information function in irt gave information toward the presumption of the ability level of tpe participants, the smaller the standard error, the more accurate the assessment conducted in predicting the ability. thus, in this research, the item information function value functioned to provide information toward the presumption of the ability level of tpe participants as the model selected. findings and discussion the abilities of the in-servive and preservice tpe participants foccused on this research were the ability in composing lesson plans and the ability in conduction a learning. both of them were assessed by lecturers in the workshop, field teaching practice and in the competency test through the performance practice. after conducting the assessment toward the ability of tpe participants, it was obtained the ability parameter (θ). the estimation result of the ability of the tpe participants based on the grouping as elaborated as follows. the ability of composing the lesson plan the ability of pre-service and in-service tpe participants in preparing lesson plans is estimated by the partial credit model (pcm) method through the r program with the extended rasch modeling (erm) package. the results of the analysis are obtained in the form of characteristic items that are completely presented in table 2. table 2. the result of the analysis of item characteristics in learning planning assessment item location threshold ᵟ1 threshold ᵟ2 threshold ᵟ3 threshold ᵟ4 a1 0.77 -1.54 0.54 1.33 2.75 a2 0.62 -1.96 0.45 1.54 2.47 a3 0.70 -1.95 0.48 1.34 2.94 a4 0.61 -1.66 -0.03 1.51 2.62 a5 0.61 -1.83 0.29 1.39 2.60 a6 0.79 -1.06 0.19 1.60 2.43 b7 0.73 -1.35 0.23 1.33 2.71 b8 0.60 -1.67 0.02 1.38 2.69 b9 0.70 -1.31 0.15 1.42 2.57 b10 0.67 -1.02 -0.31 1.51 2.52 b11 0.72 -1.36 0.16 1.61 2.50 b12 0.74 -1.06 -0.27 1.47 2.84 c13 0.81 -1.34 0.09 1.64 2.85 c14 0.78 -1.43 0.32 1.37 2.85 c15 0.82 -1.35 0.15 1.47 3.02 c16 0.81 -1.41 0.25 1.49 2.93 c17 0.77 -1.40 0.26 1.41 2.82 c18 0.65 -1.75 0.08 1.54 2.74 d19 0.57 -1.68 0.09 1.27 2.62 d20 0.66 -1.19 0.02 1.22 2.61 d21 0.67 -1.30 0.18 1.26 2.54 d22 0.52 -1.54 -0.06 1.13 2.56 d23 0.51 -1.68 0.19 1.03 2.50 d24 0.49 -1.57 0.08 0.88 2.59 d25 0.65 -1.05 -0.15 1.19 2.63 doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim 164 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) based on table 2, information is obtained that the location parameters of each item vary from 0.49 to 0.82. in addition, the threshold parameter ᵟi are four groups or four intersections. this is a parameter for the level of difficulty participants get a certain score when responding to item i. it was reviewed from the chance of achieving the score, the threshold parameter coefficient ᵟi for each category is different. the higher the achievement category, the higher the threshold coefficient ᵟi. it means that the higher the assessment of lesson plans, the higher the location coefficient, and the more difficult the item with the threshold distribution which is a category of achievement level. the higher the threshold, the more difficult it is to reach the threshold, so participants who have low ability can only reach the threshold (category threshold) too low, participants with medium ability are only able to reach the threshold (category threshold) to intermediate participants with high capability can certainly reach the high threshold category as well. embretson and reise (2000) stated that item location reflects the level of ease or difficulty of the item, while the threshold is the threshold between certain categories to be achieved. another thing that can be stated based on the results of item analysis is the item characteristic curve. the item characteristic curves are illustrated to make it easier to understand the relationship between each threshold ᵟ i which is the level of difficulty with the participant's ability to reach a certain score or category. the following is an example of the a4 item characteristic curve from the lesson plan assessment assessing the clarity of the formulation of competency achievement indicators using verbs that can be measured or observed. the full results are presented in figure 1. figure 1 is an example of an item characteristic curve from the assessment of lesson plan, that is item 4 evaluates the clarity of the formulation of competency achievement indicators using verbs that can be measured or observed. if related to the results of the item calibration in table 2, it can be explained that basically item 4 has a location parameter of 0.61 with a threshold parameter ᵟ1 -1.66, threshold ᵟ2 -0.03, threshold ᵟ3 1.51, and threshold ᵟ4 2.62. graphically, threshold ᵟi can be interpreted as the intersection of the curves of each category. from figure 1, it is clear that to achieve category 2 or to obtain score 2 in item 4, it needs the ability (θ) about -0.03 to 1.51. in addition to the item characteristic curve, another thing that can be explained is the value of the information function. the information function basically can provide maximum information if it is imposed on certain abilities (θ). the following is the result of the value of the information function (ift) assessment of lesson plan linked to the standard error of measurement (sem) in figure 2. figure 1. the curve of item characteristic 4 of the learning planning assessment figure 2 presents an information function curve from the accumulation of 25 items that assess the ability of tpe participants in preparing the lesson plans. figure 2 shows a graph of information values (ift) and measurement errors (sem) meeting on a capability scale of -4.3 and 0.7. conversely, when the capability scale is less than -4.3 and more than 0.7, then this instrument has a measurement error greater than the information provided. another thing from figure 2 is the instrument information function value of 16.36 on the ability scale (θ) -1.8. then it can be explained that by knowing the information function value of 16.36, the measurement error coefficient (sem) obtained by 0.24 indicates the instrument has a higher information value compared to the measurement error. overall distribution of the estimated results of tpe participants' abilities in compiling a complete learning plan is shown in figure 3. doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim copyright © 2020, reid (research and evaluation in education), 6(2), 2020 165 issn: 2460-6995 (online) figure 2. the converse relation of ift and sem from the learning planning assessment figure 3. the distribution of tpe participants’ ability in composing lesson plans figure 3 shows the distribution of the estimated results of tpe participants' abilities in compiling overall learning planning can be said to be good. this is shown by the results of the estimated ability of tpe participants in preparing learning plans dominated by the abilities (θ) 1 to 3. if the results of estimation of tpe program students' ability in preparing learning plans are grouped based on the ability of pre-service tpe participants and the ability of in-service tpe participants, they are presented in figure 4 and figure 5. figure 4 shows the estimated results of pre-service tpe participants' ability in developing overall lesson plan can be said to be good, because ideally the expected tpe participants’ ability is at least 1 or more. this is shown by the results of the estimated ability of pre-service tpe participants in compiling lesson plans which are dominated by abilities (θ) 1 to 3 where from 239 participants assessed, there were 26% on the ability 1 ≤ θ < 2, 42% were on ability 2 ≤ θ < 3, and 29% were on ability 3 ≤ θ < 4. besides, figure 5 shows the estimation of in-service tpe participants’ ability which shows the abilities tend to be possessed by participants in the ability (θ) 1 to 3 with the total 277 participants assessed, there were 24% in the ability 1 ≤ θ < 2, 49% in the ability 2 ≤ θ < 3, and 21% were on the ability 3 ≤ θ < 4. the estimation shows that the ability of in-service and pre-service tpe participants in composing the lesson plans are both already good. doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim 166 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) figure 4. the distribution of pre-service tpe participants’ ability in composing lesson plans figure 5. the distribution of in-service tpe participants in composing lesson plans the ability to conduct learning the second ability which became the analysis unit was the ability of the teacher profession education (tpe) participants in the learning process. just like in estimating the ability in composing the lesson plans, in this ability was also estimated by the partial credit model (pcm) using the r program with the help of extended rasch modeling (erm) package, in which it obtained the analysis of the item characteristics, which was completely shown in table 3. based on the analysis which was served in table 3, it is obtained information that the location of parameter in every item varied from 0.58 to 0.85. furthermore, the parameter threshold ᵟi comprised four groups or also known as four intersections. it was the parameter of difficult level of participants in obtaining certain scores when responding item i. as reviewed from the chance of the score achievement, the coefficient of the parameter threshold ᵟi for every category is different. the higher the category of achievement, the higher the coefficient threshold ᵟi. therefore, it shows that in the assessment of learning process, the higher the coefficient of the location, the harder the item was. the higher the threshold, the more difficult to achieve the threshold, therefore, participants who have the low ability only could achieve the low threshold. otherwise, participants who had medium ability could only achieve the medium threshold and the participants with high ability could achieve the high threshold. another thing which could be explained based on the item analysis with the partial credit model (pcm) was the item characteristic curve. the item characteristic curve was described to understand the relation of each threshold ᵟi which was the difficulty level with the ability of participants to achieve certain scores. for example, it can be seen from the item characteristic curve of item c15 about training the students to politely communicate to others and used the appropriate gestures in communication, which are completely presented in figure 6. figure 6 is an example of the item characteristic curve of the learning implementation assessment in item 15. if it is related to the result of item calibration on table 3, it can be explained that, basically, item 15 has location parameter 0.64 with the parameter threshold ᵟ1 -1.36, threshold ᵟ2 -0,13, threshold ᵟ3 1.29, and also threshold ᵟ4 2.78. graphically, the threshold ᵟi can be interpreted as the curve intersection of each category. from figure 6, it can be explained that to achieve category 2 or to obtain score 2 in item 15, it is necessary that the ability ( ) should be around -0.13 to 1.29. in addition to this item characteristic curve, another thing which can be explained is the information function. the result of the ift is connected to the standard error of measurement (sem), as completely presented in figure 7. doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim copyright © 2020, reid (research and evaluation in education), 6(2), 2020 167 issn: 2460-6995 (online) table 3. the analysis of item characteristics of learning assessment item location threshold ᵟ1 threshold ᵟ2 threshold ᵟ3 threshold ᵟ4 a1 0.75 -1.50 0.59 0.95 2.97 a2 0.58 -1.27 -0.10 0.87 2.86 a3 0.65 -1.00 -0.15 0.88 2.91 a4 0.69 -1.17 -0.01 1.16 2.79 b5 0.63 -1.44 0.07 1.03 2.88 b6 0.60 -1.26 -0.22 1.14 2.75 b7 0.73 -1.23 0.04 1.22 2.89 b8 0.71 -1.52 0.14 1.13 3.08 b9 0.70 -1.10 -0.12 1.18 2.85 b10 0.68 -1.07 -0.13 1.07 2.87 b11 0.68 -0.90 -0.37 1.09 2.90 c12 0.62 -1.45 0.19 0.77 2.99 c13 0.58 -1.39 0.10 0.78 2.84 c14 0.64 -1.04 -0.13 0.97 2.77 c15 0.64 -1.36 -0.13 1.29 2.78 d16 0.62 -1.26 -0.19 1.05 2.91 d17 0.65 -1.18 0.02 1.10 2.68 d18 0.67 -1.25 0.19 0.93 2.84 d19 0.64 -1.21 -0.30 1.16 2.95 d20 0.85 -0.91 0.17 1.09 3.07 figure 6. the distribution of tpe participants’ ability in conducting a learning figure 7. the converse relation of ift and sem from the learning assessment figure 7 serves the curve of information function from the accumulation of 20 items in the learning assessment. it shows the graphic of information value and measurement error. those two function graphics meet in the ability scale -2.9 and 1.8. from both of two abilities, intsrument has higher information value than its measurement eror. otherwise, when the ability scale is less than -2.9 and more than 1.8, so that assessment has the bigger measurement error than the infromation given. another thing that can be explained from figure 7 is the maximum information function value which is 13.3 in the ability scale (θ) -0.6. the bigger the information value, the smaller the sem will be or vice versa. therefore, by the identification of the information function value which is 13.3, the sem obtained is 0.27. the estimation result of tpe participants in the learning process is shown in figure 8. figure 8 presents the estimation of tpe participants in conducting the learning process which is good. it is showed by the estimation of the tpe participants in conducting learning process dominated by the ability (θ) 1 until 3, because the ideal tpe participants’ ability that is expected is 1 or more. if the estimation of tpe participants’ ability is grouped based on the category of the in-service and pre-service tpe, then it is described in figure 9 and figure 10. doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim 168 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) figure 8. the distribution of tpe participants in conducting the learning figure 9. the distribution of pre-service tpe participants’ ability in conducting the learning figure 10. the distribution of in-service tpe participants in conducing the learning figure 9 presents the estimation of inservice tpe participants in conducting learning process which is stated as good, because the ideal tpe partiicpants’ ability expected is 1 or more. it is shown by the estimation of pre-service tpe participants’ ability in conducting the learning process dominated by the ability (θ) 1 until 3. from 239 participants, there were 21% of participants in the ability 1 ≤ θ < 2, 41% in the ability 2 ≤ θ < 3 and 19% in the ability 3 ≤ θ < 4, while figure 10 shows the estimation of in-service tpe participants’ ability with good results shown by the ability of participants in the ability level (θ) 2 until 3. from 277 participants, there were 25% in the ability level 2 ≤ θ < 3 and 50% of them were in the ability level 3 ≤ θ < 4. from the estimation, it can be explained that the ability of in-service tpe participants is higher than the ability of pre-service tpe participants. one of the findings in this research is that it obtained the assessment or the description of the ability of in-service and pre-service tpe participants using irt approach. empirically, the result of the assessment of tpe participants’ ability shows the good results, where the ability of participants is dominant in the ability level (θ) 1 until 3. hence, in the ability of tpe participants from the learning assessment was also dominated by the ability level (θ) 1 until 3. it is supported by retnawati and munadi (2013) that the ideal ability parameter is 1 or more. besides, the ability of participants in composing lesson plans, and conducted the learning process in the in-service tpe participants. this finding was indicated because the participants of in-service tpe program already had teaching experiences rather than the aprticipants of the pre-service tpe program. dewey (1997) states that experiences are all processes of the living especially when interacting with many things from inside and outside, then that interaction influenced the further interactions. dewey’s point of view became the basis in reflecting the continuous experiences of tpe participants, especially in improving their competences. paterson (2010) stated education was not only doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim copyright © 2020, reid (research and evaluation in education), 6(2), 2020 169 issn: 2460-6995 (online) existed in someone’s life, but also as the process which formed the better version of someone. afterwards, the experiences obtained by the tpe participants were the whole learning processes became the essential experiences for them to be implemented as a professional teacher in the future. this finding showed that the estimation of pre-service and in-service tpe participants was conducted using irt approach, so the assessor could directly determine the position of tpe participants’ ability in composing lesson plans, becsue irt approach had assumptions that the latent variable represented by an unidimensional continuum could provide accurate information about the latent attribute or the ability possessed by someone (de ayala, 2009; hu et al., 2017). it was also in line with baker (2001) who stated that one of the aims of irt was to find the posistion of participants based on the ability scale. through this information, the assessor could recognize the ability of the participants. besides, the assessor could also compare the ability among participants in the score determination based on that abiliy scale (θ). this finding also showed the estimation conducted had the high information function value and the small estimations standard error, meaning that the estimation of ability produced was more accurate. based on the finding obtained in estimating the ability of tpe participants showed that irt approach could increase the accuracy of achievement measurement of tpe participants especially in the ability of composing lesson plans. besides, it was also proved with the accuracy of assessment measured from the information function value and the estimation standard error. it is in line with baker (2001) who stated that if the parameter can be predicted carefully, so it will be easier to discover the information about the parameter value. it was essential for the assessor to estimate the ability of tpe participants, because the precision which predicted the position of participants’ ability depending on the position of someone’s ability on the ability scale. thus, the tendency of the assessment of tpe participants should be directed to the irt polytomous approach because the ability of tpe participants was ranging as a continuum from the easiest to the most difficult. tpe participants tried to understand or master the expected abilities, so the mastery would be on the position in the continuum, and it was not limited only to the position of the lowest or highest ability. if the ability of the tpe participants is measured by irt approach, so the measurement results are between the lowest and the highest margin in a continuum. through the evaluation of assessment process using irt approach, it is expected that it can produce the qualified teachers who master the competences and can implement them in the learning process. this effort can realize the improvement of the competences of professional teachers. it is reinforced by some opinions, such as from biktagirova and valeeva (2014), pollard (2014), liu (2015), and galih and iriani (2018) the professionalism of a teacher must be improved not only when teaching in the class, but also before and after the class. becoming a professional teacher is not only enough with the educator certificate, but a professional teacher should also improve the professionalism continuously, fulfilling the responsibility and duty, conducting self-reflection in making decision to make a better teaching and learning process in the future. besides, loughland and alonzo (2019) state that the criteria of teachers’ success in the learning process, really depend on the expectation to fulfil the students’ needs. thus, teachers need to evaluate the learning process as the refection in evaluating and improving self-ability. teachers have an important role in improving students’ critical thinking, improving social and interpersonal communication of students, confidence, learning interest, active participation, and also helping students to prepare themselves as a good citizen. in realizing that, it really depends on the moral imperative of teachers in giving positive response toward the guidance model conducted (hammond & moore, 2018; kuş & öztürk, 2019). furthermore, it needs selfawareness from all teachers to always develop their ability to become qualified teachers (creemers, kyriakides, & antoniou, 2012; gareis & grant, 2014; good, 2008; goodwin, 2010; rabadi-raol, 2019; zhu et al., 2017). as stated by sheridan and tindall-ford (2018), doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim 170 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) the assessment of the ability of teachers of teacher candidates becomes more significant to evaluate and improve the learning process to be better in the future. conclusion the finding of this research shows that the tpe participants’ mastery is good. it is the absolute requirement to conduct tpe assessment which measures not only the participants’ academic mastery, but also the learning achievement and the competency mastery. by recognizing the ability of every tpe participant, mainly the ability in composing lesson plans and the ability in conducting a learning process, the assessment obtained will be more objective, accurate, and accountable in estimating the mastery of tpe participants. the positive things obtained from the findings are: first, the assessment is designed using irt polytomous model to determine the level or category achieved by participants based on the response given, so it can collect more information of item characteristic and estimate the tpe participants’ ability based on the ability scale. secondly, it can collect more detail information of tpe participants, it can describe the steps mastered by tpe participants, because the steps assessed from the tpe participants are correct in certain steps, but incorrect in the other steps. thus, estimating the tpe participants’ ability using irt approach is the choice which possibly gives information of their ability. the higher the parameter of the tpe participants’ ability, the bigger the chance they have to do the step by step correctly as the item assessed. third, the applicability of this assessment is not only used to assess the mastery of tpe participants, but also can be implemented in the assessment of teacher performance, teacher supervision, field teaching practice, and the assessment of other teaching assessments. references aiken, l. r. (1980). content validity and reliability of single items or questionnaires. educational and psychological measurement, 40(4), 955-959. doi: 10.1177/001316448004000419 aiken, l. r. (1985). three coefficients for analyzing the reliability and validity of ratings. educational and psychological measurement, 45(1), 131-142. doi: 10.1177/0013164485451012 amadi, m. n. (2013). in-service training and professional development of teachers in nigeria: through open and distance education. paper presented at the annual meeting of the bulgarian comparative education society. anita, n., & rahman, a. (2013). penilaian peserta ppg sm-3t prodi ppkn unesa terhadap pelaksanaan program pendidikan profesi guru (ppg) tahun 2013. kajian moral dan kewarganegaraan, 3(1), 409-423. bahcivan, e., & cobern, w. w. (2016). investigating coherence among turkish elementary science teachers' teaching belief systems, pedagogical content knowledge and practice. australian journal of teacher education (online), 41(10), 63-86. doi: 10.14221/ajte.2016v41n10.5 baker, f. b. (2001). the basics of item response theory. clearinghouse on assessment and evaluation. biktagirova, g. f., & valeeva, r. a. (2014). development of the teachers' pedagogical reflection. life science journal, 11(9), 60-63. caena, f. (2011). literature review quality in teachers’ continuing professional development. education and training, 20, 2-20. creemers, b., kyriakides, l., & antoniou, p. (2012). teacher professional development for improving quality of teaching. springer science & business media. de ayala, r. j. (2009). the theory and practice of item response theory. guilford. dewey, j. (1997). experience and education. simon & schuster inc. embretson, s. e., & reise, s. p. (2000). item response theory for psychologists. lawrence erlbaum associates. doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim copyright © 2020, reid (research and evaluation in education), 6(2), 2020 171 issn: 2460-6995 (online) fahmi, m., maulana, a., & yusuf, a. a. (2011). teacher certification in indonesia: a confusion of means and ends. center for economics and development studies (ceds) padjadjaran university, 3(1), 1-18. galih, a., & iriani, c. (2018). persepsi mahasiswa program pendidikan profesi puru (ppg) pendidikan sejarah terhadap program ppg. jurnal pendidikan sejarah, 7(1), 66-83. gareis, c. r., & grant, l. w. (2014). the efficacy of training cooperating teachers. teaching and teacher education, 39, 77-88. doi: 10.1016/j.tate.2013.12.007 gerdeman, d., garrett, r., & monahan, b. (2018). teacher professional learning through teacher network programs: a multiple case study investigation. american institutes for research, 1-28. gerritsen, s., plug, e., & webbink, d. (2017). teacher quality and student achievement: evidence from a sample of dutch twins. journal of applied econometrics, 32(3), 643-660. good, t. l. (2008). 21st century education: a reference handbook (vol. 1). sage publications. goodwin, a. l. (2010). globalization and the preparation of quality teachers: rethinking knowledge domains for teaching. teaching education, 21(1), 1932. doi: https://doi.org/10.1080/10476210903 466901 hambleton, r. k., swaminathan, h., & rogers, h. j. (1991). fundamentals of item response theory. sage publication inc. hammond, l., & moore, w. m. (2018). teachers taking up explicit instruction: the impact of a professional development and directive instructional coaching model. australian journal of teacher education, 43(7), 110-133. doi: 10.14221/ajte.2018v43n7.7 hotimah, h., & suyanto, t. (2017). strategi pendidikan profesi guru (ppg) unesa dalam mengembangkan kompetensi pedagogik dan profesional peserta ppg pasca sm-3t. kajian moral dan kewarganegaraan, 5(01), 242-256. hu, b., qin, l., sullivan, m., & templin, j. (2017). contemporary approaches to psychometrics: item response theory and diagnostic classification models/ enfoques contemporáneos sobre psicometría: los modelos de la teoría de respuesta al ítem y los modelos de clasificación de diagnósticos. cultura y educación, 29(3), 461-491. istiyono, e., mardapi, d., & suparno, s. (2014). penerapan partial credit model pada tes pilihan ganda termodifikasi merupakan model alternatif asesmen fisika yang adil. paper presented at the prosiding kongres dan konferensi ilmiah himpunan evaluasi pendidikan (hepi) tahun 2014, bali. kartowagiran, b. (2012). model penilaian kinerja guru. paper presented at the seminar nasional hepi penelitian dan evaluasi pendidikan, pascasarjana universitas negeri yogyakarta. kemenristekdikti. (2017). pedoman penyelenggaraan pendidikan profesi guru. direktorat jenderal pembelajaran dan kemahasiswaan direktorat jenderal kelembagaan. kemenristekdikti. (2018). pedoman penyelenggaraan pendidikan profesi guru tahun 2018. direktorat jenderal pembelajaran dan kemahasiswaan direktorat jenderal kelembagaan. kuş, z., & öztürk, d. (2019). social studies teachers’ opinions and practices regarding teaching controversial issues. australian journal of teacher education, 44(8), 15-36. doi: 10.14221/ajte.2019v44n8.2 le cornu, r. (2016). professional experience: learning from the past to build the future. asia-pacific journal of teacher doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim 172 copyright © 2020, reid (research and evaluation in education), 6(2), 2020 issn: 2460-6995 (online) education, 44(1), 80-101. doi: 10.1080/1359866x.2015.1102200 liu, k. (2015). critical reflection as a framework for transformative learning in teacher education. educational review, 67(2), 135-157. doi: 10.1080/00131911.2013.839546 loughland, t., & alonzo, d. (2019). teacher adaptive practices: a key factor in teachers’ implementation of assessment for learning. australian journal of teacher education, 44(7), 18-30. doi: 10.14221/ajte.2019v44n7.2 mardapi, d. (2017). pengukuran, penilaian, dan evaluasi pendidikan. parama publishing. ningrum, e. (2012). membangun sinergi pendidikan akademik (s1) dan pendidikan profesi guru (ppg). jurnal geografi gea, 12(2), 49-55. nunnally, j. c., & bernstein, i. h (1994). psychometric theory. tata mcgraw-hill education. nurmaliah, c. (2018). analisis kemampuan peserta program pendidikan profesi guru (ppg) dalam workshop subject specific pedagogy (ssp) di fkip unsyiah. paper presented at the prosiding seminar nasional biotik, program studi pendidikan biologi universitas islam negeri ar-raniry banda aceh. oviyanti, f. (2016). tantangan pengembangan pendidikan keguruan di era global. nadwa, 7(2), 267-282. paterson, r. w. k. (2010). values, education and the adult. routledge. petrie, k., & mcgee, c. (2012). teacher professional development: who is the learner? australian journal of teacher education, 37(2), 59-72. pollard, a. (2014). reflective teaching: in schools. bloomsbury publishing. rabadi-raol, a. (2019). quality of teacher education and learning: theory and practice. journal of education for teaching, 45(1), 115-117. doi: 10.1080/02607476.2018.1541342 retnawati, h. (2011). mengestimasi kemampuan peserta tes uraian matematika dengan pendekatan teori respons butir dengan penskoran politomus dengan generalized partial credit model. prosiding semnas penelitian pendidikan dan penerapan mipa, uny, 53-62. retnawati, h. (2014). teori respons butir dan penerapannya: untuk peneliti, praktisi pengukuran dan pengujian, mahasiswa pascasarjana. nuha medika. retnawati, h. (2016). validitas reliabilitas dan karakteristik butir. parama publishing. retnawati, h., apino, e., & anazifa, r. d. (2018). impact of character education implementation: a goal-free evaluation. problems of education in the 21st century, 76(6), 881-899. retnawati, h., & munadi, s. (2013). mengestimasi parameter butir dan kemampuan guru menggunakan model parsial kredit dan parsial kredit tergeneralisasi. lumbung pustaka universitas negeri yogyakarta. robertson, s. (2017). a class act: changing teachers work, the state, and globalisation. routledge. sheridan, l., & tindall-ford, s. k. (2018). fitting into the teaching profession: supervising teachers’ judgements during the practicum. australian journal of teacher education, 43(8), 46-64. doi: 10.14221/ajte.2018v43n8.4 steinberg, m. p., & garrett, r. (2016). classroom composition and measured teacher performance: what do teacher observation scores really measure? educational evaluation and policy analysis, 38(2), 293-317. doi: 10.3102/0162373715616249 stronge, j. h. (2018). qualities of effective teachers. ascd. sulisworo, d., nasir, r., & maryani, i. (2017). identification of teachers’ problems in indonesia on facing global community. international journal of research studies in doi:https://doi.org/10.21831/reid.v6i2.36043 lian gafar otaya, badrun kartowagiran, heri retnawati, & siti salina mustakim copyright © 2020, reid (research and evaluation in education), 6(2), 2020 173 issn: 2460-6995 (online) education, 6(2), 81-90. doi: 10.5861/ijrse.2016.1519 sumintono, b., & widhiarso, w. (2014). aplikasi model rasch untuk penelitian ilmuilmu sosial (edisi revisi). trim komunikata publishing house. suswantar, i. s. d., & retnawati, h. (2016). penilaian kinerja guru sma swasta di kabupaten sukoharjo dan faktor-faktor yang mempengaruhi. jurnal evaluasi pendidikan, 4(1), 36-44. wahyudin, d. (2016). manajemen kurikulum dalam pendidikan profesi guru (studi kasus di universitas pendidikan indonesia). jurnal kependidikan: penelitian inovasi pembelajaran, 46(2), 259-270. zhu, x., goodwin, a. l., & zhang, h. (2017). quality of teacher education and learning: theory and practice. springer nature. this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(2), 2019, 103-119 available online at: http://journal.uny.ac.id/index.php/reid an authentic assessment model to assess kindergarten students’ character *1umi faizah; 2darmiyati zuchdi; 3yasir alsamiri 1department of islamic early childhood education, sekolah tinggi pendidikan islam bina insan mulia yogyakarta jl. jembatan merah no. 116k, prayan, depok, sleman, yogyakarta 55283, indonesia 2department of social sciences education, universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia 3college of education, university of hail p.o. box 2440, hail, 81481, kingdom of saudi arabia *corresponding author. e-mail: umifaizah74@gmail.com submitted: 25 april 2019 | revised: 19 august 2019 | accepted: 3 september 2019 abstract the aim of character development is essential to know, especially for kindergarten pupils. it might help teachers in developing students and offering helps and services. thus, an assessment model is needed to identify children's character development easily and accurately. this study aims to (1) develop an assessment model for evaluating kindergarten pupils’ character which is considered valid, reliable, and fulfill the criteria of goodness of fit statistic, and (2) know the characteristics of an authentic assessment model instrument to assess the achievements of early childhood characters in kindergarten. this research used plomp’s research and development model. data were collected using a questionnaire, documentation, interview, observation, and focus group discussion. the validation was proven by the expert judgment with the aiken’s v formula and the reliability was estimated with cronbach’s alpha. the validation construct and reliability were examined using exploratory factor analysis, followed by confirmatory factor analysis to ensure the result. furthermore, the results of this research show that (1) the assessment model developed is asoka. this model is considered valid and reliable as it meets the criteria of goodness and fits the statistic; (2) the characteristics of an authentic assessment model instrument to assess the achievements of early childhood characters in kindergarten are: (a) with the index of aiken's v analysis results of 0.901, the content validity is considered high; (b) the instrument construct validity has fulfilled the criteria of goodness of fit statistic; (c) as seen from alpha cronbach value coefficient 0.914, its instrument reliability is considered good enough. keywords: asoka model, authentic assessment, pre-school student permalink/doi: https://doi.org/10.21831/reid.v5i2.24588 introduction education in indonesia is still encountering an unfavorable situation, particularly when it is being related to the low quality of the educational process and outcomes as well as the nation’s weak characters (abidin, 2012). many issues reflect weak characters, one of which is the brawl among students which is unexpectedly worrying. another worse example is the nglithih (viciously hurting someone by using sharp things) which may cause death. those various conflicts arising today are not only triggered by the economic crisis, but also by the moral crisis. given these circumstances, the educational institution is the first institution to be questioned. one of the reasons that can be put forward is that educational institutions are the most effective means of strengthening the character of the nation. besides, the character is also a benchhttps://doi.org/10.21831/reid.v5i2.24588 an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri 104 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 mark of educational success. based on the historical research of all countries in the world, education has two main purposes: to guide young people to be intelligent and to have virtuous behavior (lickona, 1991, p. 5). the educational purposes are not limited solely to intelligence. character is another important purpose of education. the existing education is supposed to be able to create highly intelligent students with the best character. in addition, education is again the most essential aspect of the quality improvement of indonesian people. referring to its basic role, education is a path of human quality improvement that emphasizes the formation of basic quality, such as faith, piety, personality, intelligence, and so forth (naim, 2012, p. 25). on the other hand, education has a very strategic value in improving the quality of the nation. thus, breakthroughs to develop the character of the nation through character education programs need to be reformulated. as might have known, character education is not a new thing in the national education system, since the objectives of the national education, as stated in all laws, substantively contain character education, despite the different formulas. the characters which are developed within the system are not just a horizontal relationship between individuals with other individuals, yet it also deals with the vertical relationships between individuals and allah swt. in this case, faith becomes the core of a man while he is controlled by his belief/faith (majid & andayani, 2011, p. 65). thus, in islamic educational institutions, faith becomes the target of education. similarly, in the formulated character education references, religious values become the main target that must be developed for each learner. conceptual character education manuscripts have been designed in such a way to produce the character of human beings. it is reflected in the 2005-2025 national long-term development plan which places character education as the first mission to realize the vision of national development. hence, character building is meant to make particular groups realize the well-mannered society with noble characters. concerning the effort to realize character education as mandated in the national long-term development plan, the character building has been stipulated in the function and objectives of national education, as in the law of republic of indonesia no. 20 of 2003 on national education system. thus, the national long-term development plan and the law on national education system are solid foundations for implementing the operation of cultural education and nation character as the priority program of national education ministry 2010-2014, as outlined in the national character education action plan. based on the data obtained on implementation of the character education in kindergarten or raudhatul athfal, character education had not been implemented optimally. it is in line with the findings of zuchdi (2006, pp. 92–93) which concluded that the context of school education has not fully supported the implementation of character education, especially the achievement in implementing character education at the kindergarten level, and only four skills were managed to be developed, namely greeting, being friendly, helping others, and asking for help politely. in achieving the mission of developing people’s characters, each related institution, starting from families, educational institutions, and communities should take their roles based on their own capacities. concerning character development in kindergartens as the initial formal education, schools and teachers attempt to integrate values in the character education through lessons and school environment, and there should be assessment systems that can monitor the character development of early learners in kindergartens. the learners’ cognitive development has been the focus of almost all teaching strategies. in this context, it is not easy to develop the affective and psychomotor aspects of learning (suyadi, 2013, p. 189). the affective learning strategy will be useful to improve learners’ attitudes during learning activities (hamruni, 2009, p. 20). it is developed based on the behavioral psychology with stimulusresponse (s-r) concept to form new behavior (attitude). thus, the affective strategy aims to develop values in character education. in other words, the affective aspect will strongly influence the learners’ feelings or positive an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri copyright © 2019, reid (research and evaluation in education), 5(2), 2019 105 issn 2460-6995 emotions, so teachers will view the learning as ‘a process to become’ rather than ‘the results’ (suyadi, 2013, p. 190). thus, an assessment model that can be used to track the character building in a learning process is needed. to serve the purpose, the authentic assessment is the best assessment that can be used. the authentic assessment is based on the real-life context and requires multiple approaches to solve a problem. this assessment involves performance measurement which reflects the learners’ competencies as observed in their learning, achievement, motivation, and attitude (o’malley & pierce, 1996, p. 4). the competencies to achieve are attitude, knowledge, and skill, based on the recommended national standards (law of republic of indonesia no. 20 of 2003 section 35 verse 1). in other words, this type of assessment monitors and measures learners’ competencies in multiple problem-solving situations in a real-life context. it is expected that this authentic assessment model can be a solution to provide an excellent assessing system and the instruments can be easily used by teachers and accurately measure learners’ performance in kindergartens (taman kanak-kanak or tk) and raudlatul athfal (ra). during the preliminary study, a survey and interview were conducted to 21 participants from tk and ra principals and teachers from june to september 2014 in several kindergartens in sleman regency, yogyakarta. the result shows that 70% of teachers had not implemented a proper assessment of learners’ learning, 40% of teachers revealed that they regarded the assessment only as burdensome administrative duties, and 80% of teachers and tk and ra did not have proper assessment instruments. the assessment in character education should be carried out integratedly and continuously, that is by observation, task completion, conversation, and task submission to provide a conclusion for the achievement of indicators for character values. the preliminary study also revealed that the majority (65%) of tk/ra teachers determined the level of children’s character achievement based on the teachers’ personal perception, knowledge, and interpretation, and not based on daily recorded data. it then resulted in the teachers’ unawareness in learning the achievement level of children’s character development confidently; particularly on whether it has developed optimally or not. besides, it can be seen that although character values have been integrated into the daily activity plans, teachers remain depicting hesitance in determining the decision on the achievement level of children’s character development because the indicators employed in the children’s character development assessment tend to be too broad and not specified in details. thus, it is necessary to have an assessment model that can be applied specifically to identify children’s character development easily and accurately, so that teachers can provide assistance and service to students to develop their potentials and characters according to each child’s developmental needs. with appropriate assessment options, teachers will be able to detect each child’s developmental achievements appropriately. based on those descriptions, the model developed in this study is an authentic assessment model to assess the character achievement level, including: (1) the theoretical formation of tk/ra children’s character dimensions in the construct of the assessment model; (2) the development of an authentic assessment model instruments from the construct of an authentic character assessment model. authentic assessment in early childhood learning in tk/ra early age children, including those in tk and ra, are in the period of growth and development. early childhood comprises of various activities of motions, games, and habituations. thus, the assessment of early age children is done by observing their growth and developmental stages, then comparing them with the indicators. previous studies (edgington, 2004, p. 149; suyanto, 2005, p. 194) state that an assessment of early age and kindergarten children is a process of observing, recording, and documenting the children’s performance and work (jamaris, 2004, p. 119), skills attitudes, and performance. an early childhood assessment aims not necessarily at measuring the achievement and achieving scholastic success, but rather oban authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri 106 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 serving the level of the developmental progress and abilities that children have made in their various actions, attitudes, performance, and appearance. thus, assessing early age children’s characters in kindergartens does not serve as a purpose to compare the children, but to see and comprehend the development of one child and the other. jamaris (2004, p. 134) suggests that an observation must focus on the child’s behaviors which are then compared with their age. the assessment should be sustainable and holistic, authentic, individual, natural, multi-sources, and multi-context (suyanto, 2005, pp. 195–196). the use of authentic assessment to measure character can be done by observing children’s performances and comparing them with the children’s developmental level during the observation. to gain accurate data on children’s development, the observation may be done in a school setting during inor out-class activities. assessment of early childhood is conducted in an authentic manner with real, functional, and natural activities (suyanto, 2005, p. 196). it is done to get an overview of the real development of children’s abilities, by presenting valid and comprehensive data through record-keeping of children’s creativity in detail about their strengths and weaknesses, as well as significant events in their lives (edgington, 2004, p. 147; jamaris, 2004, p. 119). an authentic assessment in learning is a process or formal effort to collect information on the important variables of learning as evaluation materials and decision making by teachers to improve the process and students’ learning outcomes (herman, aschbacher, & winters, 1992, p. 95; popham, 1995, p. 5). in this description, it is understood that an authentic assessment involves learners in purposeful and meaningful authentic assessment. the term ‘authentic assessment’ was first introduced by wiggins in 1988 in the journal phi delta kappa entitled ‘authentic assessment’ (zaenul, 2001, p. 4). the assessment is also known as an alternative assessment, in contrast with its more widely known traditional counterpart of the traditional assessment in the form of a paper and pencil test. due to this case, the idea of the alternative assessment raises more serious attention and becomes the turning point of the widespread discussion of authentic assessment. an authentic assessment is considered as an effort to integrate the learning achievement measurement with the overall learning process. it must be noted that the assessment itself is a part of the learning process as a whole. therefore, tk/ra teachers should learn about the purpose of the authentic assessment and be able to apply it in the learning process to make it more effective. sometimes, the term ‘authentic assessment’ is interchangeable with other terms. that is an alternative assessment referring to the process of assessing students’ behavior performance on a multidimensional basis in real situations. in other words, alternative assessment can be defined by using nontraditional approaches to measure students’ performances and learning outcomes. an authentic assessment typically involves a task for learners to display and assessment criteria or a rubric to assess the task performance. arends (1997, p. 284) defines an authentic assessment as a process to assess students' performance in carrying out certain tasks in real situations. from those statements, it is concluded that the assessment to assess young learners’ characters in tk/ra is categorized as an authentic and classroom-based assessment. a classroom-based assessment will be able to reveal the learners’ real conditions in the classroom (stiggins, 1991, p. 8). this authentic and classroom-based model is appropriate for assessing the child's character achievement consisting of 14 characters. an authentic assessment is a comprehensive assessment process (encompassing all aspects of learning), continuous and inseparable from the learning process, aiming to determine the progress and achievement of students and to improve the planning, process, and learning achievement. character values developed in tk/ra the character values developed in tk/ ra are based on the opinions of experts (baron, 2000, 2005; gardner, 1996; thorndike, hagen, & sattler, 1986). further, through the fgd consisting of two experts in character education and five other experts, three aspects an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri copyright © 2019, reid (research and evaluation in education), 5(2), 2019 107 issn 2460-6995 were selected, namely spiritual, personal, and social aspects. in the development of character values which had been validated by expert judgment, the three aspects were developed into 14 characters values: faithful, worshipping ritual, humane, honest, patient and modest, brave and confident, disciplined, creative, independent, caring/empathy, tolerant, cooperative, and polite and humble. each of these characters is explained as follows. faithful being faithful is the first character instilled in every muslim child. it can be seen from the way muslims welcoming their newborn by reciting adzan in the baby’s right ear and iqomah in the left ear. it is evidence that the first and foremost value developed by muslims is believing in god (allah) (marzuki, 2015, p. 32). being faithful in this context is linked with rukun iman (the pillars of faith). worshipping ritual (hablun minallah) worshipping ritual is a part of sharia (islamic law). the prophet muhammad saw. taught that after tauhid (believing that allah is the one and only god) comes sharia in the form of worship and muammalah (humanity) (marzuki, 2015, p. 45). worshipping can be defined as rules regulating the direct relationship (ritual) between human beings and allah (ash-shiddieqy, 2009). in other words, the character-building through worshipping rituals is simplified by implementing six pillars of faith and five pillars of islam. in this study, the discussion of worship and muamalah are separated into the character values that must be instilled in students of tk/ra. based on islamic teaching, everything conducted by muslims can all be counted as worship, when it is intended as a form of devotion to allah swt. thus, the term worshipping ritual is used. worshipping ritual is associated with the implementation of the five pillars (rukun islam). humanity (hablum minannas) the islamic character is divided into two parts: the character of khalik (allah swt.) and the character of beings (other than allah) (marzuki, 2015, p. 32). the character of beings can be broken down into several types, one of which is the character of fellow humans (hablun minannas). muamalah means treatment or action towards others, the relationship of interests (munawwir, 1997), and muamalah means action between humans and other than humans. the actual activity is difficult to distinguish from the character of the social aspect, but in this study, performing muamalah (hablun minannas) is doing an activity which have something to do with other people is related to behavior in islamic teachings, namely, those contained in the qur'an and/ or al-hadits. honesty honesty literally means straight heart, not lying, not cheating. honesty is an important value that must be owned by everyone. honesty is not only shown verbally, but it is reflected in everyday behavior (naim, 2012, p. 132). honest character must be instilled from an early age by using various approaches and setting exemplary behaviors. patience and modesty being patient means being able to refrain from anger. patience is a positive character that must be instilled in children early on. it is the ability to control oneself. armed with patience, a child will be able to resist the inner impulse and think before acting. it will guide the child to do the right things and less likely to take actions ending in bad results (naim, 2012, p. 56). by instilling patience, it is expected that children will be able to wait patiently without easily getting upset, and they are willing to wait their turn orderly (queuing). modesty is a way of life that is not excessive. modesty is the inner attitude of a person who fully believes that god enlarges the sustenance of his servants, so he becomes a servant of god who is impervious and satisfied with what has been earned so far (munawar-rachman, 2015, p. 373). bravery and confidence being brave and confident are important characters that should be instilled early on. being courageous is often associated with self-confidence because it is believed that an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri 108 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 courage grows from a positive self-image. a child with a positive self-image will be courageous to try tackling difficult things or challenges. the child has the confidence that he is able and, therefore, he is willing to try. confidence is an attitude of believing in one’s ability (self-ability). confidence removes worry in conducting one’s action. confidence fosters a sense of freedom to do as he desires and at the same time fosters a sense of responsibility for his actions. it also fosters a sense of achievement and fosters the ability to recognize the advantages and disadvantages of oneself. louster (2002, p. 4) describes a person with self-esteem as a selfless person who does not need the encouragement of others, always being optimistic, and happy. according to schiller and bryant (2012, pp. 76–77), self-confidence is someone's ability to weigh choices and make the decision to choose one choice freely and consciously. self-discipline self-discipline is shown when students respect and do a system requiring people to obey provisions, orders, and regulations in pursuance. in other words, self-discipline is an attitude of obeying established regulations and provisions without any intention/expectation of rewards (naim, 2012, pp. 142–143). self-discipline is an intended influence to help children deal with their environment. self-discipline is developed from the need to maintain the stabilization between individuals’ tendencies and intentions to achieve their goals in regard to environmental expectations (semiawan, 2008, pp. 27–28). self-discipline stated in this study is children's willingness to correctly adhere to the rules. creativity creativity is one of the most essential character values. by possessing creative character, a child may experience a dynamic life. his mind is constantly developed and he constantly conducts activities to explore valuable things (naim, 2012, p. 152). creativity is an attitude and action reflected in innovation in many aspects of problem-solving. thus, a creative person can always find better new ways with diverse results compared to the previous ones (suyadi, 2013, p. 8). creativity becomes an essential character for early-aged children. self-independence self-independence is the ability to be independent without relying on other people (marzuki, 2015, p. 98). it is an essential character that needs to be instilled from an early age. by instilling a sense of independence, it is hoped that children will be initiated in doing things they should be able to do by themselves. as a result, they will be skillful in life. caring/empathy caring is an attitude that shows tendencies on problems, situations, and conditions occuring in the children’s surroundings by being involved in them. being emphatic is reflected in the way children intend to be treated. it is in line with schopenhauer (1997, p. 190) who states that caring is based on a principle: 'treat others the way you want to be treated'. children who care are those who move to do something to inspire, change, and do good deeds to surroundings. caring usually comes from loving. to develop a sense of caring, as in other moral values, learning involving approaches of developing three characters aspects i.e. knowing, feeling, and acting needs to be done (lickona, 1991, p. 312). tolerance tolerance is one of the essential characters to be instilled from an early age. by instilling tolerance at an early age, children are expected to accept diversity and believe that god creates a variety of humans with various perfections and lacks. tolerance is a permissive attitude toward disagreement or the ability to consider different opinions, attitudes, or ways of life (naim, 2012, p. 138). the development of tolerance is due to willingness and awareness to respect differences. responsibility responsibility is the main essential character which need to be instilled at an early age. the concept of responsibility in this study is the effort performed when completing tasks. the tasks should be done at best, and anybody performs the effort should take any an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri copyright © 2019, reid (research and evaluation in education), 5(2), 2019 109 issn 2460-6995 possible risk. moreover, he should be able to solve any problem. in the literal meaning, it is the ability to respond to something. a peopleoriented attitude is implied in the definition, and it shows the attitude of actively responding to other’s needs (lickona, 2004, p. 44). responsibility is defined by maximizing one’s capabilities in an attempt of doing something (munawar-rachman, 2015, p. 345). with the responsibility character, it is expected that children can maximize their effort in doing their tasks. cooperation cooperation stated in this research includes mutual cooperation and active participation in a work assigned in groups. the sense of cooperation is shown in how children do group tasks well, help others who have not finished their tasks, and ask others to play around together. cooperation is developed by the principle of mutual respect and affection. cooperations also mean helping each other in kindness and devotion as suggested by islamic provisions: ta’awun ‘ala al birr wa al taqwa (munawar-rachman, 2015, p. 259). politeness and humbleness politeness and humbleness are in one of the main characters that should be instilled in early childhood. the concepts of politeness and humbleness stated in this paper include the character of prioritizing values in behaving and respecting others in the ways children speak and act (miskawaih, 1994, p. 81). humbleness teaches children to take others' knowledge, strength, and other qualities into account. humbleness is politeness and not arrogant (munawar-rachman, 2015, p. 230). human glory is measured by the quality of their humbleness in life. method this research and development employ a set of stages suggested by plomp (1997, p. 5) covering five stages: (1) the preliminary investigational phase, (2) the design phase, (3) the realization/construction phase, (4) the test, evaluation, and revision phase, and (5) the implementation phase. there are two main stages in this research: the research/ pre-research stage and the development stage. the pre-research activities include investigation, design, and construction, while the development stage covers the activities of testing, evaluating, and revising. the r&d model by plomp includes five stages: preliminary investigation, assessment model planning, and designing, assessment instrument development, assessing, assessment of evaluation items, item assessment, and assessment implementation. the development procedure of asoka (authentic character assessment) can be seen in figure 1. figure 1. development procedure of asoka measure an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri 110 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 findings and discussion findings the theoretical validation result of asoka model in the evaluation phase of the first model of asoka, to prove the validity, the content of the instrument was analyzed by the experts which consist of assessment experts, character education experts, paud/tk/ra experts, and islamic education experts to gather the expert judgment. the experts on children's development psychology were not included in this phase since the researchers consider the assumption that they are represented by the experts on paud/tk/ra. the readability test was administered to teachers of tk/ra and some tk/ra principals. the validation test was done to know whether the assessment measure could measure the early age children's character development, especially those in tk/ra level. the explanation of the product evaluations from expert judgment is described as follows. content validity in the asoka model from experts the first product validation of the asoka model was carried out through expert judgment using the delphi technique. validation is intended to make sure that the developed asoka model instruments can be used to detect the attainment level of tk/ra children. this delphi technique was chosen with considerations easier to do, more indepth input, and focused on the problem under study. content validation analysis from experts/experts is done by using the formula of aiken’s v. the results of the analysis show that the asoka instrument has a good representation related to the accuracy of the indicators of the aspects and accuracy of the items on the indicator. for the criteria for the accuracy of indicators on the aspects assessed, the aiken validity index identifies that there is 1 (one) indicator that has a lower index than the other indicators (<0.76), while the accuracy of the items on the indicator, the aiken validity index identifies six items, namely items 1, 2, 3, 4, 10 and 22, have lower indices than other items (<0.76). the six items were then annulled and replaced with new items after a long discussion with smes. furthermore, the new items were reassessed by the smes and the content validity index value (v) ≥ 0.76 was obtained, so it can be concluded that all the items in the asoka instrument which amounted to 65 items met the content validity. the results of the discussion and input from experts, as well as the final index obtained, were then consulted with the promoter and co-promoter. some changes after consultation with experts through the delphi method are as follows. (1) changes to the spiritual aspect, initially consisting of three indicators, namely faith, worship, and honesty, then changed places and new characters emerged, namely, the changes were perfected into faith, worship ritual (hablun minallah), and performing muamalah (hablun minannas), the addition of indicators on the spiritual aspect in the form of muamalah, actually will be a little confusing with the social aspects, but in muamalah, these special indicators are made that are different from the social aspects. (2) changes in indicators consist of adding, combining, adding a lot given to the character of the faith by including six pillars of faith as a whole, the addition of indicators is also on creative characters, who previously could only use three indicators, into five indicators. (3) amendment to the number of item statements, previously 56 items, after summarizing all entries from the seven experts to 65 items. (4) changes to the choice of terms of achievement of child characters, previously with the term not developing (belum berkembang or bb), starting to grow (mulai berkembang or mb), developing expectations (berkembang sesuai harapan or bsh), and developing very good (berkembang sangat baik or bsb), changing to unappeared (belum muncul or bm), appearing with stimulation (muncul dengan stimulus or ms), emerging not consistent (muncul belum konsisten or mbk), and emerging consistently (muncul konsisten or mk). the full explanation of the three aspects of the character developed is as follows: (1) spiritual aspects, including the character of an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri copyright © 2019, reid (research and evaluation in education), 5(2), 2019 111 issn 2460-6995 faith, ritual worship (hablun minallah) and muamalah (hablun minannas). (2) personal aspects, including honest, patient and simplicity, brave and confident, disciplined, creative, and independent. (3) social aspects, including care/patience, tolerance, responsibility, cooperation, courtesy, and humility. readability test results by practitioners the readability test of the asoka instrument aims to ensure (1) the clarity of clues, the scope of asoka components, the language used, and the writing procedure and appearance of the assessment in general from the instrument, as a whole are understood by prospective users, (2) the clarity of aspects of character, (3) the clarity of character indicators, (4) the formulation of communicative statements, (5) the use of easily understood sentences and words, (6) the clarity of assessment rubrics, and (7) the written procedures related to letter form, font size, format or instrument layout. this readability test activity involves practitioners from the elements associated with prospective users, namely teachers, heads of tk/ra, assessment of readability using a modified likert scale with four choices, namely a minimum score of 1 (cannot be used), score 2 (can be used with little improvement), score 3 (can be used without improvement, and score 4 (ideal used). when consulted with the guidelines the feasibility categorization of the model is included in either classification or it gives an indication that the level of readability of this developed instrument can be classified as good or feasible to use. results of asoka model try-out i try-out i involves 106 learners of group b from six tk/ra in yogyakarta, including: ra dwp uin sunan kalijaga, ra nurul dzikri, tk islam tunas melati, tk aisyiyah bustanul athfal taruna alquran, ra masyitoh ngeposari gunung kidul, tkip salsabila pandowoharjo-sleman. the sample of the limited test is determined by proportional random sampling. the results of the first try-out of the asoka model are explained as follows. spiritual aspects the achievement of the spiritual aspect of tk/ra is represented by three tk/ra with different characteristics, namely ra nurul dzikri (ra-nd), tk islam tunas melati (tk-tm), tk plus (full day) salsabila (tkp-sb). the distribution of assessment results on the achievement of children's character in the three institutions ra-nd, tk-tm, and tkp-sb as a representation of tk/ra in islamic education institutions can be seen in the histogram in figure 2. figure 2. histogram of distribution of assessment of spiritual character achievement for each component an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri 112 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 in figure 2, it can be seen clearly that the average achievement of the children's character in the spiritual aspect is dominated by the achievement of mbk (appearing inconsistent), which is 65% -75%, whereas the attainment of the mk (consistent emergence) /culture is only about 10%. it proves that the spirituality of the child is still in the developing stage and its appearance has not been consistent, for this reason, it needs to be continually improved with various strategies to instill it into the main character possessed by each student. personal aspects the assessment results for the achievement of the children's character on the personal aspects can be seen in the distribution in figure 3. in figure 3, it can be seen clearly that the average achievement of the children's character in this personal aspect is dominated by the achievement of mbk (muncul belum konsisten or emerging inconsistently), which is 75%, while the lowest in the mk (muncul konsisten or consistent appearance)/creative character is only around 4%, while the highest achievement is independent character, which reaches 45%. it is a challenge for teachers to grow the creative character of each student through a variety of planned stimulation in learning. social aspects the assessment results for the achievement of the children's character on the social aspects can be seen in the distribution in figure 4. figure 3. histogram of personal character aspect achievement for each component figure 4. histogram of distribution of assessment results achievement of child characters on social aspects an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri copyright © 2019, reid (research and evaluation in education), 5(2), 2019 113 issn 2460-6995 results of analysis of validity and reliability of asoka instruments contract validity testing is a test of validity related to the level of the scale that reflects and acts as a concept being measured (hair, black, babin, & anderson, 2010, p. 710). analysis of construct validity on character dimensions was carried out using explanatory factor analysis (efa) analysis. this analysis serves as a pointer to factors that can explain the correlation between variables. each variable has a value of loading factor that represents it. the value of loading factors in efa can be determined based on the number of samples in the study (hair et al., 2010, p. 117). the adequacy of the number of observations of data can be identified through the kaiser-meyer-olkin (kmo) parameter with a kmo value of > 0.5. correlations between multivariate variables can be identified by bartlett's test of sphericity parameter which must have significance with p-value <0.05. the magnitude of the correlation between multivariate variables can be seen from the value of measure of sampling adequacy (msa) with the value of msa > 0.5. the value of communal items has acceptable limits which are above 0.30 (mooi & sarstedt, 2011, p. 212). the results of the correlation test between variables are presented in the output of kmo and bartlett's test in table 1. table 1. the kmo and barlett’s test instrument of asoka on try-out i kmo and bartlett's test kaiser-meyer-olkin measure of sampling adequacy. 0.868 bartlett's test of sphericity approx. chi-square 1525.141 df 91 sig. .000 in table 1, kmo msa (kaiser-meyerolkin measure of sampling adequacy) has a value of 0.868. the msa kmo value is good since it is greater than 0.5 (kmo> 0.50). it indicates that all character dimensions have met the adequacy requirements of the number of observations (data). based on the bartlett’s test of sphericity, it is obtained a chi-square value of 1525.141 at the degrees of freedom of 91 with the significance of less than 0.001 (<0.001). based on the anti-image correlation (aic), the item having msa value which is less than 0.50 (<0.50) is not found, as shown in table 2. table 2. values of aic spiritual aspect personal aspect social aspect s1 s2 s3 0. 774 0. 742 0. 867 p4 p5 p6 p7 p8 p9 p1 0. 830 0. 845 0. 921 0. 889 0. 895 0. 877 0. 800 sos11 sos12 sos13 sos14 0. 863 0. 915 0. 923 0. 916 table 3. total variance explained value component initial eigenvalues rotation sums of squared loading total % of var. cum. % total % of var. cum. % 1 6.510 46.541 46.541 4.172 29.700 29.799 2 2.421 17.202 63.834 3.634 25.954 55.753 3 1.165 8.322 72.156 2.296 16.403 72.156 4 .695 4.962 77.188 5 .571 4.079 81.197 6 .463 3.3.4 84.502 7 .437 3.122 87.623 8 .377 2.692 90.315 9 .287 2.047 92.362 19 .273 1.950 94.312 11 .256 1.830 96.143 12 .216 1.557 97.699 13 .177 1.267 98.966 14 .145 1.034 100.000 an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri 114 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 figure 4. scree plot of the asoka instrument in the tryout i based on table 2, no items have msa value below 0.50 (<0.50), so that in the next process, all asoka instrument items are included. furthermore, to determine the number of possible factors formed can be seen in the total variance explained table. the total variance explained value is summarized in table 3. in table 3, total variance explained values can be seen, and the variance that can be explained by factor 1, factor 2, and factor 3. the total of these three factors will be able to explain the variable at 72.156%. thus, because eigenvalues are set 1, then the total value taken is > 1, namely components 1, 2, and 3. in the initial eigenvalues column of the cumulative sub column, it can be seen that the reduction of 14 items analyzed, obtained characteristic values (eigenvalue) as many as three factors. of the three factors obtained kmo msa value of 0.868 (> 0.07), it means fulfilling the requirements to continue. eigenvalues with values above 1 (> 1) have three factors. it shows that there are three factors in achieving the character of early childhood in tk/ ra according to the estimated indicators. thus, it can be said that the asoka model instrument is said to be valid in terms of the validity of the construct. the percentage of the loading factor variance that can explain the variance of the early childhood character achievement in tk/ ra is the first loading factor of 46.541%, the second loading factor of 17.292%, and the third loading factor of 8.322%. cumulatively, the three factors comprise of 72.156%. besides, the scree plot which explains the total variance is illustrated in figure 4. figure 4 shows the tendency of eigen (eigenvalue) decrease used to determine subjectively the number of factors formed. from figure 4, it can be seen that the scree plot shows the tendency of the eigen (eigenvalue) decrease indicates that the formed factor leads to three characters dimensions. overall model fit the goodness of the fittest of the measurement model with field data was done using the second-order confirmatory factor analysis (cfa) technique. the second-order cfa analysis aims to determine the validity of indicators developed by the researchers. the existing indicators are said to be valid if the result of the loading factor value is higher than 0.3. the construct validity with the second-order cfa technique is used to test the fitness of the characteristic achievement assessment (ghozali & fuad, 2008, p. 137; joreskog & sorbom, 1999, p. 115). this approach means that the analysis is done directly in two stages, i.e. from the variable to the indicator, then from the indicator to the item. in addition, the second-order cfa also tests whether the data fit with the model that was formed previously or not. an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri copyright © 2019, reid (research and evaluation in education), 5(2), 2019 115 issn 2460-6995 based on the standardized second order cfa test output, the statistical values used as the criteria in the good of fit fitness statistics are as follows: df = 69; p-value of 0.06624; chi-square (p-value) > 0.05; rmsea value of 0.05 (rmsea < 0.08), cfi value of 0.99 (cfi ≥ 0.9); nfi value = 0.9; cfi value = 0.99; and rmr value = 0.018. the results of the character measurement model are shown in table 4. table 4 shows the good of the fit test results of the model. judged from the loading factor value, the indicators are all above 0.3. it indicates that all indicators constructing the authentic assessment components of early childhood characters in tk/ra are valid. the character constructs measurement model has met the goodness of fit statistics so that the character construct measurement model is stated as a good measurement model. the asoka instrument test is done directly, and the spss display result shows kmo msa > 0.05. thus, it can be explained that all asoka dimensions have fulfilled the requirements of a sufficient amount of observation (data). in addition, barlett's test of sphericity shows the significance of a p-value less than 0.05 (p-value < 0.05), indicating a significant correlation between observed variables of all dimensions. it can be concluded that the data of observations of character dimensions of tk/ra students have been qualified for the confirmatory factor analysis. the overall model goodness of fit evaluation of each character dimension based on the second-order cfa (2nd cfa) test shows the fitness of the model with the data. the main criterion of the model matches with the field data if at least three requirements of seven commonly used measures are fulfilled, namely (1) chi-square (p-value), (2) goodness of fit index (gfi), and (3) root mean square error of approximation (rmsea). the model is said to be fit if chi-square has a significance level (p-value) ≥ 0.005; goodness of fit index (gfi) ≥ 0.90; and root mean square error of approximation (rmsea) is 0.05 < rmsea ≤ 0.08 (browne & cudeck, 1993). thus, it can be interpreted that the indicators specified to measure each dimension (personal, personal, social) together measure things accordingly. it is further seen that all sizes of gof (goodness of fit) show a good model matching with the field data, so it can be concluded that overall, the asoka model on all character dimensions is fit. an instrument for assessing students’ character had been developed and proved to be valid and reliable, and then the feasibility of the model was tested by trying it out based on the model usage guideline. the asoka model usage guideline is in the form of a handbook completed with the assessment instruments and procedures as well as guidelines on writing the assessment reports. this handbook was tested to 15 teachers/potential users from 15 tk/ra in yogyakarta. discussion in developing this character assessment model, the researchers adopt plomp's (1997, p. 5) approach that is modified into four phases: (1) the initial investigation stage, (2) design, (3) prototype construction, and (4) development. this development process produced a construction of the asoka model consisting of three aspects, namely: the spiritual aspect, the personal aspect, and the social aspect. the spiritual aspect consists of three characters: (1) faithfulness, (2) worshipping ritual (hablum minallah), and (3) humanity table 4. goodness of fit test results of asoka model on tryout i no gof effect size fitness level target estimated value fitness level 1. chi-square(x2) < 2df 87.44 good fit 2. p-value > 0.05 0.06 good fit 3. rmsea ≤ 0.05 0.05 good fit 4. nfi ≥ 0.90 0.96 good fit 5. nnfi ≥ 0.90 0.99 good fit 6. cfi ≥ 0.92 0.99 good fit 7. rmr < 0.05 0.018 good fit an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri 116 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 (hablum minannas). the personal aspect consists of six characters, namely: (1) honesty, (2) patience and modesty, (3) bravery and confidence, (4) self-discipline, (5) creativity, and (6) self-independence. the social aspect consists of five characters: (1) caring/empathy, (2) tolerance, (3) responsibility, (4) cooperation, and (5) politeness and humbleness. these characters are the results of the development of the assessment model in its early stages to assess the character development of young learners in tk/ra. the method used in the asoka assessment model was an authentic assessment employing observations. it is in accordance with meisels, bickel, nicholson, xue, and atkins-burnett (2001, p. 75) who assert that the appropriate method for assessing kindergarten children is through performance inherent in the curriculum or often referred to as an authentic assessment. the assessment was done gradually and continuously so that the progress towards children's character development can be measured. in line with this fact, suyanto (2005, p. 189) has suggested that an assessment is done through real, functional, and natural activities starting from the time the students get to school until they go home. the observation method employed in this assessment model has also been appropriate, as azwar (2015, p. 90) states the assessment of attitudes (characters) can be done through behavioral observations. when a child shows repeated or consistent behaviors, it can be said that they already have those behaviors as their characters. a child’s behaviors that appear repeatedly (consistently) show that the child has characters. for example, there is a child who always prays before doing any activities such as eating, drinking, and even playing; then it can be said that the child has a spiritual character. it is said so because when the child always prays before doing anything, he/she shows his/her belief in the existence of god/allah. in addition, validity was proven and the reliability of the asoka model instruments was estimated. the content validity of the instrument was obtained from expert judgment through the delphi method and continued with analysis using the aiken’s formula. based on the analysis, the overall results of the instruments’ indicators and items had aikens index from 0.714 to 1.000, meaning that the proposed indicators and items were valid. the criterion which was used to determine the validity level was that of retnawati (2016): aikens's agreement index of 0.4-0.8 shows medium validity; an index of more than 0.8 shows high validity. in conclusion, all of the proposed indicators can be used to develop an authentic assessment instrument to assess the characters of young learners in tk/ ra. the aiken index is chosen because it is considered accurate to measure the content validity of an instrument. the instrument is valid if it measures what it should measure based on the raters’ agreement. the construct validity is the result of testing relating to the scale level that reflects and acts like the concept being measured (hair et al., 2010, p. 710). the analysis of construct validity to character dimensions was performed by using exploratory factor analysis (efa). this analysis resulted in kmo value of 0.868 (kmo> 0.50); chi-square of 1525.141 with 91 degrees of freedom and significance less than 0.001 (<0.001); anti image correlation (aic) values of more than 0.50 (aic > 0.50); communality value of 0.6110.791 (communality > 0.03) which already fulfills the prerequisite criteria (mooi & sarstedt, 2011, p. 212). meanwhile, the instrument reliability was calculated by using cronbach's alpha approach. the cronbach's alpha coefficient for the reliability of the asoka model was 0.914. this value is higher than 0.70 (>0.70). this requirement refers to instrument reliability criteria (mardapi, 2017; nunnally, 1981, p. 115) which state that an instrument is considered reliable if the alpha reliability is 0.70 or higher. asoka model instrument’s validity and reliability have been estimated and the results show that this instrument is valid and reliable. the assessment instrument product obtained is then used to assess the achievement of early childhood character development in tk/ra. this assessment is in the form of a checklist containing character assessment indicators including 14 characters described in 65 assessment items. an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri copyright © 2019, reid (research and evaluation in education), 5(2), 2019 117 issn 2460-6995 further product development in this study is a user manual of the asoka model. it serves as a guide for the users, that is, kindergarten teachers, in applying the asoka instruments. the results of the assessment are used as the basis for the feasibility of using the asoka model in tk/ra. assessment results from tk/ra teachers show that 72% assume that this model is good, meaning that it can be used without any revision. fourteen percent thought that the manual is good enough, meaning it could be used with little improvement, while the other 14% would judge that the manual is excellent, ideal to be used as an example for character assessment. some weaknesses in asoka's user manual have been improved. it is done to improve the quality of the function of the asoka model as a valid and reliable assessment model that is able to measure the achievement of early childhood character in tk/ra. the use of this asoka model is started from the making of the rpph (daily learning program plan) and ended by assessing the expected characters. thus, all the points of behavior that become indicators of the achievement of the child's character can be appraised properly. in assessing the achievement of the children’s characters, it is necessary to diversify the system and the method of the assessment because the use of a varied method will better guarantee the quality of the assessment result. thus, the construct of the asoka model developed by the researchers is one of the important assessment models used to help facilitate tk/ra teachers in performing their duty to do the assessment, which is inseparable from their two other tasks namely planning and conducting the learning process effectively. after passing the experiment with a wide sample, the results of the asoka model development fulfill the validity and reliability criteria and it is also considered as the correct measurement model because it fulfills the criteria of goodness-of-fit model. thus it can be stated that the developed instrument has a feasibility standard as an instrument to detect the level of achievement of early childhood character in tk/ra. conclusion based on the findings and discussion, the conclusion of this study can be formulated as follows. (1) asoka is an authentic assessment model developed to assess the achievement of early childhood character in tk/ra. this model consists of an instrument and manual of the assessment model to assess the character of early childhood in tk/ ra. the asoka model is effective because all the indicators used to measure the spiritual, personal, and social aspects of early childhood character constructs in tk/ra mostly have a loading factor value greater than 0.30, while the reliability of the character constructs is proved from the value of the construct reliability coefficient (cr)> 0.70, i.e. cr = 0.72 on the spiritual aspect; cr = 0.79 on the personal aspect; cr = 0.86 on the social aspect. the asoka model also meets the criteria of the goodness-of-fit statistic, so the asoka model is considered as an assessment model that can be used to detect the achievement of early childhood character in tk/ra. (2) the characteristics of the authentic assessment model instruments to assess early childhood character outcomes in tk/ra are as follows: (a) the content validity of the asoka (authentic character assessment) instrument is high. based on the results of formula aiken's v analysis, the overall results of the indicators have an aiken index of 0.714 to 1.000, the average index is 0.901; (b) the validity of the asoka model instrument construct (authentic assessment for character) using the second-order cfa approach was obtained by a fit model to assess the character of early childhood in tk/ra. it means that the developed model of asoka meets the criteria of goodness-of-fit statistics; (c) the reliability of the developed instrument is high with the cronbach alpha value coefficient 0.914. references abidin, m. z. (2012). tingkat pendidikan di indonesia. seminar pendidikan karakter. bali: universitas udayana. arends, r. i. (1997). classroom instruction and management. new york, ny: mcgrawhill. an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri 118 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 ash-shiddieqy, t. m. h. (2009). sejarah dan pengantar ilmu hadits. semarang: pustaka rizki putra. azwar, s. (2015). skala pengukuran sikap manusia. yogyakarta: pustaka pelajar. bar-on, reuven. (2000). emotional and social intelligence: insights from the emotional quotient inventory. in r. bar-on & j. d. a. parker (eds.), the handbook of emotional intelligence: theory, development, assessment, and application at home, school, and in the workplace (pp. 363– 388). san francisco, ca: jossey-bass. bar-on, reuven. (2005). the impact of emotional intelligence on subjective well-being. perspectives in education, 23(2), 1–22. retrieved from https:// hdl.handle.net/10520/ejc87316 browne, m. w., & cudeck, r. (1993). alternative way of assessing model fit. in k. a. bollen & j. s. long (eds.), testing structural equation model. new york, ny: sage publication. edgington, m. (2004). the foundation stage teacher in action: teaching 3, 4, and 5 years olds. london: pcp press. gardner, h. (1996). intelligence: multiple perspectives. fort worth, tx: harcourt brace college. ghozali, i., & fuad, f. (2008). structural: equation modeling. semarang: undip press. hair, j. f., black, w. c., babin, b. j., & anderson, r. e. (2010). multivariate data analysis (7th ed.). upper saddle river, nj: prentice hall. hamruni, h. (2009). strategi dan model-model pembelajaran aktif menyenangkan. yogyakarta: fakultas tarbiyah uin sunan kalijaga. herman, j. l., aschbacher, p. r., & winters, l. (1992). a practical guide to alternative assessment. alexandria, va: association for supervision and curriculum development. jamaris, m. (2004). assesmen pendidikan anak usia dini. seminar dan lokakarya nasional pendidikan anak usia dini. jakarta. joreskog, k. g., & sorbom, d. (1999). lisrel 8: user’s reference guide. chicago, il: scientific software international. law of republic of indonesia no. 20 of 2003 on national education system. , (2003). lickona, t. (1991). educating for character: how our schools can teach respect and responsibility. new york, ny: state university of new york. lickona, t. (2004). character matters: how to help our children develop good judgment, integrity, and other essential virtues. new york, ny: simon & schuster. louster, p. (2002). tes kepribadian (c. g. sumeksto, trans.). yogyakarta: kanisius. majid, a., & andayani, d. (2011). pendidikan karakter perspektif islam. bandung: remaja rosdakarya. mardapi, d. (2017). pengukuran, penilaian, dan evaluasi pendidikan (2nd ed.). yogyakarta: parama publishing. marzuki, m. (2015). pendidikan karakter islam. jakarta: amzah. meisels, s. j., bickel, d. d., nicholson, j., xue, y., & atkins-burnett, s. (2001). trusting teachers’ judgments: a validity study of a curriculum-embedded performance assessment in kindergarten to grade 3. american educational research journal, 38(1), 73–95. https://doi.org/ 10.3102/00028312038001073 miskawaih, a. a. a. i. (1994). menuju kesempurnaan akhlak (h. hidayat & e. hasan, trans.). bandung: mizan. mooi, e., & sarstedt, m. (2011). a concise guide to market research. berlin: springer-verlag berlin heidelberg. munawar-rachman, b. (ed.). (2015). pendidikan karakter: pendidikan menghidupkan nilai untuk pesantren, madrasah dan sekolah. jakarta: lsaf dan alive indonesia. an authentic assessment model to assess kindergarten... umi faizah, darmiyati zuchdi, & yasir alsamiri copyright © 2019, reid (research and evaluation in education), 5(2), 2019 119 issn 2460-6995 munawwir, a. w. (1997). kamus arab-indonesia (14th ed.). surabaya: pustaka progressif. naim, n. (2012). character building: optimalisasi peran pendidikan dalam pengembangan ilmu dan pembentukan karakter bangsa. yogyakarta: ar-ruzz media. nunnally, j. c. (1981). psychometric theory. new york, ny: mc-graw hill. o’malley, j. m., & pierce, l. v. (1996). authentic assessment for english language learning: practical approaches for teachers. new york, ny: addison-wesley. plomp, t. (1997). educational design research: an introduction. in t. plomp & n. nieveen (eds.), educational design research. enschede: faculty of educational science and technology, university of twente. popham, w. j. (1995). classroom assessment: what teachers need to know. boston, ma: allyn and bacon. retnawati, h. (2016). proving content validity of self-regulated learning scale (the comparison of aiken index and expanded gregory index). reid (research and evaluation in education), 2(2), 155–164. https://doi.org/ 10.21831/reid.v2i2.11029 schiller, p., & bryant, t. (2012). 16 moral dasar bagi anak (s. sensusi, trans.). jakarta: pt elex media komputindo. schopenhauer, a. (1997). menembus selubung sang maya. in f. m. suseno (ed.), 13 model pendekatan etika. yogyakarta: kanisius. semiawan, c. r. (ed.). (2008). penerapan pembelajaran pada anak. jakarta: indeks. stiggins, r. j. (1991). relevant classroom assessment training for teachers. educational measurement: issues and practice, 10(1), 7–12. https://doi.org/ 10.1111/j.1745-3992.1991.tb00171.x suyadi, s. (2013). strategi pembelajaran pendidikan karakter. bandung: remaja rosdakarya. suyanto, s. (2005). konsep dasar pendidikan anak usia dini. jakarta: departemen pendidikan nasional. thorndike, r. l., hagen, e. p., & sattler, j. m. (1986). stanford-binet intelligence scale (4th ed.). chicago, il: riverside. zaenul, a. (2001). alternative assessment. jakarta: direktorat jenderal pendidikan tinggi, departemen pendidikan nasional. zuchdi, d. (2006). pendidikan karakter melalui pengembangan keterampilan hidup (life skills development) dalam kurikulum persekolahan. in laporan penelitian hibah pascasarjana. yogyakarta. copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(2), 2021, 168-176 available online at: http://journal.uny.ac.id/index.php/reid developing student character assessment questionnaire on french subject in state high schools nur hamidah assa'diyah*; samsul hadi universitas negeri yogyakarta, indonesia *corresponding author. e-mail: nurhamidah.2018@student.uny.ac.id introduction in the 21st century, rapid technological advances occur. not only positive impacts, e.g., facilitating communication to get information from various countries, but with the advancement of technology, people forget that interaction and socialization must be established with others (lalo, 2018). the rapid advancement of technology has unconsciously changed human character. both positive and negative impacts are inevitable. for example, children and adolescents failing to control themselves and are too dependent on technology (mobile phones) will tend to live apathetically and pay less attention to the situation and the surrounding community, it also makes human character unable to grow properly. character education is expected to be able to make humans have good character and care about environment. there are many countries that have implemented character education in schools and family, but character problems still often occur, especially among school students (lee & manning, 2014, p.284). setiawan (2013, p.53) explained implementation of character education has not been maximized. many efforts have been made to improve character education, but until now, there are still frequent criminal acts, such as pornography, there for character education should be made a priority. good character will make the humas useful for the environment (sugiarti, 2018, p.6). article info abstract article history submitted: 16 august 2021 revised: 3 november 2021 accepted: 24 december 2021 keywords questionnaire development; character assessment; french scan me: as time goes by, technological advances affect character crises, especially among students. this study is aimed at developing a student character assessment questionnaire for the french subject in state high schools of yogyakarta special region. the study employed the 4d development formula comprising of define, design, develop, and disseminate. the study involved 269 students of x, xi, and xii grades as the samples. the reliability test was performed using the cronbach’s alpha formula, 0.689, with strong reliability. the validity of the study was tested using efa (exploratory factor analysis). the questionnaire used the likert scale. the result of the study shows that 15 items are valid and can show the character of students in the french subject. this is an open access article under the cc-by-sa license. how to cite: assa'diyah, n., & hadi, s. (2021). developing student character assessment questionnaire on french subject in state high schools. reid (research and evaluation in education), 7(2), 168-176. doi:https://doi.org/10.21831/reid.v7i2.43196 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i2.43196 10.21831/reid.v7i2.43196 nur hamidah assa'diyah & samsul hadi page 169 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) character education programs in schools have been part of human life since ancient times. the goal is to create and live a better life, both in the environment itself, family environment, and social environment. character education held in schools can be started by building awareness, feelings, care, knowledge, trust, and forming habits to be good (rokhman et al., 2014, p.1163). in living a better life, it is necessary to start by forming good attitudes and actions. therefore, habit formation should be practiced in character education. bad habits must be abandoned and lead students to get used to behaving and acting well. character building and character education must become a single unit. character-building can also be done in a school environment or an educational environment following character building in the family and social environment. judiani (2016, p.281) explained that character education in schools is not a particular subject and rather taught implicitly in each subject. instilling of character values can be done at the education level. currently, the implementation of character education is intensively carried out to promote better character for the next national generation. character education is a system of instilling character values in the school environment. the instilling is applicable for all school residents. the components of character education values include knowledge, awareness or willingness, and actions to apply these values to god, others, the environment, and the nation; hence, school residents can become human beings (citra, 2012, p.238). the implanted character education will not only improve the character of students, but all school residents will also have good character since they are involved in the process of character building. darodjat and zuchdi (2016, p.25) said that character education in schools must be supported by student activities in the family and community, so students can have good character. regulations for the implementation of character education programs have been set by the president of the republic of indonesia since 2017. the program is inserted into subjects in schools, including local content or specialization subjects, french. local content subjects always run in schools because these subjects can shape the character of students who love local culture (hadi et al., 2019, p.46). referring to the presidential regulation of the republic of indonesia number 87 of 2017 concerning strengthening character education, the researchers harmonized these regulations with the syllabus of high school french subjects. the syllabus mentions the character values that must be applied concerning subjects at school, including being communicative, independent, cooperative, responsible, and tolerant. the five character values are following article 3 of the presidential regulation of the republic of indonesia, in which the characteristics are shown in table 1. currently, many high schools in indonesia facilitate students to learn foreign languages, e.g., french. learning french helps students find information from books or internet in french. for students who do not continue their education to a higher level, french can be a skill to get a job as a tour guide. table 1. character values no. character values characteristics 1. communicative students have language skills. the material taught must encourage students to communicate fairly (rabawati et al., 2013) 2. independent confident consider the opinions and advice of others able to make decisions not easily influenced by others (wuryandani et al., 2016, p.210) 3. cooperation conducting learning with two or more people by interacting with each other, combining energy, ideas or opinions at a certain time in achieving learning objectives (yulianti et al., 2016, p.35) 4. tolerance appreciating differences in religion, ethnicity, ethnicity, opinions, attitudes, and actions of others (komalasari & saripudin, 2017, p.41) 5. responsibility carrying out duties and obligations, towards themselves, society, the environment (nature, social, and culture), the nation, and god (gunawan, 2014, p.33) 10.21831/reid.v7i2.43196 nur hamidah assa'diyah & samsul hadi page 170 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) character values taught in french subjects will be useful for studens. for example, communicative character will make students those who want to be a tour guide will not find difficulties while working, and those who want continue their education to a higher level will be easily to undersrtand the lesson in class. french has different speech acts form indonesian or english. tutoyer ou vouvoyer in french shows politeness when speaking. tolerance character taught in character education in french subjects can make students understand language and cultural differences between indonesia and french. based on interview with french teacher in high schools of yogyakarta special region, the character education strengthening program also applied to french subjects. therefore, the researchers wanted to develop an instrument to assess student characters for the french subject following the character values implemented by the government in the syllabus. character assessment is an important part to do. doing assessment is the same as looking at the level of success of character education that has been taught (supriyadi, 2011 p.116). doing assessment in french subjects helps teachers know the teacher’s success in instilling good character. a questionnaire is one of the study instruments. the questionnaire types are divided into two, i.e., open and closed. an open questionnaire allows respondents to provide opinions according to their agenda and circumstances, while a closed questionnaire allows respondents to choose one answer according to their characteristics (apriliasari, 2015, p.4). according to malinda (2016, p.3) the questionnaire instrument can be used as a tool to show validation of teachers and determine student responses. one example of a questionnaire is a questionnaire with a likert scale calculation. this questionnaire are five assessments with different values, starting from very bad with a score 1, bad with a score 2, fairly good with a score 3, good with a score 4, and very good with a score 5. thus, this study developed a closed questionnaire with the likert scale calculation. following the previous explanation, this questionnaire was utilized to measure student characters in the french subject. this study develops a questionnaire with likert scale. according to previous explanation, this questionnaire will be used to measure student’s character assessment in french subject. however, before being used, it will be validated first so that the statement contained in questionnaire are valid. method the study was a development study using the 4d development formula comprising define, design, develop, and disseminate. the study aimed to develop a student character assessment questionnaire for the french subject of x, xi, and xii grades. the population was students of x, xi, and xii grades in high schools of yogyakarta special region. french is an optional subject, so not all schools and all classes study this subject. the samples of the study were 269 students collected using purposive sampling, with the criteria of students of x, xi, and xii grades high schools of yogyakarta special region implementing the french subject. this technique was used because not all samples have criteria that match the study criteria. the study developed a close questionnaire using the likert scale with five points, comprising 1 = strongly disagree, 2 = disagree, 3 = uncertain, 4 = agree, and 5 = strongly agree. strongly disagree means students have never done the character values mentioned in french subject, disagree means students have conducted the character values but only once or twice, doubtful means students have occasionally done the character values, agree means students have often done the grades character, while strongly agree means that every time the french subject students learn character values. this questionnaire was employed to measure student characters for the french subject. the validation was performed using efa (exploratory factor analysis). the reliability test was performed using the cronbach’s alpha formula. based on calculations, the reliability in this study is 0.689, which is considered as strong reliability. 10.21831/reid.v7i2.43196 nur hamidah assa'diyah & samsul hadi page 171 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) findings and discussion based on the study, the results can be seen in table 2. the questionnaire was tested on students of of x, xi, and xii grades in high schools of yogyakarta special region. the analysis was carried out with efa. the initial step for efa validation is to find the kmo value and barlett’s test. the calculation can be seen in table 3. referring to table 3, after calculating kmo and bartlett’s test, the keiser meyer measure of sampling value was 0.854. thus, kmo met the requirement by having a > 0.5 value. it indicates that the samples were sufficient. the subsequent analysis was searching for the msa value. the calculation of the msa values is shown in table 4. table 2. points of student character education for the french subject in high schools of yogyakarta special region aspects indicators items communicative students can introduce themselves using french students can greet using french students can express apologies, excuses, or gratitude using french 1 5 7 tolerance students can express opinions using french students ask their friends' opinions when working on french assignments in groups students appreciate criticism and suggestions from friends when presenting french assignments in front of the class 4 11 14 cooperation students can compose french dialogues using french students can practice french dialogue in groups students identify the contents of a french text in groups 2 9 15 independent students can complete the french subject assignments given by the teacher individually and confidently students are sure to get satisfactory results when taking the french language exam on their own efforts students are not afraid to ask the teacher if they have difficulty learning french 3 6 13 responsibility -students work hard on the french assignments given by the teacher -students submit french assignments on time students can present their french assignments in front of the class according to the teacher's instructions 8 10 12 table 3. kmo value and bartlett’s test kaiser-meyer-olkin measure of sampling adequacy indicator 0.854 bartlett’s test of sphericity approx. chi-square 1091.030 df 105 sig. 0.000 table 4. msa values items values 1 0.707 2 0.889 3 0.896 4 0.838 5 0.817 6 0.798 7 0.864 8 0.866 9 0.890 10 0.883 11 0.803 12 0.812 13 0.889 14 0.769 15 0.867 10.21831/reid.v7i2.43196 nur hamidah assa'diyah & samsul hadi page 172 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) the analysis results discovered that all items had msa values fulfilling the > 0.5 requirements. these results indicate that each item had a good correlation. the lowest correlation value was item 1 with 0.707, while the highest correlation value was item 3 with 0.896. the subsequent analysis was then carried out. after knowing all the items meet the requirements for the msa value > 0.5, then the next step is look for the communalities values in table 5. from the analysis results, it was discovered that all items had communalities values fulfilling the requirement > 0.50. it shows that each item could explain the factor to be measured. item 1 had the highest role in explaining the factor. item 13 could not explain too many factors compared to other items. all items could explain the factors. thus, further analysis was carried out. the subsequent analysis searched for total variance explained to determine the number of factors. fifteen components can represent items. in the “initial eigenvalues” column, it was found that four factors explained the item with the factor criteria approaching one or factor > 1. the item can be explained by factor 1 was 4.574/ 15 x 100% = 30.49%; then, the role of factor 2 was 1.841 / 15 x 100% = 12.27%, factor 3 was 1.154/ 15 x 100% = 7.69%, factor 4 was 1.050 / 15 x 100% = 6.99%, and factor 5 was 1.025 / 15 x 100% = 6.83%. hence, the total of the five factors could explain the variables of 30.49% + 12.27% + 7.69% + 6.99% + 6.83% = 64.27%. since the eigenvalues were set to 1, the total value to be taken was > 1, i.e., components 1, 2, 3, 4, and 5. the next step is to find the value of the component matrix, as can be seen in table 6. table 5. communalities values items initial extraction item 1 1.000 0.841 item 2 1.000 0.654 item 3 1.000 0.610 item 4 1.000 0.628 item 5 1.000 0.686 item 6 1.000 0.611 item 7 1.000 0.602 item 8 1.000 0.574 item 9 1.000 0.660 item 10 1.000 0.571 item 11 1.000 0.728 item 12 1.000 0.613 item 13 1.000 0.549 item 14 1.000 0.718 item 15 1.000 0.597 table 6. component matrix items components 1 2 3 4 5 item 1 0.172 0.064 0.039 -0.020 0.898 item 2 0.146 0.788 0.043 0.017 0.100 item 3 0.403 0.655 -0.061 0.040 0.115 item 4 0.758 0.195 -0.072 -0.087 -0.058 item 5 -0.053 0.689 0.402 0.205 -0.077 item 6 -0.154 0.432 0.384 0.386 0.323 item 7 0.676 0.056 0.357 -0.008 0.120 item 8 0.704 0.132 0.015 0.202 0.142 item 9 0.494 0.602 0.162 0.117 -0.114 item 10 0.259 0.233 0.602 0.149 -0.256 item 11 0.131 0.074 0.821 0.009 0.179 item 12 -0.091 0.146 0.167 0.744 -0.047 item 13 0.191 0.473 0.267 0.465 -0.030 item 14 0.434 -0.030 -0.197 0.699 0.041 item 15 0.717 0.128 0.235 0.085 0.062 10.21831/reid.v7i2.43196 nur hamidah assa'diyah & samsul hadi page 173 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) table 7. aspect naming no. aspect indicator 1. persistent -b 4 (students can express opinions using french) -b 7 (students can express apologies, excuses, or gratitude using french) -b 8 (students work hard on the french assignments given by the teacher) -b 15(students identify the contents of a french text in groups) 2. optimist -b 2 (students can compose french dialogues using french) -b 3 (students can complete the french subject assignments given by the teacher individually and confidently) -b 5 (students can greet using french) -b 6 (students are sure to get satisfactory results when taking the french language exam on their own efforts) -b 9 (students can practice french dialogue in groups) -b 13 (students are not afraid to ask the teacher if they have difficulty learning french) 3. forbearance -b 11 (students ask their friends' opinions when working on french assignments in groups) 4. confident -b 14(students appreciate criticism and suggestions from friends when presenting french assignments in front of the class) -b 12 (students can present their french assignments in front of the class according to the teacher's instructions) 5. competent -b1 (students can introduce themselves using french) after knowing that there are five factors, each item was classified into factors 1, 2, 3, 4, or 5. the classification was based on the highest loading factor value by ignoring negative effects. for clarity, it can be observed in table 7. based on the component matrix table, the five factors were termed following the indicators for each questionnaire item. character assessment can be assessed through aspects of persistent, optimist, forbearance, confident, and competent. learning outcomes measurement is fundamental to understand the progress of the student learning process. student learning outcomes can be measured in terms of the cognitive, affective, and psychomotor aspects. there are still many measurements that focus on the affective domain and ignore the affective and psychomotor domains, while the three domains are equally important to measure. the questionnaire is a student character assessment questionnaire for the french subject. the questionnaire was developed using the 4d development procedures, including define, design, develop, and disseminate. define communicative students are declared communicative if they have language skills. the subject taught by the teacher must encourage students to communicate naturally. it is required in 21st-century education. parallelism five characteristics of independent students are confidence in acting, considering opinions or advice from others, making decisions, and not being easily influenced by others. independence is the basis for students' self-development to compete and develop. cooperation cooperation can work well if two or more people interact with each other, combine energy, ideas, or opinions at a certain time in achieving learning objectives. the demands of the 21st century require students to collaborate, which is called collaborative, and this is a challenge for students and, thus, must be owned by students. 10.21831/reid.v7i2.43196 nur hamidah assa'diyah & samsul hadi page 174 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) responsibility the form of responsibility is to carry out duties and obligations as they should. this responsibility should be embraced to themselves, society, environment (nature, social, and culture), nation, and god. tolerance tolerance means the respect for differences. this can refer to the difference in religions, ethnicity, opinions, attitudes, and actions. design five aspects were measured using the study questionnaire. these aspects were adjusted to the presidential regulation regarding character values taught in the french subject. the five aspects include communicative, tolerance, cooperation, independence, and responsibility. each aspect was broken down into three indicators. each indicator was broken down into one question item, resulting in 15 items. develop five character values, including communicativeness, tolerance, cooperation, independence, and responsibility, are specified for each aspect into three indicators. these indicators were arranged under each aspect's theory as described in the introduction and adapted to the government syllabus for the french subject. each indicator is arranged into one question. in fulfilling the communicative aspect, students must introduce themselves, greet others, and express apologies, excuses, or gratitude using french. in the tolerance aspect, students are declared tolerant if they are able to express opinions, ask friends for opinions using french, and appreciate friend's criticism and suggestions when they present french assignments in front of the class. the cooperative character values will be fulfilled if the students can compose dialogues, practice, and also identify the contents of a french text. the next aspect is independence. independent students can complete assignments individually with confidence, and they believe that the task will get satisfactory results. besides, students are not afraid to ask the teacher when experiencing difficulties during the french subject's teaching and learning process. the last aspect is responsibility. students are responsible if they are serious about completing the assignments, submitting on time, and presenting french assignments according to the teacher's instructions. disseminate the questionnaire was tested on students in x, xi, and xii grades at state high schools of yogyakarta special region. the efa results was employed to carry out the analysis with the help of spss. in order to make the research questionnaire valid, two rotations were performed. in every rotation, invalid items were reduced in order to produce valid items. the first rotation demonstrated that the kmo value met the requirements > 0.5, i.e., 0.854, and demonstrated that the kmo value met the kmo requirements > 0.5. in addition, the results of the msa analysis showed that 15 items were valid. therefore, further analysis can be carried out, i.e., the analysis of the communalities value with the results of all items meeting the requirements > 0.5. it means that each item could explain the factors to be measured. based on the result of the total variance explained analysis, four factors could explain items, with a factor criterion approaching one or > 1. then, each item is classified into five factors by ignoring negative effects. these four factors are persistent, optimist, forbearance, confident, and competent. 10.21831/reid.v7i2.43196 nur hamidah assa'diyah & samsul hadi page 175 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) conclusion the study findings indicate that the 15 statements in the character assessment questionnaire in the french subject in high schools of yogyakarta special region are valid. the reliability test results using the cronbach's alpha formula showed 0.805 > 0.60. hence, the questionnaire met the reliability requirements. character assessment can be assessed through five factors: persistent, optimist, forbearance, confident, and competent. the character values tested followed the provisions of the government syllabus for the french subject. the five character values include tolerance, communication, cooperation, independence, and responsibility. after identification, it was found that the value of the tolerance character appeared most often while the independent character value rarely appeared in the french subject. this questionnaire can help teachers assess student characters in the french subject at public high schools. by assessing the student characters, the teacher can determine the character values to be improved. future research must consider each character value to be measured to be able to explain as a whole. references apriliasari, r. a. (2015). pengembangan modul materi jurnal penyesuaian perusahaan dagang berbasis pendekatan saintifik di kelas xi smk negeri 1 sooko mojokerto. jurnal pendidikakan akuntansi (jpak), 3(3), 1–10. https://ejournal.unesa.ac.id/index.php/jpak/article/view/32748 citra, y. (2012). pelaksanaan pendidikan karakter dalam pembelajaran. jurnal ilmiah pendidikan khusus, 1(1), 237–249. http://ejournal.unp.ac.id/index.php/jupekhu darodjat, d., & zuchdi, d. (2016). model evaluasi pembelajaran akidah dan akhlak di madrasah tsanawiyah (mts). jurnal penelitian dan evaluasi pendidikan, 20(1), 11–26. https://doi.org/10.21831/pep.v20i1.7517 gunawan, h. (2014). pendidikan karakter: konsep dan implementasi. alfabeta. hadi, s., andrian, d., & kartowagiran, b. (2019). evaluation model for evaluating vocational skills programs on local content curriculum in indonesia: impact of educational system in indonesia. eurasian journal of educational research, 2019(82), 45–62. https://ejer.com.tr/evaluation-model-for-evaluating-vocational-skills-programs-on-localcontent-curriculum-in-indonesia-impact-of-educational-system-in-indonesia/ judiani, s. (2016). implementasi pendidikan karakter di sekolah dasar melalui penguatan pelaksanaan kurikulum. jurnal pendidikan dan kebudayaan, 16(9), 280–289. https://doi.org/10.24832/jpnk.v16i9.519 komalasari, k., & saripudin, d. (2017). pendidikan karakter: konsep dan aplikasi living values education. refika aditama. lalo, k. (2018). menciptakan generasi milenial berkarakter dengan pendidikan karakter guna menyongsong era globalisasi. jurnal ilmu kepolisian, 12(2), 68–75. http://www.jurnalptik.id/index.php/jik/article/view/23 lee, g. l., & manning, m. l. (2014). introduction: character education around the world: encouraging positive character traits. childhood education, 89(5), 283–285. https://doi.org/10.1080/00094056.2013.830879 malinda, s. (2016). pengembangan media audio visual sebagai media pengamatan dalam pembelajaran kurikulum 2013 materi jurnal penyesuaian kelas x akuntansi smk negeri 10 surabaya. jurnal pendidikan akuntansi (jpak), 4(3), 1–7. https://ejournal.unesa.ac.id/index.php/jpak/article/view/32748 http://ejournal.unp.ac.id/index.php/jupekhu https://doi.org/10.21831/pep.v20i1.7517 https://ejer.com.tr/evaluation-model-for-evaluating-vocational-skills-programs-on-local-content-curriculum-in-indonesia-impact-of-educational-system-in-indonesia/ https://ejer.com.tr/evaluation-model-for-evaluating-vocational-skills-programs-on-local-content-curriculum-in-indonesia-impact-of-educational-system-in-indonesia/ https://doi.org/10.24832/jpnk.v16i9.519 http://www.jurnalptik.id/index.php/jik/article/view/23 https://doi.org/10.1080/00094056.2013.830879 10.21831/reid.v7i2.43196 nur hamidah assa'diyah & samsul hadi page 176 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) rabawati, k., sutama, m., & gosong, m. (2013). penerapan pendekatan komunikatif dalam pembelajaran bahasa indonesia siswa kelas xi smk negeri 1 denpasar. jurnal ilmiah pendidikan dan pembelajaran ganesha, 2. https://ejournalpasca.undiksha.ac.id/index.php/jurnal_bahasa/article/view/581 rokhman, f., syaifudin, a., & yuliati, y. (2014). character education for golden generation 2045 (national character building for pladonesian golden years). procedia social and behavioral sciences, 141, 1161–1165. https://doi.org/10.1016/j.sbspro.2014.05.197 setiawan, d. (2013). peran pendidikan karakter dalam mengembangkan kecerdasan moral. jurnal pendidikan karakter, 4(1), 53–63. https://journal.uny.ac.id/index.php/jpka/article/view/1287 sugiarti, i. i. s. (2018). character education (study on sukan jaya activities for strengthening discipline in thamavitya mulniti school yala southern thailand). undergraduate thesis, state institute on islamic studies purwokerto, central java. http://repository.iainpurwokerto.ac.id/3783/2/iis%20sugiarti_character%20e ducation%20%28study%20on%20sukan%20jaya%20activities%20for%20strengthe ning%20discipline%20in%20.pdf supriyadi, e. (2011). pendidikan dan penilaian karakter di sekolah menengah kejuruan. jurnal cakrawala pendidikan, 2, 110–123. https://doi.org/10.21831/cp.v0i2.7590 wuryandani, w., fathurrohman, & ambarwati, u. (2016). implementasi pendidikan karakter kemandirian di muhammadiyah boarding school. jurnal cakrawala pendidikan, 15(2), 208– 216. https://doi.org/10.21831/cp.v15i2.9882 yulianti, s. d., djatmika, e. t., & susanto, a. (2016). pendidikan karakter kerja sama dalam pembelajaran siswa sekolah dasar pada kurikulum 2013. jurnal teori dan praksis pembelajaran ips, 1(1), 33–38. https://doi.org/10.17977/um022v1i12016p033 https://ejournal-pasca.undiksha.ac.id/index.php/jurnal_bahasa/article/view/581 https://ejournal-pasca.undiksha.ac.id/index.php/jurnal_bahasa/article/view/581 https://doi.org/10.1016/j.sbspro.2014.05.197 https://journal.uny.ac.id/index.php/jpka/article/view/1287 http://repository.iainpurwokerto.ac.id/3783/2/iis%20sugiarti_character%20education%20%28study%20on%20sukan%20jaya%20activities%20for%20strengthening%20discipline%20in%20.pdf http://repository.iainpurwokerto.ac.id/3783/2/iis%20sugiarti_character%20education%20%28study%20on%20sukan%20jaya%20activities%20for%20strengthening%20discipline%20in%20.pdf http://repository.iainpurwokerto.ac.id/3783/2/iis%20sugiarti_character%20education%20%28study%20on%20sukan%20jaya%20activities%20for%20strengthening%20discipline%20in%20.pdf https://doi.org/10.21831/cp.v0i2.7590 https://doi.org/10.21831/cp.v15i2.9882 https://doi.org/10.17977/um022v1i12016p033 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(1), 2021, 57-65 available online at: http://journal.uny.ac.id/index.php/reid english as a foreign language (efl) learning assessment in single-sex and co-educational classrooms umi farisiyah1*; badrun kartowagiran1; aminuddin bin hassan2 1universitas negeri yogyakarta jl. colombo no. 1, karangmalang, depok, sleman, yogyakarta 55281, indonesia 2universiti putra malaysia jl. universiti 1 serdang, 43400 seri kembangan, selangor, malaysia *corresponding author. e-mail: umifarisiyah.2020@student.uny.ac.id introduction a substantial scholarship provides compelling evidence that religious values and tenets are interwoven with education practices (eluu, 2016). one example expounding this avowal is the phenomenon of separation between female and male students in learning (gender segregation). gender segregation stands for two kinds of classroom organization: a single-sex classroom (henceforth ss) and a co-educational classroom (henceforth ce). according to the policy and program of american education program (2005), "ss education generally refers to education at the elementary, secondary or postsecondary level in which male or female students attend school or classroom exclusively with members of their own set" (u.s. department of education, 2005, p. 1) and "coeducation, generally, refers to education at the elementary, secondary or postsecondary level in which male and female students attend school or classroom altogether with members of their group" (u.s. department of education, 2005, p. 1). extracting from the definition of ss education and coeducation, a ss classroom consists of only female or male students in the learning process, whereas a ce classroom consists of both male and female students in the learning process. article info abstract article history submitted: 22 june 2021 revised: 27 june 2021 accepted: 29 june 2021 keywords single-sex classroom; coeducational classroom; causal comparative study; ex post facto scan me: this study investigates the effects of single-sex and co-educational classrooms on english learning outcomes. this study is a causal-comparative study in ex post facto design. the sample is three classes consisting of 73 students (boy and girl-single-sex classes and a co-educational class) from a private secondary school in central java, indonesia. an integrated english test was used, equating the 2013 curriculum and cefr for english. it tested four skills in english consisting of listening, speaking, reading, and writing. expert judgments have already checked the instrument through face validity. item internal consistency from all skills was good, and the reliability was too, in a good category. this study indicates that organizing a single-sex classroom in the english learning process has a positive and significant effect on english achievement. being in a single-sex classroom benefitted the students in their outcomes in learning english. this study also implies that teachers, especially english teachers, must understand their students' learning strategies to implement the appropriate learning strategies. it is because male and female students learn something in different ways. this is an open access article under the cc-by-sa license. how to cite: farisiyah, u., kartowagiran, b., & hassan, a. (2021). english as a foreign language (efl) learning assessment in single-sex and co-educational classrooms. reid (research and evaluation in education), 7(1), 57-65. doi:https://doi.org/10.21831/reid.v7i1.41644 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i1.41644 https://doi.org/10.21831/reid.v7i1.41644 umi farisiyah, badrun kartowagiran, & aminuddin bin hassan page 58 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) critically, gender segregation at school is believed to be one of the prominent causes of demolishing inequality (smyth & steinmetz, 2008). implementing classroom organization based on gender segregation is assumed to decrease inequality between men and women. ergo educational institution works as the engine in gender inequality (barone, 2011). besides, gender segregation stresses the innate differences between male and female students (ecklund et al., 2012). created differently, men and women should be treated in distinct ways too to meet their necessities. male and female students have different ways of seeing, hearing, thinking, and learning. significantly, it affects the learning achievements of stem and physical education (best et al., 2010; blake, 2012; bradley, 2009; douglas, 2011; laster, 2004; pahlke et al., 2014; parker & rennie, 2002; pendleton, 2015; whitlock, 2006; younger & warrington, 2002). although there is significant evidence demonstrating the importance and impacts of gender segregation in education, limited research protocols have still been taking shape this gender segregation role in language learning, particularly in english language learning and achievement. a study by mathers (2008) merely studies about the role of ss and ce classrooms on boy students' attitude and self-perception of competence in french, and shows that self-perception of competence of the students in the ss classroom was healthier for various reasons. the boy students in the ss classroom were more willing to work hard, not afraid of making mistakes and errors, and better risk-takers. according to mathers (2008), those characteristics were crucial ingredients for developing french-speaking skills. those findings also suggest that the ss classroom environment is superior for boys in french communicative activities. likewise, a study by aslan (2009) on the influence of gender and language learning strategies in learning english shows that the language learning strategies used gave advantageous effects in improving english achievement. female students were better than male students in terms of achievement tests because female students involved more language learning strategies in learning english. furthermore, based on the statistical results, there was a significant connection between gender, language learning strategies, and achievement in english. thus, undoubtedly, based on aslan (2009) and mathers (2008) studies, there is a lack of knowledge and information on how gender segregation could impact learners' english learning process and achievement. thus, there is a need to study gender segregation in english language learning, especially in indonesia. gender segregation issue in indonesia has been applied since formal education in the country was established. education in indonesia is mainly believed to be influenced by religious values by many believers. islamic values predominantly underlie the foundation of education in indonesia. it is also obliged to deliver religious values at every level of education. these phenomena are quite different from education wisdom in some countries that rarely put religion as part of the mandatory subject at school. based on this condition, the education system in indonesia relates to islamic teaching. hence, several schools are ruled under islamic values. one of the values is to differentiate between males and females. in islam, the males and the females are forbidden to be in one area without any legal law, like marriage. for example, some schools under islamic boarding schools/pesantren policy tend to put their students into classes based on their sex. islamic schools rule the wisdom to implement gender segregation, and christians, notably catholics and jewish schools, are also noted to maintain the classification or divide the class member based on their genders. it becomes an interesting phenomenon for the researcher to investigate. despite being implemented, gender segregation in education still finds many problems in the field. the students' problems are as follows: the first thing that includes problems is students' motivation. students admitted that being in the ce classroom made them not accessible in involving the learning process. they were more motivated and confident in the ss classrooms. the reason was that they were more comfortable and more accessible in the english learning process. it is in line with the theory of adolescent development. it stated that senior secondary school students are in the age of adolescence, and based on vygotsky's theory on adolescent development, they are in the transitional period, from childhood to adulthood. in this period, adolescence experiences changes in their biological, psychological, and social aspects (santrock, 2011). https://doi.org/10.21831/reid.v7i1.41644 umi farisiyah, badrun kartowagiran, & aminuddin bin hassan page 59 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) in addition, according to jean piaget (santrock, 2011), the cognitive development of adolescence is on the stage of formal operation that is included in the last stage of the four cognitive stages. to understand more about the formal operation stage from adolescent cognitive development, vygotsky (santrock, 2011, p. 101) explains the proximal development zone. it is a zone experienced by every individual where they cannot solve complex tasks only by themselves. those tasks only can be fixed by the support or supervision from the adult or peer friends who are trained. based on this theory, vygotsky states that the school is a cultural agent that can determine adolescent thinking development. it emphasizes that the classroom atmosphere is very urgent to be paid attention to in order to create a supportive school environment that can stimulate the reach of optimum thinking development of the adolescence. uncertainty, the classroom environment does not support the learning process; the students are simply unmotivated to involve. furthermore, senior secondary school students are around sixteen and seventeen years old. they are called teenagers. teenagers experiencing puberty have more interest to attract more attention from one another (forbes & dahl, 2010). therefore, a ce classroom can have two kinds of effects. it can be a positive effect if the students are encouraged to learn english activities by undergoing that phase. though they are meant to attract their target attention, they will be more motivated during the classroom activities. however, it also can harm shy students. they even will say nothing in the class because they are afraid of being paid attention. conversely, ss classrooms are considered to give the same experiences to the students. on one side, students can be focused and free in taking into the english learning process. on the other side, they can be bored and unmotivated in attending the class because they are of the same sex. based on this period the students undergo in that age, the students gaining the positive effects will be so much helped because it can encourage positive affective filters that can stimulate the subject's acquisition. on the other hand, if they get the adverse effects, they will be less motivated because the reluctance will increase their anxiety about teaching and learning english. another problem merged in implementing ss and ce classrooms is students' involvement in the english learning process. this condition was related to the motivation of students. when the students are motivated to learn something, they will be persistent, not anxious, and good risktakers. in line with the result of the preliminary studies through survey and observation, one of the sla theories declares that many factors influence sla speed, namely internal and external factors. motivation includes an internal factor, and environment includes an external factor. those factors can accelerate or impede the sla process. students who attended ss classrooms having lower anxiety to be involved in the english learning process will be active and perform well in every english learning activity. on the other hand, those in the ce classroom eventually could not execute the english learning activities and ss students. that is because of the higher anxiety produced by the classroom environment. this problem was in line with the finding of study by mathers (2008). her study revealed that male students felt more willing to perform freely in the ss classroom than in the ce classroom. this finding indicates that the ss classroom contributes a positive environment in the language learning process. contrariwise, younger and warrington (2002) suggested that both ss and ce classrooms gave the same learning process and achievement effects. the third problem in implementing ss and ce classrooms is the english learning achievement. the english learning achievement reached from the learning english process is dealt with the involvement of the students during the process of learning english. being comfortable and confident in learning english stimulates their comprehension of the material they receive in the classroom. these factors mushroom students' english achievement. it also interests the researchers to research further the effects (both positive and negative) of ss and ce classrooms. the research questions proposed are there any significant differences in students' mean scores on integrated english tests in ss and ce classrooms? moreover, what kind of classes serve the students better learning environment affecting their learning outcome?. https://doi.org/10.21831/reid.v7i1.41644 umi farisiyah, badrun kartowagiran, & aminuddin bin hassan page 60 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) method to best investigate the research question, aspects of quantitative study were conducted. therefore, a causal-comparative study was utilized to allow evidence from the students' english proficiency test in three settings: boy-single-sex girl-single-sex classes, and co-educational classes. like experimental research, causal-comparative research involves comparing groups to see whether some independent variable has caused a change in a dependent variable. causal – comparative research sets up studies to control possible extraneous variables (lodico et al., 2010). single-sex and co-educational classrooms have already been experienced, so they cannot be manipulated experimentally. the classroom setting, single-sex or co-educational classroom, as the independent variable will be seen as the factor influencing senior high school students' english achievement. the study was conducted in an islamic senior high school in central java, indonesia, which has already maintained this classroom setting for years. the participants in causal-comparative research already belong to groups based on their past experiences, and the researcher selects participants from these preexisting groups. an essential consideration in designing a causal-comparative study is whether the two groups are similar (comparable) except for the independent variable they are being compared to (lodico et al., 2010). this study collected data from a causal-comparative study measuring the mean difference between the three classes involved in this study. the three classes were the sf class, the sm class, and the ce class. twenty-four students were in the sf class, 25 students were in the sm class, and 24 students were in the ce class. the total of students involved in this study to gain quantitative data was 73 students. in collecting data, integrated test a test consisting of four skills tested) was used. the test was composed of the combination material from the 2013 curriculum and cefr (common european framework of reference) for english for a1 grade. this test was based on the common european framework of reference for english and the english syllabus from the 2013 curriculum. the cefr for english was used because the framework addresses the need for english comprehensively. the detail of the english learning target is displayed in its can-do description. furthermore, the concept of cefr for english is in line with the 2013 curriculum. the tests consisted of four skills (listening, speaking, writing, and reading) set based on the class level. the second grade of senior high school is at a2 level. it was based on the association of language testers in europe (alte) – of which cambridge english language assessment is a founding member – estimates that learners typically take the following guided learning hours to progress between levels. 'guided learning hours' means time in lessons as well as the tasks you set them to do. the second grade of senior secondary school was estimated to have been learning english for approximately 180-200 hours, including in a2 level (cambridge english language assessment, 2013, p. 4). the material of the tests was a combination of the two language curriculums. this test was administered to the sample in their english class. the listening test consisted of 20 items of multiple choices with the value of the pearson correlation (.0663) and cronbach alpha 0.776. deleting some items having low internal item consistency was done. only items having a good correlation (>0.05) were used. the reading test was done by answering true or false questions. there were ten items based on a text. the pearson correlation value gained was 0.441, and the cronbach alpha was 0.476. since having a low value of pearson correlation, the form of the reading instrument test was changed into multiple choices. the speaking test's instruction consisted of five. it has good pearson correlation and cronbach alpha values (0.7128 and 0.769). the writing test consisted of four aspects to test. it has a high value of pearson correlation (0.850) and cronbach alpha (0.884). the test was given to the sample in two meetings (one meeting was 70 minutes). listening, writing, and reading tests were conducted in the first meeting. at the same time, the speaking test was done in the following week. the listening test consisted of two parts about daily activhttps://doi.org/10.21831/reid.v7i1.41644 umi farisiyah, badrun kartowagiran, & aminuddin bin hassan page 61 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) ities. the questions were 20 numbers in multiple-choice form. it took 38 minutes. the recording was played twice. the writing test was done right after the students did the listening test. the students were asked to compose a passage about dr. allan's daily routine based on the chart. the time to complete the passage was 10 minutes. the reading test, it was last for 15 minutes. five minutes were used to read the passages about mr. miller's and ms. lucy's daily routine. ten minutes were used to answer ten numbers of true-false questions dealt with the passages. for the speaking test, students were chosen randomly into some groups consisting of three students. each group had to find a teacher and interview a teacher about her/ his daily routine. the result of the interview was presented in front of the class without bringing any text. the time to deliver the speaking test was seven minutes. findings and discussion the aim of this study was to investigate if there was a difference in english achievement among students enrolled in ss and ce classrooms. besides, this study explored the potential influences of those classrooms on the outcome of english learning achievement. descriptive statistics and inferential statistics were used to analyze the ss and ce classrooms' impact on students' gains in english learning, as indicated by integrated english test scores. the data gained from the integrated english test is as in table 1. table 1. the mean score of the integrated english learning achievement test english learning achievement score class statistics stand. error s f mean 66.3958 1.66 sd 8.11330 s m mean 65.8400 2.06 sd 10.2933 c e mean 57.8125 1.90 sd 9.30295 table 1 described that the number of students taking the score to be analyzed was almost equal. besides, the mean of the three groups was different. sf class was the class with the highest mean, followed by the sm class, and the lowest mean was from the ce class. the mean difference among the three groups was not too much. table 1 displays the acquired integrated english learning achievement test score from the three classes of the sample. the sf classroom gains the highest mean score of the average of four english skill scores of 66.39. the sm classroom is the second rank with 65.84 as the final score, and the lowest score is from the ce classroom, 57.81. it implies that being in the ss classroom and the ce classroom affect students' english learning achievement. likewise, the students attending the english learning process in the ss classroom gain a higher mean score than those in the ce classroom. it is on a par with the view of aslan (2009) that students in the ss classroom gain better english learning achievement. the elaboration of each skill score is that the order for the reading and speaking scores are about the same as the overall scores (see figure 1). this finding aligns with o’neill (2011) that the ss classrooms contribute better achievement on students' reading tests. it is because students in the ss classrooms are more accessible and psychologically secured. this positive feeling also affects students' confidence in performing the speaking test. on the contrary, the listening and writing scores seem different. in writing skills, the ce classroom came up with a higher score than the sm classroom. it is pretty altered from the three skill scores that always put the ce classroom in the last position. the sm classroom acquires the highest score in the listening test. the result contradicts the avowal from sax (2017), stating that girls are better at hearing than boys. it is because listening is not merely about hearing. in the listening test, concentration and focus are needed. female students that are considered to be multitasking are easy to be distracted by other activities. this condition causes the female students' concentration and focus on facing the listening test to be irritated. https://doi.org/10.21831/reid.v7i1.41644 umi farisiyah, badrun kartowagiran, & aminuddin bin hassan page 62 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) though the overall integrated english learning achievement test score illustrated that the sf classroom was the highest in gaining the score and the sm classroom as the second, each english skill score showed different findings. figure 1 explains in detail the overall english learning achievement test score. figure 1. integrated english learning achievement test score from figure 1, the score for listening and writing skills is different from the overall integrated english learning achievement test score. the sequences of the scores were the sf, sm, and ce classrooms from the overall score. nevertheless, listening and writing skills were not the same. the sm classroom got the highest score (45.6) in three classrooms, then the sf classroom (42.7), and the last level was from the ce classroom (36.2). for writing skills, the ce classroom (69.4) placed the second level from the three classrooms. the sm classroom was the lowest level (67.4), though the variance was not quite considerable. to check the significance of the mean difference, an analysis of variance was conducted. before conducting an anova, the basic assumptions (the assumption of random sampling, the assumption of normality (p (.148, .200, .200) > α (.05) for kolmogorov-smirnov), and the assumption of homogeneity variances, (f(2, 70) = .314, p = .732. that was, p (.732) > α (.05)) underlying it have been met. from the testing of assumptions for anova, all of the assumptions were met for the calculation. based on this finding, anova analysis could proceed. in this present study, the question was addressed to ferret out ss and ce classrooms' main effect on the students' english achievement. to answer the question formulated in this study, one-way anova was used. table 2. the result of one way anova sum of squares df mean square f sig. between groups 1113.5 2 556.7 6.4 .003 within groups 6047.4 70 86.4 total 7160.885 72 table 2 shows that the value of probability (p) obtained was lower than α= 0.05 (0.003< 0.05). it means that there was a significant difference among single female, single male, and coeducational classrooms in learning english outcomes. this result answers the first research questions. this result is also in line with some previous researches (best et al., 2010; blake, 2012; bradley, 2009; douglas, 2011; pahlke et al., 2014; parker & rennie, 2002; pendleton, 2015; whitlock, 2006; younger & warrington, 2002) which focused on the ss and ce classrooms in science and pe, and especially from laster (2004), mathers (2008), and o’neill (2011), which concern on research in language learning, as well as from aslan (2009) which delved into english learning. https://doi.org/10.21831/reid.v7i1.41644 umi farisiyah, badrun kartowagiran, & aminuddin bin hassan page 63 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) the result of the study reveals that being in the ss classroom benefitted the students in their outcomes in learning english. the result supported the previous study conducted by aslan (2009). he found that students are in ss classrooms have low anxiety during the learning process since they are of the same sex as the entire class is. the low anxiety stimulates them to be wholly engaged in the classroom during the learning process, not being reluctant to question when they find difficulties in the material they learn and feeling free to express what they feel, causing their psychological security. furthermore, to investigate which one was the most effective of the three types of classrooms used as the medium of instruction, the tukey hsd followed the anova test. table 3 presents a summary of the result of the tukey hsd. table 3. the result of tukey hsd (i) class (j) class mean difference (i-j) std. error sig. sf sm .55583 2.66 .976 ce 8.58333* 2.68 .006 sm sf -.55583 2.66 .976 ce 8.02750* 2.66 .010 ce sf -8.58333* 2.68 .006 sm -8.02750* 2.66 .010 *. the mean difference is significant at the 0.05 level. table 3 illustrated that the ss classrooms were indicated to be more effective than the ce classroom, especially for the sf classroom. it could be seen from the value of probability (p) from the sf class that it was lower than the significance level (α), .006 < .05. this result indicated that the sf classroom was the most effective classroom for teaching english. this study was not in line with the research conducted by o’neill (2011), which stated that males gained higher scores than females did. nevertheless, the result of this study is in line with gurian (2011), stating that females have more maturity in their brain dealing with linguistic progression since they have more development in their frontal lobes and occipital lobes where sensory processing happens. this thing supports the ability of females to use their senses in the language is better, and males do. it is why females are considered to be more innovative in language than males. conclusion the highlight of this study is a contribution to the knowledge field of gender segregation in efl students' learning process and achievement. this study contributes insight that the ss classroom significantly influences the second graders of islamic senior secondary school, central java, with a significant value on the mean score difference, 0.003. it means that the english learning process in ss classrooms is more effective than in ce classrooms. therefore, the achievement of english learning reaches more considerable improvement. from the statistical computing result, the significant difference between the male and female groups of students in the ss classroom and the male and female groups of students who received instruction in a ce classroom reveals that male students in an ss classroom are better than male students in a ce classroom. female students in a ss classroom are also better than female students in a ce classroom. the differences are found in the speaking and listening test. ss classroom is indicated as a better classroom organization due to its effectiveness in sla. the study's result can be the fundamental thought in founding and composing language departments in schools. to create an effective language classroom, the students must be set under gender segregation. this study also implies that teachers, especially english teachers, must understand their students' way of learning to implement the appropriate learning strategies. it is because male and female students learn something in different ways. https://doi.org/10.21831/reid.v7i1.41644 umi farisiyah, badrun kartowagiran, & aminuddin bin hassan page 64 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) references aslan, o. (2009). the role of gender and language learning strategies in learning english [master thesis, middle east technical university, ankara]. https://etd.lib.metu.edu.tr/upload/12611098/index.pdf barone, c. (2011). some things never change. sociology of education, 84(2), 157–176. https://doi.org/10.1177/0038040711402099 best, s., pearson, p. j., & webb, p. i. (2010). teachers’ perceptions of the effects of single-sex and coeducational classroom settings on the participation and performance of students in practical physical education. in a. rendimiento (ed.), congreso de la asociación internacional de escuelas superiores de educación física (pp. 1016–1027). university of wollongong. https://ro.uow.edu.au/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir =1&article=1673&context=edupapers blake, c. d. (2012). single-sex education versus coeducation in north georgia public middle schools. doctoral dissertation, liberty university, lynchburg, va. bradley, k. (2009). an investigation of single -sex education and its impact on academic achievement, discipline referral frequency, and attendance for first and second grade public school students. doctoral dissertation, mercer university, macon, ga. cambridge english language assessment. (2013). introductory guide to the common european framework of reference (cefr) for english language teachers. cambridge university press. https://www.englishprofile.org/images/pdf/guidetocefr.pdf douglas, d. d. (2011). single-gender versus coed instruction as a factor impacting reading achievement for male elementary school students. doctoral dissertation, walden university, minneapolis, mn. ecklund, e. h., lincoln, a. e., & tansey, c. (2012). gender segregation in elite academic science. gender & society, 26(5), 693–717. https://doi.org/10.1177/0891243212451904 eluu, p. e. (2016). the role of religion in value education in nigeria. british journal of education, 4(9), 64–69. https://www.eajournals.org/journals/british-journal-of-education-bje/vol-4issue-9-august-2016-special-issue/role-religion-value-education-nigeria/ forbes, e. e., & dahl, r. e. (2010). pubertal development and behavior: hormonal activation of social and motivational tendencies. brain and cognition, 72(1), 66–72. https://doi.org/10.1016/j.bandc.2009.10.007 gurian, m. (2011). boys and girls learn differently: a guide for teachers and parents. jossey bass. laster, c. (2004). why we must try same-sex instruction. education digest: essential readings condensed for quick review, 70(1), 59–62. lodico, m. g., spaulding, d. t., & voegtle, k. h. (2010). methods in educational research: from theory to practice (2nd ed.). jossey bass. mathers, c. a. (2008). the role of single-sex and coeducational instruction on boys’ attitudes and self perceptions of competence in french language communicative activities [doctoral dissertation, boston college, chestnut hill, ma]. http://hdl.handle.net/2345/592 o’neill, l. (2011). the impact of single-sex education on male and female gains in mathematics and reading at the elementary level in a selected school in north carolina [doctoral dissertation, gardner-webb university, boiling springs, nc]. https://digitalcommons.gardnerwebb.edu/education_etd/80 https://etd.lib.metu.edu.tr/upload/12611098/index.pdf https://doi.org/10.1177/0038040711402099 https://ro.uow.edu.au/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=1673&context=edupapers https://ro.uow.edu.au/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=1673&context=edupapers https://www.englishprofile.org/images/pdf/guidetocefr.pdf https://doi.org/10.1177/0891243212451904 https://www.eajournals.org/journals/british-journal-of-education-bje/vol-4-issue-9-august-2016-special-issue/role-religion-value-education-nigeria/ https://www.eajournals.org/journals/british-journal-of-education-bje/vol-4-issue-9-august-2016-special-issue/role-religion-value-education-nigeria/ https://doi.org/10.1016/j.bandc.2009.10.007 http://hdl.handle.net/2345/592 https://digitalcommons.gardner-webb.edu/education_etd/80 https://digitalcommons.gardner-webb.edu/education_etd/80 https://doi.org/10.21831/reid.v7i1.41644 umi farisiyah, badrun kartowagiran, & aminuddin bin hassan page 65 copyright © 2021, reid (research and evaluation in education), 7(1), 2021 issn: 2460-6995 (online) pahlke, e., hyde, j. s., & allison, c. m. (2014). the effects of single-sex compared with coeducational schooling on students’ performance and attitudes: a meta-analysis. psychological bulletin, 140(4), 1042–1072. https://doi.org/10.1037/a0035740 parker, l. h., & rennie, l. j. (2002). teachers’ implementation of gender-inclusive instructional strategies in single-sex and mixed-sex science classrooms. international journal of science education, 24(9), 881–897. https://doi.org/10.1080/09500690110078860 pendleton, m. (2015). a comparison of single gender and coeducational classrooms, student engagement, and achievement scores. doctoral dissertation, lindenwood university, st charles, mo. santrock, j. w. (2011). educational psychology (5th ed.). mcgraw hill. sax, l. (2017). why gender matters: what parents and teachers need to know about the emerging science of sex differences. smyth, e., & steinmetz, s. (2008). field of study and gender segregation in european labour markets. international journal of comparative sociology, 49(4–5), 257–281. https://doi.org/10.1177/0020715208093077 u.s. department of education. (2005). single-sex versus coeducation schooling; a systematic review. office of planning, evaluation and development. whitlock, s. e. (2006). the effects of single-sex and coeducational environments on the self-efficacy of middle school girls [doctoral dissertation, virginia polytechnic institute and state university, blacksburg, va]. https://vtechworks.lib.vt.edu/handle/10919/28041 younger, m., & warrington, m. (2002). single-sex teaching in a co-educational comprehensive school in england: an evaluation based upon students’ performance and classroom interactions. british educational research journal, 28(3), 353–374. https://doi.org/10.1080/01411920220137449 https://doi.org/10.1037/a0035740 https://doi.org/10.1080/09500690110078860 https://doi.org/10.1177/0020715208093077 https://vtechworks.lib.vt.edu/handle/10919/28041 https://doi.org/10.1080/01411920220137449 research and evaluation in education iv volume 1, number 1, june 2015 subject indexes 3plm/grm model, 55, 65, 67, 68, 69 a ability, 8, 9, 18, 20, 21, 23, 26, 31, 33, 45, 46, 48, 49, 50, 51, 53, 55, 56, 57, 58, 59, 60, 61, 64, 68, 69, 70, 75, 80, 89, 90, 91, 92, 93, 94, 95, 96, 100, 101, 104, 105, 106, 107, 112 apprenticeship, 84, 85, 86, 95, 96, 99 c classroom and school context, 1, 4 community, 1, 3, 4, 9, 10, 11, 12, 17, 18, 23, 89, 91, 92, 95, 97 d dimension, 45, 46, 47, 48, 49, 50, 51, 52 e e-learning, 20, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44 estimation, 46, 47, 55, 56, 58, 59, 61, 63, 69, 74, 77, 79, 101, 107, 112 f family, 1, 4, 8, 10, 11 h historical thinking skills, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82 i implementation, 11, 13, 14, 19, 20, 21, 22, 23, 70, 75, 76, 80, 85, 88, 89, 91, 96, 98, 99, 112 individual student, 1, 4 instrument development, 73, 75 irt true score equating, 100 item parameter, 47, 55, 61, 64, 69, 76, 100, 101, 102, 103, 104, 107, 108 item parameter drift, 100, 101, 102 l listening, 45, 46, 48, 49, 50, 51, 52, 53 m mathematics test, 55, 102, 103, 104, 107, 108, 110 mcm/gpcm model, 55, 65, 67, 68, 69 p pcm, 58, 73, 78, 79, 80, 81, 82 polytechnic education, 84, 86, 87, 88, 90, 94, 95, 96, 97, 98, 99 polytomous, 56, 57, 58, 68, 73, 75, 77, 81, 82 professional, 14, 15, 16, 18, 22, 23, 33, 84, 85, 86, 90, 91, 93, 94, 95, 96, 97, 98 r recognition, 14, 84, 95, 96, 99 robust z method, 100, 101, 102, 103, 105, 106, 107, 108, 112 t team builder, 25, 31, 32, 33, 34, 35, 36, 38, 39, 40, 41, 42, 43 test, 14, 15, 20, 21, 25, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 45, 46, 48, 52, 53, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 73, 74, 75, 76, 77, 79, 80, 81, 82, 95, 96, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112 toep, 45, 46, 48, 49, 50, 51, 52, 53 total quality management, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23 v vocational character, 84, 85, 86, 88, 90, 95, 96, 97, 98, 99 vocational high school, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 29, 31, 32, 33, 34, 36, 39, 40, 41, 43 research and evaluation in education research and evaluation in education, e-issn: 2460-6995 v author indexes abadyo, 55 al-zuhdy, yosa abduh, 45 bastari, 55 brodjonegoro, satryo soemantri, 84 felestin, 13 handayani, peni, 84 keoviphone, chanthaboun, 1 mardapi, djemari, 74, 100 munadi, sudji, 45 ofianto, 73 patwary, mahmud al haq, 25 rahmawati, 100 retnawati, heri, 45 suhartono, 73 surjono, herman dwi, 25 triyono, mochamad bruri, 13 wibowo, udik budi, 1 research and evaluation in education vi volume 1, number 1, june 2015 authors’ biography abadyo. was born in lumajang, east java, indonesia, on 24 april 1952. currently works as a lecturer in mathematics department, malang state university, indonesia. attained his bachelor degree in 1980 from malang institute of teacher education and educational sciences (recently known as malang state university) majoring in mathematics education, his master degree in 1997 from gadjah mada university, indonesia, majoring in mathematics, and his doctoral degree in 2014 from the graduate school of yogyakarta state university, indonesia, majoring in educational research and evaluation. bastari. was born on 30 july 1966. currently works at the center of educational quality assurance, ministry of education and culture of republic of indonesia and as a lecturer of special education study program at the graduate school of yogyakarta state university, indonesia. chanthaboun keoviphone. currently works as an officer at vientiane provincial education and sport department, laos. received her master degree from graduate school of yogyakarta state university in 2014. djemari mardapi. was born in binjai, north sumatra, indonesia, on 1 january 1947. currently works as a senior lecturer in the graduate school and faculty of engineering of yogyakarta state university, indonesia. attained his bachelor degree in electricity education study program in 1973 and finished his master degree in 1984 on educational evaluation study program in yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university). earned his doctoral degree on educational measurement and statistics from the university of iowa, iowa city, usa, in 1988. felestin. was born in antsirabe-nord, madagascar, on 16 november 1983. attained his bachelor degree in science of management, majoring in organization and administration of enterprise from university of antananarivo. his master degree was attained in 2013 from yogyakarta state university, indonesia, majoring in technology and vocational education, focusing on information communication and technology. heri retnawati. was born on 3 january 1973. currently works as a lecturer in the faculty of mathematics and natural sciences of yogyakarta state university, indonesia. attained her bachelor degree on mathematics education in 1996 in yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university), master degree on educational research and evaluation in 2003, and also doctoral degree in the same major in 2008, in the same university. herman dwi surjono. currently works as a senior lecturer at the faculty of engineering and the graduate school of yogyakarta state university, indonesia. attained his doctoral degree in information technology in 2006 from southern cross university, australia. mahmud al haq patwary. currently works as a community and outreach executive at hubdhaka, dhaka, bangladesh. attained his bachelor degree from the department of science, mathematics and technology education (smte) of the institute of education and research, university of dhaka, bangladesh, and his master degree in instructional technology from the graduate school of yogyakarta state university, indonesia, with a perfect cgpa and was awarded cumlaude. research and evaluation in education research and evaluation in education, e-issn: 2460-6995 vii mochamad bruri triyono. works as a lecturer at the faculty of engineering of yogyakarta state university, indonesia. attained his bachelor degree in 1983 from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university) majoring in mechanical engineering education, his master degree in 1996 from jakarta institute of teacher education and educational sciences (recently known as state university of jakarta) majoring in vocational technology education, and his doctoral degree in 2006 from state university of jakarta majoring educational technology. ofianto. was born in curup, bengkulu, indonesia, on 20 october 1982. currently works as a lecturer at the faculty of social science of padang state university, indonesia. attained his bachelor degree of history education in padang state university, and finished his master and doctoral education in the graduate school of yogyakarta state university, indonesia, both majoring educational research and evaluation. peni handayani. was born in ponorogo, east java, indonesia, on 1 september 1959. currently works as a lecturer in electronics engineering at bandung state polytechnic, indonesia. attained her bachelor degrees from two different universities: engineering teacher faculty of yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university) in 1983 and bandung institute of technology in 1999. earned her master degree from bandung institute of technology, majoring in computer technology in 2003, and her doctoral degree from yogyakarta state university in 2013, majoring vocational education. rahmawati. was born in banten, indonesia, on 23 august 1979. currently works at the center of education assessment of research and development organization of the department of education and culture. attained her bachelor degree in 2001 majoring electronic engineering from gadjah mada university, her master degree on research and evaluation method study program from university of massachusetts amherst in 2009, and her doctoral degree on educational research and evaluation study program from yogyakarta state university in 2014. satryo soemantri brodjonegoro. was born in delft, netherlands, on 5 january 1956. currently active as a visiting professor in mechanical engineering department at toyohashi university of technology, japan and at bandung institute of technology, indonesia. attained his doctoral degree from university of california at berkeley, usa, in 1985, majoring in mechanical engineering, and joined bandung institute of technology (itb) since then. sudji munadi. works as a professor in educational research and evaluation department of the graduate school of yogyakarta state university, indonesia. his expertise area includes measurement and evaluation in education. suhartono. works as a professor at the faculty of letter and culture of gadjah mada university in the history science department. attained his bachelor degree in 1966 majoring history science from the faculty of letter and culture gadjah mada university. continued his study in the post graduate training program in the leiden university, netherland (1978-1979). performed an archive study in the rijsarchief nederland institute for asian studies, leiden, netherland (2003), institute of oriental culture, tokyo university and institute of asian pacific studies waseda university, japan (2004). udik budi wibowo. has been working as a lecturer at the faculty of education and graduate school of yogyakarta state university since 1987. attained his bachelor degree in 1986 from gadjah mada university majoring in public administration science, his master degree in 1993 research and evaluation in education viii volume 1, number 1, june 2015 from indonesian university of education majoring educational administration, and his doctoral degree in 2010 in the same major and same university. yosa abduh al-zuhdy. works as a lecturer in english education department of the faculty of languages and arts, yogyakarta state university, indonesia. his expertise is english language education. research and evaluation in education research and evaluation in education, e-issn: 2460-6995 ix submission guidelines  the manuscript submitted is a result of a research or scientific assessment of an actual issue in the area of research, evaluation, and education in a broad sense, which has not been published elsewhere and is not being sent to other journals.  manuscript is accepted in english. any consistent spelling and punctuation styles may be used. please use single quotation marks, except where „a quotation is “within” a quotation‟. long quotations of 40 words or more should be indented without quotation marks.  a typical manuscript is approximately 8000 words or 12-18 pages including tables, references, captions and endnotes. manuscripts that greatly exceed this will be critically reviewed with respect to length. (a4; margins: top 3, left 3, right 2, bottom 2; double columns [except in abstract: single column]; single-spaced; font: garamond, 12).  manuscripts should be compiled in the following order: (1) title; (2) abstract; (3) keywords; (4) main text: introduction, research method, findings and discussion, conclusion and/or suggestions; (5) references.  the title of the manuscript should clearly represent the content of the article, and contain the keywords.  authors' name(s) should be written under the title (without any academic degree), along with the affiliation(s) and email address(es).  an abstract that does not exceed 300 words is required for any submitted manuscript. it is written narratively containing the aim(s), method, and the result(s) of the research.  each manuscript should have 3 to 5 keywords written under the abstract.  all tables and figures are adjusted to the paper length, and numbered referring to the text.  the citation and references are referred to american psychological association (apa) style, for example: .......... (switzgerald, 2014, p. 8) ............. mardapi (2015, pp. 13-14) [in text].  american psychological association (apa) style format is used.  the manuscript must be in *.doc or *.rtf, and sent to reid's management via online submission by creating account in this open journal system (ojs) [click register if you have not had any account yet; or click log in if you have already had an account].  authors‟ biography must be written narratively, containing each author‟s full name, degree(s) which were attained, place and date of birth, the last three educational levels which were taken, affiliation/department in which the author is currently working, phone number and email address.  all author(s)' names and identity(es) must be completely embedded in the form filled in by the corresponding author: email; affiliation; and each author's short biography (in the column of 'bio statement'). [if the manuscript is written by two or more authors, please click 'add author' in the 3rd step of 'enter metadata' in the submission process and then enter each author's data.]  (if any) the funding or grant-awarding bodies is acknowledged in the column of ‘contributors and supporting agencies’ when entering metadata in the open journal system (ojs ) of the journal. for single agency grants: "this work was supported by the [name of funding agency] under grant [number xxxx]."  all correspondences, information and decisions for the submitted manuscripts are given through email written in the manuscript and/or the emails used for the submission.  word template is available for this journal. if you have template queries, please contact reid.ppsuny@gmail.com mailto:reid.ppsuny@gmail.com this is an open access article under the cc-by-sa license. reid (research and evaluation in education), 5(2), 2019, 152-168 available online at: http://journal.uny.ac.id/index.php/reid the effectiveness of game-based science learning (gbsl) to improve students’ academic achievement: a meta-analysis of current research from 2010 to 2017 *1heru setiawan; 2shane phillipson 1sekolah menengah atas global mandiri jakarta jl. raya cakung cilincing km. 5, cakung timur, jakarta timur, dki jakarta 13910, indonesia 2faculty of education, monash university, australia 29 ancora imparo way, clayton vic 3800, australia *corresponding author. e-mail: heru.setiawan@teacher.globalmandiri.sch.id submitted: 10 november 2019 | revised: 6 december 2019 | accepted: 12 december 2019 abstract this study identifies the effectiveness of game-based science learning (gbsl) for improving students’ learning outcomes by conducting a literature review of the current research from 2010 to 2017. this study also explores the correlation between variation in school level and year of publication on gbsl effect size. data were collected from peer-reviewed journal articles published in educational databases including eric (educational research information centre), springer link, proquest education journal, and a+ education. seven inclusion criteria were used to select relevant studies. comprehensive meta-analysis (cma 2.0) was used to analyze the data. this study finds that (1) gbsl intervention has a statistically significant effect on students' learning outcomes with a higher average on the effect size of the experimental group (41.12) than the control group (37.07). the mean of the reviewed studies’ effect size is 0.667 in the medium category. (2) the implementation of gbsl in secondary school has a bigger average effect size than in elementary school. year of publication and effect size has a low positive correlation with a coefficient of correlation 0.40. keywords: game-based science learning, learning outcomes, meta-analysis permalink/doi: https://doi.org/10.21831/reid.v5i2.28073 introduction the young generation who was born in the 21st century is a digital native or the net generation (bennett, maton, & kervin, 2008). the millennial in this era also can be called a game generation (prensky, 2001). the trend of digital games’ use has been increasing in this era (corbett, 2010; mcgonigal, 2011). millions of people have been immersed in playing digital games either for entertainment or education (huang, hew, & lo, 2019). gee (2007) reported in his study that approximately 90% of students’ mobile phones connect to digital games. besides, many teachers use digital games as a medium of instruction in their classroom for engaging students during teaching and learning processes, or it is commonly called digital game-based learning (dgbl) (papastergiou, 2009; van eck, 2006). students also obtain feedbacks such as improvement, and win conditions after completing the goals (okeke, 2016, p. 1). the dgbl that specifically focus on science is called game-based science learning (gbsl). since 2006, the number of research investigating the effect of digital games in education has been increasing (chorney, 2012). some literature has been debating the effectiveness of gbsl in the last decade (hamari & keronen, 2017; quandt et al., 2015). the community of science education (physics, biology, chemistry, and general sciences) also https://doi.org/10.21831/reid.v5i2.28073 the effectiveness of game-based science learning... heru setiawan & shane phillipson copyright © 2019, reid (research and evaluation in education), 5(2), 2019 153 issn 2460-6995 concern with the potential of game-based learning. some researchers investigate the effectiveness of gbsl in some science subject matter such as newtonian mechanics (clark et al., 2011), human immunology (cheng, su, huang, & chen, 2014), and photosynthesis (culp, martin, clements, & lewis presser, 2015). they argue that science is challenging for some students because of abstract concepts and invisible objects. in addition, some research illustrated that rote memorization and decontextualized learning have potential drawbacks in the science context (honey & hilton, 2011; mayo, 2007). this issue has an impact on their learning outcomes which can be defined as skills, knowledge, and values as an outcome of students’ experiences (the us council for higher education accreditation (chea) cited in adam (2004, p. 4). learning outcomes can be knowledge, skills, or attitude. however, in this context, the learning outcome only refers to students’ learning outcomes in academic settings. thus, gbsl is the proper solution to this issue because digital games are highly engaging and motivating (huang et al., 2019; tsay, kofinas, & luo, 2018). several researchers demonstrated empirical evidence of the potential of this educational tool to enhance students’ learning outcomes in the various context of science subjects through comparing control and experiment group (such as bello, ibi, & bukar, 2016; fan, xiao, & su, 2015). however, a small number of sample of studies investigating the effect of gbsl on students' learning outcomes tended to have a more significant mean of effect sizes than studies with larger sample sizes (cheung & slavin, 2013). effect size refers to a quantitative measurement of the difference between the mean score of the control group and the treatment group (nakagawa & cuthill, 2007). meanwhile, the small sample size of the research cannot be used to generalize the effect of gbsl. in order to solve this issue, it needs further investigation of the effectiveness of gbsl in students’ achievement in sciences with a meta-analysis study to develop a better estimate of effect magnitude (king & he, 2005). meta-analysis is the process of converting the effects of several similar research into quantitative data so that these averages of the effect size and an overall determination can be made concerning the cumulative findings of several studies (glass, mcgaw, & smith, 1981). meta-analysis is a kind of retrospective observational study in which researchers make data recapitulation without any experimental manipulation (brockwell & gordon, 2001). several literature reviews of gamebased learning have been conducted both in the context of sciences and other subjects such as mathematics, language, history, and physical education. in 2006, vogel et al. (2006) used meta-analysis of digital games versus traditional teaching methods. the overall result of the meta-analysis was that treatment groups were reported higher learning outcomes and better attitudes toward learning than control groups. the report also analyses some moderator categories. he reported that gender, school level, and user type showed significant statistical results. meanwhile, learner control, type of activity, and realism do not appear to be influential. in the science context, li and tsai (2013), reviewed research articles regarding game-based science learning (gbsl) published from 2000 to 2011. the focus of the review is qualitative outcomes including research purposes and designs, the theoretical foundations, game design, and learning focus. based on the review, gbsl can provide effective learning in a collaborative problem-solving environment. however, the research only focused on qualitative data without discussing and analyzing the quantitative analysis of gbsl intervention and the effect size. according to the previous research, gaps in the literature have been identified. although several studies have explored a review of literature of gbsl, few have tested their relative influence on learning the outcome. there was also a lack of research metaanalysis of gbsl with a quantitative approach. li and tsai (2013) who focused their research on the qualitative method suggested that quantitative content analysis of gbsl effectiveness such as students’ learning outcomes in science education should be conducted in future investigations. it is because the effectiveness of game-based science learning (gbsl)... heru setiawan & shane phillipson 154 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 digital games that can promote students’ engagement (annetta, minogue, holmes, & cheng, 2009; tsay et al., 2018) might also enhance students’ learning outcomes. other similar studies such as vogel et al. (2006) also have a limitation. although he specifically focusses on cognitive aspects in the analysis, the context of the study is in a broad context and did not specifically focus on science education. based on this gap, a newly proposed work focusing on a meta-analysis of the effect of the digital game on students’ learning outcome in science education or gbsl need to be conducted. thus, two central research questions (rqs) were addressed in this study: (1) rq1: do game-based science learning (gbsl) effective to enhance students’ learning outcomes compared to traditional method as reported by the current studies from 2010 to 2017? (2) rq2: do moderator categories including school level of participants (elementary and secondary school context) and year of publication has any correlation with gbsl effect size? this research contributes to the literature in this field. first, this study reviewed recent trends in gbsl research, especially for those in the field of science education who are interested in quantitative studies of gbsl for students’ learning outcomes. a metaanalysis of gbsl has been conducted by several researchers within a broader context such as mathematics, language, and other subjects (divjak & tomić, 2011; young et al., 2012), but lack of research conducted in science education. second, the consistency of the result of similar studies for several years will be investigated. therefore, consistency and inconsistency of findings of similar research will be found, and bias of one or more studies in this field could be detected (borg & gall, 1983). third, a meta-analysis uses a significant amount of data, and applying statistical methods by organizing some information comes from a broad cross-section whose function is to complement other purposes (glass et al., 1981). by the significant number of participants, the study develops a better estimate of effect magnitude (king & he, 2005). the larger sample size in conducting a meta-analysis could be found in one study that will create greater statistical power and more precise confidence intervals. this is because the study collects several similar studies to be analyzed quantitatively. it concentrates on the effect size of this empirical discovery which is relatively better than the other methods of quantitative approaches including narrative review, descriptive review, and vote counting (lipsey & wilson, 2001). moreover, through the substantial number of participants with different variables, the differences may exist because of differences that exist among the articles such as different subject populations, education level, gender, game type, etc. by using meta-analysis, different moderator variables can be investigated. vogel et al. (2006) state that analyzing moderator variables would give a clearer overview or more complex picture of reviewed studies. method research strategies and data collection the search of the literature was conducted from june to july 2017. data were collected from journal articles published from educational sources including proquest education journal, springer link, a+ education, and eric (educational research information centre). the databases provide a high impact and a high-quality journal article. the keywords are 'digital game, sciences, physics, biology, chemistry, secondary, high school, elementary.' the boolean operator, 'and' or 'or', was used to combine all key terms. following the keywords, the researchers read the abstract and full-text. we use some inclusion and exclusion criteria as the evaluation to choose appropriate journal articles. seven inclusion and exclusion criteria were applied in screening the eligible article included in this study including publication year, unit, participant, game intervention, research design, participant, outcome type, and language. these details of inclusions and exclusions are explained as follows. (1) publication year: all of the articles are peer-reviewed journal articles published in the last seven years from january 2010 to june 2017. (2) unit: the unit in elementary and secondary education in this study is science the effectiveness of game-based science learning... heru setiawan & shane phillipson copyright © 2019, reid (research and evaluation in education), 5(2), 2019 155 issn 2460-6995 subjects including biology, physics, chemistry, and general sciences. other units such as technical subjects in vocational high school are excluded. also, unrelated subject matters that have similar keywords, but they are not related to science subjects such as physical education are excluded. (3) game/ intervention: digital games in this study is defined as a digital experience where participant use game of computer software and they receive feedback to achieve the goals in the form of a score, progress and win condition. however, learning intervention that focused on creating a digital game for students is not included. the studies compared digital games in science instruction and traditional methods. (4) research design: all of the journal articles included in this meta-analysis must use experimental and control groups or game versus non-game conditions. the studies must have a sample size, standard deviation, and mean. however, studies that do not have the data were excluded. the studies included used an experimental method to make sure that the included studies have data compared in the statistical analysis. studies are considered experimental if individual students are randomly assigned to an instructional condition. (5) participant: the participants of the research in the included studies are elementary and secondary school students. students with specific clinical criteria such as disabilities are excluded from this study. (6) outcome type: the data that will be extracted in this study is only quantitative data (numerical data) specifically students' learning outcome or cognitive aspect. other research outcomes or qualitative data such as behavior, activity, participation, collaboration, engagement, and motivation are not extracted. (7) language: the study included is an only article published in english without considering the country in which the studies are conducted. the full text that is related to the inclusion criteria of the topic was evaluated by annotating each article to extract some necessary information. this step was conducted using note-card contained eligibility criteria evaluation rubric recommended by mertens (2015) including research question, the design of research, data analysis, results, conclusion, and research evaluation. during the preliminary selection of eligibility occurred in 137 articles were identified. then, after the articles were screened for eligibility to exclude some non-eligible full text by applying inclusion criteria, 12 journal articles are carefully selected although this amount is a small number relative to some meta-analyses in this field. the data from the selected studies is then extracted for further analysis. first, the data of the characteristics of the reviewed studies that include the year of publication, country of origin, school level of participants, science domain, game name, and the purpose of the study were noted in microsoft excel. the data were extracted through manual searches in each article. the data is important to provide an overview of the characteristics of the reviewed studies. second, the key information which corresponds to the research questions were also extracted for each study. the information which is needed to answer the research questions is only quantitative data (numerical data) that was used in the statistical analysis. the quantitative data extracted are student’s achievement means, standard deviation, the number of participants of the control and treatment group. data analysis method microsoft excel and comprehensive meta-analysis (cma 2.0) were used for statistical analysis after the quantitative data were extracted. formerly, the demographic characteristics of the reviewed studies were analyzed with descriptive statistics using microsoft excel which present data such as mean, percentage, and also frequencies. the data would also be presented with visual techniques such as a column, bar chart, and histogram. lately, cma 2.0 was also used. several researchers verified the accuracy of the analysis method (ones, viswesvaran, & schmidt, 1993). cma 2.0 is used to analyze hedges' g effect size, the lower limit (ll), the upper limit (ul), pvalue, and the relative weight of all studies (borenstein, hedges, higgins, & rothstein, 2005). in order to give a clearer overview of the overall effect size, the forest plot to compare the effect of digital games over traditional methods was used (sutton, abrams, the effectiveness of game-based science learning (gbsl)... heru setiawan & shane phillipson 156 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 jones, sheldon, & song, 2000). two kinds of effect models in a meta-analysis are fixed effect model and random effect model (michael borenstein, hedges, higgins, & rothstein, 2010). the decision to select the effect model to analyze data is an essential factor in the meta-analysis (hedges & vevea, 1998). improper determination of the model will cause inefficient estimation and incorrect conclusions (nickell, 1981). however, in this study, we use the random effect model because all twelve studies which are used in this research were drawn from different populations, such as different populations in different countries. a similar condition of research is conducted by sacks, berrier, reitman, ancona-berk, and chalmers (1998). moreover, the studies report varies the effect size (es). in the randomeffects model, the true effect size might differ from one study to another study (olejnik & algina, 2000). in addition to the estimation of the primary effect, secondary analyses were conducted to take advantage of the coded study characteristics and test the moderating effects. specifically, secondary analysis tested the influence of grade level (elementary and secondary school) and year of publication. the data from statistical analysis from cma 2.0 were used in order to address the research questions with the following method of interpretation. we address the first research question by comparing the experimental group and the control group. there would be no difference between the control and experimental group when the mean of the sample is equal. however, when the experimental group's means score is higher than the control group, it means that gbsl intervention is more effective through looking at the mean difference between the experiment and the control group. the second research question is answered by investigating the effect of moderator categories including the year and school level, to the gbsl effectiveness, we use descriptive analysis by comparing the mean of effect size in each category. we compare the average effect size at each school level (elementary and secondary school) to determine which school level more effective in the game intervention. then, to analyze whether or not publication year has any correlation with game effectiveness, we use inferential statistics because it strives to make inferences and predictions (bryman, 2016). the statistical method would improve the previous research that only looks at the pattern of effect size across the years. the data would be presented as scatterplot to illustrate the relationship between two variables (cohen, manion, & morrison, 2007, p. 507). it would also count the spearman's rank correlation coefficient (r) because both variables are ordinal to see the linear trend using microsoft excel. the interpretation to assess the degree of the correlation coefficient were categorized into very high (0.9 to 1.0), high (0.7 to 0.9), moderate (0.5 to 0.3), low (0.3 to 0.5), and negligible correlation (0 to 0.3). detection of publication bias detection of publication bias of reviewed studies is crucial in meta-analysis study (rothstein, sutton, & borenstein, 2006). publication bias is the tendency of researchers to screen articles for publication based on the statistical significance of effects than the quality of the study (rothstein et al., 2006, p. 296). several pieces of evidence show that some research that has a higher effect size is more likely to be published (peters, sutton, jones, abrams, & rushton, 2006). consequently, it will affect the review process. therefore, the meta-analysis may be overestimated effect size because it uses a biased sample or target of the population. hence, to avoid this concern or minimizing this bias in this study, it needed a model to know which study is missing. one of the proper models is the funnel plot (j. a. c. sterne et al., 2011). in the funnel plot, the effect size is plotted in x-axis, and the number of participants is plotted in y-axis (sterne & egger, 2001). also, asymmetry easily detected in the funnel plot. the studies will be distributed symmetrically when the publication bias is absent (schmidt & hunter, 2014). the next problem is whether the observed overall effect is robust. to solve this issue, some researchers use rosenthal’s fail-safe n. orwin (1983) suggested that rosenthal’s failsafe n compute the number of studies that should be incorporated in the analysis. the effectiveness of game-based science learning... heru setiawan & shane phillipson copyright © 2019, reid (research and evaluation in education), 5(2), 2019 157 issn 2460-6995 findings and discussion overview of the reviewed studies the publication years range from 2010 to 2017. the purpose is to know the development of research in this area in the last eight years. the highest number of publications is in 2015 with three publications (figure 1). then, the presence of international studies is reflected in the sample. however, 50% of the studies included were conducted within the asia continent especially in taiwan, while the others were conducted internationally. there are two countries including taiwan and singapore from asia. within this international group, spain is well represented by two studies, while the other research is from the u.s and nigeria, africa (figure 2). based on the school level, elementary and secondary education has an almost equal number. eight studies are from elementary school and four studies from high school (figure 3). subject areas are also well represented with three in the context of biology, seven general sciences, while each physics and chemistry are only one study (figure 4). the studies included are presented in table 1. table 1 outlines the characteristics of the included studies meeting all the eligibility criteria. figure 1. the number of reviewed studies by year of publication figure 2. the reviewed studies by country the effectiveness of game-based science learning (gbsl)... heru setiawan & shane phillipson 158 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 figure 3. the reviewed studies based on the education level of participants figure 4. the reviewed studies according to science domain table 1. the background information of reviewed articles authors country game name school level science domain bello et al. (2016) nigeria n/a secondary sciences chee & tan (2012) singapore alkhimia secondary chemistry wrzesien & raya (2010) spain supercharged elementary sciences anderson & barnett (2013) usa supercharged secondary physics sung & hwang (2013) taiwan alien invasion elementary sciences yien, hung, hwang, & lin (2011) taiwan nutrition supplement battle elementary biology chu & hung (2015) taiwan kodu elementary sciences su & cheng (2015) taiwan find insect elementary sciences chen & hwang (2017) taiwan alien invasion elementary sciences fan et al. (2015) taiwan the mmbcls secondary biology furió, juan, seguí, & vivó (2015) spain iphone game elementary sciences chen, yeh, & chang (2016) taiwan role play game (rpg) secondary biology the effectiveness of game-based science learning... heru setiawan & shane phillipson copyright © 2019, reid (research and evaluation in education), 5(2), 2019 159 issn 2460-6995 how effective gbsl does to enhance students’ learning outcomes in sciences compared to the traditional method as reported by the current studies from 2010 to 2017? the first research question is answered by comparing the average mean of the reviewed studies. the result of data extraction is presented in table 1 which compares the twelve studies with the treatment group and control group. the number of participants in the twelve studies is 954 students. most of the studies have an equal number of participants in the treatment and control group, although some of them have a slightly higher participant in one group than the other group. there are 489 students in a total of the control group and 465 students from the experimental group. the number of participants in the studies is varied from 38 to 180 students. the standard deviation of all of the studies is also varied from the lowest 0.93 to the highest 23.54. the detail of the data for each study is shown in table 2. based on table 2, the average learning outcome mean from the overall studies of the experimental group (40.82) is higher than the control group (36.82). the mean difference analysis shows that one study, chu and hung (2015), has a negative mean difference between experimental and control group compared to the other ten studies that have a positive mean difference. the highest mean difference between the studies is 19.63, while the lowest mean difference is -15.03. the experimental and control group’s standard deviation shows a variation. the analysis result of standardized mean difference effect size, variance, weight, and confidence interval (ci) the random-effects model was used to know the composite effect size with comprehensive meta-analysis (cma). the summary of the final analysis for all studies is presented in table 3. we calculate hedges's g for each study separately to maintain consistency of measurement. in addition to the individual effects, we also present a 95% confidence interval (lower limit and upper limit) around each study and the relative weight (w). the overall effect size of the twenty studies is g = 0.661, p<.001; with a 95% confidence interval between 0.223 and 1.090. it indicates a moderate overall effect for the synthesized gbsl interventions that is statistically different from a null effect. the largest effect size influencing this study is bello et al. (2016) of 2.338. in contrast, the study contributing the smallest overall influence is chu and hung (2015) with an effect size of -0.637. the comparison of the smd effect size of all studies is presented in a forest plot in figure 5. table 2. mean, standard deviation, and sample size of the studies on digital games versus control method authors (year) experiment class control class n total mean difference mean sd n mean sd n bello et al. (2016) 66.23 7.07 90 46.6 9.48 90 180 19.63 chee & tan (2012) 3.28 2.61 40 2 1.71 38 78 1.28 wrzesien & raya (2010) 6.33 2.2 24 5.88 1.54 24 48 0.45 anderson & barnett (2013) 6.3 1.2 32 5.9 1.27 32 136 0.4 sung & hwang (2013) 57.26 16.87 31 43.07 14.24 31 62 14.19 yien et al. (2011) 16.94 2.38 33 15.09 3.39 33 66 1.85 chu & hung (2015) 56 23.54 30 71.03 23.04 29 59 -15.03 su & cheng (2015) 82.94 10 34 75.59 9.595 34 68 7.35 chen & hwang (2017) 86.78 9.15 27 82.35 12.38 26 53 4.43 fan et al. (2015) 88 9 23 76 12 23 46 12 furió et al. (2015) 4.89 1.45 19 4.74 0.93 19 38 0.15 chen et al. (2016) 18.51 2.71 43 16.63 4 77 120 1.88 41.12 = 426 37.07 = 456 = 954 the effectiveness of game-based science learning (gbsl)... heru setiawan & shane phillipson 160 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 figure 5. the forest plot’s comparison of the smd effect size of reviewed studies table 3. effect sizes, confidence intervals, and relative weights of reviewed studies name of the study hedges's g lower limit upper limit p-value relative weight (w) bello et al. (2016) 2.338 1.959 2.716 0.00000 8.738 chee & tan (2012) 0.571 0.123 1.020 0.01255 8.502 wrzesien & raya (2010) 0.233 -0.325 0.792 0.41332 8.089 anderson & barnett (2013) 0.886 0.536 1.237 0.00000 8.821 sung & hwang (2013) 0.898 0.381 1.414 0.00065 8.253 yien et al. (2011) 0.624 0.136 1.113 0.01227 8.358 chu & hung (2015) -0.637 -1.153 -0.120 0.01571 8.252 su & cheng (2015) 0.741 0.255 1.228 0.00279 8.367 chen & hwang (2017) 0.402 -0.134 0.938 0.14151 8.177 fan et al. (2015) 1.112 0.500 1.724 0.00036 7.873 furió et al. (2015) 0.121 -0.503 0.744 0.70453 7.826 chen et al. (2016) 0.520 0.143 0.896 0.00682 8.743 randon effect model 0.661 0.232 1.090 0.00253 table 4. mean effect size of gbsl based on school level moderator number of studies % of study d n elementary 7 58.33% 1.08 394 secondary 5 41.67% 0.34 560 do moderator categories including school level of participants (elementary and secondary school context) and year of publication have any correlation with gbsl effect size? based on our analysis of moderating variables as the addition to the overall effect size, subsequent analyses of some moderating variables were run by school level and year of journal article’s publication, shown in table 4. firstly, we made two comparisons from the school level including elementary and secondary schools (table 4). seven studies are in the context of an elementary school setting with the mean of effect size 1.08. the other five studies tested on secondary school setting with an effect size mean of 0.34. this number shows that the effect size of gbsl on secondary school contexts nearly two and a half times higher than elementary school students sample effect size. thus, the implementation of gbsl in secondary school tend to have a larger effect size than in elementary school context. the effectiveness of game-based science learning... heru setiawan & shane phillipson copyright © 2019, reid (research and evaluation in education), 5(2), 2019 161 issn 2460-6995 secondly, we made a comparison of effect size according to the year of publication (table 5). according to the correlational analysis between the year of publication and effect size, it shows that the variable has a low correlation with the r= 0.40 (r2= 0.16). figure 6 illustrates a scatter plot that shows the relationship between year of publication (xaxis) and effect size (y-axis). figure 6 shows that from 2010 the effect size average is 0.23, followed by approximately double to 0.55 in 2011. five years later, in 2016, the effect size significantly increased again to 2.54. analysis for publication bias according to the analysis of rosenthal’s fail-safe n (orwin, 1983), among the various methods for assessing bias, rosenthal’s failsafe n has the advantage of focusing on the potential impact any unpublished or unidentified studies may have on the current estimated effect size. it provides an estimate for the number of hypothetical missing studies that must be identified in order to bring the calculated overall effect below the level of researcher-imposed substantive significance (easterbrook, gopalan, berlin, & matthews, 1991). it assumes that those missing studies have negligible effects. based on the analysis, 307 more studies are needed to make p-value to be alpha (z for alpha= 1.959). the other method to analyze publication bias is using the funnel plot, which has two diagonal lines that represent the 95% confidence interval, and a vertical central line. the x-axis represents the study sample size, and the y-axis represents the effect size. figure 7 illustrates the funnel plot of standard error (se) by hedges' g effect size. according to figure 7, the nine studies fall around the two horizontal lines or a confidence interval of 95%. however, three studies fall outside the funnel plot, indicating that these studies were not as significant as the other nine studies. table 5. mean effect size of gbsl based on year of publication year of publication number of studies % of study d n 2010 1 8.33% 0.23 48 2011 1 8.33% 0.55 66 2012 1 8.33% 0.75 78 2013 2 16.67% 0.892 198 2014 1 8.33% 0.77 68 2015 3 25.00% 0.51 143 2016 2 16.67% 2.54 300 2017 1 8.33% 0.36 53 figure 6. scatter plot of the relationship between the year of publication and average effect size the effectiveness of game-based science learning (gbsl)... heru setiawan & shane phillipson 162 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 figure 7. funnel plot of standard error (se) by hedges’s g effect size of reviewed studies the performance of the result of this study with similar research the performance of this study aligns with similar studies of literature reviews using meta-analysis on gamification across various context, such as mathematics, language, and also physical education over a decade, which has consistently found that game-based learning outperforms traditional-based learning (divjak & tomić, 2011; vogel et al., 2006; young et al., 2012). however, some notable differences regarding the statistical analysis are revealed. first, the fail-safe number (nfs) that we found in this research, that is 307 studies, is much lower than the previous meta-analysis. the fail-safe number is only approximately a fifth than the findings of vogel et al. (2006) with nfs 1465. second, the number of studies in this meta-analysis is only twelve, which is lower than similar research in this field, such as divjak and tomić (2011) with 32 studies, and young et al. (2012) with more than 300 articles. in addition, the findings of this research support the findings of li and tsai (2013) regarding the potential of gbsl to promote students’ learning. li and tsai (2013) believe that gbsl can promote students’ engagement. therefore, students’ engagement and motivation might lead to an improvement in students’ learning outcomes in science. conclusion and recommendation conclusion based on the result and discussion, some conclusions can be drawn. first, based on the investigated studies conducted from 2010 to 2017, the use of gbsl is statistically significant to improve students’ learning outcomes in elementary and secondary school. the learning outcome of the experimental group of the overall studies is higher than the control group, which is 41.12 against 37.07 respectively. the mean of hedges' g random effect size of the reviewed studies is 0.667, which can be classified into a medium effect size. second, moderator categories or variation of school level of the study have any correlation on digital game effectiveness on which the implementation of gbsl in secondary school have a greater effect size than in elementary school context. also, the year of publication and effect size has a low positive correlation with r= 0.40. recommendation the result of this study has implications for future studies. experimental research of gbsl in science education across various contexts is still needed. it is supported by the result of detection publication bias which showed that at least 237 studies in this area of the effectiveness of game-based science learning... heru setiawan & shane phillipson copyright © 2019, reid (research and evaluation in education), 5(2), 2019 163 issn 2460-6995 research are needed that would bring p-value to be alpha. this research is complex, but the description of the process and result has been presented. furthermore, we use comprehensive meta-analysis 2.0 as trusted software for quantitative meta-analysis. however, our study has some limitations. the study only includes a small amount of research. it might be caused by the topic used is too specific where it only includes the effect of gbsl in a subject (science) and the outcomes only specifically focus on cognitive aspects. there are many potential studies in gbsl in science education and in the timeframe (2010-2017), but they were not included in this study because they were not eligible in the screening process with the seven inclusion and exclusion criteria which is determined in the research design. some researches have no complete data to be extracted, or the topic is not suitable for this research. for example, the research use case study which only has an experimental group does not have a control group (echeverría et al., 2011; spires, rowe, mott, & lester, 2011). other studies are not eligible because they focus on other outcomes such as engagement (annetta et al., 2009), collaboration and problem-solving (sánchez & olivares, 2011), and developing serious games (khalili, sheridan, williams, clark, & stegman, 2011; nilsson & jakobsson, 2011; ting, 2010). therefore, future studies should not only focus on the cognitive or quantitative outcome but also affective or qualitative outcomes such as students’ engagement, motivation, self-efficacy, participation, collaboration, communication, and problem-solving skills. the research to review the qualitative outcome can be conducted with a systematic review, narrative review, or descriptive review (for example, kim, munson, & mckay, 2012; li & tsai, 2013). the limited number of research identified might also due to the restricted criteria of the year of publication, sources of databases, context, and moderator categories. first, the included studies were conducted from 2010 to 2017. therefore, the result of this study does not capture the studies outside this period. second, the review only includes some databases, including eric, springer link, proquest, and a+ education. future studies can also be conducted by extending the literature to other educational databases such as isi web of sciences or sources like google scholar, conference proceedings, and dissertations. there many articles related to gbsl. third, regarding context, investigating the effectiveness in different contexts/country and expanded educational level such as preschool could also be explored in future studies. it is because we found that most of the research included in this meta-analysis was conducted within asia and educational level in the preschool context has not been explored. the last, for moderator categories, our research only focused on the school level of participants and year of publication of the study. therefore, future research can explore different moderators such as gender (tsay et al., 2018; vogel et al., 2006), game genre (individual, peers, or groups), stream type or typical games (sjöblom, törhönen, hamari, & macey, 2017), learner control, and type of activity (vogel et al., 2006). acknowledgment this research is funded by lpdp (lembaga pengelola dana pendidikan). we also acknowledge the lecturer of research approaches in education and research project in education in the faculty of education, monash university australia who gave many suggestions to this article. lastly, we thank monash university library that provides resources for this article. references adam, s. (2004). a consideration of the nature, role, application, and implications for european education of employing ‘learning outcomes’ at the local, national, and international levels. united kingdom bologna seminar. 1-2 july 2004, heriot-watt university (edinburgh conference centre), edinburgh. anderson, j. l., & barnett, m. (2013). learning physics with digital game the effectiveness of game-based science learning (gbsl)... heru setiawan & shane phillipson 164 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 simulations in middle school science. journal of science education and technology, 22(6), 914–926. https://doi.org/ 10.1007/s10956-013-9438-8 annetta, l. a., minogue, j., holmes, s. y., & cheng, m.-t. (2009). investigating the impact of video games on high school students’ engagement and learning about genetics. computers & education, 53(1), 74–85. https://doi.org/ 10.1016/j.compedu.2008.12.020 bello, s., ibi, m. b., & bukar, i. b. (2016). effect of simulation techniques and lecture method on students’ academic performance in mafoni day secondary school maiduguri, borno state, nigeria. journal of education and practice, 7(23), 113–117. retrieved from https:// www.iiste.org/journals/index.php/jep /article/view/32584 bennett, s., maton, k., & kervin, l. (2008). the ‘digital natives’ debate: a critical review of the evidence. british journal of educational technology, 39(5), 775–786. https://doi.org/10.1111/j.14678535.2007.00793.x borenstein, m., hedges, l., higgins, j., & rothstein, h. (2005). comprehensive metaanalysis version 2. englewood cliffs, nj: biostat. borenstein, m., hedges, l. v., higgins, j. p. t., & rothstein, h. r. (2010). a basic introduction to fixed-effect and random-effects models for meta-analysis. research synthesis methods, 1(2), 97–111. https://doi.org/10.1002/jrsm. 12 borg, w. r., & gall, m. d. (1983). educational research: an introduction (4th ed.). new york, ny: longman. brockwell, s. e., & gordon, i. r. (2001). a comparison of statistical methods for meta-analysis. statistics in medicine, 20(6), 825–840. https://doi.org/10.1002/sim. 650 chee, y. s., & tan, k. c. d. (2012). becoming chemists through gamebased inquiry learning: the case of “legends of alkhimia.” electronic journal of e-learning, 10(2), 185–198. retrieved from http://www.ejel.org/issue/ download.html?idarticle=188 chen, c.-l. d., yeh, t.-k., & chang, c.-y. (2016). the effects of game-based learning and anticipation of a test on the learning outcomes of 10th grade geology students. eurasia journal of mathematics, science and technology education, 12(5), 1379–1388. https:// doi.org/10.12973/eurasia.2016.1519a chen, c., & hwang, g. (2017). effects of the team competition-based ubiquitous gaming approach on students’ interactive patterns, collective efficacy and awareness of collaboration and communication. educational technology & society, 20(1), 87–98. retrieved from https://www.jstor.org/stable/jeductech soci.20.1.87 cheng, m.-t., su, t., huang, w.-y., & chen, j.-h. (2014). an educational game for learning human immunology: what do students learn and how do they perceive? british journal of educational technology, 45(5), 820–833. https://doi. org/10.1111/bjet.12098 cheung, a. c. k., & slavin, r. e. (2013). the effectiveness of educational technology applications for enhancing mathematics achievement in k-12 classrooms: a meta-analysis. educational research review, 9, 88–113. https://doi.org/ 10.1016/j.edurev.2013.01.001 chorney, a. i. (2012). taking the game out of gamification. dalhousie journal of interdisciplinary management, 8(1), 1–14. https://doi.org/10.5931/djim.v8i1.242 chu, h.-c., & hung, c.-m. (2015). effects of the digital game-development approach on elementary school students’ learning motivation, problem solving, and learning achievement. international journal of distance education technologies, 13(1), 87–102. https://doi.org/ 10.4018/ijdet.2015010105 clark, d. b., nelson, b. c., chang, h.-y., martinez-garza, m., slack, k., & the effectiveness of game-based science learning... heru setiawan & shane phillipson copyright © 2019, reid (research and evaluation in education), 5(2), 2019 165 issn 2460-6995 d’angelo, c. m. (2011). exploring newtonian mechanics in a conceptually-integrated digital game: comparison of learning and affective outcomes for students in taiwan and the united states. computers & education, 57(3), 2178–2195. https://d oi.org/10.1016/j.compedu.2011.05.007 cohen, l., manion, l., & morrison, k. (2007). research methods in education (6th ed.). new york, ny: routledge. corbett, s. (2010, september 15). learning by playing video games in the classroom. the new york times. retrieved from https://www.nytimes.com/2010/09/19 /magazine/19video-t.html culp, k. m., martin, w., clements, m., & lewis presser, a. (2015). testing the impact of a pre-instructional digital game on middle-grade students’ understanding of photosynthesis. technology, knowledge and learning, 20(1), 5–26. https ://doi.org/10.1007/s10758-014-9233-5 divjak, b., & tomić, d. (2011). the impact of game-based learning on the achievement of learning goals and motivation for learning mathematics literature review. journal of information and organizational sciences, 35(1), 15–30. retrieved from https://jios.foi.hr/ index.php/jios/article/view/182 easterbrook, p. j., gopalan, r., berlin, j. a., & matthews, d. r. (1991). publication bias in clinical research. the lancet, 337(8746), 867–872. https://doi.org/ 10.1016/0140-6736(91)90201-y echeverría, a., garcía-campo, c., nussbaum, m., gil, f., villalta, m., améstica, m., & echeverría, s. (2011). a framework for the design and integration of collaborative classroom games. computers & education, 57(1), 1127– 1136. https://doi.org/10.1016/j.comp edu.2010.12.010 fan, k.-k., xiao, p., & su, c. (2015). the effects of learning styles and meaningful learning on the learning achievement of gamification health education curriculum. eurasia journal of mathematics, science and technology education, 11(5), 1211–1229. https:// doi.org/10.12973/eurasia.2015.1413a furió, d., juan, m.-c., seguí, i., & vivó, r. (2015). mobile learning vs. traditional classroom lessons: a comparative study. journal of computer assisted learning, 31(3), 189–201. https:// doi.org/10.1111/jcal.12071 gee, j. p. (2007). what video games have to teach us about learning and literacy (revised an). new york, ny: palgrave macmillan. glass, g. v., mcgaw, b., & smith, m. l. (1981). meta-analysis in social research. beverly hills, ca: sage publications. hamari, j., & keronen, l. (2017). why do people play games? a meta-analysis. international journal of information management, 37(3), 125–141. https://do i.org/10.1016/j.ijinfomgt.2017.01.006 hedges, l. v., & vevea, j. l. (1998). fixed and random-effects models in metaanalysis. psychological methods, 3(4), 486– 504. https://doi.org/10.1037/1082989x.3.4.486 honey, m. a., & hilton, m. (eds.). (2011). learning science through computer games and simulations. washington, dc: national academy of sciences. huang, b., hew, k. f., & lo, c. k. (2019). investigating the effects of gamification-enhanced flipped learning on undergraduate students’ behavioral and cognitive engagement. interactive learning environments, 27(8), 1106–1126. https://doi.org/10.1080/10494820.201 8.1495653 khalili, n., sheridan, k., williams, a., clark, k., & stegman, m. (2011). students designing video games about immunology: insights for science learning. computers in the schools, 28(3), 228–240. https://doi.org/10.1080/073 80569.2011.594988 kim, h., munson, m. r., & mckay, m. m. (2012). engagement in mental health the effectiveness of game-based science learning (gbsl)... heru setiawan & shane phillipson 166 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 treatment among adolescents and young adults: a systematic review. child and adolescent social work journal, 29(3), 241–266. https://doi.org/10.1007/ s10560-012-0256-2 king, w. r., & he, j. (2005). understanding the role and methods of meta-analysis in is research. communications of the association for information systems, 16, 665–686. https://doi.org/10.17705/ 1cais.01632 li, m.-c., & tsai, c.-c. (2013). game-based learning in science education: a review of relevant research. journal of science education and technology, 22(6), 877–898. https://doi.org/10.1007/s10956-0139436-x lipsey, m. w., & wilson, d. b. (2001). practical meta-analysis (vol. 49). thousand oaks, ca: sage publications. mayo, m. j. (2007). games for science and engineering education. communications of the acm, 50(7), 30–35. https://doi. org/10.1145/1272516.1272536 mcgonigal, j. (2011). reality is broken: why games make us better and how they can change the world. new york, ny: penguin. mertens, d. m. (2015). research and evaluation in education and psychology (4th ed.). london: sage publication. nakagawa, s., & cuthill, i. c. (2007). effect size, confidence interval and statistical significance: a practical guide for biologists. biological reviews, 82(4), 591– 605. https://doi.org/10.1111/j.1469185x.2007.00027.x nickell, s. (1981). biases in dynamic models with fixed effects. econometrica, 49(6), 1417–1426. https://doi.org/10.2307/ 1911408 nilsson, e. m., & jakobsson, a. (2011). simulated sustainable societies: students’ reflections on creating future cities in computer games. journal of science education and technology, 20(1), 33–50. https://doi.org/10.1007/s10956 -010-9232-9 okeke, g. n. (2016). the impact of digital games on high school students’ learning outcomes in mathematics education: a meta-analytic investigation. doctoral thesis, university of north texas, denton, tx. olejnik, s., & algina, j. (2000). measures of effect size for comparative studies: applications, interpretations, and limitations. contemporary educational psychology, 25(3), 241–286. https:// doi.org/10.1006/ceps.2000.1040 ones, d. s., viswesvaran, c., & schmidt, f. l. (1993). comprehensive meta-analysis of integrity test validities: findings and implications for personnel selection and theories of job performance. journal of applied psychology, 78(4), 679–703. https: //doi.org/10.1037/0021-9010.78.4.679 orwin, r. g. (1983). a fail-safe n for effect size in meta-analysis. journal of educational statistics, 8(2), 157–159. https://doi.org/10.2307/1164923 papastergiou, m. (2009). digital game-based learning in high school computer science education: impact on educational effectiveness and student motivation. computers & education, 52(1), 1–12. https://doi.org/10.1016/ j.compedu.2008.06.004 peters, j. l., sutton, a. j., jones, d. r., abrams, k. r., & rushton, l. (2006). comparison of two methods to detect publication bias in meta-analysis. jama, 295(6), 676. https://doi.org/ 10.1001/jama.295.6.676 prensky, m. (2001). digital natives, digital immigrants part 1. on the horizon, 9(5), 1–6. https://doi.org/10.1108/1074812 0110424816 quandt, t., van looy, j., vogelgesang, j., elson, m., ivory, j. d., consalvo, m., & mäyrä, f. (2015). digital games research: a survey study on an emerging field and its prevalent debates. journal of communication, 65(6), 975–996. https://doi.org/10.1111/jcom.12182 rothstein, h. r., sutton, a. j., & borenstein, m. (eds.). (2006). publication bias in metathe effectiveness of game-based science learning... heru setiawan & shane phillipson copyright © 2019, reid (research and evaluation in education), 5(2), 2019 167 issn 2460-6995 analysis: prevention, assessment and adjustments. hoboken, nj: john wiley & sons. sacks, h. s., berrier, j., reitman, d., anconaberk, v. a., & chalmers, t. c. (1998). meta-analyses and large randomized, controlled trials. new england journal of medicine, 338(1), 59–62. https://doi. org/10.1056/nejm199801013380112 sánchez, j., & olivares, r. (2011). problem solving and collaboration using mobile serious games. computers & education, 57(3), 1943–1952. https://doi.org/ 10.1016/j.compedu.2011.04.012 schmidt, f. l., & hunter, j. e. (2014). methods of meta-analysis: correcting error and bias in research findings. london: sage publications. sjöblom, m., törhönen, m., hamari, j., & macey, j. (2017). content structure is king: an empirical study on gratifications, game genres and content type on twitch. computers in human behavior, 73, 161–171. https://doi.org/ 10.1016/j.chb.2017.03.036 spires, h. a., rowe, j. p., mott, b. w., & lester, j. c. (2011). problem solving and game-based learning: effects of middle grade students’ hypothesis testing strategies on learning outcomes. journal of educational computing research, 44(4), 453–472. https://doi.org/ 10.2190/ec.44.4.e sterne, j. a. c., sutton, a. j., ioannidis, j. p. a., terrin, n., jones, d. r., lau, j., … higgins, j. p. t. (2011). recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials. bmj, 343(jul22 1), d4002–d4002. https://doi.org/10.1136/bmj.d4002 sterne, j. a. c., & egger, m. (2001). funnel plots for detecting bias in meta-analysis. journal of clinical epidemiology, 54(10), 1046–1055. https://doi.org/10.1016/ s0895-4356(01)00377-8 su, c.-h., & cheng, c.-h. (2015). a mobile gamification learning system for improving the learning motivation and achievements. journal of computer assisted learning, 31(3), 268–286. https://doi. org/10.1111/jcal.12088 sung, h.-y., & hwang, g.-j. (2013). a collaborative game-based learning approach to improving students’ learning performance in science courses. computers & education, 63, 43– 51. https://doi.org/10.1016/j.comp edu.2012.11.019 sutton, a. j., abrams, k. r., jones, d. r., sheldon, t. a., & song, f. (2000). methods for meta-analysis in medical research. chichester: john wiley & sons. ting, y. l. (2010). using mainstream game to teach technology through an interest framework. journal of educational technology & society, 13(2), 141–152. retrieved from https://scholar.lib.ntnu. edu.tw/en/publications/using-main stream-game-to-teach-technologythrough-an-interest-fra tsay, c. h.-h., kofinas, a., & luo, j. (2018). enhancing student learning experience with technology-mediated gamification: an empirical study. computers & education, 121, 1–17. https://doi.org/ 10.1016/j.compedu.2018.01.009 van eck, r. (2006). digital game-based learning: it’s not just the digital natives who are restless. educause review, 41(2), 16–30. retrieved from https:// er.educause.edu/articles/2006/1/digital -gamebased-learning-its-not-just-thedigital-natives-who-are-restless vogel, j. j., vogel, d. s., cannon-bowers, j., bowers, c. a., muse, k., & wright, m. (2006). computer gaming and interactive simulations for learning: a meta-analysis. journal of educational computing research, 34(3), 229–243. https://doi.org/10.2190/flhvk4wa-wpvq-h0ym wrzesien, m., & raya, m. a. (2010). learning in serious virtual worlds: evaluation of learning effectiveness and appeal to students in the e-junior project. the effectiveness of game-based science learning (gbsl)... heru setiawan & shane phillipson 168 copyright © 2019, reid (research and evaluation in education), 5(2), 2019 issn 2460-6995 computers & education, 55(1), 178–187. https://doi.org/10.1016/j.compedu.20 10.01.003 yien, j.-m., hung, c.-m., hwang, g.-j., & lin, y.-c. (2011). a game-based learning approach to improving students’ learning achievements in a nutrition course. tojet: the turkish online journal of educational technology, 10(2). retrieved from http:// www.tojet.net/articles/v10i2/1021.pdf young, m. f., slota, s., cutter, a. b., jalette, g., mullin, g., lai, b., … yukhymenko, m. (2012). our princess is in another castle: a review of trends in serious gaming for education. review of educational research, 82(1), 61–89. https://doi.org/10.3102/00346543124 36980 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(2), 2021, 118-131 available online at: http://journal.uny.ac.id/index.php/reid empirical lecturers’ and students’ satisfaction assessment in e-learning systems based on the usage metrics sulis sandiwarno* universitas mercu buana, indonesia *corresponding author. e-mail: sulis.sandiwarno@mercubuana.ac.id introduction information technology has been widely used in education to facilitate lecturers and students to enhance communication and interaction during the learning process. e-learning grows up utilization of information technology to improve the education process from conventional learning to electronic-based learning (caputi & garrido, 2015; sadikin, 2017; sadikin et al., 2016). hong et al. (2017) defined e-learning as an online learning which provides a collaborative means to achieve knowledge, creation, and interactions among lecturers and students. the participation of lecturers and students is a key to make a desirable outcome in higher-level learning (kim, 2013). e-learning system helps lecturers as well as students to work and communicate (collaboratively) using web technology tools in different time and space (casamayor et al., 2009; gameel, 2017). moreover, e-learning system provides a new approach to give an orientation for the learner in learning processes and is convenient to use anytime and anywhere (navimipour & zareie, 2015). the discussion is a concept of interaction whereby users are responsible for learning activities and give contribution in e-learning system (asoodar et al., 2016a; haron et al., 2017; lin, 2018; zhang et al., 2017). to make successful communication in a forum based on students and lecturers’ feedback in the learning process, there are some activities of users in e-learning system article info abstract article history submitted: 25 march 2021 revised: 14 november 2021 accepted: 6 december 2021 keywords e-learning; satisfaction; usage-based metrics; sus scan me: nowadays, in the pandemic of covid-19, e-learning systems have been widely used to facilitate teaching and learning processes between lecturers and students. assessing lecturers’ and students’ satisfaction with e-learning systems has become essential in improving the quality of education for higher learning institutions. most existing approaches have attempted to assess users’ satisfaction based on system usability scale (sus). on the other hand, different studies proposed usage-based metrics (completion rate, task duration, and mouse or cursor distance) which assess users’ satisfaction based on how they use and interact with the system. however, the cursor or mouse distance metric does not consider the effectiveness of navigation in e-learning systems, and such approaches measure either lecturers’ or students’ satisfaction independently. towards this end, we propose a lostness metric to replace the click or cursor distance metric for assessing lecturers’ and students’ satisfaction with using e-learning systems. furthermore, to obtain a deep analysis of users’ satisfaction, we tandem the usagebased metric (i.e., completion rate, task duration, and lostness) and the sus metric. the evaluation results indicate that the proposed approach can precisely predict users’ satisfaction with e-learning systems. this is an open access article under the cc-by-sa license. how to cite: sandiwarno, s. (2021). empirical lecturers’ and students’ satisfaction assessment in e-learning systems based on the usage metrics. reid (research and evaluation in education), 7(2), 118-131. doi:https://doi.org/10.21831/reid.v7i2.39642 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i2.39642 10.21831/reid.v7i2.39642 sulis sandiwarno page 119 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) such as knowledge sharing (upload some course materials) and problem-solving (horvat et al., 2015; koohang et al., 2016; sandiwarno, 2016). the quiz is also an indicator used by lecturers to see the performance of students in the learning process. in the e-learning system, lecturers can upload questions such as multiple choice or essay, then lecturers give time for students to answer and finish them. assessments of the quiz are usually done weekly, and auto-graded and peergraded assignments (sun, 2016). to facilitate the learning process, several platforms, such as learning management systems (lms) have been proposed to support activities in e-learning and help users. moodle is one of the common platforms of lms and is mostly used for assisting users in the learning process (liberona & fuenzalida, 2014). moodle is an lms software which is publicly available and one of the pertinent e-learning system widely used in learning institutions (ifinedo et al., 2018; muñoz et al., 2017). kerimbayev et al. (2017) used moodle to share materials and users’ knowledge to increase the motivation in an online course. assessing users’ satisfaction in the e-learning system is necessary because it highlights the satisfaction level of users on using e-learning system. satisfaction is the condition of users’ emotional issue that can be viewed as consideration based on personal experiences and belief to products. moreover, satisfaction is important key to indicate the effective of learning process between lecturers and students. most previous approaches have been attempted to assess users’ satisfaction in e-learning systems. for instance, almarashdeh (2016) measured lecturers’ satisfaction based on questionnaires, which in separates users’ criteria, gender, and age. asoodar et al. (2016b) assessed users’ satisfaction based on the learning process (i.e., course dimension, technology dimension, and design of system) using an anonymous questionnaire and regression analysis. the evaluation results show that the proposed approach can be employed to explain and describe the users’ satisfaction in learning process. cohen and baruth (2017) in their study proposed an anonymous questionnaire and analysis of variance (anova) to evaluate users’ satisfaction in difference among groups online learning by their personality. the result indicates that the proposed can be able to use in evaluate users’ satisfaction. po-olusula (chen & adesope, 2016), measuring users’ satisfaction in the e-learning system involves different aspects such as technology, criteria of the user, and feature of web-based systems. ku et al. (2013) contend and demonstrate that measuring users’ satisfaction in e-learning system can be done in teamwork, which means dividing the learning participants into two teams of students. moreover, there are several usability methods which can be used to assess the users’ satisfaction when they are interacting in e-learning systems, namely usage-based metrics (i.e., completion rate, task duration, and lostness). usability is used as the measurement of some useful products, and it is easy to use for the users to get satisfaction goals more effectively and efficiently. mehmet (berkman et al., 2018) defined usability as a tool to evaluate software products from subjective users’ perspective and questionnaires standardized to confirmed dependability of satisfaction. harrati et al. (2016) argued that completion rate (notated as cr) is a metric which used to measure the percentages of users successfully and finished the activities on a specific task of the e-learning system. the high results of completion rate on tasks indicate that users successfully completed the assigned tasks. however, the low score implies that users did not achieve some of the tasks. task duration (notated td) is a metric used for measuring the total time that users require to finish the tasks. task time is usually measured in minutes for long activities and seconds for the short activities (curcio et al., 2019), whereas lostness is a metric which used to calculate the efficient in the navigation of web pages in which the participants took to complete the task step by step (ahn et al., 2018; curcio et al., 2019). therefore, completion rate, task time and lostness respectively describe to what extent users successful finished each task, how long they take to complete such tasks, and the minimum number of steps that a user must take to finish the tasks. https://doi.org/10.21831/reid.v7i2.39642 10.21831/reid.v7i2.39642 sulis sandiwarno page 120 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) harrati et al. (2016) have attempted to assess lecturers’ satisfaction based on usage-based metrics (i.e., completion rate, task duration, cursor distance or mouse clicks) and system usability scale (sus) metric. sus is a metric that is used to assess users’ satisfaction based on questionnaires. cursor distance is a metric that is employed to assess the efforts undertaken by users in the systems through the hand use to move the cursor on the screen. the authors measured the correlation between the completion rate and sus metrics by adopting pearson correlation coefficient (pcc) metric. the results indicate that, there is a correlation between completion rate and sus metrics. although, the previous approaches have been attempted to assess users’ satisfaction and obtained the good results, however, questionnaire of sus metric is not a sufficient for expressing the level of users’ satisfaction. additionally, the previous approaches assess both lecturers and students separately. moreover, the aforementioned approaches do not consider evaluating the effectiveness navigation in e-learning systems. to this end, in this paper, we propose an approach to assess lecturers’ and students’ satisfaction in using e-learning system, unlike other works which consider lecturers or students separately. moreover, in conducting users’ satisfaction assessment we propose a lostness metric which is part of usage-based metrics to replace cursor distance or mouse clicks. the choice of this metric (lostness) was motivated by previous approaches (ahn et al., 2018). to the best our knowledge, this paper is the first attempted to introduce assessing users’ satisfaction with adding lostness metric. our proposed approach consists of two parts: (1) employing usage-based metrics to assess users’ satisfaction based on task modelling and (2) usability data analysis based on sus metric. task modelling is used to capture the activities and track the navigation of users. in addition, we exploit the well-known metrics in usability data analysis to assess the lecturers’ and students’ satisfaction. further, we analyze the correlation between the results between usage-based metrics and sus metric. the main contribution of this study is summarized as follows. first, we propose a new way to assess lecturers’ and students’ satisfaction based on usage-based metrics with added lostness metric. second, the proposed approach has been evaluated with the data from users in using e-learning systems. third, we compared and examined the correlation between usage-based metrics and sus metric. the evaluation results of this study indicate that there is significant correlation. the rest of this paper is organized as follows. in section 2, we highlight several related works in assessing users’ satisfaction. section 3 describes the research method of our study. section 4 presents the results and discussion. in section 5, we conclude the paper and highlight the future work. method in this section, we present an approach for assessing the users’ satisfaction in an e-learning system (moodle). the version 3.6.2 moodle is installed on remote accessible web server with the usage of logger scriplet that is integrated within html pages of the website. empirically, to assess the users’ satisfaction, we grouped users into two groups (trained and non-trained). the users were such grouped in order to assess the influence of user training on the level of users’ satisfaction. trained users are those with experience and are familiar with using e-learning system, whereas non-trained users are those who do not have experience or are not familiar with using elearning system. in assessing lecturers’ and students’ satisfaction, we explain the framework of the proposed approach as shown in figure 1. figure 1 shows the framework of the proposed approach which has two steps. first, we collected the logs activities of users from e-learning system such as discussion forum, quiz, uploading educational materials (e.g., documents, music, and pictures), and record all activities of users in a database. second, in supporting to assess users’ satisfaction, the users had to fill a sus questionnaire. the following subsections in detail present each of the key steps of the proposed approach. https://doi.org/10.21831/reid.v7i2.39642 10.21831/reid.v7i2.39642 sulis sandiwarno page 121 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) figure 1. the framework of the proposed approach usage-based metrics to assess the level of satisfaction based on usage-based metrics, we define the series of tasks (task descriptor), which are generally conducted by lecturers and students in e-learning system. the task descriptors of lecturers and students are similar, but some tasks are different. the task descriptor (task modelling) of lecturers consists of login, open and choose the course, making discussion forum, respond to the discussion forum, and also uploading quiz. on the other hand, the task descriptor of students consists of login, open and choose the course, respond discussion forum, uploading the discussion forum, and responding to quizes. the performed activities by the lecturers and students are similar in task 1 and task 2. the steps in task 1 are such that users must open e-learning system by typing the e-learning address on the browser address bar. once open, on the start page of an e-learning system, the users will see the login form, which should be filled in by entering the login credentials. this step is a validation process which means if users are registered or have the credentials of the e-learning system then users are given access into the system. after successful login, users will have access to the main menu of the e-learning system. for creating a discussion forum or quiz that is shown to students in e-learning system, the lecturer opens the page of forum or quiz, fills out the form of forum or quiz, and also uploads some materials for discussion. after the lecturer creates a forum, students can interact with the lecturer. after the students upload the forum, then lecturer can provide feedback to the forum that has been done. therefore, tthe student can also reply to the forum provided by the lecturer. in acquiring data of the courses in the e-learning system, we collected from the data logs of the e-learning system and put-on javascript code into the e-learning system for collecting data of activities performed by the lecturers and students in the course. the events were recorded by javascript in order to assess the users’ satisfaction based on system usage, we define the series of tasks (task descriptor), which are generally conducted by lecturers and students in e-learning system. the task descriptors of lecturers and students are similar, but some tasks are different. the task descriptor of a lecturer consists of login, open and choose the course, making discussion forum, respond to the discussion forum, and uploading quiz. task descriptor of students consists of login, open and choose the course, respond to discussion forum, uploading the discussion forum, and responding quiz. in supporting the assessment of the users’ satisfaction based on activities, we employ commonly usage-based metrics including: completion rate, task duration and lostness. https://doi.org/10.21831/reid.v7i2.39642 10.21831/reid.v7i2.39642 sulis sandiwarno page 122 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) completion rate completion rate is a metric used to measure the success of any activities performed by users, calculated by formula (1). the percentage of this success ranges from 0% (failure) to 100% (success) (harrati et al., 2016; tullis & albert, 2013). …………………………… (1) task time task time is a common way to measure the usability of a product. task time is simply the time elapsed between the start of a task (st) and the end of a task (ft), usually expressed in minutes and seconds, calculated as in formula (2) (tullis & albert, 2013). task time = (st ft) …………………………………………………. (2) lostness to calculate the lostness, formula (3) is employed, where n represents the number of different web pages visited while performing the task, s is defined as the pages visited total number to indicate each task, r is denoted as the minimum number of task in pages which should be visited to finish the task, s is total number of page visited. ………………………………. (3) system usability scale (sus) from bareeq et al. (alghannam et al., 2017), sus with 10 questions was used, where each question has the concept of the sus, and the positive statements presented in odd-numbered and the negative statements are even numbered. the respondents choose from a five-point likertscale that is represented by numbers from strongly disagree (1) to strongly agree (5) accordingly. each item's score contributes from 0 to 4. the sum of the scores is multiplied by 2.5 to obtain the overall sus score and the number of scores for each respondent ranges from 0 – 100, as formulated in formula (4). ……………………….. (4) experimental setup in this section, we explain the experimental setup of two methods. these two methods are the usage-based metrics and sus metrics. usage-based metrics and sus metrics for evaluation, we compared the proposed method against the approaches that only use completion rate (notated as cr), task duration (notated as td), lostness (notated as l), and sus metrics. note that, the cr results indicate that users are satisfied in range of 70% 100% (harrati et al., 2016; tullis & albert, 2013). the result of lostness metric should be less than 0.5 to consider satisfaction (smith, 1996; tullis & albert, 2013). the sus results indicate that users are satisfied if the score is not less than 70% (alghannam et al., 2017). https://doi.org/10.21831/reid.v7i2.39642 10.21831/reid.v7i2.39642 sulis sandiwarno page 123 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) data collection from e-learning the data of this study is the logs of the activities which are performed by the lecturers and students from online courses in the a university located in jakarta, indonesia. the lecturers were grouped based on gender, age, and academic qualification, whereas the students were grouped based on gender and age. the total number of users is 1906, out of which lecturers are 50 and students are 1856. table 1. the distribution of lecturers and students lecturers n-data % students n-data % gender male 30 60 gender male 464 50 female 20 40 female 464 50 age distribution 25-35 25 50 age distribution 17-19 464 50 36-45 10 20 20-22 464 50 46-56 10 20 57-67 5 10 academic qualification junior lecturers (aa) 25 50 senior lecturers (l) 15 30 associate professor (lk) 5 10 professor (prof.) 5 10 table 1 shows the distribution of lecturers and students, in which the lecturers consist of the junior lecturer (asisten ahli or aa), senior lecturer (lektor or l), associate professor (lektor kepala or lk) and professor (prof.) in order to acquire data of the courses in the e-learning system, we collected from the data logs of the e-learning system (moodle) and put-on javascript into the e-learning system for collecting data of activities that are performed by the lecturers and students in the course. the events were recorded by javascript, in certain items such as how many clicks the users make to go to the intended page, because before the users use e-learning system, we identified the minimum number of click links that are required to reach each part of the system. findings and discussion in this section, we present and discuss the results obtained from the experiment based on usage-based metrics, sus metric, and finally a results discussion. findings usage-based metrics evaluation table 2 depicts the results for the completion rate, duration, and also lostness for the two groups of lecturers. we note that, on average, all of the trained lecturers were able to complete the assigned activities successfully, except in task 4 and 5. in a total of eight lecturers and 14 lecturers out of 50 lecturers failed to complete the assigned activities fully task 4 and task 5 respectively. then, we further noted that for the non-trained lecturers, though the majority were able to complete the assigned activities, but some of them failed to fully complete the activities in all the tasks. moreover, the results generally suggest that the trained lecturers spent less time to complete the tasks (an average of 16 minutes for the long task), whereas non-trained lecturers spent about 37 minutes for the same task (task 5). finally, according to the recorded values for the lostness metric, the results suggest that the trained lecturers efficiently navigated through the system comparing against the non-trained lecturers. in addition, table 3 provides the detailed results of the lecturers based on their biographical information. the comparison of the trained and non-trained lecturers for the three usage-based metrics is further pictorially depicted in figure 2. https://doi.org/10.21831/reid.v7i2.39642 10.21831/reid.v7i2.39642 sulis sandiwarno page 124 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) table 2. lecturers’ satisfaction assessment task 1 task 2 task 3 task 4 task 5 completion rate trained 100 100 100 97.5 93.8 non-trained 64.9 73.2 68.9 55.8 77.5 sd 24.82 18.95 21.99 29.49 11.53 duration trained 3.18 3.32 11.14 12.32 15.65 non-trained 8.39 12.09 35.03 30.24 37.65 sd 3.68 6.2 16.89 12.67 15.56 lostness trained 0.03 0.13 0.16 0.25 0.14 non-trained 0.37 0.44 0.53 0.55 0.66 sd 0.24 0.22 0.26 0.21 0.37 table 3. lecturers’ satisfaction assessment based on gender, age, and academic qualification gender age academic qualification male female 26-35 36-45 46-56 57-67 aa l lk prof. completion rate trained 99.08 97.68 98.7 99.24 96.24 99.24 98.7 99 95.64 96.94 non-trained 67.44 68.98 65.9 62.68 67.38 65.56 68.64 65.1 60.76 73.86 sd 18.28 15.81 18.76 20.12 16.52 18.24 17.83 18.36 19.78 14.63 duration trained 9.31 9.64 9.32 8.87 8.92 8.92 9.32 8.94 9.14 8.42 non-trained 24.48 24.98 24.51 24.73 24.92 24.93 24.51 24.76 24.48 25.27 sd 12.67 12.91 12.78 12.61 12.73 13.01 12.78 12.7 12.54 13.01 lostness trained 0.19 0.09 0.17 0.19 0.07 0.14 0.17 0.19 0.12 0.05 non-trained 0.5 0.53 0.53 0.5 0.65 0.57 0.53 0.5 0.61 0.67 sd 0.19 0.25 0.2 0.2 0.3 0.26 0.2 0.2 0.28 0.34 figure 2. trained and non-trained lecturers’ comparison table 4 reports the results for the completion rate, duration, and also lostness for the two groups of students. a total of 1,856 students were grouped into two groups (trained and nontrained) to contain 928 students each. generally, the results suggest that on average all of the 928 trained students were able to successfully complete the assigned activities in all of the five tasks. furthermore, a total of 25 non-trained students (3% = 25/928) were unable to fully complete the assigned activities in all five tasks. it was also noted that, generally most of the non-trained students did not fully complete the tasks resulting to an average of 60% completion rate. moreover, the results generally suggest that trained students spent less time to complete the tasks (an average of 13.5 minutes for the long task), whereas non-trained lecturers spent about 31 minutes for the same task (task 5). finally, according to the reported values for the lostness mehttps://doi.org/10.21831/reid.v7i2.39642 10.21831/reid.v7i2.39642 sulis sandiwarno page 125 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) tric, the results indicate that the trained students efficiently navigated through the system comparing against the non-trained students. table 5 further provides the detailed results of the students based on their biographical data. the comparison of the trained and non-trained students based on the three usage-based metrics is further depicted in figure 3. in summary, the results indicate that there is a significant difference between the recorded results of the three metrics (completion rate, duration, as well as lostness) between the trained and non-trained users (lecturers and students). the results suggest that trained users are significantly better at and also more satisfied with using the e-learning system compared to non-trained users. furthermore, it is worth noting that trained students generally reported good results comparing with the trained lecturers, whereas non-trained students and non-trained lecturers reported almost similar results. table 4. students’ satisfaction assessment based on gender, age, and academic qualification task 1 task 2 task 3 task 4 task 5 completion rate trained 100 100 100 100 100 non-trained 60.4 64.6 46.1 51.5 59.58 sd 28 25.03 38.1 34.29 28.58 duration trained 2.9 3.1 11.16 12.65 13.55 non-trained 16.69 14.95 31.49 29.71 30.78 sd 9.75 8.38 14.38 12.06 12.18 lostness trained 0.08 0.16 0.14 0.18 0.18 non-trained 0.38 0.48 0.51 0.51 0.61 sd 0.21 0.23 0.26 0.23 0.3 table 5. students’ satisfaction assessment based on gender and age gender age male female 18-20 21-22 completion rate trained 100 100 100 100 non-trained 56.24 56.7 47.38 33.7 sd 23.56 23.44 27.44 35.45 duration trained 8.49 8.85 8.49 10.34 non-trained 26.35 27.12 24.32 25.13 sd 10.25 11.09 10.25 11.09 lostness trained 0.14 0.14 0.12 0.12 non-trained 0.49 0.49 0.45 0.59 sd 0.2 0.2 0.19 0.27 figure 3. trained and non-trained students’ comparison https://doi.org/10.21831/reid.v7i2.39642 10.21831/reid.v7i2.39642 sulis sandiwarno page 126 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) system usability scale the system usability analysis aimed at quantifying how students and lecturers perceive the usability of the e-learning system. table 6 depicts the sus results for both trained and nontrained lecturers based on gender, age and academic qualification. generally, the results suggest that, on average the trained lecturers reported more than 90% whereas non-trained lecturers reported 69% with an average standard deviation (sd) of 15.02 on task completion rate. as depicted in table 6, it is evident that trained lecturers perceived that the system as more usable comparing against non-trained lecturers. table 6. lecturers’ sus analysis gender age academic qualification male female 26-35 36-45 46-56 57-67 aa l lk prof. sus (%) trained 90 92.86 90 95 90.42 92.5 88 94.64 85 86.25 non-trained 69.47 67.08 68.96 70 68.75 74.17 69.79 71.94 66.25 65.83 sd 14.52 18.23 14.88 17.68 15.32 12.98 12.88 16.05 13.26 14.44 furthermore, in table 7, we report the results of a usability analysis of students based on gender and age. the average results of sus of trained students were recorded at 87.69% and at 69.84% for non-trained students with an average sd of 12.62 on the completion rate. similar as trained lecturers, and also the trained students reported higher sus scores than non-trained students which implies that trained students perceived the system as more usable comparing against non-trained students. finally, the overall results for both lecturers and students suggest that lecturers ranked the usability of the system higher than the students. in other words, the lecturers were more satisfied with in than the students. table 7. students’ sus analysis gender age male female 18-20 21-22 sus (%) trained 86.18 88.27 87.49 88.81 non-trained 69.42 70.31 69.16 70.47 sd 11.84 12.7 12.96 12.97 the cronbach alpha α which refers to the reliability of assessment is estimated as 0.865 for all scores of tasks. this is an indicative that the questionnaire of sus metric strong reliability instrument used in the e-learning evaluation according to borkowska and jach (2017). they argued that the internal consistency assessing of the α scale should be reached the value above 0.8. comparison between sus and lostness the study further compared the sus and lostness to deduce whether the perceived satisfaction of the system reflects users’ actual performance when using the system. in that regard, we compared the sus scores against lostness, that is because sus reveals how the users’ rate how easy using the system is, whereas lostness reflects to what extent users were able to use the system practically in practice by measuring the ease of navigating within the system. we compared the lostness with the sus metrics and completion rate with the sus metrics. in the comparison we examined the correlation between the results of lostness with sus metrics and completion rate with sus metrics. we also computed the pearson correlation coefficient (pcc) for lostness with sus metrics and obtained an average pcc value of r = 0.658. furthermore, pcc for completion rate and sus metrics we achieved an average pcc value of r = 0.736. figure 4 depicts the comparison of sus and lostness for lecturers and students respectively. the results generally suggest that there is a close correlation between sus and lostness. https://doi.org/10.21831/reid.v7i2.39642 10.21831/reid.v7i2.39642 sulis sandiwarno page 127 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) figure 4. lostness versus sus in lecturers and student activities for example, in figure 4, comparing the trained and non-trained lecturers, the trained lecturers reported higher scores of sus which implies they perceived the system to be easy to use, equally their scores on the percentage of lostness were less. therefore, the sus assessment fairly evaluated the system. the same case was reported for trained and non-trained students as can be seen in figure 4. moreover, it is worth noting that, there were no significant differences between the results reported for sus and lostness across different age, gender, and academic evaluation. discussion this research shows an approach for evaluating the satisfaction of e-learning system based on the sus and usage-based metrics. the experimental results show that trained users have skills and more experienced in using the e-learning system, while not all non-trained users have experience in using it. the results for measuring users’ satisfaction were presented in widely used metrics: completion rate, task time, lostness and sus. table 6 shows the results of the lecturers; the average results of trained lecturers reported 90.46% and non-trained lecturers reported 69.2% on completion rate. table 7 shows average results of trained students of 87.69% and non-trained students achieved 69.84% on completion rate. the combination of trained users reported an average result of 89.08% and non-trained users reported an average result of 69.53% on completion rate. based on lostness in table 4, trained lecturers reported an average result of 0.14 and the result of non-trained lecturers was 0.56. average result of trained lecturers showed better than the trained students, since the average sus score of trained lecturers was 90.5%, but for trained students, the average sus score was 87.7%. further, the average lostness score of trained lecturers was 0.14 while average result for trained students reported 0.13. in summary, there are two indicators for lostness and sus: (1) trained students are more capable than trained lecturers in using the e-learning system based on lostness, and (2) trained lecturers are more satisfied in using e-learning system than trained students based on sus. also, harrati et al. (2016) and tullis and albert (2013) argued that the minimum results of sus to indicate satisfac-tion level is 70-100%. as summary, this research presents a number of unique contributions to assess users’ satisfaction in e-learning system and explore some factors that can affect the satisfaction level and interactivity performance for the lecturers and students in the university for employing the educational technologies. firstly, the results achieved by the directed experimentations confirm that the sus metric is insufficient to reveal the true approval and level of users' satisfaction in using elearning system. the evaluation of sus metric should be fulfilled tandem with the usage-based metrics. this would assist to cluster the different lecturers and students and also comportment cavernous the reported usability analysis results based on the participants actual performances. https://doi.org/10.21831/reid.v7i2.39642 10.21831/reid.v7i2.39642 sulis sandiwarno page 128 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) therefore, the reported satisfaction results reported by means of sus questionnaires administrated to a set of users, can be potentiality have different interpretation by the users to express their level of acceptance. in the other words, are the lecturers and students satisfied due to the ease of adopt for the e-learning system or since experiencing a new technological product of learning management system that they felt enjoy and happy about it regardless of the expected results. the experimental results have revealed that the distinct usage based metrics that including task duration completion rate lostness metrics present equivalently the same part in expressing and analyzing the usability degree of lecturers’ and students’ interactivity. for other factors related to the participants themselves, the younger users have shown greater motivation and skills to use technological products meanwhile older users have struggled poorly to use the e-learning system. this is in sentence with a number of recent studies which arrived to the same conclusions (bringula, 2013; wagner et al., 2014). they argued that the factor of age has a pronounced impact on the users’ performance. moreover, the lecturers with the highest academic qualifications have reported receiving performance with high completion rates. this is instinctively because the comparable connection among the qualification of age and the academic. according to mentes and turan (2012), the authors said that the gender is the factor which impacts the users’ performance, the results achieved confirm that both genders and ages have almost same usage based metrics with small variances with exception that male users have declared better selfapproval with the e-learning system. we noted that the usage metrics have represented that the lecturers and students in the university have attempted to associate with the platform of e-learning system when deal with the web pages with ample graphical view navigation and tools. this suggests that the partial impoverished usability of the lecturers' and students' interface which should be improved during the stages where highly lecturers and students are not success to comprehensive the e-learning tasks meantime, the minimal interfaces are evident to be preferable in terms of obtaining objectives with the ease and consistency deducing the correlation among the the task complexity and the time duration number and navigation web with respect to the elements' number and options comprised within the e-learning interface. moreover, lecturers and students have claimed their satisfaction for adopting the e-learning in the future for supporting online teaching while they have reuqired obviously anymore practicing and directive of how to employ the e-learning system. conclusion in this paper, an empirical study was conducted to assess the satisfaction of lecturers and students on using e-learning system. in the experiment, we adopted a widely used e-learning system (moodle) for tracking users’ activities and their evaluation. we used four key metrics (completion rate, task time, lostness and sus) to assess users’ satisfaction and to quantify the performance of users on using e-learning system. the findings of this study reveals that trained students and trained lecturers are more satisfied in using the e-learning system compared against nontrained lecturers and students. the findings, therefore, suggest that formal training for employing e-learning system is essential to obtain satisfied and experienced users. in our future work, we aim at expanding the usage-based metrics to assess the speed and accuracy of communication within the forum between lecturers and students. references ahn, j., kim, k., & proctor, r. w. (2018). comparison of mobile web browsers for smartphones. journal of computer information systems, 58(1), 10-18. https://doi.org/10.1080/08874417.2016.1180652 alghannam, b. a., albustan, s. a., al-hassan, a. a., & albustan, l. a. (2017). towards a standard arabic system usability scale: psychometric evaluation using communication https://doi.org/10.21831/reid.v7i2.39642 https://doi.org/10.1080/08874417.2016.1180652 10.21831/reid.v7i2.39642 sulis sandiwarno page 129 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) disorder app. international journal of human–computer interaction, 34(9), 1–6. https://doi.org/10.1080/10447318.2017.1388099 almarashdeh, i. (2016). sharing instructors experience of learning management system: a technology perspective of user satisfaction in distance learning course. computers in human behavior, 63, 249–255. https://doi.org/10.1016/j.chb.2016.05.013 asoodar, m., vaezi, s., & izanloo, b. (2016a). framework to improve e-learner satisfaction and further strengthen e-learning implementation. computers in human behavior, 63, 704–716. https://doi.org/10.1016/j.chb.2016.05.060 asoodar, m., vaezi, s., & izanloo, b. (2016b). framework to improve e-learner satisfaction and further strengthen e-learning implementation. computers in human behavior, 63, 704–716. https://doi.org/10.1016/j.chb.2016.05.060 berkman, m. i̇., karahoca, d., & karahoca, a. (2018). a measurement and structural model for usability evaluation of shared workspace groupware. international journal of human-computer interaction, 34(1), 35–56. https://doi.org/10.1080/10447318.2017.1326578 borkowska, a., & jach, k. (2017). pre-testing of polish translation of system usability scale (sus). advances in intelligent systems and computing, 521, 143-153. https://doi.org/10.1007/978-3-319-46583-8_12 bringula, r. p. (2013). influence of facultyand web portal design-related factors on web portal usability: a hierarchical regression analysis. computers and education, 68, 187-198. https://doi.org/10.1016/j.compedu.2013.05.008 caputi, v., & garrido, a. (2015). student-oriented planning of e-learning contents for moodle. journal of network and computer applications, 53, 115–127. https://doi.org/10.1016/j.jnca.2015.04.001 casamayor, a., amandi, a., & campo, m. (2009). intelligent assistance for teachers in collaborative e-learning environments. computers and education, 53(4), 1147–1154. https://doi.org/10.1016/j.compedu.2009.05.025 chen, p.-h., & adesope, o. (2016). the effects of need satisfaction on efl online learner satisfaction. distance education, 37(1), 89–106. https://doi.org/10.1080/01587919.2016.1155962 cohen, a., & baruth, o. (2017). personality, learning, and satisfaction in fully online academic courses. computers in human behavior, 72, 1–12. https://doi.org/10.1016/j.chb.2017.02.030 curcio, k., santana, r., reinehr, s., & malucelli, a. (2019). usability in agile software development: a tertiary study. computer standards and interfaces, 64, 61-77. https://doi.org/10.1016/j.csi.2018.12.003 gameel, b. g. (2017). learner satisfaction with massive open online courses. american journal of distance education, 31(2), 98–111. https://doi.org/10.1080/08923647.2017.1300462 haron, h., aziz, n. h. n., & harun, a. (2017). a conceptual model participatory engagement within e-learning community. procedia computer science, 116, 242–250. https://doi.org/10.1016/j.procs.2017.10.046 harrati, n., bouchrika, i., tari, a., & ladjailia, a. (2016). exploring user satisfaction for elearning systems via usage-based metrics and system usability scale analysis. computers in human behavior, 61, 463–471. https://doi.org/10.1016/j.chb.2016.03.051 hong, j. c., tai, k. h., hwang, m. y., kuo, y. c., & chen, j. s. (2017). internet cognitive failure relevant to users’ satisfaction with content and interface design to reflect continuance 10.21831/reid.v7i2.39642 sulis sandiwarno page 130 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) intention to use a government e-learning system. computers in human behavior, 66, 353–362. https://doi.org/10.1016/j.chb.2016.08.044 horvat, a., dobrota, m., krsmanovic, m., & cudanov, m. (2015). student perception of moodle learning management system: a satisfaction and significance analysis. interactive learning environments, 23(4), 515–527. https://doi.org/10.1080/10494820.2013.788033 ifinedo, p., pyke, j., & anwar, a. (2018). business undergraduates’ perceived use outcomes of moodle in a blended learning environment: the roles of usability factors and external support. telematics and informatics, 35(1), 93–102). https://doi.org/10.1016/j.tele.2017.10.001 kerimbayev, n., kultan, j., abdykarimova, s., & akramova, a. (2017). lms moodle: distance international education in cooperation of higher education institutions of different countries. education and information technologies, 22(5), 2125–2139. https://doi.org/10.1007/s10639-016-9534-5 kim, j. (2013). influence of group size on students’ participation in online discussion forums. computers & education, 62, 123-129. https://doi.org/10.1016/j.compedu.2012.10.025 koohang, a., paliszkiewicz, j., gołuchowski, j., & nord, j. h. (2016). active learning for knowledge construction in e-learning: a replication study. journal of computer information systems, 56(3), 238–243. https://doi.org/10.1080/08874417.2016.1153914 ku, h. y., tseng, h. w., & akarasriworn, c. (2013). collaboration factors, teamwork satisfaction, and student attitudes toward online collaborative learning. computers in human behavior, 29(3), 922–929. https://doi.org/10.1016/j.chb.2012.12.019 liberona, d., & fuenzalida, d. (2014). use of moodle platforms in higher education: a chilean case. communications in computer and information science, 446, 124–134. https://doi.org/10.1007/978-3-319-10671-7_12 lin, j. w. (2018). effects of an online team project-based learning environment with group awareness and peer evaluation on socially shared regulation of learning and self-regulated learning. behaviour and information technology, 37(5), 445–461. https://doi.org/10.1080/0144929x.2018.1451558 mentes, a., & turan, a. h. (2012). assessing the usability of university websites: an empirical study on namik kemal university. turkish online journal of educational technology, 11(3), 61-69. http://www.tojet.net/articles/v11i3/1136.pdf muñoz, a., delgado, r., rubio, e., grilo, c., & basto-fernandes, v. (2017). forum participation plugin for moodle: development and discussion. procedia computer science, 121, 982–989. https://doi.org/10.1016/j.procs.2017.11.127 navimipour, n. j., & zareie, b. (2015). a model for assessing the impact of e-learning systems on employees’ satisfaction. computers in human behavior, 53, 475–485. https://doi.org/10.1016/j.chb.2015.07.026 sadikin, m. (2017). mining relation extraction based on pattern learning approach. indonesian journal of electrical engineering and computer science, 6(1), 50-57. https://doi.org/10.11591/ijeecs.v6.i1.pp50-57 sadikin, m., fanany, m. i., & basaruddin, t. (2016). a new data representation based on training data characteristics to extract drug name entity in medical text. computational intelligence and neuroscience, 3483528. https://doi.org/10.1155/2016/3483528 10.21831/reid.v7i2.39642 sulis sandiwarno page 131 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) sandiwarno, s. (2016). perancangan model e-learning berbasis collaborative video conference learning guna mendapatkan hasil pembelajaran yang efektif dan efisien. jurnal ilmiah fifo, 8(2), 191-200. https://doi.org/10.22441/fifo.v8i2.1314 smith, p. a. (1996). towards a practical measure of hypertext usability. interacting with computers, 8(4), 365–381. https://doi.org/10.1016/s0953-5438(97)83779-4 sun, j. (2016). multi-dimensional alignment between online instruction and course technology: a learner-centered perspective. computers and education, 101, 102–114. https://doi.org/10.1016/j.compedu.2016.06.003 tullis, t., & albert, b. (2013). measuring the user experience: collecting, analyzing, and presenting usability metrics (2nd ed.). elsevier. https://doi.org/10.1016/c2011-0-00016-9 wagner, n., hassanein, k., & head, m. (2014). the impact of age on website usability. computers in human behavior, 37, 270-282. https://doi.org/10.1016/j.chb.2014.05.003 zhang, s., liu, q., chen, w., wang, q., & huang, z. (2017). interactive networks and social knowledge construction behavioral patterns in primary school teachers’ online collaborative learning activities. computers and education, 104, 1–17. https://doi.org/10.1016/j.compedu.2016.10.011 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(2), 2021, 145-155 available online at: http://journal.uny.ac.id/index.php/reid developing assessment instruments of debate practice in indonesian language learning septiana farida*; farida agus setiawati universitas negeri yogyakarta, indonesia *corresponding author. e-mail: sfarida2590@gmail.com introduction speaking skill is a second language skill acquired by humans before having reading and writing skills, which is practical oral communication carried out on every individual in the social environment (simarmata & sulastri, 2018). so far, speaking skills as part of communication have not been noticed, often ignored, and not taken seriously so that many students are unable and dare not speak (isnaniar, 2013; morelent, 2012). speaking skills play a significant role in giving birth to a generation that is intelligent, critical, creative, and cultured (isnaniar, 2013). in practice, speaking skills involve more complex aspects (sari et al., 2016) and support other language skills (simarmata & sulastri, 2018). the method used when speaking or in rhetoric is known as the art of speaking in dialogue or monologue. the art of speaking in the form of dialogue in question is a speaking activity that involves two or more people taking part in a conversation process (midun, 2017, p. 14). the art form of speaking dialogue is debate, discussion, question and answer, negotiation, and conversation. the art form of monologue speech involves only one person speaking, namely in speeches, lectures, declamations, and remarks. each speaking skills practice needs to be carefully studied and its components considered in every evaluation practice in the scope of indonesian language learning in schools. article info abstract article history submitted: 22 august 2021 revised: 19 november 2021 accepted: 8 december 2021 keywords assessment; instrument; debate; practice; indonesian language scan me: this study aims to develop an instrument for assessing debate practice in indonesian class x senior high school (sekolah menengah atas or sma/madrasah aliyah or ma) learning. the theoretical construct of the instrument was found after reviewing several theories, including speaking skills that apply to debate practice, especially those based on the australian debating federation. the non-test instrument development procedure used is the mardapi model, which includes non-cognitive. ten material experts reviewed the draft instrument (two lecturers and eight indonesian language teachers class x in the sma/ma in yogyakarta special region) then it was calculated using the aiken formula to prove the validity of the contents of the instrument. the draft instrument was also tested by two raters/evaluators to assess the debate practice. the results of this trial were used to calculate inter-rater reliability using cohen kappa. the assessment instrument was declared reliable from the calculation of the inter-rater reliability value of the kappa formula, which was obtained at 0.678. the final item number of the instrument after the exploratory factor analysis is 33 items with adjustments to the composition of the dimensions of the statement items. this is an open access article under the cc-by-sa license. how to cite: farida, s., & setiawati, f. (2021). developing assessment instruments of debate practice in indonesian language learning. reid (research and evaluation in education), 7(2), 145-155. doi:https://doi.org/10.21831/reid.v7i2.43338 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i2.43338 10.21831/reid.v7i2.43338 septiana farida & farida agus setiawati page 146 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) speaking not only issues meaningless words, but it requires technique, clear thoughts, and content (midun, 2017, p. 14). the technical components in question are breathing, voice building, reading, and storytelling techniques. furthermore, clear and contained thoughts become part of the weight of the substance conveyed when speaking, namely whether they have high creative and fantasy power or knowledge and objective evidence (midun, 2017, p. 2014). therefore, learning speaking skills occupies an essential part (isnaniar, 2013) and at the end of the learning requires a form of practical evaluation to observe all these components. in principle, the implementation of the evaluation of language skills in schools takes place differently. reading and writing skills are used in non-face-to-face communication, and the evaluation is done in writing through cognitive evaluation of the learning. the dominant mental evaluation is carried out and put forward by educators rather than affective or psychomotor evaluation. cognitive evaluation is used as a benchmark for assessment and holds the principal place (poerwanti et al., 2008, p. 23). this assessment is also evident from the national exam grid, which focuses on evaluating cognitive aspects at the elementary, junior high, and high school levels. the indonesian national exam indicators only describe the evaluation of limited literary-non-literary reading and writing competencies and editing spelling (badan standar nasional pendidikan, 2018). the form of evaluation of listening and speaking skills is carried out in practice when teaching is in progress (isnaniar, 2013; nurgiyantoro, 2001, p. 7). however, in reality, this skill evaluation is often forced in cognitive assessment through theoretical questions. if it is carried out in practice, it is realised without specific instrument guidelines. speaking skills as a basic form of visual communication are often considered easy competencies, both to do and assess (isnaniar, 2013; morelent, 2012). the practice of evaluating speaking skills is not an easy thing to do (sari et al., 2016, p. 2), and the form of evaluation is in the form of non-test instruments that can be in the form of observation sheets, questionnaires, or assessment rubrics and requires accuracy in the evaluation process. based on observations and unstructured interviews conducted with indonesian teachers at man 3 sleman, man 2 kulonprogo, sma n 6 yogyakarta, and sma n 9 yogyakarta, the speaking skill assessment of students was carried out at a glance without special instruments. at a glance, here it is guided by the general aspects, namely intonation, expression, gesture, and mastery of the material, without elaborating the indicators in each of these aspects. these general aspects happen in every assessment of speaking practice, whether speech, sermon, negotiation, declamation, debate or drama. each student certainly has different outstanding abilities between each component and deserves different values/weights of appreciation. in addition, based on the research mentioned by brown (2015, p. 51) that fifteen of the sixteen students in the study commented that the use of debate in the classroom could improve collaborative skills or critical thinking skills during learning. debate is one of the arts of speaking in the form of dialogue learned in class x sma/ma and does not yet have a structured assessment instrument. therefore, assessment instruments are needed specifically made to support each form of speaking practice. debate is a very complex speaking skill competency. in addition to involving a whole of personnel, the flow of the debate also requires adequate competence in speaking strategies. speakers are not only necessary to be able to master the motions given so they can speak frankly and piously, but they must also can convince and be critical and at the same time break the opponent's opinion (salim, 2015, p. 100). in addition, the debate also requires a reflective and neutral attitude and is critical in examining the arguments or evidence used. the debater must assess the problem with solid analysis, not just relying on the opponent's interpretation (o’connor et al., 2018, pp. 90–91). thus, the aspect of speaking skills in the practice of debate includes various components. this component becomes the basis for assessing the competence of each student when practising debate. in line with this, it is necessary to develop an instrument for assessing the competence of debating practice realized in a non-test instrument in the form of an observation sheet (ghorbani et al., 2018). https://doi.org/10.21831/reid.v7i2.43338 10.21831/reid.v7i2.43338 septiana farida & farida agus setiawati page 147 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) the research by viswesh et al. (2018) aims to evaluate students' ability to make evidencebased decisions and presentations through debate activities (methods). the results of this study indicate the readiness of team performance and students' skills in perceiving through debate. the process observed in this study is similar to the procedure followed in the development research carried out. the process in question is the various components and indicators that become the points of assessment in the debate, including the preparation process of the materials and methods of arguing used to obtain success in debating. in addition, based on relevant theory and research results, it can be assumed that the instrument of debating practice ability consists of three factors: matter, method, and manner. the problem is how to develop an instrument that can be used to assess the practice of debate in an appropriate (valid) and reliable (reliable) manner. observation sheet to be filled out by the teacher when evaluating students' debating practices in class. developing the instrument is carried out based on the theory of speaking that applies to the course of debate. a valuable instrument for assessing kd 4.12 indonesian language learning in class x sma/ma even semesters reads based on the problems/issues, points of view and arguments of several parties and conclusions from verbal debates to show the essence of the debate. the result of this study is the instrument used to assess the practice of debate that has good content validity and construct validity and has good interrater/cohen kapha reliability. the result of factor analysis (exploratory factor analysis = efa) shows that this debate instrument consists of three factors: matter, method, and manner. the matter factor consists of 1 to 12 items, the method factor consists of 13 to 24 items, and the manner factor consists of 25-38 items. these three factors can explain the variance of debate practice by 100%. method this study is a research on developing debate practice instruments using djemari mardapi's non-cognitive instrument development model (mardapi, 2017). the assessment of debate practice in several debate contests that take place in the world after being studied from various sources refers to three dimensions: matter, method, manner (d’cruz, 2003; latif, n.d.; quinn, 2005). the three dimensions are decided by the adjudicators or debate experts (jurors). each size has a component description related to general speaking skills. this speaking skill material is used as an indicator of assessment. it is stated that the instrument statement on the components contained in each dimension of the debate. the instruments that have been compiled were tested twice, namely limited trials and field trials. an expert judgment validation process preceded the trial against ten material experts in indonesian language learning. after obtaining the validity value, the product was revised and tested limited (murti, 2011, p. 20). a little trial was conducted on 24 students from four schools in yogyakarta special region involving two assessors, namely the indonesian language teacher from each school and researchers. the results of the limited trial obtained four values of inter-rater reliability using the kappa formula calculation. then, the average of the four reliability values was calculated, and the final inter-rater reliability value was obtained. furthermore, the same instrument was used in a field trial on 246 class x students from four schools in yogyakarta special region. the results of the field trials obtained were analyzed by exploratory factors (efa) to obtain construct validity. the product then underwent a second revision based on various outcomes of factor analysis and suggestions for modification, resulting in a final instrument. moreover, the test subjects were students of class x, namely 24 students in the limited trial and 246 students in the field trial. the trial issue was determined by purposive sampling, namely the class x students who practised debating. subject determination was assisted by indonesian teachers from four different schools, namely man 3 sleman, sma n 6 yogyakarta, sma n 9 yogyakarta, and man 2 kulonprogo. the subjects of the field trial can be detailed as follows: 66 students of sma n 6 yogyakarta, 54 students of sma n 9 yogyakarta, 72 students of man 3 sleman, and 54 students of man 2 kulonprogo. https://doi.org/10.21831/reid.v7i2.43338 10.21831/reid.v7i2.43338 septiana farida & farida agus setiawati page 148 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) findings and discussion the product that resulted in this development research is an instrument sheet for assessing the practice of debate in indonesian language learning for class x sma/ma. the instrument sheet was developed to be used by indonesian teachers in evaluating students when carrying out debating practices. the instrument sheet is in the form of a checklist observation sheet containing statements about the points that must be indicated to be observed when students argue. in the initial development, instrument specifications were carried out, which included the preparation of statement items and poured into the instrument grid. the item statements totalled 44 statements. the grid was developed from three dimensions of debating practice assessment: matter, method, and manner. this dimension is determined by extracting from various literature on the evaluation of debate practice, including from international debate association institutions. furthermore, each dimension is classified into components that give more specific indicators to be arranged into operational statement items. the dimension of matter or material consists of three components: motions, arguments, and facts from statements. measuring the method is also distinguished into three parts: delivering idea, submitting an objection, and providing the response. dimensions of manner or attitude are classified into components of expression, appearance, and vocals. each of these components is still classified into more specific indicators to be reduced to statement items. the naming of the elements in each dimension and the arrows for each component may still change concerning the results of the exploratory factor analysis based on field trial data. the indicators derived from each component are reduced to item statements in more detail. statements are coherently written on the instrument sheet according to the elements' order (chai et al., 2019). the product instrument is a checklist observation sheet using a dichotomous score. these item statements are then observed in students when carrying out debate practices. the determination of the score on the observation sheet is carried out if the statement is found or observed, then the instrument sheet is given a check in the "yes" column. on the other hand, if the statement is not found or observed, the instrument sheet is marked with a tick in the "no" column. the checklist in the column also determines the score obtained by students because each statement rated "yes" has a score of 1, and "no" has a score of 0. the total score of the observed items can be grouped into the categorisation of students' debating practice abilities. then, from this categorisation, the score or predicate of students' level of proficiency in debating practice is known. content validity validation of expert judgment is carried out before the product is used for testing. expert validation was carried out on two indonesian language, and literature education lecturers and eight indonesian language teachers in class x. expert validation aim to obtain content validity values calculated using the aiken's v index formula. in the revision of the first stage, six items were dropped. these items do not pass the validity test because according to aiken's table, which refers to the number of raters, the value of the item validity coefficient must exceed 0.73. this value of 0.73 is the reference value for the 4-scale instrument at 10 raters with an error rate of 5%. the six items can be detailed as follows. table 1. details of dropped items in the first revision item statement number aiken's validity value indicator information 4 0.667 the substance of the motion safe to abort; other items represent indicators 13 0.700 substance facts 17 0.333 argument statement 21 0.600 parry against opponents 29 0.700 eye contact 37 0.633 costume https://doi.org/10.21831/reid.v7i2.43338 10.21831/reid.v7i2.43338 septiana farida & farida agus setiawati page 149 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) construct validity the developed instrument, which has been declared content valid and revised, is then used in a limited trial, besides that. the instrument is satisfactory and robust from kappa's inter-rater reliability calculations. the instrument was tested on 246 students from four different schools. in this trial, each student debated their role as a pro and con group member and then directly assessed using the revised assessment instrument. the score data per item of the students in the form of a 1-0 dichotomous score were recapitulated and analysed using the spss program. results of factor analysis items 1 to 12 results of factor analysis items 1 to 12 represents the dimension of matter which consists of the components of motion, argument, and facts of the argument. items are arranged coherently from each component. the motion component is divided into two indicators: the formulation of the motion with two statements and the substance of the motion with one idea. the argument component is divided into the essence of the argument with two item statements and each speaker's opinion with three-item ideas. two indicators are components, including the identity of the facts with three statements and the substance of the points with one statement. the results of the kmo and bartlett’s test and the total variance explained are shown in figure 1 and figure 2. figure 1. kmo for items 1-12 figure 2. total variance explained for items 1-12 figure 3. scree plot display dimension matter https://doi.org/10.21831/reid.v7i2.43338 10.21831/reid.v7i2.43338 septiana farida & farida agus setiawati page 150 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) table 2. the naming of efa result factors dimension matter component number of statement items percentage variance factor naming 1 9, 10, 11, 12 33.879 fact of argument 2 4, 5 19.435 argument statement 3 3, 6, 7 17.586 contents of speaker's argument 4 2 10.970 introduction to arguments from figure 3, it is known that the analysis results obtained are that four factors have an eigenvalue > 1. the following details are the values of the rotated component matrix. if the magnitude is > 0.5, it indicates a tendency to categorise the grain components. the four factors formed the group the statement items and led to naming the factors as presented in table 2. the analysis results show that the components developed in the instrument are primarily by the reality in the field, although there are improvements that need to be made. point 1 can be aborted because apart from being represented by point 2, the “motion formulation” in debate practice has generally been formulated before the debate. therefore, point 1 is not an item that must be observed in the implementation of debate practice because it is automatically present. point 8 can be dropped because each speaker refers to the same concept of argument so that it cannot be separated, and each speaker strengthens the other speaker's argument. therefore, arguments need not be restricted explicitly between speakers. results of factor analysis items 13-24 items 13-24 coherently represent the method's dimension, which consists of the components of how to convey arguments, how to submit rebuttals, and how to submit responses. the way the argument is delivered has two indicators: the argument statement with four-item statements and the speaker's opinion with two item statements. an indicator of providing defence is divided into two views of resistance items against opponent and two statements concerning the identification of reasons. components of delivering responses are divided into objective responses with two item statements and response structure with two item statements. based on the factor analysis of the 12 item dimensions of the method, two statements with the smallest anti-image correlation value were produced, namely item 16 of 0.314 and item 13 of 0.473. the two items were aborted, and the other 10 items were re-analysed in the same way. figure 4. kmo for items 13-24 figure 5. total variance explained for items 13-24 https://doi.org/10.21831/reid.v7i2.43338 10.21831/reid.v7i2.43338 septiana farida & farida agus setiawati page 151 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) figure 6. display of the dimensional scree plot method table 3. the naming of efa result factors dimension method component number of statement items percentage variance factor naming 1 18, 19, 20, 21, 22, 23, 24 49.769 how to respond to arguments 2 14, 15 12.412 how to present an argument 3 17 10.974 arguing rules the kmo value generated after the second stage of analysis was 0.841. the results of the kmo and bartlett’s test and the total variance explained are shown in figure 4 and figure 5. the scree plot in figure 6 shows that there are three form factors which are indicated by the eigenvalue > 1. the following is a breakdown of the rotated component matrix value; if the magnitude is > 0.5, it categorises the item component groupings. the three factors formed the group the statement items and led to naming the factors as shown in table 3. the concept of grouping indicators on each component of the developed method dimensions follows the factors formed in the results of the factor analysis carried out. the dominant factor of this dimension lies in how to convey responses, which in the development concept are divided into two indicators, namely responses and rebuttals. however, in the factor analysis results, these two indicators tend to group on one factor. since rebuttal is also a form of response, factors are generally named ways of responding to arguments. the rest, items 14, 15, and 17, have occupied the same elements as the initial lattice development concept. results of factor analysis items 25-38 items 25-38 contain coherent statements from the manner dimension, composed of expression, appearance, and vocal components. indicators of eye contact with one item statement, gestures with two item statements, and facial expressions with two item statements are descriptions of the expression components. the appearance component is detailed by one statement each of the indicators of standing and costume. the vowel component consists of voice and speed indicators with two item statements and pitch and pronunciation clarity which are also detailed in two item statements, respectively. figure 7. kmo for items 25-38 https://doi.org/10.21831/reid.v7i2.43338 10.21831/reid.v7i2.43338 septiana farida & farida agus setiawati page 152 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) figure 8. total variance explained for items 25-38 figure 9. scree plot display of manner dimensions table 4. the naming of efa result factors manner dimension component number of statement items percentage variance factor naming 1 32, 33, 34, 35, 36, 37 21.875 vocal 2 25, 26, 27 18.320 appearance 3 28, 29, 30 11.513 expression 4 31 8.999 costume based on the factor analysis of the 12 items of the manner dimension, one statement was produced with the value of the rotated component matrix < 0.3, namely item 38 of 0.091. thus, these items were dropped, and the other 11 things were re-analyzed with the same steps—items 25 to 37, analyzed by repeated factors, resulted in a kmo value of 0.719. the results of the kmo and bartlett’s test and the total variance explained are shown in figure 7 and figure 8. from figure 9, it is known that four factors are formed, namely four points that exceed the eigenvalue > 1. the grouping of items into four components can be seen in the classification and naming of factors in table 4. items 32-37 were previously identified in the indicators specifically for the vocal component, but after factor analysis, it turned out that all of these items showed unity. therefore, items 32-37 form a focused component as a vowel. items 25 and 26-27 are different indicators, but they belong to the same component. however, after the factor analysis is carried out, it can be categorized into appearance components. furthermore, item 30 in the initial development is a https://doi.org/10.21831/reid.v7i2.43338 10.21831/reid.v7i2.43338 septiana farida & farida agus setiawati page 153 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) different component of items 28-29. however, factor analysis tends to group the three items and can be categorized as expression components. item 31 independently occupies the costume component factor because it is a statement that describes the costume. reliability the instrument reliability value was obtained by calculating the kappa formula's inter-rater reliability average on 24 subjects from four different schools, namely from man 3 sleman, sma n 9 yogyakarta, sma n 6 yogyakarta, and man 2 kulonprogo. the assessment results between the two assessors from the four schools were categorized first and then analyzed for their kappa scores with the help of the spss program. the categorization is determined into five groups based on the categorization guide azwar (2012, p. 140), namely by first calculating the minimum value, maximum value, range, mean, and standard deviation. the instrument is in the form of an observation checklist with 38 statements whether or not there is so that the minimum value is 0 and the maximum value is 38. the range or range between the maximum and minimum values is 38. the mean is the maximum and minimum value divided by two, which is 19. the standard deviation is obtained from the range separated by six, which is 6.3. based on these benchmarks, the categorization as shown in table 5 can be obtained. table 5. categorization of students' debate practice ability predicate value range nominal score ver low x ≤ 9.55 1 low 9.55 < x ≤ 15.85 2 medium 15.85 < x ≤ 22.15 3 high 22.15 < x ≤ 28.45 4 very high x > 28.45 5 there are 24 subjects divided into four schools obtained a nominal score categorisation ranging between 4 and 5. kappa reliability values obtained from data from man 3 sleman, sma n 6 yogyakarta, and sma n 9 yogyakarta were 0.571. in contrast, the reliability value kappa data obtained from man 2 kulonprogo is perfect, which is 1. the kappa reliability value of the assessment instrument developed is calculated from the average of these four values. the average weight of kappa reliability obtained is 0.678. based on the categorisation of the kappa reliability value by fleiss and cohen (1973), it can be seen that the debate practice assessment instrument developed is in the sufficient predicate because it is in the range of 0.61 to 0.75. it was corroborated by garson (2016, p. 65), who also stated that the kappa inter-rater reliability value was between 0.6 to 0.79 is included in the substantial (sturdy) category. therefore, it can be said that the instrument developed is reliable. conclusion the research and development that has been carried out have produced an instrument for assessing the practice of debate in indonesian language learning for students of class x sma/ ma, tested for validity and reliability. before further research, the theoretical construct of the instrument was found after examining several theories about speaking skills in debate practice. after that, the procedure for developing the non-test instrument used in the study used the mardapi model, which included non-cognitive. the draft instrument has been assessed by 10 experts, two of whom are dialectical speaking skills lecturers at the indonesian language and literature education study program and eight indonesian class x teachers in the sma/ma in yogyakarta special region. both of them conducted an assessment using the aiken formula to prove the validity of the instrument content. the initial instrument developed consisted of 44 items. however, after operating field trials and conducting factor analysis, 33 final instruments were produced. 10.21831/reid.v7i2.43338 septiana farida & farida agus setiawati page 154 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) the research results are as follows. first, the product of the debate practice assessment instrument compiled has been tested and has an adequate content validity value. the content validity test with the aiken's v index formula produces a validity value of 0.73 and is classified as valid. the first product revision was carried out after this calculation, which was to abort six statements so that the instrument totalled 38 items from the previous 44 items. second, the product reliability value is included in the reliable category, which is 0.678. reliability is generated from the calculation of inter-rater reliability using the kappa coefficient, which is based on the average of the four inter-rater reliability values. the four inter-rater reliability scores were obtained from a limited trial of class x students from four different schools. thus, the number of statement items for the debate practice assessment instrument developed after a construct validity test with exploratory factor analysis (efa) is 33 items originating from three assessment dimensions. the three dimensions include the matter dimension with four components/10 item statements, the method dimension with three elements/10 item statements, and the manner dimension with four components/13 statement items. acknowledgment the researchers would like to thank the validators, both indonesian language lecturers for speaking sub-skills and indonesian language teachers who have been willing to be reviewers of the developed instrument items. also, the class x students of the 2018/2019 academic year man 3 sleman, man 2 kulon progo, sma n 6 yogyakarta, and sma n 9 yogyakarta, have become respondents/research samples for the debate practices carried out. references azwar, s. (2012). reliabilitas dan validitas (4th ed.). pustaka pelajar. badan standar nasional pendidikan. (2018). kisi-kisi usbn dan un. badan standar nasional pendidikan. https://bsnp-indonesia.org/2018/11/bsnp-rilis-kisi-kisi-usbn-dan-un-2019/ brown, z. w. (2015). the use of in-class debates as a teaching strategy in increasing students’ critical thinking and collaborative learning skills in higher education. educationalfutures [online], 7(1). https://educationstudies.org.uk/?p=3685 chai, c. s., hwee ling koh, j., & teo, y. h. (2019). enhancing and modeling teachers’ design beliefs and efficacy of technological pedagogical content knowledge for 21st century quality learning. journal of educational computing research, 57(2), 360–384. https://doi.org/10.1177/0735633117752453 d'cruz, r. (2003). the australia-asia debating guide (2nd ed.). the australian debating federation. https://www.dav.com.au/resources/aadg.php fleiss, j. l., & cohen, j. (1973). the equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. educational and psychological measurement, 33(3), 613–619. https://doi.org/10.1177/001316447303300309 garson, g. d. (2016). partial least squares: regression & structural equation models. statistical publishing associates. ghorbani, s., mirshah jafari, s. e., & sharifian, f. (2018). learning to be: teachers’ competences and practical solutions: a step towards sustainable development. journal of teacher education for sustainability, 20(1), 20–45. https://doi.org/10.2478/jtes-2018-0002 isnaniar. (2013). peningkatan kemampuan berbicara siswa kelas xi sma negeri 4 kota bengkulu tahun ajaran 2012-2013 dengan pendekatan komunikatif [universitas bengkulu]. http://repository.unib.ac.id/id/eprint/8515 10.21831/reid.v7i2.43338 septiana farida & farida agus setiawati page 155 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) latif, m. a. (n.d.). a comprehensive guide to debate adjudication. international islamic university malaysia. http://phaseduph.weebly.com/uploads/3/2/1/6/32162939/comprehensive_adjudication_guide.pdf mardapi, d. (2017). pengukuran, penilaian, dan evaluasi pendidikan (2nd ed.). parama publishing. midun, h. (2017). membangun budaya mutu dan unggul di sekolah. jurnal pendidikan dan kebudayaan missio, 9(1), 50–59. http://unikastpaulus.ac.id/jurnal/index.php/jpkm/article/view/117 morelent, y. (2012). peningkatan kemampuan berbicara siswa melalui kegiatan bercerita berbasis karakter di sekolah menengah atas: studi kuasi eksperimen pada siswa kelas x sma banuhampu kabupaten agam [sekolah pascasarjana universitas pendidikan indonesia]. http://repository.upi.edu/7716/ murti, b. (2011). validitas dan reliabilitas pengukuran. in matrikulasi program studi doktoral, fakultas kedokteran, uns, 1-19. https://dokumen.tips/documents/validitas-reliabilitaspengukuran-prof-bhisma-murti-55cd8744673e9.html?page=19 nurgiyantoro, b. (2001). penilaian dalam pengajaran bahasa dan sastra. bpfe-ugm. o’connor, a., carpenter, b., & coughlan, b. (2018). an exploration of key issues in the debate between classic and constructivist grounded theory. grounded theory review, 17(1). http://groundedtheoryreview.com/2018/12/27/an-exploration-of-key-issues-in-thedebate-between-classic-and-constructivist-grounded-theory/ poerwanti, e., widodo, e., masduki, pantiwati, y., poerwanti, e., widodo, e., masduki, pantiwati, y., & departemen pendidikan nasional. (2008). asesmen pembelajaran sd. direktorat jenderal pendidikan tinggi departemen pendidikan nasional. quinn, s. (2005). debating. simon quinn. https://debate.uvm.edu/dcpdf/quinn_debating.pdf salim, a. (2015). debate as a learning-teaching method: a survey of literature. tarbiya: journal of education in muslim society, 2(1), 97–104. https://doi.org/10.15408/tjems.v2i1.1665 sari, k. d. i., wendra, i. w., & wisudariani, n. m. r. (2016). pelaksanaan evaluasi pembelajaran keterampilan berbicara (bercerita) dengan materi cerpen pada siswa kelas ix d smp negeri 3 singaraja. jurnal pendidikan bahasa dan sastra indonesia undiksha, 5(3). https://ejournal.undiksha.ac.id/index.php/jjpbs/article/view/8688 simarmata, m. y., & sulastri, s. (2018). pengaruh keterampilan berbicara menggunakan metode debat dalam mata kuliah berbicara dialektik pada mahasiswa ikip pgri pontianak. jurnal pendidikan bahasa, 7(1), 49-62. https://journal.ikippgriptk.ac.id/index.php/bahasa/article/view/826 viswesh, v., yang, h., & gupta, v. (2018). evaluation of a modified debate exercise adapted to the pedagogy of team-based learning. american journal of pharmaceutical education, 82(4), 345– 353. https://doi.org/10.5688/ajpe6278 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(2), 2021, 106-117 available online at: http://journal.uny.ac.id/index.php/reid evaluation of the implementation of educational assessment standards at madrasah tsanawiyah modern islamic boarding school nurul ngarifillaili1*; badrun kartowagiran1; umwari yvette2 1universitas negeri yogyakarta, indonesia 2university of kibungo, rwanda *corresponding author. e-mail: aqbilinaa@gmail.com introduction education in indonesia is dominated by formal education from the government and nonformal education with attached existence in the population, i.e., pesantrens. the pesantren term came from the word santri (students) with a peprefix and -an suffix, acting as the student residence in studying religion (takdir, 2018, p. 156). pesantren culture has been applied using its particular method. based on this explanation, it is clear that a pesantren is a place to study, particularly in islam, offering many advantages both in the method and studying process. rapid pesantren development should be responded to wisely. currently, many pesantren develop following globalization. at first, pesantren are merely defined as a place to study religion; however, today’s definition has been expanded. currently, many pesantren teach not only religion but also general science. in the past, most pesantren were of the salaf or religious type, now they have changed to the khalaf (modern) type. the pesantren began to develop slowly by establishing public schools so that the learning process was a combination of religious learning within the pesantren and learning in schools. article info abstract article history submitted: 4 september 2021 revised: 17 november 2021 accepted: 19 november 2021 keywords educational assessment standards; mts within the modern islamic boarding school; evaluation program scan me: this study aims to collect, analyze, and present evaluation results and assess them by comparing the evaluation indicators. the evaluation focuses on achieving and implementing seven components of the educational assessment standard in islamic middle school (madrasah tsanawiyah or mts) of modern islamic boarding schools (pesantren) in the kebumen regency. the study employed a descriptive quantitative approach. the evaluation model was the discrepancy evaluation model. the study subjects were 227 students of grade viii class 2020/2021. data collection was performed using a questionnaire and brief interviews. the study result shows that 70.55% of students stated the assessment comprises three aspects, i.e., knowledge, attitude, and skills. the assessment principles of valid, objective, fair, integrated, open, thorough and sustainable, systematic, based on criteria, and accountable have been implemented well. about 67.82% of students assert that all principles have been reflected during assessment by educators and education units. as much as 72.35% of students explained that the assessment had utilized an appropriate form to measure students’ competency achievement. the assessment of mts in modern pesantren in kebumen regency used an instrument following the regulation, where 74.82% of students revealed that the study instrument had followed the empirical validity requirement. this is an open access article under the cc-by-sa license. how to cite: ngarifillaili, n., kartowagiran, b., & yvette, u. (2021). evaluation of the implementation of educational assessment standards at madrasah tsanawiyah modern islamic boarding school. reid (research and evaluation in education), 7(2), 106-117. doi:https://doi.org/10.21831/reid.v7i2.43672 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i2.43672 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 107 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) in indonesia, pesantren with primary schools amounted to 2072, with middle schools for 2721, open middle schools for 224, high schools for 1580, vocational schools for 35, and religious high schools for 176. pesantrens continue to accept all advances, including building formal education institutions (istikomah, 2017, p. 57). pesantren that combines two curricula have more subjects than traditional ones. students should simultaneously learn two sciences, i.e., general science and religious teachings. general science such as indonesian, mathematic, and english lessons are taught in more lesson hours than others. therefore, the studying process in several subjects is expected to be better than in others. in this study, modern pesantren are defined following the integrated curriculum utilization and applying the 2013 curriculum in its schools. in kebumen regency, ten of madrasah tsanawiyah (mts) apply the 2013 curriculum. however, field observation revealed that mathematics teachers complained about the new curriculum utilization in assessment. teachers consider the assessment complicated; hence, troublesome for processing scores. also, teachers acknowledge that the process of attitude assessment is affected by subjectivity (retnawati, 2015). based on this finding, although mathematics has more lesson hours than others, its assessment process demonstrates various problems. the 2013 curriculum implements a higher order thinking skills (hots) based assessment. however, the fact demonstrates that teacher assessment has not shown the hots level and still tends to be lower order thinking skills (lots). this is seen from their knowledge concerning hots implementation, hots characteristics, and steps to arrange hots questions. teacher perceptions are classified to agree and disagree to hots assessment. this study result revealed that the implementation of assessment had not followed the government’s expectations. teachers have limitations in implementing a hots-based 2013 curriculum assessment. evaluation of the government’s policy relating to education assessment is crucial, given that many problems in the assessment process still exist. the possible measures are identifying the implementation of the educational assessment standards on the field particular to educators and education units, identifying discrepancies on the field against the policy, and finding the discrepancies. in this case, the discrepancy evaluation model has matching characteristics to the problem and is applicable. it is expected to be the standardized assessment process, improving the school quality. based on this explanation, it is necessary to conduct an evaluation study of the implementation of educational assessment standards at madrasah tsanawiyah modern islamic boarding schools (pesantren), particularly to indonesian, mathematics, and english subjects. method the study approach was descriptive-quantitative. the evaluation model employed was the discrepancy evaluation model, since the evaluation study was defined as a compatibility process of programs against program standards and whether any discrepancy occurs between program aspects on the field against the predetermined standard. thus, the evaluation was specific by comparing actual things on the field and expected things from the standard. the steps in this discrepancy evaluation are (wirawan, 2016, p. 140): (a) developing a design and standards specifying characteristics of the ideal implementation of evaluation objects, (b) determining information required to compare the actual implementation and standard defining the evaluation object performance, (c) capturing evaluation object performance, including program implementation and quantitative and qualitative outcomes, (d) identifying discrepancies between standards and the actual implementation of evaluation objects, (e) determining the discrepancy cause, and (f) eliminating discrepancies by making changes on the evaluation object implementation. table 1. mts data in kebumen regency madrasah tsanawiyah amount public school 8 private school (outside islamic boarding school) 76 private school (including the modern islamic boarding school) 10 https://doi.org/10.21831/reid.v7i2.43672 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 108 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) this study was conducted on three mts in modern pesantren in kebumen regency selected based on the large student amount with the purposive sampling technique. the large student amount category on the study object was over 100 students. another study object selection category was middle school location distributed in three different districts. table 1 presents the data of mts in kebumen regency. the criterion for determining the object of the study was mts in modern pesantren applying an integrated curriculum, i.e., pesantren and governmental curricula. the schools included in this category combine school and pesantren in the same complex where all students live within the area. these schools are under pesantren institutions with the pesantren curriculum and school curriculum under the government, i.e., ministry of religion. mts fulfilling the criteria amounted to ten; those with over 100 students are mts plus nururrohmah, mts yapika, and mts salafiyah wonoyoso. the population of grade viii students on the three mts was 587. the samples as study respondents were selected using the random sampling technique, obtaining 227 students. the sample selection used the cohen and morrison table with a 95% confidence level and a 0.05 confidence interval (cohen et al., 2018, p. 206). samples were collected using a questionnaire and brief interviews with the schools. instrument validation should involve content analysis and empirical analysis of the test scores and response data to items by test takers. the content analysis of the test is related to the validity of the content, which furthermore requires empirical analysis to determine the validity of the construct. both of these analyzes are indispensable in the world of education so that the instrument meets the standard requirements (retnawati, 2016, p. 18). the study employed a content validity test with expert judgment and a construct validity test with efa (exploratory factor analysis) for the student’s questionnaire. the content validity index determination utilized the aiken formula, as shown below in formula (1), where v = aiken validity index, s = r-io (score given by rater lowest validation score), n = the number of panelists, and c = the number of categories. v = σs/n(c-1) …………………………….. (1) the construct was measured by a trial on 44 students. the analysis utilized exploratory factor analysis (efa). validity of a question item is assigned using several criteria (retnawati, 2016, p. 43), including (a) a kayser mayer oikin (kmo) score over 0.5, (b) the significance value of the barlett’s test of sphericity analysis under 0.05, (c) the anti-image correlation over 0.5, (d) the eigenvalue price in total variances explained over 1.0, (d) the rotated component matrix coefficient is over 0.4, and the loading value of such a factor is bigger than other factors with a minimum difference of 0.1 to discover the item group. in addition, the grid of student questionnaire instruments is presented in table 2, and the result of the analysis carried out with efa is presented in table 3. table 2. grid of student questionnaire instruments no. aspect indicator question item 1. scope of assessment teachers evaluate students’ knowledge 1-4 teachers evaluate students’ attitude 5-7 teachers evaluate students’ skills 8-10 2. principle of assessment the assessment meets most principles (valid, objective, fair, integrated, open, thorough, and sustainable) the assessment does not exclude three other principles (systematic, based on criteria, and accountable) 11-15 3. form of assessment teachers conduct daily tests, observation, assignments 16-19 4. instrument of assessment teachers use an instrument of assessment including the knowledge, attitude, and skill aspects 20-22 https://doi.org/10.21831/reid.v7i2.43672 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 109 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) table 3. kmo value and bartlett’s test kaiser-meyer-olkin measure of sampling adequacy indicator .567 bartlett’s test of sphericity 613.395 231 .000 table 4. msa value items value items value 1 0.731 12 0.574 2 0.489 13 0.569 3 0.816 14 0.513 4 0.384 15 0.347 5 0.323 16 0.501 6 0.593 17 0.589 7 0.539 18 0.586 8 0.415 19 0.533 9 0.651 20 0.514 10 0.242 21 0.565 11 0.684 22 0.607 after calculating kmo and bartlett’s test, the keiser meyer measure of sampling value was 0.567. therefore, kmo met the requirement by having a > 0.5 value. it indicates that the samples were sufficient. the subsequent analysis was searching for the msa value, as presented in table 4. the analysis results show six reduced (excluded from the study) items for not following the msa > 0.5 requirements. therefore, an analysis was performed by excluding the reduced items, leaving the questionnaire with only 16 questions. the instrument reliability calculation used the spss program based on the cronbach alpha coefficient. the reliability value is declared good if close to 1 or with a coefficient > 0.7 (hair et al., 2010, p. 21). based on the measurement results, the aiken validity index obtained was 0.97, categorized as highly valid. the construct validity analysis result, referring to previous requirements, demonstrates that the student’s questionnaire with 16 question items was declared valid. the reliability reached 0.889, indicating that the student assessment questionnaire is reliable as a study instrument. findings and discussion following the regulation of the minister of education and culture no. 23 of 2016 on the educational assessment standards, educational assessment is a process to collect and process information to measure students’ competency achievements, including authentic assessment, selfassessment, portfolio-based assessment, tests, daily tests, mid-semester tests, end-semester tests, competency level tests, national tests, and islamic middle school tests. the assessment of competency achievement involves attitude, knowledge, and skill performed equally to determine each student's relative position against the predetermined standard (hidayah, 2020, p. 101). educational assessment standards is a criterion on mechanisms, procedures, and instruments of students assessment. the assessment standard by educators, according to badan standar nasional pendidikan (bsnp) or the agency of national education standard, includes the general standard, planning standard, implementation standard, and report of assessment outcomes and assessment finding utilization standard. meanwhile, an education rapport by education units has two fundamental standards, i.e., standard for determining grades and standard for graduation determination (salamah, 2018, p. 287). based on several explanations, the educational assessment standards applies a criterion including several education components' particular and general standards. good learning should have a good assessment process by comparing to the predetermined standards or criteria. https://doi.org/10.21831/reid.v7i2.43672 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 110 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) evaluation research of assessment standards implementation on indonesian, english, and mathematics was performed in kebumen regency. islamic middle schools in the modern pesantren area were the primary target of this study. in addition, three mts were the subjects: mts plus nururrohmah, mts yapika, and mts salafiyah wonoyoso. the study scope components include: (a) student learning outcome assessment on primary and middle schools including attitude, knowledge, and skills, (b) attitude assessment, an activitiy carried out by educators to acquire descriptive information regarding students’ behaviors, (c) knowledge assessment, an activity to measure students’ knowledge mastery, (d) skill assessment, an activity to measure students’ ability in implementing knowledge to carry particular assignments. furthermore, quantitative data processing for the likert scale can be interpreted into score ranges using the normal distribution criteria (mardapi, 2017, p. 10). the categories of the score ranges are presented in table 5. meanwhile, the assessment scope component results on three mts in pesantren in kebumen regency are presented in percentage (%). the achievement is illustrated in table 6. table 6 shows that 70.55% of students admitted that assessment implementation in the three mts had covered three aspects, i.e., knowledge, attitudes, and skills. it proves that the components of the scope of the assessment have been implemented well. the difference in the achievement of the assessment scope component at the three mts can be seen in figure 1. table 5. assessment score range category no. value interval category 1. x > + 1.5 sbx very good 2. < x < 1.5 sbx good 3. – 1.5 sbx < x < poor 4. x < – 1.5 sbx very poor table 6. component achievement result of assessment scope (%) component achievement result of assessment scope (%) no. subjects madrasah mts nururrohmah mts yapika mts salafiyah 1. indonesian 64.7 71.42 72.15 2. english 63.55 82.54 69.62 3. mathematics 63.53 73.02 74.42 overall average 70.55 (good) figure 1. achievement diagram of the assessment scope https://doi.org/10.21831/reid.v7i2.43672 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 111 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) based on figure 1, it is discovered that the percentage of students acknowledging that the assessment had been good and covering three aspects at mts yapika and mts salafiyah wonoyoso is more significant than mts nururrohmah. the assessment scope is one of the critical standards in determining the quality of a school. the assessment focuses on three aspects: knowledge, attitudes, and skills. student learning outcomes, i.e., the achievement during the learning activities, become one of the critical benchmarks in the assessment scope (jihad & haris, 2013, p. 14). the assessment of student learning outcomes in three mts in modern pesantren in kebumen regency conducted by educators and education units was excellent. it is indicated by 70.55% of student respondents stated that knowledge, attitude, and skill aspects had been reflected in the assessment activities in mts. according to khuriyah et al. (2016), from a managerial perspective, the basis of tradition in managing an institution, including pesantren, causes management products not having a focused strategic focus. personal dominance is too large and tends to be exclusive in its development. it shows that pesantren require improvement in their management. this condition is a different obstacle in implementing student assessment in mts in pesantren. furthermore, the skill aspect has its obstacles where students could not be assessed objectively. specific skills that students should achieve have not been implemented due to several obstacles such as limited space for movement and facilities. for example, there was no laboratory for the english subject; thus, it is challenging for students to show their skills to the maximum. improvement efforts need to be made in order to maximize the assessment, especially concerning improving the implementation of educational assessments. this percentage can be increased to reach the optimum result. the non-optimal part of the scope of this assessment is caused by several obstacles related to the existence of madrasas that are one with the dormitory. from the observations, the knowledge and attitude aspects have been implemented well. however, students; skills are limited by the rules of the pesantren regarding the procurement of tools and materials. generally, mts in pesantren are different from public schools, which are free to enter and leave. mts are bound by the rules of the pesantren and school rules. the following evaluation component is the assessment principle. based on the regulation of the minister of education and culture no. 23 of 2016, the nine components of the assessment principles evaluated are: (a) valid, meaning that the assessment is based on data that reflects the measured ability; (b) objective, meaning that the assessment is based on clear procedures and criteria, not influenced by the subjectivity of the rater; (c) fair, meaning that the assessment is not beneficial or detrimental to students because of special needs and differences in religious, ethnic, cultural, customs, socioeconomic status, and gender backgrounds; (d) integrated, meaning that assessment is an inseparable component of learning activities; (e) open, meaning that interested parties can know the assessment procedure, assessment criteria, and basis for decision making; (f) comprehensive and continuous, meaning that the assessment covers all aspects of competence by using various ap-propriate assessment techniques to monitor and assess the development of students' abilities; (g) systematic, meaning that the assessment is carried out in a planned and gradual manner by following standard steps; (h) based on criteria, meaning that the assessment is following the achievement of the specified competence; and (i) accountable, meaning that the assessment can be accounted for both mechanisms, procedures, techniques, and results. table 7. component achievement result of assessment principle component achievement result of assessment principle (%) no. subjects madrasah mts nururrohmah mts yapika mts salafiyah 1. indonesian 58.82 63.49 69.62 2. english 65.88 69.84 73.42 3. mathematics 67.05 65.08 77.21 overall average 67.82 (good) https://doi.org/10.21831/reid.v7i2.43672 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 112 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) figure 2. achievement diagram of the assessment principle in general, the results of the achievement of the principal assessment components shown in table 7. it shows that the principal components of the assessment carried out at mts in modern pesantren in kebumen regency were in the good category. it is indicated by 67.82% of students asserting that the nine principles had been implemented in the assessment process in mts. a bar chart showing the implementation of assessment principles in indonesian, english, and mathematics subjects in three mts in the kebumen area is shown in figure 2. the bar chart in figure 2 shows that the achievement of the assessment principle was in a good category. at mts salafiyah, the percentage of respondents who stated that the assessment principle had been implemented was higher than the other two mts. the achievement of the principal assessment components out in three mts in kebumen regency was in a good category, with 67.82% of respondents admitting that the nine principles have been reflected in the assessment implementation. however, the results of these achievements remained poor compared to the components of the assessment scope. it is evident that in implementing the assessment principles, students still had a significant level of deficiency. the lack of high achievement results for the components of the assessment principles is due to several principles that were not appropriately implemented. although the nine assessment principles are easy to theorize, they are complicated to implement. the principle of objective and valid still needs to be improved in its implementation. the objectivity of an educator is required to carry out evenly distributed assessment to all students. however, in practice, the assessment was still subjective to specific students. the assessment of the knowledge aspect was related to the attitude and other aspects. according to mardapi (2017, p. 5), learning outcomes in the three aspects are not summed or affected by each other because they measure different dimensions. ayu and marzuki (2017, p. 78) also wrote about the importance of internal competencies possessed by educators, especially the character of educators. however, in the assessment implementation in schools, the subjectivity of educators, e.g., combining both assessment aspects, still occurs, and efforts need to be made to eliminate this. furthermore, the validity principle should be improved where the assessment must reflect the ability being measured. in practice, students' ability has not been appropriately measured since the subjectivity of educators influences it. these two principles must be improved to perform educational assessment correctly to meet applicable standards in improving the quality of a school. the assessment form components included assessing learning outcomes by educators in the form of tests, observations, assignments, and other necessary forms. in addition to educators, assessment was carried out on education units in school/mts exams. the results of the assessment components carried out in three mts in pesantren in kebumen regency is shown in table 8. https://doi.org/10.21831/reid.v7i2.43672 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 113 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) table 8. component achievement result of assessment form component achievement result of assessment form (%) no. subjects madrasah mts nururrohmah mts yapika mts salafiyah 1. indonesian 65.87 77.77 67.08 2. english 69.41 84.12 68.35 3. mathematics 78.83 71.42 68.36 overall average 72.35 (good) figure 3. achievement diagram of the assessment form based on table 8, 72.35% of students stated that the assessment in mts had used various forms according to the characteristics of the material being taught. it shows that the assessment implementation had been going well, although improvements are appreciated. the difference in specific achievement for student respondents is presented in figure 3. from figure 3, it is observed that the form of the assessment done at mts nururrohmah and mts yapika was better in approaching the standard than mts salafiyah wonoyoso. assessment of learning outcomes by educators was conducted in tests, observations, assignments, and other necessary forms. in contrast, the assessment by the education unit was in the form of mts or school exams. educators must provide an assessment form following the competencies to be measured. the achievement of the assessment form components in mts in modern pesantren in kebumen regency showed promising results, where 72.35% of respondents stated that the assessment form applied was under the regulation of the minister of education and culture no. 23 of 2016. these results still need to be improved to be more optimal in its implementation. various obstacles remained a problem, particularly for educators. indonesian, english, and mathematics subjects have different characteristics so that the assessment form carried out must also be different. the implementation in schools has the same tendency where educators are still centered on tests and assignments. the assessment does not merely consist of tests and assignments in the linguistic field, but there must be other forms of measuring the skills possessed. utami (2018) stated that assessment in indonesian covers various abilities such as listening, speaking, reading, and writing. based on the aforementioned various linguistic abilities, educators must look for appropriate assessment forms, for example, the use of an assessment form on portfolio is essential. however, in practice, the portfolio assessment has not been done for linguistic subjects. various obstacles still occur related to project and portfolio assessment, primarily time constraints where this assessment is centered on the development of student's abilities so that it requires a more extended period. https://doi.org/10.21831/reid.v7i2.43672 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 114 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) in its development, efforts to assess various complex abilities still have to be improved so that the implementation of the assessment form components becomes optimal. it aims so that the competence of students can be adequately measured. the last component that is evaluated in this study was the assessment instrument. the components of the assessment instrument evaluated include: (a) the assessment instruments used by educators are tests, observations, individual or group assignments, and other forms following the characteristics of students' competence and level of development, and (b) the assessment instrument used by the education unit in the form of a final evaluation and/or school/mts exam meets the requirements for substance, construction, and language and has evidence of empirical validity. the results of the achievement of the assessment instrument components in three mts in pesantren in kebumen regency are displayed in table 9. based on table 9, 74.82% of students stated that the assessment instrument was under the regulation of the minister of education and culture no. 23 of 2016. with a large percentage of 74.82%, it revealed that the assessment instrument had been appropriately implemented. the difference in the level of achievement of the three mts can be seen in figure 4. it shows that at mts salafiyah, the assessment instrument did not follow the existing rules of the other two mts. based on the results of a brief interview with the madrasah, there are still many teachers who do not have an educator certificate and are relatively young in age so that in compiling the assessment instrument they are still inexperienced. this is different from the other two mts where the number of certified teachers is higher and most senior teachers have more extensive experience. the assessment instruments can be tests, observations, assignments, and other necessary instruments. an education unit instrument must possess requirements that include substance, construction, language, and empirical validity. in this case, the education unit had carried out excellently due to preparing assessment instruments through school studies and various outside parties to guarantee the instrument quality. table 9. component achievement result of assessment instrument component achievement result of assessment instrument (%) no. subjects madrasah mts nururrohmah mts yapika mts salafiyah 1. indonesian 72.94 85.71 67.09 2. english 71.76 77.78 74.68 3. mathematic 81.17 71.42 70.89 overall average 74.82 (good) figure 4. achievement diagram of the assessment instrument https://doi.org/10.21831/reid.v7i2.43672 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 115 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) table 10. overall average of the research object no. mts average (%) 1. mts nururrohmah 69.01 2. mts yapika 83.21 3. mts salafiyah 76.18 furthermore, the achievement of the results of the assessment instrument components in the three study objects was in the good category where 74.82% of respondents stated that the assessment had implemented various instruments following the characteristics of the subjects being assessed. however, this number still needs improvement to make it better. various obstacles faced by educators in making assessment instruments still occur. among the obstacles in the field is an assessment instrument that does not follow the aspect characteristics to be assessed. language subjects have a variety of components that are more complex than other subjects. the instrument prepared should measure these aspects. for example, an assessment using projects and portfolios should be carried out to assess linguistic materials. educators only provide assessment instruments in tests and assignments, leaving students' abilities not appropriately measured. therefore, it needs to be conveyed back to educators to carry out the assessment process as well as possible so that the progress of students' abilities can be adequately measured. educators and education units' implementation of the assessment on the three study objects had been carried out well despite many perceived obstacles. it is based on the assessment implementation where either the mechanism or the previous assessment instrument has been processed together with other mts. also, it is equipped with supervision from a higher level. the assessment process is processed internally within the school and together with other mts, including state mts. it is a support for the achievement of the assessment implementation at the education unit level. in the next calculation, the three mts gave different results. these results are presented in table 10. based on table 10, the results on yapika mts show the best of the three mts. this shows that the assessment implementation at the mts was classified as good and following the standards. other mts were lower than mts yapika. at mts yapika, the learning applied follows the 2013 curriculum, where the assessment is carried out thoroughly. for example, in skills assessment, students are given the freedom to prepare the equipment needed. the interests of pesantren do not limit the need for learning. various activities to hone students' skills are also available in schools, such as scout, drum bands, and other activities such as performing arts. these activities have significant benefits that impact learning in schools to train the courage and mentality of students. this is separate support for the assessment implementation in schools. in the other two mts, the regulations applied are more stringent so that students do not have wider scope in developing their skills. furthermore, poor communication between mts and pesantren managers remained a big problem in implementing the assessment. the competence of students had not been appropriately measured. also, at mts yapika, most of the educators came from the surrounding area near the location of the mts. the teachers seemed to be more active in teaching than other mts. this is evidence that learning at mts yapika is more active so that the assessment process becomes better. this study also found that the principal component of the assessment had a lower level of achievement than the other components. this is because assessment principles such as valid and objective tend to be challenging to implement. the assessment carried out by the teacher is not under the competencies to be achieved. teachers have difficulty in assess student attitudes. in assessing student attitudes, teachers must observe student behavior during learning. it is challenging since students do not entirely obey the rules conveyed by the teacher (suciati et al., 2017, p. 70). based on this study, the objective principle is challenging to implement in assessment. therefore, teachers sometimes still assess with high subjectivity since they are affected by their closeness to students. 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 116 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) conclusion based on the study results, the components of the scope, principles, forms, and assessment instruments had been reflected in the assessment activities in mts. there were 70.55% of students asserting that the assessment had covered three aspects of the assessment scope, i.e., knowledge, attitudes, and skills. in addition, the assessment component implementation was in a good category, with 67.82% of students stating that the nine principles had been reflected in the assessment implementation by educators and education units. a total of 72.35% of students stated that the assessment had used the appropriate form to measure the achievement of student competence. assessment at mts in pesantren in kebumen regency utilized instruments that comply with regulations where 74.82% of students stated that the assessment instrument had met the linguistic requirements and empirical validity. based on the study results, several suggestions can be made, such as increasing training for educators to increase competence, especially in carrying out standardized assessments. another recommendation is the importance of further research to analyze the obstacles in the assessment implementation, especially in schools under pesantren, to find solutions in improving the achievement of the assessment component. references ayu, s. m., & marzuki, m. (2017). an assessment model of islamic religion education teacher personality competence. reid (research and evaluation in education), 3(1), 77–91. https://doi.org/10.21831/reid.v3i1.14029 cohen, l., manion, l., & morrison, k. (2018). research methods in education (8th ed.). routledge. hair, j. f., black, w. c., babin, b. j., & anderson, r. e. (2010). multivariate data analysis: global perspective (7th ed.). pearson education. hidayah, i. (2020). analisis standar penilaian pendidikan di indonesia. al-iman: jurnal keislaman dan kemasyarakatan, 4(1), 85–105. http://ejournal.kopertais4.or.id/madura/index.php/aliman/article/view/3851 istikomah, i. (2017). modernisasi pesantren menuju sekolah unggul. halaqa: islamic education journal, 1(2), 53–62. https://doi.org/10.21070/halaqa.v1i2.1246 jihad, a., & haris, a. (2013). evaluasi pembelajaran. multi pressindo. khuriyah, k., zamroni, z., & sumarno, s. (2016). pengembangan model evaluasi pengelolaan pondok pesantren. jurnal penelitian dan evaluasi pendidikan, 20(1), 56–69. https://doi.org/10.21831/pep.v20i1.7529 mardapi, d. (2017). pengukuran, penilaian, dan evaluasi pendidikan (2nd ed.). parama publishing. regulation of the minister of education and culture no. 23 of 2016 concerning the educational assessment standards. (2016). https://bsnp-indonesia.org/wpcontent/uploads/2020/12/permendikbud_tahun2016_nomor023.pdf retnawati, h. (2015). hambatan guru matematika sekolah menengah pertama dalam menerapkan kurikulum baru. jurnal cakrawala pendidikan, xxxiv(3), 390-403. https://doi.org/10.21831/cp.v3i3.7694 retnawati, h. (2016). analisis kuantitatif instrumen penelitian. parama publishing. salamah, u. (2018). penjaminan mutu penilaian pendidikan. evaluasi: jurnal manajemen pendidikan islam, 2(1), 274–293. https://doi.org/10.32478/evaluasi.v2i1.79 suciati, r. m., nurhaidah, n., & vitoria, l. (2017). pelaksanaan penilaian hasil belajar siswa pada subtema hidup rukun dengan teman bermain di kelas ii sdn 14 banda aceh. jurnal ilmiah 10.21831/reid.v7i2.43672 nurul ngarifillaili, badrun kartowagiran, & umwari yvette page 117 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) pendidikan guru sekolah dasar fkip unsyia, 2(1), 59–72. http://www.jim.unsyiah.ac.id/pgsd/article/view/2532 takdir, m. (2018). modernisasi kurikulum pesantren. ircisod. utami, s. (2018). pengaruh kemampuan berbicara siswa melalui pendekatan komunikatif dengan metode simulasi pada pembelajaran bahasa indonesia. jurnal likhitaprajna, 18(2), 58–66. https://likhitapradnya.wisnuwardhana.ac.id/index.php/likhitapradnya/article/view/59%0 a wirawan, w. (2016). evaluasi: teori, model, metodologi, standar, aplikasi dan profesi (3rd ed.). rajawali pers. copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(2), 2021, 177-185 available online at: http://journal.uny.ac.id/index.php/reid profile of the ability of prospective biology teachers in making question instruments using bloom's taxonomy tengku idris*; sepita ferazona; herlina safitri universitas islam riau, indonesia *corresponding author. e-mail: idrisbio@edu.uir.ac.id introduction the use of bloom's taxonomy in assessment is something that is ingrained in indonesia. all levels of education use bloom's taxonomy as an instrument to see the achievement of learning outcomes (bloom et al., 1956). bloom's taxonomy concept was developed by benjamin s. bloom in 1956. bloom's taxonomy consists of three domains, namely the cognitive, affective and psychomotor domains (krathwohl, 2002). in 2001, bloom's taxonomy was revised by dividing it into two domains, namely the knowledge domain consisting of factual, conceptual, procedural and metacognitive and the cognitive process domain consisting of remembering, understanding, applying, analyzing, evaluating and creating (waite et al., 2020). taxonomy is a classification system that provides a hierarchical hierarchy of things or principles or concepts that are tiered (van niekerk & von solms, 2013). these levels ultimately lead to disagreements among educational and psychological experts who think that the human brain is not like a computer which has steps and levels that must be passed to achieve something (arievitch, 2020). the use of bloom's taxonomy is very broad in education covering various fields of knowledge and perspectives (pappas et al., 2013) such as in the accounting (kidwell et al., 2013), medical (ghidinelli et al., 2021), reading activities (tangsakul et al., 2017), and writing (baghaei et al., 2020), in mathematics (radmehr & drake, 2017; risnawati et al., 2019), engineering (meda & article info abstract article history submitted: 1 november 2021 revised: 6 december 2021 accepted: 24 december 2021 keywords bloom’s taxonomy; lots; hots; prospective teacher scan me: the purpose of this study is to determine students' ability to make bloom's taxonomy questions. the study was conducted on students in the 5th semester of the 2020/2021 academic year who were taking evaluation courses and learning achievement techniques with a sample of 62 people. this research is a descriptive study using a checklist sheet as the main instrument that has been content validated by experts, questionnaires, and interviews as supporting data. data were analyzed using the quantitative descriptive method. the results of the research show that students have very good abilities in making bloom's taxonomy questions for the lower other thinking skills category with a percentage of 92.22% with details c1 100%, c2 95%, and c3 81.67%, while the students' ability in making questions with a higher level of other thinking skills is in the very poor category with a percentage of 44.17% with c4 54.17%, c5 46.67% and c6 31.17%. in addition, based on the form of questions, students' ability to design essay questions (74.17%) is better than multiple-choice questions (62.22%). the conclusion of this study is the ability of students to make questions based on bloom's taxonomy is in the fairly good category, with a percentage of 78.06%. this is an open access article under the cc-by-sa license. how to cite: idris, t., ferazona, s., & safitri, h. (2021). profile of the ability of prospective biology teachers in making question instruments using bloom's taxonomy. reid (research and evaluation in education), 7(2), 177-185. doi:https://doi.org/10.21831/reid.v7i2.44903 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i2.44903 10.21831/reid.v7i2.44903 tengku idris, sepita ferazona, & herlina safitri page 178 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) swart, 2018), and science (lee et al., 2015). in addition, the bloom's taxonomy can be used to cognitive goal and thinking levels (van niekerk & von solms, 2013), determine mastery of exam questions (ebadi & shahbazian, 2015; momsen et al., 2013) and in evaluating textbooks as in (parsaei et al., 2017; sahragard & alavi, 2016). the teacher is one of the spearheads of educational progress, thus, a teacher must be able to measure learning progress with various instruments that have been studied such as to measure cognitive aspects using questions, to measure psychomotor aspects using performance assessments and for affective aspects using non-test instruments such as observation sheets (johnson et al., 2021). the teacher's ability to make questions is an implication of learning evaluation courses at the home campus. according to hadiprayitno et al. (2019), most students (≥70%) had difficulty in learning biology material. this difficulty is in line with the difficulties of teachers in developing assessment instruments. the biology education study program, faculty of teacher training and education, islamic university of riau is one of the education personnel education institutions that produces prospective biology education teachers. as that role is to equip students in teaching, evaluation courses are required for prospective teacher students to help them measure the learning abilities of students later. evaluation courses and techniques for achieving biology learning outcomes are taken in semester 5, one of which is bloom's taxonomy. the learning outcome of this material is that students are able to design, apply, evaluate and make evaluation tools in measuring the cognitive, affective and psychomotor domains using bloom's taxonomy. the purpose of this study was to determine the ability of prospective biology teacher students in the biology education study program in making questions based on bloom's taxonomy. this study provides important implications for the learning process or subsequent lectures, because these findings will be input to the learning process in the evaluation course and the achievement of biology learning techniques in the biology education study program in particular and evaluation courses on all campuses of educational personnel education institutions in general. method this research is a descriptive study using a checklist sheet instrument, interviews and questionnaires. the research was conducted at the biology education study program, faculty of teacher training and education, islamic university of riau, academic year 2021/2022. the research sample is all students who have taken evaluation courses and biology learning outcomes achievement techniques in semester 5, as shown in table 1. the data used to answer the problem formulation are questions made by students on the topic of bloom's taxonomy. there are two kinds of questions, namely multiple choice and essay. meanwhile, to find out students' perceptions of their abilities, a limited questionnaire was given. the research instrument uses a checklist sheet that has been content validated by experts. the data is calculated by formula (1), in which p = percentage, f = frequency, and n = number of samples (spuck et al., 1975). after that, it is categorized based on the criteria (asrul et al., 2014) as shown in table 2. table 1. population and research sample class population sample 5a 31 31 5b 31 31 total 62 62 p = …………………… (1) 10.21831/reid.v7i2.44903 tengku idris, sepita ferazona, & herlina safitri page 179 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) table 2. categorization of students' ability to make cognitive questions achievement category 86 – 100% very good 76 – 85% good 60 – 75% fairly good 55 – 59% poor ≤ 54% very poor findings and discussion findings based on the results of the study, data were obtained from the results of students' abilities in making questions based on bloom's taxonomy with the category of lower order thinking skills. based on table 3, it can be seen that the overall ability of students in making questions in the lots category is in the very good category with a percentage of 92.22%. in the given category (c1), the ability of the two classes is the same, namely each is very good with an average of 100%, this means that all students are able to make c1 questions correctly. in the matter of understanding c2, 95% of students were able to make questions correctly, both on multiple choice questions and on objective questions. meanwhile, in making c3 questions, the students' ability to apply obtained a percentage of 81.67% with a good category. table 4 shows that the overall ability of biology education students in making hots-type questions is in the very poor category with a percentage of 44.17%, in class a it is 46.11 while in class b it is 42.22% each in the very poor category. based on high-level thinking skills, the ability to make c6 (creating) questions is the lowest with a percentage of 31.67% while the ability to analyze is in the less category with a percentage of 54.17%. as for the ability of students in making evaluation questions, it is still in the poor category with a percentage of 46.67%. based on figure 1, it can be seen that the students' ability in making questions based on bloom's taxonomy is in a fairly good category with a percentage of 68.18%. whereas when viewed in each class, it is not much different, namely in the fairly good category with class a getting a percentage of 69.64% and class b by 66.94% the ability of students in making questions with the lots type is very good while the hots type is still lacking. table 3. student ability to make lots questions indicator class a class b average category remember 100.00 100.00 100.00 very good understand 95.00 95.00 95.00 very good apply 83.33 80.00 81.67 good average 92.78 91.67 92.22 very good category very good very good very good table 4. student ability to make hots questions indicator class a class b average category analyze 58.33 50.00 54.17 poor evaluate 48.33 45.00 46.67 very poor create 31.67 31.67 31.67 very poor average 46.11 42.22 44.17 very poor category very poor very poor very poor 10.21831/reid.v7i2.44903 tengku idris, sepita ferazona, & herlina safitri page 180 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) figure 1. the ability to create bloom's taxonomy questions table 5. students' ability in making questions based on the type of questions level type of questions multiple choice category essay category c1 100.00 vg 100.00 vg c2 93.33 vg 96.67 vg c3 73.33 fg 90.00 vg c4 46.67 vp 61.67 fg c5 33.33 vp 60.00 fg c6 26.67 vp 36.67 vp average 62.22 fg 74.17 fg based on table 5, it can be seen that the ability of students in making essay type questions is higher than that of multiple choice questions. in multiple choice and essay types, the average ability of students is in the sufficient category with a percentage of 62.22% for multiple choice questions and 74.17% for essay questions. discussion based on figure 1, it shows that the average ability of prospective biology teacher students in the biology education study program, faculty of teacher training and education, islamic university of riau in making bloom's taxonomy questions is in the fairly good category with a percentage of 78.06% in the fairly good category. the ability to make questions in the category is very good at the level of lowers other thinking skills and the category is very poor at questions of higher other thinking skills. bloom's taxonomy refers to a taxonomy created for learning purposes, namely student learning outcomes. this taxonomy was first compiled by benjamin s. bloom, kartwohl, and friends in 1956. bloom's taxonomy is the most influential taxonomy in the world compared to other taxonomies, more than 60 countries have used it in education. bloom's taxonomy is considered the taxonomy that best meets educational needs and is very easy to measure because it is equipped with operational verbs at each level completely with a clear hierarchy (anderson et al., 2001). the use of bloom's taxonomy is very broad in education covering various fields of knowledge and perspectives (pappas et al., 2013). during the process of making bloom's taxonomy questions, there are several obstacles faced by students including the difficulty of distinguishing between understanding and applying questions, the available operational verbs are actually very helpful for students in making questions but in the application the questions made are not in accordance with the selected operational verbs. in addition to the aforementioned constraints, questions with a higher level (hots), 10.21831/reid.v7i2.44903 tengku idris, sepita ferazona, & herlina safitri page 181 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) some of the information presented in the questions is not good enough so that the questions that are made end up being questions with level c2 or c3. to support students' ability to design and create questions using bloom's taxonomy, the supervisor provides direct examples of using operational verbs according to their level and the results of the questionnaire show that hots questions are difficult material for students, so they require special assistance and a longer time to study it. in designing and developing learning innovations, it is important to focus on supporting the process of constructing and reflecting students' knowledge rather than conveying and memorizing principles and facts (hooshyar et al., 2019). based on the data from table 3, it can be seen that the students' ability in making questions with lower other thinking skills or lots consisting of questions with the ability to remember (c1), questions with the ability to understand (c2) and questions with the ability to apply (c1) c3) (prakash & litoriya, 2021). based on the aforementioned data, it can be seen that the average ability of students in making questions with the lots type is in the very good category with a percentage of 92.22%. this is in line with a research by lee et al. (2015) which shows that the item questions used by the majority of teachers are in the lots category. the ability of students to make lots questions is very important to measure students' basic abilities later as a basis for higher thinking (krathwohl, 2002). the ability of students to develop evaluation tools is very important because it is the bsasis for teachers in measuring learning success. the more difficult it is for the teacher to make questions, the student's ability to learn will also be difficult because learning activities cannot be measured properly. this is in line with research conducted by hadiprayitno et al. (2019) which showed that most students (≥70%) had difficulty in learning the biology material. this difficulty is in line with the difficulties of teachers in developing assessment instruments. in addition to lots questions, prospective biology teacher students must also be able to make questions with high-level criteria called hots (higher other thinking skills). higher order thinking ability is an activity/thinking process that is directed at manipulating ideas and information in a certain way that can provide new understanding for students (retnawati et al., 2018) there are three levels in higher order thinking according to bloom's taxonomy, namely (1) analyzing, which can be in the form of analyzing information which is then classified, identifying the causes and effects of an event and identifying questions; (2) evaluating, which means providing assessment and analysis using certain criteria, checking and speculating and accepting or rejecting arguments with clear criteria; (3) creating, which consists of shared perceptions of point of view, modeling problem solving and innovating (krathwohl, 2002). in addition to hots, according to bloom's taxonomy, several researchers and experts have different views, such as in the research of dillo and scott (2002), miri et al. (2007), zohar and dori (2003) which found that hots consists of the ability to analyze, synthesize, evaluate, develop skills to estimate, generalize, create thoughts, make decisions and think critically and systematically (kwangmuang et al., 2021). according to dillo et al. and kangmuang et al., hots consists of critical thinking, creative, problem solving and decision making (retnawati et al., 2018). hots can also only consist of two components: critical and creative thinking (sulaiman et al., 2017). according to prakash and litoriya (2021), hots is an important element in education because of its benefits in improving student achievement, reducing weaknesses, interpreting, synthesizing, solving problems, and controlling information, ideas and daily activities. based on table 4, it can be seen that the ability of students in making hots questions is in the very poor category with a percentage of 44.17%. the poor category in all research objects, both in class a and class b, if seen from each, increases hots both at levels c4 c6 all in the less category. the results of this study are in line with the findings of arti and hariyatmi (2015) showing that the ability of teachers at sman wonosari klaten in making questions c4 (15.2%) c5 (3%) and c6 (3%) is still very low. from interviews with research subjects, information was obtained that the difficulty faced was to make questions according to the operational verbs of each level. in addition, providing a number of information that can be processed to make ques10.21831/reid.v7i2.44903 tengku idris, sepita ferazona, & herlina safitri page 182 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) tions with hots is also not easy. to help students understand and be able to make hots questions, the course supervisor provides feedback on assignments and presentations that have been made by presenting students. schut et al. (2020), deiglmayr (2018), and sargeant (2015) support this statement, namely that providing feedback will make the learning process more effective. from the input on learning that was captured after the lecture was over, specifically on materials that were considered difficult or wrong for hots, students hoped that the time would be increased and there would be a detailed explanation from the course supervisor. the questions commonly used by educators to measure the cognitive domain are limited response types in the form of multiple choice questions and essay questions. multiple-choice test questions can be used to measure learning outcomes that are more complex and related to aspects of memory, understanding, application, analysis and evaluation (stringer et al., 2021). multiple-choice test questions consist of the subject matter carrier and answer choices. the main issuer can be stated in the form of a rudimentary question or statement which is often called a stem, while the answer choices can be in the form of words, numbers or sentences and are often called options (chin et al., 2021). the answer choices consist of the correct answer or the most correct answer which is called the answer key and the possible wrong answer is called a distractor (distractor/decoy/fails), but allows someone to choose it if they do not master the material asked in the question (hingorjo & jaleel, 2012; sahoo & singh, 2017). in addition, essay questions are questions that are used to measure the cognitive domain which consist of description questions whose answers are limited (restricted response essay items) or objective description questions and description questions whose answers are more unlimited (extended response essay items) or nonobjective description questions (ramesh & sanampudi, 2021). based on table 5, on average, students' ability to make questions in the essay category is higher (74.13%) than multiple choice questions (62.22%). moreover, if viewed from each level, the questions c1, c2 and c6 have the same category, which is both very good (c1 and c2) and less than once (c6). for c1 and c2, the questions are in the easy category, so that students can make different types of questions as well as c6 questions because they are very difficult, so students cannot make different types of questions. meanwhile, there are differences in categories in making questions including those at levels c5, c4 and c3. these data indicate that students' ability is better in making essay questions compared to multiple choice. according to students, essay questions are easier to use because they are not tied to alternative answers so that using operational verbs can be made directly. the difficulty in making multiple choice questions apart from making alternative answer choices, creating and providing information that is used to be processed is also equally difficult. conclusion based on the research that has been done, it is concluded that the ability of prospective teacher students in making questions using bloom's taxonomy is in the fairly good category with a percentage of 78.06%. acknowledgment the authors would like to thank the dppm of the islamic university of riau as a provider of research funds and not to forget the study program and students of biology education, faculty of teacher training and education, islamic university of riau, where the research was conducted. references anderson, l. w., krathwohl, d. r., airasian, p. w., cruikshank, k. a., mayer, r. e., pintrich, p. r., raths, j., & wittrock, m. c. (2001). a taxonomy for learning, teaching, and assessing: a revision 10.21831/reid.v7i2.44903 tengku idris, sepita ferazona, & herlina safitri page 183 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) of bloom's taxonomy of educational objectives. longman. https://www.uky.edu/~rsand1/china2018/texts/anderson-krathwohl a taxonomy for learning teaching and assessing.pdf arievitch, i. m. (2020). the vision of developmental teaching and learning and bloom’s taxonomy of educational objectives. learning, culture and social interaction, 25, 100274. https://doi.org/10.1016/j.lcsi.2019.01.007 arti, e. p. n., & hariyatmi, h. (2015). kemampuan guru mata pelajaran biologi dalam pembuatan soal hot (higher order thinking) di sma negeri 1 wonosari klaten. undergraduate thesis, universitas muhammadiyah surakarta, sukoharjo. http://eprints.ums.ac.id/33446/ asrul, a., ananda, r., & rosinta, r. (2014). evaluasi pembajalaran. ciptapustaka media. baghaei, s., bagheri, m. s., & yamini, m. (2020). analysis of ielts and toefl reading and listening tests in terms of revised bloom’s taxonomy. cogent education, 7(1), 1720939. https://doi.org/10.1080/2331186x.2020.1720939 bloom, b. s., engelhart, m. d., furst, e. j., hill, w. h., & krathwohl, d. r. (1956). taxonomy of educational objectives. d. mckay. chin, h., chew, c. m., & lim, h. l. (2021). development and validation of online cognitive diagnostic assessment with ordered multiple-choice items for ‘multiplication of time.’ journal of computers in education, 8(2), 289–316. https://doi.org/10.1007/s40692-020-001807 deiglmayr, a. (2018). instructional scaffolds for learning from formative peer assessment: effects of core task, peer feedback, and dialogue. european journal of psychology of education, 33(1), 185–198. https://doi.org/10.1007/s10212-017-0355-8 ebadi, s., & shahbazian, f. (2015). exploring the cognitive level of final exams in iranian high schools: focusing on bloom’s taxonomy. journal of applied linguistics and language research, 2(4), 1–11. http://jallr.ir/index.php/jallr/article/view/58 ghidinelli, m., cunningham, m., monotti, i. c., hindocha, n., rickli, a., mcvicar, i., & glyde, m. (2021). experiences from two ways of integrating preand post-course multiple-choice assessment questions in educational events for surgeons. journal of european cme, 10(1), 1918317. https://doi.org/10.1080/21614083.2021.1918317 hadiprayitno, g., muhlis, & kusmiyati. (2019). problems in learning biology for senior high schools in lombok island. journal of physics: conference series, 1241(1). https://doi.org/10.1088/1742-6596/1241/1/012054 hingorjo, m. r., & jaleel, f. (2012). analysis of one-best mcqs: the difficulty index, discrimination index and distractor efficiency. journal of the pakistan medical association, 62(2), 142–147. https://jpma.org.pk/pdfdownload/3255.pdf hooshyar, d., lim, h., pedaste, m., yang, k., fathi, m., & yang, y. (2019). autothinking: an adaptive computational thinking game. in lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 11937, 381-391. springer. https://doi.org/10.1007/978-3-030-35343-8_41 johnson, c., boon, h., & dinan thompson, m. (2021). cognitive demands of the reformed queensland physics, chemistry, and biology syllabus: an analysis framed by the new taxonomy of educational objectives. research in science education, 0123456789. https://doi.org/10.1007/s11165-021-09988-4 https://www.uky.edu/~rsand1/china2018/texts/anderson-krathwohl%20-%20a%20taxonomy%20for%20learning%20teaching%20and%20assessing.pdf https://www.uky.edu/~rsand1/china2018/texts/anderson-krathwohl%20-%20a%20taxonomy%20for%20learning%20teaching%20and%20assessing.pdf https://doi.org/10.1016/j.lcsi.2019.01.007 http://eprints.ums.ac.id/33446/ https://doi.org/10.1080/2331186x.2020.1720939 https://doi.org/10.1007/s40692-020-00180-7 https://doi.org/10.1007/s40692-020-00180-7 https://doi.org/10.1007/s10212-017-0355-8 http://jallr.ir/index.php/jallr/article/view/58 https://doi.org/10.1080/21614083.2021.1918317 https://doi.org/10.1088/1742-6596/1241/1/012054 https://jpma.org.pk/pdfdownload/3255.pdf https://doi.org/10.1007/978-3-030-35343-8_41 https://doi.org/10.1007/s11165-021-09988-4 10.21831/reid.v7i2.44903 tengku idris, sepita ferazona, & herlina safitri page 184 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) kidwell, l. a., fisher, d. g., braun, r. l., & swanson, d. l. (2013). developing learning objectives for accounting ethics using bloom’s taxonomy. accounting education, 22(1), 44– 65. https://doi.org/10.1080/09639284.2012.698478 krathwohl, d. r. (2002). a revision of bloom’s taxonomy: an overview. theory into practice, 41(4), 212–218. https://doi.org/10.1207/s15430421tip4104_2 kwangmuang, p., jarutkamolpong, s., sangboonraung, w., & daungtod, s. (2021). the development of learning innovation to enhance higher order thinking skills for students in thailand junior high schools. heliyon, 7(6), e07309. https://doi.org/10.1016/j.heliyon.2021.e07309 lee, y. j., kim, m., & yoon, h. g. (2015). the intellectual demands of the intended primary science curriculum in korea and singapore: an analysis based on revised bloom’s taxonomy. international journal of science education, 37(13), 2193–2213. https://doi.org/10.1080/09500693.2015.1072290 meda, l., & swart, a. j. (2018). analysing learning outcomes in an electrical engineering curriculum using illustrative verbs derived from bloom’s taxonomy. european journal of engineering education, 43(3), 399–412. https://doi.org/10.1080/03043797.2017.1378169 momsen, j., offerdahl, e., kryjevskaia, m., montplaisir, l., anderson, e., & grosz, n. (2013). using assessments to investigate and compare the nature of learning in undergraduate science courses. cbe life sciences education, 12(2), 239–249. https://doi.org/10.1187/cbe.12-08-0130 pappas, e., pierrakos, o., & nagel, r. (2013). using bloom’s taxonomy to teach sustainability in multiple contexts. journal of cleaner production, 48, 54–64. https://doi.org/10.1016/j.jclepro.2012.09.039 parsaei, i., alemokhtar, m. j., & rahimi, a. (2017). learning objectives in esp books based on bloom’s revised taxonomy. beyond words, 5(1), 14–22. http://journal.wima.ac.id/index.php/bw/article/view/1112 prakash, r., & litoriya, r. (2021). pedagogical transformation of bloom taxonomy’s lots into hots: an investigation in context with it education. wireless personal communications, 122, 725-736. https://doi.org/10.1007/s11277-021-08921-2 radmehr, f., & drake, m. (2017). revised bloom’s taxonomy and integral calculus: unpacking the knowledge dimension. international journal of mathematical education in science and technology, 48(8), 1206–1224. https://doi.org/10.1080/0020739x.2017.1321796 ramesh, d., & sanampudi, s. k. (2021). an automated essay scoring systems: a systematic literature review. artificial intelligence review, 55, 2495-2527. https://doi.org/10.1007/s10462-021-10068-2 retnawati, h., djidu, h., kartianom, apino, e., & anazifa, r. d. (2018). teachers’ knowledge about higher-order thinking skills and its learning strategy. problems of education in the 21st century, 76(2), 215–230. https://doi.org/10.33225/pec/18.76.215 risnawati, r., andrian, d., azmi, m. p., amir, z., & nurdin, e. (2019). development of a definition maps-based plane geometry module to improve the student teachers’ mathematical reasoning ability. international journal of instruction, 12(3), 541–560. https://doi.org/10.29333/iji.2019.12333a sahoo, d. p., & singh, r. (2017). item and distracter analysis of multiple choice questions (mcqs) from a preliminary examination of undergraduate medical students. international journal of research in medical sciences, 5(12), 5351-5355. https://doi.org/10.18203/23206012.ijrms20175453 https://doi.org/10.1080/09639284.2012.698478 https://doi.org/10.1207/s15430421tip4104_2 https://doi.org/10.1016/j.heliyon.2021.e07309 https://doi.org/10.1080/09500693.2015.1072290 https://doi.org/10.1080/03043797.2017.1378169 https://doi.org/10.1187/cbe.12-08-0130 https://doi.org/10.1016/j.jclepro.2012.09.039 http://journal.wima.ac.id/index.php/bw/article/view/1112 https://doi.org/10.1007/s11277-021-08921-2 https://doi.org/10.1080/0020739x.2017.1321796 https://doi.org/10.1007/s10462-021-10068-2 https://doi.org/10.33225/pec/18.76.215 https://doi.org/10.29333/iji.2019.12333a https://doi.org/10.18203/2320-6012.ijrms20175453 https://doi.org/10.18203/2320-6012.ijrms20175453 10.21831/reid.v7i2.44903 tengku idris, sepita ferazona, & herlina safitri page 185 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) sahragard, r., & alavi, s. z. (2016). investigating the predominant levels of learning objectives in general english books. journal of english language teaching and learning studies, 8(17), 93-114. https://elt.tabrizu.ac.ir/article_4962.html sargeant, j. (2015). reflecting upon multisource feedback as ‘assessment for learning.’ perspectives on medical education, 4(2), 55–56. https://doi.org/10.1007/s40037-015-0175-y schut, a., van mechelen, m., klapwijk, r. m., gielen, m., & de vries, m. j. (2020). towards constructive design feedback dialogues: guiding peer and client feedback to stimulate children’s creative thinking. international journal of technology and design education, 32, 99-127. https://doi.org/10.1007/s10798-020-09612-y spuck, d. w., hubert, l. j., & lufler, h. s. (1975). an introduction to educational policy research. education and urban society, 7(3), 211-219. https://doi.org/10.1177/001312457500700301 stringer, j. k., santen, s. a., lee, e., rawls, m., bailey, j., richards, a., perera, r. a., & biskobing, d. (2021). examining bloom’s taxonomy in multiple choice questions: students’ approach to questions. medical science educator, 31(4), 1311–1317. https://doi.org/10.1007/s40670-021-01305-y sulaiman, t., muniyan, v., madhvan, d., hasan, r., & rahim, s. s. a. (2017). implementation of higher order thinking skills in teaching of science. international research journal of education and sciences, 1(1), 1–3. https://www.masree.info/wpcontent/uploads/2019/11/implementation-of-higher-order-thinking-skills-in-teachingof-science.pdf tangsakul, p., kijpoonphol, w., duy linh, n., & kimura, l. n. (2017). using bloom’s revised taxonomy to analyze reading comprehension questions in team up in english 1-3 and grade 9 english o-net tests. international journal of research -granthaalayah, 5(7), 31–41. https://doi.org/10.29121/granthaalayah.v5.i7.2017.2106 van niekerk, j., & von solms, r. (2013). using bloom’s taxonomy for information security education. ifip advances in information and communication technology, 406, 280-287. https://doi.org/10.1007/978-3-642-39377-8_33 waite, l. h., zupec, j. f., quinn, d. h., & poon, c. y. (2020). revised bloom’s taxonomy as a mentoring framework for successful promotion. currents in pharmacy teaching and learning, 12(11), 1379–1382. https://doi.org/10.1016/j.cptl.2020.06.009 https://elt.tabrizu.ac.ir/article_4962.html https://doi.org/10.1007/s40037-015-0175-y https://doi.org/10.1007/s10798-020-09612-y https://doi.org/10.1177/001312457500700301 https://doi.org/10.1007/s40670-021-01305-y https://www.masree.info/wp-content/uploads/2019/11/implementation-of-higher-order-thinking-skills-in-teaching-of-science.pdf https://www.masree.info/wp-content/uploads/2019/11/implementation-of-higher-order-thinking-skills-in-teaching-of-science.pdf https://www.masree.info/wp-content/uploads/2019/11/implementation-of-higher-order-thinking-skills-in-teaching-of-science.pdf https://doi.org/10.29121/granthaalayah.v5.i7.2017.2106 https://doi.org/10.1007/978-3-642-39377-8_33 https://doi.org/10.1016/j.cptl.2020.06.009 judul dalam bahasa indonesia, ditulis dengan huruf tnr-14 bold, maksimal 14 kata, rata kiri research and evaluation in education journal e-issn: 2460-6995 research and evaluation in education journal volume 1, number 1, june 2015 (45-54) available online at: http://journal.uny.ac.id/index.php/reid factor analysis to identify the dimension of test of english proficiency (toep) in the listening section 1) heri retnawati; 2) sudji munadi; 3) yosa abduh al-zuhdy 1)2)3) yogyakarta state university, indonesia 1) retnawati.heriuny1@gmail.com; 2) sudji.munadi@gmail.com; 3) alzuhdy@gmail.com abstract this research is aimed at identifying the amount of the ability dimensions contained in the test of english proficiency (toep), particularly in the listening section. this study is an explorative descriptive quantitative research. the data of the study are the responses of the toep participants in the whole indonesia in 2010, in some toep components. the participants are grade ix students of senior high schools. the collected data were gained from the data documentation of the diretorate of senior high schools founding of the national education ministry, jakarta. the data analysis for identifying the dimension was done by implementing exploratory and confirmatory factor analysis. the exploratory factor analysis was done by using spss computer program. the result of the research shows that of the seven sets of listening which were analyzed, all of them contain listening dominant dimension if they are analyzed using graphics method, explainable variance, and eigen value ratio. keywords: dimension, toep, listening mailto:1)retnawati.heriuny1@gmail.com mailto:sudji.munadi@gmail.com mailto:3)alzuhdy@gmail.com research and evaluation in education journal 46 volume 1, number 1, june 2015 introduction the development of science, technology, and communication is getting rapid. the develpoment can become a global world challenge to answer, one of which is by preparing human resources who are able to communicate in the international world. one of the competences which is needed in this case is english competence. in indonesia, an english test which is consiered to be standardized is test of english proficiency (toep), which has been calibrated and proved to be able to predict the participants’ competence in international english language testing system (ielts) or test of english as a foreign language (toefl). toep measures listening and reading ability. to be able to measure english competence, all this time, test method is employed. the result of the test is then analyzed, using classical test theory assumptions or item response theory. one of the assumptions to analyze with the approach of the two theories is a test to measure one competence only, known as unidimension. in designing, assembling, and analyzing test items, the theory approach which is employed is unidimensional test approach, which only measures one single dimension. in this item response theory, there are assumptions to be fulfilled, namely local independence and unidimension (hambleton, swaminathan and rogers, 1991; hulin et al., 1983). local independence occurs when the factors influencing achievement becomes constant, so that the subject responses towards the pair of any items will be statistically independent to each other. unidimension means that each item of the test measures only one ability. the assumption of unidimension can only be shown if the test contains only one dominant component which measures the achievement of a subject. in the real fact, unidimensional assumption is hard to be fulfilled. this is in line with the opinion that most educational and psychological tests in some levels are multidimensional (bolt and lall, 2003; ackerman et al., 2003). unidimensional analysis on the data which reality is multidimension will cause the occurrence of sistematic mistakes in the test administration. as a result, the information which is gained will be misleading and inflict detriment for the participants of the test. the existence of multidimensional contents in the test components which are analyzed by using unidimensional model causes inaccurate ability estimation and gives misleading information. related to this, a research about the amount of the dimension contents of toep sets, especially listening section, which is able to be utilized as a prerequirement using item response theory approach to do test set further analysis. in the test, the presentation of the content is a vital thing in the test validity. through an evaluation on the test which can be executed by a subject-mater expert (sme), the items that form a test and its relevance towards the planned domain can be revealed (sireci and geisinger, 1995). the result of the evaluation will show the dimensions which are consistent with the content structure, or dimensions which are not consistent with the content structure. this result also underlies a multidimensional scaling (mds). multidimensional scaling positions the items in a space in a certain coordinate location. this space is determined by certain dimensions as the axis. the relative distance between the item pairs reflects the differences of the items (bolt, 2001). the nearer an item towards others/its pair is, the bigger the characteristics similarity between the two items is. based on the closeness of the distance or the similarity of the characteristics, the items can be categorized according to their substance. the analysis is known as a hierarchical cluster validity analysis (sireci and geisinger, 1995). the test in education and psychology which measures latent variables is multidimensional. if the item analysis uses unidimensional approach, thus, inaccurate result of ability measurement is produced research and evaluation in education journal factor analysis to identify the dimension... 47 heri retnawati, sudji munadi, & yosa abduh al-zuhdy (wang, chen, and cheng, 2004). this happens because unidimensional approach ignores the correlation between latent competences. multidimensional measurement approach gives attention to the relation between the latent competences which causes the increasing of measurement accuracy. another advantage of the item response theory is proposed by de la tore and patz (2005) that the analysis with this approach gives additional information which increases the accuracy of item parameter estimation. in this situation, unidimension is a case of multidimension, that is, when the inter-latent-variable correlation is equal to zero. there are two types of dimensional structure, namely double-dominant type with inter-dimension correlation and one dominant dimension with some minor dimensions (kirisci, hsu and yu, 2001). in line with this, wang, chen and cheng (2004) state that there are two kinds of multidimension, namely inter-item multidimension and within-item multidimension. in a research, instrument which involves a lot of items is usually employed. in order to understand such data, factor analysis is usually used. factor analysis is used to reduce the data by finding the intervariable relationship which are independent to each other (stapleton, 1997), which then are assembled in less variables to find out the latent dimensional structure (anonymous, 2001; garson, 2006), which is known as factor. this factor is a new variable, and also called latent variable, construct variable, and has being directly unobservable as its characteristics. factor analysis can be done in two ways, namely exploratory factor analysis and confirmatory factor analysis. the basis of both exploratory and confirmatory factor analysis is reducing the large amount of variable. for instance, the initial variable is x1, …, xq, a latent factor compilation that will be found is 1, …, n (with q > n). an observable variable depends on the linear combination of latent factor 1 which is asserted by xi = i11 + i22 +...+inn+ i ..............(1) with i (measurement error) is a unique part from xi which is assumed not to be correlated with 1, 2, ...., n. for i  j, so i  j. the unique part comprises special factor si and a random measurement error ei. in the confirmatory factor analysis, the amount of latent variable  is less compared to the exploratory factor analysis. in the factor analysis, there is a squared factor loading. this squared factor loading indicates the amount of variants in the observed variables which can be explained by the factors (van de geer, 1971). the observed variable which can be explained by factor is usually reflected in the form of relative percentage towards the variants total from the whole observed variables. exploratory factor analysis is a technique for detecting and accessing latent resources from variance or covariance in a measurement (joreskog & sorbom, 1993). exploratory factor analysis is defined as exploring empirical data to find out and detect the characteristics and inter-variable relationship without determining the model in the data. in this analysis, the researchers did not have a priori theory to arrange a hypothesis (stapleton, 1997). considering its exploring characteristics, the result of the analysis of this exploratory factor is weak. the result of the analysis, which explains only inter-variable relationship, is not also based on the applied theory. the result of the analysis only depends on the empirical data, and if the observed variables are in the large amount, the result of the analysis will be difficult to be interpreted (stapleton, 1997). factor analysis is usually strongly related to the questions about validity (nunally, 1978). when identified factors are linked, exploratory factor analysis answers the question about construct validity, if a score measures what is supposed to be measured. in exploratory factor analysis, factor analysis is aimed at explaining variance in the observed variable which can be explained by latent factor. then, in order to interpret the research and evaluation in education journal 48 volume 1, number 1, june 2015 result of exploratory factor analysis, rotation is conducted. there are two kinds of rotation which can be employed to interpret factor, namely varimax, quartimax, equamax rotation (which is orthogonal) and direct oblimin, promax rotation (which is non-orthogonal or oblique). the result will show the rotated loading factor matrix, which then is named based on the dominant items in a particular factor (wells & purwono, 2009). .............................(1) in order to know the english competence, language communicative competence needs to be defined. various communicative competence models which are proposed by language experts basically share the same concept which includes four main competence, namely grammatical/linguistic competence, strategic competence, sociocultural/sociolinguistic competence, and discourse competence (bachman, 1990; bachman & palmer, 1996; savignon, 1997). the second model is a modification result of the first model which is developed by language experts (bachman, 1990; bachman & palmer, 1996) based on the results of the research in the field of language evaluation. therefore, this second model may be more appropriate to be the conceptual basis of the evaluation system development in language field. according to bachman & palmer, the term communicative competence is equalized with the term language ability as a construct that is supposed to be measured by a language test. language ability covers two components: knowledge on language and strategic competence (or also known as metacognitive strategy). a language user needs the combination of these two competences to be able to produce or interprete a discourse, both in doing the language test and in using the language in real life. the development of toep question items refers to the taxonomy of language ability. in this case, munby (1981) has identified micro-language ability which is resulted in the point that micro-language ability can be divided into observing, speaking, reading, and writing ability. thus, this research aims to identify the amount of ability dimensions contained in test of english proficiency (toep), particularly in the listening section. method this research employed quantitative method with explorative descriptive approach, because multidimensional loading of toep elements would be identified in this research. this study was conducted in yogyakarta special region, from april 2013 until november 2013. the data in this study were the responses of toep participants in the whole indonesia in 2007-2010, at some toep sets. the participants were grade ix senior high schools students. the data collection in this study referred to the data of the documentation result from the directorate of senior high schools founding of the national education ministry, jakarta. the data sources were in the form of students’ answer sheets which had been documented in the form of computer data. the data which had been gained were in the form of toep participants’ responses and the sets of toep employing exploratory factor analysis, then the amount of dimensional loading in the test sets were estimated. the quantity of the dimension would be known by counting the amount of the factors contained in the test sets in the factor analysis, both exploratory and confirmatory. this exploratory factor analysis was done using a computer program, statistical package for social sciences (spss). findings and discussion findings in this research, the dimensionality of toep sets was proven through three ways, namely graphics, percentage of the explainable variance, and the ratio of the first and the second eigen value. analysis research and evaluation in education journal factor analysis to identify the dimension... 49 heri retnawati, sudji munadi, & yosa abduh al-zuhdy was done by employing spss to find out the eigen value, then, with the help of microsoft excel to draw the graphics, the percentage of the explainable variance and the ratio of the first and second eigen value were counted. each result is presented below. dimensional validation with graphics in the toep 1a set, for the listening section, there is one part of the graphics which is steep. this indicates that the 1a listening section measures one main dimension, that is, listening ability. more complete result is presented in figure 1. toep 1a listening figure 1. scree plot for the toep 1a set the same thing happens to the toep 2a set. for the listening section, there is one part of the graphics which is steep. it indicates that the listening section 2a measures one main dimension, that is, listening 2a set is proven to measure listening empirically. the result is presented in figure 2. toep 2a listening figure 2. scree plot for the toep 2a set in the toep 3b set, there is also a steep part in the graphics. it indicates that the listening 2b set measures one main dimension. this result shows that listening section 2b is proven empirically to measure listening ability. this different dimension is reading dimension. the more complete result is presented in figure 3. research and evaluation in education journal 50 volume 1, number 1, june 2015 toep 2b listening figure 3. scree plot for toep 2b set in the toep 3a set, for the listening section, there is also a steep part in the graphics. it is an indication that the listening 3a set measures one main dimension, that is, this set is proven empirically to measure the listening ability. the result is presented in figure 4. toep 3a listening figure 4. scree plot for toep 3b set in the toep 3b set for the listening section, there is a steep part in the graphics. it indicates that the listening section measures one main dimension, that is, the listening 3b set is proven empirically to measure listening ability. the result is presented in figure 5. toep 3b listening figure 5. scree plot for toep 3b set the graphics in the toep 4a for listening set shows the same thing. there is also one part of the graphics which is steep. it also indicates that the listening 4a set research and evaluation in education journal factor analysis to identify the dimension... 51 heri retnawati, sudji munadi, & yosa abduh al-zuhdy measures one main dimension, that is, listening 4a set is proven empirically to measure listening ability. the complete result is shown in figure 6. toep 4a listening figure 6. scree plot for toep 4a set a little different result occurs in toep 4b set. in the toep 4b for listening section, there is one part in the graphics that shows a steep line, one other part which is rather steep, and another part which is slope. it indicates that listening 4b measures at least two dimensions, which are, listening dimension and other dimension. the more complete result is shown in figure 7. toep 4b listening figure 7. scree plot for toep 4b set based on those results, graphically, there are two main domains measured using toep set. in listening section, the measured domain is listening. in reading section, the measured main domain is reading. dimensional validation with explainable variance percentage by employing analysis, the eigen value, the result of explainable variance analysis, can be figured out. explainable variances percentage is used to find out and explain how the score of the measurement result and its variation is. an instrument is said to be unidimension if the value is above 20%. based on the analysis result presented in table 1, the explainable variable percentage is still far below 20%. some sets such as listening 1a, 2a, and 3b are close to 15% so it can be said as containing dominant dimension. table 1. explainable variance percentage from toep set toep l 1a 15.890 2a 15.500 2b 12.600 3a 11.329 3b 14.953 4a 9.872 4b 8.910 research and evaluation in education journal 52 volume 1, number 1, june 2015 dimensional validation with the first and second eigen value ratio the first and second eigen value ratio analysis result is presented in table 2. based on table 2, a relatively equal result with explainable variance percentage result is gained. toep in listening section 1a, 2a, 3b, and reading 1a, 4b shows that those sets are unidimensional, while others contain dominant dimensions. table 2. result analysis of the first and second eigen value ratio toep listening 1  1/ 1a 7.945 1.561 5.090 2a 7.750 1.686 4.597 2b 6.300 1.856 3.394 3a 5.664 1.681 3.369 3b 7.479 1.855 4.032 4a 4.936 1.759 2.806 4b 4.455 1.969 2.263 discussion based on the analysis result with graphics method, the explainable variance, and the ratio of the first and second eigen value, it can be said that listening set is proven to contain one dominant dimension, which is, listening only, although there is another dimension which is measured. another dimension is vocabulary. in doing the listening test, the participants need not only listening skill, but also understanding on the words which are listened to. this is related to the vocabularies and expressions which have been understood by students. in the listening section, there are three points which are measured: responses, conversation, and mini talk. these three components require vocabulary mastery to understand, so it can be understood that listening does not merely measure dominant dimension only, but also other dimensions. the example of responses, conversation, and also mini talk are presented below. an example of responses question (man): hi, amir. long time no see. where have you been? a. i see. i have a long story to tell you. b. hi, budi. i was overseas for a short course. c. yeah, steve has been out for a long time. d. i have been waiting here for you since dawn. an example of conversation question man : sorry, i’m late. woman: what happened? did you lose your way? man : no. i had to work overtime finishing the report for tomorrow’s meeting. it’s a very busy time for us this week. question: why was the man late? in your test book, you read: a. he lost his way. b. he missed his bus. c. he was in a meeting. d. he had to work extra hours. an example of mini talk: the powerful healing properties of plants, spices, minerals, and fruit have been used for centuries. ten everyday ingredients, gathered from all over the world, can be used to treat common ailments and injuries. there is no need for expensive prescriptions, you will find most of these remedies in your cupboard, or under the sink. the first ingredient is aloe vera. scientists are not sure how it works, but the gel you get when you cut a leaf of an aloe vera plant is rich in antiinflammatory compounds as well as a chemical called bradykininase that acts as a topical painkiller. you can buy products containing aloe vera, but there’s no substitute for the real thing. the plant is easy to grow on a kitchen windowsill and thrives on neglect. to soothe sunburn, cuts, piles, and minor burns, wash the affected area thoroughly with soap and water. then cut a chunk off a leaf, slice it lengthways and squeeze out the gel. apply a research and evaluation in education journal factor analysis to identify the dimension... 53 heri retnawati, sudji munadi, & yosa abduh al-zuhdy generous coating to the injured area and repeat two or three times a day. what does the talk mainly discuss? a. the curing power that aloe vera offers. b. ten ingredients that aloe vera consists of. c. how to treat common ailments and injuries. d. why there’s no substitute for aloe vera. what can be said about the chemical called bradykininase? a. it is anti-inflammatory. b. it can relieve pains. c. it can kill tropical animals. d. it substitutes soap and water. assumptions which emerge during the development of the research seem to influence this analysis result. since formulating the objectives of the test, it has been intended that the test which is being developed measures only one dimension, that is, listening ability. related to this, it has been attested that toep set measures only one dimension, namely english listening ability. conclusion from the seven listening sets which were analyzed, all of them contain listening dominant dimension when they were analyzed with graphical method, explainable variance, and eigen value ratio. based on this result, it can be said that listening section in toep contains dominant dimension, which is, listening only. this result brings implication towards further analysis related to the utilization of unidimensional item response theory. according to this case, analysis with the use of logistic model can be applied, in case to reveal the items quality quantitatively. further mapping which is related to the factor loading and items substance needs to be done in order to figure out which item which measures more than one dimension. this can be a consideration for toep developer. references ackerman, t. a., et al. (2003). using multidimensional item response theory to evaluate educational and psychological tests. educational measurement, 22, pp. 37-53. anonymous. (2001). factor analysis. journal of consumer psychology, 10 (1&2), pp. 75-82. lawrence erlbaum. bachman, l. f. (1990). fundamental considerations in language testing. oxford: oxford university press. bachman, l. f. & palmer, a. s. (1996). language testing in practice. oxford: oxford university press. bolt, d. m. (2001). conditional covariancebased representation of multidimensional test structure. applied psychological measurement, vol. 27, no. 3, pp. 244-257. bolt, d. m. & lall, v. m. (2003). estimation of compensatory and noncompensatory multidimensional item response models using marcov chain monte-carlo. applied psychological measurement, 27, pp. 395414. de la torre, j. & patz. (2005). making the most of what we have: a practical application of multidimensional item response theory in scoring. educational and behavioral statistics, 30, pp. 295-311. garson, d. (2006). factor analysis. retrieved on 24 september 2006 from http://www2.chass.ncsu.edu/garson /pa765/index.htm . hambleton, r. k., swaminathan, h. & rogers, h. j. (1991). fundamental of item response theory. newbury park, ca: sage publication inc. hullin, c. l., et al. (1983). item response theory: application to psychological measurement. homewood, il: dow jones-irwin. joreskog, k. & sorbom, d. (1993). lisrel88: structural equation modeling with the simplis command language. hillsdale, nj: scientific software international. kirisci, l., hsu, t. & yu, l. (2001). robustness of item parameter estimation programs to assumtions http://www2.chass.ncsu.edu/garson/pa765/index.htm http://www2.chass.ncsu.edu/garson/pa765/index.htm research and evaluation in education journal 54 volume 1, number 1, june 2015 of unidimensionality and normality. applied psychological measurement, 25, pp. 146-162. munby, j. (1981). communicative syllabus design: a sociolinguistic model for defining the content of purpose-specific language programs. cambridge: cambridge university press. nunally, j. (1978). psychometric theory (2 nd ed.). new york, ny: mcgraw hill. sireci, s. g., & geisinger, k. f. (1995). using subject-matter experts to assess content representation: an mds analysis. applied psychological measurement. vol. 19. no. 3, pp. 241255. stapleton. (1997). basic concepts and procedures of confirmatory factor analysis. retrieved on 25 september 2006 from http://ericae.net/ft/cfa.htm van de geer, j. p. (1971). introduction to multivariate analysis for the social sciences. san francisco: w. h. freeman and company. wang, w. c., chen, p. h., & cheng, y. y. (2004). improving measurement precission of test batteries using multidimensional item response models. psychological methods, vol 9, no. 1, pp. 116-136. wells, c. s. & purwono, u. 2009. assesing the fit of irt models to item response data. makalah pelatihan psikometri kerjasama pascasarjana uny dengan usaid. http://ericae.net/ft/cfa.htm copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(2), 2021, 88-105 available online at: http://journal.uny.ac.id/index.php/reid an evaluation of the petty officer training program of the republic of indonesia police using the kirkpatrick model junus simangunsong1*; langgeng purnomo2 1ministry of education and culture, indonesia 2human resources staffing of indonesian national police headquarter, indonesia *corresponding author. e-mail: junussim@gmail.com introduction at present, the problems of the nation and country, both from outside and within the country, have an effect on the weakness of social capital, which is marked by a mutually suspicious relationship between citizens and groups of citizens. other problems include the spread of intolerance, hate speech, and identity politics. an example of this is the demonstration and riot in papua on august 29, 2019. this condition has an effect on the weakening of national unity and brotherhood. therefore, to create a professional, modern, and reliable indonesian police in carrying out its main functions, not only does the republic of indonesia police use a community policing paradigm approach, but also a national approach to support the realization of an action plan for the national resilience program. in 2019, the republic of indonesia police chief's priority program was to create superior human resources of indonesian police through pro-active recruitment, and the recruitment with clean, transparent, accountable, and humanitarian principles and using it specifically for natives article info abstract article history submitted: 25 december 2020 revised: 27 october 2021 accepted: 15 november 2021 keywords kirkpatrick evaluation model; national identity; noken; petty officer scan me: this study aims to evaluate the petty officer training program noken to determine the program's success based on predetermined character values. the program evaluation uses kirkpatrick's evaluation model with four levels: reaction, learning, behavior, and result. the data were obtained using a questionnaire, observations, interviews, and documentation and analyzed using quantitative and qualitative descriptive techniques. before the questionnaire and observation sheets were used, their readability was tested to the super team for the national identity formed by the republic of indonesia police headquarters and experts in psychology and educational evaluation. the results of the instrument readability test showed excellent results, with a score of 4.8 on a scale of 1-5. however, some revisions were needed, especially for the instruments for policymakers and management. the results of the evaluation of the two groups show that for the assessment at level 1: reaction, the participants' reaction to the organizing committee and resource persons is very high, and there is just a need to pay attention to the availability of facilities. at level 2: learning, the achievement of key competencies is very significant in its assessment. at level 3: behavior, there are behavioral changes including diligence in doing worship, discipline in attending classes, neat and clean clothes, working in groups, communicating well, and accuracy and speed in completing daily tasks. at level 4: result, the level of achievement of character values is very significant. the effect of conditioning and the teaching of materials contributes very significantly to applying national identity in everyday life. this is an open access article under the cc-by-sa license. how to cite: simangunsong, j., & purnomo, l. (2021). an evaluation of the petty officer training program of the republic of indonesia police using the kirkpatrick model. reid (research and evaluation in education), 7(2), 88-105. doi:https://doi.org/10.21831/reid.v7i2.36937 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 89 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) of papua and west papua (decree of the chief of indonesian national police no. kep/2513/ xii/2019). the national police program is also an elaboration of the president's program, which is to create superior indonesian human resources, so that through the recruitment of indonesian police members, it is expected that prospective petty officers will be obtained, who can carry out police duties in accordance with law no. 2 of 2002 concerning indonesian national police. furthermore, the preparation of the prospective members is called the petty officer training program of noken (potp noken) through the 2020 budget year admission. the participants of the 2020 potp noken are recruiting members of the indonesian police for the 2020 academic year to take part in education for the formation of the indonesian police officers for the 2020 academic year who are qualified to be professional, modern, and reliable police officers. this training program was designed by the human resources division of the republic of indonesia police headquarters, while the coaching and training was centered at the papua provincial police academy for two months from september to november 2020. the number of the participants of the potp noken who had been recruited were 272, and 146 of them were from papua and 126 from west papua. however, due to the covid-19 pandemic, during the training the participants were divided into two groups: the regular group and the accelerated group. the regular group consisted of 257 participants while the accelerated group consisted of 15 participants. the potp noken aims to build the national character of the prospective petty officers, native to the regions of papua and west papua who are inspired by the identity of the indonesian nation so that they are able to act as a driving force in the realization of pancasila and become agents in strengthening the brotherly unity of the indonesian nation to support the realization of the country's goals. the potp noken curriculum was compiled by the module development team, the republic of indonesia police headquarters, which focuses on the character of the indonesian national identity (letter of assignment of the chief of indonesian national police no. sprin/1637/vi dik.2.1/2020/ssdm). the indonesian national identity consists of 12 values, namely: (1) faith, (2) humanity, (3) integrity, (4) humility, (5) tolerance, (6) brotherhood, (7) selflessness, (8) discipline, (9) mutual cooperation, (10) perseverance, (11) innovativeness, and also (12) communicativeness. the training program puts more emphasis on the humanitarian approach toward bringing about behavior change, not on physical pressure. the distinctive feature of the teaching is its nonmental violence use and sense of kinship and affection approach according to the characteristics of the unity of faith and humanity. the expected graduates are the indonesian citizens who have a view of themselves as the indonesian nation and the homeland of indonesia as a unitary unit of the unitary nation of the republic of indonesia which is imbued with national identity and prospective students of education for the formation of indonesian police officers in the fiscal year 2020. the teaching kits consist of: (1) values of national identity, (2) a dictionary of key competencies and behaviors, (3) teaching modules, and (4) videos and infographics. the graduate profile is characterized by five levels of key competencies, for example, tolerance has a level of behavior, including: (1) accepting differences, (2) likes to help others, (3) acting and inviting others, (4) consistently acting and inviting others, and (5) being a role model in the association among religious communities. the coaching and training is carried out through habituation and teaching of materials. habituation is an activity to apply the values of identity in everyday life. the implementation of the teaching of materials is divided into four stages, namely: (1) basic and personality formation stage, (2) debriefing national insights and police introduction stage, (3) self-development stage, and (4) stabilization stage. first, basic formation is the cultivation of noble values and character embodied in heart, exercise, thought, and intention through changing mindsets. personality is the stage of self-identification as an individual created by god almighty and as part of the indonesian people. this stage is also at the same time a self-concept that will become part of members of the family, community, and environment that prioritizes relationships and interactions with other people as part of indonesian society based on pancasila and the 1945 constitution of the republic of indonesia. https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 90 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) second, the debriefing of national and police insights is the stage of inculcating the values of national insight originating from pancasila, the 1945 constitution, bhinneka tunggal ika, the unitary state of the republic of indonesia, the text of the proclamation of indonesian nation independence, the red and white flag, indonesian language as the language of unity, the symbol of the state, the national anthem that every indonesian citizen must have in order to love the indonesian homeland which is imbued with national identity and become superior human resources in the era of globalization and the industrial revolution 4.0. the profession introduction is the stage of introducing the main tasks of the indonesian police, and the organization of the indonesian police, as well as the implementation of general police duties, and the history of the indonesian police. providing character that is imbued with the identity of the indonesian nation is through the unity of faith and humanity in order that the program participants can act as a driving force in the application of pancasila and to become an agent in strengthening the fraternal unity of the indonesian nation in order to support the realization of the country's goals (sumantri & setiawan, 2019). third, the self-development stage is the stage of providing participants with interpersonal skills, creativity, and innovativeness, in the form of understanding social media and the internet and how to use them for community service. this stage also provides motivation and inspiration so that they are confident to become wise citizens in living their daily lives. finally, the stabilization stage is the stage of providing learning experiences in the form of direct work training in the community by solving problems or looking for problems that occur in the surrounding environment and doing social services and briefing lectures which are a summary of all previously studied subjects. the teaching model used in the potp noken is a combination of experiential learning and neuro linguistic programming (nlp). experiential learning is the learning process that involves each participant in contemporary and challenging activities designed based on a coaching program where participants will be actively involved to gain values and inspiration in a structured program, which aims to conduct a concrete experience process (kolb & kolb, 2005). nlp was developed at the university of california at santa cruz in 1970 (tosey et al., 2005). the founders and main authors are richard bandler, a student (originally) and john grinder, a professor of linguistics. neuro linguistic programming (nlp) is based on the idea that there is a relationship between neurological processes, language (linguistics) and behavioral patterns that originate from experience (programming). nlp is a learning method that activates conscious and unconscious brain power (conscious and subconscious mind) by using language (linguistics) in a sequence of mental processes (programming) that affects behavior to create positive and constructive meaning in our lives (hemmatimaslakpak et al., 2016). by studying these relationships, individuals are effectively transformed from their old ways of feeling, thinking, and behaving, into new and far more helpful forms of human communication (huehls, 2010; seysener, 2011). the purpose of this training is that the training participants have good strategies in dealing with stress and build a more positive perception. the coaching and training program is implemented for the first time in the police environment, so it requires an evaluation whose purpose is to find out the success of the implementation of training based on predetermined stages. the program evaluation model specifically created for the training is the kirkpatrick evaluation model (2006). this model was chosen because it has been widely used to evaluate training programs around the world. one of them is the application of kirkpatrick's four-level model in the evaluation of education and training of instrument programs at the oil and gas training center in 2014 (ramadhon, 2016). kirkpatrick is an expert in evaluating training programs in the field of human resources development (hr). the evaluation model that is developed by kirkpatrick (1998) is known as the kirkpatrick four level evaluation model. according to kirkpatrick (1998), the evaluation of the effectiveness of the training program includes four levels of evaluation, namely: level 1: reaction, level 2: learning, level 3: behavior, and level 4: result. (1) reaction evaluation is an evaluation to https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 91 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) determine the level of trainees' satisfaction with the implementation of a training. (2) learning evaluation is an evaluation to measure the additional level of knowledge, skills, and changes in trainees' attitudes after attending the training. (3) behavior evaluation is an evaluation to determine the level of change in the work behavior of trainees after returning to their work environment. (4) results evaluation is an evaluation to determine the result of changes in the work behavior of trainees on the level of organizational productivity. the four stages of evaluation are described in more detail as follows. kirkpatrick's four level evaluation model the four-level evaluation model was introduced in 1959 when donald l. kirkpatrick wrote a series of four articles entitled "techniques for evaluating training programs" published in training and development, the journal of the american society for training and development (astd). the articles describe a four-level evaluation formulated by kirkpatrick based on the concepts from his dissertation at the university of wisconsin, madison. donald l. kirkpatrick and kirkpatrick (2006) suggest three specific reasons for evaluating training programs, namely: to justify the existence of a training budget by showing how the training program contributes to organizational goals and objectives, to determine whether a training program is continued or not, and to obtain information on how to improve the training program in the future. the four-level evaluation method represents a sequence of each stage for evaluating a training program. the sequence in question is that each level must be done in stages. this is because each level in the four-level model is important and each level has an effect on the next level. the four levels are as follows. level 1: reaction the reaction level evaluation is basically an evaluation of the participants' satisfaction with the various activities followed. the participants’ reaction can determine the level of achievement of the objectives of the implementation of the training. the training program is considered successful if the trainees are satisfied with all the elements involved in the implementation process. the success of the learning activity process cannot be separated from the trainees’ interest, attention, and motivation in participating in the training. trainees learn better when they react positively to the learning environment. there are two types of reaction instruments to evaluate level 1 reactions: the trainees’ reactions to the management and to the resource persons. the purpose of the reaction level evaluation is to provide the organizers with valuable input of the training program for improving future training programs; to give suggestions and input to teachers regarding their level of effectiveness in teaching; to provide decision makers with information related to the implementation of the training program; and to provide resource persons with the information that can be used as a basis for making teaching standards for future programs. level 2: learning at the learning level, trainees learn the knowledge or skills conveyed in teaching activities. measuring learning means determining one or more things related to the training objectives, such as what knowledge has been learned, what skills have been developed or improved, and what attitudes have changed. according to ramadhon (2016), there are steps taken in evaluating at the learning level, namely: (a) evaluating the increase in knowledge, skills, and changes in attitude before and after training; (b) measuring attitudes using tests that have agreed indicators; (c) measuring knowledge using pretest and posttest; (d) measuring skills using performance tests; (e) taking appropriate action based on the results of the measurements. what is meant by appropriate action in this case is to take confirmative action with the evaluation results at the reaction level, because the teacher is less communicative in delivering the material, because learning strategies are not in line with the expectations of the participants, or because other factors at level-1 might cause participants to experience demotivation in learning, so that lack of evaluation in reaction can immediately get attention. https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 92 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) level 3: behavior behavior according to d. l. kirkpatrick (1998) is the extent to which changes in behavior arise because participants follow the training program. a level-3 evaluation is conducted to identify the extent to which the training materials are applied to the participants' jobs and workplaces. according to tan and newman (2013), behavioral evaluation measures what learned knowledge, skills, or attitudes can be applied or transferred to work. from the aforementioned definition, it can be interpreted that the purpose of conducting an evaluation at the behavioral stage is to measure changes in work behavior that arise because an employee participates in a training program. to be able to apply the behavior change, according to d. l. kirkpatrick (1998), there are four necessary conditions, namely: (1) one must have the desire to change; (2) one must know what to do and how to do it; (3) one must work in a proper work environment; and (4) one should be rewarded because he/she has changed. training programs can provide the first and second conditions with training programs that support attitude change in accordance with the training objectives by providing materials related to knowledge, skills, or attitudes. however, the third condition about the right working environment is directly related to the supervisor and the participant's environment. level 4: result the implementation of training programs, of course, aims to get good results, such as improving quality, productivity, or safety levels. evaluation of results, according to donald l. kirkpatrick and kirkpatrick (2006), can be defined as an end result that occurs as a result of trainees’ participation in the training program. the steps in conducting an evaluation at level-4 are: (1) do an evaluation at level-3 first; (2) give your self-time to see result emerging or achieved. there is no specific time to evaluate the results, so that in determining the time of the evaluation, the various factors involved must be considered; (3) do it with a survey method using a questionnaire or interviews with training participants and company leaders; (4) take measurements, both before and after the training program if possible; (5) perform re-evaluation at the appropriate time; (6) consider the costs with the results obtained; (7) use secondary data, such as sales data, production data, and other data that support survey results in analyzing results. as explained earlier, the implementation of the four-level evaluation model must be done sequentially because each level is important and has an effect on the next level. for example, if a direct evaluation is carried out at level-3 (without conducting level-2 evaluation), when the evaluation results indicate that only a few participants have changed their behavior in accordance with the training objectives, the conclusion that is drawn is that the training program is not good, so it is not be continued or modified. this is inappropriate, because in implementing behavior change, there are other influencing factors, such as workplace conditions and the leader of the trainees. another factor that is of no less importance is to look at the results of the evaluation analysis at level-2, and thus it can be traced whether the participants' inability to change their behavior is also caused by their lack of understanding of the training materials. the reason for the participants’ lack of understanding the materials can then also be traced by looking at the results of the analysis of participants at level-1, whether their lack of understanding of the materials is caused by their dissatisfaction with the implementation of the training or because of the lack of the trainers’ quality, so that they are not motivated to learn. thus, with the implementation of the fourlevel model sequentially, there is a better basis for the analysis to draw a conclusion. evaluation is also closely related to the assessment process, both learning outcomes assessment and process assessment. evaluation serves to develop a learning program that includes teaching and learning designs. evaluation also serves to determine the position of a learning program based on certain criteria, so that a program can be trusted, believed, and carried out sustainably, or vice versa, that it must be improved or perfected. the reason why it has been implemented is that stage 1-reaction aims to determine the participants' reactions to the training program, and stage 2-learning also aims to determine the imhttps://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 93 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) provement of participants' competence in attitudes, knowledge, and skills. stage 3-behavior aims to assess changes in the behavior of the training graduates after returning to their workplace. stage 4-result aims to assess participants in terms of performance result after participating in the training, in this case the petty officer training program of noken (potp noken) whether they made a better contribution to themselves during the education for the establishment of police officers. in general, this evaluation aims to explain the level of success of the potp noken. in the end, it is expected to improve and refine and can be used as a reference in determining program policies at the national police academies in other regional police offices. specifically, as an evaluation research, this evaluation has an aim of knowing the various dimensions that can affect the effectiveness of the potp noken. method the petty officer training program of noken (potp noken) at the papua national police academy is carried out for prospective indonesian police officers who come from papua and west papua for 45 days. this research was conducted to get an overview of the implementation of the program using the kirkpatrick evaluation model. the model includes four levels: evaluating reaction, evaluating learning, evaluating behavior, and evaluating result. the type of this research is quantitative and descriptive qualitative in interpreting the four levels of kirkpatrik model. the data on the four levels of kirkpatrick's model were collected using several instruments/questionnaire. for the first to fourth levels of the evaluation of kirkpatrick model, the data were collected using closed, open, and short answer questionnaire. the questionnaire used were first estimated with the reliability of the questionnaire with cronbach's alpha. the subjects in this study are the participants of the potp noken, who received a training program on the values of the national identity. the total population of this study is 272 indigenous papuan male students, 146 of whom were sent by the district police office of papua and 126 by the district police office of west papua. because at that time there were 15 participants affected by covid-19, the program was postponed for 14 days. the accelerated group, which was affected by covid, was given additional time for evening learning and assignments. furthermore, the primary data were obtained through the program participants, namely 15 participants in the accelerated group and 60 in the regular group. the regular group was further divided into two groups consisting of 45 participants from papua and 15 from the west papua. the secondary data were obtained from 15 instructors and two resource persons. the research subjects for policy makers are the head of the police academy of papuan regional police office, the papuan police head of hr bureau, and the head of personnel management of bureau of personnel management, human resources division, republic of indonesia police headquarters. the research procedure is a mechanism that is carried out during the implementation of research. this research procedure followed that of sugiyono (2013) as follows. literature survey this stage was collecting literature and information related to the research title. the main literature material was the general instructions set by the indonesian national police chief. in completing the main literature material, the researchers collected and identified modules and teaching materials to realize the objectives of the potp noken. problem identification at this stage, the researchers identified the problems to be discussed related to the quality management and the success or failure of the potp noken based on the literature and information that had been obtained. therefore, at the beginning of the training, a mapping of the values of the participants was carried out to see the extent of the participants' nationalism insight. https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 94 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) after completing the potp noken, the participants were given the task of actualizing what they had obtained to the community or their immediate environment for one week until the potp noken began and the evaluation and monitoring were carried out by a small independent team. literature study at this stage, the researchers studied the literature to be used as a theoretical study of the values of national identity in this study. the researchers studied the theories about the values of national identity, and how to apply it in everyday life, which was inspired by national identity. various library sources were used to see the perspective of the indonesian people regarding themselves and their environment, regional unity and integrity in the implementation of life as a community and nation associated with identity, national identity, and nationalism insight. hypothesis formulation at this stage, the researchers raised the initial question “is there a relationship between quality management and the satisfaction of the training participants, the success or failure of the potp noken and how big is the relationship.” when the potp noken took place, monitoring was carried out by the supervisor, who was immediately attached to each team. the teaching carried out by the trainers from preparation to completion of the program and the monitoring and evaluation were carried out by their respective supervisors. determining variables and data sources at this stage, the researchers determined the variables about the satisfaction of the program participants and the success or failure of the program. then, the researchers determined the variables of quality management aspects, namely human resources, materials, and equipment. furthermore, the researchers determined the kind of data needed based on the population, sample, and sampling method. the research subject is the subject that is intended to be studied, the center of attention or the target of the researchers. the subjects of this study are potp noken participants who had completed the program. the participants who become respondents are the people who are asked to provide information about a fact or opinion. this information can be submitted in written form, namely when filling out a questionnaire or when answering an interview. the parties who are the respondents are the implementation division such as: program participants, trainers/ instructors, caregivers, and program organizers, namely the head of the papua national police academy, the head of the papua police hr bureau, and the head of the human resources division of the republic of indonesia police headquarters. as many as 45 participants from papua and 15 participants from west papua in the regular group and all 15 participants in the accelerated group was established as the sample of this research. the sampling used the purposive sampling method, with the following criteria: the program participants had completed all teaching materials according to the module and trainers, caregivers, and officials who carried out the quality management of the education and training. determining and developing research instruments (questionnaire) this stage was the determination of the research instrument, which was a questionnaire. the questionnaire was prepared based on indicators from each level of kirkpatrik's evaluation, divided into four parts, namely the identity of the data source, qualitative data, quantitative data, and essays. the questionnaire consisted of four parts, namely part 1 containing six items of respondent's reaction data. part 2 was a 6-point learning questionnaire, part 3 was a 52-point outcome questionnaire, part 4 was a 21-point result questionnaire, and part 5 was a 29-item filled-in questionnaire. in this study, each aspect used its own assessment. the evaluation of learning programs in the potp noken was carried out in four aspects, namely evaluating reaction, evaluhttps://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 95 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) ating learning, evaluating behavior, and evaluating results. in conducting the assessment, the researchers maintained or prioritized objectivity, so that each assessment needed a rubric or assessment criteria. part 1 of the quality management questionnaire contained several questions regarding the personal data of respondents, such as the duties and responsibilities of the potp noken. the qualitative section of quality management consisted of three questions about human resources (hr), two questions about materials, and two questions about equipment. meanwhile, the success or failure questionnaire consists of six items in the form of in-depth interview transcripts. the measurement scale on the questionnaire uses the five-point likert scale. in part 2, the identification of the scale used is as shown in table 1. table 1. qualitative scale identification score description 5 = very important/ very true/very high/always 4 = important/true/high/often 3 = quite important/ quite true/quite high/sometimes 2 = not quite important/ not quite true/ poor/seldom 1 = never/wrong/very poor/never part 3 is a questionnaire containing multiple choice questions. the questionnaire consists of 13 questions for conditioning and 32 questions for the four aspects of kirtpartrik. the measurement scale uses a five-point scale, to level out answers that have units that are different from one another. the identification of the scale used is as shown in table 2. table 2. quantitative scale identification answers score a = 1 b = 2 c = 3 d = 4 e = 5 part 4 contains essay questions, which respondents directly answer in accordance with the conditions that occurred during the implementation of the program. the field for quality management consists of four questions about hr, six questions about teaching materials, and two questions about equipment. the part about the success or failure of the potp noken consists of three questions about human resources, three questions about teaching materials, and one question about equipment. to support this research, the researchers needed secondary data. the secondary data needed are: regulations and supporting literature and the information on notes from observations of caregivers and trainers during the implementation of the potp noken and actions taken by management when problems and obstacles occurred. all of those parts were arranged in one bundle to be distributed to respondents. field observation and licensing at this stage, the researchers searched for data sources and permits from competent parties to fill out the questionnaire. before going into the field, the researchers asked the bureau of personnel management, human resources division, republic of indonesia police headquarters for permission to take data to the training program. then, in order to synchronize the potp noken with the regional hr development program, a meeting was held with the head of the hr bureau at the papua regional police office. finally, before the questionnaire was distributed to respondents, the researchers asked the head of spn papua and the class leader of each group for permission. https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 96 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) collecting data at this stage, the researchers distributed the questionnaire to respondents. this was done in conjunction with observation and licensing to save time, cost, and effort. the primary data needed for this research were collected using the questionnaire given to the respondents. the data collection methods used were the questionnaire method and literature/documentation method. the questionnaire method is a method by distributing questionnaire to be filled out about the process of implementing the potp noken related to the variables to be searched. the first step was to arrange permits for the potp noken organizers, in this case the papuan national police academy, in order to meet competent respondents to fill out questionnaire. then a meeting agreement with the respondent was made. the questionnaire was given to respondents to be filled in according to the actual situation without direction from the researchers. after completion, the respondents submitted the questionnaire to the researchers. the literature/documentation method is a method by collecting, identifying, and processing written data in the form of relevant books, regulations, activity reports, and relevant data for research. the sample respondents from the regular group were established randomly, and it represented the classes. the questionnaire for participants consists of 86 items for regular group and 88 items for accelerated group. the questionnaire for the instructor consists of 29 items, including the entries. the guidelines for interviewing policy makers or quality management, namely, for the head of the national police academy as many as seven questions and for the head of bureau of personnel management and the head of personnel management, human resources division, republic of indonesia police headquarters as many as eight questions. the evaluation instrument is accompanied with instructions for filling out. the questionnaire was completed and interviews were carried out in three days, from 7 to 9 november 2020 at the national police academy of papua regional police office, at jalan tj. ria no. 1, tj. ria, north jayapura, jayapura city, papua. data processing the data processing consisted of giving variable codes, tabulations, and calculations using the excel program and spss version 22. then a second tabulation was carried out. the first stage of data tabulation was to group answers from respondents and it was carried out on all types of answers to the questionnaire. data tabulation 2 was grouping the calculated data or output from spss 13.0 for parts 2 and 3 of the questionnaire. the data from the questionnaire were recapitulated using ms. excel, while those of the interviews were summarized in an interview result sheet. the analysis of the regular and accelerated groups based on the number and percentage of achievements was carried out on the levels of: (1) reaction, (2) learning, (3) behavior, and (4) result (donald l. kirkpatrick & kirkpatrick, 2006). the calculation of the correlation coefficient and cronbach's alpha for levels 3 and 4 was carried out to see the validity and reliability of the instrument used. the final step was to do multiple regression analysis to see the effect of conditioning and teaching materials on the participants’ achievement of national identity values using spss version 22. data analysis at this stage, the researchers analyzed the results of data processing based on the results of existing research and theories. analysis is an important part of scientific research methods because, by doing so, the data can be given meanings that are useful in solving problems. there are two approaches to information analysis based on the type of information obtained, namely quantitative analysis and qualitative analysis. quantitative analysis is an analysis based on the results of data calculations. this stage is to determine the strength and weakness of each relationship seen from the correlation value. if the correlation value is <0.5 then the correlation is weak or not correlated. if the correlation value is https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 97 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) 0.5 < r < 1 then the correlation is strong. in addition, it is also necessary to check the direction of the correlation, negative or positive. that is, if it is positive then the relationship is in the same direction, and if it is negative then the relationship is reversed or in the opposite direction. after that, the illogical relationships between variables were eliminated. thus, it will show what variables can be analyzed further. the qualitative analysis used next is a discussion of the results of the quantitative analysis. logical relationships are explained along with some existing theories and the results of processing data entry. if the results of the relationship analysis are in accordance with the existing theory, no further study will be conducted. however, if the opposite happens, then further discussion is needed about why it is not appropriate. conclusion drawing this stage was the final stage, namely drawing conclusions based on data analysis and checking whether they were in accordance with the aims and objectives of the study. the next stage was to relate the quality management variable to the construction failure variable. therefore, the final result of the analysis would show the factors that affected participants’ satisfaction and the effect of quality management on the success or failure of the potp noken. findings and discussion reaction level the reaction level evaluation is basically an evaluation of the participants' satisfaction with the various activities that they followed. the reaction of the participants determined the level of achievement of the objectives of the implementation of the program. the potp noken is considered successful if the program participants are satisfied with all the elements involved in the implementation process. the success of the learning activity process could not be separated from the interest, attention, and motivation of the participants to participate in the training. they learned better when they reacted positively to the learning environment. there were two types of instruments to evaluate participants’ reactions. participants' reactions to the regular group management the aim was to reveal the satisfaction of the participants with the success of the learning process which could not be separated from the interest, attention, and motivation related to (a) the importance of learning, (b) sufficient time, (c) mastery level of instructors/resource persons, (d) facilities, (e) care, and (f) learning methods. participants' reactions to the accelerated group management the aim was to determine the satisfaction of the participants with the success of the learning process which could not be separated from the interest, attention, and motivation related to (a) the importance of training, (b) sufficient time, (c) mastery level of instructors/resource persons, (d) facilities, (e) care, (f) teaching methods, (g) time to do tasks, and (h) activity time in class. the results of the assessment of the participants' evaluation of all aspects for the two groups are shown in table 3. all aspects of the reaction are in the "very good" category. based on the evaluation of participants' reactions to the potp noken, it can be said that it is effective and satisfactory. however, there are some notes that need further attention or are not in accordance with the expectations of participants, namely the availability of facilities and trash bins that are not available as needed. the head of the national police academy of papua district police office is also aware of this. https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 98 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) according to the instructor, it is necessary to hold the potp noken because it can inculcate good behavior, discipline, and conscious loyalty. the training participants are motivated by providing in-depth insight into the police profession where being a good police officer requires a lot of learning, practicing, and obeying the applicable rules. table 3. recapitulation of reaction levels no. percentages regular group accelerated group 1. 100% 100% 2. 98.33% 100% 3. 100% 100% 4. 90.00% 86.67% 5. 96.67% 93.33% 6. 100% 100% 7. 100% 8. 100% based on the results of field observations, it was found that the participants were quite active in each session. however, there are still a small number of participants who seem to be late for class in some of the material presentation sessions, but they are active in participating in the material presentation sessions. the results of the interview show that the participants were generally quite enthusiastic in participating in the training program. the results of the assessment of the participants' evaluation show that the resource persons are "very good" as evidenced by the acquisition of an overall average score of 100%. there are three resource persons in: mr. rs, mr. k, and mr. lp. the resource persons mastered the material and provided enthusiasm, and changed the mindset so that the participants were very enthusiastic at the stage of forming the basis and personality and a sense of nationalism insight. based on the results of the interviews with participants, it is known that the instructor explained how to design learning programs not only theoretically but also by playing so that it was fun and all participants were always active. learning level at this stage, an evaluation of learning outcomes was carried out, which included the achievement of learning objectives and expected learning outcomes from the teaching-learning process. for the potp institutions, especially the national police academy of papua regional police office, this is very important, because the success and subsequent components are closely related to the components of this learning outcome. the learning outcomes that were tested were in accordance with the subjects taught and the objectives of the learning at the potp. at the learning stage, it is expected that there will be changes in the program’s participants according to the objectives of the training program. the changes referred to are: (a) habit of being always punctual, (b) always praying according to their respective religions and beliefs, (c) having the ability to complete tasks, (d) having relevant knowledge in performing daily tasks, (e) having insight in the context of the republic of indonesia, (f) having informal knowledge. table 4. recapitulation of learning level no. percentages regular group accelerated group 1. 93.33% 93.33% 2. 98.33% 100% 3. 100% 100% 4. 100% 100% 5. 100% 100% 6. 98.33% 93.33% https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 99 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) the results of the assessment of the participants' evaluation of all aspects for the two groups are shown in table 4. all aspects of learning are in the "very good" category. based on the evaluation of participants' learning, it can be said that the potp noken is effective and successful. according to the instructor, the teaching was marked by working well in group assignments, being very critical and curious about teaching and learning activities at national police academy, and having confidence to act out in front of their peers. the effectiveness of learning programs is measured in terms of three aspects, namely: behavior, knowledge, and skills. without changing attitudes, increasing knowledge, and improving skills in students, the program is considered as a failure. this evaluation is also called the assessment of learning outcomes (output). therefore, the measurement of learning outcomes (learning measurement) means determining one or more of the following: (1) behavior change, (2) knowledge that has been learned, and (3) skills that have been developed or improved. the quality of the test used is a test battery that is in accordance with the training curriculum. all the materials provided are also in accordance with the needs, and the materials are in accordance with the curriculum. the evaluation of the learning of the potp noken is an activity that is integrated in the program design. the indonesian national police headquarters authority covers the preparation of tests for the achievement of key behaviors and nationalism identity values and the determination of graduation through a set of multiple-choice tests. the categorization of graduation to the issuance of a certificate of training completion is all carried out by the indonesian national police headquarters. the participants are declared to have passed if they already have level 2 key behaviors and a minimum score of 70 for each aspect of nationalism identity. behavior level behavior according to donald l. kirkpatrick & kirkpatrick (2006) is defined as the extent to which behavior changes arise because participants follow a training program. a level-3 evaluation was conducted to identify the extent to which the material in the training was applied to the participants' jobs and workplaces. according to tan and newman (2013), behavioral evaluation measures what knowledge, skills, or attitudes are learned to be applied or transferred to the job. from the aforementioned definition, it can be interpreted that the purpose of conducting an evaluation at the behavioral stage is to measure changes in work behavior that arise because an employee participates in a training program. in order to be able to apply the behavior change, based on the program curriculum, there need to be three conditions: (1) change of mindset so that they have the desire to change, (2) habituation in daily activities, (3) inspiration from the materials of nationalism identity and role models or community figures. changes in mindset are characterized by: (a) direction on how to behave in life, (b) instructors’ motivation, (c) behavior towards fellow participants, and (d) behavior as part of indonesian society. at behavior level, an assessment of the participants’ behavior during the potp noken was carried out to find out changes in their behavior after participating in the program. the results of the assessment of the participants' evaluation of all aspects for the two groups are shown in table 5. all aspects of behavior are in the "very good" category. based on the evaluation of participants' behavior, it can be said that there has been a very significant change. table 5. behavior level recapitulation no. percentages regular group accelerated group 1. 100% 93.33% 2. 100% 100% 3. 100% 93.33% 4. 98.33% 100% https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 100 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) according to the head of the human resources bureau of the papuan police district office, the mindset change occurred because: (1) the teaching involved papuan historical figures who struggled for the republic of indonesia independence, (2) the teaching is with love and compassion, in accordance with the background and characteristics of the training participants, (3) training participants have many talents to solve problems that exist in the community. likewise, according to the instructor/caregiver, changes in behavior are due to the fact that the participants have known and understood the history of the indonesian nation, are increasingly disciplined and critical, understand their duties, and remain enthusiastic. result level the implementation of the training program, of course, aims to get good results, such as improving the quality, productivity, or safety levels. the evaluation at the result level aims to reveal whether the training program is useful in achieving organizational goals. the final results in the context of evaluation at the result level include increased production results, customer satisfaction, and increased teacher morale. the relationship between positive results received and training activities is complicated because there are many other aspects that affect this and the training. the result of achieving the values of national identity is characterized by: (a) their relevance to the problems that exist in society, (b) knowledge needed in society and the state; (c) having an result on changing attitudes in daily life, and (d) the result of increasing productivity in people's lives, and (e) achievement of the predetermined target. table 6. result level recapitulation no. percentages regular group accelerated group 1. 96.67% 100% 2. 100% 100% 3. 100% 100% 4. 98.33% 93.33% 5. 93.33% 93.33% at the result stage, an assessment of the participants during the potp noken was carried out to find out the result after their participation in the training. the results of the assessment of the participants' evaluation of all aspects for the two groups are shown in table 6. all aspects of the result are in the “very good” category. based on the evaluation of the participants' result on the training, it can be said that they got very significant results. this is in line with the statement by the head of the national police academy of papua district police that initially the training was estimated to reach only level 7 (standard), but the potp noken is very satisfying and it shows that papuan and west papuan youngsters can also learn. according to the instructor, the result in daily behavior is marked by awareness and sincerity to worship god almighty, returning lost items, and being punctual for activities at national police academy. likewise, according to the head of personnel management, bureau of personnel management, human resources management, republic of indonesia police headquarters, the implementation of the curriculum and module for the potp noken went well, even exceeding the target. the level of achievement of key competencies and national identity values is above level 2 or the participants have behaved according to these values. estimating the effect of habituation and teaching materials on the results of applying the national identity to apply national identity, according to the curriculum, training participants habituate themselves to apply the values in daily activities. in addition to this refraction, the participants https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 101 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) were also given materials and role models so that they were inspired by the teaching of the materials in order to behave that reflects national identity. changes in behavior from the aspect of habituation are represented by 18 items for behavioral variables and 26 items for the aspect of the teaching of the materials. the validity and reliability of the final instrument of habituation and the teaching of materials was obtained with the help of the spss for windows version 22.0 program (priyatno, 2010). based on the results of the validity test, the calculated r-value (corrected item total correlation) for all items (indicators), both regular and accelerated groups, is above the r-table (n=60. 0.256). the calculation carried out shows a good result because the minimum requirement that must be met for an item to be valid, which is greater than 0.239 (ghozali, 2006), can be met. thus, it can be concluded that the item is valid. the item validity coefficient is expressed directly by the comparison between the number of item validity indices and the number of item reliability indices. the validity of the test with the criteria will be maximized if the test contains the items that have a validity index equivalent to their reliability index (allen & yen, 1979). the result of the calculation of the aforementioned reliability test shows that the indicator of behavioral change due to habituation has a very high reliability coefficient (cronbach's alpha in the regular group = 0.884 and that in the accelerated group = 0.867), because according to nunnally and bernstein (1994), or the index commonly used in social research, if cronbach's alpha (α) is above 0.60, it indicates that a construct or variable is reliable. the result of the calculation of the reliability test above shows that the indicators of behavior change due to the teaching of the materials have a very high reliability coefficient (cronbach's alpha in the regular group = 0.944 and that in the accelerated group = 0.961) because, according to nunnally and bernstein (1994), for the index commonly used in social research, if the number of cronbach's alpha (α) is above 0.60 a construct or variable is reliable. the magnitude of the effect of refraction and teaching of the materials was estimated by using the multiple regression analysis. in accordance with the rules in performing multiple regression analysis as stated by gujarati and porter (2009), a regression equation must have data that are normally distributed, free of heteroscedasticity, and free from multicollinearity in order to obtain a good and unbiased regression equation. the result of the data normality test that has been carried out shows that the data used in this regression equation are not normally distributed and have heteroscedasticity, and there is no multicollinearity so that it meets the requirements to perform multiple regression analysis properly. to solve the problem, achieve the goal, test the hypothesis, and find out whether the explanatory variable partially has a significant effect on the dependent variable, the researcher conducted a t-test. the results of the multiple regression analysis that have been carried out are as follows. correlation coefficient product moment correlation and multiple correlation can be used to determine the relationship between the independent variable and the dependent variable. according to sudjana (2005), the correlation coefficient between the variables x1 and y, and x2 and y can be found using the pearson correlation formula. to find out whether or not the calculated correlation coefficient is significant, it is necessary to compare it with the r table product moment at the significance level of 0.05 (95% confidence level). the rule of the significance test is: if rcount ≥ rtable, then h0 is rejected, meaning that there is a significant relationship, and if rcount < rtable, then h0 is accepted, meaning there is no significant relationship. the effect of habituation and teaching of materials on the results of applying the national identity was obtained with the help of the spss for windows version 22.0 program. table 7 shows the correlation of the results of habituation and the teaching of materials for the regular group. table 8 shows the correlation of the results of habituation and the teaching of the training materials for the accelerated group. https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 102 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) table 7. regular group level of success in actualizing nation self-identity correlations habituation material inspiration result habituation pearson correlation 1 .688** .657** sig. (2-tailed) 0 0 n 60 60 60 material inspiration pearson correlation .688** 1 .762** sig. (2-tailed) 0 0 n 60 60 60 result pearson correlation .657** .762** 1 sig. (2-tailed) 0 0 n 60 60 60 **. correlation is significant at the 0.01 level (2-tailed). table 8. success level in applying national identity of accelerated group correlations habituation material inspiration result habituation pearson correlation 1 .719** .763** sig. (2-tailed) 0.003 0.001 n 15 15 15 material inspiration pearson correlation .719** 1 .796** sig. (2-tailed) 0.003 0 n 15 15 15 result pearson correlation .763** .796** 1 sig. (2-tailed) 0.001 0 n 15 15 15 **. correlation is significant at the 0.01 level (2-tailed). table 7 shows the significant value of the h1 variable (habituation) = 0.00 < 0.01 so h0 is rejected, which means that the independent variable h1 partially has a positive effect on the y variable. the significant value of the h2 variable (material) = 0.00 < 0.01 so h0 is rejected, which means the independent variable h2 is partially positive and has a significant effect on variable y. thus, there is heteroscedasticity in the regression equation. the result of the calculation shows that the correlation meets the requirement and is very strong because the minimum requirement that must be met is that the correlation coefficient is greater than 0.239 (ghozali, 2006). table 8 shows the significant value of the h1 variable (habituation) = 0.00 < 0.01 so that h0 is rejected, which means that the independent variable h1 partially has a positive effect on the y variable. the significant value of the h2 variable (teaching of materials) = 0.00 < 0.01 so that h0 is rejected, which means the independent variable h2 partially has a positive and significant effect on variable y. therefore, it can be concluded that there is heteroscedasticity in the regression equation. the result of the calculation carried out shows that the correlation meets the requirement and is very strong because the minimum requirement that must be met is that the correlation coefficient is greater than 0.239 (ghozali, 2006). estimating the regression equation the regression analysis is a statistical analysis that analyses the relationship between two or more variables (pituch & stevens, 2016). in general, there are two kinds of relationship between two or more variables, namely a two-way relationship. to determine the form of the relationship, the researchers used the regression analysis. the regression analysis is used to see a one-way relationship between more specific variables, where the x variable functions as the independent variable, which is the affecting variable, and the y variable as the dependent variable is the affected variable. usually, variable x is also referred to as the independent variable or the respondent variable, and variable y is the dependent variable (sukestiyarno, 2015). https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 103 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) the estimate of the multiple regression equation between habituation and the teaching of materials with the results of applying the national identity was obtained by utilizing the spss for windows version 22.0 program. table 9 shows the r-square result of habituation and teaching of materials in the regular group, while table 10 shows those in the accelerated group. table 9. r-square in regular group model summaryb model r r square adjusted r square std. error of the estimate 1 .784a 0.614 0.601 0.24498 a. predictors: (constant), material inspiration, habituation b. dependent variable: result coefficientsa model unstandardized coefficients standardized coefficients t sig. b std. error beta 1 (constant) 0.788 0.437 1.801 0.077 habituation 0.287 0.128 0.253 2.233 0.029 material inspiration 0.556 0.107 0.588 5.186 0 a. dependent variable: result table 10. r-square of accelerated group model summaryb model r r square adjusted r square standard error of estimation 1 .842a 0.709 0.66 0.2156 a. predictors: (constant), material inspiration, habituation b. dependent variable: result coefficientsa model unstandardized coefficients standardized coefficients t sig. b std. error beta 1 (constant) 1.127 0.635 1.774 0.101 habituation 0.345 0.196 0.395 1.761 0.104 material inspiration 0.411 0.18 0.512 2.283 0.041 a. dependent variable: result the value of r square (regular group) = 0.614 presented in table 9 shows that 61.4% of the success of the potp noken is determined by habituation and inspiration from the training materials. the value of r square (regular group) = 0.614 in table 9 shows that 61.4% of the y variance can be explained by the changes in variables x1 (habituation) and x2 (teaching of materials). meanwhile, the remaining 38.6% is explained by other factors outside the model. thus, the estimation equation (regular group) is as in formula (1). y = 0.788 + 0.287*x1 + 0.556*x2 + e ……………. (1) if x1 increases by one unit, and x2 remains constant then y will increase by 0.287 unit. if x2 increases by one unit, and x1 remains constant, y will increase by 0.556 unit, so that x2 has more effect on y than on x1 because the regression coefficient x2 is higher than the regression coefficient x1. if x1 and x2 are zero, then the value of y is a constant a, which is 0.778. the value of r square (accelerated group) = 0.709 shown in table 10 shows that 70.9% of the y variance can be explained by the changes in variables x1 and x2. meanwhile, the remaining 29.1% is explained by other factors outside the model. based on table 10, the value of r square (accelerated group) = 0.709 in table 10 shows that 70.9% of the potp noken’s success is determined by habituation and inspiration from the training materials. therefore, the estimation equation (accelerated group) is as in formula (2). y = 1.127 + 0.345*x1 + 0.411*x2 + e ……………. (2) https://doi.org/10.21831/reid.v7i2.36937 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 104 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) if x1 increases by one unit, and x2 remains constant, y will increase by 0.345 unit. if x2 increases by one unit, and x1 remains the same, y will increase by 0.411 unit, so that x2 has more effect on y than on x1 because the regression coefficient x2 is higher than the regression coefficient x1. if x1 and x2 are zero, then the value of y is a constant a, which is 1.127. conclusion based on the research findings and discussion about the potp noken whose implementation is made in two groups, the following conclusions is drawn. (1) the evaluation at the reaction stage shows that the potp noken participants are very satisfied with the service from the organizing committee, and the resource persons from the potp noken are effective, satisfactory, and fun. however, there are some notes that need further attention or are not in accordance with the participants’ expectations, namely the inadequate facilities and that trash bins are not available. (2) the evaluation at the learning stage shows that the participants are very enthusiastic in participating in the learning process. the teaching process of the potp noken can be said to be effective and successful. according to the instructor, the teaching is indicated by participants’ working well in group assignments, being very critical and curious about teaching and learning activities at national police academy, and having confidence to perform in front of their peers. all potp noken participants are declared to have passed by obtaining a certificate of completion of the training. (3) the evaluation at the behavioral stage shows that the alumni of the potp noken have experienced behavioral changes in discipline to attend classes, dress appearance, independence, service to students, attitude to team or group work, and speed and accuracy in completing assignments, all of which have changed for the better. based on the evaluation of participants’ behavior towards the training, there has been a very significant change. the change in the mindset occurred because: (a) the teaching involved papuan historical figures who struggled for the republic of indonesia independence, (b) the teaching with love and compassion is in line with the training participants’ background and characteristics, and (c) training participants have many talents to solve problems that exist in the community. the changes in behavior occurs because participants have known and understood the indonesian history and they are increasingly disciplined and critical, understand their duties, and remain enthusiastic. (4) based on the evaluation of the result of the training, it can be said that the participants experienced very significant results. the potp noken is very satisfying and it shows that the youngsters of papua and west papua can also learn. the result in daily behavior is marked by the participants’ awareness and sincerity to worship god almighty, honesty to return lost items, and punctuality for activities at the national police academy. the implementation of the potp noken curriculum and module went well, even exceeding the target. the level of participants’ achievement of key competencies and national identity values is above level 2 showing that the participants have acted according to these values. based on the estimation results of the regression equation, there is a strong relationship between habituation and the teaching of materials on the use of national identity. both approaches contributed very significantly, which is above 60% for the two groups of training participants. from the results of the evaluation of training’s achievement, there is no significant difference between participants in the regular group and those in the accelerated group. references allen, m. j., & yen, w. m. (1979). introduction to measurement theory. brooks/cole. decree of the chief of indonesian national police no. kep/2513/xii/2019 concerning the implementation of indonesian police pre-petty officer recruitment in the indonesian police petty officer recruitment in the 2020 budget year, (2019). ghozali, i. (2006). aplikasi analisis multivariate dengan program spss. badan penerbit universitas diponegoro. 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 105 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) gujarati, d. n., & porter, d. c. (2009). basic econometrics (5th ed.). mcgraw-hill. hemmatimaslakpak, m., farhadi, m., & fereidoni, j. (2016). the effect of neuro-linguistic programming on occupational stress in critical care nurses. iranian journal of nursing and midwifery research, 21(1), 38–44. https://doi.org/10.4103/1735-9066.174754 huehls, f. (2010). literature review. international journal of educational advancement, 10(1), 48–55. https://doi.org/10.1057/ijea.2010.1 kirkpatrick, d. l. (1998). evaluating training programs: the four levels. berrett-koehler. kirkpatrick, donald l., & kirkpatrick, j. d. (2006). implementing the four levels: a practical guide for effective evaluation of training programs. berrett-koehler. kolb, a. y., & kolb, d. a. (2005). learning styles and learning spaces: enhancing experiential learning in higher education. academy of management learning & education, 4(2), 193–212. https://doi.org/10.5465/amle.2005.17268566 law no. 2 of 2002 concerning indonesian national police, (2002). letter of assignment of the chief of indonesian national police no. sprin/1637/vi dik.2.1/2020/ssdm concerning the implementation of noken coaching and training tot for petty officer in the indonesian police recruitment of 2020 budget year, (2020). nunnally, j. c., & bernstein, i. h. (1994). psychometric theory (3rd ed.). mcgraw-hill. https://books.google.co.id/books/about/psychometric_theory_3e.html?id=_6r_f3g58j sc&redir_esc=y pituch, k. a., & stevens, j. p. (2016). applied multivariate statistics for the social sciences: analyses with sas and ibm’s spss (6th ed.). routledge. https://www.routledge.com/appliedmultivariate-statistics-for-the-social-sciences-analyses-with-sas/pituchstevens/p/book/9780415836661 priyatno, d. (2010). paham analisa statsitik data dengan spss. mediakom. ramadhon, s. (2016). penerapan model empat level kirkpatrick dalam evaluasi program pendidikan dan pelatihan aparatur di pusdiklat migas. swara patra: majalah ilmiah ppsdm migas, 6(1), 43–54. http://ejurnal.ppsdmmigas.esdm.go.id/sp/index.php/swarapatra/article/view/101 seysener, l. (2011). time line therapy®: an advanced technique from the science of neuro linguistic programming. australian journal of clinical hypnotherapy and hypnosis, 32, 40–48. sudjana, s. (2005). metode statistika. tarsito. sugiyono, s. (2013). metode penelitian kuantitatif, kualitatif, dan r & d. alfabeta. sukestiyarno, s. (2015). olah data penelitian berbantu spss. universitas negeri semarang. sumantri, h., & setiawan, e. (2019). jati diri bangsa: kyai muchammad muchtar mu’thi sang mujadid wawasan kebangsaan (2nd ed.). organisasi shiddiqiyyah. tan, k., & newman, e. j. (2013). the evaluation of sales force training in retail organizations: a test of kirkpatrick’s four-level model. , 30, 692. the international journal of management, 30(2), 692–703. https://www.proquest.com/openview/db1da765340d369f440d54bf7271a384/1.pdf?pqorigsite=gscholar&cbl=5703 tosey, p., mathison, j., & michelli, d. (2005). mapping transformative learning: the potential of neuro-linguistic programming. journal of transformative education, 3(2), 140–167. https://doi.org/10.1177/1541344604270233 10.21831/reid.v7i2.36937 junus simangunsong & langgeng purnomo page 106 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(2), 2021, 156-167 available online at: http://journal.uny.ac.id/index.php/reid students' competence in making language skill assessment rubric memet sudaryanto1*; habib safillah akbariski2 1universitas jenderal soedirman, indonesia 2universitas sebelas maret, indonesia *corresponding author. e-mail: memet.sudaryanto@unsoed.ac.id introduction learning is transforming information received by students in their respective learning environments. one aspect as the indicator to show the learning achievement is the assessment. skill improvement refers to the dimensions of attitude, knowledge, and skill. these three dimensions can be measured through specific assessments. assessment can show accurate results through valid measurement (retnawati, 2015). the assessment attribute should be designed as well as possible so every assessment process can reflect students' skills. on the other side, learning output reflects the teaching conducted by teachers. learning outcomes can show the success of the teacher's learning process. in every learning process, a teacher should master some knowledge related to educational assessment, such as: (1) being able to select appropriate assessment procedures to make learning decisions, (2) being able to develop appropriate assessment procedures to make learning decisions, (3) being able to conduct a scoring and interpret the assessment results made, (4) being able to utilize assessment results to make decisions in education, (5) being able to develop proper assessment procedures and use assessment information, and (6) being able to communicate assessment results. article info abstract article history submitted: 23 september 2021 revised: 20 december 2021 accepted: 24 december 2021 keywords assessment rubric; language skill; language education scan me: this study aims to describe (1) the need for an assessment of language skills rubrics; (2) students' abilities in creating holistic and analytic rubrics; (3) the potential and relevance of the application of the rubric in the micro-teaching class; (4) the obstacles faced by students in compiling the rubric. this research used mix method to answer the research questions and a quantitative approach to measure students' understanding of the rubric. basic competence was assessed using the rubric and mapped students' competencies in designing the rubric. the qualitative rubric was used to describe students' difficulties in applying the rubrics produced in the learning process at school. the population of this study was 250 students of indonesian language education in central java selected using a random sampling technique. the study shows that (1) students' knowledge of the components presented in the language skill assessment rubric is still low: most students know that the assessment rubric can be done analytically and holistically, but they cannot explain the differences and how to design the language skill assessment rubric. (2) the rubric produced by students in assessing reading and writing skills is good enough and relevant to be applied in the school: 76% for reading skills and 83% for writing skills. (3) the rubric produced in assessing speaking and writing skills is less optimal and not relevant enough: 78% of students are unable to make good rubrics for listening skills, and 49% are unable to make a rubric for speaking skills. this is an open access article under the cc-by-sa license. how to cite: sudaryanto, m., & akbariski, h. (2021). students' competence in making language skill assessment rubric. reid (research and evaluation in education), 7(2), 156-167. doi:https://doi.org/10.21831/reid.v7i2.44005 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i2.44005 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 157 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) assessment is an essential aspect of learning achievement. assessment refers to fact-based processes and outcomes to explain the characteristics of someone or something (barkaoui, 2016; griffin, 1991; pearce et al., 2009). aspects of the process and results in the assessment of learning achievement become part of a complex learning evaluation that must be revised continuously. assessment is related to students' scores and emphasizes the aspect of the valuation process in the information that refers to students' competencies. one of the assessment tools that teachers can use in measuring student competence is a rubric. teachers use the rubric to decide the assessment criteria for assignments. however, a rubric is also helpful for students. it defines what is expected from students in the written form, especially in obtaining specific scores in an assignment (rukmini & saputri, 2017; mertler, 2000). in learning, rubrics are performance standards used to measure particular competencies from tasks that have been given to students (al-rabai, 2014). the rubrics that can be used in the authentic assessment are holistic and analytic rubrics (ratnaningsih, 2016). rubrics are used to measure students' knowledge, attitudes, or skills (rukmini & saputri, 2017). each rubric has specific features to make it easier for teachers to make assessments. according to stevens and levi (2013), rubric contains four essential features: (1) the description of assignment or descriptive title of assignments worked by students, (2) scale (and score), which describe the level of mastery (for example: exceed the expectation, meet the expectation, does not meet the expectation); (3) component/dimension that students should notice in doing their assignments; and (4) the description of performance quality (performance descriptor) from the component/dimension in every mastery (stevens & levi, 2013). the rubric is a performance standard used to measure specific competencies of the assignments that have been given to students. the type of rubric that can be used in authentic assessment is the holistic and analytic rubric. the rubric is used to measure students' knowledge, attitude, or skill. each rubric has its advantages and disadvantages depending on the needs of the assessment carried out by the teacher. the advance of an analytic rubric is the availability of feedback to students. analytic rubric specifically focuses on every criterion and makes this kind of rubric is appropriate to be used as remedial teaching and as the feedback in remedial teaching (widiastuti, 2021). therefore, an analytic rubric is suitable for formative assessments such as assignments, daily examinations, and homework. meanwhile, a holistic rubric provides the assessment with high reliability, and the scores obtained by students describe the standard or criterion that can be easily interpreted (al zumor, 2015). the holistic rubric is appropriately applied in summative assessment and if teachers only need information or students' learning output data to give students' final scores. however, the holistic rubric cannot give feedback to students. the selection of the two types of rubrics can be adjusted with teachers' needs, the type of assignments are given, or the competencies/purposes of the learning. in learning indonesian, the needs of each rubric will be different depending on the measured language skills. indonesian language learning in curriculum 2013 contains four language skills divided into learning materials. every language skill should be measured using a standardized assessment rubric, possessing a purpose, meaningful, transparent, feasible, and generalizable according to the stage of development (sudaryanto et al., 2019a). the test of language skills is vulnerable to the subjectivity of assessors. therefore, to minimize subjectivity, the assessment rubric is needed. the rubric should represent every skill component students will achieve in the assignments given. accordingly, the construction of the rubric should be adjusted with the learning goals. learning objectives need to be measured using an assessment rubric to be objective and accurate. based on these (a) assessment should be based on comprehensive measurement results; (b) assessment should be an integral part of the teaching and learning process; (c) the assessment used should be clear to students and teachers; (d) the assessment must be comparable; (e) the assessment should pay attention to the existence of two kinds of assessment orientations, namely the norm-referenced and criterion-referenced assessments; (f) a distinction must be made between scoring and scoring (ayhan & türkyılmaz, 2015). 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 158 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) rubrics must provide clear and consistent assessments (greenstein, 2012; stevens & levi, 2013). the teacher provides information about the assessment criteria. on the other hand, students can measure their achievements using rubrics and get practical and efficient feedback. as an instrument of learning reflection, the rubric developed by the teacher can be used as an assessment guide. rubrics are not just a learning assessment tool. thus, teachers should note that the assessment rubric must pay attention to the principles and be related to critical competencies or achievement indicators (anisa, 2017). the assessment rubric must contain three main aspects: job descriptions, rating scales, and assessment dimensions. meanwhile, in learning indonesian, the rubrics must include language skills standards that are measured, such as listening, reading, writing, and reading. it aims to make it easier for teachers to create assessments using rubrics. in addition, apparent aspects and skill standards in rubrics can minimize the subjectivity of assessors. therefore, teachers and students need to understand that rubrics measure the material that students learn (achievement of learning objectives, not vice versa, which measures students' assignments or work (anisa, 2017; zubairu et al., 2016). the purpose of the assessment using rubrics needs to be considered by the teacher appropriately. for example, a language teacher assigns students to read a novel. after that, students are asked to present their understanding in front of the class. the activity aims to assess students' speaking skills (sudaryanto et al., 2019b). based on the activity, teachers should design the rubrics that focus on the criteria of speaking skill rather than focus on the novel's content delivered. the aforementioned speaking skill assessment case example shows that students must understand speaking skills and actively participate in conversations or discussions. this skill includes the essential skill in language competence (rimmer, 2006). the speaking skill can be tested either spontaneously or cautiously. the speaking skill test can measure students' understanding of the context received. the assessment dimensions contained in the analytic rubric, such as (1) content dimensions, (2) creativity dimensions, (3) language dimensions, and (4) interaction dimensions, are essential components that teachers must consider. as a complex language skill, speaking skill involves segmental and suprasegmental aspects. it makes teachers obligatory to put supra-segmental aspects in the assessment rubric, such as pause accuracy and intonation, that can be included in the pronunciation dimension (bukhari, 2016). the different assignments may make different assessment results towards the pronunciation dimension involving supra-segmental aspects. for example, the assignment in stand-up comedy puts supra-segmental as the humor creating that makes teachers carefully notice those aspects and categorize them in creativity assessment. comprehensively, every dimension can relate to each other. besides, the interaction dimension should be emphasized because this dimension differentiates speaking skills from other language skills. teachers can add and adjust the dimension of assessment rubrics depending on the assignments given. the suitability of the skills that the teacher wants to measure and the form of assessment are critical in determining the teacher's evaluation of the assessment results. frequently, teachers compose a rubric that focuses on the criteria that do not reflect students' speaking mastery but focus on students' work (endrayanto & harumurti, 2014). for example, teachers focus on the novel's content, whether it is exciting or not, the neatness of presentation made, performance and delivery, or the other objects beyond the purpose of teachers in measuring students' speaking skills. another case shows that the test of writing skills is prone to inserting the subjectivity of the assessors. therefore, an assessment rubric is needed to minimize subjectivity (mertler, 2000). the assessment rubric of writing skills should be written in a unified principle. it possibly leads to adjusting the assessment dimension towards the learning process. the making of assessment dimension can be developed with students to give their opinion about the assessment principles that teachers can accommodate (ratnaningsih, 2016). 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 159 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) the appropriate type of rubric to diagnostically measure students' writing skills is by providing detailed information related to the strength and weaknesses of students' skills. it is useful, especially in making a better teaching and learning process. the dimensions attached to the rubric include (1) the content dimension and (2) the language dimension that contains writing structure, sentence structure, punctuation, and diction (ayhan & türkyılmaz, 2015). the dimension that is needed to be emphasized in writing skill assessment is the language aspect related to the writing structure, sentence structure, punctuation, and diction used. the different types of text possibly make the different structure, so this aspect is essential to notice. the assessment related to punctuation can be the focus because this aspect is essential to observe students' skills in understanding and to use punctuation in specific contexts. in addition, the measurement of writing skills can be done using a holistic rubric. a holistic rubric in writing skills can be used to measure the effectiveness of students' skills; for example, a holistic rubric can assess the narrative, persuasive, and exposition texts. subsequently, a holistic rubric is used to determine the quality of every text, primarily aiming to achieve the main learning goals in writing (endrayanto & harumurti, 2014). the possible questions in this context are how well the story is delivered (narrative)? how good can the paper affect someone (persuasive)? or, how well can the writer explain something (exposition)? every leading dimension of the assessment rubric can be developed and adjusted by teachers. it should refer to the learning goals that students must achieve. therefore, the assessment rubric can provide a detailed explanation of the consistent and fair assessment (al-rabai, 2014). besides, students are expected to understand that they can self-reflect and benefit from doing assignments, besides obtaining scores. it is essential because the assignment is related to the learning goal achievement. a rubric is an instrument that can make the assessment becoming objective, which means that the subjectivities can be minimized. besides, the rubric also provides clear and consistent assessment (greenstein, 2012; stevens & levi, 2013). teachers give information regarding the scoring criteria. on the other sides, students can measure their achievement using the rubric and gain practical and efficient feedback. as the learning reflection instrument, the rubric developed by teachers can be made as to the assessment guidance. the rubric is not just an assessment tool of learning. moreover, the rubric can also be designed in the lesson plan. assessment rubrics can help students understand what is assessed (criteria) and detailed descriptions of various achievement scores. from that perspective, students can set their learning strategy. besides, it is easy for the facilitator to explain how their performances are assessed (rukmini & saputri, 2017). the rubric provides students with essential information regarding their performances towards the teachers' criteria. besides, teachers also have opportunities to make rubrics together with the other teaching partners, making the assessment more specific and more valid. the rubric also transparently provides information regarding the process of scoring. the background on the use of rubrics in assessment shows that it is important to discuss how indonesian language teacher candidates can create holistic and analytic rubrics. utilization of holistic and analytical rubrics to measure aspects of attitudes, knowledge, and skills in learning (kennedy et al., 2013). the rubric is developed based on listening, speaking, reading, and writing skills. each skill is described in the aspects such as (1) differentiating the range and scale of the skill among the competence item options, (2) determining the construction of theory from the research and the theory from a book that explains the aspects assessed in the developed instrument, (3) using concrete language, not ambiguous, and easy to understand by raters or teachers, (4) differentiating the aspect/domain assessed based on the relevant theories but measure different skills, (5) developing the items by noticing the aspect of simplification but still fulfill the aspects assessed. thus, the aims of this study are to (1) describe the need for an assessment of language skills rubrics; (2) describe students' abilities in creating holistic and analytic rubrics; (3) describe the potential and relevance of the application of the rubric in the micro-teaching class; (4) describe the obstacles faced by students in compiling the rubric. 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 160 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) method the method used in this study is the mixing method. the data used in this study were holistic and analytic rubrics made by 250 students of indonesian language education. students were asked to make rubrics on the listening, speaking, reading, and writing skills. the sampling technique used was stratified random sampling by filtering students' data in public and private universities. to assess the rubric that has been made, the researchers make a research instrument consisting of: (1) determination and classification of scales; (2) making a description of the tasks; (3) the using of good and correct language; (4) construction of components/dimensions; (5) performance descriptors. the instrument's validity was tested using aiken's v to calculate the content-validity coefficient, which was based on the panel assessment of five people towards an item, especially in how far the item can represent the construct measured. the analysis of students' skills in making a rubric was measured to acknowledge the most mastered aspects. students obtained the scores that described their abilities based on the constructed indicators. the quantitative approach in the interrater assessment which was used by the experts consisted of five indonesian language education lecturers in assessing the holistic and analytic rubric made by students (sudaryanto et al., 2020). the qualitative approach is used in describing students' skills, especially in the language. the qualitative result of this study was validated through the triangulation method and data source, namely the interview result and the observation analysis of rubric making in each university. the data analysis was an interactive technique in the qualitative aspect through data collection, data presentation, reduction, and conclusion. each process was conducted continuously, and the component of data reduction was done together with data collection (miles & huberman, 2012). findings and discussion findings the initial analysis of language skill rubric assessment the development of a rubric is the effort to give objective assessments, namely the assessment, which can closely give accurate information related to the development and growth of students. teachers should see students as a unique individual who is different one another. a good rubric should facilitate every possibility of answers presented by students in listening, speaking, reading, and writing as the language skills. teachers develop rubrics holistically, analytically, or both in order to obtain an accurate result. the holistic rubric is the construction that contains various levels of performance and can define the quality of the assignments, the quantity of the assignments, or both of them. based on the results of in-depth interviews with lecturers and students regarding holistic rubrics, it was found that students' understanding was quite high, especially regarding the rubric structure that was often used in learning both at school and in university. an analytic rubric is a construction consisting of criteria that are divided into various levels of performance. there is no defined procedure for developing rubrics because rubrics are objective-dependent constructs. assessment rubrics provide many benefits to both students and teachers. the rubric provides input and feedback to help students improving their skills, and it is also a powerful way to clarify students' goals and skills. the main obstacles faced are in the formulation of the comprehensive, visible, and observable skill indicators. data regarding students' understanding of the rubric was conducted through interviews and analysis of lecturers' assessment documents on the competence to create holistic and analytic rubrics. some of the questions asked were (1) students' knowledge of the process of making rubrics, (2) students' knowledge of the structure and systematics of rubrics, (3) steps and procedures for making rubrics on the four language skills, and (4) what are the materials covered? must be constructed in a rubric. the initial analysis of the assessment of the language skills rubric 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 161 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) shows the mapping of the problem based on the students' knowledge related to the assessment of the language skill rubric. based on the initial analysis conducted by the researcher, 80% of respondents knew about the structure and systematics of holistic and analytical rubrics, 22% could not apply this understanding into a good rubric, 17% felt they were able to make rubrics but did not master the construction of four language skills. the initial analysis showed that most students already know about the structure and systematics of holistic and analytic rubrics. this knowledge becomes the fundamental aspect for students in making language skill rubrics. however, 22% of students cannot apply this understanding to a good rubric based on the five components contained in a rubric. the needs analysis results showed that 17% of students felt they could make rubrics but did not master the construction of the four language skills, namely reading, writing, listening, and speaking. meanwhile, this study specifically focuses on students' abilities in compiling an assessment of language skills rubrics. mastery of the construction of the four language skills also influences the interpretation and implementation of students in creating and developing good rubrics. preliminary data analysis shows that students' competence in making language assessment rubrics is limited to just knowing (80%). however, most students do not have good competence in creating and applying rubrics. meanwhile, learning indonesian cannot be separated from four language skills. the urgency of mastering the competence to make language assessment rubrics so that the assessment process can be more objective, directed, and systematic. therefore, teachers must master the rubric of language skills assessment so that the learning process can optimally and achieve optimal learning competencies. students' ability in making rubric researchers measure the ability of students to make language skills assessment rubrics. the instrument made is a performance instrument consisting of five aspects. students are asked to make an assessment rubric then the researcher and the lecturer evaluate the ability to make the rubric. researchers measured the ability of students to make language skills assessment rubrics. the measurement was carried out using five competencies, namely (1) determination and classification of scales (and scores) that described the level of mastery; (2) making a description of the tasks that are expected to be produced or carried out by students; (3) the using of good and correct language; (4) construction of components/dimensions that students must pay attention to in completing assignments; (5) description of performance quality (performance descriptors) of components/dimensions at each level of mastery. the result of the clustering of the measurement results on the five competencies of students' abilities in making language skills assessment rubrics can be seen in table 1. to describe the students' ability to make rubrics, the ability is divided into three categories: high, medium, and low. sorting is done through standard-setting using the angoff method. the procedures of the angoff method include (1) the expert is allowed to examine the test items individually (2) the expert is asked to determine or give a percentage of the ability/opportunity of the test taker to answer the item being studied correctly, (3) the expert gives a final score based on the average percentage score of each item, (4) assigning a score to the passing grade based on the average percentage score of the item, and (5) panelists discussing the cut score obtained and then describing the competence. table 1. the competence of students in making rubric category compt. 1 compt. 2 compt. 3 compt. 4 compt. 5 total high 43 (17%) 32 (13%) 70 (28%) 12 (5%) 54 (22%) 211 (17%) medium 56 (22%) 90 (36%) 92 (37%) 43 (17%) 45 (18%) 326 (26%) low 151 (61%) 128 (51%) 88 (35%) 195 (78%) 151 (60%) 713 (57%) mean 3.024 3.06 3.316 3.072 3.028 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 162 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) figure 1. mean of students' ability in constructing rubric the data in table 1 shows that the competence to use proper and correct language gets the highest score with an average achievement of 3.316 compared to the other four aspects. in this case, the ability to compile an assessment rubric with excellent and correct language becomes very vital. students must follow specific systematic strategies in describing standards or indicators to avoid overlap or confusion. the average value of high language use competence is also influenced by the scientific background of indonesian language education students. then, the technical aspect of language use shows a few spelling errors, punctuation, and capitalization. in interactive speaking activities, listening skills cannot be separated. some students develop rubrics on integrating listening with speaking, which means the collaborative assessment process. in this way, the productive skills assessed during learning can also supplement some of the less interactive receptive-based assessments. the procedures and rubrics (see figure 1) presented can be used for a single assessment or multiple assessments by providing an average score for overall performance over the school term. if a one-to-one assessment proves impractical, rubrics can also be used by observing language interaction activities in class over time (mardapi & kartowagiran, 2019). in this case, students are aware that they are being assessed on an ongoing basis and will receive a score for each skill category and an overall score at the end of the lesson. the lowest average score is on the ability of students to determine and classify scales (and scores) that describe the level of mastery and describe the quality of performance (performance descriptors) of the components/dimensions at each level of mastery. one factor affecting the average score for the two competencies is that students' ability to design analytical and holistic rubrics is still low. the potential and relevance of the rubric implementation in micro-teaching based on the study, the specifications and relevance of the rubric developed by students to assess language skills are presented in figure 2. the data in figure 2 shows the specification and relevance of the rubric developed for language skills assessment. figure 2. the relevance of language skill rubric (listening) (speaking) (writing) (reading) 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 163 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) writing skills get the highest percentage, which means that 83% of the rubrics students develop are relevant because they are measurable and systematic. meanwhile, the listening skill rubric got the lowest percentage of 32% because students could evaluate themselves. furthermore, 51% of the developed speaking rubrics are relevant because the teacher can compile teaching materials clearly and 76% of the developed reading rubrics are relevant because the learning process in learning procedures can be carried out in an integrative and more systematic way. the obstacles in making a rubric table 2. the obstacle in making a rubric aspect types of difficulties the causes of difficulties competence 1 unable to concretely determine the scale. students do not master the difference between holistic and analytical rubrics and do not master the determination of the correct score. unable to differentiate scales 2 and 3. students do not master the level of performance in the rubric scale. competence 2 listening theory is complicated to distinguish between environmental disturbances and low comprehension. students do not focus on indicators of student achievement and do not specifically describe the tasks that they must do. the speaking theory is too complex to understand, so it needs a long specification not to be biased. students do not describe their assignments specifically based on achievement indicators. competence 3 the use of concrete language is still challenging to master, so it is not easy to understand. students are not accustomed to composing sentences with concrete language. repetitive use of active verbs. students do not master the variety of active verbs in bloom's taxonomy. the dictions used are not easily understood by others. students do not adjust the word choice to the variety of languages mastered by students. competence 4 the difficulty in explaining the components/ dimensions must be mastered by students and measured by the teacher. students do not master the indicators of achievement that students must achieve. some of the components were developed to measure the same ability. students are not able to develop achievement indicators into assessment components. competence 5 no performance quality (performance descriptor) makes it easier for teachers to assess students' abilities. students do not know the fifth competence as one of the components included in the assessment rubric. table 2 shows the various difficulties and causes of difficulties experienced by students in compiling and developing language skills assessment rubrics. this difficulty mapping is based on measuring student competence in compiling rubrics. the obstacles in the preparation of the rubric were collected based on interviews conducted with the respondent students. discussion teacher competence in conducting assessments is one of the mandatory requirements for prospective teachers to hone their professionalism and pedagogy. assessment is an objective process in reconstructing students' abilities into numbers and competency descriptions so that the progress of their abilities is measured systematically and continuously. components in the assessment are measurement tools to reflect actual abilities but remain within the corridor of competence to be measured (suwandi & sudaryanto, 2021). one of the critical components that must 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 164 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) be mastered is the assessment rubric. making assessment tools is not an easy thing. beside paying attention to validity, assessment tools must have reliability; the results of using assessment tools also need to be carefully considered. thus, the same assessment tool for the same competency can be used in different schools/students. likewise, other assessment tools include assessment guidelines, rubrics, and test/non-test instruments. preliminary analysis shows that students as prospective teachers feel pretty enthusiastic about developing rubrics. with a good rubric, students can understand what is assessed (criteria) and how detailed descriptions are for different achievement grades. with this understanding, students can develop learning strategies and efficiently achieve the expected goals and indicators (reddy & andrade, 2010). on the other hand, the competence of prospective teachers in developing rubrics is also assessed in detail because one of the significant effects of learning to run well is that the measurement process is carried out ideally. based on the research findings, it is known that students' knowledge of the components presented in the rubric of language skills assessment is still low. indonesian language as a skill subject still contains aspects of attitude and knowledge that must be measured (suwandi et al., 2021). students have not been able to master the specifications of linguistic constructs, which consist of listening, speaking, reading, and writing skills. in addition, students have not been able to explain the differences in the dimensions in the skill rubric, significantly if it is associated with the schema of attitudes, knowledge, and skills. in the aspect of attitude, the rubric made by students is still not quite observable. some students make a knowledge/test rubric to describe the students' attitudes they will teach. in contrast, it is more appropriate for the student attitude rubric to use observable and self-assessment rubrics (miller, 2013). in addition, some knowledge rubrics created by students represent practical skills. students are still not fluent in compiling command words, active verbs, and differentiating between scales so that rubrics can be easily used by assessors/teachers. most students know that assessment rubrics can be done analytically and holistically but cannot explain the differences and design assessment rubrics on language skills. based on the students' learning experiences, they were asked to apply analytical and holistic rubrics to the four language skills. language skills are a receptive and productive learning process, where students listen when the speech partner speaks, and students read when others write. improving language skills needs to be described in the teacher's active learning process and is expected to be listed in the rubric created. in the aspect of attitude, (1) students' learning activities can be seen in the classroom, (2) how students work together in doing the tasks given by the teacher, (3) students' activities when paying attention to the explanation and presentation of material from the teacher, and (4) enthusiastic students in class and involvement in the teaching and learning process/discussion/as well as other activities. each student activity is expected to be measured in a rubric made by students. the rubrics produced by students in assessing reading and writing skills are pretty good and relevant to be applied in schools, namely 76% for reading skills and 83% for writing skills. reading and writing skills are included in the textual category to be repeated and improved as long as they are still short. the analysis of the test construction on the rubrics made by students shows that reading and writing skills have complete references compared to other rubrics. reading is essentially an activity to capture reading information both express and implied in the form of literal, inferential, evaluative, and creative reading comprehension by utilizing reading experiences. students develop the process assessment as measured by the observation rubric in developing both observative and unobservable aspects. in this case, the perspective developed in the rubric that is made does not only focus on student scores or achievements. however, it prioritizes the quality of the learning process built through student interaction during the learning process carried out. the rubrics produced by students in assessing speaking and listening skills are still not optimal and not relevant enough; namely, 78% of students have not been able to make good rubrics 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 165 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) on listening skills, and 49% of students have not been able to make speaking rubrics. compared to the skills of prospective teachers in developing reading and writing rubrics, listening and speaking skills have different levels of difficulty. apart from being seen from the very complex construction of the rubric, speaking competence can change according to the situation and condition of students at school (especially during online learning). at the same time, in listening skills, the difficulty lies in the mastery of listening. based on a research conducted by sudaryanto (2019b), listening skills are heavily influenced by listening so that the material can be more accepted if the topic is liked. students can determine theoretical construction from research results and book theories that explain the aspects assessed in the developed instrument. based on the rubric developed by the teacher (in this context, students as prospective teachers), students are expected to acquire the basic knowledge and skills that are considered essential for continuing their studies and adjusting to construct the learning continuum. in different cases, the use of language requires special treatment to produce language that is concrete, unambiguous, and easily understood by raters and teachers. designing rubrics in an easy-to-understand language is not easy because the choice of words will affect the rater's acceptance of using the rubric. when mapping the causes of student difficulties, some of them are related to (1) not mastering the difference between holistic and analytic rubrics, (2) not mastering the elements in the rubric, (3) not mastering the use of concrete language and variety mastered by students, and (4) not mastering bloom's taxonomy. meanwhile, the types of difficulties experienced by students related to (1) not being able to determine the rubric scale concretely, (2) difficulty in distinguishing between external disturbances and shared understanding in listening theory, (3) speaking theory that was too complex, (4) use of language that is difficult to understand by others, (5) minimal mastery of operative words in bloom's taxonomy, (6) difficult to explain and apply rubric elements. advanced competence in developing rubrics that are expected to be reflected in this research is the ability of students to distinguish aspects/domains that are assessed based on relevant theories but measure different abilities. therefore, students can develop items by paying attention to aspects of simplification but still meeting the assessed aspects. conclusion as part of the assessment tool, a rubric is a set of criteria used to assess a student's work or assignment performance. the rubric becomes a guide in the assessment to provide more detailed information on the achievement grade. thus, the rubric helps teachers provide more objective assessments following the expected learning outcomes. based on the research findings, three conclusions can be drawn. the three points described include (1) student knowledge of the components presented in the language skills assessment rubric is still low: most students know that the assessment rubric can be done analytically and holistically but cannot explain the difference and how to design the assessment rubric for language skills; (2) the rubrics produced by students in assessing reading and writing skills are pretty good and relevant to be applied in schools. meanwhile, the rubrics produced by students in assessing speaking and writing skills are still not optimal and not relevant enough; (3) students can determine theoretical construction from research results and book theories that explain the aspects assessed in the developed instrument. types and causes of difficulties experienced by students in compiling rubrics for assessing language skills related to the five competencies asked through interviews. based on the research results, there are four conclusions. the four conclusions can be described, among others (1) language skills are the main components in learning indonesian, so that teacher competence is needed in compiling rubrics for assessing language skills; (2) student knowledge of the components presented in the language skills assessment rubric is still low: most students know that the assessment rubric can be done analytically and holistically but cannot explain the difference and how to design the assessment rubric for language skills; (3) the rubrics 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 166 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) produced by students in assessing reading and writing skills are pretty good and relevant to be applied in schools. meanwhile, the rubrics produced by students in assessing speaking and writing skills are still not optimal and not relevant enough; (4) constraints experienced by students are related to mastery of the differences in holistic and analytic rubrics, rubric scales, rubric structures, language, and achievement indicators. references al-rabai, a. (2014). rubrics revisited. international journal of education and research rubrics, 2(5), 473–484. https://www.ijern.com/journal/may-2014/39.pdf al zumor, a. (2015). quality matters rubric potential for enhancing online foreign language education. international education studies, 8(4), 173–178. http://dx.doi.org/10.5539/ies.v8n4p173 anisa, a. a. (2017). students’ literature achievement: predictors investigation research. reid (research and evaluation in education), 3(2), 144-151. https://journal.uny.ac.id/index.php/reid/article/view/17498 ayhan, ü., & türkyılmaz, m. (2015). key of language assessment: rubrics and rubric design. international journal of language and linguistics, 2(2), 82–92. https://ijllnet.com/journals/vol_2_no_2_june_2015/12.pdf barkaoui, k. (2016). what changes and what doesn’t? an examination of changes in the linguistic characteristics of ielts repeaters’ writing task 2 scripts. ielts research reports online series, 3, 1-55. https://www.ielts.org/for-researchers/research-reports/online-series-2016-3 bukhari, s. (2016). mind mapping techniques to enhance efl writing skill. international journal of linguistics and communication, 4(1), 58-77. https://doi.org/10.15640/ijlc.v4n1a7 endrayanto, h., & harumurti, y. (2014). aplikasi rubrik untuk penilaian belajar siswa. kanisius. greenstein, l. (2012). assessing 21st-century skills: a guide to evaluating mastery and authentic learning. corwin. https://us.corwin.com/en-us/nam/book/assessing-21st-century-skills griffin, p. (1991). literacy assessment: merging teaching, learning, and assessment. in the annual meeting of the international reading association, las vegas, nv. https://eric.ed.gov/?id=ed337746 kennedy, j. a., anderson, c., & moore, d. a. (2013). when overconfidence is revealed to others: testing the status-enhancement theory of overconfidence. organizational behavior and human decision processes, 122(2), 266-279. https://doi.org/10.1016/j.obhdp.2013.08.005 mardapi, d., & kartowagiran, b. (2019). pengembangan instrumen pengukur hasil belajar nirbias dan terskala baku. jurnal penelitian dan evaluasi pendidikan, 15(2), 326-341. https://doi.org/10.21831/pep.v15i2.1100 mertler, c. (2000). designing scoring rubrics for your classroom. practical assessment, research, and evaluation, 7, 25. https://doi.org/10.7275/gcy8-0w24 miles, m., & huberman, m. (2012). analisis data kualitatif: buku sumber tentang metode-metode baru (t. rohendi, trans.). universitas indonesia press. miller, n. (2013). measuring up to speech intelligibility. in international journal of language and communication disorders, 48(6), 601-612. https://doi.org/10.1111/1460-6984.12061 pearce, j., mulder, r., & baik, c. (2009). involving students in peer review: case studies and practical strategies for university teaching. centre of the study of higher education, the university of melbourne. https://melbourne-cshe.unimelb.edu.au/__data/assets/pdf_file/0006/ 3590943/involving-students-in-peer-review.pdf https://www.ijern.com/journal/may-2014/39.pdf http://dx.doi.org/10.5539/ies.v8n4p173 https://journal.uny.ac.id/index.php/reid/article/view/17498 https://ijllnet.com/journals/vol_2_no_2_june_2015/12.pdf https://www.ielts.org/for-researchers/research-reports/online-series-2016-3 https://doi.org/10.15640/ijlc.v4n1a7 https://us.corwin.com/en-us/nam/book/assessing-21st-century-skills https://eric.ed.gov/?id=ed337746 https://doi.org/10.1016/j.obhdp.2013.08.005 https://doi.org/10.21831/pep.v15i2.1100 https://doi.org/10.7275/gcy8-0w24 https://doi.org/10.1111/1460-6984.12061 https://melbourne-cshe.unimelb.edu.au/__data/assets/pdf_file/0006/3590943/involving-students-in-peer-review.pdf https://melbourne-cshe.unimelb.edu.au/__data/assets/pdf_file/0006/3590943/involving-students-in-peer-review.pdf 10.21831/reid.v7i2.44005 memet sudaryanto & habib safillah akbariski page 167 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) ratnaningsih, e. (2016). improving students' writing ability through the use of dictogloss technique. transformation, 12(2), 1-14. https://jurnal.untidar.ac.id/index.php/transformatika/article/view/186 reddy, y., & andrade, h. (2010). a review of rubric use in higher education. assessment & evaluation in higher education, 35(4), 435–448. https://doi.org/10.1080/02602930902862859 retnawati, h. (2015). the comparison of accuracy scores on the paper and pencil testing vs. computer-based testing. turkish online journal of educational technology, 14(4), 135–142. http://www.tojet.net/articles/v14i4/14413.pdf rimmer, w. (2006). measuring grammatical complexity: the gordian knot. language testing, 23(4), 497–519. https://doi.org/10.1191/0265532206lt339oa rukmini, d., & saputri, l. (2017). the authentic assessment to measure students' english productive skills based on 2013 curriculum. indonesian journal of applied linguistics, 7(2), 263–273. https://doi.org/10.17509/ijal.v7i2.8128 stevens, d., & levi, a. (2013). introduction to rubrics: an assessment tool to save grading time, convey effective feedback, and promote student learning. stylus publishing, llc. sudaryanto, m., mardapi, d., & hadi, s. (2019a). multimedia-based online test on indonesian language receptive skills development. journal of physics: conference series, 1339(1). https://doi.org/10.1088/1742-6596/1339/1/012120 sudaryanto, m., mardapi, d., & hadi, s. (2019b). how foreign speakers implement their strategies to listen indonesian language? journal of advanced research in dynamical and control systems, 11(7), 355–361. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3567681 sudaryanto, m., ulya, c., rohmadi, m., & kuhafeesah, k. (2020). inter-rater assessment on listening media for foreign language speakers. in proceedings of the 2nd konferensi bipa tahunan by postgraduate program of javanese literature and language education in collaboration with association of indonesian language and literature lecturers, kebipaan, surakarta, central java, indonesia. https://doi.org/10.4108/eai.9-11-2019.2295064 suwandi, s., & sudaryanto, m. (2021). benefits and challenges of learning indonesian language with an environmental system: an action research at high school in surakarta. psychology and education journal, 58(2), 4403–4413. http://psychologyandeducation.net/pae/index.php/pae/article/view/2829 suwandi, s., sudaryanto, m., wardani, n., zulianto, s., ulya, c., & setiyoningsih, t. (2021). higher order thinking skills in indonesian language national exam in junior high school. jurnal kependidikan: penelitian inovasi pembelajaran, 5(1), 31-44. https://doi.org/10.21831/jk.v5i1.35457 widiastuti, i. (2021). assessment and feedback practices in the efl classroom. reid (research and evaluation in education), 7(1), 13-22. https://doi.org/10.21831/reid.v7i1.37741 zubairu, u. m., dauda, c. k., sakariyau, o. b., & paiko, i. i. (2016). academic performance and moral competence: a match made in heaven? reid (research and evaluation in education), 2(2), 206-219. https://journal.uny.ac.id/index.php/reid/article/view/8956 https://jurnal.untidar.ac.id/index.php/transformatika/article/view/186 https://doi.org/10.1080/02602930902862859 http://www.tojet.net/articles/v14i4/14413.pdf https://doi.org/10.1191/0265532206lt339oa https://doi.org/10.17509/ijal.v7i2.8128 https://doi.org/10.1088/1742-6596/1339/1/012120 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3567681 https://doi.org/10.4108/eai.9-11-2019.2295064 http://psychologyandeducation.net/pae/index.php/pae/article/view/2829 https://doi.org/10.21831/jk.v5i1.35457 https://doi.org/10.21831/reid.v7i1.37741 https://journal.uny.ac.id/index.php/reid/article/view/8956 research and evaluation in education research and evaluation in education, e-issn: 2460-6995 iv subject indexes b blended learning, 175, 179, 180, 181, 182, 183, 184, 185 bridging course, 146, 147, 148, 149, 153, 154, 155, 156 c cect model, 129, 131, 132, 133, 134, 135, 136, 137, 138, 140, 141, 142, 143 cipp, 114, 119, 120, 122, 125, 128, 146, 148, 149 competitive strategy, 199, 208 critical thinking, 161, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 217, 221 d dcp students, 146, 148, 150, 151, 153, 154, 155, 156 digital learning, 212, 213, 214, 218, 223 e effective learning, 158, 160 evaluation, 116, 117, 118, 119, 120, 121, 122, 1127, 128, 132, 133, 134, 136, 137, 138, 139, 140, 142, 146, 148, 149, 150, 154, 155, 156, 157, 165, 173, 185, 186, 187, 188, 189, 190, 191, 192, 194, 196, 200, 204, 205, 207, 215, 216, 219, 223, 224 excretory system, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223 f formative assessment, 165, 186, 188, 196 i instructional model, 175 interactive, 132, 156, 175, 177, 178, 179, 180, 181, 182, 183, 184, 185, 197, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224 interactive media, 212, 220 l learning outcome competence level, 158, 172 learning simulation, 158 m malcolm baldrige, 199, 203 mathematics learning, 186 mechanical engineering, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144 metacognition level, 158, 171, 172 p program effectiveness, 114, 118 program objectives, 114, 119, 120, 121, 122, 124, 125, 126, 127 q quasi-experiment, 158, 166, 175, 180 s spu, 129, 133, 134, 135, 136, 137, 138, 139, 140, 141, 143, 144 strategic planning, 176, 178, 199, 200, 201, 202, 203, 204, 205, 206, 207, 211 student achievement, 118, 176, 212, 214, 224 students‟ needs and barriers, 114, 121, 124, 126, 127 t teaching methods, 114, 118, 120, 121, 122, 124, 125, 126, 127, 154, 155, 160 teaching materials, 114, 116, 119, 121, 123, 124, 125, 213 tve, 199, 201, 202, 203, 204, 206 v vhs students, 129, 131, 132, 136, 137, 138, 139, 140, 141, 142, 143, 145 virtual laboratory, 212, 214, 222, 223, 224 w web-based, 175, 177, 178, 179, 180, 181, 182, 183, 184, 185 research and evaluation in education v volume 1, number 2, december 2015 author indexes † budiyono, aris, 129 budijantoro, f. putut martin herry, 212 edidas, 158 effendi, hansi, 175, 179, 184 irambona, alfred, 114 isnaeni, wiwi, 212 jailani, 186 jama, jalius, 158 kartowagiran, badrun, 186, 189, 197 kumaidi, 114 mardapi, djemari, 121, 128, 146, 151, 156, 181, 185, 187, 188, 194, 198 marianti, aditya, 212 pardjono, 129 perez, beatriz eugenia orantes, 146 rosnawati, r., 186 setiawan, heru, 212 soenarto, 175 sofyan, herminarto, 175 sugiyono, 129 suharno, 199 sukamto, 199 sutarto, 199 research and evaluation in education research and evaluation in education, e-issn: 2460-6995 vi authors’ biography aditya marianti. has been working as a lecturer on animal physiology in the department of biology, faculty of mathematics and natural sciences, semarang state university, indonesia, since 1993. she attained her bachelor degree on biology education in semarang institute of teacher education and educational sciences (currently known as semarang state university), her master degree on biology in gadjah mada university, indonesia, and her doctoral degree on environmental sciences in diponegoro university, indonesia. she has been writing and publishing many scientific studies on environmental condition and its impact towards the surroundings, on animals physiology, and many more. the researches are published in both accredited and non-accredited national journals, international journals, and also presented in various national and international forums and seminars. alfred irambona. was born in burundi, central of africa, in the southern province of burundi called makamba in december 15 th , 1979. after his high school education, he attained his bachelor degree in english teaching department in the institute of applied pedagogy at burundi national university. he finished his master degree in educational research and evaluation at yogyakarta state university, indonesia, in 2015. †aris budiyono. was born on 5 april 1967. he worked as a lecturer in the department of mechanical engineering, faculty of engineering, semarang state university. he attained his doctoral degree in 2015 from yogyakarta state university, majoring in vocational education. he passed away in 2015. badrun kartowagiran. was born on 25 july 1953. he works as a lecturer in the faculty of engineering and graduate school of yogyakarta state university, indonesia. he attained his bachelor degree in 1977 from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university) majoring in machine engineering education. in 1992, he achieved his master degree on educational research and evaluation from jakarta institute of teacher education and educational sciences (recently known as jakarta state university). in 2005, he graduated from his doctoral study in psychology/psychometrics in gadjah mada university, indonesia. beatriz eugenia orantes perez. was born on february 17 th , 1982 in tuxtla gutierrez, chiapas, mexico. she studied english education at the state university of chiapas (unach). she has worked as an english lecturer at the state university of chiapas. in 2015, she attained her master degree on educational research and evaluation from yogyakarta state university, indonesia. djemari mardapi. works as the current head of department of the educational research and evaluation, in the postgraduate school of yogyakarta state university, and lecturer in the same department as well as the department of engineering. prof. djemari studied his bachelor degree on education of electrical engineering at the yogyakarta state university (formerly known as yogyakarta institute of teacher education and educational sciences); and attained a master degree on educational research and evaluation at the same university. then, he continued his studies of ph.d on education, measurement, and statistics at the university of iowa. he has research and evaluation in education vii volume 1, number 2, december 2015 worked and collaborated in several researches, as well as been publishing many articles focused on research and books. edidas. was born on 9 september 1963. he currently works as a lecturer in the department of electronics engineering, faculty of engineering, padang state university, indonesia. his undergraduate study concerns in electronics engineering education. his master degree is in the field of computer engineering, and his doctoral degree is attained from yogyakarta state university majoring in vocational education. f. putut martin herry budijantoro. currently woks as a lecturer on ecology, environmental sciences, and innovative learning media in the department of biology, faculty of mathematics and natural science, semarang state university, indonesia. he attained his bachelor degree on biology in gadjah mada university, indonesia, and his master degree on environmental sciences in the same university. some of his researches and scientific publications are focused on environmental quality, organic and inorganic waste, nature reserve, and also some researches on the quality improvement of the courses of ecology, environmental sciences, and learning media. besides his scientific publications in journals, he is also active in many social researches. hansi effendi. currently works as a teaching staff in the department of electrical engineering, faculty of engineering, padang state university. he has been teaching in the padang state university since 2002. he was born in batusangkar, february 11 th , 1979. he earned his undergraduate degree in 2001 with major in electrical engineering study program, andalas university, padang. then, he earned his graduate degree in 2009 with major in the computer science study program, graduate program, putera indonesia university, padang. next, he earned his postgraduate degree in 2015 with major in vocational education technology study program, postgraduate program, yogyakarta state university. herminarto sofyan. was born on 9 august 1954. he currently works as a senior lecturer in the department of automotive engineering education of the faculty of engineering and graduate school of yogyakarta state university, indonesia. he attained his bachelor degree in 1978 from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university), majoring in mechanical engineering. in 1986, he attained his master degree from jakarta institute of teacher education and educational sciences (recently known as jakarta state university), majoring in vocational education. he achieved his doctoral degree in 2002 from jakarta state university majoring in educational technology. heru setiawan. was born in blora, april 24 th 1993. his concern in the field of education has led him to attain his bachelor degree on biology education, in the faculty of mathematics and natural sciences, semarang state university, indonesia. he is currently active as a researcher in the field of biology education either independently or funded by institutions or the directorate general of higher education of indonesia. he is also involved in community service activities. in addition, he has experienced in becoming a lecturer assistant in the courses of general biology, genetics, and plant taxonomy. some of his researches are focused on digital learning and technology. jailani. was born on 27 september 1959. he currently works as a lecturer in the faculty of mathematics and natural science and the graduate school of yogyakarta state university. in research and evaluation in education research and evaluation in education, e-issn: 2460-6995 viii 1984, he attained his bachelor degree on surabaya institute of teacher education and educational sciences (recently known as surabaya state university) majoring in mathematics education. in 1990, he graduated from malang institute of teacher education and educational sciences (recently known as malang state university) achieving his master degree on the same major. in 2002, he achieved his doctoral degree from yogyakarta state university majoring in educational research and evaluation. jalius jama. was born on 5 february 1942. he works as a senior lecturer in the faculty of engineering, padang state university. he attained his bachelor degree from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university), master degree from sam houston state university, usa, and his doctoral degree from ohio state university, usa. kumaidi. was born on march 24th, 1952. he is a current lecturer at yogyakarta state university and in many other universities, including muhammadiyah university of surakarta. as his study background, he attained his bachelor degree at yogyakarta state university in mechanical engineering education in 1976. in 1984, he achieved his masters degree in educational measurement and statistics, at iowa university, usa. in 1987, he attained his ph.d. degree in the same major at the same university in usa. he has made a lot of publications. pardjono. was born on 2 september 1953. he currently works as a senior lecturer in the faculty of engineering and graduate school of yogyakarta state university. he graduated achieving bachelor degree in 1977 from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university), indonesia, majoring in mechanical engineering education. in 1986, he attained master degree in industrial arts and technology education from state university of new york, usa. his doctoral degree in cognitive education was attained in 2000 from deakin university, australia. r. rosnawati. was born in indragiri hulu, on 20 december 1967. she currently works as a lecturer in the mathematics education study program, faculty of mathematics and natural sciences of yogyakarta state university, indonesia. she attained her bachelor degree from bandung institute of teacher education and educational sciences (recently known as indonesia university of education) in 1991 majoring in mathematics education, her master degree in mathematics from gadjah mada university, indonesia, in 2000, and her doctoral degree from yogyakarta state university in 2015 majoring in educational research and evaluation. soenarto. was born on 4 august 1948. he works as a senior lecturer in the faculty of engineering and the graduate school of yogyakarta state university, indonesia. he attained his bachelor degree in 1974 from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university) majoring in electrical engineering education. in 1984, he attained his master of science degree in industrial and vocational education from state university of new york, usa. in 1987, he also achieved a master of arts degree in educational program evaluation from ohio state university, usa. his doctor of philosophy degree in industrial vocational education was attained in 1988 from ohio state university, usa. research and evaluation in education ix volume 1, number 2, december 2015 sugiyono. was born on 14 december 1953. he works as a senior lecturer in the faculty of engineering, yogyakarta state university, indonesia. he attained his bachelor degree in 1977 from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university) majoring in mechanical engineering education. in 1985, he achieved his master degree from bandung institute of teacher education and educational sciences (recently known as indonesian university of education) majoring in educational management. his doctoral degree was attained in 1990 from the same major and university. suharno. was born in sragen on june 3 rd , 1971. he earned his undergraduate degree in engineering study program national technology academy yogyakarta. then, he earned his graduate degree from engineering study program gadjah mada university. he earned his postgraduate degree from technological and vocational education study program yogyakarta state university. recently he is working as a teaching staff (lecturer) in the engineering study program surakarta state university. sukamto. was born on 25 february 1947. he is a professor in the department of mechanical engineering education of the faculty of engineering, yogyakarta state university. sutarto. was born in cilacap, central java, indonesia, on 1 september 1953. he currently works as a lecturer in civil engineering and planning education study program, faculty of engineering, yogyakarta state university. he attained his bachelor degree in civil engineering education from yogyakarta institute of teacher education and educational sciences (recently known as yogyakarta state university), his master degree in technology education from state university of new york (suny) at oswego, usa, and his doctoral degree in vocational education from ohio state university (osu), ohio, usa. wiwi isnaeni. was born in banyumas, central java, in 1958. she currently works as a lecturer in semarang state university since 1985, focusing on animals physiology. in 1984, she attained her bachelor degree on biology education from yogyakarta institute of teacher education and educational sciences (currently known as yogyakarta state university). in 1992, she attained her master degree on science, majoring in biology, from gadjah mada university, indonesia. in 2014, she finished her doctoral degree on educational research and evaluation in yogyakarta state university. research and evaluation in education research and evaluation in education, e-issn: 2460-6995 x submission guidelines  the manuscript submitted is a result of a research or scientific assessment of an actual issue in the area of research, evaluation, and education in a broad sense, which has not been published elsewhere and is not being sent to other journals.  manuscript is accepted in english. any consistent spelling and punctuation styles may be used. please use single quotation marks, except where „a quotation is “within” a quotation‟. long quotations of 40 words or more should be indented without quotation marks.  a typical manuscript is approximately 8000 words or 12-18 pages including tables, references, captions and endnotes. manuscripts that greatly exceed this will be critically reviewed with respect to length. (a4; margins: top 3, left 3, right 2, bottom 2; double columns [except in abstract: single column]; single-spaced; font: garamond, 12).  manuscripts should be compiled in the following order: (1) title; (2) abstract; (3) keywords; (4) main text: introduction, research method, findings and discussion, conclusion and/or suggestions; (5) references.  the title of the manuscript should clearly represent the content of the article, and contain the keywords.  authors' name(s) should be written under the title (without any academic degree), along with the affiliation(s) and email address(es).  an abstract that does not exceed 300 words is required for any submitted manuscript. it is written narratively containing the aim(s), method, and the result(s) of the research.  each manuscript should have 3 to 5 keywords written under the abstract.  all tables and figures are adjusted to the paper length, and numbered referring to the text.  the citation and references are referred to american psychological association (apa) style, for example: .......... (switzgerald, 2014, p. 8) ............. mardapi (2015, pp. 13-14) [in text].  american psychological association (apa) style format is used.  the manuscript must be in *.doc or *.rtf, and sent to reid's management via online submission by creating account in this open journal system (ojs) [click register if you have not had any account yet; or click log in if you have already had an account].  authors‟ biography must be written narratively, containing each author‟s full name, degree(s) which were attained, place and date of birth, the last three educational levels which were taken, affiliation/department in which the author is currently working, phone number and email address.  all author(s)' names and identity(es) must be completely embedded in the form filled in by the corresponding author: email; affiliation; and each author's short biography (in the column of 'bio statement'). [if the manuscript is written by two or more authors, please click 'add author' in the 3rd step of 'enter metadata' in the submission process and then enter each author's data.]  (if any) the funding or grant-awarding bodies is acknowledged in the column of ‘contributors and supporting agencies’ when entering metadata in the open journal system (ojs ) of the journal. for single agency grants: "this work was supported by the [name of funding agency] under grant [number xxxx]."  all correspondences, information and decisions for the submitted manuscripts are given through email written in the manuscript and/or the emails used for the submission.  word template is available for this journal. if you have template queries, please contact reid.ppsuny@gmail.com mailto:reid.ppsuny@gmail.com copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) reid (research and evaluation in education), 7(2), 2021, 132-144 available online at: http://journal.uny.ac.id/index.php/reid the effect of external factors moderated by digital literacy on the actual use of e-learning during the covid-19 pandemic in islamic universities in indonesia sumin1*; kahirol mohd salleh2; nurdin3 1institut agama islam negeri pontianak, indonesia 2universiti tun hussein onn malaysia, malaysia 3sekolah menengah kejuruan (smk) negeri taman fajar aceh timur, indonesia *corresponding author. e-mail: amien.ptk@gmail.com introduction since the initial appearance of a new variant of the coronavirus in wuhan-china at the end of 2019, which was later named sars-cov 19, the pandemic has infected nearly 200 million people. based on who data (03/08/2021), 223 countries have been affected by covid-19, with 198.778.175 confirmed cases, 4.235.559 deaths, a total of 3.886.112.928 doses of vaccine have been given (unesco, n.d.). the covid-19 pandemic, both directly and indirectly, has had a wide impact on human life. in addition to causing health impacts, covid-19 has also harmed almost all sectors. the significant impact felt by almost all countries confirmed by covid-19 is the impact on the economic, social, political, and educational sectors. in the economic sector, the distribution of goods and services can only be carried out with strict restrictions or health protocols, as well as culinary businesses and the micro, small and medium enterprises (msme) sector, having to close their businesses because of the government's policy of restrictions to tackle the transmission of covid-19, some are even forced to quit or go bankrupt because few customers come to make transactions, so they are unable to pay off article info abstract article history submitted: 27 october 2021 revised: 9 december 2021 accepted: 9 december 2021 keywords external factors; actual use; digital literacy; e-learning scan me: the future of education is increasingly worrying due to the impact of the covid-19 pandemic since the end of 2019. restrictions on community activities and campus closures have forced university administrators to use e-learning. on the other hand, online learning has encountered many obstacles. barriers to the use of e-learning are thought to stem from external problems (online facilities and infrastructure) or educators and students (internal factors), such as lack of literacy, low absorption, level of understanding, and other non-technical factors. this study aims to examine further the influence of external factors and digital literacy and the moderating effect of digital literacy with external factors (system design, user friendly, devices, internet, electricity) on the actual use of campus e-learning at islamic universities. this study found that: external factor variables have a significant positive effect on the actual use of elearning. the digital literacy variable has a significant positive effect on the actual use of e-learning. the digital literacy variable weakens the influence of external factors on the actual use of e-learning at islamic universities in indonesia. this is an open access article under the cc-by-sa license. how to cite: sumin, s., salleh, k., & nurdin, n. (2021). the effect of external factors moderated by digital literacy on the actual use of e-learning during the covid-19 pandemic in islamic universities in indonesia. reid (research and evaluation in education), 7(2), 132-144. doi:https://doi.org/10.21831/reid.v7i2.44794 https://creativecommons.org/licenses/by-sa/4.0/ https://doi.org/10.21831/reid.v7i2.44794 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 133 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) their investment loans. in the social sector, the hustle and bustle of night entertainment, social and religious associations, and communities were forced to stop carrying out activities to prevent the transmission of covid-19. in the political and government sectors, covid-19 has forced governments in affected countries to make policies to prevent and control covid-19. even the government should cut and do refocusing the budget to provide social assistance, accelerate the absorption of vaccination, and breaking the chain of transmission of covid-19. the impact of the covid-19 pandemic on the education sector can be seen from the official report of the united nations educational, scientific and cultural organization (unesco), it was reported that: one year after the covid-19 pandemic, almost half of the world's students are still affected by partial or complete school closures, and more than 100 million additional chil-dren will fall below the minimum reading proficiency level as a result of the health crisis. priori-tizing the restoration of education is critical to avoiding a generational disaster as highlighted at the ministerial summit in march 2021. (unesco, n.d.). unesco supports countries in their efforts to reduce the impact of school closures, address learning loss and adapt education systems, especially for vulnerable and disadvantaged communities. to mobilize and support sustainable learning, unesco has formed the global education coalition which currently has 160 members working on three main themes: gender, connectivity, and teachers (unesco, n.d.). the future of education is very worrying and is at stake due to the direct and indirect impacts of the current covid-19 pandemic. learning conducted through online media (e-learning or social media), has encountered many obstacles. barriers to the use of e-learning can come from external problems (online facilities and infrastructure) as well as from factors originating from within the educators and students (internal factors) such as lack of literacy, low absorption, level of understanding, and other non-technical factors. referring to the results of research by putro et al. (2020), the five countries that are the subject of this study (indonesia, the philippines, nigeria, finland, and germany), are divided into groups of poor, developing and developed countries. problems of online learning during the covid-19 pandemic found in research putro et al. (2020) can be classified into two groups, namely technical and non-technical problems. technical problems include the issue of inadequate internet network availability, unsupported electricity network, limited availability of supporting equipment (facilities) both in terms of educators and students, and school access to software that supports online learning is not evenly distributed. the non-technical factors include the problem of the level of understanding of the material taught online, the many and piling up of assignments submitted online, the lack of equitable mastery of information technology from educators and students, the uneven financial ability of students, the online learning system is easy to cause problems. boredom in students, disruption of household work (because learning must be from their respective homes), students' difficulties in adapting to online learning systems, online learning systems reduce verbal communication between educators and students, disruption of the academic calendar (due to policy changes), routine training and seminar activities are eliminated, there is a learning gap, education budget cuts, interference from family, friends and the environment (because learning is carried out from home), lack of response and good coordination in overcoming obstacles in online learning, i said. the lack of distance learning experiences, the high cost of the internet, the poor condition of education in the pre-pandemic period, the lack of parental support and understanding of online learning, rapid changes that result in insufficient time to make appropriate learning designs, the gap between students who have high ability and good mentality in learning with students with the opposite nature, online group assignments are less effective, fatigue, there is a view that technology in the world of education is not good for children. the non-technical ability of educators and students in using online learning systems as described in a study by putro et al. (2020) cannot be separated from the digital literacy of educators https://doi.org/10.21831/reid.v7i2.44794 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 134 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) and students, this is following the results of the study widana (2020, p. 8) who found that “…the digital literacy factor is one of the variables that significantly affect the ability of teachers to develop hots-based assessments…” widana's research results are in line with a research by noh (2017) who found that “…bit literacy most influences information usage behavior, followed by virtual community literacy and technical literacy… bit literacy is related to the ability to use information including information retrieval, information acuity, information editing, information processing, and information utilization…” a research presented by jan (2017, p. 31) found that “… digital literacy (dl), tablet and smartphone use, previous training in computer use and frequency of computer use significantly influence students' attitudes toward ict use”. on the other hand, the results of jang et al. (2020) found that digital literacy had no direct significant effect on the intention to use. on the other hand, the research conducted by jang et al. (2020) found that “digital literacy did not have a direct significant effect on the intention to use learning technology in finland...” the researchers chose the state islamic universities (perguruan tinggi keagamaan islam negeri or ptkin) as the research locus, considering that the majority of students and lecturers at ptkin were graduates of traditional islamic boarding schools, considering the results of a research by azzahra and amanta (2021) that most islamic boarding schools in indonesia use traditional learning systems so that it is difficult to obtain digital literacy and use digital technology. there is no complete information regarding how many islamic boarding schools are equipped with facilities such as the internet and computers. thus, based on the research gap and the results of previous studies, this paper aims to further examine the influence of external factors moderated by digital literacy skills on the use of e-learning at state islamic universities in indonesia. external factors given the breadth of the definition of external factors, the researchers narrowed the meaning of external factors only to the context or scope of learning information technology. limiting the meaning of external factors is intended to facilitate the identification of variables and their indicators. the external factors referred to in this study are related to factors originating from outside the students themselves that can affect their ability to utilize information technology. external variables can be defined as factors outside the users of information technology, such as; information system design, easy to understand or easy to learn, availability of supporting devices (smartphone/computer), adequate internet network, adequate electricity network. according to davis et al. (1989, p. 985), perceived usefulness can be influenced by several external factors, educational programs are designed to "capture" potential users to use information systems to increase user productivity. still according to davis et al. (1989, p. 985), learning based on the concept of feedback between educators and students is another type of external variable that can affect beliefs in the use of information technology. digital literacy according to krumsvik (2008) in liu et al. (2020), “computer-based literacy, media literacy, digital literacy, and digital competence are concepts that focus on the need to use technology in the digital era”. meanwhile, according to kaeophanuek et al., (2019) digital literacy is: …a set of competencies possessed by an individual to apply digital tools well in the digital era, easily accessing, applying, evaluating, analyzing and synthesizing data, as well as creating new knowledge. with that, students will be able to communicate and present content through various digital technologies. a good level of digital literacy will make it easier for students to achieve their goals. if literacy is defined as a person's ability to read, interpret written sources of knowledge in a social group, then academic literacy is the ability to read, interpret and produce information in a digital format that is valued in academia (kaeophanuek et al., 2019, p. 24). https://doi.org/10.21831/reid.v7i2.44794 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 135 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) according to ferrari; kaeophanuek, et al.; and owen, et al. in kaeophanuek et al. (2019, p. 24), the indicators that form digital literacy variables can be identified as follows. (a) information skills: the basis for information management, techniques, and various strategies involving digital information management, which includes the process of identifying problems, determining the search topics, methods, and strategies for accessing, analyzing, synthesizing content, systematizing, evaluating, interpreting, and applying information used in doing or solving problems correctly. (b) use of digital tools: skills and competencies in learning how to use a wide variety of software and applied digital tools to successfully accommodate everyday life. it also relates to the ability to maintain, manage use and troubleshoot basic computers, as well as the ability to communicate, systematically manage either personal or network data, comply with ethical norms, and utilize technology for effective teamwork. (c) digital transformation: skills in consolidating information to create, improve, design, and producing content and products, and presenting information in the form of new information, creating new knowledge and new digital innovations under collaborative learning. learners can reflect on their thoughts to improve their work and publish it following copyright laws. actual system use actual use (actual system use) is a real condition of the application of information systems/ information technology measured in units of time or frequency of technology use, users will feel satisfied while using information system services or information technology if they believe that the system has been used, can increase productivity work that is reflected in the real conditions of use (davis et al., 1989, p. 987; venkatesh & davis, 2000, p. 204). measurement of actual usage (actual system usage) can be evaluated in units of time, how often users use information technology service systems in terms of duration of usage time. actual technology use is measured by the amount of accumulated time spent interacting with technology and the number of times using the technology (davis et al., 1989, p. 987; venkatesh & davis, 2000, p. 204). according to li and lalani in hermawan (2021), “e-learning can be found from a variety of existing learning media, ranging from language applications, video conferencing tools, virtual tutoring, online learning software, moodle, and many more”. based on the opinions of experts that have been put forward, an online learning system (elearning) is a set of software designed as an online-based learning medium (using an internet network) that can be accessed via desktop computers and mobile devices (smartphones) containing learning plans, learning process, and learning evaluation. learning plans can be designed and incorporated into e-learning based on the curriculum content that has been set in the learning curriculum document. the learning process through e-learning includes delivering material in various forms, such as textbooks, journals, presentation slides, audio videos, or learning website addresses, recording the attendance of teachers and students, and discussion rooms. e-learning applications at certain universities or schools have been designed following the requirements of the quality standards of education with special criteria and conditions, which are different from social media or e-learning based on content management systems offered by software development companies such as google. classroom, moodle, and others. according to davis et al. (1989), external variables have no direct effect on attitudes and behavior in using technology, but the technology acceptance model proposed by davis et al. (1989) found that there is a procedure that bridges between external variables and attitudes in using technology, it is triggered by individual differences related to one's personality and characteristics, which are related to one's self-confidence and belief. a research by jan (2017, p. 31) found that “… digital literacy (dl), tablet and smartphone use, previous training in computer use and frequency of computer use significantly influence students' attitudes toward ict use”. on the other hand, the results of jang et al. (2020) found that digital literacy had no direct significant effect on the intention to use. https://doi.org/10.21831/reid.v7i2.44794 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 136 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) figure 1. thinking framework the researchers have not found any previous studies that place digital literacy as a moderating influence between external factors on the actual use of information and computer technology (ict), therefore the researchers intend to develop a model that has been found previously by experts, by placing digital literacy as a moderating variable of the influence between external factors on the actual use of e-learning. based on the theories and results of previous research, this research model can be described as in figure 1. method this study uses a correlational research design and a quantitative approach. the use of the correlational method is to determine the coefficient of the effect of the predictor variables on the response variable (creswell, 2012, p. 45). the correlational research design is used to analyze the direct influence of external variables and digital literacy variables on the use of e-learning, as well as to measure the impact of digital literacy variables which are hypothesized to strengthen or weaken the influence of external variables on the use of e-learning. based on the theory that underlies the relationship between external factors and digital literacy on the use of learning information technology (e-learning), the researchers formulate the following hypothesis. h1: there is a significant direct effect of external factors on the actual use of e-learning at islamic universities in indonesia; h2a: there is a direct positive and significant influence of digital literacy on the actual use of e-learning at islamic universities in indonesia; h2b: digital literacy variables can strengthen the influence of external factors on the actual use of elearning at islamic universities in indonesia. to answer the research questions, the researchers used one of the units of multivariate statistical analysis, namely; structural equation modeling (sem) with partial least squares (pls) approach. herman word introduced sem-pls to model latent variables which he called "soft modeling". the term refers to the flexibility of using sem-pls which does not require many assumptions and does not have to be based on a strong theory. sem-pls can be used for theory confirmation but can also be used to develop models (vinzi et al., 2010, p. 2). the analysis of structural equation modeling with the partial least square (pls) approach consists of two stages: evaluation of the measurement model in the sem-pls analysis is called the outer model, and the evaluation of the structural model, in terms of sem-pls, is called the inner model. model estimation in sem-pls is carried out in two stages, namely: first; conducting an assessment of the measurement model (outer model), second; conducting an assessment of the structural model (inner model). the measurement model is defined through the equations in formula (1) and formula (2), in which = factor load from indicator to latent variable, = re external factors (ef) digital literacy (dl) actual usage (au) dl*ef h1 h2a h2b https://doi.org/10.21831/reid.v7i2.44794 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 137 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) sidual/error on the exogenous latent variable indicator, = residual/error on the endogenous latent variable indicator, = exogenous latent variable indicators, = endogenous latent variable indicators. …………………….. (1) ….………...………… (2) the quality of the measurement model can be assessed from several measuring instruments related to the instrument items’ validity and reliability (garson, 2016; hair et al., 2014; lohmöller, 1989), that is: (1) convergent validity; an instrument item can be designated as a valid measure ment if it has a loading factor value of ≥ 0.7, (2) discriminant validity; a variable measurement item can be designated as a good measurer, if it has the item only significant in measuring the latent construct/variable in its indicator block, and should not be significant in other latent variable indicator blocks, and (3) composite reliability and internal consistency; instrument items meet composite reliability if they have a composite reliability coefficient (cr) > 0.7, and are considered to have good/ideal internal consistency if they have cronbach's alpha coefficients > 0.7. the structural model (inner model) can be defined through the following mathematical equations in formula (3), where endogenous latent variable, exogenous latent variable, parameter coefficient (factor loading) of exogenous latent variable to endogenous latent variable, and residual/error of inner model. the inner model can be assessed from several statistical measures as follows: (a) model fit (fit index, coefficient of determination, and effect size), and (b) hypothesis testing (direct and indirect effect, total effect). …………. (3) the sampling technique used in this paper is simple random sampling. sample size is calculated using the sample formula from lemeshow et al. (1990), as shown in formula (4), where z = standard normal table value (if the confidence level is 95%, then the z value = 1.96), = the estimated proportion of the attributes that are in it, is assumed to be 0.5, and e = the absolute precision required. ……………….. (4) if set value; confidence interval (1-α) = 95% (α=5%), the standard normal distribution table value is 1.96. anticipated population proportion = 0.80 assuming the proportion of students using e-learning at ptkin is 80%. the expected absolute precision was set at =0.03999 (close to the 5% significance level). relative precision (ε), equal to = 0.0499875 (close to 5% significance level). the population size is 760.619. then, the sample size can be calculated based on formula (4), as follows. the ideal number of samples in this study were 385 respondents, taken randomly from a student population of 760.619 from 59 state islamic universities in indonesia. primary data in this https://doi.org/10.21831/reid.v7i2.44794 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 138 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) study were obtained through a questionnaire. questionnaires were distributed randomly to the respondents who were the target sample (students at state islamic universities in indonesia). the data collection instrument used was a questionnaire. this questionnaire was designed using the symantec differential scaling scale. symantec differential scaling is an attitude scale arranged in a continuum line, with very positive answers on the right and very negative answers on the left. the symantec differential uses seven response options for the statement of each instrument item, with categories for positive questions graded from "strongly disagree" with an item score = 1 to "strongly agree" with score of = 7, on the contrary for negative questions graded from "strongly agree" with item score = 1 to “strongly disagree” with score = 7 (rosenberg & navarro, 2018). convergent validity checking on the reflective measurement model can be seen from the standardized loadings factor value which shows the correlation between indicator scores and latent variables, as in table 1. calculations using warppls 7.0 software on 14 items of external factor variable indicator items and 23 digital literacy variable indicator items and seven item indicator variables for the actual use of e-learning, obtained the following results; items from the latent variables of external factors x1.7, x1.8, x1.9, x1.10, x1.11, x1.12, and x1.13 have standardized loadings factor values < 0.7 so that seven items must be discarded because it is not valid in measuring the latent variable of external factors. table 1. variables and indicators of measurement of research variables no. variable indicators measuring scale 1. external factors (davis et al., 1989) ▪ e-learning design ▪ easy to understand (user-friendly) ▪ online device availability (smartphone/ computer) ▪ adequate internet network availability ▪ availability of adequate electricity grid ordinal 2. digital literacy (kaeophanuek et al., 2019) ▪ skills in managing information ▪ skills in using digital equipment ▪ digital transformation ordinal 3. actual usage of e-learning (davis et al., 1989; venkatesh & davis, 2000) ▪ usage time duration ▪ frequency of use ordinal table 2. convergent validity checking item ef dl au indicator type annotation x1.1 0.777 0.192 0.052 reflective valid x1.2 0.771 -0.03 0.065 reflective valid x1.3 0.762 -0.06 0.122 reflective valid x1.4 0.798 -0.04 -0.116 reflective valid x1.5 0.739 -0.04 -0.176 reflective valid x1.6 0.751 -0.02 0.052 reflective valid x2.3 0.705 -0.009 reflective valid x2.8 0.781 -0.002 reflective valid x2.11 0.781 0.013 reflective valid x2.12 0.804 -0.002 reflective valid x2.14 0.735 -0.113 reflective valid x2.17 0.83 0.038 reflective valid x2.19 0.719 -0.041 reflective valid x2.20 0.766 0.039 reflective valid x2.21 0.754 0.068 reflective valid y1.1 0.847 reflective valid y1.2 0.859 reflective valid y1.3 0.743 reflective valid y1.4 0.853 reflective valid y1.5 0.805 reflective valid https://doi.org/10.21831/reid.v7i2.44794 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 139 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) furthermore, of the 23 items measuring the digital literacy latent variable indicator, 13 items have a standardized loading factor value of < 0.7, namely; items x2.1, x2.2, x2.4, x2.5, x2.6, x2.7, x2.9, x2.10, x2.13, x2.15, x2.15, x2.16, x2.18, x2.22 and x2.23, these items are not valid in measuring digital literacy latent variables and must be excluded, because they are not valid in measuring digital literacy variables. in the variable of actual use of e-learning, there are two items of the questionnaire that have a standardized loading factor value of < 0.7 out of seven of the questionnaire items used to measure the actual use of e-learning, namely; y1.6 and y1.7, the items must also be excluded from the measurement, because it is not valid in measuring the actual use of e-learning variables. the results of measuring variables after removing instruments that do not meet the criteria for convergent validity in this study are presented in table 2. a measuring instrument is said to have good validity, not only judged by its ability to measure the variables in the contract to be measured but also to be significantly different from measurements in other construct indicator blocks. to find out how far the instrument items differ from one indicator block to another, it can be seen from the results of discriminant validity testing. the results of the discriminant validity test in sem-pls using warppls software in this paper are presented in table 3. the results of the discriminant validity test in table 3 show the average value of the variant extract (ave) is below the root value of the average variant extract, so it can be concluded that the measurement item on the external factor latent variable is only valid for measuring external factor variables, thus also the items used in the digital literacy variable and the actual use of elearning. the results of instrument reliability testing using composites and internal consistency (alpha cronbach) in this paper are presented in table 4. table 3. discriminant validity checking variable ave square of ave annotation ef 0.588 0.766812 valid dl 0.585 0.764853 valid au 0.677 0.8228 valid source: primary data, processed with sem-pls (warppls 7.0), 2021. table 4. composite reliability checking variable composite reliability coefficients internal consistency (cronbach's alpha coefficients) annotation ef 0.895 0.859 reliable dl 0.927 0.911 reliable au 0.913 0.88 reliable table 5. model accuracy criteria no. model fit and quality indices statistics criteria (kock, 2019) annotation 1. average path coefficient (apc) 0.296 p<0.05 accepted 2. average r-squared (ars) 0.52 p<0.05 accepted 3. average adjusted r-squared (aars) 0.517 p<0.05 accepted 4. average block vif (avif) 1.2 accepted ≤ 5, ideal ≤ 3.3 ideal 5. average full collinearity vif (afvif) 1.621 accepted ≤ 5, ideal ≤ 3.3 ideal 6. tenenhaus gof (gof) 0.609 small ≥0.1, medium ≥ 0.25, large ≥ 0.36 large 7. sympson's paradox ratio (spr) 1 accepted ≥ 0.7, ideal = 1 ideal 8. r-squared contribution ratio (rscr) 1 accepted ≥ 0.9, ideal = 1 ideal 9. statistical suppression ratio (ssr) 1 accepted ≥ 0.7 ideal 10 nonlinear bivariate causality direction ratio (nlbcdr) 1 accepted ≥ 0.7 ideal https://doi.org/10.21831/reid.v7i2.44794 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 140 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) the results of composite reliability test and cronbach's alpha internal consistency show that all variables have a measurement coefficient value > 0.7. it indicates that all measurement items of external factor variables, digital literacy, and actual use of e-learning are reliable and consistent in measuring their respective latent variables. the suitability of the model with the theory that underlies the relationship between variables can be seen from the model's accuracy index. it can be interpreted by how accurately the research model confirms the results of previous studies. if the model accuracy index is met (ideal and significant), it means; the model developed is stated to be appropriate and has succeeded in confirming the results of previous studies. the sem-pls using warppls 7.0 application, presents ten types of model accuracy indexes as seen in table 5. based on the ideal criteria and the accuracy index value of the model obtained, it can be concluded that the model developed in this study is appropriate or does not violate the results of previous studies. sem-pls in principle is a development of regression analysis, for that, it is required that the indicator block must be free of multicollinearity which is marked by the value of variance inflation factors (vif)<3 the value of variance in this paper can be seen in table 6. taking into account the value of the latent variable indicator block vif in table 6, it can be seen that all latent variables, including the moderating variable between digital literacy and external factors, have a vif value <3.3, which can be interpreted that all indicators are multicollinearity free. the size of the effect (effect size) is a measure of the meaning of the results of research at a practical level. the effect size criteria are; if the value of effect size (f) = 0.1 means it has a small effect size, f = 0.25 has a medium effect size, and f = 0.4 has a large effect size (kock, 2019). the effect size value of this paper can be seen in table 7. the effect size value in table 7 can be interpreted that, the size of the influence of external factor variables on the actual use of e-learning is moderate, the influence of digital literacy on the actual use of e-learning is moderate and the interaction of digital literacy with external factor variables on the actual use of e-learning relatively small. the coefficient of determination shows the magnitude of the contribution of the exogenous latent variable to the endogenous latent variable. the results of the analysis show that the r-square value of 0.52, it can be interpreted that the variation of the actual use of e-learning variables can be explained by external factors, digital literacy and moderation between digital literacy and external factors is 52.00% while the remaining 48.00% is explained by other variables not included in this study. hypothesis testing using sem-pls can be assessed from the path coefficient. this measurement is obtained through the estimation of the parameters of the research model. the results of the sem-pls parameter testing using the warppls 7.0 software are presented in table 8. table 6. multicollinearity assumption checking variable vif ef 1.757 dl 1.538 au 2.059 dl*ef 1.131 table 7. effect size variable f annotation ef 0.317 medium dl 0.171 medium dl*ef 0.032 small table 8. path parameter coefficient relationship between latent variables path coefficients p-value annotation ef --> au 0.488 <0.001 accepted dl --> au 0.308 <0.001 accepted dl*ef -->au -0.092 0.034 accepted https://doi.org/10.21831/reid.v7i2.44794 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 141 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) the parameter coefficients of the sem-pls path in table 8 can be explained as follows. the path parameter coefficient that marks the relationship between external factors and the actual use of e-learning is 0.488 with a p-value <0.05 (significant at the 5% level), h01 is rejected. it indicates that the external factor variable has a significant positive effect on the actual use of elearning. learning at state islamic universities in indonesia during the covid-19 pandemic. the path parameter coefficient that marks the relationship between digital literacy and the actual use of e-learning is 0.488 with a p-value <0.05 (significant at the 5% level), h02a is rejected, it can be interpreted that the digital literacy variable has a significant positive effect on the actual use of e-learning. learning at state islamic universities in indonesia during the covid-19 pandemic. the path parameter coefficient that marks the moderation between external factors and digital literacy on the actual use of e-learning is -0.092 with a p-value of 0.05 (significant at the 5% level), h02b is rejected, it can be interpreted that the digital literacy variable weakens the influence of external factors on the actual use of e-learning at state islamic universities in indonesia during the covid-19 pandemic. the final model of this research can be seen from the path diagram of the sem-pls analysis output as presented in figure 2. figure 2. pls-sem path diagram (final research model) the path diagram in figure 2 shows the parameter coefficient (β) and p-value (probability value) between the exogenous latent variable and the endogenous latent variable. the path parameter coefficients can be entered in the mathematical model according to the structural model equation in formula (3) as follows. the mathematical equation for the au estimation model above can be explained as follows. parameter coefficient ( ) is 0.49, meaning; every increase in external factors by 1 point, will have a significant impact on increasing the actual use of campus e-learning by 0.49 points, assuming other variables that influence it to remain. the parameter coefficient ( ) is 0.31, meaning; every 1 point increase in digital literacy will have a significant impact on an increase in the actual use of learning by 0.31 points, assuming other variables that influence it to remain the same. the parameter coefficient ( ) is -0.09, meaning; every interaction between external factors and digital literacy will have a significant impact in reducing the influence of external factors on the actual use of e-learning by 0.09 points. findings and discussion the influence of external factors on actual use is indicated by the positive and significant coefficient of the sem-pls structural model parameter at the 5% level which means that external factors have a significant positive effect on the actual use of e-learning at state islamic universities in indonesia during the covid-19 pandemic. this finding verifies the research findings of noh (2017) and widana (2020) but is not in line with davis et al. (1989) who state that external variables do not directly affect attitudes and behavior in the use of technology. https://doi.org/10.21831/reid.v7i2.44794 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 142 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) the effect of digital literacy on actual use is indicated by the parameter coefficients on the structural model which are positive and significant at the 5% level, meaning that; digital literacy has a significant positive effect on the actual use of e-learning at state islamic universities in indonesia during the covid-19 pandemic. the findings of this study are in line with the results of jan (2017) who stated that “… digital literacy (dl), tablet and smartphone use, training, previous use of computers and frequency of computer use significantly influence students' attitudes towards ict use”. the results of this study are not in line with the results of the research of jang et al., (2020, p. 1) who found that "digital literacy did not have a direct significant effect on the intention to use learning technology in finland...". the moderating effect between digital literacy and external factors is indicated by the negative and significant sem-pls structural path parameter coefficient at the 5% level, meaning that the digital literacy variable weakens the influence of external factors on the actual use of elearning at state islamic universities in indonesia during the covid-19 pandemic. findings related to the moderating relationship of external factors with digital literacy show a unique phenomenon, if students are more skilled in managing information, using digital equipment, and being able to transform digital data and information, combined with the availability of adequate facilities and infrastructure, it will weaken the frequency and duration of their time in using digital literacy elearning. this finding is interesting to observe because it contradicts the general assumption that better understanding and literacy supported by adequate facilities and infrastructure will have a positive impact on the frequency and duration of ict use. in fact, from the results of direct observations of researchers while teaching at the state islamic universities (as participant-observers), and preliminary research interviews with ptkin students and lecturers, that students who have more skills in the field of ict, and adequate supporting facilities do not feel at home for a long time in using e-learning, because they can still follow the learning process through recordings of material that have been uploaded in e-learning, so most of them only record attendance online, download material to be repeated on a computer or smartphone device offline, with more time and cost-efficient reasons. the findings of this study, are supported by the results of ferri et al. (2020) revealing that there are several challenges in using online learning media during an emergency (covid-19 pandemic), including technological challenges, pedagogics, and social challenges. technological challenges related to an inadequate internet connection, and lack of necessary electronic devices. pedagogic challenges, related to the lack of digital skills of teachers and students, the lack of structured content when compared to the number of online resources, the lack of interaction between students and teachers, and the lack of social presence and teacher cognition. social challenges are related to the lack of interaction between teachers and students. conclusion the results of this study can be concluded that; external factor variables have a significant positive effect on the actual use of e-learning. the digital literacy variable has a significant positive effect on the actual use of e-learning. the digital literacy variable weakens the influence of external factors on the actual use of e-learning on islamic college campuses in indonesia. the researchers realize that there are several limitations in this paper, first; there is no previous research that places digital literacy as a moderating variable of external factors on the actual use of e-learning. second; the researchers only adopted and adapted the actual use variable in the technology acceptance model theory developed by davies, and added the role of digital literacy as a moderating variable from the influence of external factors on the actual use of e-learning. third; this research was conducted in a short time, there was no instrument trial, so from the results of field testing many research instrument items had to be issued. fourth; the population limit is unknown, so the proportion of samples taken is not evenly distributed across all state islamic universities. 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 143 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) acknowledgment we would like to express our gratitude to the students of state islamic universities throughout indonesia who have responded to the statement of research instruments. we also express our appreciation to all lecturers and education staff of the pontianak state islamic institute of religion who helped distribute research instruments to ptkin students in indonesia. we do not forget to express our gratitude to the lecturers and postgraduate students of yogyakarta state university batch of 2021 who have helped direct the preparation of this paper. references azzahra, n. f., & amanta, f. (2021). promoting digital literacy skill for students through improved school curriculum. in policy brief no. 11, 1-13. center for indonesian policy studies (cips). http://hdl.handle.net/10419/249444 creswell, j. w. (2012). educational research: planning, conducting, and evaluating quantitative and qualitative research (4th ed.). pearson. davis, f. d., bagozzi, r. p., & warshaw, p. r. (1989). user acceptance of computer technology: a comparison of two theoretical models. management science, 35(8), 982–1003. https://doi.org/10.1287/mnsc.35.8.982 ferri, f., grifoni, p., & guzzo, t. (2020). online learning and emergency remote teaching: opportunities and challenges in emergency situations. societies, 10(4), 86. https://doi.org/10.3390/soc10040086 garson, g. d. (2016). partial least squares: regression & structural equation models. statistical publishing associates. hair, j. j. f., hult, g. t. m., ringle, c. m., & sarstedt, m. (2014). a primer on partial least squares structural equation modeling. long range planning, 46(1), 184-185. https://doi.org/10.1016/j.lrp.2013.01.002 hermawan, d. (2021). the rise of e-learning in covid-19 pandemic in private university: challenges and opportunities. ijorer: international journal of recent educational research, 2(1), 86–95. https://doi.org/10.46245/ijorer.v2i1.77 jan, s. (2017). investigating the relationship between students’ digital literacy and their attitude towards using ict. international journal of educational technology, 5(2), 26-34. https://ecommons.aku.edu/pakistan_ied_pdck/304/ jang, m., aavakare, m., kim, s., & nikou, s. (2020). the effects of digital literacy and information literacy on the intention to use digital technologies for learning a comparative study in korea and finland. in its online event, 1-13. international telecommunications society. http://hdl.handle.net/10419/224858 kaeophanuek, s., na-songkhla, j., & nilsook, p. (2019). a learning process model to enhance digital literacy using critical inquiry through digital storytelling (cidst). international journal of emerging technologies in learning (ijet), 14(03), 22–37. https://doi.org/10.3991/ijet.v14i03.8326 kock, n. (2019). warppls user manual: version 6.0. scriptwarp systems. https://scriptwarp.com/warppls/ lemeshow, s., jr., d. w. h., klar, j., & lwanga, s. k. (1990). adequacy of sample size in health studies. who. https://doi.org/10.1002/sim.4780091115 10.21831/reid.v7i2.44794 sumin, kahirol mohd salleh, & nurdin page 144 copyright © 2021, reid (research and evaluation in education), 7(2), 2021 issn: 2460-6995 (online) liu, z.-j., tretyakova, n., fedorov, v., & kharakhordina, m. (2020). digital literacy and digital didactics as the basis for new learning models development. international journal of emerging technologies in learning (ijet), 15(14), 4–18. https://doi.org/10.3991/ijet.v15i14.14669 lohmöller, j.-b. (1989). latent variable path modeling with partial least squares. physica-verlag hd. https://doi.org/10.1007/978-3-642-52512-4 noh, y. (2017). a study on the effect of digital literacy on information use behavior. journal of librarianship and information science, 49(1), 26–56. https://doi.org/10.1177/0961000615624527 putro, s. t., widyastuti, m., & hastuti, h. (2020). problematika pembelajaran di era pandemi covid-19 studi kasus: indonesia, filipina, nigeria, ethiopia, finlandia, dan jerman. geomedia: majalah ilmiah dan informasi kegeografian, 18(2), 50–64. https://journal.uny.ac.id/index.php/geomedia/article/view/36058 rosenberg, b., & navarro, m. a. (2018). semantic differential scaling. in b. b. frey (ed.), sage encyclopedia of research, measurement, and evaluation. sage publications. https://scholar.dominican.edu/books/144 unesco. (n.d.). education: from disruption to recovery. retrieved august 3, 2021, from https://en.unesco.org/covid19/educationresponse venkatesh, v., & davis, f. d. (2000). a theoretical extension of the technology acceptance model: four longitudinal field studies. management science, 46(2), 186–204. https://doi.org/10.1287/mnsc.46.2.186.11926 vinzi, v. e., chin, w. w., henseler, j., & wang, h. (2010). handbook of partial least squares: concepts, methods and applications. springer berlin heidelberg. https://doi.org/10.1007/978-3540-32827-8 widana, i. w. (2020). the effect of digital literacy on the ability of teachers to develop hotsbased assessment. journal of physics: conference series, 1503(1), 012045. https://doi.org/10.1088/1742-6596/1503/1/012045 judul dalam bahasa indonesia, ditulis dengan huruf tnr-14 bold, maksimal 14 kata, rata kiri research and evaluation in education journal e-issn: 2460-6995 research and evaluation in education journal volume 1, number 1, june 2015 (73-83) available online at: http://journal.uny.ac.id/index.php/reid an assessment model of historical thinking skills by means of the rasch model 1) ofianto; 2) suhartono 1) padang state university, indonesia; 2) gadjah mada university, indonesia 1) ofianto.anto@yahoo.com; 2) suhartono@ugm.ac.id abstract this study was conducted to produce a model and instruments of historical thinking skills in the history subject at the senior high school (shs) and to identify shs students’ historical thinking skills. the study was conducted in two stages, namely model development and instrument development altogether with a small-scale tryout and a large-scale tryout. the test for each tryout consisted of six and five sub-test sets. each test set contained 20 anchor items. the sample for each tryout comprised 1573 and 2613 testees. the data was analyzed by means of partial credit model (pcm) using the quest program. the overall tryout results indicate that, based on the criteria for an infit mnsq mean of 0.1 and a standard deviation of 1.0, the tests fit the pcm. the reliability coefficients of the tests for the tryouts are moderately good; the cronbach’s alpha coefficients are, respectively, 0.65 and 0.54. the lowest score of historical thinking skills is -.352 and the highest is +1.21 in an ideal range of -4.0 to +4.0. in overall, the testees’ scores are not satisfactory. only 5.89% of the testees are above the expected median. keywords: instrument development, test, historical thinking skills, polytomous, pcm mailto:suhartono@ugm.ac.id research and evaluation in education journal 74 volume 1, number 1, june 2015 introduction assessment is an important component in the operation of an education. an assessment is conducted in order to view and to monitor the development of educational quality from one period to another (alen & yen, 1997, p. 2; griffin & nix, 1991, p. 4). therefore, in order to perform an assessment toward the educational quality, teachers might use multiple assessment tools. the assessment tools might be in the form of test and non-test (mardapi, 2008, pp. 2-3). the use of multiple assessment tools is intended to portray the learning results comprehensively. thereby, the assessment will be useful for viewing the educational quality in overall and the assessment will also provide important information for improving the learning process. an assessment technique in the form of a test is a measurement activity because through a test, a teacher might attain numerical data for improving the learning participants’ characteristics capability (hargreaves & schmidt, 2002, pp. 69-95). one of the learning subjects taught from the elementary schools to the senior high schools is history. the history subject in the schools aims to attain the historical thinking skills (fogu, 2009, pp. 103-121), to encourage the learning participants to be critical-analytical (winerburg, 2006, pp. 3-6) and to benefit the knowledge about the past in order to comprehend the life in the present time and in the future time. according to the ministry of national education regulation number 20 year 2007 (depdiknas, 2007, pp. 1-2) regarding the assessment standards for the elementary and the high education, the assessment of history learning results contains three aspects: academic, historical awareness, and nationalism. in performing the assessment in the schools, teachers should pay attention to the compatibility between and among the standards (the competencies), the contents (the curriculum contents), the assessment and the learning strategies (ashby & shemit, 2005, pp. 150-163). the analysis on the learning results is also important information for improving the learning process; therefore, the psychometric experts develop an analysis model known as test theory (rasch, 1961, pp. 321-334; rasch, 1977, pp. 58-93). the test theory that has been developed for a long period is the classical test theory (ctt) (van der linden & hambleton, 1997, pp. 4-5; hambleton & swaminathan, 1985, p. 5). ctt, in its estimation, contains man erros and provides little information. within the development, in order to overcome the fundamental weakness of ctt, the experts developed the item response theory (irt) (master, 1999, pp. 98109). the irt model provides more information with more assumptions. the irt model consisted of three models namely the rasch model or the 1 logistic parameter (1-lp), the 2 logistic parameter (2-lp) and the 3 logistic parameter (3-pl). based on the results of a survey which were conducted by the researchers, the researchers found that the assessment done by the history teachers had been an objective one and had tendency of demanding the learning participants to memorize the facts. such fact has been investigated by several aspects such as bain (2005, pp. 179-214), barton & levstik (2003, pp. 358-261) and lee (2005, pp. 31-40). the results of their investigation show that the recent practice of history assessment had been lingering on the factual memory by means of multiple choice test provision. the other fact that these researchers found is that the written test, as one of the assessment tools that had been implemented up to date in order to uncover the students’ capability or learning results, was constructed insystematically. as a result, many tests that the teachers provide cannot uncover the learning participants actual capability. the results of a study by mardapi et al. (1999, p. 45) found that there had been many teachers who did not pay attention to the test guidelines while making the test items; instead, they tended to use the test items from the books circulated in the market. in relation to that matter, the teachers should habituate themselves in implementing research and evaluation in education journal an assessment model of historical thinking skills... 75 ofianto & suhartono the other test form, such as essay, that will be more appropriate for the subject characteristics and for the learning objectives that have been formulated. the demand within the formulation of one of the basic competence (bc) in the content standards of the national curriculum for the senior high schools/madrasah aliyahs is that the learning participants will be able to develop their ability in understanding and implementing the basic principles of inquiry, which has been the application historical thinking skills in the history subject. historical thinking skills might be defined as a scientific steps/process in studying the history (seixas & peck, 2004, pp. 109-117; seixas, 2013, pp. 10-12). in each process of historical thinking skills, there will always be thinking process. thereby, the historical thinking skills might also encourage the development of critical and creative thinking capabilities within the learning participants. based on the explanation, in order to measure the historical thinking skills, the researchers would like to provide an essay test. therefore, the researchers should arrange an instrument of historical thinking skills that consists of a test and an assessment guideline. as a result, the researchers are encouraged to perform a study on the instrument development for measuring the learning participants’ historical thinking skills that consists of a test and an assessment guideline. method the study was a developmental one and its aim was to develop a test on senior high school students’ historical thinking skills. the development procedures and phases implemented by the researchers in the study referred to the research and development proposed by borg & gall (1989, p. 227). however, the stages were made appropriate to the objectives and the importance of the study. then, the stages of the research and development study were as follows: (1) needs analysis and preliminary investigation; (2) model planning and design; (3) model experiment; (4) evaluation; (5) implementation; and (6) dissemination. the needs or problem analysis and the preliminary investigation were conducted in the form of direct observation/survey and literature or library study. the results of these activities were made as the basis of designing the initial draft of the test/assessment model. in the model design, the researchers developed a test of senior high school students’ historical thinking skills. according to oriondo & dallo-antonio (1984, p. 34), the stages of test development include: (1) test design and (2) test experiment. the activities of test design were conducted until the drafting of the test that would be ready for the experiment. the activities of instrument/ test designing included: (a) arranging the learning continuum (lc); (b) preparing the guidelines of historical thinking skills test/ instrument; (c) writing the test items; and (d) improving/ revising the test items and drafting the test/ instrument. the scales used were polytomous, adjusted according to the test form that would be taken, namely essay. for the polytomous scaling, the researchers implemented the scale from 0 to 2 for three categories. the item revision was conducted after the researchers conducted a qualitative analysis toward the items that had been drafted. the qualitative analysis toward the items was not apart from the lc and the guideline. therefore, first of all, the researchers performed a review toward the lc, the indicators, the guideline and the items by means of focus group discussions (fgd). next, the researchers performed a limited experiment toward the instrument of historical thinking skills that had been drafted in order to attain the empiric data. the results of the experiment were analyzed both by using classical approach and of item response theory (irt). the analysis was performed in order to view the quality of the test items before the instrument would be rearranged for the expanded experiment or the implementation. research and evaluation in education journal 76 volume 1, number 1, june 2015 furthermore, the researchers performed the activities in the test, evaluation and revision stage. in the stage, the researchers performed an experiment toward the model that had been developed through the limited experiment. the data attained from the results of the experiment would be analyzed to decide whether the model developed had been fit or not. the expanded experiment was performed after the limited experiment or after the instrument had been revised. the results of the expanded experiment would be analyzed to find how far the students had mastered the historical thinking skills. the final product of the model that would be developed would be disseminated to the users and the policy makers in the schools, namely: the teachers, the principals, the heads of education office in the city/ district, and the province. the dissemination would be conducted in the form of research distribution to the sample schools. the product experiment would be performed twice namely: (1) in the form of limited experiment; and (2) in the form of expanded experiment. the activities that the researchers performed in the limited experiment were as follows: test implementation and results analysis. then, the activities that the researchers performed in the expanded experiment were as follows: test implementation, results analysis, and results interpretation. the study was conducted in west sumatra province. the subjects were senior high schools students. the senior high schools involved in the study were the favorite ones located in the capitol of the province until the infavorite ones located in the capitol of the sub-district. the reason was that the researchers would like to attain maximum variability of measurement results. the data that had been gathered in the study were quantitative one. the quantitative data were in the form of test results and the qualitative data consisted of the one from the limited experiment and the one from the expanded experiment. the data gathering in the study was performed by employing a set of test. to measure the quality of the test instrument, the researchers performed both qualitative analysis by expert judgement from the aspects of contents (materials), construction and language and qualitative analysis by means of experimental process (empirical process). the data resulted from the experiment was analyzed with the quest program. the objective of the analysis was to find the quality of test item parameter and the level of test reliability. the quality of test item parameter was only shown by the level of test item because the test item parameter was implemented the 1-pl model/ the rasch model. on the other hand, the level of test reliability was performed by the score of alpha coefficient. the data resulted from the expanded experiment was analyzed with the quest program. the analysis was conducted to attain information regarding the characteristics of the item parameter, the participants’ capability parameter and the students’ mastery toward the historical thinking skills in the school. findings and discussions findings the finding of this study is in the form of assessment model of historical thinking skills resulted in the study which belonged to the procedural model, namely the model that had procedures that should be performed sequentially. the phases included the test preparation, the limited experiment, and the expanded experiment. test preparation the activities of test preparation began with the formulation of learning continuum, the test guideline draft and test items composition for the historical thinking skills. then, the researchers performed a review toward the instrument by involving several experts. the total test instruments made were six units. those six units had 10 items as the anchor or the common items. the activities of limited experiment were performed toward the selected senior high schools and the experiment involved 1,572 learning participants from grade x and grade xi. research and evaluation in education journal an assessment model of historical thinking skills... 77 ofianto & suhartono table 1. the characteristics of the senior high schools for the limited experiment of historical thinking skills test no. name of senior high school location popularity based on the graduates accepted in the state university 1 1 solok senior high school solok city popular in solok city 2 1 payakumbuh senior high school payakumbuh city popular in payakumbuh city 3 1 gunung talang senior high school solok county popular in solok county 4 1 batu sangkar senior high school tanah datar county popular in tanah datar county 5 2 solok senior high school solok city unpopular in solok city results of limited experiment the scoring was performed by using the three categories and the 0-2 polytomous scale. the data were analyzed with quest program. the result was that there had been two test items that had not been fit with the model, namely test item number 23 and test item number 24. in both items, not all of the testees were able to attain the category-2 and there were very small number of testees who attained the category-3. according to ctt, the reliability in the form of cronbach alpha, namely 0.65, is still the same after both items were eliminated from the analysis. meanwhile, according to irt, the estimated reliability based on the testees’ (case/person) analysis in the form of person separation index is 0.82. table 3 shows the average score for the increasing item difficulty level, starting from the easiest to the hardest one. the gradation for the aspect of fundamental capability is the chronological thinking skills, continuous and changing identifying skills and causal analyzing skills. table 2. results of item estimation (i) and testee estimation (n) from the limited experiment no. explanations before the two items were eliminated (i=111) after the two items were eliminated (i=109) estimation for item estimation for testees estimation for item estimation for testees 1 average and standard deviation scores 0.00 ± 1.08 -0.61 ± 0.86 0.00 ± 1.06 -0.58 ± 0.85 2 average and standard deviation scores that had been made appropriate 0.00 ± 1.02 -0.61 ± 0.78 0.00 ± 1.00 -0.58 ± 0.77 3 separation index 0.89 0.82 0.89 0.82 4 cronbach alpha scores 0.54 0.54 5 average and standard deviation scores of infit mnsq 0.98 ± 0.10 0.99 ± 0.47 0.98 ± 0.10 0.99 ± 0.48 6 average and standard deviation scores of outiftmnsq 0.99 ± 0.15 1.00 ± 0.51 0.98 ± 0.13 1.00 ± 0.51 7 average and standard deviation scores of infit t -0.22 ± 1.06 -0.24 ± 1.09 -0.19 ±1.06 -0.24 ± 1.09 8 average and standard deviation scores of outfit t -0.17 ± 1.07 -0.15 ± 1.05 -0.14±1.06 -0.14 ± 1.05 9 item or testees of 0 score 0 0 0 0 10 item or testees of perfect score 0 0 0 0 the aspects of historical thinking skills are, respectively, historical significant meaning establishing skills, historical source/information and data recording skills, historical research planning skills, historical results of research reporting skills and historical sources analyzing and benefitting skills. the average scores for the level of item research and evaluation in education journal 78 volume 1, number 1, june 2015 difficulty within the sub-aspect of historical sources analyzing and benefitting skills are the highest ones among the other historical thinking aspects; meanwhile, the average scores for the level of item difficulty within the sub-aspect of historical significant meaning establishing skills are the lowest ones. the item distribution, based on the level of difficulty in the form of difficulty value as the results of analysis by using the quest program, shows that 5.40% of the items of the basic skills are quite difficult (from 1.0 to <1.5) and that there has not been any item of basic skills that are difficult (from 1.5 to 2.0). from the items of historical research planning skills, the researchers found that there had been 5.40% of the items that were quite difficult (from 1.0 to <1.5) and that were difficult (from 1.5 to 2.0). the researchers also found that there had been 1.35% of the items that were very difficult (≥2.0). table 3. the scores of difficulty level in the aspects and the sub-aspects of historical thinking skills according to pcm based on the results of limited experiment no. aspects and sub-aspects of historical thinking skills level of item difficulty score difficulty delta 1. basic skills -0.989 -2.677 0.697 a. chronological thinking skills -1.776 -3.336 -0.221 b. continuity and change identifying skills -1.027 -2.673 0.618 c. causal relationship analyzing skills -0.348 -2.190 1.492 2. historical research capabilities 0.508 -0.685 1.703 a. significant meaning establishing skills -0.450 -1.993 1.093 b. historical data/information/source recording skills 0.462 -0.862 1.788 c. historical sources benefitting and anakyzing skills 0.917 -0.405 2.238 d. historical research planning skills 0.689 -0.305 1.690 e. historical research results reporting skills 0.726 0.112 1.340 table 4. the item distribution in the aspects of historical thinking skills based on the scores of difficulty level in the limited experiment range on the level of difficulty basic skills historical research capabilities absolute frequency relative frequency absolute frequency relative frequency < -2.0 4 10.81% 0 0.00 -2.0 to <-1.5 5 13.51% 0 0.00 -1.5 to <-1.0 6 16.21 % 4 5.40% -1.0 to <-0.5 11 29.72% 3 4.05 % -0.5 to <0.0 5 13.51 % 9 12.16 % 0.0 to <0.5 3 8.10 % 16 21.62 % 0.5 to <1.0 2 5.40 % 23 31.08 % 1.0 to <1.5 1 2.70 % 16 21.62% 1.5 to <2.0 0 0.00% 2 2.70% ≥ 2.0 0 0.00% 1 1.35 % total 37 100 % 74 100 % research and evaluation in education journal an assessment model of historical thinking skills... 79 ofianto & suhartono results of expanded experiment the summary was compiled by using the quest program and the results of the summary are presented in table 5. table 5 shows that overall, the items in the form of the test had been fit with the model which had been a prerequisite for the quest program. table 5. results of item estimation (i) for the historical thinking skills and of testee estimation (n) according to the partial credit model (pcm) in the expanded experiment. no. explanations estimation for item estimation for testees (case/person) 1 average and standard deviation scores 0.00 ± 0.96 -0.58 ± 0.71 2 average and standard deviation scores that had been made appropriate 0.00 ± 0.93 -0.58 ± 0.60 3 separation index 0.93 0.72 4 cronbach alpha scores 0.41 5 average and standard deviation scores of infit mnsq 0.99 ± 0.05 0.99 ± 0.51 6 average and standard deviation scores of outiftmnsq 0.99 ± 0.10 0.99 ± 0.56 7 average and standard deviation scores of infit t -0.16 ± 1.05 -0.25 ± 1.08 8 average and standard deviation scores of outfit t -0.14 ± 1.04 -0.16 ± 1.05 9 item or testees of 0 score 0 0 10 item or testees of perfect score 0 0 according to ctt, the cronbach alpha index is 0.54. on the other hand, according to irt (wright & masters, 1982, p. 106; keeves & masters, 1999, p. 276) the reliability that has been estimated based on the testee (case/person) analysis in the form of person separation index is 0.72. table 6. the scores on the level of item difficulty in the aspects and the sub-aspects of historical thinking skills within the expanded experiment no. aspects and sub-aspects of historical thinking skills level of item difficulty score difficulty delta 1 2 1. basic skills -0.705 -2.307 0.897 a. chronological thinking skills -1.072 -2.641 0.488 b. continuity and change identifying skills -0.698 -2.150 0.758 c. causal relationship analyzing skills -0.420 -2.261 1.419 2. historical research capabilities 0.369 -0.650 1.390 a. significant meaning establishing skills -0.13 -1.363 1.102 b. historical data/information/source recording skills 0.24 -1,00 1.481 c. historical sources benefitting and anakyzing skills 0.461 -0.552 1.475 d. historical research planning skills 0.643 0.135 1.153 e. historical research results reporting skills 0.933 0.178 1.691 based on table 6, most item analysis results in the expanded experiment are similar to those of the limited experiment. the average scores for the level of item difficulty from the basic skills to the historical research planning skills show an increasing gradation from the easiest ones to the hardest ones. the finding is similar to that of the limited experiment. results of measurement for the expanded experiment the results of measurement show that the range of raw scores is 2 as the lowest score and 39 as the highest one and the limit of maximum score is 50 (category-1 = 0, category-2 = 1 and category-3 = 2). research and evaluation in education journal 80 volume 1, number 1, june 2015 table 7. the absolute frequency and the converted relative scores of the historical thinking skills in the range between -2.00 and 2.00 with the class interval 0.5. no. class of interval for converted scores absolute frequency relative frequency cummulative frequency 1 score 0 (uncallibrated) 0 0.00 0.00 2 <-2.00 46 1.72 1.72 3 -2.00 s/d -1.50 244 9.12 10.84 4 >-1.50 s/d -1.00 321 12.00 22.84 5 >-1.00 s/d -0.50 738 27.60 50.44 6 >-0.50 s/d 0.00 1166 43.62 94.06 7 >0.00 s/d 0.50 122 4.56 98.62 8 >0.50 s/d 1.00 28 1.04 99.66 9 >1.00 8 0.29 100.00 total 2673 100.00 after having been calibrated, the lowest converted score was -3.52 and the highest converted score was 0.09 from the range between -4.00 and +4.00. the calibrated scores were then grouped with the class interval 0.5. the results of the calibration show that there are 5.89% of the testees who earned the converted scores bigger than 0.00. thereby, if the limit 0.00 was positioned as the mid-score, then 94.11% of the testees would be under the mid-score. as a result, most of the testees did not manage to earn 50% of the correct answers. discussions item characteristics in the activities of limited experiment the results of analysis on the data of limited experiment, based on the partial credit model, show that there are items that had delta-1 scores bigger than those of delta2; however, in overall, the items had been fit with the model. the finding was not in contrary to the supporting theories, as having been proposed by wright & masters (1982, pp. 44-45) that according to pcm the analysis characteristics enabled the items that had the scores of delta-1 bigger than those of delta-2. the statement implied that the ability to improve from category-2 to category-3 might be lower than that of category-1 to category2. the results of the analysis also showed that among 111 items that had been tested, there were 2 items that had not been fit to the partial credit model (pcm), namely item number 23 and item number 24. level of test item difficulty the sub-aspects of basic skills and the questions of causal analyzing skills were the most difficult skills. then, both of the skills were accompanied by the following skills: (a) change and continuity identifying skills; and (b) chronological thinking skills. the causal analyzing skills were the skills that demanded in-depth comprehension from the learning participants. analyzing the causal relationship should be followed by the learning participants’ capabilities in the form of systematic historical data presentation so that the causal relationship in certain historical events would be easily comprehended. in this case, the students should not only memorize the facts presented in the textbooks and the lectures by teachers but also should present the causal relationship in certain historical events from many sources. in the same time, the students were also demanded to classify the historical presentations from the historical sources that they had. in other words, the students did not only summarize the results of their observation but also presented the results of their observation into multiple forms of data presentation such as tables, flowcharts, historical maps and alike. in terms of historical research planning skills, the sub-aspects that had the highest level of difficulty was the historical research capabilities. the finding was common due to the lack of research reporting implementation in the schools. the level of difficulty in the research and evaluation in education journal an assessment model of historical thinking skills... 81 ofianto & suhartono aspect was followed respectively by the following skills: historical research planning skills, historical sources benefitting/analyzing skills, historical sources/information/data recording skills and historical significant meaning establishing skills. the learning participants had difficulties when they had to think about alternative actions if the activities had been rarely conducted. the characteristics of test items in the expanded experiment all of the items implemented in the expanded experiment had been fit with the model. the average scores for the level of item difficulty in the limited experiment, for the aspects of basic skills and of advanced skills, respectively, were -0.989 and 0.508. in the expanded experiment, the rank of the average scores for the level of difficulty, respectively, were -0.705 and 0.369. the data showed a similar pattern of responses between the results of limited experiment and those of expanded experiment and, based on the level of difficulty, still there had been a similar pattern of responses between the results of both experiments. the average scores for the sub-aspect difficulty level from the basic skills aspect in the activities of limited experiment, starting from the most difficult, were -0.348 (subaspect c: analyzing causal relationship), -1.027 (sub-aspect b: identifying the continuity and change) and -1.776 (sub-aspect a: thinking chronologically). then, the average scores for the sub-aspects difficulty level from the basic skills aspect in the measurement stage, starting from the most difficult one, were 0.420 (sub-aspect c: analyzing the causal relationship), -0.698 (sub-aspect b: identifying the change and the continuity) and -1.072 (sub-aspect a: thinking chronologically). thereby, there have not been any difference in the pattern of testees’ responses. similarly, the easiest response was still the same, that is, thinking chronologically. results of test in the expanded experiment the results of the test in the expanded experiment show that the scores of historical thinking skills which were attained from 2,673 testees were unsatisfiying; there are only 5.89% of the testees who earned the scores above the mid-point. there were three factors that might cause the finding. the first factor is that the historical thinking skills were not taught completely and integratedly in each subject topic. as a result, the opportunities of exercising the historical thinking skills became very small. the second factor is that the historical thinking skills in the subject topics of historical learning were not implemented especially in the strategies of applying the historical thinking skills for finding concepts instead of applying the historical thinking skills for clarifying the facts as a result of memorization. the historical learning that relied on the memorization of facts and concepts made the students unable to perform historical thinking appropriately. the third factors is that the historical thinking skills might have been taught in accordance with the demand of internal competence and standard competence as formulated in curriculum 2013; however, the learning participants had not been habituated to work on the non-objective tests that enabled them to provide as many correct answers as possible. conclusions and suggestions conclusions based on the results of the study and the discussions, the researchers draw several conclusions as follows. first, the assessment model that had been developed belongs to the procedural one. second, the information attained from the assessment model of historical thinking skills was the formulation of learning continuum for the historical thinking skills, the item characteristics in the form of item difficulty and the testees’ capability (theta-θ) and the test items that had empirical evidence that had been fit to the partial credit model (pcm) based on the three category polytomous data. third, the validity of test instrument for the historical thinking skills that had been designed had been met through the expert judgement and had been proven fit empirically to the partial credit model (pcm) based on the three category of polytomous data. research and evaluation in education journal 82 volume 1, number 1, june 2015 fourth, the reliability of test instrument for the historical thinking skills in the form of cronbach alpha index had been quite good, namely 0.64. fifth, the overall results of assessment showed that the testees had not mastered the historical thinking skills that had been tested. the finding was apparent from the fact that only 5.89% of the testees who had been in the expected mid-scores based on the three-category polytomous data according to the partial credit model (pcm). the reason was that the learning participants were lack of exercising the historical thinking skills in finding concepts and of working on the non-objective tests. suggestions based on the conclusions, the researchers formulate several suggestions as follows. first, the study only involved the state senior high schools as the samples; therefore, the researchers suggest that the future studies might involve larger sample size so that wider mastery of historical thinking skills in the related educational degree might be found. future studies might also be developed in elementary schools or madrasah ibtidaiyah, senior high schools or madrasah tsanawiyah and even in universities. second, there should be further studies to find out the mastery of historical thinking skills as an inter-site comparison or an interyear comparison with representative sample size. further studies might also be conducted in order to find out the relationship between the historical thinking skills and the teaching strategy in the historical learning process. third, the researchers suggest teachers to train their students through appropriate learning process to develop their historical thinking skills. fourth, it is suggested for the teachers to train historical thinking skills integratedly in every single learning activity in accordance to the characteristics of the subject topics. thus, learning participants would habituate themselves to find facts, concepts and theories by utilizing historical thinking skills as having been performed by the historians specifically and social science experts in general. fifth, historical thinking skills in senior high schools should be measured periodically in order to find out the students’ mastery level of historical thinking skills in the related year. sixth, the teachers should utilize the mechanisms of assessment for learning using the results of measuring the historical thinking skills applied in the related senior high schools so that the results might be used for improving the quality of lesson plan design and even for providing remedy tests for the learning participants. seventh, there should be appreciations and also conducive atmospheres from the related parties in order to encourage the teachers to perform tests by employing open essay to stimulate the learning participants’ development of historical thinking skills. eighth, teachers should make the learning participants aware of the importance in identifying multiple test forms in order that they would have wider insights and comprehend the problems contained in the status of the test item kinds. references allen, m. j. & yen, w. m. (1979). introduction to measurement theory. belmont, ca: wadsworth, inc. ashby, r., lee, p. j. & shemit, d. (2005). putting principles into practice: teaching and planning. in m.s. donovan & j.d. bransford (eds.). how students learn: history, mathematics, and science in the classroom. washington, dc: the national academies press. bain, r. b. (2005). applying the principles of how people learning teaching high school history. in m.s. donovan & j.d. bransford (eds.). how students learn: history, mathematics, and science in the classroom. washington, dc: the natio-nal academies press. barton, k. c. & levstik, l. s. (2003). why don’t more history teachers engage students in interpretation?. research and practice social education, 67 (6), pp. 358-361. borg, w. r. & gall, m. d. (1989). educational research: an introduction (5 th ed.). new york, ny: longman. research and evaluation in education journal an assessment model of historical thinking skills... 83 ofianto & suhartono departemen pendidikan nasional (depdiknas). (2007). peraturan menteri pendidikan nasional republik indonesia nomor 20, tahun 2007, tentang standar penilaian pendidikan untuk satuan pendidikan dasar dan menegah [indonesian national education minister’s regulation number 20, in the year of 2007, about the standard of educational assessment for primary and secondary education]. fogu, c. (2009). digitalizing historical consciousness. journal history and theory, 47 (1), pp. 103-121. griffin, p. & nix, p. (1991). educational assessment and reporting: a new ap-proach. sydney: harcourt brace jovanovich, publishers. hambleton, r. k. & swaminathan, h. (1985). item respons theory. boston, ma: kluwer inc. hargreaves, a., earl, l. & schmidt, m. (2002). perspectives on alternative assesment reform. american educational research journal, 39 (1), pp. 69-95. keeves, j. p. & master, g. n. (1999). introduction. in g. n. masters & j. p. keeves (eds.). advances in measurement in education research and assess-ment. amsterdam: pergamon, an imprint of elsevier science. lee, p. (2005). putting principles into practice: understanding history. in m. s. donovan & j. d. bransford (eds.). how students learn: history, mathematics, and science in the classroom. washington, dc: the national academies press. mardapi, d. (1999). estimasi kesalahan pengukuran dalam bidang pendidikan dan implikasinya pada ujian nasional [the estimation of miss-assessment in educational field and its implication to national examination]. proceeded in the inaugural speech of professor on 4 may 1999. yogyakarta: yogyakarta state university. mardapi, d. (2008). teknik penyusunan instrumen tes dan nontes [technique of test non-test instrument arrangement]. yogyakarta: mitra cendikia press. masters, g. n. (1999). partial credit model. in j. p. keeves & g. n. masters (eds.). advances in measurement in educational research and assessment. amsterdam: pergamon. oriondo, l. l. & dallo-antonio (1998). evaluating educational outcomes (test, measurement, and evaluation) (5 th ed.). quezon city: rex printing company. rasch, g. (1961). on general laws and the meaning of measurement in psychology. the danish yearbook of philosophy, 4 (1), pp. 321-334. rasch, g. (1977). on specific objectivity: an attempt at formalizing the request for generality and validity of scientific statements. the danish yearbook of philosophy, 14 (3), pp. 58-93. seixas, p. & peck, c. (2004). teaching historical thinking. in a. sears & i. wright (eds.), challenges and prospects for canadian social studies. vancouver: pacific educational press. seixas, p. (2013). linking historical thinking concepts, content and competencies. vancouver: pacific educational press. van der linden, w. j. & hambelton, r. k. (1997). handbook of modern item response theory. new york: springer. winerburg, s. (2006). berpikir historis: memetakan masa depan, mengajarkan masa lalu. (m. maris, trans.). jakarta: yayasan obor indonesia. wright, b. d. & masters, g. n. (1982). rating scale analysis. chicago: mesa press. judul dalam bahasa indonesia, ditulis dengan huruf tnr-14 bold, maksimal 14 kata, rata kiri research and evaluation in education journal e-issn: 2460-6995 research and evaluation in education journal volume 1, number 1, june 2015 (1-12) available online at: http://journal.uny.ac.id/index.php/reid factors discouraging students from schooling: a case study at junior secondary school in laos 1) chanthaboun keoviphone; 2) udik budi wibowo 1) vientiane provincial education and sport department, laos; 2) yogyakarta state university, indonesia 1) ckeoviphone@gmail.com; 2) yube2u@yahoo.com abstract this study is aimed at exploring and describing the factors discouraging laos students from schooling at a secondary school in various contexts such as classroom and school, individual student, family, and community contexts. descriptive qualitative approach was used, and the framework of the study was formulated around the aspects of school and classroom situation, principal’s management process, teachers’ teaching organization and performance, parents’ involvement and perception, and community’s involvement and perception. the data were collected through observation, document analysis, and interview from 10 students, six teachers, one principal, one vice principal at phonsyneua junior secondary school, and seven parents whose children were studying at this school. the finding shows that among the factors involved in students’ schooling at secondary school, several factors discourage them. the teachers’ performances were not perfect yet and some students’ competency was not qualified enough. the students’ parents were not highly committed and involved in their children’s schooling. the community had little trust in schooling since they perceived that schooling costs a lot of money. to improve these discouraging factors, several actions should be taken into consideration. the principal should ask all the teachers to communicate the vision and missions to the students, or the vision and mission should be published and socialized in the school. the observations on teachers’ instruction should be done by both the principal and senior teachers. keywords: classroom and school context, individual student, family, community mailto:ckeoviphone@gmail.com research and evaluation in education journal 2 volume 1, number 1, june 2015 introduction laos or the lao people’s democratic republic (lao p.d.r) is a small landlocked country, but now it becomes the land-link country for asean countries transport of goods such as thai, laos, myanmar, vietnam, cambodia and china. the eradication of mass poverty becomes the current major goal of lao government. basically, there are four fundamental sectors which are concentrated by national development goals such as agriculture and forestry, education, health and road infrastructure. in addition, educational development is very important for lao, focusing on educational problems such as the general conditions of inequity in access to schooling, inefficiencies in students’ progress through school, and the low quality of instruction-conditions policy. moreover, laos students are still considered as lack of creativity and original thought, so that the outcome of teaching and learning process in lao primary and secondary schools tends to be negative. this might emerge due to the teachers’ behavior and teaching style. furthermore, the size of lao schools still becomes consideration in such small municipalities with limited resources, in which it creates new difficulty in offering comparable supportive functions. this might become a reason of the poor outcome of teaching and learning process. currently, teaching and learning method still focuses on the situation or setting of the school location, especially in the schools located in remote areas. the teachers do not have a chance to train and gain new ideas and new techniques/ curriculums, therefore, the teaching quality is poor. besides, it will also produce impact to lao students’ quality and learning outcome. the seventh lao people’s revolutionary party congress in 2001 and the lao people’s revolutionary party congress in 2006 also emphasized that human resource development depends on education reform leading to a better quality basic education, equivalent to other countries. the government has given priority to education, considering that it is a key element of human resource development policy. education is one of the top priority sectors contributing to poverty eradication. the quality of laos basic education is very low if it is compared to the neighboring countries. these issues emerge due to many factors which affect the quality of basic education, in which one of the most important factors is teachers’ quality. because there are many unqualified teachers, thus, lao education finds it difficult to develop into standard level like other countries and is incapable to reach the millennium development goals (mdg). the development of the curriculum and teachers’ qualification are the main aims to manifest educational development and also reduce the drop out and repetition rates. lao educational system goes through the levels from kindergarten and pre-school (3years); primary education (5 years); lower secondary education (4 years); upper secondary school (3 years); vocational and higher education (2-4 years); undergraduate level (4-6 years); to master degree (2-5 years), research and doctor of philosophy (phd) (up to 6 years), depending on the field of study. there are still some remaining issues in lao’s education, for example, about 11% school-age children leave school due to some conditions, such as living in small or isolated village so that they have difficulty to reach the school in town. some of the students also get difficulties because of the distance from their homes to school. in addition, poverty keeps lao people away from education. this condition is worsened by the local culture to conserve, such as the paradigm that girls should be kept at home and get married at early age. this paradigm leads to the less awareness of the importance of education in lao. other educational problems which emerge in lao are related to the high number of drop out rate in grade 1 and 2, the absence of pre-primary school in the rural areas, indigene children’s reluctance to speak using national language, the shortage of local teachers, qualified teachers, and male teachers, and also the shortage of teaching and learning materials. research and evaluation in education journal factors discouraging students from schooling... 3 chanthaboun keoviphone & udik budi wibowo according to the current phonsyneua junior secondary school principal, mr. sithon, there were five drop out students and ten students who left the school before completing lower secondary school in 20092010. in 2010-2011, there were three students who were dropped out and six students who left the school before completing lower secondary school. in 2011-2012, there were two drop out students and five students who left the school before completing lower secondary school. in 2012-2013, there were no students who either left school or were dropped out, but four male students could not pass the examination. they were one student of grade 6 and grade 7, and two students of grade 8, so they still studied at the same grade. method type of the research qualitative research case study was used in this study since the focus of this research was attempted to answer the questions of ’why‘ and ’how‘. ’why‘ questions would be relevant to why the students drop out from school, why the students are not highly motivated to finish their secondary schooling after leaving primary school. ’how‘ questions would be related to how the teachers organize and perform the teaching and learning processes in the classrooms, and how the students’ parents and community get involved in children’s education at secondary school. in this qualitative research, the researchers used subjectivity to draw a conclusion of the particular cases which they had already observed. they also analyzed and described the document collected by using their own judgment. hence, they typically observed, interviewed and analyzed the data obtained from the research field. using case study for research purpose is one of the most challenging methods of all social science endeavors. the purpose of case study method is to help researchers, experienced or budding social scientists to deal with the challenge. the researchers’ goal was to design a good case study and to collect, present, and analyze data fairly. the case study as a research method (yin, 2009, pp. 2-164) is explained as follows: plan in planning the research, some steps were done by the researchers: identifying research questions or other rationale for doing a case study; deciding to use the case study method; comparing to other methods; and understanding its strengths and limitations. design in designing the research, some stages were passed: defining the unit of analysis and the likely case(s) to be studied; developing theory, propositions, and issues underlying the anticipated study; identifying the case study design (single, multiple, holistic, embedded); defining the procedures to maintain case study quality. prepare some important steps were taken in preparation stage: honing skills as a case study investigator; practicing/training for specific case study; developing case study protocol; conducting pilot case; gaining approval for human subjects’ protection. collect in collecting stage, the researchers had to pass some steps: following the case study protocol; using multiple sources of evidence; creating case study database; maintaining the chain of evidence. analyze in analyzing stage, some steps were conducted: relying on theoretical propositions and other strategies; considering any of five analytic techniques, using quantitative or qualitative data or both; exploring rival explanations; displaying data apart from interpretations. share some steps were conducted in sharing stage: defining audience; composing textual and visual materials; displaying enough evidence for reader to reach own conclusions; reviewing and re-writing until it is done well. research and evaluation in education journal 4 volume 1, number 1, june 2015 place and time of the research the research was conducted at phonsyneua junior secondary school in phonhong district, vientiane province, laos. this research was held from january to march 2014 to gather the data and make analysis to complete the research study on the factors discouraging students from schooling, which made them drop out without finishing their secondary education. subjects of the research the subjects of this study were the factors discouraging students from schooling. data and information were provided by the school principal, vice principals, teachers, and students in the typical junior secondary school: phonsyneua junior secondary school in phonhong district, vientiane province in laos. the students were selected to be asked about their schooling plans, their teachers’ performance, and their reason for dropout from school, whereas the teachers were selected to be interviewed about their teaching performance, the students’ performance and competency, and the reason of students’ dropout from school. furthermore, the principal and vice principals were asked to give information about what they had done in their management process to reach students’ high achievement. the students’ parents living in the villages surrounding the location of the school were asked to give information relevant to their involvement in their children’s schooling and their perceptions about schooling. those who had fairly high standard of living acted as the community. data collection techniques there were three methods used in this study: observation, interview, and document analysis. these three methods were formulated around the framework of school situation, students’ dropout rate, teachers’ instructions, students’ motivation, and parents’ and community’s involvement in the students’ schooling at secondary school. the observation involved the researchers in observing how the school principal, vice principals, teachers, and students acted and how things at school looked, and they recorded the events. interview was used to check the accuracy of the impression they had gained through the observation; to find out what was on the key and sub informants mind or how they feel about something. document analysis was the analysis of the documents’ written or visual contents, such as newspaper, journals, reports, and letters which were greatly valueable to the research. furthermore, the information that might be impossible to obtain through direct observation can be gained through analysis of available communication materials. research finding criteria the finding of this study was planned to be segmented into four categories, namely: classroom and school context, individual student context, family context, and community context. these contexts either affected or were affected by one another. data credibility to ensure the data credibility, there are some strategies which were employed. the strategies are listed below: triangulation triangulation was achieved by employing various instruments: observation, interview, and document analysis. informant checking informant checking involved checking an informant’s descriptions of something compared to another informant’s descriptions of the same thing. recording recording dealt with recording the researchers’ own thoughts as they went through their observations and interviews. responses that seem unusual and incorrect could be noted and checked later based on other remarks or observations. mechanically recording the data recording the data mechanically was achieved by using tape recorder, photographs, and videotapes. research and evaluation in education journal factors discouraging students from schooling... 5 chanthaboun keoviphone & udik budi wibowo low-inference descriptors low-interference descriptors involved recording precise, almost literal, and detailed descriptions of people and situations. participatory modes of research the informants would be involved in most phases of this study, from the beginning of the data collection until checking interpretations and conclusions. data analysis techniques figure 1. process of data analysis (adapted from mcmillan & schumacher, 2001, p. 463) analyzing data in a qualitative study essentially involvs collecting the information obtained by the researchers from various methods such as observation, interview, and document analysis into a coherent description of what the researchers had collected or discovered. with this regard, the technique of collecting was used during the analysis. after the researchers obtained the data, they identified the data segments and named a topic. then, the researchers grouped those topics into a larger cluster to form the categories. as the researchers developed categories, they looked for the patterns or relationships among the categories to see how they either affected or were affected by other categories. in searching for patterns, the researchers tried to understand the complex links between the various aspects of informants’ situations, beliefs, and actions. consequently, analysis is an ongoing process which may occur through a research. findings and discussions phonsyneua junior secondary school is one of the state schools in laos which has been providing free secondary education tuition in every academic year for laos students. the prior name of this school is sisattana junior secondary school. it was located in ban saensa-arth, phonhor campus. this school was built on 1 september 1975 and was built in an area surrounded by 6 villages and 899 houses including lao-kang (middle land) 14 houses. there were 1027 families: 5121 people with 2382 women, and 14 families of lao-kang (middle land): 127 people with 62 women. the principal in that year was mr. visay keomahavong. there were 8 teachers which taught grade 6-8 students. in 19761977, when mr. bounyang chanthavong was the principle of the school, the school was broken because of the storm so the students were moved to study at phonkeo primary school. in 1977-1978, the principal was mr. sibay visatheb and after this year he went to vientiane capital city to continue his study there. from 1978 until now, the principal is mr. sithon chanthabouly. in 1975-1982 there were students of phoupha campus who studied in silattana junior secondary school. in 1982-1983, this school was divided into two: the first school was still in ban phonkeo with the same name, and the second school was in hinherb district and the name was konkeo junior secondary school, but it was run under silattana junior secondary school. thus, the education materials for both schools were not sufficient since they still used education materials from silattana junior secondary school. in 1985-1986, khonkeo junior secondary school became an independent school, so silattana junior secondary school was changed to phonsyneua junior secondary school until now. on 22 july description and explanation patterns categories topics data field work: discovery and recording research and evaluation in education journal 6 volume 1, number 1, june 2015 1997, this school had new building built by the inhabitants in that area with the budget more than nine million dollars. there are two buildings in this school: the new twofloor building and the old one-floor building. phonsyneua junior secondary school was located next to the 13 th northern road, and the school buildings are located about 50 meters from this road. its area is 910 square meters, and the school number is 005 (got this number in 2008). in 2009-2010, there were grades 6-9; grade 6 had 74 students with 35 women, grade 7 had 66 students with 36 women, grade 8 had 62 students with 29 women, and grade 9 had 59 students with 31 women, so the total students were 261 with 131 women. in 2010-2011, there were 275 students with 135 women; grade 6 had 89 students with 31 women, grade 7 had 69 students with 34 women, grade 8 had 60 students with 33 women, and grade 9 had 57 students with 27 women. in 2011-2012, there were 305 students with 140 women; grade 6 had 81 students with 31 women, grade 7 had 90 students with 41 women, grade 8 had 72 students with 33 women, and grade 9 had 62 students with 35 women. based on the observation in phonsyneua junior secondary school, there are some points to be noticed. the first point is about the school’s physical building and facilities. the school is located near the 13 th northern road and the school is about fifty meters from the 13 th northern road. it is shared with hinherp district in the north, phonhor campus in the south, longleek campus of keo-oudom district in the east, and nasam campus of hinherp district in the west. the school consists of two buildings with fifteen rooms. there are eleven rooms used for teaching and learning process, a room functioned as a store as well as a library, and a room for school principal and teachers, and it is separated into two parts. there is a library with 10 computers donated by korean charity. in each class, there are wooden desks for students with the following details: 14 desks for grade 6; 15 desks for grade 7; 12 desks for grade 8; and 11 desks for grade 9. in addition, there are a desk and a chair for the teacher, and a blackboard on the wall in front of the classroom, and a piece of paper stuck near the blackboard written about the classroom sweeping groups. the second point is that when the observation was conducted, there was a notice board in the office informing that there would be a meeting on friday at three o’clock. there was another notice board in front of the office informing about some news and scholarships. further, the third point is that there are no photos of academic students’ yearly excellent achievement stuck on the notice boards. finally, the fourth point is about the teaching and learning process in the school. most teachers have lesson plans for their teachings so that the teaching and learning process in the classrooms is conducted as follows: (1) the teachers explain the purpose of that lesson; (2) ask the students to note or write the important things from that lesson; (3) ask the students to make groups; (4) give them some questions; (5) let them answer or do; (6) let them copy the answer if it is already correct; (7) deliver teaching and learning materials; (8) evaluate the students. meanwhile, some teachers do not have the lesson plans for their teachings so that the teaching and learning process in the classrooms for some subjects is conducted as follows: (1) checking the students’ presence; (2) asking several questions about the lesson taught in the previous session; (3) explaining the new lesson; (4) asking the students to work in groups to do the task; (5) checking the answers with the whole class; (6) doing the comprehension check; (7) giving advice and homework; (8) saying good bye. muijs & reynolds (2005, pp. 107-108) have suggested that school climate will strongly influence classroom climate, and in order to be effective, the two need to be complementary. they have also revealed that most studies have identified classroom climate as an important concomitant of pupil achievement. therefore, it can be said research and evaluation in education journal factors discouraging students from schooling... 7 chanthaboun keoviphone & udik budi wibowo that the school and classroom climates at phonsyneua junior secondary school can be regarded as influential factors for the supporting students’ achievements since its location was handy and secure. bush & coleman (2000, p. 10) have supported that a strategic approach to management requires an explicit sense of direction and purpose, and defining a clear vision for the organization is an important stage in this process. consequently, it can be concluded that phonsyneua junior secondary school which is one of the state schools providing lao students with secondary education has a clear purpose since it has its own vision and mission. these vision and mission can be used to improve the teaching and learning processes. as a result, the students’ high achievements are able to be attained at the end of the academic year if they were implemented well. davis & thomas (1989, pp. 22-23) have suggested that through daily interactions and modeling, the principal transmits his or her vision of a better school to teachers and other staffs and influences them to act to achieve that vision. thus, it can be inferred that although phonsyneua junior secondary school had already formulated the vision and mission to help the ministry of education in lao to achieve the long-term mission, it did not succeed in helping the ministry of education, or it was not able to provide the students with much benefit from the vision and mission because the vision and mission were not well implemented. during the interviews, only two of five teachers knew the school’s vision and mission, while others did not. seven out of eight students told that they did not know and understand well about the school vision and mission. therefore, they did not know what they could do to help the school to achieve its vision and to help the ministry of education accomplish her/his long-term mission. thus, the teachers should inform the students about the school’s vision and mission before they start teaching on the first day. roe & drake (1974, p. 227) have illustrated that evaluation is essential to the continuous improvement of the life quality of each individual within the school, including both pupils and teachers. thus, it can be seen that the school principal at this school had done one thing in the right track of the good principalship, but on the other hand, he evaluated the teachers with the lack of authentic information because he rarely observed the teaching and learning processes conducted by the teachers in the classrooms. davis & thomas (1989, p. 18) have identified that effective principals in successful schools observe teachers in the classrooms and provide positive, constructive feedback aimed at solving problems and improving instruction. hence, it can be described that the school principal at this school should be considered as effective to handle teachers and students more. sallis (1993, p. 128) has revealed that staff development can be seen as an essential tool to build quality awareness and knowledge, and it can become a key of strategic change agent for developing quality culture. according to the afore-mentioned research finding, although all teachers at this school were regarded as being qualified not only by the school principal but also by the students, the training and staff development were still conducted at this school monthly to improve the teachers’ quality. with this regard, the teachers were required to join the monthly meetings first. after the monthly meetings had finished, they were required to join different group discussion depending on the subjects because there was a head for each subject. hoy & miskel (2008, p. 380) have supported that goal-directed behavior is elicited through communication; hence, the greater the clarity and understanding of the message, the more likely the administrator, teacher, and students actions will proceed fruitfully. as mentioned on the finding, it can be inferred that there was a lack of guidance from teachers for the students at this school, which could be used to guide research and evaluation in education journal 8 volume 1, number 1, june 2015 the students to do their best towards the intended purposes. from the observation, it was shown that although there were two notice boards which were used to provide information, there was only little information of what to guide for the students to accomplish their high achievements in their educational lives. this made students do not know what to do to help the school to achieve the its purposes clearly. in addition, if much information used as the guidance was communicated well to the students, it would help them all to prepare themselves for their upcoming futures. as a result, they would be highly committed to their studies. thus, teachers could assist in guiding the students to do their best towards the intended purposes. woolfolk (2007, p. 399) has suggested that students should be encouraged to improve their own personal ability, perform difficult tasks, keep persistence, and be creative. consequently, it can be concluded that the school principal at this school pays attention to the students’ improvement of their academic achievement. although there were no photos of the moments of giving the honorary awards for the students’ academic achievement or any awards in the office stuck on the notice boards when the observation was conducted, it was known from the interviews that honorable mention certificates were given to the students for their achievements in an academic year. to do this, the school principal collaborates with the classroom teachers to encourage the students’ accomplishment by providing the top-three students in each class with the honorable mention certificates in their own classrooms by their classroom teachers at the end of each academic year. borich (2000, p. 351) has suggested that establishing rules and procedures to reduce the occurrence of classroom indiscipline problems will be done as the most important classroom management activities, especially at the beginning of the academic year. thus, as mentioned in the finding, it can be inferred that the teachers’ first day teachings at this school were not considered good yet because four out of five teachers just introduced their names to the students, asked the students to write their short family background, and such activities on their first day, without establishing the rules and procedures of the teaching and learning processes in the classrooms for the whole academic year. they did not tell the students what they were going to learn, from when to when, and what they needed to do during each semester. in addition, they did not introduce the school vision and mission to the students, nor did they set up the performance standards to help the school accomplish its vision. thus, the teacher should establish the rules and procedures of the teaching and learning processes in the classrooms for the whole academic year. orlich, harder, callahan, trevisan, & brown (2007, p. 64) have illustrated that the more systematic your instructional planning is, the greater the probability that you will succeed,. planning instruction or lesson means establishing priorities, goals, and objectives for the students. therefore, it can be concluded that although they had the teachers’ book used as the guidance for their instructional activities, the teachers’ preparation for the teaching and learning process in the classrooms at this school is still regarded as being useful to make the teaching and learning processes more effective since the finding reveals that most teachers taught their students with the lesson plan preparation, while some others used the teachers’ guidance without the lesson plan preparation. planning the lesson before teaching is very important for the teachers because it can make the teachers know what they have to do during the teaching and learning process in each session. davis & thomas (1989, p. 123) have suggested that some factors contributing to active teaching and maintaining a brisk pace include starting class quickly and purposefully, and also ending lessons promptly. they also reveal that the teacher can use clear start and stop cues to help the lessons run according to the specific time targets. based on this reason and aforementioned finding, it can be concluded that research and evaluation in education journal factors discouraging students from schooling... 9 chanthaboun keoviphone & udik budi wibowo the teaching and learning processes conducted by a few teachers at this school were not regarded as good instructional activities although most of the teachers’ instructions are acceptable. the interviews with the students showed that a few teachers were not punctual in teaching – they came late to teach and leave the class early, and a few of them did not come to teach their students regularly. thus, the teachers should come to teach punctually and regularly. orlich, harder, callahan, trevisan, & brown (2007, p. 18) have identified that teacher can teach only if the learner has desire to learn, known as motivation. hence, according to the finding, it can be seen that the students at this school had desire to learn; in other words, they had motivation as it is revealed in the interview that they plan to finish their study at secondary school and pursue their higher study in university. they believed that by doing this, they would be able to become good citizens with good reputation and glory in the society. because the students had motivation for their studies, it can be concluded that the teachers at this school found it easy to conduct the teaching and learning processes in the classrooms if they had willing to do their best for their teachings. lefrancois (2000, p. 421) has illustrated that competence motivation is manifested in the struggle to perform competently, the feelings of confidence and worth that comes along with successful performance. consequently, from the result of the interview, it can be seen that some students studying at this school were not qualified enough to start their secondary education yet because in the interviews, the teachers commented that some students could not even read and write some words correctly, and they, the students, seemed to have difficulties in following the lesson which was taught in each session. as a result, the teachers found it hard to teach them as well, and these students are risky for dropping out of the school. thus, the students should be qualified enough to start their secondary education. muijs & reynolds (2005, p. 86) have revealed that schools in regional areas are likely to suffer from students’ misbehavior. therefore, a disciplined, structured and caring environment needs to be provided in order to help compensating for what students are missing at home. from the aforementioned finding, it can be said that it is not hard for the school to correct the students’ misbehavior or to help compensating what the students are missing at home since they are good students because the interviews with both school principal and teachers showed that most of the students were obedient. they tried to listen to their teachers while their teachers were explaining the content of the new lesson to them. however, there were still only very few students who had misbehavior. to solve this problem, the school principal has to work together with the teachers to correct the students’ misbehavior by endorsing them with the knowledge of school regulation and educated advice. woolfolk (2007, p. 76) has identified that parenting that is strict and directive, with clear rules and consequences combined with high levels of warmth and emotional support, is associated with higher academic achievement. hence, based on the research finding, the parents whose children were studying at this school lack the state of being strict and directive with clear rules because they rarely monitored their children’s schooling. with this regard, they thought that they did not have sufficient ability to explain to their children what were taught at school. in contrast, they still provided their children with the warmth and emotional support since during the interview they commented that they tried to provide their children with emotional and financial support. hence, the parents whose children were studying at this school should be the state of being strict and directive with clear rules and also provide warmth, emotional and financial support for their children. ormrod (2003, p. 131) has illustrated that students’ school performance is correlated with socioeconomic status; research and evaluation in education journal 10 volume 1, number 1, june 2015 students with higher socioeconomic status tend to have higher academic achievement, and students with lower socioeconomic status tend to be at greater risk for dropping out of school. consequently, it can be seen that the low socioeconomic status of the students’ parents tended to produce greater risk for the students’ performance at school because in the interviews with students’ parents, they thought that they needed to invest a lot of money on their children’s secondary education as well as higher education. they believed that although their children finished their secondary education, they still found it hard to get a job. in addition, because of their economics priority, they had to ask their children to quit their study and find a job to help their family in earning money. ormrod (2003, p. 505) has suggested that students whose parents are involved in school activities have better attendance records, higher achievement, and more positive attitudes toward school. thus, according to the above-mentioned research finding, it can be concluded that there was lack of students’ parents’ involvement in their children’s schooling at this school since the research finding shows that the students’ parents do not have any relationship with the school principal as well as the teachers of the school where their children were studying. therefore, they have only little ideas of what is going on with their children’s schooling, and it leads to the lack of corrective actions for their children’s schooling. mohrman, wohlstetter, & associates (1994, p. 84) have suggested that effective school has positive relationships with the community it serves, and it is regarded as a member of the community it serves. the sense of community helps reducing alienation between the school and the community and increasing students’ achievement. therefore, there is a community involvement because in the interview, the school principal told that the community where this school serves helped the school by providing some tables and spaces to fill some parts of the school building. however, it was just a little involvement in children’s schooling. thus, the community should be involved in higher involvement in this school for increasing students’ achievement. hoy & miskel (2008, p. 191) have revealed that trust in schools is important because it facilitates cooperation, enhances openness, promotes group cohesiveness, and improves students’ achievement. consequently, it is known from the interview that the community believes that education is very important for their children. they can see that higher-educated people could find good jobs easily. however, it can also be concluded that there is a lack of trust from the community towards secondary education and higher education since they reveal that they need to spend a lot of money on the education for their children. conclusion and recommendation conclusion based on the research finding and discussion, the four contexts involved in students’ schooling have performed their tasks well in several ways, whereas some aspects of their performance still need improvement to motivate the students for schooling. therefore, the conclusion of factors discouraging students from schooling at this school can be drawn as follows: first, some positive aspects of controlling are conducted at this school. however, planning and controlling the teaching and learning processes are not good because of the following causes: (a) yearly evaluations towards the teachers’ performace at this school are conducted with the lack of authenticity and transparency, (b) monthly technical meetings for staff training and development at this school are not conducted on the most effective dates, (c) the information which is communicated to the students as the guidance for their academic lives at this school is still limited, and (d) the ways of giving the honorable mention certificates to the top-three students are not done on the research and evaluation in education journal factors discouraging students from schooling... 11 chanthaboun keoviphone & udik budi wibowo most effective dates to motivate the students for schooling. second, although the teachers have performed their instructional activities well in some aspects, other aspects need to be improved because: (a) most teachers do not set the performance standards for the teaching and learning process on the first day of their teachings, (b) some teachers do not have lesson plan preparation for the teaching and learning process, (c) a few teachers are neither well-prepared nor punctual for their teachings, and (d) a few of them do not come to teach their students regularly as the arranged schedule. third, the students who are not highly motivated for their schooling are not qualified enough for their secondary education. fourth, although the students’ parents support their children with emotional and financial supports for schooling, they still have some jobs that they do not perform well in a few ways: (a) they rarely monitor their children’s schooling, (b) some parents regard family’s socioeconomic status as the first priority rather than their children’s schooling because of poverty, and (c) they do not have close relationship with the school where their children are studying. last, even if the community gets involved in the students’ schooling by helping to improve the school’s campus and repair the tables and chairs, it has little responsibility or little involvement in the students’ schooling at secondary school. although the community perceives that schooling is very important for children, they do not have much trust in it since they state that they spend a lot of money on it. recommendations according to the afore-mentioned conclusion, several recommendations are shared as follows: first, it is recommended that the implementation of planning and controlling the teaching and learning process at this school should be improved. to do this, several actions should be taken: (a) the vision and mission should be communicated to the students on the first day of the teachers’ teaching performance, (b) the formal observations streamlined into three phases such as pre-observation conference, classroom observation, and post-observation conference on the teachers’ teaching management should be done by senior teachers, (c) yearly evaluations for the teachers should be done by the school principal and other evaluators more authentically and transparently by using the information notes from the observations and they should ask the teachers to sign for their acceptance of the results before being sent to the provincial education department, (d) monthly technical meetings for the staff training and development should be conducted on mondays so that all teachers can attend the meetings, (e) more information about academic guidance should be communicated to the students, and (f) the honorable mention certificates given to the top-three students of each class should be done on the opening ceremony day. the second recommendation is that the teachers should make more contacts with the students’ parents by sending them invitation letters to discuss their children’s schooling. third, it is recommended that several aspects of teachers’ teaching management should be enhanced, and to achieve this, the teachers at this school should do the following tasks: (a) on the first day of the teaching and learning process, the teachers should provide syllabus for the students and set up the rules and procedures for the teaching and learning processes, (b) all teachers should be either well-prepared or punctual for their teachings, and (c) all teachers should come to teach the students regularly. fourth, students’ low competency should be more focused. to do this, the classroom’s teacher should invite the students with low competency to come to his/her office and talk directly and personally to find out their difficulties and ask other teachers to help them during the teaching and learning processes in the classrooms. in addition, the information of those low-competency students should be research and evaluation in education journal 12 volume 1, number 1, june 2015 communicated to their parents so that they can help the school to improve their children’s learning competency. fifth, the students’ parents should do more important tasks for their children’s learning: (a) they should monitor their children’s schooling regularly, and they should also go to school to meet the teachers to ask about their children’s schooling as often as possible, (b) they should regard their children’s schooling as the first priority. to change this perception, they should be invited to attend the opening ceremony so that they can have chance to listen to the explanation on how important education is for their children, and they know how to help their children for schooling. as a result, they will do their best for it, and (c) they should have close relationship with the school. thus, they should come to school to ask about their children’s schooling with the teachers more often. sixth, the community should get involved more in students’ schooling. they should be invited to join the opening ceremony every new academic year, and be informed by the principal about what is going on at school monthly so that they can share more responsibility for the students’ schooling. last, the value of trust in students’ schooling should be taken into account. to do so, the teachers should strengthen the students’ discipline. if discipline is strengthened, the teachers and students will be highly involved in the teaching and learning processes. as a result, the quality of education will be improved, and trust can be built as well. references borich, g. d. (2000). effective teaching method. london: prentice hall, inc. bush, t., & coleman, m. (2000). leadership and strategic management in education. london: paul chapman publishing ltd. davis, g. a. & thomas, m. a. (1989). effective schools and effective teachers. massachusetts: library of congress cataloging-in-publication data. hoy, w. k. & miskel, c. g. (2008). educational administration: theory, research, and practice (8 th ed.). new york, ny: mcgraw-hall international edition. lefrancois, j. y. r. (2000). psychology for teaching. belmont: thomson. mcmillian, j. h. & schumacher, s. (2001). research in education: a conceptual introduction. new york: longman. mohrman, s. a., wohlsletter, p., & associates. (1994). school-based management. san francisco: jooseybass publishers. muijs, d. & reynolds, d. (2005). effective teaching: evidence and practice. london: sage publication ltd. orlich, d. c., harder, r. j., callahan, r. c., trevisan, m, s., & brown, a. h. (2007). teaching strategies: a guide to effective instruction. new york, ny: houghton mifflin company. ormrod, j. e. (2003). educational psychology: developing learners. new jersey: merrill prenticee hall. roe, w. h. & drake, t. l. (1980). the principalship. london: collier macmillan. sallis, e. (1993). total quality management in education. london: kogan page. woolfolk, a. (2007). educational psychology. new york, ny: pearson. yin, r. k. (2009). case study research: design and method (4 th ed). thousand oaks, ca: sage. how to cite item: widjanarko, d., sofyan, h., & surjono, h. (2016). improving students’ mastery on automotive electrical system using automotive electrical multimedia. research and evaluation in education, 2(1), 71-78. doi:http://dx.doi.org/10.21831/reid.v2i1.8219 research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 2, number 1, june 2016 (pages 71-78) available online at: http://journal.uny.ac.id/index.php/reid improving students’ mastery on automotive electrical system using automotive electrical multimedia ¹dwi widjanarko; ²herminarto sofyan; ³herman dwi surjono ¹semarang state university; 2,3 yogyakarta state university ¹dwi2_otosmg@yahoo.com; ²hermin@uny.ac.id; ³hermansurjono@uny.ac.id abstract this research was conducted to study the improvement of students’ understanding of automotive electrical system by applying automotive electrical multimedia. the multimedia was developed and validated by automotive and multimedia experts before being applied in teaching and learning processes. the research design was a single group pretest-posttest design with three field testing. the results showed that in the preliminary field testing, the students’ understanding increased by 32.55%; in the main field testing, the understanding was up to 64.89%; and in the operational field testing, the understanding became 77.36%. this indicates that automotive multimedia could increase the students’ understanding of automotive electrical system significantly. keywords: automotive electrical system, multimedia mailto:dwi2_otosmg@yahoo.com mailto:hermin@uny.ac.id mailto:hermansurjono@uny.ac.id research and evaluation in education 72 − volume 2, number 1, june 2016 introduction the advanced development of automotive systems mainly occurs in electrical systems. most of the systems in automotive field have been controlled electrically. that is why the students of automotive programming study program have to master automotive electrical systems. unfortunately, based on the survey carried out on automotive teacher students of semarang state university (ssu) and yogyakarta state university (ysu), automotive electrical system is one of the subjects which are difficult to study. the main difficulty of mastering automotive electrical systems is to understand automotive electrical circuit and how the circuit works. a research conducted by widjanarko and abdurrahman (2006) shows that students’ ability in explaining and analyzing the operation of automotive electrical systems was disappointed for the mastery level was lower than 50%. ideally, it minimally reaches 70%. based on that problem, a big solution is urgently needed to improve the students’ mastery of automotive electrical systems. an instructional media or multimedia has to be developed to facilitate the automotive electrical system learning. multimedia is the integration of more than one medium into some forms of communication or experience whcich is delivered by a computer. most often, multimedia refers to the integration of media such as text, sound, graphics, animation, video, image, and spatial modeling into a computer system (reeves, 1998, p.22). multimedia is the use of several media to present information. the combination may include texts, graphics, animation, pictures, video, and sound (ivers & barron, 2002, p.2). computer-assisted education has been used frequently in modern educational systems because of its benefits like providing persistence in learning in general, providing a learner-centered learning process, getting the event of learning out of four walls and making it independent from space and time, providing the possibility to practice frequently and providing a quick access to information (kayri, gencoglu, & kayri, 2012, p.59). the use of the computer as a medium for instruction provides many capabilities that cannot be readily duplicated within the traditional lecture format. teachers could combine the internet, projector, presentation software, and web resources into the teachinglearning scenarios. the contents of teaching become vivid with delightful visual icons, graphs, or appealing explanations, which may generate very different teaching situation in contrast to the traditional teaching (lu & cheng, 2012, p.1013). multimedia can be defined as an integration of multiple media elements into one synergetic and symbiotic whole that results in more benefits for the end user than any one of the media element can provide individually (reddi & mishra, 2003, p.4) and enable learners study anytime and anywhere (arkorful & abaidoo, 2014, p.403) ‘multimedia’ and ‘interactive multimedia’ are defined by four basic characteristics (garrand, 2006, pp.4-5), namely: (1) combination of many media (video, text, audio, and still pictures into a single piece of work); (2) computer mediated, or computer which is used to mediate or make possible the interaction between the users and the material or media being manipulated; (3) mediaaltering interactivity, or user interactivity in multimedia which is best defined as ‘the ability of the user to alter media he or she comes in contact with’; and (4) linking, or allowing links or connections to be made between different media elements. educational multimedia features used for instructional purposes cover (1) screen design (visual elements: color, text, graphics, and animation); (2) learner control and navigation; (3) use of feedback; (4) student interactivity; and (5) video and audio elements (stemler, 1997, p.339). there are really only five basic types of media objects we will generally use (simkins, cole, tavalin & means, 2002, pp.13-14), among other things are (1) images, which come in many forms including graphs, maps, photographs, and drawings; (2) text, which includes everything from image captions to paragraphs of information; (3) sound, voice recordings, music, and sound effects that can be used alone or to research and evaluation in education improving students' mastery on automotive electrical system... 73 dwi widjanarko, herminarto sofyan, & herman dwi surjono enhance another media element; (4) motion, which includes cartoon-type animation, video, and moving transitions between screens of media; and (5) interactivity, which means making buttons, hyperlinks, and the like. to have a good multimedia, some systematic steps must be carried out. frey and sutton (2010, p.491) explain several steps to develop multimedia: (1) define the instructional goals, objectives, and audience; (2) review and investigate the existing options; (3) determine format, budget, and timeline; (4) determine the content, activities, and assessment strategies; (5) develop evaluation strategies, criteria, and instruments to determine the effectiveness of the project; (6) develop the flowchart, site map, and/or storyboard; (7) develop a prototype; (8) perform a formative evaluation; (9) complete the design; and (10) perform a summative evaluation of product and process. a technological approach which includes multimedia application has become increasingly important for students in schools. the use of the multimedia opportunities makes the instruction readily available, more affordable, and limitless accessible, and easily comprehensible. the role of technology in every step of the instructional systems is the requirement of today’s world (eristi, 2008, p.832). multimedia can be used in instruction by utilizing computer in teaching and learning activities. this learning is termed computerbased learning (cbl). nazimuddin (2014, p.185) states that computer in education is a diverse and rapidly expanding spectrum of computer technologies that assist the instructional process. the application of computer in instruction includes guided drill and practice exercises, computer visualization of complex objects, and computer-facilitated communication between the students and teachers. basic features in which cbl environments outwit other forms of learning environments can be summarized as follows (sidhu, 2010, p.53): (1) the speed in which the computer can respond to individual learner’s need; (2) the way that the computer can offer, and respond to, a wide range of learner interaction; (3) the potential to represent information in a wide scope of formats from text to video; and (4) the opportunity to provide unlimited choice of learning paths. concerning the difficulty of mastering automotive electrical systems, this research was conducted to developed a multimedia program able to help students in learning all automotive electrical materials. this multimedia is intended to make automotive electrical systems easy to learn. this article will discuss the improvement of students’ mastery on automotive electrical systems during preliminary field testing, main field testing, and operational field testing. method the automotive electrical multimedia was developed following the 10 steps of developing educational product (borg & gall, 1989): (1) research and information collecting, (2) planning, (3) developing preliminary form of product, (4) preliminary field testing, (5) main product revision, (6) main field testing, (7) operational product revision, (8) operational field testing, (9) final product revision, and (10) dissemination and implementation. the multimedia was developed based on the need assessment from research and information collecting, planning, and developing preliminary form of product. before the product was field tested, it was validated by automotive electrical and multimedia experts. the research was conducted from march 20th to october 25th, 2012 at the automotive engineering education study program, yogyakarta state university and semarang state university – indonesia. the research subjects in the preliminary and main field testing were 80 students who had passed the automotive electrical subject. in the preliminary field testing, six randomly determined students were involved. in the main field of testing, 42 randomly determined students participated. for the operational field testing, 100 students who are programming automotive electrical subject were involved. the one-group pretest-posttest design was used in the experiment. a multiple choice test was utilized to measure the automotive electrical system mastery before and after the application of the multimedia. the t-test was used to analyze research and evaluation in education 74 − volume 2, number 1, june 2016 the difference of the average scores. the descriptive statistic (%) was also used to describe the level of improvements. findings and discussion the automotive electrical multimedia developed in this research can facilitate all automotive electrical systems needed by the students of programming automotive electrical subject. the multimedia contains basic electricity, basic electronics, engine electrical system (covering starting system, charging system, and ignition system), and body electrical system (lighting system, wiper system, power window system, and power mirror system). each system presented covers fundamentals, components and their fuctions, circuits, operation of each component, system operation in animation forms, exercises, and evaluations. the multimedia cd contains 31 folders, 128 power point files with a total of 2007 slides, 31 pdf files, 348 gif animation files, 1120 wav sound files, and cd capacity of 618 mb. the developed automotive electrical multimedia must fulfill valid criteria either from substantial view or multimedia point of view, so the multimedia has to be validated. the validation of the multimedia was carried out by four automotive electrical experts was intended to make sure that the content is valid, correct, and appropriate with the automotive electrical competencies, so the multimedia can be utilized in automotive electrical subjects. the content validation of the multimedia referred to the substantial validation instrument that had been prepared in this research. the summary of automotive electrical expert validation is listed in table 1. the validity determination of the automotive electrical multimedia refers to the value classification to be level having evaluative meaning (i.e. good-bad, high-low, validinvalid, etc.) according to azwar (2003, p.157). based on the classifying calculation, criteria were found as follows: the multimedia is invalid if mpb ≤ 3.79; less valid if 3.79 < mpb ≤ 4.23; valid if 4.23 < mpb ≤ 4.68; and very valid if 4.68 < mpb ≤ 5.12. the data in table 1 indicate that the multimedia has mpb of 4.45 and it lies at valid criteria. therefore, the multimedia is substantially valid. the validation of multimedia based on its performance was also done. the validation was carried out by two multimedia experts to ensure that all aspects that have to be in a multimedia have been fulfilled, so the multimedia could be used in the teaching and learning processes of automotive electrical subjects. the summary of multimedia expert validation is tabulated in table 2. based on the classified calculation, the following criteria were found: the multimedia is invalid if mpb ≤ 3.53; less valid if 3.53 < mpb ≤ 4.05; valid if 4.05 < mpb ≤ 4.58; and very valid if 4.58 < mpb ≤ 5.10. the data from table 2 indicate that the multimedia has mpb of 4.45 and it lies at valid criteria. therefore, the performance of the multimedia is valid. the automotive electrical multimedia was applied in teaching and learning processes to improve students’ competency. the data of mastery improvement were collected through experiment using one-group pretest-posttest design. pretest and posttest were administered when the field testing was carried out. in the preliminary field testing, the pretest was administered before the multimedia teaching and learning process and posttest was administered after the multimedia teaching and learning process. the summary of the pretest and posttest data in the preliminary field testing is tabulated in table 3. table 1. automotive electrical expert validation data val.1 val.2 val.3 val.4 mean mean 4.69 4.78 4.18 4.16 4.45 table 2. multimedia expert validation data skor val. 1 skor val. 2 mean mean 4.53 4.11 4.32 table 3. pretest and posttest data in preliminary field testing average improvement pretest posttest points % 58.34 76.14 17.80 32.55 research and evaluation in education improving students' mastery on automotive electrical system... 75 dwi widjanarko, herminarto sofyan, & herman dwi surjono based on the data in table 3, it can be seen that the students’ mastery of automotive electrical systems before and after the application of the multimedia increases. the pretest average score is 58.34 and the posttest average score is 76.14, so the improvement of the average score is 17.80 points, or it creates an increase up to 32.55%. the t test analysis shows that the calculated t is 1.979 and tabulated t is 3.169 (at 1% significance level) and 2.228 (at 5% significance level). this means that the calculated t value is less than the tabulated t value. there is no significant difference between the pretest average score and posttest average score. although there is a mastery improvement of automotive electrical systems, the improvement is not significant yet. thus, the automotive electrical multimedia was not effective yet in increasing students’ mastery of automotive electrical systems. based on the multimedia evaluation conducted during the preliminary field testing, the multimedia was revised to accommodate students’ suggestions in order to gain the ease of operating the multimedia. after the revision of the multimedia, the main field testing was conducted. pretest and posttest were also done in the main field testing. the tests were held in order to see the effectiveness of the multimedia during the main field testing. the summary of the pretest and posttest data in the main field testing is tabulated in table 4. the mastery of the automotive electrical system before and after the application of the multimedia in the main field testing increased with the improved average score of 24.38 points or up to 64.89%. the t test analysis shows that the calculated t is 7.614 and tabulated t is 2.637 (at 1% significance level) and 1.989 (at 5% significance level). it means that the calculated t value is higher than the tabulated t value. there is a significant difference between the pretest average score and the posttest average score. therefore, there is also students’ significant mastery improvement in the automotive electrical system in the main field testing. it seems that the automotive electrical multimedia is effective in increasing students’ mastery of automotive electrical systems. based on the multimedia evaluation which has been conducted during the main field testing, the multimedia was also revised to accommodate students’ suggestions. the last pretest and posttest were done in the operational field testing. the tests were held to see the effectiveness of the multimedia during the operational field testing. the summary of the pretest and posttest data in the operational field testing is tabulated in table 5. the students’ mastery of automotive electrical systems before and after the application of the multimedia in the operational field testing increases with the average score improvement of 26.50 points or up to 77.36%. the t test analysis shows that the calculated t is 12.482 and the tabulated t is 2.576 (at 1% significance level) and 1.196 (at 5% significance level). this means that the calculated t value is higher than the tabulated t value. there is a significant difference between the pretest average score and posttest average score. thus, there is also a significant mastery improvement of automotive electrical systems in the operational field testing. it seems that the automotive electrical multimedia is effective in increasing students’ mastery of automotive electrical systems, and the learning objectives could be achieved. after the application of the automotive electrical multimedia in the field testing, there is a significant difference in the students’ mastery of automotive electrical systems before and after they use the multimedia. this means that the multimedia could effectively increase students’ mastery of automotive electrical systems. table 4. pretest and posttest data in main field testing average improvement pretest posttest points % 46.30 70.68 24.38 64.89 table 5. pretest and posttest data in operational field testing average improvement pretest posttest points % 43.78 70.28 26.50 77.36 research and evaluation in education 76 − volume 2, number 1, june 2016 the improvement can be resulted from their habit by being more active in learning, exercising, and evaluating personally facilitated by the multimedia. afolabi, abidoye, and afolabi (2012, p.6) state that effective utilization of appropriate instructional media is highly essential to improve teaching and learning of social studies in secondary schools. improvement and better academic achievement can also be guaranteed through the use of instructional media. in addition, if the standard of education has to be raised, instructional media should be used for the teaching and learning of social studies and other school subjects. the utilization way of computers in learning environments shows positive contributions of computer-based learning environments to student learning (deniz & cakir, 2006, p.2) so the students’ academic achievement can rise. the combination of computerassisted instruction and collaborative work improves learning without a significant effect on attitude (ragasa, 2008, p.8). the combination of computer-assisted instruction and collaborative work improves learning without a significant effect on attitude (ragasa, 2008, p.8). the improvement of achievement after the application of computer-based learning can be caused by the increase of learning motivation. reeves (1998, p.2) states that computers as tutors have positive effects on learning as measured by standardized achievement tests, are more motivating for students, and are accepted by more teachers than other technologies. students could reach learning objectives in shorter time compared with conventional learning. the developed multimedia was interesting and able to help students master automotive electrical systems. this is based on students’ responses in the main field testing, stating that the developed multimedia could improve students’ motivation. in the operational field testing, the multimedia also made their motivation to learn increase. a research conducted by rosa and preethi (2012, p.9) shows that students who are taught through multimedia instructional package can perform better than those who are taught through a common method of teaching. multimedia enables learning through exploration, discovery, and experience. that role belong to the learning needs of students with multimedia, and the process of learning can become more goal-oriented, more participatory, flexible in time and space, unaffected by distances, and tailored to individual learning styles. multimedia enables learning to become fun and friendly, without fear of inadequacies or failure. in their research, they also found that there was a significant difference of achievement between the experimental group taught with multimedia and the control group without multimedia. the conclusion is that students learning with multimedia package got better result compared to those learning without multimedia package. leow and neo (2014, p.99) also find that an interactive multimedia learning bring significant improvement of students’ achievement and students’ activity and motivation in the learning processes. some advantages of computer-based learning are that the students can choose their own way and speed, the program can be stoped at any time, the program can be repeated as often as the user wishes, the computer is not judgmental, the students can learn from their mistakes without embarrassment, it saves the teacher’s time, the students are more activated, and weak students are favoured (schittek, mattheos, lyon, & attström, 2001, p.99). in addition, computer-aided learning (cal) can provide innovative and interactive ways of presenting material and therefore should be used as an adjunct to conventional teaching or as a means of self-instruction. cal can elicit positive responses from students and consequently motivate students to learn. a cal program that is at least as effective as other methods of learning has several potential added value advantages (depending on how the program is designed and the students’ ease of access to the cal modules): students can learn at their own pace, cal lessons can be reviewed several times, and computer-based modules can be used at convenient times when the student is free of distractions, alert, and ready to learn (rosenberg, grad, & matear, 2003, p.531). research and evaluation in education improving students' mastery on automotive electrical system... 77 dwi widjanarko, herminarto sofyan, & herman dwi surjono computer-aided instruction (cai) is an application of computer in implementing instructions. it is an integration of software and hardware. computer-based learning is especially effective for training people to use computer applications because the program can be integrated with the applications so that students can practice using the application as they learn (alsultan, lim, matjafri, & abdullah, 2006, p.29). based on the aforementioned explanation, it can be concluded that utilizing computer technology in education brings many advantages especially for students. the learning objectives can be achieved through activating students in learning processes. the learning process can be carried out in the classroom and also out of the classroom personally whenever and wherever. thus, computer in education now becomes a basic need to facilitate teaching and learning processes for teachers and students. conclusion the validation which is conducted by automotive electrical and multimedia experts shows that the developed automotive electrical multimedia is valid, so it fulfills the requirements as an education multimedia. the application of multimedia in teaching and learning processes has significantly improved students’ mastery of automotive electrical systems. the improvement reached 32.55% in the preliminary field testing, 64.89% in the main field testing, and 77.36% in the operational field testing. this indicates that automotive multimedia could increase the students’ understanding of automotive electrical systems significantly. therefore, based on the research findings, the developed automotive electrical multimedia can be used as an instructional multimedia for teachers and as a learning tool for students to learn automotive electrical system seasily. references afolabi, a.k., abidoye, j.a., & afolabi, a.f. (2012). effect of instructional media on the academic achievement of students in social studies in junior secondary schools. pnla quarterly, 77, 1-7. alsultan, s., lim, h.s., matjafri, m. z., & abdullah, k. (2006). developement of a computer aided instruction (cai) package in remote sensing education. international archives of the photogrammetry, remote sensing and spatial information science, 34, 29-33. arkorful, v. & abaidoo, n. (2014). the role of e-learning, the advantages and disadvantages of its adoption in higher education. international journal of education and research, 2(2), 397-410. azwar, s. (2003). tes prestasi: fungsi dan pengembangan pengukuran prestasi belajar (2 nd ed.) [achievement test: function and development of learning achievement measurement (2 nd ed.)]. yogyakarta: pustaka pelajar. borg, w.r. & gall, m.d. (1989). educational research: an introduction (5 th ed.). new york, ny: pearson education. deniz, h. & cakir, h. (2006). design principles for computer-assisted instruction in histology education: an exploratory study. journal of science education and technology. doi: 10.1007/s10956-006-9031-5, 1-10. eristi, s.d. (2008). the effectiveness of interactive instruction cd designed through the pre-school students. journal of theoretical and applied information technology, 5, 832-839. frey, b.a. & sutton, j.m. (2010). a model for developing multimedia learning projects. merlot journal of online learning and teaching, 6, 491-507. garrand, t. (2006). writing for multimedia and the web: a practical guide to content development for interactive media. amsterdam: elsevier. ivers, k.s. & barron, a.e. (2002). multimedia projects in education designing, producing, and assessing (2 nd ed.). connecticut, ct: teacher ideas press, a division of greenwood. kayri, i., gençoglu, m.t., & kayri, m. (2012). the computer assisted education and its research and evaluation in education 78 − volume 2, number 1, june 2016 effects on the academic success of students in the lighting technique and indoor installation project course. international journal of advances in engineering & technology, 2, 51-61. leow, f.t. & neo, m. (2014). interactive multimedia learning: innovating classroom education in a malaysian university. tojet: the turkish online journal of educational technology . 13, 2, 99-110. lu, c.h. & cheng, s.f. (2012). applying computer-based technology to instruction for the effectiveness of teaching and learning. arpn journal of science and technology, 2(10), 1013-1017. nazimuddin, s.k. (2014). computer assisted instruction (cai): a new approach in the field of education. international journal of scientific engineering and research (ijser), 3(7), 185-188. ragasa, c.y. (2008). a comparison of computer-assisted instruction and the traditional method of teaching basic statistics. journal of statistics education, 6(1), 1-10. reddi, u.v. & mishra, s. (2003). educational multimedia: a handbook for teacherdevelopers. new delhi: the commonwealth of learning, commonwealth educational media centre for asia. reeves, t.c. (1998). the impact of media and technology in schools. a research report which is prepared for the bertelsmann foundation. university of georgia. rosa, m.c. & preethi, c. (2012). effectiveness of multimedia instructional package for teaching marketing management among higher secondary school students. education india journal: a quarterly refereed journal of dialogues on education, 1, 1-12. rosenberg, h., grad, h.a., & matear, d.w. (2003). the effectiveness of computeraided, self-instructional programs in dental education: a systematic review of the literature. journal of dental education, 67, 524-532. schittek, m., mattheos, n., lyon, h.c., & attström, r. (2001). computer assisted learning: a review. european journal of dental education, 5, 93–100. sidhu, m.s. (2010). technology-assisted problem solving for engineering education: interactive multimedia applications. new york, ny: engineering science reference. simkins, m., cole, k., tavalin, f., & means, b. (2002). increasing student learning through multimedia projects. alexandria, va: association for supervision and curriculum development. stemler, l.k. (1997). educational characteristics of multimedia: a literature review. journal of educational multimedia and hypermedia, 6, 339-359. widjanarko, d. & abdurrahman. (2006). peningkatan kemampuan menganalisis kerja sistem kelistrikan mobil dengan tugas model uraian terbatas [development of analyzing ability of car electrical working system by employing restricted response task model]. jurnal pendidikan teknik mesin, 6, 59-64. judul dalam bahasa indonesia, ditulis dengan huruf tnr-14 bold, maksimal 14 kata, rata kiri research and evaluation in education journal e-issn: 2460-6995 research and evaluation in education journal volume 1, number 1, june 2015 (84-99) available online at: http://journal.uny.ac.id/index.php/reid strengthening vocational character for polytechnic education which has non-production-based curriculum 1) peni handayani; 2) satryo soemantri brodjonegoro 1) bandung state polytechnic, indonesia; 2) bandung institute of technology, indonesia 1) penihan@polban.ac.id; 2) satrio1@indo.net.id abstract the vocational character of polytechnic education has declined in the last ten years, especially for polytechnic which has non-production-based curriculum. this research aims to reveal factors which have contribution in the devocationalization of polytechnic education, and to find out the alternative solution to revocationalize by considering the current condition and the future demand. this study applied qualitative approach supported by quantitative data that involved three polytechnics in bandung and malang, three industries in bandung, one industry in yogyakarta, and an expert as representative from the department of industrial and cooperation yogyakarta. interpretational inductive analysis was used to analyze qualitative data. this study revealed: (1) environmental factors are very influential factor in shaping the character of polytechnic education and recognition; (2) the greatest gaining vocational character acquired through apprenticeship in industry or workplace; and (3) gaining vocational character needs to be managed by maintain and strengthen cooperation between polytechnic and industries, and competent institution which can develop education system. keywords: vocational character, polytechnic education, professional, apprenticeship, recognition mailto:penihan@polban.ac.id mailto:satrio1@indo.net.id research and evaluation in education journal strengthening vocational character... 85 peni handayani & satryo soemantri brodjonegoro introduction polytechnic, in government regulation number 17 of 2010 on the implementation and management of education chapter 1 verse 18, is a higher education institution that conducts vocational education in few specific fields. vocational education in the historical record has been developed since several centuries ago in egypt in the form of apprenticeships over the years to become an expert in the field of job. (finch & crunkilton, 1999, p. 4). apprenticeship is a non-formal education at that time attempted to achieve competencies needed to do a specific job. until now, vocational education implementation is often related to supplying skilled workers and the economic growth of the country. supplying skilled workers is conducted through apprenticeship in various models. apprenticeship at some centuries ago until the late 18 th century was conducted in individual and non-formal training form to achieve professional competencies. the increasing of skilled workers requirement since word-war-i has been driven vocational education to move from non-formal education to formal education as stated in smith-huges act (19917) in the united state of america. later on, the vocational education has been developing. throughout its history, vocational education in various countries has always been associated with the supply of skilled workers through various training models, economic growth, and product development: goods or services. character according to the dictionary is the quality of something different from the others; or trait (bull, 2008, p. 68). vocational is everything associated with vocation. vocation is something associated with work or way of life (bull, 2008, p. 495). thus, it can be interpreted as a vocational character traits or characteristic associated with job. vocational education is any education that provides experiences, visual stimuli, affective awareness, cognitive information, or psychomotor skills; and that enhances the vocational development process of exploring, establishing, and maintaining oneself in the world of work (thompson, 1973, p. 216). vocational education can be differentiated from academic education through competence definition in the same area, as mentioned by barnett and shows in table 1. barnett differentiate operational competence for vocational, and academic competence for academic, including cognitive competence, attitudes, or personal competence, value or behavior that reflected professional action. several vocational education characters are: the implementation of education often related to technology and economic development, and it supplies skilled workers who have specific competencies for doing a certain job. the curriculum was planned and developed according to the existing products; learning model was often related to job. ideally, these learning activities should be equipped with the industrial equipment that would be used in the workplace. the implementation of this vocational education model is costly. schools have to update their equipment if there is any change in the industrial equipment. the application of technology, especially information technology, in many sectors has affected in the change of using industrial equipment. this change requires workers with different qualification than before. another impact is the emergence of new profession that has never been imagined before. these changes are very fast, so that the competencies of education result have been obsolete (cheng, 2005, p. 19). these conditions encourage vocational education in each country to change its curriculum planning and strategic implementation to maintain its existence and can contribute to the economic development of the country. some research data show that vocational education has become a great developer of innovative product that encourages the economic growth of the country, such as south korea, singapore and china (yoo jeung joy nam, 2009). the direction of vocational education is influenced by the role of this education in the economic research and evaluation in education journal 86 volume 1, number 1, june 2015 development framework of a country. the contribution of this education to each country‟s economic development was given in a different form. the history has shown that this education have been established and developed under social condition. the primary objective of vocational education should be focused on individual development. the role and the contribution of vocational education that provided economic improvement is not just viewed for individual level but also for the country. education effectiveness has been measured based-on their contribution in the economic performance. every vocational institution has their own way to give their contribution. principally, vocational education can be characterized by two things, they are: (1) education is focused on operational or technical competences development and training associated with the work in certain profession in order to produce graduate who have „professional experiences at the beginning level‟; (2) the existence of vocational education is determined by its contribution into economic performance in the different forms, methods and level. based on these main characteristics, all levels of vocational education need work environment. professional competence as a product of vocational education could be developed through apprenticeship (nicholls, 2001, p. 43). therefore, apprenticeship for polytechnic students should be viewed as a part of practical learning process intergrated into education to improve technical competence in the field. why devocationalization phenomena appear in polytechnic? what factors influence the vocational character of polytechnic education? how should polytechnic develop their students‟ competences? what kind of competence need to be developed as the basic of professional competence? how apprenticeship can be utilized by polytechnic to increase its contribution to economic performance? table 1. two rival barnett‟s competence no term operational competence academic compeence 1 epistemology know how know what 2 situations define pragmatically define by intellectual field 3 focus outcomes proposition 4 transferabilty metaoperation metacognition 5 learning experiential propositional 6 communication strategic disciplinary 7 evaluation economic truthfulness 8 value orientation economic survival disciplinary strength 9 bondary conditions organizationall norms norms of intellectual field 10 crtitique for better practical effectiveness for better cognitive understanding source: nicholls (2001, p. 28) method type of research this research mainly used qualitative approach for generated devocationalization phenomenon of polytechnic education generally in indonesia. this research was also supported by quantitative data to reinforce the interpretation data in the field, and for data, model and conceptual model. data validation one of the weaknesses of qualitative approach is bias that comes from the researchers as the main instrument in this research. this bias can be eliminated by validation data through triangulation approach and data check and recheck to the sources of data using the same tools for other source of data, or by using different tools to check the same source of data. research and evaluation in education journal strengthening vocational character... 87 peni handayani & satryo soemantri brodjonegoro time and location of this research this study was conducted from april until october 2012. the research involved three state polytechnics in bandung and malang which were considered to represent a model of polytechnic education in indonesia, three industries in bandung, one industry in yogyakarta, and an entrepreneurial expert from the ministry of industry and cooperative of yogyakarta special region or also known as daerah istimewa yogyakarta (diy) who was considered to be very concerned about vocational education development, particularly polytechnic. technique and tools to collect data document exploration, interview, observation and questionnaire were used to explore and generate data. tools that had been used to collect data were: digital camera, voice digital recorder, stationery, and questionnaire. sampling had been chosen based-on purposeful sampling. sampling and respondent this research involved three polytechnics, 12 lecturers, 20 students, 4 industries, and 8 polytechnic managements. purposeful sampling were used in this research. research procedure figure 1. research procedure this research was started with observation in four state of polytechnics: bandung state polytechnic (polban), malang state polytechnic (polinema), and jakarta state polytechnic (pnj). collected data which had been used to formulate the problem were related to the main issue or phenomenon which should be observed. the next step was field study, in which the researchers determined the location or plan of this study, sampling and informant. polytechnic selected as the samples were state polytechnic of bandung or politeknik negeri bandung (polban), state polytechnic of malang or politeknik negeri malang (polinema), and manufacture polytechnic of bandung or politeknik manufaktur bandung (polman). the informen consisted of senior lecturers who have taught more than 15 years, 6 th semester students from four study programs, including: electrical, electronic, sipil, and electromotif. intensive data collection was conducted from three polytechnics. the collected data were analyzed and resulting conceptual model as an alternative solution of the main problem of devocationalization of polytechnic education. this conceptual model has been socialized to the sample polytechnics by using delphi technique. this technique was selected to collect opinion data without discussion to each other in the first step. these opinions were then analyzed and used as a reference to improve model. improved model was resent to the same respondent and other respondent from state polytechnic of jakarta, state polytechnic of padang, and state electronic polytechnics of surabaya. analysis technique triangulation technique was employed to ensure the validity of the collected data. interpretational inductive analysis was used to define problems through main question(s) determine location/ sampling/informan exploring / generating data analysis qualitative data desain model conceptual validation model conceptual valid ? model conceptual yes no research and evaluation in education journal 88 volume 1, number 1, june 2015 analyze qualitative data. analysis was conducted through cyclical process, which was started from collecting data, transcription process and data reduction, interpretational process, and revealing conceptual model as an alternative solution for main phenomenon, that is, a character vocational polytechnic education. findings and discussion factors which influence polytechnic education characters the first step of this research reveal information about the history of polytechnic development which influence the establishment of polytechnic education characters in indonesia. these characters are influenced by many factors. these factors mainly can be categorized into two categories, namely: internal and external factors. the internal factors are related to the history of establishing polytechnics in indonesia. the first six polytechnics were built during 1980‟s by the collaboration between indonesian and switzerland governments. it was an expansion of polytechnic mechanic swiss pilot project that was successful. this development was covered by the world bank and organized by swiss-contact. these polytechnics were attached to universities or institutes which had successfully delivered academic education. fresh graduate prospective lecturers from universities or institutes with minimum industrial experience had been trained in polytechnic education and development center (pedc) bandung. laboratory and workshop equipment had been built at a very sophisticated condition in pedc and polytechnics accordance to the equipment used in industries. polytechnics had graduated skilled workers in accordance with the needs of workforce. ten years later, polytechnics became independent from host university or institute. pedc institution was then discontinued following the regulation that independent polytechnics should develop their own academic implementation. the polytechnic development was successful followed by spring up in a number of polytechnics in indonesia, especially delivered by private sectors. the need of faculties has increased significantly. the recruitment of prospective lecturers was not followed by special training for them as before. the new lecturers were mainly fresh graduates from universities or institutes which have no industrial experience. the vocational character of polytechnic was starting to degrade. in other words, polytechnics were starting to de-vocationalize. research and evaluation on polytechnic education implementation were too late to be conducted, even until now, the evaluation has not been done comprehensively. quality and relevance between polytechnic competence and industry problems have started to emerge. this situation was worsened by external factors that had a direct impact on polytechnics. external factors that have contributed on polytechnic education characters are economics, social, culture, politics, and technology application, especially information technology. the application of technology in industrial equipment has changed work patterns and the type of works and they need workforce qualification which is quite new and different than before. these changes are very fast so that graduate competencies have rapidly obsolete. the same condition also occurs in other vocational education in other regions in the world (cheng, 2005; unesco, 1992). this condition has been responded differently by each country. strategy for coping with environmental change some countries have chosen a strategy by equipping their students with generic competences in addition to specific technical competences. the same strategy has also been taken by many polytechnics in indonesia. strong vocational education still delivered in many countries which have manufacturebased industries. some countries in asia research and evaluation in education journal strengthening vocational character... 89 peni handayani & satryo soemantri brodjonegoro choose to strengthen their manufacturing industries by involving vocational education like polytechnics in their production process. the countries which have decided to choose this strategy are germany, singapore, south korea, and taiwan (unesco, 1992). thus, students are prepared to be an expert on particular job. indonesia has other types of industries. it tends to have trading or services industries rather than manufacturing. this condition will continue until at least the upcoming five years. this was revealed in the discussion of economic forum west java, on saturday (8 june 2013) in bandung (kompas, monday, 10 june 2013 p. 17). the discussion concluded that indonesian government does not have strong policies in the areas of manufacturing industries, trading and investment to face asean economic community 2015. this condition needs planning and polytechnic‟s educational curriculum implemention, including different learning model to vocational curriculum which is used in germany, singapore, or south korea that have a strong „manufacturing industries culture‟. competencies which are needed in manufacturing industries are different to competencies which are needed in trading and or services industries. the types of job in indonesian industries are mostly in service industry. the service types of job in the engineering area are mostly maintenance and repair industrial equipments or machines. the strengthening of vocational education character could be focused on preparing workforce who has the ability to do this job and solve field work problems at this area. educational and training curriculum needs to be planned strategically to achieve an effective and efficient results. polytechnic development strategy applied technology in various equipments in industries have changed the work pattern and create new profession which need new qualification competences which are different to the previous ones. fast changing causes the polytechnic graduate competences become obsolete quickly (cheng, 2005, p. 19). these conditions encourage vocational education in each country to change its curriculum planning and strategic implementation in order to maintain its existence and contribute to the country‟s economic development. some research data show that vocational education has become a great developer of innovative product that encourages the country‟s economic growth, such as south korea, singapore and china (yoo jeung joy nam, 2009). curriculum has been developed by aligning with the economic development framework. some countries have a strategy to implement its vocational education by equipping their students with generic competencies that are more flexible and still maintain their specific competencies as a character of its vocational. the two kinds of competencies have different role in the work. these competencies are needed for different working contexts. each competency plays its own role, especially in the transition time from education to the working life. the generic competencies have influenced positively at out of work field, while technical or specific competencies have influenced positively in the internal of work field (heijke, meng & ris, 2003). in the recent years, polytechnic will be faced with a situation in which the characteristics of work organizations across the board change under the influence of the increasing importance of knowledge. it has been created three trend demands e.g. (1) the increasing emphasis which has been placed in education and training as an important factor affecting economic growth (world bank, 2002); (2) the change of labor market process which has emerged the concept of the transition labor market to indicate how in modern society, the demarcation lines between work, leisure time, education, and care have been blurred, leading to increased mobility and flexibility patterns, and to overall focus on employability; (3) the research and evaluation in education journal 90 volume 1, number 1, june 2015 needs of internationalization and globalization of product markets, labor markets, and their impact for higher education (dicken & arlett, 2009; marginson & van der wende, 2006; van dame, 2001; jim allen rolf van der velden (ed), 2009, p. 3). these signals indicate that polytechnic development should be directed to meet these demands and trends. the strategy of strengthening vocational character the two main characters of vocational education are mentioned above, e.g. focused on professional competence development and its contribution into economic performance will be used as a guidance to design vocational curriculum. curriculum is a key element in an educational process (finch & crunkilton, 1999, p. 3). therefore, the strengthening of polytechnic education character can be started from the strategic planning and developing curriculum. field data showed that, generally, state polytechnics delivered their education by package system. provision of the curriculum is set by ministry regulation number 232 year 2000 about the guideline of structuring higher education curriculum and the appraisal of students learning outcomes which is stated in credit unit semester (sks). learning load for d3 program is 110-120 sks, and learning load for d4 program is 144-160 sks. this load equals to 30-40 hours/week. polytechnic curriculum which is nonproduction-based has theory composition about 50%-60% of the total hours. the remaining hours (40%-50%) are allocated for practical training. polytechnic curriculum during that time has been developed by polytechnic itself. industry involvement was minimal. the content of the curriculum was still focused on knowledge strengthening. practical activities in the laboratory and workshop that occupy 50% of total learning hours have not been directed to building attitude and basic professional ability, but just at the level achievement of knowledge experience. it is the time that polytechnic curriculum development now should involve industries more than before. the form of engagement can be upgraded from participation to collaboration. the examples of real form of these activities are defining the collaboration vision between polytechnic and industry, identifying competencies which are needed in industry, holding general lectures with key note speakers from industries, and participating in the graduate competence examination team. this strategy is adopted for shortening the gap between polytechnic and industry, and for improving their communication which have been less effective. the basic concept of polytechnic curriculum development which is proposed in this study is shown in figure 2. curriculum development has to be derived from ideal vision which has been built together with industries. although far reaching and long term, the polytechnic vision needs to be built as realistic as possible in order to be easily derived to become program activities at major level and operational activities in teaching and learning which should be developed by faculty member. curriculum development needs to be planned strategically, based on ideal vision (kaufman, herman & watters, 2009, p. 71). this strategy is needed to assure that polytechnic is going in the right direction (relevant). the second step is deriving the strategic planning into program planning or program development at the tactical level. this step includes the planning of all resources and facilities needed to achieve this vision. research and evaluation in education journal strengthening vocational character... 91 peni handayani & satryo soemantri brodjonegoro figure 2. the basic concept of polytechnic curriculum development efficiency program can be measured based on vision achievement (professional skills achievement and polytechnic contributions into economic development). the third step is implementing, continuously improving and determining what should be done to achieve vision through the planning of learning and teaching including selecting and developing materials, selecting the appropriate method, and assessing the learning achievement. programs need to be evaluated for improvement. the more detail strategic planning and curriculum development is presented in figure 3. figure 3. strategic planning and polytechnic curriculum development skills, knowledge, attitudes and ability (skkas) requirements trend technology, community requirements, opportunities (currents and future) industries curriculum requirements courses requirements learner entry (skaas) existing/required resources establishing and developing performance completion remedial revise as required polytechnic & study program vision instructional design & development  define vision, mission(s), and goal(s)  needs assessment  needs analysis  design program/ course curriculum design  define objectives  determine assessment method  select methods  determine contents  instructional development  define resources integrate planning, quality, and assessment  selecting & developing materials  select methodology  skills development  determine evaluation method strategic planning content & instructional development implementation & continuous improvement strategical level tactical level operational level outcomes  graduates induct into new situation at the workplace  action critically upon trained knowledge  responsible for what has been decided  etc research and evaluation in education journal 92 volume 1, number 1, june 2015 after the vision of polytechnic and study program is determined, polytechnics need to do need analysis including identification of skills, knowledge, attitudes, and ability needed in industries and by considering the students‟ development. the result of this analysis will be used as consideration to plan or develop curriculum. course structure is derived from curriculum which has been determined. resources are designed in accordance with the result of evaluation of the existing condition and consideration for future requirement. the next step is designing and developing instructional method needed in the process of learning and teaching. building and developing student performance are done through learning and teaching process. the content of curriculum is designed in accordance to the need of individual, industrial and polytechnic including learning outcomes standard. practical paradigm in curriculum planning there are two kinds of curriculum planning perspective according to kaufman theory, namely: inside-out planning perspective and outside-in planning perspective. the difference between the two planning perspectives is lied on how one views the world. conventionally, curriculum is designed in organization‟s (school) viewpoint as a client. it usually carries assumptions that complicate major changes, such as altering a system‟s major mission or identifying opprtunities and problems which currently do not exist. it is a reactive planning. mission(s), goals, objectives and activities are defined as reaction to the changes. there are very high activities in implementaion phase, but the contribution of this change is not significant. more detailed explanation on this difference is presented in figure 4. figure 4. inside-out and outside-in planning perspectives (kaufman, 2002, p. 33) the outside-in planning views society as the primary client. planning in this way is as if one were looking into the organization, its results, and effort. it is proactive planning. mission(s), goals, and objectives are determined by what we want to accomplish in the future. all activities aim to achieving the target changes. in this perspective schools is actively contributing to the change. based-on this theory, designing polytechnic curriculum should be planned in the proactive perspective in order to increase its contribution to the change in the future. tactical planning curriculum tactical planning is needed as a bridge between strategic planning and operational activities. this stage requires a system analysis. it needs identify the key aspects related to the learning and teaching process. there are at least four key aspects which need to be considered in the curriculum planning, e.g. context, curriculum design, skills development and assessment (dickens & arlett, 2009). the context aspect of learning and teaching in the engineering area including setting standard, identifying learning outcomes which is contain in terms of knowledge and understanding, intellectual abilities, practical skills, transferable skills, to address the needs of employability, entrepreneur, and internationalization skills. educational resources, methods, structure, and learning results current societal/ community consequences educational resources, methods, structure, and learning results desired societal/ community consequences (a) inside-out perspectives (reactive) (c) outside-in perspectives (pro active) research and evaluation in education journal strengthening vocational character... 93 peni handayani & satryo soemantri brodjonegoro kaufman theory about inside-out and outside-in planning needs to be considered in planning the curriculum. if polytechnic wants to increase its contribution to economic development in the future, it is necessary to apply the proactive curriculum planning by involving industries to define vision, mission(s), and learning outcomes which are needed to industries or world of work. the big problem now is the poor communication between polytechnic and industries. it has been broken information flows between them. polytechnic will lose informations which are needed to develop their curriculum and students competences, and the worse case, polytechnic will be isolated from industries. this conditions mean disaster for polytechnic. establishing and strengthening collaboration and cooperation between polytechnic and industries are not avoided. this collaboration and cooperation should be conducted at “b-tob” level (institution level). skills development aspect should consider the most effective way of providing opprtunities for students to develop their professional compeencies by attach them within a modul. this approach not just help students to fit them into new material and condition, but it also help the student to learn within a relevant context. competencies should be developed are professional competencies which have core component consist of knowledge or cognitive competence, functional competence, value or etical competence, and personal or behavioural competence (nicholls, 2001, p. 124). knowledge competence is ability to put appropriate work-related knowledge in effective use. all courses have to ensure students develop their skills and knowledge which will enable them to think and act critically and effectively. functional competence is ability to perform a range of work-based task effectively to produce specific outcomes. time management skill is the most important skills to perform work-based task effectively. value or ethical competence is the possession of apprpriate personal and professional value and ability to make judgment based-on these in work-related situations. it can be enhanced by case studies which support the learning and teaching of ethic. personal or behaviour competence is the ability to adopt appropriate behaviours in work-related situations. personal competencies, including interpersonal skills and intrapersonal skills, are competences needed to support professionals work. intrepersonal skills are abilities to relate with other people, member of team work or other team member, for example: motivation skills, leadership skills, negotiation skills, communication skills, relationship building, publick speaking skills, self developing, etc. intrapersonal skills are the abilities to perform self control and or self direction, e.g. time management, stress management, change management, transfoming belief, creative thinking process, self learning, etc. these profesional competencies components should be developed. the question is how to develop and when it should be developed? the key word of professional competencies is an expertise. the power and satus of professional workers depend to a significant extend on their claim to unique forms of expertise, which are not shared with other occupational groups, and the value which is placed on that expertise (eraut, 1994, p. 14). the development of professional competencies at all levels need work environment. professional development is essentially a learning process. learning is defined as changes in knowledge, understanding, skills and atitudes brought about by experience and reflection upon that experience, whether that experience is structured or not (nicholls, 2001, p. 54). based-on these theory, vocationalization character could be formed by cyclical process of learning. individual, groups, and organization should have learning experiences as descibed in figure 5. research and evaluation in education journal 94 volume 1, number 1, june 2015 figure 5. the vocationalization process in polytechnic education assessment is a broad term which is defined as a process for obtaining information that is used for making decisions about students; curricula, programs, and schools, and educational policy (nitko & brookhart, 2011, p. 3). assessment needs to be integral part of the development process and focuses on learning. assessment in this case is a process for obtaining information about how much the professional skills have been achieved, students performance have been improved, and polytechnics productivity has been increased. professional skills consist of knowledge skills, thinking skills, personal skills, personal attributes, and practical skills. skill is ability to use one‟s knowledge effectively and readily in execution or performance (eraut, 1994, p. 72). skills could be differentiated into two categories: hard skills and soft skills. hard skills are easier to describe and recognize because they are more likely about how to follow procedure for certain work, for example use computer to collect, analyze, and display information from these data. soft skills harder to observe except their absence, such as being nice, listening, taking the initiative, problem solving. soft skills fall into three categories: (1) reasoning or logic skills; (2) interpersonal and communications skills; and (3) leadership, management, and entrepreneurial skills. these skills need to be assessed formatively and sumatively. formative assessment of student‟s achievement is judging the quality of student‟s achievement while the student is still in the process of learning. information which is gathered from formative assessment may be used to guide next learning steps. summative assessment of student‟s achievement may be used to judging the quality of student‟s achievement after the instructionsl process is completed. these information may be used to evaluate student‟s achievements, curriculum, and programs. the change of students‟ performance reflects the result of education or training program. students‟ performance should be assessed in four levels: (1) students response (satisfaction), (2) students learning, (3) behavior changes, and (4) results. this approach referred to kirkpatrick‟s theory about evaluation training program (kirkpatrick d. l & kirkpatrick j. d, 2006, p. 21). the most important thing is getting positive reaction, because the future of a program depends on positive reaction. if participants do not have a positive reaction, they probably will not be motivated to learn. positive reaction may not ensure learning, but negative reaction almost certainly reduces the possibility of its occurrence. learning can be defined as the extent to which participants change attitudes, improve knowledge, and/or increase skills as a result of attending the program. students‟ performance has been used as a measure the concrete work experiences observation & reflection forming new knowledge of work, understanding, specific skills, and attitudes integrated new knowledge and experience into learning materials and laboratory activities testing implication of new knowledge, understanding, specific skills, and attitudes in new situation/problem work environment assessment & evaluation system school environment research and evaluation in education journal strengthening vocational character... 95 peni handayani & satryo soemantri brodjonegoro success of education or training program. polytechnic contribution on economic may be measured by its contribution in supplying semiskilled and skilled workers, the number of patens which has been obtained by faculties or institution, the number of innovative products which have economic values and are recognized by community(ies). assessment is also needed for curriculum and program. the effectiveness of curriculum could be measured based-on the objectives achievement. it needs standards proficiency which consists of content and performance. content standards describe the subject matter, facts, concepts, principles and so on that students are expected to learn. performance standards describe the thing students can perform or do once the content standards are learned. evaluation term is defined as a process of making value judgement about the worth of a student‟s product or performance. evaluation may or may not be based-on measurements or test results, or may be based-on counting things, using checklists, or using rating scales. evaluation of student‟s product or performance is based-on the achievement indicators which are agreed by both polytechnic and industries. the evaluations tend to summarize the strengths and weakness, and describe whether a properly implemented program or procedure has attained its stated goals and objectives. assessment within professional development needs to reflect the specific process of learning. assessment should reflect what students can apply, analyze, and critically reflect what they have learnt. based-on the afore-mentioned theory, the strengthening vocational character of polytechnic education can be done through four approaches which are described in figure 6. the approaches are as follows: (1) lecturing approach; (2) laboratory practice; (3) apprenticeship in industy or in the workplace; and (4) evaluation system. curriculum delivery to strengthening vocational character the afore-mentioned vocational character strengthening can be done at least by four ways. figure 6 shows the process of strengthening polytechnic education which can be done by polytechnic which has nonproduction-based curriculum. character strengthening in the learning and teaching activities can be conducted through many methods, such as giving students more opportunity to learn actively, being more active in discussion, and self-learning actively about theory and science in practice. enquiry-based learning method can be used to give students opportunity to learn something deeply, for instance, project-based learning, problembased learning or investigation study. the teaching evaluation methods can be selected e.g. tests or quizzes which can motivate students to be more active. strengthening vocational character can also be given through practical activities in the laboratory. these activities can provide many learning outcomes, such as gaining practical skills, having experiences using fine instrument, designing a test program, linking theory and practice, collecting data, gaining analysis skills, making judgment, establishing team work ability, improving interpersonal ability, and so on. all practical activities need to be focused on formation professional competencies. professional is a social recognition about expertise in doing the job (eraut, 1994, p. 100). it is also associated with the evaluation system and the variable which are assessed. research and evaluation in education journal 96 volume 1, number 1, june 2015 figure 6. the strengthening polytechnic education character approach the professional skills are vocational character that has to be reinforced. students can reinforce their professional skills in the workplace at a certain profession/occupation. assessment system can be utilized to improve vocational character. aspects to be be assessed should include: (1) personality aspects, especially ethics which consists of: honesty, including academic honesty by not doing plagiarism, falsification, respectful, sense of responsibility; (2) gaining or developing skills, knowledge, attitudes and ability (skkas) aspects, including implemented k3 consistently, and discipline in action; (3) developing students creativity aspect; (4) recognition of learning outcomes which is delivered through learning outcomes evaluation system. the consequences of implementation of this teaching and learning concept are the need of changing teaching and learning method, and evaluation system. polytechnics also need to develop instruments and learning outcomes standard to improve their evaluation system. the third approach that can be used to strengthen vocational character is apprenticeship in industry or workplace. students will gain the real experience directly from first-hand (fry, ketteridge & marshall, 2009, p. 14). good experience (success) or bad experience can be used as a learning source. apprenticeship should be focused on technical competencies certificate achievement, which can be used to access the world of work. the technical competencies could be used as foundation of career development of students in the future. the role of apprenticeship in the polytechnic education is very important, because it is the only event for students to gain their specific competencies which could not be acquired on the campus. the competence certificate which is obtained from apprenticeship is expected to have an added value for students which are recognized by the the approaches of strengthening vocational character of polytechnic education learning and teaching activities laboratory and workshop activities apprenticeship in industry/workplace evaluation system improvement  learning more interactive  more participation in the discussion  integrate filed problems into material of learning as exercises  teaching methods improvement e.g. project-based learning, problembased learning, workplace learning, work-based learning, etc.  using instrument/ measure equipment finely  gaining collect data methods  proving theory  linking theory & practice  gaining data processing technique  gaining analytical skills  improving data management skills  improving data presentation.  choose interested filed of work  choose level of competence achievement  intend the prequalification test of competence  plan step of achievement  intend competence examination.  identifying factors determining vocationalization  determine categories of performance  setting performance standards on educational assessments and criteria for evaluating the process. research and evaluation in education journal strengthening vocational character... 97 peni handayani & satryo soemantri brodjonegoro world of work. this certificate can be used as diploma supplement (ds). the fourth approach is improving evaluation system. educational evaluation is one of the ways to determine effectiveness educational program (worthen & sanders, 1981, p. 1; kirkpatrick d. l. & kirkpatrick j. d., 2006, p. 3). evaluating is defined as the process of making a value judgment about the worth of a student‟s performance (nitko & brookhart, 2011, p. 6). determining this judgment value needs standards as the quality assurance of educational and training outcomes. the lack of learning outcomes standard for assessment will provide various learning outcomes quality which are happening now in several study programs in polytechnics. therefore polytechnics need to develop evaluation standards as a real action to perform educational quality assurance. based on kolb‟s experimental learning cycle and the strengthening approach which is described in figure 6, the strengthening process of vocational character in polytechnic education can be described in figure 7. reflection and learning are key elements of any professional development program, and yet increasing these particular elements being assessed. figure 7. the individual strengthening process of vocational characters polytechnic should develop monitoring system to reveal the progression of professional skills development. there are critical points should be monitored are transition points between school to work and between stage to next stage. these points are crucial because advisor, faculties member, and community makes judgment and assessments about individual often requring new thing before transfer to the next stage in endorsed. based-on these theories and all descriptions as mentioned above, the hypothetic model of strengthening vocational character for polytechnic education can be described at figure 8. this model has been socialized and provided positive response, and made polytechnic management has an optimistic feeling to gain their polytechnic education character in the future. most respondents proposed that detail guidelines are transition point transition point transition point forming new theories or knowledge in the class or from environment integrated new knowledge and experience into learning materials and laboratory activities observation and reflection forming abstract, concept, generalization and skills concrete experience observation and reflection applied knowledge and theory in the workplace improved skills and knowledge polytechnic environment workplace/ industry environment transition point transition point transition point transition point research and evaluation in education journal 98 volume 1, number 1, june 2015 needed to implement this model, because it needs a big change in the habit of teaching and learning implementation in polytechnic. a lack of financial support to provide all necessary resources, including materials development and training for the faculty members needed to achieve this mission may be the major obstacle to implement this model. based on the current data of faculty members and participants suggestions, it needs a step by step training for faculty members and education personnel to deliver the model. figure 8. the hypothetic model of strengthening vocational character of polytechnic education input process evaluation learning in the school environment get theory and new knowledge integrate theories and practice get konwledge experinces enhance basic skills & attitudes get theory and new practice knowledge enhance specific skills & attitudes reflection on professional experiences create innovative idea or solving problem learning in work environment considerations:  vision  mission(s)  goals setting  standard setting planning:  strategical  tactical  oprational  implementation & evaluation reaction learning behavior result evaluation level knowledge skills thinking skills personal /behavior skills value/ etical skills skills should be assessed polytechnic contributions should be assessed supply skilledworkers innovative products innovative social problem solving applied research products revise as required research and evaluation in education journal strengthening vocational character... 99 peni handayani & satryo soemantri brodjonegoro conclusion based on the research findings, there are some points that can be noticed as conslusions: (1) environmental factors are very influential factor in shaping the character of polytechnic education and recognition; (2) the greatest gaining vocational character acquired through apprenticeship in industry or workplace and supports by evaluation system; (3) gaining vocational character has to be done cyclical and it needs to be managed by maintaining and strengthening cooperation between polytechnic and industries, and competent institution which can develop education system. references cheng, y. c. (2005). new paradigm for reengineering education. netherlands: springer. dicken & arlet. (2009). key aspects of teaching and learning in engineering. in k. m. fry. h, a handbook for teaching and learning in higher education. (pp. 264-281). new york, ny: routledge. eraut, m. (1994). developing professional knowledge and competence. london: falmer press finch, c. r. & crunkilton, j. r. (1999). curriculum development in the vocational and technical education. planning, content and implementation. boston: allyn and bacon. fry, h., ketteridge, s., & marshall, s. (2008). understanding student learning. in k. m. fry.h, a handbook for teaching and learning in higher education. enhanching academic practice (pp. 8-26). new york, ny: routledge. heijke, h., meng, c., & ris, c. (2003). fitting to the job: the role of generic and vocational competencies in adjusment and performance. labor economic, 10, pp. 215-229. kauffman, r., herman, j., & watters, k. (2002). educational planning, strategic, tactical, and operational. lancaster: technomic publishing. kirkpatrick, d. l. &. kirkpatrick, j. d. (2006). evaluating training programs (3 rd ed). san francisco: berrett-koehler publisher. nicholls, g. (2001). professional development in higher education. new dimensions and directions. new york, ny: kogan. nitko, a. j. & brookhart, s. m. (2011). educational assessment of students. boston: pearson education inc. thompson, j. f. (1973). foundation of vocational education. new jersey: prentice-hall inc. unesco. (1992). new direction in the technical and vocational education. bangkok: unesco principal regional office for asia and the pacific. worthen, b. r. & sanders, j. r. (1994). educational evaluation: theory and practice. worthington, ohio: charles a. jones yoo jeung joy nam. (2009). pre-employment skills development strategies in the oecd. washington dc: social protection and labo. 100 ofianto and suhartono 100 volume 1, number 1, june 2015 judul dalam bahasa indonesia, ditulis dengan huruf tnr-14 bold, maksimal 14 kata, rata kiri research and evaluation in education journal e-issn: 2460-6995 research and evaluation in education journal volume 1, number 1, june 2015 (100-113) available online at: http://journal.uny.ac.id/index.php/reid modified robust z method for equating and detecting item parameter drift 1) rahmawati; 2) djemari mardapi 1) center of educational assessment, indonesia; 2) yogyakarta state university, indonesia 1) rahmapepuny2011@gmail.com; 2) djemarimardapi@gmail.com abstract this study is aimed at: (1) revising the criterion used in robust z method for detecting item parameter drift (ipd), (2) identifying the strengths and weaknesses of the modified robust z method, and (3) investigating the effect of ipd on examinees’ classification consistency using empirical data. this study used two types of data. the simulated data were in the form of responses of 20,000 students on 40 dichotomous items generated by simulating six variables including: (1) ability distribution, (2) differences of groups’ ability between groups, (3) type of drifting, (4) magnitude of drifting, (5) anchor test length, and (6) number of drifting items. the empirical data was 4,187,444 students’ response of un sd/mi 2011 who administered 41 test forms of indonesian language, mathematics, and science. modified robust z method was used to detect ipd and the irt true score equating method was used to analyze the classification consistency. the results of this study show that: (1) the criterion of 0.5 point raw score tcc difference leads to 100% consistency on passing classification, (2) the modified robust z is accurate to detect the b and abdrifting when the minimal length of anchor test is 25%, (3) ipd occurring on empirical data affected the passing status of more than 2,000 students. keywords: robust z method, item parameter drift, irt true score equating mailto:rahmapepuny2011@gmail.com mailto:djemarimardapi@gmail.com research and evaluation in education journal modified robust z method for equating... 101 rahmawati & djemari mardapi introduction the use of multiple test forms which is considered as parallel is widely implemented recently. multiple test forms are used due to the test security, and to prevent the examinees from cheating easily to others. the other reason of designing parallel test forms is minimalizing the chance of practicing the test. if the administration of the test can be taken twice or more by a particular examinee, then using similar test form would kame the item get exposed frequently, the examinee may recall and practice the items. although the test is designed to be parallel, it is so hard to have the multiple test forms are perfectly parallel. different item will have different level of difficulty, regardless similar resources of item’s specification. the difference level of items’ difficulties can raise unfair issues. the less difficulty test form will advantage the examinee who took the form, while examinee who took the more difficult item will get less score not caused by less ability. thus, comparing the score between groups who took different test forms will lead to a bias result. non equivalent anchor test (neat) design is a way to design parallel test forms, so that the difference of difficulty levels also the difference of groups’ ability can be adjusted. the adjustment of differences is determined bu ancor items. example of national test that using neat design is national exam (ne) for elementary schools (es) and madrasah ibtidaiyah/mi (islamicbased elementary school) which is familiarly named as ne es/mi. un sd/mi items are constructed by provincial item writing team. all province used the same test specification and items’ indicators. each province then has their own test which differ from one province to others. to maintain the function of the test as a national measurement tool, 25% of the items were removed and replaced by national anchor items. the national ancor items were place in the same order, and preserve exactly the same content, format, even layout. no changes on national anchor items were allowed. all provinces had to make sure similarity of the anchor items. the anchor items have a very important role. the accuracy of test form’s difficulty level and the accuracy of examinee’s ability estimation depend on the quality of anchor items. the score on anchor items defines the difference of groups’ ability. a group which gets higher score on anchor items is considered as having better ability. based on the ancor items’ property, the difference of test form’ level of difficulty can be determined and used for scoring adjustment (cook & eignor, 1991). regarding its importance, the anchor items’ parameter should satisfy the measurement invariance assumption. the assumption is that the parameter’s value may shift around the bound of sampling error. instead of being stable, anchor items’ parameter are not uncommon shifting accross subsample, test administration, or location. these shifting conditions are known as item parameter drift (ipd) and may cause bias on ability estimation. keller and wells (2009, p. 6) investigated the impact of drifting anchor items(ipd) on the accuracy of examinees’ ability estimation. the study found that the difference of groups’ ability defines the magnitude of ipd’s impact. even only one moderate drifting anchor item could give a bias ability estimation. robust z method (hyunh & meyer, 2010) is a method for detecting drifting items and for fitting linking constants a and b which will be used in scaling process. robust z method applies a simple algoritm, yet still presents linking constant that is close to linking constant of the stocking lord method. the weaknesses of robust z method are its over-sensitivity and the absence of clear cut off criteria (arce & lau, 2011). the robust z method often detects undrifting anchor items as drifting. the criteria which are used are based on the probability of occurance in a hypothetic distribution; flagging an item as statistically significant ipd does not always mean that the impact of drifting ancors is practically significant. research and evaluation in education journal 102 volume 1, number 1, june 2015 regarding the criteria problem, thus, modification of robust z method is necessary. the modification which is made is aimed at practically detecting meaningful ipd. only anchor items which caused significant practical impact will be excluded from scaling process. the modification can give consideration to make decision for either retaining or refining the anchor items. an example of practically meaningful impact is changes on examinees’ classification decision; passing to failing or failing to passing. this study is aimed at: (1) revising the criterion which is employed in robust z method so that the detection of item parameter drift (ipd) can be related to a practically meaningful criterion, (2) identifying the strengths and weaknesses of the modified robust z method in various conditions, and (3) investigating the effect of ipd on the examinees’ classification consistency in real life situation by implementing the modified robust z method on empirical data. research method type of research the research is categorized as a descriptive study. the study described the strengths of modified ribust z method, compared to the original version. the study also described the weaknesses of the modified robust z method and identified the test’s characteristics which were potential for having ‘practically meaningful’ ipd. the descriptions of ipd’s impact on examinees’ classification in real life situation were also revealed. the real life situation was illustrated by analyzing empiric data using the modified robust z method. time and location the research took place at yogyakarta state university, indonesia, the center of educational assessment, and a province that held item writing workshop for constructing ne es/mi in the academic year of 2013. the research was conducted in 11 months, starting from march 2013 until february 2014. population and sample the population of this study was all students who were enrolled as examinee of ne es/mi in the academic year of 2011 who took the main tests among all provinces in indonesia. the main tests are defined as the tests which are administered on the main schedule of ne es/mi. students who took repeated session or make up session were excluded from the population. according to the population definition, the total number of the students in the research is 4,187,444. sample selection in this study was based on the result of cheating validation process. a school is considered as a cheating school if at least one item were identified as being responded identically incorrect by at least 90% students in the school. identification of cheating school resulted exclusion of all students’ responses of the identified school from the database. this cheating validation process eliminated about 40% of responses and the number of responses remained in the database were: 2,509,646 for bahasa indonesia test, 2,509,517 for mathematics test, and 2,509,751 for science test. technical steps on modifying robust z method in order to improve the criteria of robust z method, the principle of the difference that matter (dtm) which is proposed by brennan (2008, p. 108) at a topic of ‘population invariance’ was used. a way of considering an item as a drifting item is not only a statistical significance but an impact which is caused by the drifting items. how significant the impact is is determined by the researchers. the researchers set the practical impact which was considered as meaningful. in this study, the practical impact which was used to determine wether a drifting item was meaningful or not was the changes on classification consistency. if the detected drifting items made the score test equating changes significantly so causes any examinee classify differently, then the items considered as a practically meaningful ipd. it is suggested to exclude the practically meaningful ipd from scaling process, research and evaluation in education journal modified robust z method for equating... 103 rahmawati & djemari mardapi otherwise the decision of examinee classification may disadvantage both the examinee and the user. the robust z method consists of several algoritms which, in the end, give the linking constant of a and b. these constants were then used in the scaling process to transform the scale of anchor and non anchor items’ parameter from a focal test form into the same scale as the reference test form. the transformation of the items’ parameter were used to plot the test characteristic curve (tcc). the linking of point to point between tcc focal test and tcc transformed focal test became the conversion table for equating test score. the equated test score was then used to decide wheter an examinee passes or fails in the test. in order to evaluate the ipd impact on modified robust z method, wyse and reckase (2011) formula was adapted. the formula was used to see the significant difference between tcc total and tcc refinement. tcc total is tcc of transformed focal test that used all anchor items for scaling process. tcc refinement is tcc of transformed focal test that using only non drifting anchor items for scaling process. if the difference between the two tccs is small, then the impact of ipd on classification consistency can be waived. on the other hand, when the difference is big, then the ipd is practically meaningful and suggested to be excluded from the scaling process. in tis study, the cut off value of 0.5 point ‘raw score’ was used as the maximum difference between the two tccs. this cut off ensured a hundred percent of classification consistency. equation (1), (2), (3), and (4) are the formulas which were used in modifying robust z method’s citeria. )()( 11     n itotal itotal n icv icv ppmax (1)  )(7,1exp1 1 )( ** cvycvy icv ba p     (2)  )(7,1exp1 1 )( ** totalytotaly itotal ba p     (3) ay*cv=ay/acv dan by*cv=acv*by+bcv (4a) ay*tot=ay/atot dan by*tot=atot*by+btot (4b) equation 4a and 4b are formulas which were used to calculate the linking constant of a and b in two different conditions: without refining ipd items(atot and btot) and by refining ipd items(acv and bcv). both a and b linking constants were used to scale both anchor items and non anchor items’ parameter. the two kinds of a and b linking constants also lead to two kinds of tcc plots: tcc without refinement (σpitotal) and tcc by refining ipd (σpicv). the maximum absolute value of the difference between two tccs was then compared to the dtm cut off value to find out the summary of practically meaningful ipd. data, instrument, and data collection empirical data which were used in this research were collected by documentation process. the ne es/mi of the year of 2011 data were copied from center for educational assessment database. this concludes that the type of the data which was used was secondary data. the collected data were raw responses on the 41 test forms of bahasa indonesia test, 41 test forms of mathematics test, and 41 test forms of science test. the key of each test form was also collected to complement the raw responses data sets. the instruments which were used in this research was analysis software. there were 5 softwares which were used in this study, namely: wingen, bilog-mg, winstep, r program, and robust z modif. the software functions are: generating response data, validating responses, estimating item parameter, detecting ipd, constructing conversion table, and equating test scores. research and evaluation in education journal 104 volume 1, number 1, june 2015 data analysis figure 1. curve of proportion of correct responses on mathematics test using 2.5 million responses of ne es/mi 2011 examinees the analysis was started by determining the item response theory (irt) model that would be used. to find out the most suitable model, curves of raw score againts the proportion of students within each group that respond correctly on particular items were manually plotted. figure 1 is an example of anchor items curve for mathematics test. after deciding the irt model which was used, simulation study data were generated using wingen (han, 2007) software. each dataset generated was represented responses of 20,000 examinees on 40 dichotomus items. there are six manipulated variables: (1) the percentage of anchor items relative to total number of items (15%, 25%, and 40%); (2) the percentage of drifting items relative of total number of anchor items (15%, 30%, and 45%); (3) the magnitude of drifting. there are two kinds of drifting: the a-parameter drifting (no drifting, moderate drifting of 0.3, and large drifting of 0.7); the b-parameter drifting (no drifting, moderate drifting of 0.5, and large drifting of 0.8); (4) the direction of ipd (symmetrical two direction, one direction); (5) the ability distribution shape (normal and negatively skewed); and (6) comparison of the ability distribution between groups (similar ability distribution and different ability distribution). in total, there are 188 conditions. each manipulated condition was replicated 50 times for both the reference and the focal groups which resulted analysis of 18,800 datasets. the percentage occurance of manipulated drifting items detected as an ipd named as power rate, the percentage occurance of non manipulated drifting items detected as an ipd named as type i error rate, and the percentage occurance of tccs differences larger than the cut off value named as dtm rate. the expected results from this study are combination of a high power rate, a low type i. the analysis of empirical data was started with calibration of national anchor items using national responses. the parameter estimated from the national responses was then used as references for calibrating non anchor items in each province. the method which was used to calibrate provincial items is known as fixed item parameter calibration. the similarity of mean and standard deviation between non anchor test and anchor test was used to select the reference test form for equating process. after the reference test form was selected, equating score test of each provincial main test form can be conducted. for each provincial main test form, there are two equating processes: using all anchor items regardless the drifting and using only non drifting anchor items. based on the two equating processes, each examinee will be classified two times. the classification consistency analysis categories examinees into four groups as follows: (1) passing and keep passing, (2) passing then failing, (3) failing then passing, and (4) failing and keep failing. for each group, the proportion of examinees relative to total number of examinees was calculated. classification consistency is the sum of proportion of examinees at groups of ‘passing and keep passing’ and ‘failing and keep failing’. the analysis of empirical data also determined the frequency of each anchor item which was detected as an ipd accross 41 test forms. this frequency was named as ipd rate. the anchor item that has high ipd rate needs p ro p o rt io n o f a n sw e ri n g c o rr e ct ly soal 8 soal 10 soal 17 soal 19 soal 24 soal 31 soal 35 soal 36 soal 37 soal 40 research and evaluation in education journal modified robust z method for equating... 105 rahmawati & djemari mardapi detail analysis on source of drifting. the expected results from the empirical data are a high percentage of classification consistency and a low ipd rate. findings and discussion results of simulation study the result of analysis power rate based on the type of ability distribution is presented in figure 2. the pattern of power rate of normal distribution is similar with the pattern of skewed distribution. accross different level of drifting magnitude, the type of ability distribution does not present different results. it indicates that the performance of modified robust z method is similar with the two types of ability distribution. figure 2. power rate graph of type of ability distribution accross different level of ipd’s magnitude the modified robust z method is accurate when the ability of examinees in one group differs from the other group. figure 3 and figure 4 are graphs of power rate and type 1 error rate ipd detection on interaction between number of anchor condition and difference of ability among group condition. figure 3 shows that the modified robust z method is accurate when the number of anchor items is 40% and the groups are different in ability. a 100% of power rate means that the modified robust z method can detect manipulated drifting items accross all replications. a type 1 error rate close to 0% means that the occurance of detecting ipd incorrectly is almost close to zero. figure 3. power rate graph of interaction between type of distribution and ability differences among groups, accross number of anchor items and type of ipd figure 4. type 1 error rate graph of interaction between type of distribution and ability differences among groups, accross number of anchor items and type of ipd the results presented in figure 5 shows that using 40% anchor items can mimimalize the impact of ipd on the classification consistency. the dtm rate for condition of 40% anchor items is close to 0%, not only for the type a-drift but also tyoe b-drift, for both moderate and large level of drifting magnitude. it concludes that designing multiple test forms using 40% of anchor items anticipates the impact of ipd that may arise. although the anchor test may have an ipd, at least the impact of the ipd to classification consistency can be minimalized. power rate normal skewed power rate samanormal samaskewed bedanormal bedaskewed type i error rate samanormal samaskewed bedanormal bedaskewed research and evaluation in education journal 106 volume 1, number 1, june 2015 figure 5. dtm rate graph of interaction between type of distribution and ability differences among groups, accross number of anchor items and type of ipd table 1. power rate, type i error rate, and dtm rate based on anchor test length, number of drifting items, and ipd direction anchor test length number of drifting power rate 15% 10% (one way) 100.0 25% (symmetric) 100.0 25% (one way) 98.3 40% (one way) 17.1 25% 10% (one way) 91.4 25% (one way) 99.4 40% (symmetric) 95.5 40% (one way) 38.6 40% 10% (symmetric) 100.0 10% (one way) 100.0 25% (symmetric) 90.0 25% (one way) 97.1 40% (symmetric) 100.0 40% (one way) 9.5 the ipd detection rate accross different proportion of drifting items shows the weakness of modified robust z method as presented in table 1. table 1 shows that the power rate of modified robust z method is less than 20% in condition number of drifting items is 40% out of total number of anchor items. this finding summarizes that modified robust z method is not powerful to detect ipd when the proportion of drifting items in anchor test is big. large proportion of drifting items makes the anchor items be distributed evenly around the fitting regression line, hiding the facts that many items were drifting. overall, everything seemed normal and no outlier in the distribution. the modified robust z method failed to identify which anchors are drifting and which anchors are not. table 1 shows that the modified robust z method is still accurate in detecting many drifting items as long as the direction of drfiting is symmetric. a symmetric direction means that some items are drifting more difficult, while some others are drifting less difficult. it is shown that when the drifting items number is 40% of anchor test length, the power rate of one way direction is 9.5%, while the power rate of symmetric direction increases dramatically into 100. figure 6, figure 7, and figure 8 illustrate power rate, type 1 error rate, and dtm rate when direction of ipd distributions are one way and symmetrically two opposite direction. the results show that the modified robust z method perfoms better in looking the impact of ipd in test level not only in item level particularly. the practical impact of consistency classification is identified by modified robust z method as aggregate of items in test level. even the number of drifting items were great, but when drifting in an opposite direction, the effect will cancel out and the practical impact can be waived. figure 6. power rate graph of interaction anchor test length condition, number of ipd condition, ability distribution, and ipd direction. the simulation study results show that the modified robust z method improves the dtm rate samanormal samaskewed bedanormal bedaskewed power rate 15%25% 25%40% 40%10% 40%25% 40%40% research and evaluation in education journal modified robust z method for equating... 107 rahmawati & djemari mardapi performance of original robust z method specifically on test level. the original version cannot give conclusion on the impact caused by some detected drifting items as a part of a test. the original robust z method only justifies whether an item is drifting or not. the modified version adds information about the impact of all drifting items on the test score equating. this is similar to complement the analysis of differential item functioning (dif) with differential test functioning (dtf) analysis. many dif items at the end can be waived if the dtf analysis performs no difference. figure 7. type 1 error rate graph of interaction anchor test length condition, number of ipd condition, ability distribution, and ipd direction. figure 8. dtm rate graph of interaction anchor test length condition, number of ipd condition, ability distribution, and ipd direction. results of empirical study table 2, table 3, and table 4 present the item parameter estimation for bahasa indonesia test, mathematics test, and science test after they were calibrated using the national data. each table represents the parameter of one test data which were conducted. table 2. anchor items parameter for bahasa indonesia test item code location parameter slope parameter bin 18 bin 20 bin 21 bin 22 bin 23 bin 25 bin 27 bin 31 bin 32 bin 35 bin 36 bin 37 bin 40 -2.442 -2.035 -2.752 -2.173 -1.917 -2.640 -2.756 -1.796 -2.438 -1.999 -0.997 -2.186 -2.776 1.791 0.829 0.898 1.147 1.372 1.307 1.309 0.705 1.913 1.304 0.811 0.942 1.164 table 3. anchor items parameter for matematics test item code location parameter slope parameter mat8 mat10 mat17 mat19 mat24 mat31 mat35 mat36 mat37 mat40 -0.722 -1.622 -1.328 -0.939 -0.888 -0.632 1.479 -0.451 -1.552 -7.492 1.261 0.635 1.350 1.145 1.227 1.283 0.328 0.856 1.144 0.093 table 4. anchor items parameter for science test item code location parameter slope parameter ipa2 ipa3 ipa9 ipa10 ipa18 ipa23 ipa27 ipa29 ipa32 ipa38 -2.869 -2.611 0.027 -1.506 -1.475 -0.536 1.204 -2.225 -0.889 -2.348 0.974 1.102 0.466 0.768 1.337 0.519 0.257 0.843 0.951 1.132 the parameter of anchor test was used to calibrate non anchor items (fixed item type i error rate 15%25% 25%40% 40%10% 40%25% 40%40% dtm rate 15%25% 25%40% 40%10% 40%25% 40%40% research and evaluation in education journal 108 volume 1, number 1, june 2015 parameter calibration) to select the best reference for the test form. ipd detection was implemented using modified robust z method in over 41 test forms for each subject. table 5 presents the ipd rate for each anchor item. table 5. ipd rate of each anchor items over 41 test forms id % ipd id % ipd id % ipd bin 18 3 mat 8 18 ipa 2 45 bin 20 8 mat 10 90 ipa 3 15 bin 21 35 mat 17 10 ipa 9 93 bin 22 13 mat 19 13 ipa 10 85 bin 23 15 mat 24 8 ipa 18 0 bin 25 48 mat 31 28 ipa 23 5 bin 27 8 mat 35 53 ipa 27 46 bin 31 58 mat 36 20 ipa 29 18 bin 32 0 mat 37 15 ipa 32 3 bin 35 68 mat 40 100 ipa 38 18 bin 36 60 bin 37 20 bin 40 3 dtm 73 dtm 95 dtm 93 the results show that in bahasa indonesia test, there is one anchor item which was detected as ipd, more than 60% anchor items which was detected in more than 85%, while science test has 2 anchor items detected as ipd in more than 85% provinces. the simulation study prooved that the modified robust z method has an accurate ipd detection. then, the result of 85% ipd rate in empirical data means the item is truely drifting items. the anchor items which were detected as drifting items were then taken into consideration while performing scaling process. the drifting items impact determined whether it is practically meaningful or not. empirical data analysis considers the examinee as passing the test if the score of each subject is at least 4.00. scoring process was conducted twice: in refinement condition and without refinement condition. for each subject, the examinee will have two passing statuses. table 6, table 7, and table 8 present the examinee status proportion based on the scoring processes. tabel 6 for bahasa indonesia subject, table 7 for mathematics subject, and table 8 for science subject. table 6 summarizes analysis results of bahasa indonesia test’s passing status. eleven out of 41 test forms used show that ipd does not make the difference of tccs bigger than dtm criteria’s cut off value. careful examination on the eleven test forms proved that when the difference is less than dtm cut off value, the classification consistency is 100%. no examinee changes the passing status over two scaling conditions. it concludes that cut off criteria of 0.5 point raw score guarantee 100% classification consistency. table 7 shows that only one drifting item with large magnitude such as mat 40 has a large impact on classification consistency. the dtm rate for mathematics test is very close to 100%. the number of inconsistent classification at the national level is also very huge, about 25.58 %. this number is equal to 621,600 students regarding the numerous students for indonesia population. this is a very huge number and significant result. these 621,600 students represent student population in east part of indonesia. the smallest percentage of inconsistent classification which is persented in table 8 is 0.05%. this persentage seems small, but considering indonesian huge population, this percentage is equal to 2050 students that enrolled in ne es/mi 2011. if those 2050 students are assumed to continue their study in junior high school/ madrasah tsanawiyah (islamic-based jhs) which has capacity of 100 student, it means that 20 jhs/mts will have under-quality students to be jhs/mts students and passed the test just because of the measurement error. research and evaluation in education journal modified robust z method for equating... 109 rahmawati & djemari mardapi table 6 . percentage of classification consistensy of passing status based on bahasa indonesia test test form number of students pass/pass pass/fail fail/pass fail/fail dtm bin_01_p01 61,195 99.74 0 0 0.26 no bin_01_p02 72,992 99.62 0.07 0 0.31 yes bin_02 398,178 99.51 0.09 0 0.40 yes bin_03_p01 116,156 99.84 0 0 0.16 no bin_03_p2/3 289,772 99.88 0.03 0 0.09 yes bin_04_p01 43,573 99.95 0 0 0.05 no bin_05_p01 448,289 99.61 0.08 0 0.31 yes bin_06_p01 45,432 97.54 0 0 2.46 no bin_07_p01 80,311 98.08 0 0.41 1.50 yes bin_08_p01 69,929 99.79 0 0 0.21 yes bin_09_p01 78,401 99.57 0.09 0 0.34 yes bin_10_p01 16,133 98.76 0 0.23 1.01 yes bin_11_p01 71,833 99.03 0 0 0.97 yes bin_12_p01 98,296 99.44 0.14 0 0.42 yes bin_13_p01 66,601 98.41 0.36 0 1.23 yes bin_14_p01 25,581 99.19 0 0 0.81 no bin_15_p01 55,073 99.33 0.16 0 0.51 yes bin_16_p01 55,640 99.47 0.11 0 0.42 yes bin_17_p01 5,058 45.06 50.04 0 4.90 yes bin_18_p01 11,002 98.85 0 0 1.15 yes bin_19_p01 19,011 98.81 0 0 1.19 no bin_19_p02 12,944 99.79 0 0.06 0.15 yes bin_20_p01 6,461 98.64 0 0 1.36 yes bin_21_p01 5,692 97.86 1.35 0 0.79 yes bin_22_p01 25,854 99.97 0 0 0.03 yes bin_23_p01 45,069 98.35 0 0.32 1.33 yes bin_24_p01 12,966 93.95 0 0 6.05 no bin_25_p01 16,592 92.19 0 0 7.81 yes bin_26_p01 15,232 99.65 0 0 0.35 yes bin_28_p01 17,629 99.93 0.02 0 0.05 yes bin_29_p01 10,941 99.12 0 0 0.88 no bin_30_p01 127,834 99.21 0 0 0.79 no bin_31_p01 21,547 99.94 0.01 0 0.05 yes bin_32_p01 10,401 97.66 0 0 2.34 no bin_33_p01 8,488 94.62 0 0 5.38 no national 2,500,100 97.42 1.32 0.03 1.23 research and evaluation in education journal 110 volume 1, number 1, june 2015 table 7 . percentage of classification consistensy of passing status based on mathematics test test form number of students pass/pass pass/fail fail/pass fail/fail dtm mat_01_p01 61,194 21.78 68.96 0.00 9.26 yes mat_01_p02 72,990 31.30 61.24 0.00 7.47 yes mat_02 398,119 85.47 9.41 0.00 5.12 yes mat_03_p01 116,119 92.60 0.00 1.62 5.78 yes mat_03_p02 137,498 94.63 0.00 0.00 5.37 no mat_03_p03 152,318 89.60 2.44 0.00 7.96 yes mat_04_p01 43,569 97.13 0.00 1.41 1.46 yes mat_05_p01 448,303 92.48 0.00 1.40 6.12 yes mat_06_p01 45,403 62.37 29.90 0.00 7.73 yes mat_07_p01 80,314 66.51 23.06 0.00 10.43 yes mat_09_p01 78,393 0.79 93.84 0.00 5.37 yes mat_10_p01 16,133 68.62 21.38 0.00 10.00 yes mat_11_p01 71,822 66.89 28.01 0.00 5.10 yes mat_12_p01 98,266 60.83 27.88 0.00 11.29 yes mat_13_p01 66,600 32.40 28.92 0.00 38.68 yes mat_14_p01 25,581 62.54 31.11 0.00 6.36 yes mat_15_p01 55,078 72.25 7.49 0.00 20.26 yes mat_16_p01 55,636 69.45 14.52 0.00 16.02 yes mat_17_p01 5,058 22.18 57.85 0.00 19.97 yes mat_18_p01 11,004 69.43 21.21 0.00 9.36 yes mat_19_p01 19,015 74.55 14.18 0.00 11.26 yes mat_19_p02 12,949 88.96 9.17 0.00 1.87 yes mat_19_p03 25,767 78.74 12.93 0.00 8.32 yes mat_20_p01 6,456 69.05 20.12 0.00 10.83 yes mat_21_p01 5,692 80.32 16.30 0.00 3.37 yes mat_22_p01 25,854 97.51 1.54 0.00 0.96 yes mat_23_p01 45,073 66.86 22.40 0.00 10.73 yes mat_24_p01 12,966 4.90 26.02 0.00 69.08 yes mat_25_p01 16,551 45.77 32.56 0.00 21.67 yes mat_26_p01 15,231 73.29 13.35 0.00 13.35 yes mat_27_p01 8,252 37.71 36.40 0.00 25.88 yes mat_28_p01 17,629 14.95 78.37 0.00 6.68 yes mat_29_p01 10,942 77.19 13.53 0.00 9.29 yes mat_30_p01 127,831 82.37 14.18 0.00 3.46 yes mat_31_p01 21,548 25.37 60.24 0.00 14.40 yes mat_32_p01 10,407 62.27 22.30 0.00 15.43 yes mat_33_p01 8,488 6.95 81.73 0.00 11.32 yes national 2,430,049 62.20 25.58 0.12 12.10 research and evaluation in education journal modified robust z method for equating... 111 rahmawati & djemari mardapi table 8 . percentage of classification consistensy of passing status based on science test test form number of students pass/pass pass/fail fail/pass fail/fail dtm ipa_01_p01 61,195 95.82 2.83 0.00 1.35 yes ipa_01_p02 72,988 96.07 2.63 0.00 1.30 yes ipa_02 398,196 92.70 5.98 0.23 1.09 yes ipa_03_p01 116,123 99.30 0.00 0.00 0.70 no ipa_03_p02 137,504 99.65 0.12 0.00 0.22 yes ipa_03_p03 152,321 99.46 0.00 0.00 0.54 yes ipa_04_p01 43,570 99.86 0.05 0.00 0.10 yes ipa_05_p01 448,309 98.70 0.00 0.32 0.97 yes ipa_06_p01 45,444 97.33 0.00 0.71 1.96 yes ipa_07_p01 80,309 96.16 0.00 1.13 2.71 yes ipa_08_p01 69,932 99.11 0.30 0.00 0.59 yes ipa_09_p01 78,421 98.48 0.00 0.77 0.75 yes ipa_10_p01 16,133 98.76 0.00 0.00 1.24 yes ipa_11_p01 71,848 99.13 0.00 0.00 0.87 yes ipa_12_p01 98,279 98.00 0.00 0.00 2.00 yes ipa_13_p01 66,598 93.94 0.00 0.00 6.06 no ipa_14_p01 25,580 98.32 0.00 0.00 1.68 yes ipa_15_p01 55,080 94.96 0.00 1.49 3.55 yes ipa_16_p01 55,639 96.74 0.00 0.00 3.26 yes ipa_17_p01 5,058 35.11 59.15 0.00 5.73 yes ipa_18_p01 11,003 96.95 0.79 0.00 2.26 yes ipa_19_p01 19,015 98.85 0.00 0.00 1.15 yes ipa_19_p02 12,948 99.86 0.05 0.00 0.08 yes ipa_19_p03 25,739 98.41 0.50 0.00 1.09 yes ipa_20_p01 6,457 98.44 0.53 0.00 1.04 yes ipa_21_p01 5,692 98.84 0.67 0.00 0.49 yes ipa_22_p01 25,855 99.82 0.00 0.00 0.18 yes ipa_23_p01 45,073 97.60 0.00 0.71 1.70 yes ipa_24_p01 12,966 93.38 1.61 0.00 5.01 yes ipa_25_p01 16,595 90.32 0.00 0.00 9.68 yes ipa_26_p01 15,232 99.38 0.00 0.00 0.62 yes ipa_27_p01 8,253 92.88 3.55 0.00 3.57 yes ipa_28_p01 17,628 98.79 0.96 0.00 0.25 yes ipa_30_p01 127,839 99.65 0.00 0.00 0.35 yes ipa_31_p01 21,547 94.10 5.43 0.00 0.48 yes ipa_32_p01 10,408 97.53 1.40 0.00 1.07 yes ipa_33_p01 8,494 94.28 0.00 0.00 5.72 no national 2,489,271 95.59 2.34 0.14 1.93 a deep attention must be put to answer the results of classification consistency. the table shows that inconsistent classification is mostly in categories of passing, while in fact, the status is failing. this means that thousand even hundred thousands students are decided as passing the test, while in fact, their competencies are still below the standard. this inconsistency has a big influence because the ne score is then used as a selection tool for ebtering secondary schools. the starting point of learning process cannot be in the right starting point. the students research and evaluation in education journal 112 volume 1, number 1, june 2015 need to repeat or remedy what their lack of for their primary school’s competencies before continuing to a higher level of competency. summary and suggestion summary the analysis proved that external criteria of 0.5 point raw-score tcc difference for modifying robust z method can make the modified robust z method able to give information about the consequencies of ipd to classification consistency. if the difference of tcc is less than 0.5 point raw-score, then the classification of consistency will be 100%. the modified robust z method perfoms better on looking the impact of ipd in test level not only particularly in item level. the practical impact of consistency classification is identified by modified robust z method as aggregate of items in test level. even the number of drifting items were great, but when drifting in an opposite direction, the effect will cancel out and the practical impact can be waived. the implementation of modified robust z method in empirical data shows that the impact of ipd was very significant for ne es/mi 2011 examinees. at least 2000 students were classified as passing, while in fact, their competencies were not sufficient to pass the exam and continue to secondary education. suggestion the use of multiple test forms is more frequent. score test equating process has to be performed. the heterogenity of ability accross provinces in indonesia is also potential for the occurrence of drifting items, which in the study has a big impact on the classification consistency. regarding to those facts, then the modified robust z method is suggested to be used for both detecting drifting items and equating test score, especially when the design of the multiple test form employs set of ancor items, passing classification. the analysis shows that in order to minimalize the effect of ipd on classification consistency, it is suggested to have 40% of anchor test length. this proportion has quite big risk both from the security of anchor items from being too exposed and also less variance items accross provinces. the rule of thumb of anchor test length is 20% (hambleton, swaminathan, & rogers, 1991). to have better prevention of drifting items yet still maintain the item exposure and variability accross provinces, the 40% anchor test length can be constructed in matrix sampling design. split the anchor tests into several clusters. one cluster to others shares overlapped items. this study also has limitations. the condition simulated in this study is too few to represent all variance of conditions in a real life situation. then it is suggested to extend this study using broader condition so that the strengths and weaknesses of modified robust z method can be comprehensively analyzed. this study also only estimates the impact of drifting items on classification consistenty, and there is no analysis performed to see the performance of modified robust z method on ability estimation accuracy or scaling equation accuracy. thus, a study which employs similar method but focuses on the consequences of ability estimation accuracy is very suggested. the results of ability estimation accuracy or scaling constant accuacy will complement the rsults of this study. references arce, a. j. & lau, a. c. (2011). statistical properties of 3pl robust z: an investigation with real and simulated data sets. paper presented in the annual meeting of the national council on measurement in education, in new orleans, lousiana. brennan. (2008). a discussion of population invariance. applied psychological measurement. volume 32 (1), pp. 102114. cook, l. l. & eignor, d. r. (1991). irt equating methods. educational research and evaluation in education journal modified robust z method for equating... 113 rahmawati & djemari mardapi measurement: issues and practice, 10, pp. 37-45. hambleton, r. k., swaminathan. h., & rogers, h. j. (1991). fundamentals of item response theory. newbury park, ca: sage. han, k. (2007). wingen: windows software that generates irt parameter and item responses. applied psychological measurement, 31, pp. 457–459. huynh & meyer. (2010). use of robust z in detecting unstable items in item response theory models: practical assessment. research and evaluation electronic journal, 15 (2). keller & wells. (2009). the effect of removing anchor items that exhibit differential item functioning on the scaling and classification of examinees. paper presented in the annual meeting of ncme, in denver. wyse & reckase. (2011). a graphical approach to evaluating equating using test characteristic curve. applied psychological measurement, 35 (3), pp. 217-231. research and evaluation in education e-issn: 2460-6995 research and evaluation in education journal volume 1, number 2, december 2015 (pages 212-224) available online at: http://journal.uny.ac.id/index.php/reid implementation of digital learning using interactive multimedia in excretory system with virtual laboratory 1) heru setiawan; 2) wiwi isnaeni; 3) f. putut martin herry budijantoro; 4) aditya marianti 1),2),3),4) semarang state university, central java, indonesia 1) herusetiawan@student.unnes.ac.id; 2) wi2isna@yahoo.co.id; 3) pututmartin@yahoo.com; 4) tya.unnes@yahoo.co.id abstract this study aims to: (1) develop an interactive multimedia with virtual laboratory in excretory system for senior high school students, (2) determine the eligibility of digital multimedia of excretory system, (3) determine the effectiveness of digital learning using interactive multimedia to improve students’ achievement and activities using the excretory system. this research was conducted at senior high school 1 jepon, blora, indonesia for 2 months. the research approach was educational research and development including: (1)research and information collecting, (2) planning, (3) developing the preliminary form of the intended product, (4) preliminary field testing and validation of media by experts, (5) main product revision, (6)field testing, (7)operational product revision, (8) operational field testing, (9) final product revision, and (10) dissemination and implementation. the result is as follows: 1) the excretion system of virtual lab needs to be improved because of the inavailability of tools and limited materials, obselete school lab, expensive tools and lab materials, and crowded scheduling of laboratory uses; 2) digital multimedia of the excretory system with virtual lab is very eligible with outstanding criteria based on expert validation and field testing result; 3) digital learning using interactive multimedia improves students’ achievement and activities, (tnumber of students who achieve the active criteria is 90.63%). keywords: digital learning, interactive media, virtual laboratory, student’s achievement, excretory system mailto:1)herusetiawan@student.unnes.ac.id mailto:2)wi2isna@yahoo.co.id mailto:pututmartin@yahoo.com mailto:tya.unnes@yahoo.co.id research and evaluation in education 213 volume 1, number 2, december 2015 introduction nowadays is a digital era. the era of technology should be utilized by teachers, especially in teaching biology, which is one of the science subjects which studies the natural surroundings immediately around students. however, some physiological material and invisible characteristics is often confusing for students to learn. according to fischer (2008, p.67). there are some reasons why biological materials are considered difficult to learn, one of which is concerned with the instructional media that are used by the teachers. according to sugandi (2008, p.7), instructional media is an important tool that helps teachers in helping students under-stand biology materials. with global changes in the development of knowledge and technology primarily related to the education system in schools, teachers are required to develop media that make students learn the concept of biology more easily, but learning media today often get less attention from teachers (virvou, katsionis, & manos, 2005, p.6). when students open a package of high school biology textbooks that had been used, they will see a page full of many confusing words, with pictures that are too small and in need of clear zooming all arranged in rigid layouts to maximize the space available. not to mention the size of a thick book which means there will be more small-sized paper that must be addressed so that students quickly get bored and not motivated to learn. on the other hand, today is a globalization era. the globalization is in various aspects of life and the field of education is no exception. therefore, with the rapid advancement of information and communication technologies in education, learning is also directed to digitalization (digital learning). digital learning is learning by utilizing digital media, electronic systems or computers so that they can support the ongoing learning process (lazarowitz, 2013, p. 219). digital learning characteristics according to adri (2007, pp.7-8) are: (1) the use of electronic technology; (2) taking advantage of computer (digital media and computer networks); (3) the use of teaching materials that can be used independently (self-learning materials) which is then stored in computer, so it can be accessed by lecturers and students anytime and anywhere. utilizing a schedule of learning, the curriculum, the results of the learning progress, and matters related to the administration of education can be viewed at any time on the computer. based on the interviews conducted to biology teachers of grade xi among schools in indonesia, excretory system material is considered as being difficult for students. the difficult material is related to physiological processes which happen inside the body, so the physiological processes are invisible. this is evident from some students who have not yet reached the passing grade. based on the results of a needs analysis by using a questionnaire, one of them is caused by less varied media used. it is known that teachers used instructional media in the material of the excretory system in the form of a text and visuals such as textbooks and modules, but students are less interested in the media and less active in learning so that they feel sleepy. based on the results of the questionnaire analysis of the teachers and students of 1 jepon state senior high school (sman 1 jepon), all of the students want to study media at the excretory system, which are attractive, practical, easy to understand, convenient for viewing, stimulating to learn, easy to use, not easily damaged, and also containing complete material and the principles of scientific learning. the interactive learning media that can be used in learning are usually in the form of cd (compact disc) learning. it is rare to find a learning cd equipped with scientific-based content. the scientific approach in the teaching of biology should be designed and directed as much as possible to the involvement of students in constructing knowledge and skills of science through science processes which include observing, asking, reasoning, associating and communicating. for this purpose, a science laboratory is the most appropriate vehicle. however, the gaps that can be found in a laboratory used in school cannot be used with limited laboratory facilities. it is due to research and evaluation in education implementation of digital learning using interactive multimedia.. 214 heru setiawan, wiwi isnaeni, f. putut martin budijantoro, & aditya marianti the the lack of science laboratory infrastructure in the educational institutions. the problems can restrict the development of higher order thinking skills and become a gap in sma 1 jepon, since the laboratory equipment and materials are limited. based on the interviews with biology teachers in the school, the school already has a biology lab, but it does not yet have adequate facilities in the form of laboratory equipment and materials. it was due to the lack of funds. these problems made students unable to explore themselves and to do practical activities. based on the theory and the problems described before, media as a source of students' learning is made not only to become more attractive but it also has to be equipped with a complete materials-based on audiovisual, while the solution to the problems is the lack of laboratory infrastructure through virtual lab computer-aided simulation. a virtual lab allows hands-on and minds-on activities. many researchers and practitioners believe that the technology of virtual reality (vr) has created new thinking in education. duffy and jonassen (1992) state that the technology in the world of education today should be based on the paradigm of constructivism. sung and ou (2002, p.180) report that the vr has the ability to facilitate learning activities. under these conditions, an intractive cd including a virtual lab on the material needs to be developed as a learning medium for grade xi students. the aims of this study are to: (1) develop an interactive multimedia with a virtual laboratory in an excretory system for senior high school students; (2) determine the eligibility of digital multimedia of an excretory system; (3) determine the effectiveness of digital learning using an interactive multimedia to improve students’ achievement and student activities of excretory system. method research approach this research employed educational research and development (r&d) approach, which is the process used to develop and validate an educational product which is developed, tested, and revised according to the results of the field testing (borg and gall (1983, p.20). the stages of the r&d are (1) researching and information collecting, (2) planning, (3) developing the preliminary form of the product, (4) preliminary field testing, (5) main product revision, (6) main field testing, (7) operational product re-vision, (8) operational field testing, (9) final product revision, and (10) dissemination and implementation. in this study the r&d approach is used to produce materials based on scientific approach in excretory system. time and place this research was conducted at sma n 1 jepon from december 2014 to march 2015 in grade xi semester 2 academic year 2014/2015. this research was done only in the science class. sma 1 jepon is located on blora-cepu street km. 9 blora, central java. the research subject for the field testing was 32 grade xi students. while the subject of the operation field testing was 64 grade xi students of mia 2 and mia 3. research procedures research and information collecting in this stage, data were collected to know the type of instructional media that had been used in sma 1 jepon and the need of the development of instructional media. the preliminary study used questionnaire, observation, interview and documentation. the subjects consisted of teachers and students. teachers in this study was a teacher of biology subject. the students used in this study were selected using random sampling method among those who have received excretory system material, which is 96 grade xii students, in the academic year 2014/2015. planning in this stage, the development of instructional media was planned based on data from the needs analysis of the media obtained from research and information collecting which consists of making flowcharts and storyboards of digital interactive multimedia of excretion system, collection research and evaluation in education 215 volume 1, number 2, december 2015 source material using some data, photos and video and animation from the book and internet. media assessment criteria were adapted to the preparation according to the national standards of education department of indonesia with modifications to several aspects such as appropriateness of the content, appropriateness and eligibility of presenting language, formulation of learning objectives and indicators of achievement of competencies, as well as the preparation of the media. developing an evaluation tool in the form of test instruments was aimed to measure the extent of students' understanding of the material that has been studied. the tests were performed using posttest. development of prelimenary form of product in this stage, a virtual lab was compiled. virtual lab created that glucose test, urine protein test, physical and ph test, chloride test of urine. tthe initial product of interactive digital multimedia excretory system was made as a whole using adobe flash player ver. 5. the development of the initial format of products made with the following steps: (1) preparing the materials that will be used to create media; (2) preparing media scipt; (3) editing the display of the main menu and its contents; (4) making the virtual lab design; (5) finishing program; (6) determing and manufacturing the specifications of the product; (7) preparing a virtual lab guide. prelimenary field testing and main product revision validation phase includes validation of interactive digital multimedia in terms of appearance made by media experts and material expert. the aspects of media assessment in the form of software engineering aspect consists of usability, compatibility, reusablity, effectivity, and audio-visual aspects of communication, whereas the assessment of material aspects such as material scope, the material depth, and basic competencies and indicator linkages, and the aspects of language. while the questionnaire responses that provided to the media is directed to sma 1 jepon biology teacher and student. prelimenary field testing to validate the design was conducted by experts using a questionnaire of validation media. the media experts were two lecturers from the department of biology, who are the experts of ecophysiology, from a well known university in indonesia. field testing and operation product revision field testing was conducted on 32 students of class xi mia 1. the tests were aimed to obtain information on the use of interactive digital multimedia learning in excretory system. testing is done by giving the media on students, then collecting the data through the questionnaire responses of students, as well as observation of activity and learning results are used to determine the level of student mastery. the method used is preexprerimental design. in this design before the sample is treated, first they are given a pretest then at the end they are given a post test. the student learning given a total of 4 meetings. field testing was tested on 64 students. assessment of the activities carried out by the observation of students during the learning process by the observer using observation sheets. the assessment of student’s achievement is done after all learning process is completed using 30 multiple choice questions. operation product re-vision is done based on the advice after testing a wide scale, then the media be improved to get the final product. method of collecting data in collecting data, the following methods were employed: (1) eligibility of interactive digital multimedia with criteria is taken by questionaire; (2) student and teach-er response data to interactive digital multimedia of excretion system were taken by questionnaire; (3) data of student student’s achievement on the use of interactive digital multimedia system taken with excretion in learning tasks, virtual lab, and the final evaluation; and (4) data of student activity taken from the observation result to the student activity in teaching and learning by the observer. research and evaluation in education implementation of digital learning using interactive multimedia.. 216 heru setiawan, wiwi isnaeni, f. putut martin budijantoro, & aditya marianti data analysis methods analysis of expert validation the data analysis expert judgment employed a likert scale 1-4 (wahono 2006, pp.45-46). the data were then calculated by the following formula: note : np = the value of percent is sought r = score obtained sm = maximum score the categories of eligibility based on the percentage that is: 81% ≤ np <100% (outstanding), 62% ≤ np <81% (very good), 43% ≤ np <62% (good), 33% ≤ np <43% (fair), np <33% (poor). analysis of response data of students and teachers the variable refers to the measure-ment scale (likert scale) 1-4. the criteria for student and teacher responses to interactive digital multimedia excretory system are: 85% 100% (outstanding), 70% -84% (very good), 60% -69% (good), 50% -59% (fair), <50% (poor). the evaluation results were analyzed to determine the classical completeness with the classical criteria ≥ 80% of students meet the value individuals. students activity activity data were obtained from observation sheets, then were analyzed in descriptive percentage. a percentage that has been obtained confirms the percention of student activity. the criteria classification according to the student's activity (arikunto 2006, pp. 60-61) are: 85%-100% (very active), 70% -84% (active), 60% -69% (quite active), 50%-59% (underactive), <50% (inactive). findings and discussion the development of interactive digital multimedia on the material of excretion excretory system the description of the product learning multimedia in this study is in the form of learning cd. the development of interactive digital multimedia of excretion system in this study is the process of making media containing excretion system material, and a virtual lab designed specifically with the aid of electronic devices with the use of macromedia flash, through methods of borg and gall (1983, p.20). in this study, the data are obtained from textbooks, electronics book, internet, high school biology syllabus. the data are used as source materials in the design of interactive digital multimedia. manufacture of virtual lab work is done by making the design tools and lab materials as real as possible. design created allows the students can do the next step if the student clicked with appropriate tools. if students click practical tools and materials available in the virtual lab table to the right, lab work step will run automatically. so that the student assignment only make observations on the results of the virtual lab that they do. the physical test observe by students is the colour and acidity (ph) test with universal indicator. while the glucose test students observe the color of urine before being tested benedict, after drops benedict color, color after being burned and the final color. at the stage of finishing media components, and virtual lab experiments need to be done repeatedly and improved many times that the media does not experience an error and can run smoothly from start to finish. once the compilation is complete, the media creation is ready to be transferred into its application form. interactive digital multimedia system measuring is 443 mb (mega bytes). minimal hardware used is a personal computer (pc) with these specs: intel pentium dual core 2.0 ghz, 128 mb ram, 150 mb hard disk remnant, 1024 x 768 svga monitor, windows xp, windows vista or windows 7. preparation of modules guide in virtual lab and virtual lab report format for the students so that the students do not have difficulties when operated virtual lab uses the interactive digital multimedia of excretion system. this practical guide module is a kind of manual book, as an addition guidance to show the users how to operate the lab properly. the format of the student report is np = x 100%............................. (1) research and evaluation in education 217 volume 1, number 2, december 2015 as follows: it should contain the column heading, the purpose, the basic theory, how to work, observations, data analysis and discussion, conclusions and references. the question should also equipped with critical thinking (contextual teaching and learning). eligibility of interactive digital multimedia on the material of excretory system interactive digital multimedia eligibility was tested by two experts that is media experts and material expert (prelimenary field testing). the testing of the use was done by students and teachers in learning. the trial was conducted in two phases, namely, a limited scale (field testing) and large-scale trials (field operation testing). prelimenary field testing the eligibility of interactive digital multimedia system based the validation of excretory material is presented in table 1. table1. eligibility of digital media excretory system according to the matter experts aspects assessed percentage completed material 75% linkage core competency and base competency of curriculum 91.67% the material accuracy 100% presentation of material 90% communicative and interactive 75% aspects of language 100% percentage eligibility of the material in classical 90.62 % the eligibility criteria in terms of material is outstanding based on the eligibility assessment table, the multimedia of interactive digital excretory system has been eligible to be used as a medium of learning the material in classical eligibility percentage of 90.62% in an outstanding criteria. the validation results in display media are presented in table 2. table 2. eligibility of digital interactive multimedia of excretion system according to media experts aspects assessed percentage software, usability 100% audio and visual communication 89.28% other aspects: the design of a virtual lab 87.50% the percentage in the classical media eligibility 91.17% the eligibility criteria in terms of media outstanding table 2 of the interactive digital multimedia in terms of media aspect judged four main aspects. classically obtained by the percentage of 91.17% in outstanding criteria. main product revision (revised design / product early stage 1) revisions to interactive digital multimedia of excretion system was implemented based on recommendations and advice that given by media experts and material expert. revision products based on expert input media which display the opening page, instructions for use, display, background, navigation buttons, distance between the main menu, sub menu simulation lab so that students know how practical procedures appropriate steps to correct. meanwhile, revision by experts of material in the manual lab work on the simulation made slow motion, image support, legibility of text and graphics, simulation lab created flow charts, adding animations to clarify the mechanism, such as the formation mechanism of urine, the excretion of sweat, bile excretion, the formation of kidney stones, animation test results with biuret and benedict of diabetes mellitus disorder of urine. display one expert revised the material before and after revison showed in figure 1 and figure 2. research and evaluation in education implementation of digital learning using interactive multimedia.. 218 heru setiawan, wiwi isnaeni, f. putut martin budijantoro, & aditya marianti figure 1. simulation of virtual lab before revision figure 2. simulation of virtual lab after revision field testing (tests media in learning a limited scale) field testing was conducted to determine the effectivity of interactive digital learning media. the effectivity of interactive digital multimedia in excretory system in this research that the effectiveness of the activity and student’s achievement. the field trials consist of a limited scale testing were carried out on 32 students of class xi mia 1. based on the results of research on a limited scale test is known that most of the students assess interactive digital multimedia system is outstanding and very well criteria. the eligibility of interactive digital media excretion system on product testing (small scale) are presented in table 3. table 3. results of student responses to interactive digital multimedia on a small scale trial (field testing) response criteria percentage outstanding 84.38% very good 12.50% good 3.13% fair 0.00% poor 0.00% the questionnaire consists of 12 aspects such as readability, navigation buttons, the research and evaluation in education 219 volume 1, number 2, december 2015 display, the clarity of the text / writing, language, becksound, virtual lab, aplicability of virtual lab in teaching and learning, and lso student interest to the media. based on analysis of the responses items 100% of students agreed learning using interactive digital multimedia is fun and not boring and urine lab simulation allows students to understand the physical test lab procedures and the content of the urine. operation product revision based on student responses to assessment of trials limited scale, needed some revisions. the revision is the main menu, audio settings (backsound). settling material content in the form of points, as well as the revision of chloride test. operation field testing (trial wide scale) large-scale trials are used to determine the response of students to instructional media after revision. results of research student responses on large scale trials are presented in table 4. table 4. the response of students to the interactive digital multimedia of large scale trials response criteria percentage outstanding 92.19% very good 7.81% good 3.13% fair 0.00% poor 0.00% based on these results it can be concluded that student responses increased which is compared to the limited scale trial. according to the students' response interactive digital multimedia systems facilitate the excretion of the physical and the content of the urine test, students are eager to learn, and learning to be fun and not boring. final product revision based on a broad scale trials there are still need some improvements to the media. revisions were made only on the color of the urine into 3 colors, 3 colors of yellow urine, and urine 3 brownish yellow color. subsequently revised to green and blue colors more clearly visible and distinguishable. results of teacher responses to interactive digital multimedia of excretion system. teachers provide very good response (95%) on the use of interactive digital multimedia excretion system. the problem for students who do not love music overcome by adding buttons for controlling the volume of music, so that it can be used according to user preferences. constraints in implementing instructional media developed which takes longer before requiring preparation as well. this problem can be overcome with proper preparation before applying learning and improve time class management. the use of multimedia can help teachers deliver the material easily and effectively. this will suport student centered learning. the effectiveness of interactive digital multimedia excretory system to improve student’s achievement and student activity the results of the study showed that learning using interactive digital multimedia system effective excretion on student’s achievement and student activity, which is presented in table 5. table 5. student’s achievement in a limited scale trial (field testing) classical completeness graduate not graduate small scale test 84.37% 15.62% large-scale test 87.5% 12.5% based on the results obtained, we can conclude that learning using interactive digital multimedia effective to improve student’s achievement. the effectiveness can be seen from the results of trials studying a limited scale or large-scale trials ≥75. by definition presented study results is a combination of value assignment, virtual lab report, and the final evaluation. classical mastery learning students with passing grade ≥ 75 obtained on a limited scale trial of 84.37% and on a wide scale trial of 87.5%. these percentages show that the student’s achievement in learning research and evaluation in education implementation of digital learning using interactive multimedia.. 220 heru setiawan, wiwi isnaeni, f. putut martin budijantoro, & aditya marianti using interactive digital multimedia system can achieve success indicators ≥ 80%. the trial results of interactive digital multimedia in excretion system for student activities based on trial results multimedia interactive digital excretion system shown to increase the activity of students in learning excretion system. the effectivity can be seen from the results of the data analysis of a limited scale trial that showed that the activity of the students in the classical style for 3 sessions included in the criteria are very active and active as presented in table 6. table 6. activity of students in learning using interactive digital multimedia on a limited scale trial criteria activity on small trial testing activity on large trial testing highly active 34.38% 39.06% active 50.00% 53.12% active enough 9.38% 6.25% less active 6.25% 3.13% inactive 0.00% 0.00% based on the student's activity table, the number of students in the category of very active and inactive is 84.38%, while 15.62% were the criteria of quite active and less active. this amount meets the minimum criteria of student activity in the classical style that is ≥ 81%. for large-scale trial results during two meetings also shows the average percentage of student activity levels in classical ≥ 81% activity of students in learning activities including active and very active categories with a percentage of 92.19%. while 9.38% were moderately active and less active. one of the things that cause high student activity in learning is the learning done in groups. during the learning process of the students interact and work together to discuss the matter that exists in a media of learning. discussion group also makes students more active and enthusiasm in learning. according to amri and ahmadi (2010, pp. 10-11) discussions help make lessons developed and stimulate the spirit of questioning and personal interests. this is also supported by the interesting media, especially in the virtual lab. according to some students simulated virtual lab is very helpful in understanding the physical test lab procedures and the content of the urine. learning environment is attractive and fun that facilitate students in learning, so will give result in the achievement of students' understanding of the material. dani (2008, pp.10-12) revealed that interactive media is one of the tool in teaching for both students and teachers that are quite effective in helping teachers to deliver educational material so that the absorption of students is higher than the conventional way because: (1) students quickly absorb information and knowledge of the material presented, (2) pictures, video, and animation in the media more attractive than text, (4) interactive, and (5) oriented towards problem solving. according to the responses of teachers that used interactive digital multimedia facilitate the learning to achieve the goal, because in it there is complete coverage and interesting material. arsyad (2011, p.12) suggests computer can accommodate students who are slow to accept the lesson because the use of multimedia involves various organs of the body begin to ear (audio), eyes (visual), and hand (kinaesthetic). the involvement of the various organs of this makes the information easier to understand. according to istianda and darmanto (2009), students were only able to retain 20% of what they seen, 30% of which they heard, but can remember 50% both they heard and seen and able to remember 80% of which is seen, heard and done at once. moreover the effectiveness of interactive digital multimedia excretion system on student’s achievement is also because the media can visualize material of excretory systems that are abstract and difficult to see directly. in line with the statement adri (2007, pp.22-23) multimedia have special functions such as multimedia animation, simulation and visualization, students get a more real information than the information that is abstract so be able to develop the cognitive aspects. same as the teacher, based on the advice that given, the teacher wants the other biological material is also made such learning research and evaluation in education 221 volume 1, number 2, december 2015 cd multimedia interactive digital excretion system, so that students are interested in learning. learning to use interactive digital multimedia excretion system requires the presence of the teacher as a facilitator, because the interaction with the computer and humans have not been able to replace the human interaction with humans (ismail, 2006, pp. 23-24). most students complete the learning using interactive digital multimedia excretion system, even so there are still some students who have not completed. the factors that cause this are internal factors and external factor in students. on the application of multimedia 84.37% of students scoring above 75 or above passing grade, only 15.63% of the students are still getting the study under passing grade on a limited scale trial, and the large scale trial of 87.5% students completed , meaning that there is an increase compared to the limited scale trial. this shows that the multimedia application is effective in achieving good student’s achievement. multimedia is able to enhance the learning process and facilitate the process of communication between teachers and students as multimedia self-learning is able to create an atmosphere so that students have the ability to organize themselves and have the intrinsic motivation to learn the material of excretory system. based on the data analysis is also known that the lab reports the average student is quite good, on a limited scale trial or to test a wide scale. some students have been able to write a report well, ranging from the title, basic theories, tools and materials, work methods, data analysis and discussion, and conclusion. in addition the average student already answer the critical thinking question. according to mintz (1993, p. 13), one of the promising applications is the use of computer simulations to teach material that cannot be done by using conventional laboratory. however, is computer simulations as effective as conventional laboratory, or can it replace them? several studies in more than two decades have shown that the use of computer simulations and lab practicum using conventional laboratory can effectively improve student’s achievement. the answer is dependent on the concept and the situation (cengis, 2010, p. 3). concept in the sense of what kind of simulated virtual lab, if the simulated is the concept of science that really abstract so as to make the students very difficult to learn, or the concept occurs very slowly, too dangerous to do, then the use of virtual lab will be very effective. while the situation in terms of condition of the use of virtual lab is whether it is appropriate in the context of its use for example the condition of school laboratories, if the virtual lab is used in schools with very limited facilities virtual lab will be more effective than schools that have full facilities (cengis, 2010, p. 4). in addition, the results in this study supported by several research studies e.g. russell et al. (1999, p.335) and sanger and greenbowe (1997, p.821) showed that the increased proportion appropriate statement compared the proportion of misconceptions decline after use virtual animation lab. in addition, a related study shows that the computer with additional effects, such as animation, simulation, and positive votes in improving the quality of learning (douglass, 1990, p. 46; heerman, 1988, p.7). while hutt (2006, p.40) found that the combination of simulation and laboratory benefits lab time so that time can be reduced. this is in line with the statement proposed by redish, jeffery, and steinberg (1997, p.4) which reports that computers are used in teaching as a learning tool to make the process of instruction increase student motivation, provide benefits for students to learn independently and manage time they need based on their own learning speed. although the results of this study support that the use of virtual labs improve student’s achievement and provide positive results for students in learning activities in biology subject, it is not claimed that the interactive digital multimedia of excretion system which is simulation-based laboratories are more effective than the activity of students in the real laboratory because the two were not examined and compared in this study. however, it is claimed that the creation of virtual lab is to show activity in the real laboratory as a reason for reaction to harmful chemicals, lab time constraints, limited equipresearch and evaluation in education implementation of digital learning using interactive multimedia.. 222 heru setiawan, wiwi isnaeni, f. putut martin budijantoro, & aditya marianti ment owned by the school lab, then in a state of virtual lab so they can be used as an alternative solution. however, in some studies as well as a research conducted by karr et al (2004, pp.6-8) which in this case comparing the achievements of some students who guided using hands-on chemistry lab compared with students who guided using the virtual chemistry lab (elab). they found that there was no difference between the segnificant achievement learn or gain score values between them. they commented on their findings that students taught using the hands-on lab chemistry appeared as good as that taught using virtual chemistry lab (elab). learning using interactive digital multimedia of excretion system certainly has its advantages and disadvantages. according to cengis (2010, pp.7-8), some advantages of virtual lab include: (a) allowing convenient learning as tools and materials simulated in the virtual computer so that it is not too dangerous; (b) learners can learn and develop creativity to experiment easily; (c) the problem can be presented using the virtual computer to generate student motivation; (d) the need for students to learn will increase because it is presented with animation which is more interesting; (e) virtual laboratory does not necessarily require a long time to prepare and carry out the activity because it is presented on the computer; (f) can encourage learners to be more effective and active in the learning process; (g) the calculation results of the experimental data is more valid and precise so that it will be easier to obtain the concepts presented. meanwhile, the lack of assisted learning virtual laboratory are: (a) the success of virtual laboratory assisted learning relies on the independence of the student to follow the learning process; (b) the access to carry out virtual laboratory activities depend on the number of computer facilities which is provided by the school; (c) learners may feel bored if the lack of understanding about the use of the computer so that it can generate a passive response to conduct virtual experiments; (d) guidance is needed from the teacher before the experiment using virtual laboratory that carried out well; (e) less accordance with the process approach, particularly to aspects of students' skills, especially for kinesthetic skills. according to the responses of teachers use interactive digital multimedia systems facilitate the excretion of the teacher in the learning process to achieve the expected goal, because in it there is complete coverage and interesting material. the material in instructional media already meet the core competency and basic competencies that must be achieved, in accordance with the development of science and technology, easy to understand, and the presentation and the language used is good. excess medium of learning is to increase students' interest because they can show contextually particular physiological mechanisms of excretion system, the learning is not monotonous because it involves information technology and creating a variety of learning methods. learning to use a medium increases students' independence is also the ability to use technology in learning. students are dealing directly with a computer so as to provide a new experience for the students. interactive digital multimedia of excretory system can integrate several components such as sound, text, animation, pictures, videos and games. the multimedia integrate these components in order to optimize the role of the senses to receive information and store it in memory. in harmony with the opinion, arsyad (2011, p.5) reveals that the use of multimedia involves various organs of the body, namely, ear (audio), eyes (visual), and hand (kinaesthetic). although the interactive digital multimedia system has not been able to facilitate the excretion of student involvement in the hand (kinaesthetic) as a whole using the five senses as well as during the laboratory experiments, the kinesthetics which is meant is kinesthetics in operating computers. such involvement of the various organs makes the information easier to understand. wahyuni and kristianingrum (2008, p.8) also stated that interactive cd are applied in learning can improve student’s achievement and the active role of students, because the research and evaluation in education 223 volume 1, number 2, december 2015 students liked the atmosphere of the class that fun and not boring. the role of teachers in learning activities also contribute to the effectiveness of interactive digital learning media of excretion system. teacher in the learning process more act as a facilitator and motivator that can provide convenience to students so that students can learn optimally, so that students truly become a center of learning and the teacher as a facilitator. teachers facilitate students who want to ask if the student is not familiar with the material which is contained in the instructional media and students who do not understand the operation of instructional media. students’ incomprehension in the operation occurs due to the students' attention to the instructions for the use of instructional media. most students complete the learning using interactive digital multimedia of excretion system, even so there are still some students who have not completed. the factors that cause this are internal factors and external in students. internal factors may be psychological factors that exist on students, such as, motivation, attention, concentration, comprehension and memory. another cause is student’s ability to think differently, and not all students are accustomed to use computers as learning media. on the application of multimedia amounted to 84.37% of students scoring above 75 or above passing grade, only 15.63%, amounting to 8 students or students who are still getting the study under passing grade on a limited scale trial, and the large scale trial 87, 5% of students completed, meaning that there is an increase compared to the limited scale trial. it shows that the implementation of multimedia is effective in helping students reaching good achievement. multimedia is able to enhance the learning process and facilitate the communication process between teachers and students as selflearning multimedia is able to create certain atmosphere so that students have the ability to organize themselves and have the intrinsic motivation to learn the material excretory system. hasebrook and gremm (1999, p.5) suggests that multimedia can help students to enhance communication, motivation and selflearning ability. evaluation of learning using interactive digital multimedia in general, the implementation of learning using interactive digital media of excretory system can be implemented properly. however, in addition to the advantages already described in learning using interactive digital multimedia of excretion system, not every implementation went smoothly. it sometimes encountered some problems. the problems that occurred in the field is a result of the weaknesses of a lesson. in order to avoid constraints, then the weaknesses of a lesson must be overcome. after conducting the research, there is something to consider: learning to use interactive digital multimedia of excretion system will run optimally when all the students could play themselves. conclusions and suggestions conclusion based on the research findings, the following conclusions can be drawn: (1) interactive multimedia excretion system needs to be developed based on several facts, such as the availability of tools and materials are limited in school lab, or school lab is not possible to use, tools and expensive lab materials, scheduling time-intensive laboratory use, but is supported by facilities adequate laboratory for use in teaching based technology; (2) the interactive digital multimedia system developed excretion achieve outstanding criteria are used as a learning media for teaching excretory system. get positive responses from teacher and students; (3) interactive digital multimedia effective to improve student’s achievement and learning activities in the excretion system of high school students at grade xi. suggestions based on the results obtained, it can be given a few suggestions including: schools are advised to further optimize the computer lab facilities already owned. before conducting virtual laboratory students must first master the procedure works well. for the teachers need to be considered in managing the time research and evaluation in education implementation of digital learning using interactive multimedia.. 224 heru setiawan, wiwi isnaeni, f. putut martin budijantoro, & aditya marianti to be good when using the virtual laboratory experiments because the activities require a relatively long time. for the next developer, image composition and design of products can be made more attractive in order to motivate students to learn concepts of biology. references adri, m. (2007). development strategy multimedia instructional design. journal invotek, i(vii), 1-9. arikunto, s. (2006). fundamentals of educational evaluation. jakarta: pt bumi literacy. arsyad, a. (2011). learning media. jakarta: raja grafindo persada. borg, w.r., & gall, m.d. (1983). educational research: an introduction (4th ed.). new york, ny: longman. cengis. (2010). the effect of the virtual laboratory on the students' achievement and attitude in chemistry. online international journal of educational sciences, 2(1), 37-53. douglas, j.e. (1990). visualization of electron clouds in atoms and molecules. journal of chemical education, 67, 42-44. duffy, t., & jonassen, r. (eds). (1992). constructivism and the technology of instruction: a conversation. hillsdale, nj: lawrence erlbaum. fischer, r.g. (2008). the delphi method: a description, review, and criticism. journal of academic librarianship, 4(2), 6470. hasebrook, j., & gremm, m. (1999). multimedia for vocational guide: effects of individualized testing, videos and photography on acceptance and recall. journal of educational multimedia and hypermedia, 8(4), 37-400. heerman, b. (1988). teaching and learning with computers. san francisco: jossey-bass. hutt, p. (2006). virtual laboratories. prog. theor. phys. suppl, 164, 38-53. ismail, a. (2006). education games (being smart and cheerful with educational games). yogyakarta: pilar media. istianda, d., & darmanto. (2009). making multimedia for improving service learning. journal of open and distance education, 1(x), 11. lazarowitz, r., & penso, s. (1992). high school students' difficulties in learning biology concepts. journal of biological education, 26, 215-224. redish, f. e., jeffery b.c. & steinberg r.n. (1997). the effectiveness of active engagement microcomputer-based laboratories. maryland, md: department of physiscs, university of maryland college park. russell, j.w., & kozma, r.b. (1997). use of simultaneous-syncronized macroscopic, microscopic, and symbolic representations to enhance the teaching and learning of chemical concept. journal of chemical education, 74(2), 330334. sanger, m.j., & greenbowe, t.j. (1997). students misconseptions in electrochemistry: current flow in electrolyte solutions and the salt bridge. journal of chemical education, 74(1), 819-823. sugandi, a. 2008. the theory of learning. semarang: unnes press. sung, w.t., & ou, s.c. (2002). learning computer graphics viartual reality using technologies based on constructivism: case study of the webdegratot system. j. interactive learning environment, 10(3), 177-197. virvou m., g. katsionis, & k. manos. (2005). combining software games with education: evaluation of its educational effectiveness. j. educational technology & society, 8(2), 54-65. available in http://www.ifets.info/ [accessed in february 4, 2015]. wahyuni, s., & kristianingrum, a. (2008). improving student’s achievement chemistry and the active role of students through pbi models with interactive cd media. journal of chemical education innovation, 2(1), 199-208. research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 1, number 2, december 2015 (pages 175-185) available online at: http://journal.uny.ac.id/index.php/reid the effectiveness of web-based interactive blended learning model in electrical engineering courses 1) hansi effendi; 2) soenarto; 3) herminarto sofyan 1) padang state university, indonesia; 2)3) yogyakarta state university, indonesia 1) hansieffendi@yahoo.com; 2) narto_elka@yahoo.com; 3) hermin@uny.ac.id abstract the study was to test the effectiveness of web-based interactive blended learning (wbibl) model for subjects in the department of electrical engineering, padang state university. the design employed was a quasi-experimental design with one group pretest-posttest administered to a group of 30 students twice. the effectiveness of wbibl model was tested by comparing the average pretest scores and the average posttest scores, both in the first trial and the second trial. the average pretest and posttest scores in the first trial were 14.13 and 33.80 subsequently. the increase in the average score was significant at alpha 0.05. the average pretest and posttest scores in the second trial were 18.67 and 47.03. the result was also significant at alpha 0.05. the effectiveness of wbibl model in the second trial was higher than that in the first test. the result was not entirely satisfactory and it might be caused by several weaknesses in both tests, including the limited number of sessions, being only one subject, and the limited number of students as the subjects. however, it could be concluded that the wbibl model might be implemented as an alternative to the face-to-face instruction. keywords: instructional model, blended learning, interactive, web-based mailto:hansieffendi@yahoo.com research and evaluation in education the effectiveness of web-based interactive blended learning model... 176 hansi effendi, soenarto, & herminarto sofyan introduction in law number 20 year 2003 regarding the national education system article 3, it is stated that the function of national education is to develop the national capability and to shape national characteristics as well as national civilization in order to brighten national life. the objective is to develop students’ potentials in order to be human beings who have faith and piety toward the lord the almighty, to have noble characteristics, to master science, to have capabilities, to be creative and independent, and to become democratic and responsible citizens. in order to achieve the objective, the government has designed multiple policies and activities that are formulated into fiveyear strategic plannings. one of these plannings is the 2010-2014 strategic planning of indonesian government with the following missions: (1) to improve the availability of educational service; (2) to expand the accessibility of educational service; (3) to improve the quality and relevance of educational service; (4) to bring into reality the equality in attaining educational service; and (5) to ensure the assurance of attaining educational service. although the missions are designed well, in the implementation, there are several constraints found in the reality. the weaknesses are as follows: the educational supporting facilities (in this case the physical facilities) are not sufficient; the number of teaching staff is not sufficient; the students’ achievement is still low; the objective of the implemented learning activities is not clear; the curriculum systems are prone to changes; the educational opportunities are not distributed evenly; and the educational costs are relatively expensive (soekatarwi, 2007, pp.4-19). in line with the problems explained, with the effective implementation of the teacher and lecturer law regarding the teacher sertification, the teachers and lecturers are motivated to continue their study. in law number 14 year 2005, it has been formulated that: ‘teachers should have academic qualifications, competencies, educational certificates, physical and mental health and capability in order to materialize the objectives of national education’ (article 8). the academic qualifications mentioned in article 8 will be attained by means of higher education in bachelor and diploma programs (article 9). on the other hand, the requirements regarding the lecturers are formulated in law number 14 year 2005 article 45 and the lecturers should have the following minimum requirements: (1) should be the graduates of masters programs for the diploma or the bachelor degree; and (2) should be the graduates of doctoral programs for the postgraduate degree (verse 2). in article 47 verse 1, it is added that the educator certificates for the lecturers as mentioned in article 45 shall be given after the lecturers meet the following requirements: (1) having been working as educators in higher education for at least two years; (2) having an academic position at least as an expert assistant; (3) having been graduated from the certification conducted by the universities that run the educational staff provision programs and the universities that run the intended program should be appointed by the government. due to those regulations, the teachers assigned in the remote areas are motivated to continue their studies. however, state universities have limited capacities in absorbing these teachers, while there are many private universities but these universities are less qualified in comparison to the state ones. in addition, the tuition fee in private universities is expensive. as a result, these teachers decide to continue their study in state universities. unfortunately, at the same time, the number of lecturers who teach in state universities is limited because most of these lecturers are also continuing their study, attending multiple training program, and performing tri dharma (teaching, research, and community service) activities in the university level in order to improve their competencies. as a result, the ratio between the students and the lecturers becomes very high. the high ratio has been becoming another problem in the implementation of education. research and evaluation in education 177 volume 1, number 2, december 2015 one of the ways for overcoming the problem is by utilizing the information and communication technology (ict) in the teaching-learning process. the researchers expect that by utilizing ict in teaching, some of the teaching problems such as infrastructure costs, teach-ing staff quantity, educational access, and educational fees might be overcome. in other words, ict might be utilized to reduce the infrastructure costs, increase the quantity of teaching staffs, expand the educational access, and reduce the educational fees. the use of information and communication technology in ongoing teaching-learning process is appropriate to be implemented in indonesia because the program is compatible to the republic of indonesia characteristics, namely: (1) the republic of indonesia is an archipelagic country that has unequal people distribution; (2) the ict-based teaching supporting infrastructures are quite sufficient (especially the telephone network); (3) the number of the internet users and stations is quite plenty; (4) indonesian people are more aware of investment toward the educational domain; and (5) the universities still have low absorbing capacity (soekartawi, 2007, pp.910). with the existence of ict-based teaching program, educational programs are expected to expand to the remote areas, and the community members will be motivated to continue their study. in addition, the implementation of ictbased teaching is supported by the government. several legal aspects that support the implementation of ict-based teaching are as follows: national education system act number 20 year 2003; minister of national education decree number 107/u/2001 regarding the implementation of distance learning; circular of the directorate general of higher education number 3040/d/ t/2005 regarding the explanation of remote class dated on september 8 th , 2005; minister of national education explanation dated on september 1 st , 2005; and governing law regulations and policies. based on the aforementioned regulations, the government allows universities to implement the infomation and technology-based distance learning because of educational cost effectiveness and efficiency; however, the information and technologybased distance learning should be implemented based on the regulations and procedures that are already formulated. the use of information and communication technology development in teaching has been brought into reality by employing web-based learning/e-learning implementation. cheng (2005, p.34) states that there has been a change of paradigm in the learning process, and in the new paradigm, there has been new characteristics, namely: (1) life long learning, (2) multiple sources of learning and teaching, and (3) globally and locally networked learning and teaching. on the other hand, pedagogic environment has been globally connected by some aspects, such as: (1) self-learning program and package; (2) interactive multimedia material; (3) web-based learning; (4) outside expert; and (5) local and global exchange program. similarly, the information and technology pedagogic environment for the students and the lecturers has been connected by the following aspects: (1) webbased learning; (2) interactive self-learning; (3) multimedia facilities and learning material; (4) interactive self-learning; dan (5) video conferencing. the comparison of the new learning paradigm and the old learning paradigm is displayed in table 1. table 1 shows that the teachers or the lecturers do not become the single source; instead, they become facilitators, motivators, catalysts or mediators for the students. the students are also demanded to be active, independent, full of initiatives and analytical in the learning process. therefore, the web-based learning that will be developed should be based on the students’ demands (student-centered, webbased learning), should be mastered independently (self-learning), should be focused on the learning manners, should have sufficient sources in supporting the learning process (multiple sources), should not be limited to time and space, and should also be interactive. research and evaluation in education the effectiveness of web-based interactive blended learning model... 178 hansi effendi, soenarto, & herminarto sofyan table 1. changes in the learning paradigm new cmi-tripilization paradigm traditional site-bounded paradigm individualized learning:  student is the centre of education  individualized programs  self-learning  self-actualizing process  focus on how to learn  self rewarding reproduced learning:  student is the follower of teacher  standard programs  absorbing knowledge  receiving process  focus on how to gain  external rewarding localized and globalized learning:  multiple sources of learning  networked learning  lifelong and everywhere  unlimited opportunities  world-class learning  local and international outlook institution-bounded learning:  teacher-based learning  separated learning  fixed period and within institution  limited opportunities  site-bounded learning  mainly institution-based experiences (source: cheng, 2005, p.29) naidu (2006, pp.4-7) also states that there are several superiorities in developing the web-based learning programs namely: (1) the learning program is very dynamic and might be displayed in multiple interesting, attractive and interactive forms; (2) the learning program might be operated all time so that the students and the lecturers might attain information regarding the necessary learning materials; (3) the learning program might be implemented individually, where each student might select the learning form or the learning model that might be more relevant to his or her background; and (4) the learning program is comprehensive and provides multiple learning forms from multiple sources that enable the students to select the available learning format, learning method, and/or practice. with all of the available technology, web-based learning provides opportunities for designing the authentic environment such as contextual learning and problem-based learning; as a result, the students will have ‘learning by doing’ experiences. even lehmann and chamberlin (2009, p.2) believe if the webbased learning is well-managed, it might have better performance than the traditional learning under the following considerations: the students should be active, the learning materials should be up-to-date, there should be an interaction between the students and the learning contents, there should be an interaction between the students and the lecturers, the idea exploration should be deep, and all of the discussions should be recorded. in addition, the use of web-based learning should make the learning more effective due to the learning concept such as repetition in which the students might repeat the learning materials as often as possible. the reason is that the learning materials are available 24 hours a day and seven days a week. the faculty of engineering, padang state university, is elligible to implement the missions of higher education that have been formulated in the higher-education strategic planning, which includes improving education availability, expanding educational accessibility, improving education quality and relevance, realizing education equality, and ensuring the certainty in attaining higher education service. however, in order to accomplish these missions, there have been limitations in the available resources. the use of e-learning is viewed as one of the alternatives that might be used to overcome the limitations that exist in the available resources. naidu (2006, p.2) states that the use of ict might improve rapidly especially in universities because technology has been considered as a way of improving learning access to the information source as well as a way of decreasing educational cost. however, the use of e-learning in the learning process in universities is not as easy as research and evaluation in education 179 volume 1, number 2, december 2015 flipping our hands, because e-learning demands multiple prerequisites that might be hard to accomplish. since the publication of the regulations, both at the level of educational minister and directorate general of higher education, which regulate the implementation of remote learning, the faculty of engineering, padang state university has expanded e-learning. specifically, it has prepared a sufficient webbased learning system. however, the use of the system has not been optimum. there are several problems that have been identified regarding the reasons why the already possessed e-learning has not been optimally benefitted. one of the problems is that both the students and lecturers have not been accustomed to attend and implement the e-learning (effendi, 2005, p.18). both the students and lecturers might need motivations and good examples in relation to the use of elearning. through the assistance of islamic development bank, padang state university has implemented the training programs for lecturers. the university serves as the designer of the e-learning training program so that the lecturers will have sufficient competencies for benefitting all of the available information and communication technologies as their teaching media. however, in reality both at the department of electrical engineering and at other departments of padang state university, there has not been much web-based learning developed by the lecturers. on the other hand, there has been a small group of lecturers who develop and use e-learning. most of the e-learning forms are turning face-to-face presentation into web presentation in the form of texts and/or documents. certainly, such a use is not incorrect, but the lecturers might gain more benefits by using all of the strengths possessed by the information and communication technology since the technology is able to integrate multiple systems of symbols into the learning process. in addition, the e-learning developed by the lecturers has not fully implemented the appropriate learning theories and learning methods in order to create effective and interesting learning situations. the package of e-learning materials that does not garner the benefits ict definitely becomes less useful for the students with several differences. for example, the different learning styles possessed by the students should also be given attention in developing the learning forms including the e-learning ones. the students and lecturers, for years, have been accustomed to face-to-face learning model. in this model, the interaction, in terms of students’ control toward the learning process, is very limited. some people argue that the use of ict for the learning process has caused the occurence of social isolation due to the decrease in the interaction both between the students and the lecturers and among the students themselves in the classroom. in order to overcome the problem, the researchers would like to develop a blended learning one of whose characteristics is being interactive. the blended learning is basically able to overcome the problem of interaction within the learning process. lawhead and rosbottom (anggarwal, 2003, p.399) believe that the combination of two learning methods (in this case, face-to-face and online) will make the learning process more effective. from the students’ point of view, it has been apparent that the students have been accustomed to the student-centered learning. they have not been fully aware that the responsibility of learning process in the new paradigm is on their own hand. one of the blended learning models that has been developed is web-based interactive blended learning model (wbiblm). the model refers to the appropriate learning theory and learning method. then, the term interactive refers to the fact that the learning will focus on the students (student-centered learning) and will be fully controlled by the students (studentcontrolled learning). the model also has complete learning components and considers the students’ needs and learning styles (effendi, 2015, pp.16-17). the model emphasizes the aspect of interaction between the students and learning materials, between the students and lecturers, and among the students themselves. the research and evaluation in education the effectiveness of web-based interactive blended learning model... 180 hansi effendi, soenarto, & herminarto sofyan interaction between the students and learning materials is designed by using the component display theory that is proposed by merrill. meanwhile, the interaction between the students and lecturers is designed by using the facility of direct and indirect discussions, which is provided by the moodle-based online learning system. as a result, the blendedlearning model is based on the combination of constructivism paradigm, interactivity principles, and learning styles, which are the aspects given attention in the the web-based interactive blended learning model. based on the background of the study, the problem in the study can be formulated as follows: what is the effectiveness of the webbased interactive blended learning model (wbiblm) in improving the like of learning achievement? therefore, the objective of the study is to test the effectiveness of wbiblm in improving the learning achievement of the students. research method type the study was quasi-experimental with one group pre-test and post-test design. the study might be described as follows: o1 x o2 figure 1. experimental design note: o1 = pre-test o2 = post-test x = experiment research setting the study was conducted in the even semester (january-june) of 2014 academic year and the odd semester (july-december) of 2015 academic year. the study was held at the department of electrical engineering, the faculty of engineering, padang state university. subjects/targets the subjects in the study were a learning group that consisted of 30 students who took the electrical machines subject at the electrical engineering department, the faculty of engineering, padang state university. they were selected randomly by implementing purposive random sampling technique. research procedure there were two experiments in the study, namely the first experiment and the second experiment. the first experiment was conducted in eight sessions for the materials of transformator within the learning group. three out of the eight sessions (37.50%), namely sessions one, five, and eight, were conducted by means of face-to-face learning in the classroom, while five out of the eight sessions (62.50%) were conducted by using the web-based e-learning. the initial meeting was used for explaining every single aspect that was related to wbiblm and for administering the pretest. session five was used for having a class discussion regarding the materials that the students had not mastered. session eight was used for administering post-test. the selection of one group pre-test and post-test design was primarily based on the limited learning group; there was only one learning group that consisted of 30 students. several weaknesses of the design were as follows: (a) there had not been any guarantee that the model would be the sole factor that cuased the differences between the results of the pre-test and post-test; and (b) the impacts of history, maturation, testing, instrument, regression, selection and mortality might not be avoided (creswell, 2014, pp.245-246). however, the selection of the design also had several benefits. the primary benefit was that the pre-test provided an opportunity for the researchers to compare the achievement of the same subjects before and after the administration of the wbiblm. data, instrument, and data gathering technique the data which were gathered in the study were the scores of the pre-test and posttest in the electrical machines 1 subject for the topic of transformator. the data were research and evaluation in education 181 volume 1, number 2, december 2015 gathered by means of a learning achievement test which had been initialized by designing the test guidelines. the test guidelines included five topics, namely: (a) working principles and construction of one-phased transformator; (b) vector diagram and circuit use of one-phased replacement; (c) transformator loading according to the load type and voltage regulation according to the loading type; (d) parralel performance and voltage regulation of the one-phased transformator load; and (e) working principles and construction of three-phased transformator and the connections. based on those test guidelines, the learning achievement test items were constructed. in order to guarantee that the test guidelines and test items had a content validity, two lecturers of subject-matter of electrical machines 1 and two other lecturers validated the test guidelines and test items. the expert validation was performed from february 8 th to february 15 th , 2014. in addition, the quality of the learning achievement test was measured by means of a content validity test that was performed in the first and second experiments. the content validity test was performed by employing point-biserial correlation. the test followed the opinion proposed by mardapi (2004, p.27) who states that if the score is performed under dichotomous manner (1 for each correct answer and 0 for each incorrect answer), then the implemented correlation technique is point-biserial. the results of the first field experiment showed that 25 out of 54 test items were totally valid. in addition, five test items (test item number 4, 9, 11, 20 and 30) could not answered and four test items (test item number 5, 6, 8 and 24) could be completed correctly. the invalid test items, including the easiest ones and hardest ones, were revised for the second field experiment. in addition, based on the expert judgement, the number of the test items should be expanded to 60 items. in the second field experiment, the overall 60 test items were valid. the reliability estimation of the learning achievement test was conducted in two ways. first, the researchers measured the experts consistency or the experts agreement regarding learning achievement test. the aspects of the learning achievement test that the researchers asked for the expert judgment were the test substance, form and language. the experts agreement was measured by means of inter-rate reliability (mardapi, 2012, p.86). after the inter-rater consistency had been calculated, the researchers found that the inter-rater coefficient was 0.73. the coefficient showed that 73.00% of the raters agreed with the judged aspects of learning achievement test. theoretically, the correlation coefficient above 0.70 is very good (streiner & norman, 2000; polgar & thomas, 2000). second, the researchers administered an internal reliability test in order to ensure that all of the items provided similar results. the result of the internal reliability test for the first field experiment was 0.744 (kr-20), while the result of the internal reliability test for the second field experiment was 0.954 (kr-20). as to the data gathering technique, the researchers employed learning achievement test results in the form of pre-test and posttest. the pre-test and post-test were administered before and after the administration of the web-based interactive blended learning model in order to measure the learning effectiveness by using the developed model. data analysis technique in data analysis technique, the following techniques were employed: (1) internal consistency analysis of reliability by means of kr-20 for the instrument of learning achievement test items whose item scores were dichotomous (1 or 0); and (2) t-test for testing the effective-ness of the web-based interactive blended learning model by comparing the results of the pre-test and post-test. findings and discussions the first field experiment was conducted to a learning group that consisted of 48 students. after the second week, five students withdrew from the learning group. there research and evaluation in education the effectiveness of web-based interactive blended learning model... 182 hansi effendi, soenarto, & herminarto sofyan were 30 students who fully participated in the classroom activity. the learning materials of transformator that became the example of development consisted of eight sessions: three sessions (37.50%), namely sessions one, five, and eight, were conducted in the face-to-face learning manner, while the remaining five sessions (62.50%) were conducted through elearning activities. the initial meeting was used for explaining every single aspect that was related to the e-learning and for administering the pre-test. in addition, the researchers also motivated the students to implement the web-based interactive blended learning model (wbiblm). session five was used for having a class discussion regarding the learning materials that the students had not mastered and it served as a formative action. the effectiveness test for the webbased interactive blended learning model was implemented by comparing the scores of the pre-test and the post-test. the scores of learning achievement in the first field experiment are shown in table 2. in the first field experiment, the researchers found that the learning by means of the web-based interactive blended learning model could attain only 62.59% of the learning achievements. in general, the increase from the pretest results to the post-test results was 36.42%. however, based on the findings the differences between the pre-test scores and the post-test scores are not very high but the scores from both tests are significantly different. table 2. the effectiveness of the wbiblm in the first experiment test period score deviation standard note maximum (%) minimum (%) mean (%) pre-test 5 (9.26) 20 (37.00) 14.13 (26.17) 3.52 p < 0.05 post-test 19 (35.19) 38 (70.37) 33.80 (62.59) 4.49 increase 17 (25.93) 18 (33.37) 19.67 (36.42) the second field testing also served as the main field testing. the objective of the second field testing was to test and to see how far the expectations of the model use had been met. the design of the second field testing was similar to that of the first field testing, namely the one-group pre-test and post-test. the results of the second field testing are displayed in table 3. based on the results in table 3, the effectiveness of the web-based interactive blended learning model is achieved around 78.00% of the learning objectives. if the pretest results are compared to the post-test results, in general, the increase is 47.28%. although the increase is not high, it is significant. table 3. the effectiveness of the wbiblm in the second experiment test period score deviation standard note minimum (%) maximum (%) mean (%) pre-test 10 (16.67) 25 (42.67) 18.67 (31.11) 3.87 p < 0.05 post-test 17 (28.33) 59 (98.33) 47.03 (78.39) 12.87 increase 14(23.33) 35 (58.00) 28.37 (47.28) table 3 shows that 60% of the participants were able to achieve more than 78% of the learning objectives. the position of each participant might be viewed from the pre-test and post-test results that are displayed in figure 2, which show that the participants who earned higher scores in the pre-test tended to earn higher scores in the post-test. after these results were analyzed further, it is found that the correlation between the pretest scores and the post-test scores is 0.86, which is significant. research and evaluation in education 183 volume 1, number 2, december 2015 figure 2. the position of each participant based on the pre-test and post-test results the comparison on the effectiveness of the web-based interactive blended learning model that had been measured by comparing the general percentage of learning achievements between the first and second field experiments is displayed in figure 3. the increase of learning achievement achievement from the first field experiment to the second field experiment is 15% and even more. figure 3. the comparison on the effectiveness of the wbiblm between the first and second field experiments the effectiveness of the wbiblm was 78.39%. the model was not significantly effective but the study on the model had informed that the wbiblm might be used as an alternative for the face-to-face lectures. within the study, wbiblm was implemented in the following composition: 37.50% face-toface learning and 62.50% online learning. the composition showed that the subject might save 62.50% of the educational resources such as classrooms, lecturers, and subject assisting tools. research and evaluation in education the effectiveness of web-based interactive blended learning model... 184 hansi effendi, soenarto, & herminarto sofyan despite the effectiveness, other limitations of the study on the model might not also be avoided. the limitations on the research and development toward wbiblm are as follows. first, the researchers lost five respondents from the study and the lost is one of the consequences in implementing a research and development study within a certain period of time. the study was conducted for eight weeks and, as a result, the researchers were not able to avoid the situations in which the students withdrew themselves as the research subject or they did not play their role as the subject completely. the withdrawal might be caused by one of the regulations which state that in the second week, the students might change their study activities. in addition, for certain reasons, there were some students who did not completely attend the research activity. for example, the initial number of students in the electrical machines 1 in the second field testing was 43 people. after the second week, the number decreased to 35 people. however, there were 30 students who completely attended the activity. second, there was not any control group and this situation had been the consequence of implementing the onegroup pre-test and post-test design. the researchers did not have any other choice because there was only one learning group in the electrical machines 1 subject. conclusions and suggestions conclusions the effectiveness of the web-based interactive blended learning model (wbiblm) was tested by comparing the pretest and post-test scores both in the first and second field experiments. from the first field experiment, the average pre-test score was 14.13 and the average post-test score was 33.80. the increase in the average score was significant at alpha 0.05. in the second field experiment, the average pre-test score was 18.67 and the average post-test score was 47.03. the increase on the average score was significant as well at alpha 0.05. however, the effectiveness result in the second field testing was higher than that of in the first field testing. the effectiveness of implementing the wbiblm achieved only 78.39%. such result is not significant; however, the study on the model at least has concluded that wbiblm might be used as an alternative for some of the face-to-face lectures. suggestions the development of wbiblm might be continued by creating a stronger design. probably, future researchers might use control groups, more subjects, longer period and more comparisons between face-to-face learning and e-learning to produce a more effective and reliable wbiblm. references cheng, y.c. (2005). new paradigm for reengineering education: globalization, localization, and individualization. dordrecht, netherland: springer. creswell, j.w. (2014). research design: pendekatan kualitatif, kuantitatif, dan mixed [research design: qualitative, quntitative, and mixed approach]. (a. fawaid, trans.). yogyakarta: pustaka pelajar. directorate general of higher education. (2005). surat edaran dirjen dikti nomor 3040/d/t2005 tentang penye-lenggaraan kelas jarak jauh [circular of the directorate general of higher education number 3040/d/t.2005, about the implementation of distance learning]. jakarta. directorate general of higher education. (2005). surat edaran direktur jenderal pendidikan tinggi nomor 3040/d/ t2005, 8 september 2005, penjelasan tentang penyelenggaraan kelas jarak jauh [circular of the director general of higher education number 3040/d/ t.2005, 8 september 2005, explanation on the implementation of distance learning]. jakarta. effendi, h. (2015). model blended learning interaktif berbasis web mata kuliah mesinresearch and evaluation in education 185 volume 1, number 2, december 2015 mesin listrik di fakultas teknik universitas negeri padang [web-based interactive blended learning model in the subjects of electricity machines in the faculty of engineering of padang state university] (unpublished doctoral dissertation). universitas negeri yogyakarta, indonesia. lehmann, k., & chamberlin, l. (2009). making the move to e-learning: putting your course online. new york, ny: rowman and littlefield education. mardapi, d. (2004). penyusunan tes hasil belajar [learning outcomes test arrangement]. yogyakarta: pascasarjana uny. mardapi, d. (2012). pengukuran, penilaian, dan evaluasi pendidikan [educational measurement, assessment, and evaluation]. yogyakarta: nuha medika. merrill, m.d. (1981). component display theory. new jersey: educational technology publication englewood cliffs. merrill, m.d., & twitchell, d.g. (1994). instructional display theory. new jersey: educational technology publication englewood cliffs. minister of national education. (2001). keputusan mendiknas no. 107/4/2001 tentang penyelenggaraan program pendidikan tinggi jarak jauh [the judgment of the minister of national education number 107/4/2001 about the implementation of distance higher education program]. jakarta. minister of national education. (2005). penjelasan mendiknas 1 september 2005 tentang penyelenggaraan kelas jarak jauh [explanation of the minister of national education 1 september 2005 about the implementation of distance course]. jakarta. ministry of education and culture. (2013). rencana strategis kementerian pendidikan dan kebudayaan 2010-2014 [strategic plan of the ministry of education and culture, 2010-2014]. jakarta. ministry of national education. (2001). keputusan menteri pendidikan nasional republik indonesia nomor 107/u/2001 tentang penyelenggaraan program pendidikan jarak jauh [the judgment of the minister of national education of indonesian republic number 107/ u/2001 about the implementation of distance learning]. jakarta. naidu, s. (2006). e-learning: a guide book of principles, procedures, and practice. new delhi: creative workshop. polgar, s., & thomas, s.a. (2000). introduction to research in the health sciences. london: churchill livingstone/ harcourt. republic of indonesia. (2003). undang-undang r.i. nomor 20 tahun 2003 & peraturan pemerintah r.i. tahun 2010. sisdiknas dan penyelenggaraan pendidikan dan wajib belajar [act no 20 year 2003 & government regulation year 2010. national education system and the implementation of education and compulsory learning]. republic of indonesia. (2005). undang-undang republik indonesia no. 14 tahun 2005 tentang guru dan dosen [act no. 14 year 2005 about teachers and lecturers]. soekartawi. (2007). merancang dan menyelenggarakan e-learning [designing and implementing e-learning]. yogyakarta: ardana media dan rumah produksi informatika. streiner, d.l., & norman, g.r. (2000). health measurement scales: a practical guide to their development and use. oxford: oxford university press. research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 1, number 2, december 2015 (pages 199-211) available online at: http://journal.uny.ac.id/index.php/reid the use of malcolm baldridge method for formulating strategic planning in technological and vocational education 1) suharno; 2) sukamto; 3) sutarto 1) 11 maret university, indonesia; 2)3) yogyakarta state university, indonesia 1) myharno@yahoo.com; 2) sukamto@uny.ac.id; 3) sutarto@uny.ac.id abstract the article describes the results of a study evaluating the performance of technological and vocational education (tve) by the use of malcolm baldrige method. the data on the performance were used in order to reveal its strengths and weaknesses. based on the performance strengths and weaknesses, a competitive strategy could be formulated to improve tve quality. first, performance measurements by using malcolm baldrige criteria were done in seven study programs in different universities. second, the results of the performance measurements were analyzed and described. with the data resulting from the performance measurements as the basis of the analysis, the and weaknesses of tve performance might be found. third, a strategy was developed. based on the performance strengths and weaknesses, a performance improvement strategy might be formulated in order to raise the quality level within the educational process of tve. the research results indicate that for the performance level in the seven universities under study, the achieved scores range from 526 to 711 points. these results show that the performance of tve study programs in indonesia is in the categories of education leader and producer of education leader. on the basis of those categories, each study program might formulate its own competitive strategy in order to improve tve performance so that the educational process might also improve in terms of its quality level. keywords: malcolm baldrige, tve, competitive strategy, strategic planning mailto:myharno@yahoo.com research and evaluation in education the use of malcolm baldridge method for formulating... 200 suharno, sukamto & sutarto introduction education is a system in which there are many processes that form sub-systems. these processes occur within an educational environment, which, in wide sense, is something that becomes the domain of review within educational strategic problems. a comprehensive educational planning will always be related to educational process and sub-systems within an educational system. the educational system referred to is the system that includes activities of resource planning, curriculum planning, learning method planning, and so on. planning is a basic and strategic managerial function that provides direction in an activity implementation in order to achieve educational objectives. educational planning is one of the key factors to achieve effective and efficient education, and also training activities so that the planning might produce graduates that meet the society’s needs/ demands (leslie, 2005, p.9). according to indrajit (2004, p.11), in its realization, the existence of educational planning in all educational degrees has been regarded as a complimentary factor or an elaboration of the institutional chief policy. in other words, most educational plannings in educational institutions are merely a realization of a chief‘s demand or expectation. oftenly, the formulated objectives are not in accordance with the planning so that they cannot be achieved appropriately. it is also supported by the low number of educational planning staffs who are able to understand the planning process and mechanism comprehensively. in addition, the position of planning department has not been made as a determining factor for the position of an educational institution both in the microand macrolevels. such condition causes the contribution of educational planning toward the attainment of vision, mission and objectives of institutions in several degrees, including in universities, has not been maximal. ridwan (2008, p.26) explains that in essence, an institution is said to have a wellqualified planning if it optimally and consistently has orientation toward the ability to provide well-qualified product/service for customers all the time. there are several criteria that might be used for measuring wellqualified educational planning, namely: the focus on problem-solving efforts, the investment toward human resources, the efforts to treat complaints as feedbacks for selfimprovement, the possession of planning/ policy strategy for attaining the quality, the efforts to pursue improvement by involving as many stakeholders as possible, the possession of clear evaluational strategy, the possession of long-term planning, and the view of quality as a part of culture. based on these explanations, it can be concluded that universities need a corporatelike management. the basis of all managerial efforts is planning. a university is said to have a well-qualified planning if it is based on the stakeholders’ satisfaction-oriented educational planning. in order to achieve the objective, all of the activities in a university should be based on the well-prepared planning both the short-term, the intermediate, and the longterm one. then, the planning approach might meet the stakeholders’ expectation in the strategic planning, which is, a long-term planning that decides the viability of an organization. thereby, there should be efforts in order that the formulation of a university’s strategic planning might be made in an appropriate way. in reality, strategic planning has not been made as the basis of higher education management in indonesia. this matter is apparent from the few science and technology innovations generated by higher education instutions. scientific publications of research reports in indonesia, both on national and international level, are still few, compared to those in developed countries. such condition is assumed to be caused by the fact that the movement of activities has not been based on the planning. therefore, this is the time for higher education to be managed like a corporate that implements activities based on a planned strategy which is implemented accurately, precisely and correctly. bryson (1999, pp.17-18) states that strategic planning is a planning that departs from the organizational vision, mission and research and evaluation in education 201 volume 1, number 2, december 2015 values and that aims to meet the stakeholders’ needs and expectation. the strategic planning that relates the vision, the mission, the values and the objectives to the analysis of internal and external environment will make the organization find the most strategic direction both in the present and in the future. within the university quality assurance system, it has been mentioned that each university should have a strategic planning, because strategic planning is the main reference for the achievement of all quality assurance within the system. in order to formulate the strategic planning appropriately, the design process should be accompanied by the environmental analysis, tracer study, and normative review. furthermore, for the environmental analysis, the designers might implement the swot analysis method; and for the tracer study, the designers might involve the related stakeholders (department of national education, 2010, pp.21-25). regarding strategy, abraham (2006, p.7) describes that strategy is how a company actually competes. the statement implies that strategy shows how an organization is actually able to perform the activities based on the possessed potentials. allio (1998, p.8) defines that strategy is the art of deploying resources toward market opportunities in a way that distinguishes a bussines from its competitors. this statement implies that strategy is an art of using every single resource to attain a market opportunity by doing something different from what the competitor does. based on the two statements, each university is demanded to intelligently define a wellplanned strategy to achieve the objectives. law number 14 year 2005 article 14 regarding teacher and lecturer and government regulation number 74 year 2008 regarding teacher certification mention that teacher education institutions or tei (indonesian = lembaga pendidikan tenaga kependidikan or lptk) is a part of university that aims to prepare teachers professionally. in order to achieve the objective, the institutions should manage the abundant resources and have a great role in educating the skillful and educated indonesian human resources. consequently, the institutions are demanded to have an appropriate strategic planning in order to be able to take over the market competitively. technological and vocational education (tve) department is one of the subsystems in the national education, especially as a part of teacher education institution (tei). as a part of tei, tve which in its vision, mission and objectives should be able to produce capable teachers as the graduates, is demanded to implement good quality education. the development and changes of science and technology altogether with the rapid global dynamic demands makes tve as a part of tei (tei-tve) must be able to adapt itself. tei-tve that manages and produces teacher candidates in tve department should pursuit changes and improvements in order to increase educational quality. in other words, tei-tve should be able to equip the graduates with the ability in the teaching and vocational domain altogether at the same time. thereby, the demand impacts the planning in tei-tve; the planning in tei-tve should be designed on the basis of strategic planning. managing tei-tve effectively and efficiently is a certainty in an effort of generating good quality products/services. prosser (1950, p.231) states that the administration of vocational education will be efficient in proportion as it is elastic and fluid rather than rigid and standardized. the statement implies that the administration (management) of vocational education will be efficient if the administration (management) is implemented under good balance by holding the principles of being flexible, dynamic and standardized. tei-tve is different to any other general educational institution. the characteristic that differentiates tei-tve from other general educational institutions is that the students are emphasized to master certain skills in addition to the knowledge. to support the objective, tei-tve demands more complex facilities compared to the general ones. in addition, tei-tve also experiences greater dynamics with the stakeholders compared to the general education; as a result, it has been common that the principles of being flexible, dynamic and standardized are inherent to tei-tve. research and evaluation in education the use of malcolm baldridge method for formulating... 202 suharno, sukamto & sutarto in order to generate well-qualified graduates who are relevant to the market demand, the government has issued several regulations that control the improvement of educational quality in indonesia. in law number 20 year 2003 regarding the national education system, article 51 verse 2 and article 91 verse 1 mention that each university, both state and private, is demanded to ensure its educational quality. to ensure its quality of education, each university is demanded to have strategic planning. in government regulation number 8 year 2008 regarding the strategic planning systematic design guidelines, it is mentioned that the strategic planning includes environ-mental assessment and institutional performance achievement assessment. law number 25 year 2004 regarding the design content and reference defines that the tei strategic planning is designed by referring to environmental analysis, performance assessment, and market/industrial demand. law number 12 year 2012 article 62 defines that university has autonomy to manage itself as the center of tri dharma (teaching, research, and community service) implementation in the university. based on the afore-mentioned regulations, tei-tve within the educational management should have a strategic planning that is designed based on the environmental analysis and institutional performance assessment. environmental analysis is implemented in order to scan the internal and external factors used for planning the strategy. on the other hand, the performance assessment is implemented in order to measure how great the power of an institution is in implementing the strategy. the environmental analysis and performance assessment might be performed from the rectorate level to the department/ study program level depending on the scope of the strategic planning. based on the initial study toward the graduates of tei-tve from the engineering education study program and construction education study program, both the ones who teach at vocational high schools and the ones who work in the construction company, it was found that these graduates are quite good in terms of knowledge competence; however, in terms of skill and attitude competence, these graduates should improve themselves. most of the principals state that the graduates of tve generally have sufficient knowledge competence, but they are poor in terms of skill and attitude competence. whereas, in order to work in industries or in vocational high schools as teachers, they should master the skill and attitude competence well. from the aspect of management, the observation and interview data from the initial study toward the team of strategic planning designers in three tei-tves show that the departmentstudy programs have already had strategic planning; however, the existing strategic planning still have to face several problems. the problems might be seen from the fact that the existing strategic planning has not reflected the conditions that might meet the institutions’ vision. the measurement of study program performance has not been performed as well; as a result, the stakeholders do not attain any document in relation to the performance of each study program. based on the interview, it is found that the parties responsible for the strategic planning in overall (100%) stated that the possession of strategic planning that has been designed through the appropriate process/method is heavily necessary. as a university which, specifically, produces teacher candidates for the technological and vocational education, tei-tve is heavily necessary for establishing a system of performance measurement to create vision and mission as a part of quality assurance system (wikipedia, 2008, p.6). a good system of performance measurement should be comprehensive and integrated to all units and activities. the performance indicators that should be formed will not only be in the form of financial ones, but also non-financial ones (wheelen, 2003, p.24). in relation to the statement, malcolm baldridge is one of the tools that might be operated for measuring the performance of an educational institution. therefore, the study presents the results of performance measurement on the educational implement-ation in tei-tve based on the criteria of malcolm baldridge in order to improve the quality of tei-tve education. research and evaluation in education 203 volume 1, number 2, december 2015 according to gasperz (2011, p.8), the performance of an institution, either profitable or non-profitable, should be measured. knowledge regarding the performance data is very useful for designing strategic planning in the related institution. furthermore, according to indrajit (2006, p.12), a university should implement a strategic planning that has been designed based on the performance data because it is the main reference in achieving university quality standards. performance measurement a measurement occurs if a certain measuring tool is operated to ensure the weight, height or other characteristics of an object measured. in daily life, people often perform measurement; however, within a study, the measurement should meet certain requirements. the measurement in the study itself should consist of number provision on an empirical event in accordance with certain regulations (cooper & william, 1996, p.6). performance (job achievement) is a result that an individual would like to achieve in performing the duties given to him/her based on the capability, experience, determination, and period (swasto, 1996, p.4). there are several measurement tools that might be used, namely balanced scorecard, quality management system iso 9001, malcolm baldridge national quality award or malcolm baldridge award (gasperz, 2001, p.15). performance measurement in tei-tve is very important to be conducted because by measuring performance, the stakeholder of an organization might find important information regarding the educational implementation. the measured information might be used by the deans and staffs of the institution as the main starting point in improving the graduates quality and relevance. if tei-tve is unable to measure its performance, then the institution will have difficulties in performing its managerial duties (indrajit, 2006, p.36). performance measurement by means of malcolm baldridge method performance analysis should be conducted as the basis of decision making in relation to the strategy scale of priority. balance scorecard balances internal and external factors. the analysis of institutional performance might be used for balancing the existing resources and selected strategy. one of the basic questions is whether the existing resources are ready to implement the strategy that has been selected or not. furthermore, an organization should also consider whether the leaders are able to mobilize their staffs in order to implement the strategy or not. in addition, the organization should consider whether the existing budget is sufficient or not as well. there are more aspects that an organization should consider in relation to the strategic planning. gasperz (2011, p.95) describes that malcolm baldridge criteria for education (mbcfe), also known as the baldridge assessment, is a measured and systematic method that might accomodate such questions. since 2009, the method has been adopted by more than 70 countries, including indonesia; indonesia adopts the system into indonesian quality award (iqa). research and development also implements mbcfe method for assesing the performance with the following objectives: to increase the awareness toward the quality, to identify the needs for well-qualified quality, to introduce multiple methods of educational quality measurement, and to share (publish) information regarding the success and the advantages of quality strategy. the targets that will be achieved by employing mbcfe is to provide an alternative model that might be used as a reference for improving the quality management continuously. the mbcfe is managed by an international organization named national institute of standards and technology (nist), which defines that: the malcolm baldrige education criteria for performance are the basis for organizational self-assessments, for making awards, and for giving feedback to applicants. in addition, the education criteria have four other important purposes: to help improve organizational performance practices and capabilities, to facilitate communication and sharing of best practices information among education organizations and among organizations of all types, to foster the development of partnerships involving schools, businesses, human service research and evaluation in education the use of malcolm baldridge method for formulating... 204 suharno, sukamto & sutarto agencies, and other organizations via related criteria, and to serve as a working tool for understanding and improving organizational performance, and guiding planning and training (nist 2000, p.2) based on the definition, mbcfe is one of the tools that might be used for improving the organization performance completely and continuously by means of measurement, and providing feedbacks regarding organizational performance in providing good quality product and service. most of the managerial experts also regard mbcfe as self-evaluation. the superiority of baldridge criteria is the ability to provide overall and integrated assessment that might asssist a visionary leader. the advantage of the criteria is the ability to improve the study program performance, lead the planning design, provide multiple information on the best practices, and communicate the best practices to all institutional working departments/units. witnessing the superiority of mbcfe and the development of educational quality system that has been well-implemented in america, the wave of adoption toward the quality system has been able to encourage the system of quality monitoring in the educational institutions spread throughout the countries around the world such as new zealand and australia. in indonesia, a systematic selfevaluation based on the system of quality monitoring in educational institutions is a new matter and has not been implemented systematically, whereas quality is the main requirement for encountering the full-ofcompetition age. without good quality in each product and service that has been generated, the customers will quickly switch to other organizations in order to get better quality goods and services. mbcfe assessment is conducted toward seven criteria/categories with a total score of 1000. the seven criteria are leadership (120 points), strategic planning (85 points), customer and stakeholder (85 points), information and analysis (90 points), human resources (85 points), process management (85 points) and results of activity (450 points). according to nist, the stipulation of the maximum score for each category is based on how great the effect of those criteria on the performance of an institution. the following sections describe briefly each of criteria/ categories that have been directly applied to the context of study program in the university (nist, 2000, p.12). leadership (120 points) a strong leadership in a study program is necessary in an effort to integrate quality, including performing the social responsibility. the category of leadership assessment is an effort to test and evaluate the commitment and involvement of the program leaders in creating and preserving the quality values, such as customers’ (university students) satisfaction. strategic planning (85 points) the category of strategic planning measures how a study program develops the plan and objective of an action. another aspect that is measured by the category is how to choose the selection and the plan of a strategic action in the implementation and in the changes, if the related situation changes, altogether with the improvement. the item explains how the study program implements the strategy and the target of strategic results; in addition, the item also explains how to reveal the strategic challenges and objectives. focus on customer and stakeholder (85 points) focusing on customers and stakeholders measures how a study program determines customers’ needs, expectation and loyalty. another aspect measured by this category is how the study program establishes relationship with the customers and determines the main factors in pursuing the customers’ objectives, satisfaction and loyalty. tve customers are university students, vocational high schools, and industries. information and analysis management (90 points) the category of information and analysis management measures how a study program selects, attains, analyzes, regulates, and develops data, information, and asset of the possessed knowledge. moreover, the category also measures how a study program research and evaluation in education 205 volume 1, number 2, december 2015 reviews the performance. it describes how a study program measures, analyzes, arranges, reviews, and develops its performance as a producer of vocational teachers. human resources (85 points) the categoy of human resources, in this case the lecturers and staffs, measures/detects the ability of a study program to evaluate the employees’ capability and capacity as well as to establish a conducive working environment for the sake of good performance. the category of human resources also measures how a study program mobilizes, manages, and develops the human resource potentials in accordance with the study program’s vision, mission, strategy, and action plan. process management (85 points) the category of process management measures how a study program designs, manages and also improves working and educational system in order to meet the customers’ and stakeholders’ expectation as well as to achieve the study program’s success and continuity. in addition, the category also measures how a study program overcomes the problems in an emergency. education results (450 points) the category of education results measures the performance and improvement of a study program in the scope of education, together with the other service results, customer satisfaction, financial performance, human resources results together with the working system, operational performance, and the leader’s responsibility. the study program performance level will also be measured in comparison to other competitors within the same domain. harry and hertz (2011, p.4) mention that in order to measure each category, a team of strategic planning together with the policymaking parties should formulate indicators by referring to the following concept: a major consideration in performance improvement involves the selection and use of performance measures or indicators. the measures or indicators you select should best represent the factors that lead to improved student, operational, and financial performance. a comprehensive set of measures or indicators tied to student, stakeholder, and organizational performance requirements represents a clear basis for aligning all activities with your organization’s goals. based on the concept, assessment indicators lead to the involvement of all components in each category. then, the team of strategic planning elaborates the indicators that have already formulated into a list of questions or statements to be tested to the respondents for measurement. the respondents who measure the study program performance are the lecturers, excluding the head and secretary of the study program. the measurement should be performed under such manner to avoid ambiguity and subjectivity within the assessment. a principle that the team of strategic planning should hold on to is that the objective of performance assessment and evaluation is not to ruin a reputation, but to find how big the value possessed by all of the staffs in a study program is. the results of evaluation becomes an official document of a study program that might be used for improving the quality continuously. thereby, the assessment should be done honestly without any concern. in the study, the indicators of mbcfe assessment are developed based on the criteria proposed by gasperz (2011, p.218) and by nist (2000, p.12). performance measurement steps by means of mbcfe according to nist (2000, pp.13-29), in general, the steps for measuring the performance of a study program by the use of mbcfe are as follows: (1) performing an initial survey toward the study program performance; (2) designing a list of mbcfe questions; (3) distributing questionnaires to respondents; (4) processing the data attained and adapting the data into mbcfe sub-categories; (5) performing assessment toward each category and subcategory of mbcfe; (6) performing overall assessment to find the performance final score; and (7) having discussions to improve the organizational performance. the category of assessment results is presented in table 1. research and evaluation in education the use of malcolm baldridge method for formulating... 206 suharno, sukamto & sutarto table 1. description of performance measurement results by means of malcolm baldridge score criteria note 876-1000 world leader excellent 776-875 benchmark leader 676-775 industry leader 576-675 emerging industry leader average 476-575 good performance 376-475 early improvment 276-375 early result poor 0-275 early development strategic planning abraham (2006, p.9) states that strategic planning is a process, that is, a series of steps followed by a company collectively trying to agree on where it is going (i.e., vision) and the way it will get there (i.e., strategy). the theory implies that the objectives that have been formulated by an organization should be achieved by means of systematic and commonly agreed methods. the agreement is necessary under the following reason: there might be multiple methods for achieving the organizational objectives and there should be support from all of the existing resources to achieve the organizational objectives. therefore, an organization should perform agreement under certain methods that have the biggest profit of implementation. alison (2005, pp.1-2) describes that strategic planning is a systematic process through which an organization agrees on priorities that are essential to its mission and are responsive to the environment; strategic planning guides the acquisition and allocation of resources to achieve these priorities. the definition implies that a strategy is a series of systematic processes. thereby, in order to formulate the strategy, there should be a method that might measure the performance clearly. a good strategic planning should be accompanied by a sequence of priority in the strategy achievement. in order to formulate the strategy and the strategy achievement, there should be a commitment from all of the human resources within the organization. in an opinion proposed by handoko (2003, p.42), a strategic planning is a process of selecting the organizational objective, determining the strategy, policy, and strategic programs necessary for the organizational objectives, and defining the methods that ensure that the strategy and policy might be implemented. in brief, a strategic planning is a process of long-term planning designed and implemented for deciding and attaining the organizational objectives. bryson (1999, p.24) provides an explanation of a strategic planning concept, namely that a process of any strategic planning will be useful only if it assists the reasoning and action of the decision-makers strategically. a strategic planning is not the objective of the strategic planning itself; instead, a strategic planning is a group of concepts which is designed for assisting the leaders in making important decisions and taking important actions. even if the process of a strategic planning causes difficulties in the reasoning and strategic action, the process of planning should be put aside instead of the reasoning and action. based on the conception of strategic planning, it can be concluded that a strategic planning is a process that contains the steps for achieving certain organizational objectives under certain methods appropriate-ly. a strategic planning is a long-term planning that decides the viability of an organization. due to the long term operational implementation, a strategic planning should be accompanied by a short term planning and an intermediate term planning. the inter-related key of success in performing strategic planning is to perform the strategic planning with strategic paradigm under strategic analysis. research method approach the study employed qualitative and quantitative approach. for the data gathering method, the researchers selected observation and questionnaire distribution. the observation was conducted toward seven study programs in different tei-tve. for the data gathering instrument, the researchers employed an observation sheet and questionnaire. the questionnaire was designed based on the research and evaluation in education 207 volume 1, number 2, december 2015 official guidelines of malcolm bridge criteria for education. the questionnaire validity and reliability was measured by employing product moment correlation and spss version 16. the object of the study was eight study programs from four different teis. the respondents who were involved in the study were the lecturers and administrative staffs in each study program. table 2 presents the names of the study programs and the number of the respondents. table 2. object and number of respondents for the performance measurement no study program tei number of respondents 1 construction education surakarta state university 22 2 engineering education surakarta state university 23 3 family welfare education sarjanawiyata tamansiswa university yogyakarta 7 4 engineering education sarjanawiyata tamansiswa university yogyakarta 12 5 electroning engineering education cendana university east nusa tenggara 8 6 engineering education cendana university east nusa tenggara 10 7 construction education cendana university east nusa tenggara 7 total 89 data type the data were divided into two categories, namely primary data and secondary data. both of the data types were constructed in the form of quantitative and qualitative data. primary data are the data which are attained from observation and questionnaire which is directly distributed to the respondents. secondary data are the elaboration of the steps in attaining the primary data as having been written in the technical explanation of data gathering. secondary data were attained from the library, the documents from the study program in the form of self-evaluation and strategic planning, legal products in relation to education and strategic planning, and sources from the internet which are relevant to the study. the secondary data were intended to attain theoretical foundation that led to the completeness of topic explanation so that the conclusion that the researchers wanted to attain would have scientific and rational load. the steps for attaining the data are explained in the following sections. pre-survey pre-survey was conducted in engineering education study program and construction engineering study program, faculty of teacher training and education, surakarta state university, and engineering education study program and construction study program, sarjanawiyata tamansiswa university, yogyakarta. the aim was to find how far the process had been implemented to students. observation observation is an activity performed to find all elements related to the strategic planning implementation in the study program being observed. by conducting observation, necessary data for the study were found. questionnaire distribution the questionnaire that had been stated as valid and reliable was distributed to all respondents who influenced the implementation of strategic planning. the respondents for mbcfe were non-structural lecturers in each related study program. research and evaluation in education the use of malcolm baldridge method for formulating... 208 suharno, sukamto & sutarto data gathering instrument instrument, as the data gathering tool, was intended to measure the validity, practicality and effectiveness of the model and instrument itself. to attain good data, the instrument should be developed well. within the study, the instrument implemented by the researchers was mbcfe research. mbcfe research instrument in the form of questionnaire was used to attain data on the level of study program performance in technological and vocational education department. the research instrument was developed based on the manual issued by the national institute of standards and technology (nist). consecutively, the steps taken by the researchers in the study were performing initial survey toward the study program’s performance qualitatively, designing questionnaire based on the malcolm baldridge criteria, distributing questionnaire to the respondents, processing the data that had been attained in accordance with malcolm baldridge category, performing assessment toward each category and sub-category that had been made in the form of percentage based on the table of scoring guidelines malcolm baldridge, performing overall assessment to attain final score on the performance and performing analysis, and having discussion to provide useful recommendation to formulate competitive strategy. findings and discussions the performance measurement toward all of the objects was based on the malcolm baldridge criteria that covered seven categories. the results of performance measurement in the construction education study program, surakarta state university for the leadership category were shown in table 3. table 3. criteria of leadership performance in the construction education study program leadership criteria (120 point) performance % sub-criteria mean sub-criteria value total score 1 study program leadership (70 point) 1.01 chief of study program involvement level 63% 1.02 commitment and consistence 65% 1.03 communication effectiveness for quality improvement 64% 1.04 leadership capability development 70% 65% 70% 45.72 2 quality management and social responsibility (50 point) 1.05 study program leadership accountability 63% 1.06 communication toward staff and customer expectation 59% 1.07 support toward staff competence improvement 44% 1.08 staff mentoring and strengthening toward the importance of competence improvement 60% 1.09 society response toward activity implementation 64% 1.10 impact and advantage planning/ calculation on the results of activities toward society (customer) 50% 1.11 involvement in handling problems within society (social) in the domain of education 58% 1.12 evidences that support the social responsibility 38% 55% 50% 27.250 total 73.100 research and evaluation in education 209 volume 1, number 2, december 2015 based on the results in table 3, it is apparent that the involvement of a study program leader is 63% based on the average assess-ment performed by the respondents. then, based on the items under assessment, the performance of the study program leader is 65%, while the quality management is 54%. the results of the average performance is multiplied by the scores that are attained in the scoring guidelines baldridge award; as a result, the total point for the leadership performance is 72.83 (the maximum point is 120 points). based on the calculation as presented in table 3, the performance in terms of the other criteria is measured. table 4 shows the results of the recapitulation of the performance assessment for the overall criteria in the construction education study program, and table 5 for the engineering education study program. table 4. point recapitulation and position assessment on the construction education study program no criteria point according to mbcfe point from the assessment results achievement (%) 1 leadership 120 73 60.69 2 strategic planing 85 51 59.51 3 customer 85 51 59.51 4 information & analysis 90 48 53.61 5 human resources 85 54 63.00 6 process management 85 47 54.93 7 results 450 258 57.44 total 1000 581 100.00 based on the results of the measurement (table 4), the scores which are attained by the construction education study program is 581. the score implies that the performance of construction education study program earns the predicate ‘average', and it belongs into the category of emerging industry leader. the emerging industry leader category implies that the construction education study program has entered the industrial area of the educational domain. in other words, construction study program has approached the industrial competition-based organization management. by improving the performance, which is focused on the weakness handling and the strength increase, the construction education study program has an opportunity to be in the category of industry leader. the industrial area implies that overall, construct-ion education study program is managed on the industrial basis; for instance, the management, including the planning, values, and working cultures, is perceived to be like that of the industry. the working program that has been designed leads to the achievement of competitive quality instead of the regular program implementation. construction education study program should start departing from the conventional performance patterns to modern performance patterns. in general, the performance of construction education study program has been good; however, it should also be admitted that there is still a gap between the expectation and the reality. construction education study program attained 50% from 100% (maximum performance) and the figure was similar to almost all of the other criteria. therefore, improvement should be made in all categories. more specific recommendations might be made based on the data which were attained from the questionnaires, which provided much information both quantitatively and qualitatively. in order to make the quantitative data have a significant impact, a team of designers should interpret the data qualitatively. research and evaluation in education the use of malcolm baldridge method for formulating... 210 suharno, sukamto & sutarto table 5. recapitulation on the results of performance measurement on the engineering education study program, surakarta state university no category point according to mbcfe point from the assessment results achievement (%) 1 leadership 120 93 77.50 2 strategic planing 85 63 74.00 3 customer 85 62 73.00 4 information & analysis 90 64 71.00 5 human resources 85 61 72.00 6 process management 85 60 70.60 7 results 450 309 67.00 total 1000 712 based on the results in table 5, it is apparent that the measurement score attained by the engineering education study program is 712 points. the score implies that the engineering education study program earns excellent title and belongs to industry leader category. industry leader category implies that engineering education study program has entered the industrial area of educational domain although only on the surface. the industrial area implies that the overall performance in the engineering education study program is managed based on the quality competence such as management, including the planning, values and working cultures; the competence has been created under the competitive performance. in general, the performance of the engineering education study program is good although there is still a gap between the expectation and the reality. the prominence of the engineering education study program is in the leadership category; whose percentage is 77.50%. the prominence in the leadership category is a very good capital for the organizational improvement. regarding the importance of leadership category, malcolm baldridge has put the leadership category in the first place. the reason is that, basically, the objectives of an organization will be achieved if the leadership is implemented effectively (kaplan, 2012, p.5). however, in several categories, the researchers still found some gaps that were quite high; the gaps were found in the activity category. therefore, improvement should become the first priority in the category. more specific recommendations might be provided based on the data attained from the questionnaires, which provided much information both quantitatively and qualitatively. thereby, the questionnaires became the data that might be regarded as the information source and communication tools with other departments. based on the performance score in each study program, the policy-makers might analyze each criterion and employ the available criteria in order to improve the institutional performance. table 6 shows the recapitulation of performance results for the other study programs. table 6. recapitulation of performance measurement with malcolm baldridge no category tei 3 4 5 6 7 1 leadership 82 72 85 73 91 2 strategic planing 56 41 55 34 70 3 customer 55 45 58 39 62 4 information & analysis 55 42 64 33 74 5 human resources 56 45 63 49 70 6 process management 51 49 61 41 67 7 results 264 241 294 304 358 total 619 535 680 573 793 research and evaluation in education 211 volume 1, number 2, december 2015 conclusions tei belongs to the category of an organization with infinite resources. the role of tei in indonesia is very strategic in supporting economic development and the nation’s competitive edge improvement. therefore, tei should work hard in order to be able to measure its performance so that the effectiveness of the educational implementation might be found. based on the performance measurement by implementing malcolm baldridge award, a study program might design a competitive plan in order to improve its educational quality. references abraham, s.c. (2006). strategic planning: a practical guide for competitive success. london: thompson south-western. allio, r. (1998). the practical srategiest: business and corporate strategy for the 1990. social science-elsevier, 21, 1041011. bryson, j.m. (1999). perencanaan strategis bagi organisasi nirlaba [strategic planning for non-profitable organization]. yogyakar-ta: pustaka pelajar. cooper, d.r. & william, e.c. (1996). metode penelitian bisnis [business research method]. jakarta: erlangga. department of national education. (2010). sistem penjaminan mutu perguruan tinggi [higher education quality assurance system]. jakarta: dikti. handoko, t.h. (2003). manajemen [management]. yogyakarta: badan penerbitan fakultas ekonomi. harry, s., & hertz. (2011). criteria for performance. retrieved from http://www.nist.gov indrajit, r.e., & djokopranoto, r. (2005). manajemen strategis perguruan tinggi [higher education strategic management]. jakarta: universitas atmajaya. kaplan, r.s, & norton d.p. (1996). translating strategy into action: the balanced scorecard. boston: harvard business school press. leslie, r. (2005). the art of trining and development: effective planning. jakarta: gramedia pustaka utama. national institute of standards and technology. (2000). baldrige national quality program 2000 education criteria for performance. maryland: nist. prosser, c.a. & quigley, t.h. (1950). vocational education in a democracy (revised edition). chicago: amerian technical society. ridwan, a.s. (2008). majamen mutu [quality management]. medan: universitas sumatera utara. swasto, b. (1996). pengembangan sumber daya manusia terhadap kinerja dan imbalan (1st ed.) [the development of human re-source towards performance and wage]. malang: universitas brawijaya. wheelen, t.l. & hunger, j.d. (1996). strategic management (5th ed.). boston: addison wesley publishing company. wikipedia. (2008). vocational education and training. retrieved from http://en.wikipedia.org/wiki/vocation al_education_ and_training. http://www.nist.gov/ research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 1, number 2, december 2015 (pages 146-157) available online at: http://journal.uny.ac.id/index.php/reid evaluation of the bridging course offered at a university to foreign students: batches of 2012 and 2013 1) beatriz eugenia orantes perez; 2) djemari mardapi 1) state university of chiapas, mexico; 2) yogyakarta state university, indonesia 1) be.orantes@gmail.com; 2) djemarimardapi@gmail.com abstract this evaluation was a case study of the bridging language program offered at a state university to its developing countries partnership (dcp) scholarship awardee students. it evaluated the batches of 2012 and 2013. the focus of this evaluation was to evaluate the strengths and weaknesses of the bridging program. this was a summative evaluation that used context, input, process, and product (cipp) method. it was a pragmatic parallel mix-method design research. the data analysis techniques employed in this study were descriptive qualitative and quantitative data. fifteen students, six teachers, 10 tutors, and three administrators were involved in this study. in each component of the evaluation, strengths and weaknesses were found. the main strengths are: clear cultural objectives, experience teachers, tutorials help students a lot. the main weaknesses are: there are no clear criteria of the level to be reached, tutors are lack of training and time to prepare, material and assessment need to be improved, students cannot attain an upper intermediate level of the language. keywords: evaluation, bridging course, cipp, dcp students mailto:be.orantes@gmail.com mailto:djemarimardapi@gmail.com research and evaluation in education 147 volume 1, number 2, december 2015 introduction the main reason for a person to seek for education abroad is for a better or different education than the one offered in his/her country (lee, 2012). studying abroad programs are not only based in second language acquisition. they also offer intercultural competence, global awareness, academic discipline, and professional skills, factors that are considered when determining the success of an international studying program. schools that offer international programs are aware that international students need to pass through a process of adaptation. in this process, there are some occurring factors like transition and cultural shock. that is why offering bridging courses before international students start their official courses has become very popular and effective in international schools. however there are many factors to take into account in designing and providing a bridging course, such as: students’ background, objectives, methodology, assessment, selected materials, resources, teachers, and language of instruction as well as supports inside and outside the class. whether the bridging course is effective or not, it will definitely impact the grades of international students in their following studies at the hosting school. countries that speak a national language that is not a worldwide spoken language like english or french aim nowadays to offer scholarships that also include a bridging course to learn the language and culture of the country as a part of the main plan of making their language more spoken worldwide. more and more schools now opt to offer bridging courses (bc) to international students. these courses not only offer to level up students to the language level required, but also offer additional knowledge about the culture as the adaptation to the new environment. learning a language cannot be separated from learning the culture. it is important that the language program offered also provides with opportunities to observe and understand the culture in a critical and positive way. language in this context should be seen as a prerogative which paves the way for cultural adjustment. in other words, when foreign students want to master the language, they have to be more receptive to aspects of the new environment to cope better with their studies and find it easier to adjust in society. studying abroad is a fast-growing phenomenon. some of the main reasons are the desire of travel, or because of political changes, economic need, or cultural interaction. traveling is part of the lives of many young people of university age, especially from developed countries. nonetheless, many students from all over the world who travel are already between their 30s and 40s. it is implicit that studying abroad will lead to increased cultural capital and knowledge for the individual, improved international relations, and an extra dimension to the educational experience. nevertheless, studying and living in a foreign country implies a transition where students need to adapt. adaptation normally goes hand in hand with cultural shock. international students face these factors as well as dealing with studies, so that transition becomes one of the vital characteristics that international education providers need to take into account. moving to a new country will require a process of transition, and it will lead the new comer to adapt to the new country. adaptation is then the main goal, since the more adapted an international student is, the easier will be for him/her to focus only in his/her studies. transition is a natural part of life, and can be experienced when moving from school to school, from house to house, from job to job. for international students, this transition means many things: new country, new school, new living situation, new language, new culture, new food, new people, and many more. therefore, support in this stage is very important. indonesia is a country that offers a scholarship called beasiswa kemitraan negara berkembang (knb) or developing countries partnership (dcp) scholarship to other research and evaluation in education evaluation of the bridging course offered at a university... 148 beatriz eugenia orantes perez & djemari mardapi developing countries’ students. universitas negeri yogyakarta (uny) -yogyakarta state university (ysu) is one of the universities in indonesia which has become the host of the dcp program since 2006. ysu is located in yogyakarta city, a special region of indonesia, where javanese is the first language for most of the local people. in yogyakarta, indonesian language is the official language mainly used at school and in formal activities, but the society is ruled mainly by the local javanese language. the developing countries partnership program demands different periods of study: a bridging course where indonesian language and culture is studied for 8 months; a bachelor or master’s preparatory program for 4 months (matriculation can be included); and either a bachelor’s program of 4 years (8 semesters) or a master’s program of 24 months (4 semesters). the bridging course (bc) in ysu also involves indonesian students as tutors and companions for foreign students if they have difficulties. at the end of each semester, an internal evaluation of teachers and tutors is conducted. however, they have not had an external evaluation since they have just started offering the program. thus, the main objective of this case study research is to conduct an external evaluation of the bridging course offered at ysu to describe the strengths and weaknesses of the program. this evaluation is summative and covers a period of two batches of 2012 and 2013 of dcp awardee students at ysu. dcp students who arrive in indonesia for the first time have little or null knowledge of the language, and they have 8 months to learn and perfect their indonesian language to a level that allows those selected students to be competent inside their bachelor or master’s degree’s classes, and also inside the society. however, the course has not been evaluated yet. therefore, it is the importance of this evaluation. the needs referred in this study are the needs of the university regarding international students (requirements) and the students’ needs inside a classroom (expectations); however, we cannot truly separate students’ personal needs with classroom needs. the job of a teacher while planning a lesson is to understand student’s need and expectations about the learning process they are under. moore (2009, p.337) explains that teachers need to take into account students’ needs and provide students with different assignments that fulfil those needs. thus, getting to know students’ background includes not only getting to know what their learning needs are, but also there might be some needs that might affect their learning process, for instances: lack of instruments to do a homework, cultural shock, or problems at home. this research also evaluates the university’s language requirements and the students’ needs, along with the objectives of the course regarding to the language and culture and whether they are reached; as well as the resources which are used, the training and experience of the teachers as well as the methodology and activities which are chosen, the language level reached by students as well as their cultural awareness. in 1993, the government of the republic of indonesia started offering postgraduate (master degree) scholarships to students from other countries. in 2009, it was recorded that more than 400 students from more than 40 countries had been awarded this scholarship. the program is now named beasiswa kemitraan negara berkembang (knb) or developing countries partnership (dpc) scholarship. scholarship is offered by the government of the republic of indonesia to developing country students through this program (directorate general of higher education, 2014). there are many approaches that can be employed to conduct an evaluation of a program. fitzpatrick, sanders and blaine (2010, p.114) argue that ‘the diversity of evaluation approaches has risen from the varied backgrounds, experiences, and worldviews of their authors, which have resulted in diverse philosophical orientations, and methodological and practical preferences.’ one of the most commonly used models is the cipp proposed by stufflebeam in 1970. the meaning of the acronym cipp is: context, input, process, and product research and evaluation in education 149 volume 1, number 2, december 2015 evaluation. cipp is an approach that matches many educational evaluation’s objectives. with cipp, stufflebeam has pointed out the use of multiple methods, both qualitative and quantitative as far as it fulfils the needs of the evaluation (quoted in fitzpatrick, sanders, and blaine, 2010, p.176). the four types of evaluations are described by stufflebeam, madaus, and kellaghan (2002, p. 279) as follows: the context evaluation assesses the needs, problems, assets and opportunities; the input assesses the strategy and the associated work plan and budget for carrying out the effort; the process checks the plan’s implementation plus documentation of the process, including changes in the plan as well as key omissions and/or poor execution of certain procedures; and the product accesses the extent to which the evaluation met the needs all the rightful beneficiaries, and it gives feedback about achievements, and helps to take decisions about the program, whether it is worth using, continuing, repeating, and/or changing. when the cipp model is taken as a model to evaluate a program that has already finished, it is said that it is a summative evaluation, and the main purpose is to sum up the program’s merit, worth, probity, and significance (fitzpatrick, sanders, & blaine, 2010, p. 175). when the evaluation is summative, according to stufflebeam, the four different evaluations have role as follows: (1) context: it is useful for judging the goals already established and for helping the audience to assess the significance of the effort in meeting the beneficiaries’ needs. thus, it compares the goals and priorities with the assessed needs, problems, assets, and opportunities; (2) input: it compares the program’s strategies, design and budget to those of critical competitors and the targeted needs of beneficiaries; (3) process: it provides a full description of the actual process and cost, plus comparison of the designed and actual processes and costs; (4) product: finally it compares the outcomes and side effects to targeted needs and as feasible, to results of competitive programs; interpretation of results against the effort’s assessed context, inputs, and processes. research objectives the primary purpose of this evaluation is to examine the strengths and weaknesses of the program. it is necessary to evaluate the program relevance, program management and delivery, without forgetting to conduct an assessment of the impact from the dcp bridging program. it assesses the extent to which dcp participants improved their language abilities and acquired knowledge of indonesian culture, the instruments and methods used to deliver language instruction, as well as the assessment of instruments, and the barriers to access the program. it also examines if the social and students’ needs are fulfilled during the program to be able to participate effectively in the lesson in bachelor or master’s degree levels. thus, the research objectives are as follows: (1) to know what ysu requirements to enter a master degree are and what the students’ background is; (2) to know what the participants’ needs are; (3) to understand the learning goals/objectives of the course; (4) to evaluate if the program was designed to fulfil students and university’s demands; (5) to determine what and how good the resources are; (6) to evaluate if the planning of the lesson is based on the objectives of the course and the needs of students; (7) to know the teacher’s previous experience and training in teaching indonesian language to foreign students; (8) to evaluate the methodologies used; (9) to evaluate if the student’s language level and cultural awareness reached after following two semesters of bridging course at ysu is the optimal result; (10) to evaluate if the needs and objectives are fulfilled at the end of the course; and (11) to determine the strengths and weaknesses of the course. research methods this evaluation is a case study. to be able to evaluate this program, a summative evaluation was conducted by using cipp method. this evaluation collected the data quantitatively and qualitatively. this research was a pragmatic parallel mix-method design research, since both quantitative and qualitative data were collected at the same research and evaluation in education evaluation of the bridging course offered at a university... 150 beatriz eugenia orantes perez & djemari mardapi time or at a short period of time (mertens, 2010, p.298). the analysis technique was descriptive for the qualitative data and it was descriptively statistical for the quantitative data. triangulation was used to compare the results of the analysis of the data from the questionnaires, interviews, outcomes from the course, and relevant bibliography. the research was conducted in yogyakarta state university (ysu), for a period of one year. it started on june 2014 and finished on june 2015. the subjects of the evaluation were: six students from 2012 batch and nine students from 2013 batch, six teachers, ten tutors and three administrative staffs. several lines of inquiry were used to evaluate the program: document review; literature review; administrative data analysis; questionnaires and interviews to dcp students, teachers and administrative staffs. the validity of the instruments was focused on the content and format of the instruments. in order to confirm that the instruments measured what they were supposed to measure, a pilot of the check list, the questionnaires and interview paper were conducted. in addition, three experts were appointed to validate credibility of the instrument. all of the instruments were improved. the research data were divided in three main sectors: analyzing (from the researchers’ point of view), interviews and questionnaires (from the stakeholders’ point of view) and results (the outcomes at the end). first, since this was a mix-method research, the data were divided and analyzed in quantitative and qualitative method. second, the procedure of analyzing the data was also different according to the type of the method used. qualitative data all data from the interviews were qualitative. a few questions from the questionnaires were quantitative; except for the questionnaires for administrative staffs, there were only two administrators answered the questionnaires, so that the likert scale could not be used and a qualitative approach was employed. the qualitative questions are divided into two groups: descriptive questions for all participants and some descriptive questions for administrators. the criteria of the qualitative questions of the questionnaires for administrators were: for option ‘a’ or ‘b’, it was considered as the ‘strength’ of the program; for option ‘c’ or ‘d’, it was considered as the ‘weakness’ of the program. the data analysis strategy chosen for the descriptive qualitative data was the one presented by hesse-biber and leavy (2006, quoted by mertens, 2010, p. 424) that consists of three steps: (a) preparing data for analysis (organizing the data); (b) data exploration phase (reading, thinking and making notes); (c) data reduction phase (selecting the relevant data and assigning a label). quantitative data the quantitative data were analyzed in different methods: likert, percentage, and mean. each method is explained as follows. the percentage method percentage was used as a method to analyze some of the quantitative data. there were two types of questions. one type of questions gave the stakeholders opportunity to choose only one option. the frequency of occurrence of these questions was divided by the number of respondents who answered it: two administrators, six teachers, ten tutors, and fifteen students. the formula goes as follows: percentage: frequency number of stakeholder per group the other type of questions was multiple answers. stakeholder could choose or give as many answers as they felt it was necessary. the frequencies of these questions were sum and then each frequency was divided by the total number of the questions. the formula is as follows: percentage: frequency total number of the questions modified likert scale the modified likert scale was based on the categories of behaviours and interests research and evaluation in education 151 volume 1, number 2, december 2015 proposed by mardapi (2008, p.123), where the categories of behaviours and interests are ranked by using the following formula: table 1. categories of behaviours and interests score x categories x ≥ m + 1. sd very high m ≤ x < m + 1. sd high m 1. sd ≤ x < m low x < m 1. sd very low for this case study, each question was analyzed individually and the criteria used were shown in table 2. table 2. category of the likert scale criteria category x ≥ 3 strong 2.5≤ x <3 strong 2 ≤ x< 2.5 weak x < 2 strong weak the criteria were made taking the following elements into account: number of question : 1 maximum : 4 minimum : 1. sd : 0.5 m : 2.5 the criteria for all the questions that were analyzed using the modified likert scale were the following: ‘a’ equals to four points, ‘b’ equals to three points, ‘c’ equals to two points, and ‘d’ equals to one point. then each answer given by the stakeholders was replaced by the value and the mean was obtained to be able to know under which category on table 1 and 2 the question belongs to. mean there were two types of mean used: a simple mean and a mean for ranking. the simple mean to get the average years of teachers’ and tutors’ teaching experience was by adding all the numbers and dividing the result between the numbers of participants. the second type was when using ranking. according to the statistical services center (2001, p.6), ‘the ordered categories can be reduced by accepting a degree of arbitrariness. then, give scores to the categories and produce an average score.’ for this case study, the questions where rankings were asked followed the criteria of: ‘1’ equals to seven, ‘2’ equals to six, ‘3’ equals to five, ‘4’ equals to four, ‘5’ equals to three, ‘6’ equals to two, and ‘7’ equals to one. then, the frequency of each category, in a question, was multiplied by the criteria and then sum, and divided by the sum of all frequencies (n). the mean of each question then was taken into account. if the mean was μ<4, the option of the question was considered as the ‘weakness’ of the program. if it was μ≥4, then it is considered as belong to the category of the ‘strength’ of the program. μ = (f1*7+ f2*6+ f3*5+ f4*4+ f5*3+ f6*2+ f7*1) n findings and discussions context for any foreign student, ysu requires an ‘upper intermediate’ level of language to enter the university. thus, an exam is applied to measure in which level a foreign student belongs to. however, for dcp students, there is no language requirement to enter to the bachelors or masters since the program only requires students to follow an indonesian language program, so there is no need to reach any level. therefore, there is no clear language level required for dcp students that is administered by the administrators, and also teachers and tutors of the program have no clear level for the students’ target achievement. the objectives of the program also lack to include specific academic language inside the program regarding to students’ field of study. it cannot help students to reach their first need ‘to learn the language to be competent in the class’. the syllabus did not personalize the vocabulary into different areas of study too, so that it can be considered as being lack in fulfilling one need of students. brown (2007, p.78) explains students need to communicate in a contextualized, appropriate and meaningful way. being able to fluently communicate in their field of study will increase students’ confidence and grades. in addition, the syllabus is too general, leaving too much room for ‘flexibility’ which is a research and evaluation in education evaluation of the bridging course offered at a university... 152 beatriz eugenia orantes perez & djemari mardapi good characteristic for the experienced teachers, but not a good one for the inexperience tutors. threfore, the language’s objectives and the syllabus can be considered as a weakness of the program. the strengths of the program are the part of cultural awareness where the program’s objectives are clear, bringing awareness on yogyakarta as the main objective. moreover, most of the objectives are related to the communication outside the school, interaction with the society, and with their daily life activities. all of the four language skills are covered, together with grammar. the two main needs of students, ‘to learn the language to be competent in class,’ and ‘to learn the language to communicate outside school’, are considered by the administrators, teachers and tutors as important needs. however, the third main need of students, ‘to adapt the course to skills and needs,’ was not so considered by the administrators, teachers and tutors as a very important need. in the class, students will show different abilities or skills. being able to recognize, plan, and bring activities that cover as many skills as possible will help all students to learn. input although the materials are considered as having good quality, the materials and extra materials in general are not enough; the books lack strength since they are not used widely by the teachers and tutors. tutors also claimed that the materials for them are insufficient. it can be considered as a ‘weakness’ of the program. however, the resources in the class are enough; it is considered as a ‘strength’ of the program. lessons plans were not given to the administrators, and this is an important part of the assessment of students’ learning. moreover, teachers and tutors did not always plan the lessons based on the syllabus, since the syllabus is too general. teachers and some tutors declared that they planned their lessons based on what they felt students needed to learn. the lesson planning should be balanced between the students’ needs and the syllabus to fulfil the objectives of the program. thus, the way the planning of the lesson has been made can be considered as the ‘strength’ of the program, but the fact that teachers and tutors do not hand the lesson planning can be considered as the ‘weakness’ of the program. teachers have sufficient experience in teaching and teaching indonesian language to foreign students. therefore, training for three days is more than enough for them. this is ‘strength’ of the program. however, tutors are less experienced and a three-day-training is not enough for them. it can be considered as the ‘weakness’ of the program. before the students started the course, there was no information given to the teachers or tutors related to students’ background and level of english. tutors were not given enough time or support to prepare their lessons. they are inexperienced, so that this lack of information affects their performance. these two last points are considered as the ‘weaknesses’ of the program. process all of the four skills and grammar are included in the program. it is considered as the ‘strength’ of the program. however, when balancing the activities, it is noticeable that the activities are mainly focused on ‘speaking’, ‘reading’ and ‘listening’. ‘pronunciation’ and ‘grammar’ are not so often practiced. in addition, regarding the barriers faced by students, some mentioned that they could not pronounce indonesian words very well and that they lacked more grammar knowledge. thus, these skills need to be balanced. balancing the activities is necessary in any language program; otherwise, students will end up being great at some skills but lacking on others. grammar should not be diminished, since to be able to compete in a language, it is necessary to be accurate. it can be considered as a ‘weakness’ of the program. feedback is considered as a ‘weakness’ of the program, since it is not constantly given and this is one of the most important factors in seeking students’ improvement. support from the tutors, on the other hand, is considered as ‘strength’ of the program, since it is constantly given. yet, support form the research and evaluation in education 153 volume 1, number 2, december 2015 teachers is not constantly given, so that it can be considered as a ‘weakness’ of the program. however, students, teachers and tutors are satisfied with the way the assessment was done. assessment also needs to be reinforced since not all the participants know when and how the assessment should have been done. in addition, there was no control from the administrators, there was no clear database of the results; there was no consistency on the way the grades were provided for the two semesters. students did not know the level they achieved. thus, it can be considered as the ‘weakness’ of the program. product administrators agreed that the results of the dcp students were ‘as expected’ in general after the two semesters at the bridging course at ysu. more than half of the students, most of the teachers, and the tutors also agreed that the students were ‘prepared’ at the end of the course. administrators and teachers agreed in the questionnaire that the course brings students to a certain level but not to an upper intermediate level. students also expressed that they were prepared but they faced problems. students are being prepared as expected can be considered as the ‘strength’ of the program, but the level of indonesian language needs required for the program needs to be clear among all stakeholders. ‘the tutorial classes’ are seen as the main the characteristics that help students to reach the level of indonesian language. for this reason, the tutorials lessons need to be more taken care of, for example: tutors need to have a better training and more time to prepare the lessons. the tutorial, as part of the strategy to help students to reach a good level of language, can be considered as the ‘strength’ of the program. however, the way tutors are performing the lessons are still not satisfying to students. as explained by killen (2009, p.77), the way students learn is important because their learning experiences will directly influence their motivation and their future learning strategies. thus, the methodology used by tutors can be considered as the ‘weakness’ of the program, and this goes together with the training given. the program’s objectives are ‘clear and realistic’ and ‘taking students’ needs into account’. the objectives are realistic in terms of helping students to adapt and bring them to an intermediate level of communication. however, the program fails to take students to an upper intermediate level that, according to the administrators, is the level that any foreign student should have to be able to follow lessons at ysu masters or bachelors. students themselves confirmed that their level of language was intermediate, and during their first semester, they faced problems like: ‘they were not able to understand the language’, ‘it was not like in the bridging course’, ‘they were not able to fulfil assignments completely in indonesian language’, ‘they were not able to speak fluently’, and other problems. therefore, the objectives of the program regarding cultural awareness can be considered as the ‘strength’ of the program, but the level of language that is needed to follow a masters or bachelors’ degree is considered as the ‘weakness’ of the program. students, teachers and tutors expressed that ‘students were aware of indonesian culture but mainly about yogyakarta’. the main reasons are because: ‘the syllabus was designed to promote cultural awareness’ and ‘the material was designed and selected to promote cultural awareness’ and according to the students, ‘indonesian culture is similar to their culture’. being aware of indonesian culture, especially yogyakarta, is the main objective expressed by the administrators, teachers and tutors. thus, it can be considered as the ‘strength’ of the program. regarding adaptation, administrators, teachers, and tutors agree that ‘most of them adapted very well and a few of them are just well’; more than a half of the students agree that the bridging course helps them mainly to interact with people. in general, the students’ adaptation to society is well covered by the course. thus, it can be considered as the strength of the bridging course. research and evaluation in education evaluation of the bridging course offered at a university... 154 beatriz eugenia orantes perez & djemari mardapi on the other hand, regarding whether the bridging course help them to integrate to their masters or bachelors’ lesson, more than half of students chose the option ‘somehow, but there are some things i did not learn in the classes’. these things that students wished to know before entering their masters or bachelors’ course are: ‘culture inside the class’, ‘javanese language’, and also ‘the learning style’. during the interviews and the recommendations given to improve the course, students mentioned that academic language is one of the things they lacked during the bridging course. for their adaptation to the lessons, there are still some needs that need to be covered by the course, so it can be considered as the ‘weakness’ of the program. students expressed that they were ‘satisfied’ with the course, and ‘satisfied’ with the testing model. they also mentioned that the methodology was ‘most of the times adapted to their learning styles, but in a few times, they did not feel the teaching methodology was adequate for them’. one student mentioned in the interview that his/ her needs were not totally fulfilled regarding to the knowledge. another student mentioned that he/she felt not prepared to face the course, since he/she had poor academic language knowledge. if we compare these results with the main students’ need that is: ‘to learn the language to be competent in class’, first, it is important to remark that the administrators agreed that the ideal level to enter a masters or bachelors’ degree is upper intermediate level. however, the course itself cannot lead a student to that level since the objectives are not clear. therefore, students are struggling in their first semester of masters or bachelors’ degree. the main need of students is not completely fulfilled, so it can be considered as the ‘weakness’ of the program. the second need of students is: ‘to learn the language to communicate outside the school’. this need is for sure fulfilled since most of the activities are ‘speaking activities’. students also declared that the bridging course helps them to ‘interact with people’. this can be considered as the ‘strength’ of the program. the third main need of students is ‘to adapt the course to skills and needs’. as observed in the activities presented, they were varied covering the four skills: ‘speaking’, ‘reading’, ‘writing’ and ‘listening’. however, ‘grammar’ and ‘pronunciation’ were not widely practiced, and some students, in the interviews and advices to improve the program, wrote that it is important to conduct more practice. students evaluated their teachers in a positive way, agreeing that they appreciated their experiences and support. however, feedback, support, punctuality, and teaching methods are recommended by some students as the features that need to be improved by the teachers and tutors. teachers’ performance can be considered as the ‘strength’ of the program. for tutors, students also considered their performance as positive, since it helps them to understand the language, and the tutors were friendly and helpful. however, the teaching methods and feedback were negative aspects that the tutors need to improve. tutors performance can be considered as the ‘weakness’ of the program since more than 50% of the students are not satisfied with the methods used by the tutors. there is no report of any evaluation which has been made to the course during the years that the bridging course has been offered. it is contradictory with the advice expressed by garavalia, et al. (1999, p.15) that: faculties which are interested in improving the quality of their syllabi should obtain feedback from a variety of sources including students, other faculty members and administrators, and personal reflection. conclusion and suggestions conclusion in the context, the course has clear objectives about the awareness of culture that should be reached by dcp students and it promotes communication skills, as the practice of the four language skills. the stakeholders are aware of the two main needs research and evaluation in education 155 volume 1, number 2, december 2015 of students: ‘to be competent in class’ and ‘to communicate outside the school’. on the other hand, the level of indonesian language that dcp students should reach is not clear, and it is not the same as requested to nondcp foreign students. thus, the syllabus fails to include academic language specific to the field of study of the dcp scholarship participants. from the evaluation of the input, it is found that the lesson planning is balanced between the students’ needs and the syllabus. the training provided to teachers and tutors is around 4 days; for teachers (who already have sufficient experience), the training is enough, but for tutors who have lack of experience, the training is not enough; in addition, training on how to handle students who do not speak english is not provided. there are a variety of material and resources provided by the administrators. in contrast, the materials are not enough for the teaching of the course. lessons plans are not handled to the administrators neither by teachers nor by tutors; in addition, tutors are given not enough time to prepare the lessons. another weakness is that teachers and tutors are not given specific information about dcp student’s background. the process’ results show that support is given to students during their adaptation. this support is constantly given by tutors but not by teachers. on the other hand, teachers’ methods are suitable to students, but tutors’ methods are not. nevertheless, bridging course is applied to guaranty students’ adaptation with a range of activities and material, regarding to cultural understanding. pronunciation and grammar are found not usually practiced and feedback is not constantly given. administrators, teachers, tutors and students do not know when and how the assessment should have been done. the monitoring offered to teachers and tutors is constant. the last part of the evaluation is product; the results show that the program helps students to adapt and bring them to an intermediate level of communication. the tutorial classes play a key role since it supports students to improve their level of indonesian language. in addition, students are aware of indonesian culture, mainly the culture of yogyakarta. in general, students evaluate their teachers in a positive way; the experience possessed by teachers is the main characteristic appreciated by students. tutors are considered as being positive by students since they help students to understand the language; and they are friendly and helpful. despite the good results, students do not reach an upper intermediate level of indonesian language. in addition, there are some adaptation problems inside the classroom that are not covered by the program, such as lack of academic language, or problems with javanese language. the main need of students is not completely fulfilled, since they do not reach a level of the language that allows them to be competent in class. in addition, teachers’ feedback, punctuality, and teaching methods, as well as the tutors’ performance do not completely satisfy students. control from the administrators regarding the results database is weak, since they do not have all the results from students and the way the grades are presented in the diplomas is not consistent. at the end, students are not able to know their level of language. suggestions administrators of the bc knb ysu it is necessary to establish the goals of language level and passing criteria; it is also important to include academic language specific of the area of students’ bachelors or masters degree to personalize the vocabulary by creating opportunities like reading, listening, writing or presenting. teachers and tutors need to be given more information about students’ background like their previous studies, skills (language, computer, and so on), and difficulties while learning, life style, personality and many more. tutors should be given more training about lesson planning, improvisation, and methods. the training should include: effective feedback and strategies to deal with students who do not speak english or the target language. research and evaluation in education evaluation of the bridging course offered at a university... 156 beatriz eugenia orantes perez & djemari mardapi another important point is that the lesson plans should be requested to teachers a week in advance, as a way to help tutors to plan better and give support if it is requested. evaluating the materials and then revising the materials together with teachers, tutors and also students if possible to improve them is also recommended. in addition, the assessment system needs to be revised and controlled. if possible, a test by the administration should be done to have a control of the students’ level. moreover, the way the diplomas present the results should be improved providing the standardized criterion and more information about the level acquired. in addition, a syllabus has to be made, taking into account that dcp students have eight months to reach an upper intermediate level of the language, where more focus on the language learning as well as academic language personalized to areas of study should be paid. it can also be possible to do followups to students after their first semester in their master or bachelor programs, to know what problems they face and include this in the following bridging courses. teachers and tutors the teachers and tutors are expected to know students’ expectations and limitations. it is also necessary to give more support and practice in pronunciation as well as in grammar. in addition, activities that promote students to learn more about their target either in master or bachelor degree should also be included. a key point that needs to be improved is feedback. feedback is important, and it needs to be given constantly and effectively, since students can learn from their mistakes and progresses. for teachers, it is suggested that punctuality should be considered as a characteristic that is important to students, and more constant support is required by the students. for tutors, more preparation or reading about the methods that can be implemented in class is necessary. moreover, looking for support is also important. future researchers if someone is interested in conducting a research on the dcp bridging course, the following elements can be evaluated independently: the methods offered in the tutorials, the materials used in the bridging course, the problems faced by dcp students on their first semester, the culture inside a class a foreigner should know, and the academic language needed. references brown, h.d. (2007). teaching by principles: an interactive approach to langauge pedagody (3 rd ed.). white plains, new york: pearson education. fitzpatrick, j.l., sanders, j.r. & blaine, r.w. (2010). program evaluation: alternative approaches and practical guidelines (4 th ed.). new jersey: pearson education. garavalia, l.s., hummel, j.h., wiley, l.p., & huitt, w.g. (1999). constructing the course syllabus: faculty and student perceptions of important syllabus components. journal on excellence in college teaching, 10(1), 5-21. killen, r. (2009). effective teaching strategies: lessons from research and practice (5 th ed.). south melbourne, vic: cengage learning. directorate general of higher education. (2014). knb program [dcp program]. jakarta: directorate general of higher education. retrieved on july 23 rd 2014, from:http://www.knb.dikti.go.id/index. php. lee, m. (2012). history of study abroad: part 1 (1190 – 1900), go overseas. retrieved on november 27 th 2014, from: www.gooverseas.com/go-abroadblog/history-study-abroad-part-1. mardapi, d. (2008). teknik penyusunan instrumen tes dan non tes [test and nontest instrument arrangement technique]. yogykarta: mitra cendekia. mertens, d.m. (2010). research and evaluation in education and psychology: integrating diversity research and evaluation in education 157 volume 1, number 2, december 2015 with quantitative, qualitative and mixed methods (3 rd ed.). thousand oaks, ca: sage. moore, k.d. (2009). effective instructional strategies: from theory to practice (2 nd ed.). thousand oaks, ca: sage. the statistical service center. (2001). approaches to the analysis of survey data. biometrics advisory and support service to dfid. reading, uk: the university of reading statistical services centre. retrieved on february 22 nd 2015, from: http://www.reading.ac.uk/ssc/resource s/docs/approaches_to_the_analysis_o f_survey_data.pdf. stufflebeam, d.l., madaus, g.f., & kellaghan, t. (2002). evaluation models: viewpoints in education and human services evaluation (2 nd ed.). dordrecht: kluwer academic. judul dalam bahasa indonesia, ditulis dengan huruf tnr-14 bold, maksimal 14 kata, rata kiri research and evaluation in education journal e-issn: 2460-6995 research and evaluation in education journal volume 1, number 1, june 2015 (55-72) available online at: http://journal.uny.ac.id/index.php/reid estimation of ability and item parameters in mathematics testing by using the combination of 3plm/grm and mcm/gpcm scoring model 1) abadyo; 2) bastari 1) malang state university, indonesia; 2) minsitry of education and culture, indonesia 1) aabadyo@gmail.com; 2) bastari@kemdikbud.go.id abstract the main purpose of the study was to investigate the superiority of scoring by utilizing the combination of mcm/gpcm model in comparison to 3plm/grm model within a mixed-item format of mathematics tests. to achieve the purpose, the impact of two scoring models was investigated based on the test length, the sample size, and the m-c item proportion within the mixed-item format test and the investigation was conducted on the aspects of: (1) estimation of ability and item parameters, (2) optimalization of tif, (3) standard error rates, and (4) model fitness on the data. the investigation made use of simulated data that were generated based on fixed effects factorial design 2 x 3 x 3 x 3 and 5 replications resulting in 270 data sets. the data were analyzed by means of fixed effect manova on root mean square error (rmse) of the ability and rmse and root mean square deviation (rmsd) of the item parameters in order to identify the significant main effects at level of  = .05; on the other hand, the interaction effects were incorporated into the error term for statistical testing. the 2ll statistics were also used in order to evaluate the model fitness on the data set. the results of the study show that the combination of mcm/gpcm model provide higher accurate estimation than that of 3plm/grm model. in addition, the test information given by the combination of mcm/gpcm model is three times higher than that of 3plm/grm model although the test information cannot offer a solid conclusion in relation to the sample size and the m-c item proportion on each test length which provides the optimal score of test information. finally the differences of fit statistics between the two models of scoring determine the position of mcm/gpcm model rather than that of 3plm/grm model. keywords: estimation, ability, item parameter, mathematics test, 3plm/grm model, mcm/gpcm model research and evaluation in education journal 56 volume 1, number 1, june 2015 introduction in 1990s the national examination in indonesia was known as evaluasi belajar tahap akhir nasional (ebtanas) or, literary translated into english, final stage of national learning evaluation. the test items for mathematics in that period were mixed ones consisting of 35 multiple choices and 3 essays. then, since 1999 such mixeditem format has not been used in the national examination and, unfortunately, there has not been any proper explanation for such circumstance whereas the use of multiple-choice (m-c) and constructedresponse (c-r) test items was heavily implemented in the usa and rhe other countries (chon, lee & ansley, 2007, p.1). studies regarding the mixed format of m-c and c-r test items based on the item response theory was popularly conducted in the early 1990 (e.g., wainer & thissen, 1993, pp. 103-112 and lukhele, thissen & wainer, 1994, pp. 234-250). these studies then were followed by the other ones conducted by tang & eignor, 1997, pp. 113; kennedy & walstad, 1997, pp. 359-375; berger, 1998, pp. 248-258; ercikan et al., 1998, pp. 137-154; lau & wang, 1998, pp. 1-13; garner & engelhad, pp. 29-51; li, lissitz, & yang, 1999, pp. 1-34; bastari, 2000, pp. 1-78; kinsey, 2003, pp. 1-110; meng, 2007, pp. 1-344; chon, lee, & ansley, 2007, pp. 1-21; cao, 2008, pp. 1-163; jurich & goodman, 2009, p. 3-25; hagge, 2010, p. 1-284; and he, 2011, pp. 1-174. studies regarding the mixed format of m-c and c-r test items mentioned above in general makes use of dichotomous scoring scheme for the m-c test items and of polytomous scoring scheme for the c-r test items. the dichotomous scoring scheme provides two result-possibilities for each item response namely „1‟ for each correct answer and „0‟ for each incorrect answer (bastari, 2000, p. 1; kinsey, 2003, p. 2; reynolds, living-ston, & willson, 2009, p. 195). on the other hand, the c-r test items are used for gathering information regarding the the incomplete knowledge that perhaps has been possessed by the test participants by demanding the test participants to provide a response toward an item suggestion (for example, the open-ended answers, the short answers and the essays). the c-r test items are usually scored according to the numbers of item completion or the degree of item correctness under the scale of correctness hierarchy (bastari, 2000, p. 1; reynolds, livingston, & willson, 2009, p. 223). numerically, the score for the polytomous item depends on the selected irt model. for example, if the model with k response category is selected then the answer will be scored as 1, 2, ..., k. for each missing data, the multilog provides a category with “0” score. the dichomotous scoring scheme for the format of m-c test items has several weaknesses because summarizing the incorrect option or the distractors into a certain category might cause the loss of information regarding the score tests. de ayala (1989, p. 790) states that dichomotization assumed “the test participants act under the principles of knowledge-or-random.” as a result, partial knowledge regarding the test participants‟ trait might be abandoned when the test items are dichomotized and tends to have less accurate in terms of estimation toward the test participants‟ ability. there has been an empirical evidence of the selection of distractors in relation to the test participants‟ characters/traits. the empirical evidence shows that certain distractors might be selected for most of the times by the test participants under different characters/traits (bock, 1972, p. 29; levine & drasgow, 1983, p. 675; sadler, 1998, pp. 289-290; thissen, 1976, p. 201; thissen & steinberg, 1984, p. 501; thissen, steinberg, & fitzpatrick, 1989, pp. 161-162; wainer, 1989, p. 192). the evidence support the hypothesis that says that partial information might be attained from the distractors. if the selection of distractors is not related to the test participants‟ characters/traits, then the opportunity that the test participants have in selecting the distractors will be distributed evenly to all of the available options in all of research and evaluation in education journal estimation of ability and item parameters... 57 abadyo & bastari the character/trait level based on the principle of equally-likely. irt has several models that might the distractors within the multiple choice test items. these models are usually named as the nominal models because, in an a priori manner, these models are not assumed to have sequences among the response items although the relative sequences within the test participants‟ characters/traits are assumed to exist. the two well-known nominal models for modelling the distractors are bock nominalmodel (bock, 1972, pp. 29-51) and thissen nominal model (thissen & steinberg, 1984, pp. 501519 – also known as multiple choice model) (demars, 2008, p.3). in this study, the researchers would like to review the multiple choice model (mcm) further. mcm is an expansion of bock nominal model and the bock nominal model is expanded by adding the latent category known as “don‟t know” or dk (penfield & torre, 2008, p.6) which is appropriate for explaining the responseaccuracy model in the complex-cognitive tasks (hoskens& de boeck, 2001, p. 19). glasersfeld (1982, p. 613) states that piaget defined that the cognitive tasks in establishing the knowledge has been related to the outside world and these tasks has been named “cogntive adaptation.” one of the domains within the outside world is mathematics which has been developed into the networks of wide abstract hierarchical concepts. in addition, each individual develops mathematic knowledge within himself or herself through assimilation or accomodation. since the mind is limited, multiple strategies are used in reducing the mental content including the compression like grouping and naming certain mathematic learning materials. however, the compression gives certain impacts, for example the fraction ¾, division 3 by 4, and the multiplication between ¼ and 3 will be compressed by an individual into a sole object namely ¾. this individual does not consider that the three objects are different. such matter might be resolved by means of „think aloud‟ method in order to attain the correct answer (someren, barnard, & sandberg, 1994, p. 142; gierl, wang, & zhou, 2008, p. 17). the last matter will be difficult to resolve by using the m-c test items or in general by using the selectedresponse test item; however, the mcm test items have provided the parameters of guessing proportion in order to accomodate the unexpected aspects. kinsey (2003, p. 3) states that there has been a new trend within the recent assessment that has encouraged an increase in the practice of combination among several test items and scoring schemes within a testing format and such trend is known as mixed-format test item. the objective of the combination is to generate a more authentic ability measurement, because the variation of scoring scheme toward the test items might be dichotomouspolytomous or polytomous-dichotomous. the mixed-format test item for the achievement test often consists of m-c test items and multiple c-r test items (traub, 1993, p. 30; wainer & thissen, 1993, p. 103; ercikan, et al., 1998, p. 138; sykes & yen, 2000, p. 222; chon, lee, & anlsey, 2007, p. 1). the competitive edge of the mixed-item test format or the combination between two test items into a single assessment is that such method might improve both the reliability and the validity or the information of the assessment information (lau & wang, 1998, p. 8). the mixed-format test item consisting of m-c test items and c-r test-items that have been studied by several researchers have not made use of mcm in scoring the m-c test items. for instance, bastari (2000, pp. 1-78) implemented 3plm (a dichotomous scoring scheme) for the m-c test items and grm for the c-r test items within a mixed-format test items. bastari (2000, p. 54) also recommended the use of 3plm/gpcm combination for estimating the parameters in the mixed-format test item consisting of m-c and c-r. other researchers, such as kinsey (2003, p. 91) and chon, lee, & ansley research and evaluation in education journal 58 volume 1, number 1, june 2015 (2007, p. 12), also provided recommendations similar to that of bastari. the two studies are quite urgent in the assessment development that might generate a more authentic ability measurement and that might improve both the reliability and the validity or the information in the test items and the test formats. in order to achieve this objective, there should be an investigation toward the ability or the performance of the combination among the dichotomous and polytomous irt model combination in analyzing (especially, in estimating the parameters of) mathematic mixed-format test items. a mathematic test demands the test participants to use mathematic protocols simply for analyzing the problems in the actual world, for designing and determining the resolution strategies and for testing the resolution appropriateness. the test participants should show their understanding toward the mathematic terminologies; in other words, the test participants need the use of definition, algorythm, theorem and other traits for solving a mathematic problem. the test participants are also expected to be able to analyze and interpret the given data (epas, 2008, p. 28). one of the objectives in conducting a mathematic test is to access the test participants‟ ability in transferring the qualitative reasoning and the problemsolving skills from one context to another. therefore, the mathematic test will continuously be challenged by new situations. the items within the mathematic test includes four cognitive level namely knowledge and skills, direct application, concept understanding and conceptual integration understanding. the cognitive development within the mathematic reasoning and the ability of providing mathematic evidence are based on the human‟s basic aspect namely perception, action and language as well as symbolization use that enable us to develop sophisticated and logical options increasingly into the sophisticated knowledge structure. such matter has been based on what has been called as sensori-motoric language of mathematics (tall et al., 2012, p. 1). based on the explanation about the cognitive development within the mathematic reasoning, there should be a characterized mathematic test that might be able to capture the pattern of graded response in order to access the mathematic cognitive ability. the polytomous irt models that have been fit into the patterns of graded response are namely graded response model (grm), partial credit model (pcm), generalized partial credit model (gpcm) and multiple choice model (mcm). the format of convential m-c test items that are generally scored dichotomously make use of 1plm, 2plm or 3 plm; in the mathematics, the format of conventional m-c test items might also be scored polytomously by using mcm. the underlying paradigm for the perception that the format of conventional m-c test items might also be scored polytomously is that each option is able to describe the gradual partial knowledge up to the option (key) that describes the perfect knowledge or ability. in addition, the mcm is derived from the nominal model. as a result, although the options do not strictly show the gradual partial knowledge, the mcm is still able to perform well within the analysis of m-c items. the study is an extension toward the parameter estimation of ability and items within the mixed-format test items by considering the recommendations from the previous researchers and by modifying the scoring scheme. then, the scoring modification is emphasized on the m-c test item format, which previously makes use of 3plm and then makes use of mcm. the change of the scoring scheme is still linear; in mathematical terms, the researchers would like to show that the 3plm is one of the mcm derivations. a study which was conducted by bastari (2000, pp. 1-78) made use of a 3plm/grm combination in order to estimate the relationship in the mixed-item test format in the common scale. on the research and evaluation in education journal estimation of ability and item parameters... 59 abadyo & bastari other hand, the study makes use of a mcm/gpcm combination in order to estame the parameters of ability and items in the mixed-item mathematic test format. due to the change of the scoring scheme, the main problem that will be discussed in the study is „how is the performance of mcm/gpcm combination in comparison to that of 3plm/grm combination in analyzing the mixed-item mathematic test format?‟ in order to attain the answers toward the main problem of the research, the researchers conducted a study regarding the influence of 3plm/grm combination and that of mcm/gpcm toward: (1) the accuracy in the ability estimation parameters and in the test item estimation parameters; (2) the optimalization of test information function (tif); (3) the derivation of estimation standard errors; and (4) the compability of the combination between the two models into the data (the data fitness) for the various proportion of m-c and essay items, the test length and the sample size. the scoring model combination, the m-c and essay test items proportion, the test length and the sample size are the factors that will be manipulated in the study. finally, the results on the performance of the combination between the two models will be compared in order to find which combination that is superior to another. method the study was a simulation one by implementing the fixed effect factorial design 3 x 3 x 3 x 2. the first factor consisted of three types of m-c and essay test tems (75:25, 80:20, and 90:10). the second factor consisted of three types of test length (which has been considered in the context of sub-summative test, the national examination and the aptitude test namely 20 items, 40 items and 60 items respectively). the third factor consisted of three sizes of sample simulation (400, 1000, 3000). the fourth or the final factor consisted of two combinations of scoring scheme namely the 3plm/grm combination and the mcm/gpcm combination. the study was conducted in the educational research and evaluation laboratory, the computer laboratory and the graduate program library of yogyakarta state university. the study was conducted for almost one year, starting from august 2013 until june 2014. within august until december 2013, a syntax for parscale and multilog software was developed. in relation to the syntax development, there had been a use of standardized normally distributed θ ability data with 1000 test participants by means of wingen2. based on the θ ability data that had been assumed as the true ability (true theta), the researchers found responses toward 20, 40 and 60 test items for about 54 x 5 = 270 data assembly according to the design of data attainment and the data assembly was replicated by means of wingen2 as well. however, the data of the answer responses might not be run in the parscale software. finally, on january 2014 the researchers decided to attain the simulation data by running the ms excel 2007 software based on the response data of 2003 junior high school examination for the mathematics in the province of yogyakarta special region. the data attainment was performed by the researchers themselves with the following phases. first, the researchers performed a unidimensional assumption test by using exploratory factor analysis (efa) toward the mathematic test items of 2003 junior high school national examination. at the beginning, the 40 test items of the national examination did not meet the unidimensional assumptions. after the data had been reduced repetitively, the researchers found 33 items that met the unidimensional assumptions and these items were shown in the following scree plot within the figure 1. research and evaluation in education journal 60 volume 1, number 1, june 2015 figure 1. scree plot of efa results toward 33 test items of 2003 junior high school mathematic national examination second, based on the data of the responses toward the unidimensional test items found by running the ms excel 2007 software, the researchers found the θ ability by running the parscale 4.1 software and the θ ability was assumed as the true theta. the θ distribution normality test was performed by running the minitab 16 software and the results of the test showed that the θ distribution was not normal. after the researchers performed the data editing process, the researchers found that there had been many outliers that caused the distribution to be asymmetrical. these outliers were shown by the asterisks in the boxplot within the figure 2. figure 2. boxplot of θ ability distribution from the test participants of 2003 junior high school mathematics national examination by reducing several scores, including the extreme ones, and then by performing the distribution normality test repetitively, eventually the researchers found a normal distribution with the mean 0.1466 and the standard deviation 0.8803 from the ability of the 2323 test participants at that year. these findings were shown by the results of anderson-darling normality test in figure 3. figure 3. anderson-darling normality test third, the size of the random samples, namely 400, 1000 and 3000 was taken from the θ ability normal distribution shown in figure 3 by implementing random sampling techniques with replacament and by running the minitab 16 software. these samples contained response data found through the operation of ms excel 2007 software in the second phase. the data was the results of scoring through the 3plm/grm combination and the mcm/gpcm combination in terms of test length, the m-c test items proportion and the essay which variations had been mentioned previously. in addition to the phase of initiating the response toward mcm, the researchers performed checking toward the order of mc item option by employing multilog software. the order was based on the score of relative frequency or the opportunity of answering the option that a high-level individual had. these scores were described by the item characteristic curve (icc) of the test items. for example, test item number 12 had four options. the multilog software employed the code „0‟ for the missing data and „1‟ for the „don‟t know‟ or dk response; therefore, the response code for the answer 1, 2, 3, 4 was shifted into 2, 3, 4, 5. figure 4 showed the icc of test item number 12. curver 5 described the opportunity that the test participants had in responding the highest category (the answer key). the rest of the curves, namely the curver 4, 3, 2 described the opportunity that the test participants had in responding the distractors‟ category which level of truth was 3 2 1 0 -1 -2 d is tr ib u s i k e m a m p u a n boxplot dari distribusi kemampuan 43210-1-2-3-4 99.99 99 95 80 50 20 5 1 0.01 kemampuan 2323 peserta unas smp 2003 p e r s e n mean 0.1466 stdev 0.8803 n 2323 a d 0.700 p-value 0.068 plot pe luang dari ke mampuan 2323 pe s e rta unas smp 2003 normal research and evaluation in education journal estimation of ability and item parameters... 61 abadyo & bastari below the correct answer. paying attention to the high-level ability or the above-average ability, the order of opportunity score (described by the order of the curve) had been clear. if the test item had not been good, then the order would be difficult to determine or even would not be found. figure 4. icc of test item number 12 the control variables within the simulation study were the scoring model combination, the m-c test and essay test items proportion, the simulation sample size, and the test length. then, the response variables were the test item parameter ability accuracy, the tif and the estimation standard error (s.e (θ) and s.e (par). the response data gathered from 270 data assembly that had been found was given prn extension. each data assembly was run in the parscale 4.1 software by using the syntax that had been developed previously. the outputs generated from the calculation by running the parscale 4.1 software for each design combinaton were the ability estimates (theta estimates), the item parameter estimates (slope estimates, location estimates and guessing parameter estimates) and 2ll statistics. the parameter estimation accuracy was evaluated by using the criteria of root mean squared error (rmse) and the root mean square differences (rmsd) methods. findings and discussions the study was conducted to answer four research questions, namely how is the influence of 3plm/grm combination and the mcm/gpcm combination toward: (1) the accuracy in the ability estimation parameters and in the test item estimation parameters; (2) the optimalization of test information function (tif); (3) the derivation of estimation standard errors; and (4) the compability of the combination between the two models into the data (the data fitness) for various proportion of m-c and essay items, the test length and the sample size. the fixed effect manova was run in the rmse (θ), the rmse (par), the rmsd (slope), the rmsd (location) and the rmsd (guessing) upon the main effects of the model, sample, proportion and test length in order to answer the first question. the researchers only investigated the significance of the main effect because the interactions of the main effect were incorporated into the statistical testing error since each cell from all factor combinations only contained one datum (bastari, 2000, p. 31). the effect size from these significant factors were evaluated by using the value of partial eta square ( 2 ) and the cohen‟s criteria (1988) which states that if the score of  2 = 0.1; 0.25; 0.4, the factor influence respectively will be small, moderate and big. the manova in the study employed the significance level  = 0.05. the results of manova show that the pillai‟s trace and the wilks‟ lambda statistical scores are significant except for the test length and these results are presented in the table 1. the results of manova show the p-values for the main effects with rmse as the dependent variable for (θ) and (par). it has been apparent in the table 1 that all of the main effects, except the test length, have significant f score. on the other hand, for rmse (θ) the scoring model factor, sample size, m-c/c-r test item proportion has  2 scores respectively as follows: 0.213; 0.480; 0.196. these scores imply that the sample size is the only factor that has big influence while the scoring model and the m-c/c-r test item proportion respectively has moderate and small influence. for rmse (par) the  2 scores are respectively as follows: 0.474; 0.730; 0.268. therefore, the test item proportion is the only factor that has moderate influence while the sample size and the scoring model are the factors that have big influence. research and evaluation in education journal 62 volume 1, number 1, june 2015 table 1. p values from the results of manove for rmse source df () (par) scoring model 1 0.001 0.000 sample size 2 0.000 0.000 m-c/c-r item proportion 2 0.007 0.001 test length 2 0.095 0.774 note: df = degree of freedom p-values printed in bold meant that the f values are significant at the level  = 0.05 in order to ease the interpretation toward the results of manova, the researchers performed a graphic analysis from the plots that state the comparison between the results of rmse (θ) marginal mean estimates and those of rmse (par) marginal mean estimates in terms of the 3plm/grm scoring model and the mcm/gpcm scoring model according to the sample size, the m-c/c-r item proportion and the test length. figure 5, 6 and 7 depicted the results of rmse (θ). figure 5. rmse (θ) marginal mean estimates according to the sample size figure 6. rmse (par) marginal mean estimates according to the m-c/c-r test item proportion figure 7. rmse (θ) marginal mean estimates according to the test length seen from figure 5, 6 and 7, it has been apparent that the scores of rmse (θ) marginal mean estimates in terms of mcm/gpcm scoring model are smaller than those of 3plm/grm scoring model. the finding gives implication that the combination of mcm/gpcm provides high accuracy in estimating the rmse (θ) marginal mean estimates than that of 3plm/grm. furthermore, it has also been apparent that the bigger the sample size and the m-c test item proportion and the longer the test length are, the smaller the scores of rmse (θ) marginal mean estimates would be. the finding implies that the bigger the sample size is, the m-c test items would be in the mixed-format test item and the longer the test length is, the more accurate the mixed-format test item would be in estimating the rmse (θ) marginal mean estimates. meanwhile, figure 8, 9 and 10 depict the results of rmse (par). research and evaluation in education journal estimation of ability and item parameters... 63 abadyo & bastari figure 8. rmse (par) marginal mean estimates according to the sample size figure 9. rmse (par) marginal mean estimates according to the m-c/c-r test item proportion figure 10. rmse (par) marginal mean estimates according to the test length similar to figure 5, 6 and 7, figure 8, 9 and 10 it has been apparent that the scores of rmse (par) marginal mean estimates in terms of mcm/gpcm combination are smaller than those of 3plm/grm combination. the finding implies that the mcm/gpcm combination provides higher accuracy in estimating the rmse (par) marginal mean estimates than the 3plm/grm combination. however, for the m-c test item proportion and the test length, the estimation accuracy is reversed, namely, the smaller the sample size is, the more accurate the result would be. the results of manova for rmsd show that the pillai‟s trace and the wilks‟ lambda statistic score are significant, except for the test item proportion. these statistic scores are presented in table 2 and show the p-values for the main effects with rmsd as the dependent variable for the slope, the location and the guessing. in table 2, it is clear that for the rmsd (slope), all of the main effects have significant f value. the  2 values for the model factor, the sample size factor, the proportion factor and the test length factor, respectively, are 0.368; 0.536; 0.167, 0.224. the rmsd (location) is similar to the rmsd (slope) and the  2 values for the rmsd (slope) respectively are 0.208; 0.604; 0.147; 0.382. finally, for the rmsd (guessing) the m-c test item proportion and essay test item proportion are the only factors which f values are not significant and the only factor that have big influence is the sample size with the  2 values = 0.604. table 2. p-values from the results of manova for rmsd source df slope location guessing scoring model 1 0.000 0.001 0.028 sample size 2 0.000 0.000 0.000 m-c/c-r item proportion 2 0.015 0.026 0.142 test length 2 0.003 0.000 0.000 note: df = degree of freedom p-values printed in bold meant that the f values are significant at the level  = 0.05 research and evaluation in education journal 64 volume 1, number 1, june 2015 the graphic analysis toward the plots shows the comparison between the results of rmsd (slope), rmsd (location) and rmsd (guessing) marginal mean estimates in terms of 3plm/grm scoring model and of mcm/gpcm scoring model according to the sample size, the m-c and essay test item proportion and the test length. the graphic analysis was conducted in order to ease the interpretation toward the results of manova. figure 11, 12 and 13 depict the results for rmsd (slope). figure 11. rsmd (slope) marginal mean estimates according to the sample size figure 12. rsmd (slope) marginal mean estimates according to the m-c/c-r item proportion figure 13. rsmd (slope) marginal mean estimates according to the test length the graphic analysis toward the plots shows that the comparison between the results of rmsd (location) and those of rmsd (guessing) is similar to the graphic analysis toward the rmsd (slope). as a result, overall, the graphic analysis from the rmse (θ) until the rmsd (guessing) shows that both the rmse and the rmsd marginal mean estimates that have been analyzed by means of mcm/gpcm combination are smaller than those of 3plm/grm combination. the finding implies that the mcm/gpcm combination is more accurate in estimating both the θ ability parameter and the test item parameter. the second research question was related to the optimalization of test information function (tif). in order to find the optimal values of test information function from each test length upon the various m-c and essay test item proportion and the various sample size, the researchers drafted the list of the optimal values in table 3. table 3 also contains the θ value range in which the maximum score of tif would be found. finally, the researchers made a comparison between the optimal values derived from the tif and the optimal values derived from the 3plm/grm method toward the mcm/gpcm method for the test length 20, 40 and 60 and the scores are respectively as follows: 0.364583; 0.358974; research and evaluation in education journal estimation of ability and item parameters... 65 abadyo & bastari and 0.348485. the values in the comparison show that the tif optimal values given by 3plm/grm method are almost one-third from those of mcm/gpcm method. table 3. the comparison of optimal values in the total test information from the combination of 3plm/grm model and the combination of mcm/gpcm model test length sample size proportion range () modek 3plm/grm mcm/gpcm 20 20 20 20 20 20 20 20 20 400 400 400 1000 1000 1000 3000 3000 3000 75% 80% 90% 75% 80% 90% 75% 80% 90% 0.4 to – 0.3 13.5 17.5 15.5 16.0 16.0 12.0 13.5 14.5 13.5 30.5 34.5 48.0 32.0 36.0 38.0 32.5 33.0 36.0 0.4 to – 0.3 0.4 to – 0.3 0.4 to – 0.2 0.4 to – 0.2 0.4 to – 0.2 0.4 to – 0.2 0.4 to – 0.2 0.4 to – 0.2 40 40 40 40 40 40 40 40 40 400 400 400 1000 1000 1000 3000 3000 3000 75% 80% 90% 75% 80% 90% 75% 80% 90% 0.6 to – 0.3 28.0 27.5 26.5 25.0 26.0 26.5 26.0 25.0 26.0 69.0 76.0 75.0 78.0 74.0 72.0 78.0 68.0 72.0 0.6 to – 0.3 0.6 to – 0.3 0.4 to – 0.3 0.4 to – 0.3 0.4 to – 0.3 0.4 to – 0.3 0.4 to – 0.3 0.4 to – 0.3 60 60 60 60 60 60 60 60 60 400 400 400 1000 1000 1000 3000 3000 3000 75% 80% 90% 75% 80% 90% 75% 80% 90% 0.4 to – 0.3 34.0 42.0 46.0 38.0 36.0 46.0 39.5 38.0 38.0 110.0 125.0 124.0 116.0 105.0 132.0 118.0 116.0 119.5 0.4 to – 0.3 0.4 to – 0.3 0.4 to – 0.2 0.4 to – 0.2 0.4 to – 0.2 0.4 to – 0.3 0.4 to – 0.3 0.4 to – 0.3 these values are presented visually in the line graphic of figure 14. the symbols which are used in the legend of figure 14 resemble the following meaning. the numbers in the square brackets show the test length. the symbol before the square bracket resembles the model combination which was employed by the researchers in the scoring scheme according to the sample variation and the m-c and essay test items. for instance, mcm/gpcm[60] refers to the results of 60-item test length that were scored by means of mcm/gpcm combination for the sample size of 400, 1000 and 3000 and the m-c test item proportion is 75%, 80% and 90%. based on the scores of the ratio above and from the results which are depicted in figure 13, it has been apparent that the optimal values of tif that had been analyzed by means of mcm/gpcm combination are three times higher than those of 3plm/rm combination. research and evaluation in education journal 66 volume 1, number 1, june 2015 figure 14. the tif optimal values according to the test length and the sample size as well as the m-c and essay test item proportion the third research problem was related to the estimates‟ standard error derivation. similar to the first research problem, in order to answer the third research problem, the researchers performed fixed effect manova on the rmse-s.e (θ) and the rmse-s.e(par). table 4 contains the p-values of manova on the rmse-s.e. for (θ) and (par). table 4. p-values from the results of manova for rmse-s.e source df () (par) scoring model 1 0.027 0.000 sample size 2 0.000 0.000 m-c/c-r item proportion 2 0.004 0.016 test length 2 0.558 0.715 note: df = degree of freedom p-values printed in bold meant that the f values are significant at the level  = 0.05 the results of manova in the rmse.s-e are similar to those of manova in the rmse for (θ) and (par) with the significant pillai‟s trace and wilks‟ lambda values, except for the test length. it has been apparent in table 4 that all of the main effects, except the test length, have significant f values. meanwhile, for the rmse-s.e (θ) the scoring model, the sample size, the m-c and essay test-item proportion have  2 values respectively as follows: 0.102; 0.530; 0.217. these values imply that the sample size is the only factor that has big influence while the scoring model and the m-c and essay test item are the factors that have small influence. for the rmse-s-e (par)  2 values respectively as follows: 0.340; 0.517; 0.164. therefore, the sample size is the only factor that has big infuence while the scoring model and the m-c and essay test item proportion are the factors that have moderate and small influence. the graphic analysis toward the plots state the comparison between the rmses.e (θ) marginal and the rmse.s-e (par) mean estimates results by means of 3plm/grm and mcm/gpcm according to the sample size, m-c and essay test item proportion and the test length. these values are depicted in figure 15, figure 16, figure 17, figure 18, figure 19 and figure 20. figure 15. rmse.s-e (θ) marginal mean estimates according to the sample size figure 16. rmse.s-e (θ) marginal mean estimates according to the m-c and essay test item proportion research and evaluation in education journal estimation of ability and item parameters... 67 abadyo & bastari figure 17. rmse.s-e (θ) marginal mean estimates according to the sample size the results of graphic analysis for rmse.s-e (θ) are similar to those of rmse (θ) and there had been consistency that the bigger the sample size is, the bigger the proportion and the longer the test length the more accurate the estimates would be. similarly, for the graphic analysis of rmse.s-e (par), there has been consistency with the graphic analysis of rmse (par), namely the smaller the m-c test item proportion and the shorter the test length the more accurate the estimates would be toward the rmse-s.e marginal mean estimates. figure 18. rmse.s-e (par) marginal mean estimates according to the sample size the fourth problem formulation was related to the compability of both combinations to the data fit. in order to evaluate the model or the combination compatibility to the data fit, the researchers made use of minus 2 log likelihood (2ll) statistic which had chi-square ( 2 ) distribution. the big values from the -2ll statistic show that the model has been less compatible to the data. in order to compare which model might be compatible to the data fit, the researchers made use of the gap between the two -2ll statistics which also has chi-square ( 2 ) distribution. the values which were generated from the gap between the -2ll (3plm/grm) statistic and the 2ll (mcm/gpcm) statistic are presented in the column „gap -2ll( 2 )‟ in table 5. all of the p-values in the colum „pvalue‟ of table 4 are not equal or bigger than 0.05 and all of the -2ll values for the 3plm/grm model are bigger than those of mcm/gpcm model; therefore, it can be concluded that the mcm/gpcm model is more fit to the data than the 3plm/grm model. therefore, from the four types of data analysis employed for solving the four problem formulations, it has been apparent that the performance of mcm/gpcm combination is more superior than that of 3plm/grm in analyzing the mathematics mixed-item test format. figure 19. rmse-s.e (par) mean marginal estimates according to the m-c/c-r item proportion figure 20. rmse-s.e (par) mean marginal estimates according to the test length research and evaluation in education journal 68 volume 1, number 1, june 2015 actually, the scoring by mcm/gpcm model in overall implemented many categories and within the study both mcm and gpcm made use of the four categories or both of the models made use of polytomous score. on the other hand, the scoring by 3plm/grm model made use of mixed categories namely two categories were mixed with multiple categories (in this study four categories). as a result, the bigger the m-c item proportion in the 3plm/grm scoring model is, the more items would be scored by means of two categories or by means of dichotomous score. the finding supports the one from the previous studies (wasis, 2009, p. 104; kinsey, 2003, p. 87; si, 2002, p. 77) which state that the scoring under the polytomous manner will generate better estimates for the test participants‟ ability than the dichomotomous manner. table 5. comparison between the 3plm/grm model and the mcm/gpcm model in terms of -2ll statistics. test length sample size propor tion model 2ll (2) gap df p-values better fit 3plm/grm 2ll mcm/gpc m 2ll 20 20 20 20 20 20 20 20 20 400 400 400 1000 1000 1000 3000 3000 3000 75% 80% 90% 75% 80% 90% 75% 80% 90% 16026.2 15848.42 15241.45 40152.45 39998.02 38459.3 121392 118456.5 117017 11106.63 10470.33 9436.919 27716.3 26264.14 23457.42 83208.44 78960.23 70585.07 4919.571 5378.092 5804.526 12436.15 13733.88 15001.88 38183.53 39496.25 46431.94 19 19 19 19 19 19 19 19 19 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm 40 40 40 40 40 40 40 40 40 400 400 400 1000 1000 1000 3000 3000 3000 75% 80% 90% 75% 80% 90% 75% 80% 90% 31563.14 30434.34 30492.42 79051.59 77270.35 76110.28 235818.2 234281.7 229789.3 21716.2 20527.49 18939.26 53770.54 51381.91 47044.5 161431.8 154481.4 141340.5 9846.939 9906.847 11553.16 25281.05 25888.45 29065.78 74386.37 79800.25 88448.79 39 39 39 39 39 39 39 39 39 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm 60 60 60 60 60 60 60 60 60 400 400 400 1000 1000 1000 3000 3000 3000 75% 80% 90% 75% 80% 90% 75% 80% 90% 328589 44666.06 42868.17 77192.34 111052.9 107759.8 334728.6 336073.6 328589 202488.4 30451.55 27270.22 77192.34 74917.41 66914.78 232122.7 223638.5 202488.4 126100.6 14214.51 15597.95 33811.89 36135.53 40845.02 102605.9 112435.1 126100.6 59 59 59 59 59 59 59 59 59 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm mcm/gpcm research and evaluation in education journal estimation of ability and item parameters... 69 abadyo & bastari in addition, the mcm/gpcm model provides higher value of test information in comparison to that of 3plm/grm model. the optimal value ratio of the test information from both scoring models for the 20-item, 40-item and 60-item respectively is 0.364583; 0.358974; 0.348485. in general, it can be stated that the function value of the test information which scoring made use of mcm/gpcm combination is three times higher than that of 3plm/grm. the finding supports the research that had been conducted and found by donoghue (1994, p.300), susongko (2009, p. 124) and wasis (2009, p.105). finally, the answer for the main problem formulation is the summary of the first to the third problem formulation altogether with the results of the test on the model compatibiity to the data. the analysis of model compatibility test by means of 2ll statistic provides an mcm/gpcm model with better fit than the 3plm/grm model. the results support the findings of chon, lee, & ansley (2007, pp.1-21). therefore, in general, the mcm/gpcm combination is more superior in terms of interface in analyzing the mixed-item test format, especially in the mathematics, than the 3plm/grm combination. conclusion and suggestion conclusions based on the explanation on the results of the study, the researchers would like to draw the following five conclusions. first, the combination of scoring model provides significant effect or influence in the level  = 0.05 toward the test participants‟ θ ability estimates accuracy. the combination of mcm/gpcm model is more accurate than that of 3plm/grm model in estimating the θ ability. the bigger the sample size, the bigger the m-c item proportion and the longer the test length, the more accurate the θ ability estimates will be. second, the combination of mcm/gpcm scoring model has more accurate estimates on the item parameter than that of 3plm/grm model and the bigger the sample the more accurate the estimate results; however, the finding does not apply to the m-c test item proportion and the test length. both by means of rmse criteria and rmsd criteria, the estimates generated by both model combinations will be more accurate if the m-c test item proportion and the test length are smaller and shorter. in addition, the factors which have big influence are the model combination and the sample size while the m-c test item proportion has moderate influence. on the other hand, the test length does not have significant f value in the level  = 0.05. third, in general, the researchers would like to state that the combination of mcm/gpcm model has provided the test information value three times higher than that of 3plm/grm model. in addition, for all of the test length the position of maximum test information value leads to the ability (θ) marginal estimates distribution. however, the researchers are unable to draw a „solid‟ conclusion regarding the sample size and the m-c test item proportion in each test length that provided the optimum test information value. fourth, the θ ability standard estimates derivation error as well as the test parameter decrease under the estimation by means of mcm/gpcm combination in comparison to that of 3plm/grm. this finding implies that the mcm/gpcm scoring model is more accurate in estimating the θ ability and the test item parameter than the 3plm/grm is. fifth, the differences in the fit statistics between the two scoring models strengthen the superiority of mcm/gpcm combination upon the 3plm/grm combination at the level  = 0.05. suggestions the test developers, especially the ones who are responsible for the national examination and the state university admission test, should consider the use of mixed-item test format in order to attain as much information as possible regarding the test participants‟ ability. in relation to the research and evaluation in education journal 70 volume 1, number 1, june 2015 matter, there should be considerations as well toward the wide-scale scoring implementation for the essay test items. then, the future researchers who would like to follow up the study are recommended to: (a) develop the model composition, for example the 3plm/grm combination, the mcm/gpcm combination and alike; (b) the numbers of response category in the study are made similar and there are four categories, therefore it is still possible that these categories might be developed into five categories or might be made different among the combined models because the researchers have not found the effects of the increase or the decrease on the model or even the unsimilarity of the response categories between the combined models; and (c) the criteria on the robustness test on the model during the unidimensionality assumption is violated because the data initiation for the irt model combinations is assumed to be dimensional. references bastari, b. (2000). linking multiple-choice and constructed-response items to a common proficiency scale (unpublished doctoral dissertation). university of massachusetts amherst, usa. umi microform 9960735. berger, m. p. (1998). optimal design of tests with dichotomous and polytomous items. applied psychological measurement, 22(3), pp. 248-258. bock, r. d. (1972). estimating item parameters and latent ability when responsesare scored in two or more nominal categories. psychometrika, 37(1) 29-51. cao, y. (2008). mixed-format test equating: effects of test dimensionality and common item sets (unpublished doctoral dissertation). university of maryland, maryland usa. chon, k. h., lee, w. c, & anlsey, t. n. (2007). assessing irt model-data fit for mixed format tests. casma research report, number 26 de ayala, r. j. (1989). a comparison of the nominal response model and the three parameter logistic model in computerized adaptive testing. educational and psychological measurement, 23(3), 789-805. de mars, c. e. (2008, march). scoring multiple choice items: a comparison of irt and classical polytomous and dichotomous methods. paper presented at the annual meeting of the national council onmeasurement in education, newyork. donoghue, j. r. (1994). an empirical examination of the irt information of polytomously scored reading items under the generalized partial credit model. journal of educational measurement, 31(4) pp. 295-311. ercikan, k. et al. (1998). calibration and scoring oftests with multiple-choice and constructed-response item types. journal of educational measurement, 35(2), pp. 137-154. garner, m., & engelhard, jr., g. (1999). gender differences in performance on multiple-choice and constructedresponse mathematics items. applied measurement in education, 12, pp. 2951. gierl, m. j., wang, c., & zhou, j. (2008). using the attribute hierarchy method to make diagnostic inferences about examinees' cognitive skills in algebra on the sat. journal of technology, learning, and assessment, 6(6). glasersfeld, e.von. (1982). an interpretation of piaget‟s constructivism. revue internationale de philosophie, 36, pp. 612–635. hagge, s. l. (2010). the impact of equating method and format representation of common items on the adequacy of mixedformat test equating using nonequivalent groups (unpublished doctoral dissertation). university of iowa, usa. he, y. (2011). evaluating equating properties for mixed-format tests (unpublished doctoral dissertation). university of iowa, usa. research and evaluation in education journal estimation of ability and item parameters... 71 abadyo & bastari hoskens, m. & de boeck, p. (2001). multidimensional componential item response theory models for polytomous items. applied psychological measurement, 25, pp. 1937. jurich, d., & goodman, j. (2009, october). a comparison of irt parameter recoveryin mixed format examinations using parscale and icl. poster session presentedat the annual meeting of northeastern educational research association, james madison university. kennedy, p., & walstad, w. b. (1997). combining multiple-choice and constructed response test scores: an economist‟s view. applied measurement in education, 10, pp. 359375. kentucky department of education. (2008). educational planning and assessment system (epas) college readiness standards and program of studies standards alignment introduction [digital edition version]. retrieved from http://www.education.ky.gov/ kinsey, t. l. (2003). a comparison of irt and rasch procedures in a mixed-item format test (unpublished doctoral dissertation). university of north texas, usa. umi microform 3215773. lau, c. a. & wang, t. (1998, april). comparing and combining dichotomous and polytomous items with sprt procedure in computerized classification testing. paper presented at the annual meeting of the american educational research association, san diego, ca. levine, m. v., & drasgow, f. (1983). the relationship between incorrect optionchoice and estimated ability. educational and psychological measurement, 43, pp. 675-685. li, y. h., lissitz, r. w., & yang, y. n. (1999). estimating irt equating coefficients for tests with polytomously and dichotomously scored items. paper presented at the annual meeting of the national council on measurement in education, montreal canada. lukhele, r., thissen, d., & wainer, h. (1994). on the relative value of multiple-choice, constructedresponse, and examinee-selected items on two achievement tests. journal of educational measurement, 31, pp. 234-250. meng, h. (2007). a comparison study of irt calibration methods for mixed-formattests in vertical scaling (unpublished doctoral dissertation). university of iowa, usa. reynolds, c. r., livingston, r. b., & willson, v. (2009). measurement and assessment in education (2 nd ed.). new york: pearson education, inc. sadler, p. m. (1998). psychomatric models of examinee conceptions in science: reconciling qualitative studies and distractor-driven assessment instruments. journal of research in science teaching, 35(3), pp. 265-296. si, c. b. (2002). ability estimation under different item parameterization and scoring models (unpublished doctoral dissertation). university of north texas, usa. van someren, m. w., barnard, y. f., & sandberg, j. a. c. (1994). the think aloud method: a practical guide to modelling cognitive processes. london: academic press. susongko, p. (2009). perbandingan keefektifan bentuk tes uraian dan testlet dengan penerapan ‘graded response model’ (grm) [the comparison between the effectiveness of explanatory test and test let with the implementation of graded response model] (unpublished doctoral dissertation). yogyakarta state university, yogyakarta. sykes, r. c., & yen, w. m. (2000). the scaling of mixed-item-format tests with theone-parameter and twoparameter partial credit. journal of educational measurement, 37, pp. 221244. http://www.education.ky.gov/ research and evaluation in education journal 72 volume 1, number 1, june 2015 tall, d. o. et al. (2012). cognitive develop-ment of proof. in icmi 19: proof and proving in mathematics education. springer. [digital edition version]. retrieved from http://homepages.warwick.ac.uk/sta ff/david.tall/pdfs tang, k. l., & eignor, d. r. (1997). concurrent calibration of dichotomously and polytomously scored toefl items using irt models. teofl technical report 13. princeton, nj: educational testing service. thissen, d. m. (1976). information in wrong responses to the raven progressivematrices. journal of educational measurement, 13(3), pp. 201-214. thissen, d., & steinberg, l. (1984). a response model for multiple choice items. psychometrika, 49, 501-519. thissen, d. m., steinberg, l., & fitzpatrick, a. r. (1989). multiple-choice models: the distractors are also part of the item. journal of educationa measurement, 26(2), pp. 161-176. traub, r. e. (1993). on the equivalence of the traits assessed by multiple-choice andconstructed-response tests. in r. e. bennett, & w. c. ward (eds). construction versus choice in cognitive measurement (pp. 29-44). hillsdale, nj: lawrenc erlbaum associates. wainer, h. & thissen, d. m. (1993). combining multiple-choice and constructed response test scores: toward a marxist theory of test construction. applied measurement in education, 6(2), pp. 103-118. wainer, h. (1989). the future of item analysis. journal of educational measurement, 26(2), pp. 191-208. wasis. (2009). penskoran model partial credit pada item multiple true-false bidang fisika [partial credit scoring model on multiple true-false items in physics field] (unpublished professor dissertation). yogyakarta state university, yogyakarta. http://homepages.warwick.ac.uk/staff/david.tall/pdfs http://homepages.warwick.ac.uk/staff/david.tall/pdfs how to cite item: pada, a., kartowagiran, b., & subali, b. (2016). separation index and fit items of creative thinking skills assessment. research and evaluation in education, 2(1), 1-12. doi: http://dx.doi.org/10.21831/reid.v2i1.8260 research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 2, number 1, june 2016 (pages 1-12) available online at: http://journal.uny.ac.id/index.php/reid a separation index and fit items of creative thinking skills assessment 1 andi ulfa tenri pada; 2 badrun kartowagiran; 3 bambang subali 1 syiah kuala university; 2,3 yogyakarta state university 1 andi_ulfa@unsyiah.ac.id; 2 badrun_kartowagiran@uny.ac.id; 3 b_subali@yahoo.co.id abstract this article discusses the evaluation results of the separation index and fit item of creative thinking skills assessment that supports the conation aspect of prospective biology teachers in aceh. this assessment consists of 37 items of divergent tasks, which is the application of human physiology courses that support the conation aspects. the participants were selected from the biology education program, faculty of teacher training and education, syiah kuala university. the data were analyzed using the quest software including the separation index and fit item. the results indicate that the creative thinking skills assessment instrument that supports the conation aspect of prospective biology teachers has a good separation index and all the items fit pcm-1pl. keywords: pcm-1pl, separation index, fit item, creative thinking skills, conation aspect research and evaluation in education 2 − volume 2, number 1, june 2016 introduction formal education in indonesia todays generally shows fewer opportunities for the development of creativity. school prioritizes cognitive training only at the knowledge, memory and reasoning levels. this is proven by the teaching process in schools, where there is hardly any activity demanding creative thinking. thus, students are stimulated to think, act, and behave creatively (supardi, 2012, p.6). this statement is similar to subali (2011, p.139), who insists that the creativity of science process skills is less developed by school teachers. the majority of biology teachers suggest that they are more concentrated in multiple-choice tests that are clearly oriented on the development of convergent thinking patterns, and are less oriented to the divergent patterns as the basis for creativity development. the importance of creativity is stated in article 3 of national education system act no. 20 of 2003, on the national education goals with the expectation that education can develop students' potentials in order to become pious, noble, skilled, creative, and independent human beings. meanwhile, the goal of indonesian national education emphasizes the importance of creativity. however, it is very much in contrast with the achievement of indonesia in international creativity survey (the global creativity index) in 2011. indonesia currently ranks 81 of the 82 countries involved in the survey, far below the neighboring countries, singapore, that ranks 9, and malaysia that ranks 48 (florida, mellander, & stolarick, 2011, p.41). the results of the international surveys show that the performance of indonesian people is included in ‘low’ category. the research which is conducted by ramirez and ganaden (2008, pp.22-33) reveals that the poor performance is due to the weakness in high-level thinking skills. the learning process in higher education today seems less than effective to improve the ability of thinking creatively. despite this, creative thinking is the culmination of the cognitive dimension on the revision of bloom's taxonomy by anderson et al. (krathwohl, 2002, p.215) and also new bloom’s taxonomy (dettmer, 2006, p.73). dehaan (2009, pp.172–181) indicates that the abilities to think creatively help the students find the value of evidence-based reasoning, increase high order cognitive skills (hocs), and make them capable of solving problems. creative thinking is a process which is employed to yield ideas or brand new ideas (runco, 2004, p.658). new ideas can result from combination (elaboration) of old ideas or newly emerging ideas. it may occur by combining the ideas of others to stimulate the rise of brand new ideas. creativity is an ability to generate new ideas or new artifacts, which are surprising and valuable (boden, 2001, p.95). the research results also show that creativity is an essential element of a problem solving (mumford, mobley, uhlman, reiterpamon, & doares, 1991, pp. 91-122; runco, 2004, pp.658-659). thereby, it is normal if the creativity and intelligence are deemed the application of creativity among hocs as described in bloom’s taxonomy (crowe, dirks, & wenderoth, 2008, pp. 368-381). the creative thinking skill is one of the thinking skill dimensions that should be further developed and measured. in the opinion of baer (2012, pp.1102-1119), the study on the creativity is complex. nonetheless, it is not merely difficult to measure or perform (dehaan, 2009, pp.172-181). the measurement of creative thinking skill ability of the students can be conducted by creating an assessment with divergent approach (subali, 2011, pp.130-144). the divergent thinking process is a part of the creative capability. divergent thinking is an ability to construct or generate sets of possible responses, ideas, options or alternatives to a problem (isaksen, dorval, & treffinger, 1994, p.18). therefore, the divergent thinking can be defined as the ability to deliver wide range of solutions to the problems with the proper procedures and reasons. the characteristic of creativity is the uniqueness and originality that should be initiated with a search for various possible solutions. afterwards, a person should know whether the solution is different from other solutions, and whether the solution has never existed before. in order to find various alterresearch and evaluation in education a separation index and fit items... 3 andi ulfa tenri pada, badrun kartowagiran, & bambang subali native solutions, one requires divergent thinking skills (subali & suyata, 2013, p.4). creative thinking is the cognitive activities that can lead creative productions deemed useful and new to the groups or individuals (isaksen, dorval, & treffinger, 1994, p. 31). thereby, in this article creative thinking skill means the ability to construct an idea into a unique pattern or structure. it puts the priority on the element of originality in the idea formed, related to the problems identified. beside cognitive aspects, the creative thinking skill which developes in learners is inseparable from the conation aspect (lubart, 2004, p.10; poole & van de ven, 2004, p.41; jo, 2009, p.86). nowadays, the conation aspect is often ignored by most educators; not only at the level of primary and secondary education, but also at the level of university (reeves, 2006, p.297). hence, if the cognitive aspect is related to the idea, the connative aspect is associated with the concept of intrinsic motivation and willingness. conation as a mental process is to activate and/or guide the behavior and actions (huitt & cainn, 2005, pp.1-7). in the opinion of pepper (1970, p.337), conation is illustrated as ‘a drive-charged pattern of references positive or negative’. a variety of terms are used to represent aspects of conation, including the intention or tendency to behave (riyanti & prabowo, 1998, p.70; board of national education standard, 2010, p.28). the connative performance is actions, willingness, or desire. conation is a state where the mind has a purpose, and connative knowledge is to select or be willing to do an act in relation to a series of circumstances. it can be concluded that conation is a statement of desire which has a positive and negative direction. according to darmawan (2013, pp.1-4), the students who already have a fairly good concept of understanding do not necessarily apply their knowledge in the real world. by the time the students learned about the circulatory and the respiratory system, they should have already recognized the health impacts of smoking on heart and lungs, yet they are still on it. on the other hand, the concept understanding in the mind of the learners may also generate constructive actions that can contribute to character development. they are benefited by gaining more awareness on the value contained in these materials by quitting smoking or reminding their peers to stop smoking. hence it is obvious that the cognitive factors getting involved in the creative process can be supported or inhibited by the willingness or conation factors. by taking into account the problem’s root, we need to think about the ways to overcome it. moreover, the implementation of a competency-based curriculum at universities is focused on training how to think and reason, developing creative activities, developing the abilities to solve problems, and communicating ideas. one effort that has been done is to develop assessment tools to measure creative thinking skills supporting the conation aspect of the students through a divergent pattern in the course of human physiology. in order to prove whether the assessment has been constructed optimally, it is necessary to evaluate the quality of the assessment tools. in order to obtain full information on the ability of creative thinking skills of students as the prospective biology teachers in the subject of human physiology, the information shall be collected at the end of learning. the expectation is that the results of the assessment does not only serve as the implications of the measurement result but also improve the thinking abilities of the students associated with the materials that they have learned, as well as provide information for classes and educators to improve the quality of teaching and learning process. this description illustrates the importance of the development of an assessment model to observe the attainment of thinking skills that support the conation aspect of prospective biology teachers in the subject of human physiology. therefore, the assessment model used shall be able to support the attainment of the course objectives. the selection of human physiology course is based on the consideration that with the course, the students’ response to the conation idea response can be obtained more research and evaluation in education 4 − volume 2, number 1, june 2016 easily as the cases which are discussed are contextual. when studying human physiology course, students learn the normal function of organs, thereby in these instruments, the stimulus which is provided is in the form of disruption to the function of organs (disease) or the inverse of the normal body functions. what is expected is that, through such abnormal condition stimulus, the students can provide relevant solutions to various cases being presented. the students were asked to provide a variety of responses in the form of a divergent production pattern to numerous cases presented through the concepts of human physiology materials that are considered essential. these responses are later capable of describing a certain tendency of behavior in accordance with the attitude of a person. in other words, they can describe a person's tendency to react against a stimulus in certain ways based on their understanding after studying human physiology. the two aspects used in assessing the quality of good assessment tools are validity and reliability (cohen, swerdlik, & sturman, 2013, p.98). in line with cohen, swerdlik, and sturman’s opinion, reynolds, livingston, and willson (2009, p.4) mention the characteristics of tests include reliability and validity. the test users should seriously consider the use of the test results. the tests employed are only those which generate valid, reliable, and accurate evidence on the purposes they serve and for whom they are intended. therefore, prior to using assessment instrument, the evaluation of the validity and reliability is necessary to conduct. according to reynolds, livingston, and willson (2009, p.4), the reliability of a test refers to the stability and consistency of the test scores, while validity refers to the accuracy of the interpretation of test scores. wright and stone (1999, pp.157-165) mention that reliability is a statement on the consistency and stability of scores of an instrument, while the validity is a statement of conformity of the test and its components, the truth of the test results and its interpretation. based on several opinions explained before, it can be argued that good tests are those having reliable and valid condition or characteristics (mardapi & kartowagiran, 2011, p.332). one technique that can be used to analyze the validity and reliability of test instruments is item response theory (irt). irt is an alternative measurment method other than classical test theory (ctt) (gorin & embretson 2006, pp.394-411). ctt is the psychometric technique which is allowing the presumption of test results, for example the item difficulties and individual talent (alagumalai, hungi, & curtis, 2005, p.273). meanwhile, irt is a psychometric technique focusing on individual response towards specific test items influenced by the quality of the item. irt is a probabilistic model which is seeking to describe a person's response to an item (hambleton, swaminathan, & rogers, 1991, p.9). in the simple form, irt argues that the possibility of random people ‘j’ with the ability ‘θj’ to answer a random item ‘i’ with a degree of difficulty ‘b’, being conditioned on the ability of people and item difficulties. in other words, if a person has high ability in a specific field, he will probably answer the easy items correctly. in contrast, if a person has low ability and gets difficult items, he will perhaps answer the item wrongly. irt is made as an alternative model by psychometric experts to overcome the weaknesses of ctt. this model has the following properties: (1) the characteristics of the item are not dependent on the group of test participants subjected to the test item, (2) the scores which are stating the ability of test participants do not depend on the test, (3) the model is expressed in rank (level) of items, not in the level of the tests, (4) the level model does not require a parallel test to calculate the reliability coefficient, and (5) the model provides the proper measure for each ability score (hambleton, swaminathan, & rogers, 1991, p.5). there are two basic postulates of modern test theory (hambleton, swaminathan, & rogers, 1991, p.7): (1) the performance of the test participants on an item can be predicted (described) by using a set of factors called properties, latent properties, or ability; (2) the relationship between the performance research and evaluation in education a separation index and fit items... 5 andi ulfa tenri pada, badrun kartowagiran, & bambang subali of test participants on a test item with the underlying characteristics can be described by a steadily increasing function, which is referred to as item characteristic function, or item characteristic curve. such a function explains that if the ability level increases, the probability of a test to respond correctly to an item will also increase. there are several assumptions in the item response theory model of hambleton, swaminathan and rogers (1991, pp.9-12): (1) it is one-dimensional (unidimensional). this assumption is highly difficult to fulfill due to the factors affecting tests, such as cognitive, personality, and language factors. however, the most important point of this assumption is one component that is considered to be dominant in determining the abilities of the subject. according to hutten (hattie, 1985, p.146), the unidimensionality can be investigated through eigen value in the factor analysis. the percentage of the total variance explained by the first component is commonly regarded as unidimensionality index. the higher the percentage of the main component total variance, the closer this test to unidimensional character. reckase (1979, p.228) recommends that for a good calibration, the total percentage of variance explained by the first com-ponent, i.e. 20% or more is required by data to fulfill the unidimensional assumption. (2) it is locally independent. such an assumption means that the test participants’ response towards an item is not related to other items within the test. the package program employed to perform item analysis in this study is quest. a central element of quest program is rasch model (rm). the program can use the response data scored in a politomus manner. the quest program is able to estimate the parameters, both for items and testee (case/ person) using unconditional (ucon) or joint maximum likelihood (adam & khoo, 1996, p.89). in irt, the instrument is declared valid when an item behaves consistently (fits) with what is expected by the model. the term ‘valid’ in irt is used to assess the success of calibration in the effort to find out the data fitness with the model. an item is declared fit with the model when the calibration is ‘valid’ and when the testee (case/person) is declared fit with the model, thus the measurement shall be ‘valid’ (wright & stone, 1999, pp.169171). the item and person fit resulted from the analysis of the quest program is based on the average value of infit mean square (infit mnsq) from 0.7 to 1.3 (wright & masters, 1982, p.100; bond & fox, 2001, pp.177-178) the criteria for fit person through the analysis using the quest program is based on the average size of infit mean of square (infit mnsq) of a person is equal to 1. another criterion is that the expected mean value of infit t is equal to 0 with variance equal to 1. the determination of a fit item with the model is based on the value of infit mnsq or the infit t value of the item. the expected value of infit mnsq value is equal to 1 with a variance equal to 0, and the expected value of infit t is equal to 0 with a variance equal to 1 (adam & khoo, 1996, p.93). in irt, the precision test is conceptualized as something referred to as information, depending on the characteristic level being measured. the estimation of the internal consistency reliability of a test is based on the person separation reliability. logit scale estimation is used for each testee to calculate the reliability (bhakta, tennant, horton, lawton, & andrich, 2005, pp.1-13). a person separation reliability ( can be calculated using the following formula: where, : is observed variance of testee : is the mean squared error of measurement. according to wright and masters (mappiasse, 2006, p.584), using the rasch model, item separation reliability and person separation reliability can be estimated as well. the interpretation of person separation reliability also encounters problems when an item fails to define a single variable leading to research and evaluation in education 6 − volume 2, number 1, june 2016 the use of alternative index which is called person separation index. the person separation index is an estimation of how well each testee can be distinguished on the measured variables. it describes the placement repetition of a testee against other items, measuring the same construct (mappiasse 2006, p.585; curtis & boman, 2007, p.251). the higher the person separation index , the more consistent each item is used to measure the respective testee. according to wright and stone (1999, p.163), the value = 2 is equivalent to the value of 0.80. the following formula is presented to calculate the person separation index: such a concept provides an estimation of sample standard deviation in standard error units. this index is useful to compare the use of different scales in an entire different classroom situation (mappiasse, 2006, p.585). it is also applicable in the item separation reliability and item separation index. the consistency of a group of individuals in providing information on item difficulty forming the scale is reflected in the item separation index (curtis & boman, 2007, p.251). the higher the estimation of an item separation index, the more precise the whole items being analyzed according to the model used (subali, 2010, p.38). this article discusses the evidence of the validity and reliability of assessment instruments in creative thinking skills using the item response theory through the partial credit model (pcm). in this analysis, there are two main things observed: fit item for instrument validity testing and pearson and item separation index. the analysis results are later used to determine the quality of the test instrument. method in order to evaluate person separation and item fit of assessment tools, empirical data are required. the data from the test product were analyzed using the quest program. the employment of this program was based on the consideration that the logistic model chosen to estimate the item parameter and ability parameter of participants was rasch model development or one parameter logistic model (1-pl), and for polytomous scoring technique was partial credit model (pcm). in this research, one parameter logistic model used pcm of the quest program. model of irt 1pl or rasch model (rm) is a central element of the quest program, using the joint maximum likelihood procedure to estimate items and case parameters (adams & khoo, 1996, p.89). pcm is developed from rm, where the rm is used on a dichotomous score data, whilst pcm is used in the polytomous score data (more than two categories) (masters & wright, 1997, p.100). in this model, it is assumed that the parameter of item difficulty level is the only item characteristics affecting the response characteristics of the test participants (nering & ostini, 2010, p.121). the trial subjects were 218 students at the initial trial and 270 students at the main trial. the criteria were the students who had attended the teaching process of human physiology course. the test instruments were distributed to students at the end of the teaching process in two periods. the students were given two hours in each period to complete all test items. the test results were then employed as the data in this study. the assessment instruments which were evaluated in this study were the assessment instrument of creative thinking skills which was supporting the conation aspect of prospective biology teachers through the divergent approach. the assessment instrument consisted of 37 items which were grouped into four components. these components consisted of: (1) the alternative solution component which was the the ability to generate a number of solutions to respond to an issue, which is consisting of 10 items; (2) the original solution components, i.e. the ability to generate a number of relevant solutions that are unique or unusual, that also consisted of eight items; (3) feasibility solution component, research and evaluation in education a separation index and fit items... 7 andi ulfa tenri pada, badrun kartowagiran, & bambang subali i.e. the ability to yield a number of effective solutions which are applicable for resolving the case given, that is consisting of 10 items; and (4) variation solution components, such as the ability to produce a number of categories of solutions, which is consisting of nine items. items in this instrument consisted of a variety of cases which were an application of human physiology course which supported the conation aspect. responses were collected through four components of creative thinking skills, namely: (1) alternative solution or fluency which was produced in generating ideas, which could be observed through a number of relevant solutions resulted; (2) the original solution, such as the ability to generate a number of relevant solutions that are unique or unusual that can be observed through the frequency of testee’s response. the score of the testee was calculated based on the response frequency given. the response which was less than 10% of the total testee was given a score of 4; lower than 25% was scored 3; lower than 50% was scored 2, and more than 50% was scored 1 (diakidoy & constantinou, 2010, p.405); (3) the feasibility solution, which was an effective solution to resolve the cases given, observable through a number of appropriate/proper responses; and (4) variation solution, such as the ability to produce a variety of categories with numerous solutions that could be observed from the number of relevant response categories with different types from the testee. findings and discussion before analyzing irt using pcm through the quest program, the researcher tested the assumptions in advance. the first assumption is unidimensional. it can be proven using the factor analysis in order to view eigen value of the inter-item covariance matrix (hambleton & rovinelli, 1986, pp.293294). the second assumption is local independence. this assumption has been automatically proven after evidenced with unidimensionality of participants' data responses to a test (mcdonald, 1981, p.101). in the preliminary field testing, assumption testing is done at the data analysis stage using factor analysis which shows the largest eigen value is 7.743; with the variation explained of 20.926% > 20%. this means that the assessment instrument developed is onedimensional or unidimensional. with the proven unidimensional assumption, the local independence assumption is then automatically proven (embretson & reise, 2000 p.48). thereby, the irt analysis using pcm through the quest program is feasible. the measurement data analysis results through the polytomous technique with five categories provide the results presented in table 1. table 1. summary of pearson and item/cases estimation using pcm criteria statistic information estimation results pearson estimation/case fit statistics (rerata infit mean square) 1.00 standar deviasi infit mnsq 0.21 separation reliability 0.87 separation index 2.58 zero score 0 perfect score 0 item estimation fit statistics (reratainfit mean square) 1.00 standar deviasi infit mnsq 0.10 separation reliability 0.72 separation index 1.60 zero score 0 perfect score 0 internal consistency (ctt) 0.87 *p < 0.05, item = 37 and cases = 218 research and evaluation in education 8 − volume 2, number 1, june 2016 when an item fits in the sense that the item behaves consistently with what is expected by the irt model, the instrument is declared valid (wright & stone, 1999, p.169171). the term ‘valid’ in irt is used to assess the success of calibration in an effort to find out that the data fit with the model. table 1 shows that the entire items in the model are declared fit with the model for fulfilling statistics fit requirements that are obliged under the quest program. an item is declared fit to the model if it has an average infit mean of square (infit mnsq) approaching 1 (adams & khoo, 1996, pp.24-25). therefore, all the items analyzed are declared fit by the model with a standard deviation of 0.10. in irt, the estimation of internal consistency reliability of a test is based on the person separation reliability, where the estimation on a logit scale for each person is used to calculate reliability (bhakta, tennant, horton, lawton, & andrich, 2005, pp.1-13). in other words, the value of test reliability is based on the error of measurement, presented in person/case; in this case it reaches 0.87. this means that the assessment instrument developed has a good reliability. in addition to person separation reliability ( ), the reliability of a test can also be seen through person separation index ( ) which is an estimation on how well each testee can be distinguished on the measured variables. if the person separation reliability value ranges from 0 to 1, then it will be in contrast to the person separation index, which is not tied to a range of values from 0 to 1. the index quantifies reliability with a simple and direct manner, as well as having clear interpretation. the person separation index value (2.58) in table 1 is classified as good. this is in line with the opinion of wright and stone (1999, p.163), that the value of = 2 is equivalent to the value of of 0.80. the quest program output also generates the item reliability analysis using the classical approach. in accordance with the reliability calculation using irt, reliability calculation that is based on the internal consistency value of 0.87 shows that the test developed is qualified as a good test. in the main field testing, the irt analysis using pcm model through the quest program is preceded by the unidimensional assumption and local independence tests. the unidimensional assumption test results of the instruments based on the factor analysis result can be seen through the eigen value which is obtained at each factor. in this major field testing, the test result shows that eigen value prior to rotation is 14.200, with the explainable variation of 38.378% > 20%, which means that the measuring instrument developed is unidimensional. it is proven with a unidimensional assumption showing a local independence assumption which is automatically proven. therefore, the irt analysis using pcm through the quest program can be done. table 2. summary of the comparison of main testing estimation by employing pcm criteria statistic information estimation results person estimation/case fit statistics (rerata infit mean square) 0.99 standar deviasi infit mnsq 0.32 separation reliability 0.94 separation index 3.95 zero score 0 perfect score 0 item estimation fit statistics (reratainfit mean square) 1.00 standar deviasi infit mnsq 0.38 separation reliability 0.80 separation index 2.00 zero score 0 perfect score 0 konsistensi internal (ctt) 0.94 *p < 0.05, item = 37 and cases = 270 research and evaluation in education a separation index and fit items... 9 andi ulfa tenri pada, badrun kartowagiran, & bambang subali the measurement data analysis result via five categories of polytomous scoring technique provides the results presented in table 2. the analysis results of creative thinking skill tests that support the conation aspect using the pcm model are based on the value of infit mean of square (infit mnsq) from 0.70 to 1.30 (wright & masters, 1982, p.100; bond & fox, 2001, p.230). the mean of infit mnsq of 1 and a standard deviation of 0.38 indicates that the data fit with the model. therefore, all items of the assessment instruments of creative thinking skills supporting the conation aspect are declared valid. the estimation of separation reliability is based on the error of measurement presented in person/case. in the main field testing, person separation reliability value reaches 0.94, which means that the instrument assessment developed has good reliability. the separation reliability value also reports the data quality. the person separation reliability is used to classify people. the person separation value that is low (<2 person reliability <0.8) with the relevant people sample indicates the possibility that the instrument is not sensitive enough to distinguish the test participants with high abilities and low ability. larger items may be needed. item separation reliability is used to verify the hierarchy of items. table 2 shows a good person separation index value ( ) of 3.95. the higher the value of person separation index, the more consistent each measuring item is used to measure the testee concerned (mappiasse, 2006, p.585; curtis & boman, 2007, p.251). the low item separation value indicates that the person sample is not large enough to confirm the hierarchy of item difficulty level of the instruments (linacre, 2015, p.656). it also applies to the item separation reliability, and item separation index, i.e. 0.80 and 2.00. the consistency in a group of individuals in providing information on the item difficulty forming the scale is reflected in the item separation index (curtis & boman, 2007, p.251). the higher the index estimation, the more precise the entire item separation analyzes according to the model used (subali, 2010, p.38). the output of the quest program for item reliability by using the classical approach is also presented here. in line with the person separation reliability, the reliability calculation based on the value of internal consistency of 0.94 in the main field testing suggests that the tests which are developed are qualified as good tests. conclusion and recommendations conclusion based on the results and discussions, several conclusions can be drawn as follows. (1) all the items in the assessment instruments of creative thinking skills are declared fit with the model. (2) the estimation of person separation reliability shows a good reliability coefficient of 0.94. this coefficient can be used to calculate the person separation index of 3.95. (3) all items in the developed assessment instruments are qualified as creative thinking skill assessment instruments supporting the conation aspect of prospective biology teachers. suggestions based on the results, it is suggested that further research employ 2pl or 3pl data analysis for the polytomous type data. the findings in this article are able to contribute in favor of instruments’ validity and reliability. through this article, the readers can understand the estimation process on validity and reliability using the item response theory. through validity and reliability coefficient tests, the measurement results can be interpreted more precisely. references adams, r.j. & khoo, s. (1996). acer quest (2.1). camberwell, victoria: australian council for educational research. alagumalai, s., hungi, n., & curtis, d.d. (2005). applied rasch measurement: a book of exemplars papers in honour of john p. keeves. dordrecht: springer. baer, m. (2012). putting creativity to work: the implementation of creative ideas in research and evaluation in education 10 − volume 2, number 1, june 2016 organizations. academy of management journal, 55(5), 1102-1119. bhakta, b., tennant, a., horton, m., lawton, g., & andrich, d. (2005). using item response theory to explore the psychometric properties of extended matching questions examination in undergraduate medical education. bmc medical education, 5(9). doi: 10.1186/1472-6920-59. board of national education standard. (2010). panduan penulisan butir soal [handbook of writing questions item]. a material of technical guidance on school-based curriculum and 2010standardized questions]. jakarta: badan standar nasional pendidikan. boden, m.a. (2001). creativity and knowledge. in a. craft, b. jeffrey, & m. leibling (eds.), creativity in education. london: continuum. bond, t.g., & fox, c.m. (2001). applying the rasch model: fundamental measurement in the human sciences. mahwah, nj: lawrence erlbaum and associates. cohen, r.j., swerdlik, m.e., & sturman, e.d. (2013). psychological testing and assessment: an introduction to tests and measurement (6 th ed.). new york, ny: mcgraw-hill. crowe, a., dirks, c., & wenderoth, m.p. (2008). biology in bloom: implementing bloom taxonomy to enhance student learning in biology. journal of life science education, 7, 368-381. curtis, d.d & boman, p. (2007). x-ray your data with rasch. international education journal, 8(2), 249-259. darmawan, e. (2013). pengaruh pbl terhadap sikap dan hasil belajar. [the impacts of pbl on learning attitude and outcomes]. jurnal lentera sains, 3(2), 1-4. dehaan, r.l. (2009). teaching creativity and inventive problem solving in science. cbe—life sciences education, 8, 172– 181. dettmer, p. (2006). new blooms in established fields: four domains of learning and doing. roeper review, 28(2), 70-78. diakidoy, i.a. & constantinou, c.p. (2001) creativity in physics: response fluency and task specificity. creativity research journal, 13, 3-4, 401-410, doi: 10.1207/s15326934crj1334_17 embretson, s.e. & reise, s.p. (2000). item response theory for psychologists. mahwah, nj: lawrence erlbaum associates. florida, r., mellander, c., & stolarick, k. (2011). creativity and prosperity: the global creativity index. toronto: martin prosperity institute. gorin, j. & embretson, s.e. (2006). item difficulty modeling of paragraph comprehension items. applied psychological measurement, 30, 394-411. hambleton, r.k. & rovinelli, r.j. (1986). assessing the dimensionality of a set of test items. applied psychological measurement, 10(3), 287-302. hambleton, r.k., swaminathan, h., & rogers, h.j. (1991). fundamentals of item response theory. london: sage. hattie, j. (1985). methodology review: assessing unidimensionality of tests and items. applied psychological measurement, 9(2), 139-164. huitt, w. & cain, s. (2005). an overview of the conative domain. educational psychology interactive. valdosta, ga: valdosta state university. retrieved from http:/www.edpsycinteractive.org/brilsta r/chapters/conative.pdf isaksen, s.g., dorval, k.b., & treffinger, d.j. (1994). creative approaches to problem solving. dubuque, ia: kendall/hun. retrieved on 22 january 2015 from https://books.google.co.id/books?id=d mgtboux3luc&pg=pa1&source=gb s_toc_r&cad=4#v=onepage&q&f=fals e jo, s.m. (2009). a study of korean students’ creativity in science using structural equation https://books.google.co.id/books?id=dmgtboux3luc&pg=pa1&source=gbs_toc_r&cad=4#v=onepage&q&f=false https://books.google.co.id/books?id=dmgtboux3luc&pg=pa1&source=gbs_toc_r&cad=4#v=onepage&q&f=false https://books.google.co.id/books?id=dmgtboux3luc&pg=pa1&source=gbs_toc_r&cad=4#v=onepage&q&f=false https://books.google.co.id/books?id=dmgtboux3luc&pg=pa1&source=gbs_toc_r&cad=4#v=onepage&q&f=false research and evaluation in education a separation index and fit items... 11 andi ulfa tenri pada, badrun kartowagiran, & bambang subali modeling (unpublished doctoral dissertation). university of arizona, usa. krathwohl, d.r. (2002). a revision of bloom’s taxonomy: an overview. theory into practice, 41(4), 212-264. linacre, j.m. (2015). a user's guide to winsteps & ministep raschmodel computer programs (program manual 3.90.0. winsteps.com). united states of america: winstep software technologies. lubart, t. (2004). individual student differences and creativity for quality education. background paper prepared for the education for all global monitoring report 2005 the quality imperative, paris. mappiasse, s. (2006). developing and validating instruments for measuring democratic climate of the civic education classroom and student engagement in north sulawesi, indonesia. international education journal, 7(4), 580-597. mardapi, d. & kartowagiran, b. (2011). pengembangan instrumen pengukur hasil belajar nirbias dan terskala baku [developing unbiased and standardized instruments for student achievements in high schools]. jurnal penelitian dan evaluasi pendidikan, 15(2), 326-341. retrieved on 20 january 2015 from http://journal.uny.ac.id/index.php/jpe p/article/view/1100 masters, g.n & wright, b.d. (1997). the partial credit model. in w.j.v.d. linden & r.k. hambleton (eds.), handbook of modern item response theory (pp. 101-118). new york, ny: springer-verlag. mcdonald, r.p. (1981). the dimensionality of tests and items. british journal of mathematical and statistical psychology, 34(1), 100–117. doi: 10.1111/j.20448317.1981.tb00621.x mumford, m.d., mobley, m.i., uhlman, c.e., reiter-pamon, r., & doares, l. (1991). process analytic models of creative capacities. creativity research journal, 4(2), 91-122. doi: 10.1080/1040041910953 4380 nering, m.l., & ostini, r. (2010). handbook of polytomous item response theory models. new york, ny: routledge. pepper, s.c. (1970). the source of value. berkeley, ca: university of california press. poole, m.s., & van de ven, a.h. (2004). alternative approaches for studying organizational change. paper presented at the first organization studies summer workshop on theorizing process in organizational research, santorini, greece, 12&13 june, 2005. ramirez, r.p.b., & ganaden, m.s. (2008). creative activities and students’ higher order thinking skills. education quarterly, 66(1), 22-33. reckase, m.d. (1979). unifactor latent trait models applied to multifactor tests: results and implications. journal of educational statistics, 4(3), 207-223. reeves, t. c. (2006). how do you know they are learning?: the importance of alignment in higher education. international journal of learning technology, 2(4), 294– 309. reynolds, c. r., livingston, r. b., & willson, v. (2009). measurement and assessment in education. upper saddle river, nj: pearson education. riyanti, d.b.p. & prabowo, h. (1998). seri diktat kuliah psikologi umum 2 [summary of general psychology lecturing vol. 2]. depok: universitas gunadarma. runco, m.a. (2004). creativity. annual review psychology, 55, 657-687. doi: 10.1146/ annurev.psych.55.090902.141502 subali, b. (2010). bias item tes keterampilan proses sains pola divergen dan modifikasinya sebagai tes kreativitas [the bias of test item of divergent pattern science process skill and its modification as a creativity test]. jurnal penelitian dan evaluasi pendidikan, 14(2), 309-334. retrieved on 15 january 2015 from http://journal.uny.ac.id/index.php/jpep/article/view/1100 http://journal.uny.ac.id/index.php/jpep/article/view/1100 research and evaluation in education 12 − volume 2, number 1, june 2016 http://journal.uny.ac.id/index.php/jpe p/article/view/1084 subali, b. (2011). pengukuran kreativitas keterampilan proses sains dalam konteks assessment for learning [creativity assessment of science process skills in the context of assessment for learning]. jurnal cakrawala pendidikan. 30(1), 130144. retrieved on 7 january 2015 from http://journal.uny.ac.id/index.php/cp/ article/view/4196 subali, b., & suyata, p. (2013). standardisasi penilaian berbasis sekolah [the standardization of school-based assessment]. jurnal penelitian dan evaluasi pendidikan, 17(1), 1-18. retrieved from http://journal.uny.ac.id/index.php/jpe p/article/view/1358 supardi, u.s. (2012). peran berpikir kreatif dalam proses pembelajaran matematika [the role of creative thinking in mathematics instructional process]. jurnal formatif, 2(3), 248-262. wright, b.d. & masters, g.n. (1982). rating scale analysis. chicago, il: mesa press. wright, b.d., & stone, m. (1999). measurement essentials (2 nd ed.). wilmington, de: wide range. http://journal.uny.ac.id/index.php/jpep/article/view/1084 http://journal.uny.ac.id/index.php/jpep/article/view/1084 http://journal.uny.ac.id/index.php/cp/article/view/4196 http://journal.uny.ac.id/index.php/cp/article/view/4196 http://journal.uny.ac.id/index.php/jpep/article/view/1358 http://journal.uny.ac.id/index.php/jpep/article/view/1358 how to cite item: faralina, a., kadri, a., & yap, a. (2016). validity and reliability of pre-class reading tasks for waves and optics (prtwo). research and evaluation in education, 2(1), 42-52. doi: http://dx.doi.org/10.21831/reid.v2i1.8466 research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 2, number 1, june 2016 (pages 42-52) available online at: http://journal.uny.ac.id/index.php/reid validity and reliability of pre-class reading tasks for waves and optics (prt-wo) 1 asrab ali nur faralina; 2ayop shahrul kadri; 3abdullah nurul syafiqah yap 1,2,3 universiti pendidikan sultan idris 1 aralina91@gmail.com; 2 shahrul.kadri@upsi.fsmt.edu.my; 3 syafiqah@upsi.fsmt.edu.my abstract this study aims to produce a valid and reliable instrument in preparing students prior to class at university level of introductory waves and optics course. the instrument so called pre-class reading task for waves and optics (prt-wo) was used to probe students’ know-ledge acquired through targeted reading activities. in practice, prt-wo was given in a series before actual face-to-face class as a reading assignment. prt-wo was content validated through expert review which was analyzed using the inter-rater reliability cohen’s kappa. an item analysis was done to identify inappropriate items further evaluated using the reliability test of kuderrichardson 20 (kr20). the finding reveals that the value of kappa is 0.66 and the value of kr20 is 0.68, indicating that the developed instrument is valid and reliable. keywords: validity, reliability, pre-reading mailto:1aralina91@gmail.com mailto:2shahrul.kadri@upsi.fsmt.edu.my research and evaluation in education validity and reliability of prt-wo 43 asrab ali nur faralina, ayop shahrul kadri, abdullah nurul syafiqah yap introduction students come to class with little or no prior knowledge on the subject. most students do not read before entering the class even though they are aware of the importance of reading. according to sikorski et al (2002, pp.312-320), majority of students do not use their textbooks frequently. it is expected that reading before entering the class may help the students to understand the topics better. podolefsky and finkelstein (2006, p.341) find that students who read before entering the class perform better in the course. however, with huge number of assignments and activities in university, students might not read before the class. stelzer, gladding, mestre, and brookes (2009, pp.184-190) agree that it is difficult to make the undergraduate students read their textbooks. as reported by hoeft (2012, p.12) in her research, more than 50% of students in first year undergraduate class did not read assigned materials before class. this indicates that if students hardly read the assigned materials, volunteer reading is much more difficult to achieve. therefore, it is crucial to find a solution to the engagement of students with reading before class. according to the studies conducted by ryan (2006, pp.135-141) and philips (1995, pp.484-489), some instructors leave with dilemma in finding the effective methods to motivate students. the best way is by encouraging a strong reading habit in and off campus for reading index. in addition, most importantly, their status as a wise person will not be disputed by the community after graduation. various methods were proposed to prepare students before class, such as just in time teaching or known as jitt, quizzes, emails and presentations. these strategies may engage students to read before entering the class. heiner, banet, and wieman (2014, p.989) implemented jitt which consisted of regularly reading and online quiz parts for students to complete prior to class. moravec, williams, aquilar-roca and o’dowd (2010, pp.473-481) employed learning before lecture (lbl) assignment as an instrument to engage with and to promote reading to the students. they report a significant increase in students’ learning gain and level of satisfaction encouraging them to continue this method. ryan (2006, p.136) used three different strategies in motivating students to read their textbooks and be prepared before entering the class. the strategies were as follows: (1) the use of general global assignments, (2) the use of focused explicit homework assignment with less comment from instructor, and (3) the use of focused, explicit assignment with the ample instructor comments. they find that those strategies are the most effective strategy. there also other researchers who employed technologies to encourage students to read their textbooks. henderson and rosenthal (2006, p.46) used email in order to submit a reading question based on the assigned reading before enter class. as a result, this method increases student’s reading and has significantly higher outcomes compared to other universities on the assessment tool. special technology developed web was used by howard (2004, pp.385-390) in his study. it was similar to the blackboard and was used at least two hours before the class begin. students need to answer two quizzes question for each section and the responses are graded with some points used in class discussion. the percentage of students who read textbook before class increases to 98% by using this technique. in this research, pre-class reading task is developed to engage students in reading the assigned material before coming to the faceto-face class. in the next study, the effect of pre-class reading on students’ achievement will be determined. prior to that, a valid and reliable instrument that can measure the prereading effectiveness is required. for that purpose, an instrument so called the pre-class reading task for waves and optics (prt-wo) was developed. the instrument was specifically designed for the course of waves and optics which is using university physics textbook (young & freedman, 2013). prereading effectiveness is defined as the score obtained by students in answering questions regarding the given targeted reading. this instrument is useful for institutions offering a similar course and which use the same referresearch and evaluation in education 44 − volume 2, number 1, june 2016 ence book. this article will describe the process of developing the valid and reliable prt-wo instrument. method this study was conducted at the department of physics, faculty of science and mathematics, universiti pendidikan sultan idris (upsi), malaysia. the prt-wo was designed to be short with a clear link to the material to be covered in immediately upcoming classes of sft3023 vibration, waves and optics course. the course is compulsory for students who take physics or science as their major or minor in their undergraduate programs. prt-wo was prepared in english since the medium of instruction for the course is english even though the mother tongue and national language of the students is bahasa malaysia. therefore, the language was not the barrier for the students to acquire knowledge from the main textbook written in english. there were eight chapters in the prtwo, consisting of: (1) periodic motion, (2) mechanical waves, (3) sound and hearing, (4) electromagnetic waves, (5) the nature and propagation of light, (6) geometric optic, (7) interference, and (8) diffraction. these chapters correspond to the eight chapters in the main reference book (young & freedman, 2013). each chapter in prt-wo was split into two parts, except for chapter 4. therefore, a total of 15 parts of the instrument were prepared. these 15 parts were released to students one by one along the whole semester of 14 weeks of instruction prior to the corresponding chapter being discussed in the faceto-face class. each release must be submitted at least a night before the class. this will give an ample time to the instructor to analyze students’ answers and discuss them in the class on the next day. each release of prt-wo parts was divided into three sections as follows: section 1 is targeted reading, section 2 is quiz, and section 3 is feedback review. section 1 contained instruction of important concept to be focused during reading. the concept will be asked in section 2. section 2 contained items regarding a specific chapter. these items required students to open the main textbook. in this way, students will need to read with guidance from section 2. after section 2 is done, section 3 will give feedback on how to get the answer to section 2. during the actual instruction, prt-wo was distributed using google form. figure 1 shows snapshots of google form for prt-wo part 1 of chapter 1 (see the ‘supplementary material’ at references for full prt-wo material). google form was used to distribute prt-wo due to its advantages such as simple management and unlimited free form creation. google form includes low cost (free) and support for multiple formats questions (kim, 2011). google docs spreadsheets and forms provide an easy and flexible way (bonham, 2011, pp.22-23). even though there were other more suitable simple tools, most of them were limited in their function. some instructors thought that google form is only for survey research purposes, but in fact, it can also be used as a learning tool. besides, all the answers and responses are automatically collected in an excel spreadsheet and it makes it easier for the instructor to analyze large sets of data using charts and other complex spreadsheet functions. instructors can upload a wide range of questions because forms in google docs support it, including scale and grid that are generally not available in other web polling services (at least the free ones) such as survey monkey and poll everywhere. as the study involved specific students only, google form is very suitable because it can automatically record the email addresses of students who fill out the form and limit the target to specific email domain only. google form employs cloud computing which required the internet access, and with cloud computing, there is no physical location of the documents shared. moreover, many institutions may prefer the idea of data stored on site where they are in control of it, rather than at a remote location where someone else is hosting it. although there were some disadvantages, the advantages surpassed those. google forms suits the simplicity of delivering prt-wo to students. research and evaluation in education validity and reliability of prt-wo 45 asrab ali nur faralina, ayop shahrul kadri, abdullah nurul syafiqah yap there are 113 items crafted which covered all the eight chapters. the crafted items were given to the class instructor and underwent a content review which had to be parallel and in accordance with the syllabus covered for the course. some modifications were done after getting feedback from the class instructor. the items were crafted to fit the first two levels of bloom’s taxonomy: remembering (level 1) and understanding (level 2) (krathwohl, 2002 p.214&218). since prt-wo was aimed to measure reading effectiveness and also prepare students before class by targeted reading, the items must not be very difficult especially in level 3 of the taxonomy. this was not to penalize students, but to provide prior knowlegde to them. examples of the developed items are shown in examples 1 and 2. example 1 was taken from chapter 2 part 2. once students receive an instruction to complete their reading assignment, they will open the given google form link. the first section will guide them what they have to read in order to answer section 2. this item was rated as level 2 of the taxonomy since the answer required students to formulate the concept of superposition in mathematical form. after the answer is submitted, feedback of section 3 will appear to give hint where the correct answers can be found for students to revise. example 1 section 1: targeted reading you must understand that wave can interfere on each other resulted in a new wave which can be understood from the principle of superposition section 2: chapter 2 question 14 which of the following describes the principle of superposition of two waves, 𝑦1(x,t) and 𝑦2(x,t) to produce the resulted wave of y(x,t)? a. y (x,t) = 𝑦1(x,t) = 𝑦2(x,t) b. y (x,t) = 𝑦2(x,t) 𝑦1(x,t) c. y (x,t) = 𝑦1(x,t) 𝑦2(x,t) d. y (x,t) = 𝑦1(x,t) + 𝑦2(x,t) figure 1. example of prt-wo part 1 of chapter 1. (left: the targeted reading section; middle: quiz section; right: feedback review section) research and evaluation in education 46 − volume 2, number 1, june 2016 section 3: feedback review pages 490-491. principle of superposition in mechanical wave means that the amplitude of the resultant wave pattern created by the interference of travelling waves opens in a new window is the sum of the amplitudes of the travelling waves. example 2 was taken from chapter 1 part 2. it required students to turn to a specific page containing the figure in question. therefore, students could not answer this question without referring to the main textbook. this item was rated as level 2 of the taxonomy. example 2 section 1: targeted reading you must understand that in real oscillating system, friction does exist and it causes the oscillation to be damped. you must be able to recognize the behaviours of the oscillation at various degrees of damping. section 2: chapter 1 question 9 why will the heavy swinging bell (figure 14.25) eventually stop oscillating? a. due to the damping forces b. due to the mass of the bell c. someone stops the bell section 3: feedback review page: 457. a swinging bell left to itself will eventually stop oscillating due to damping forces (air resistance and friction at the point of suspension), therefore, the energy is dissipated and the amplitude decreases and thus eventually stops. each instrument needed to undergo validity and reliability testing. before the prtwo was given to students, content validation on prt-wo was carried out. the purpose of validation is to see how far the relationship between variables can be ensured. according to merriam (2001), validity of equivalence relates the findings and reality. there are several methods that can be used to determine the validity, such as triangulation, expert examination, expert review, statement of experience, and bias or hope. content validity by the expert review was employed in this study. the prt-wo was content validated by the two subject-matter experts from upsi. the validation process required experts to evaluate whether each item is conceptually correct and appropriate to measure reading effectiveness. meanwhile, each expert will rate either they agree or not on the items in prt-wo. the agreement data received were used to calculate the cohen’s kappa. cohen’s kappa (cohen, 1960, pp.37-46) was introduced as a measure of agreement which avoids the problems described above by adjusting the observed proportional agreement to take account of the amount of agreement which would be expected by chance. after undergoing the internal content validity and cohen’s kappa test, the prt-wo went through the modification process again by inspecting items one by one. in this process, error is minimized on each item, for example grammatical correction, content revision, and language appropriateness. a pilot test was administered to determine the reliability value of kr20. it involved a group of 90 undergraduate students who registered vibration, waves, and optics (sft3023) course in semester 2 batch 2014/ 2015. the group consisted of two subgroups: 46 students in group a and 44 students in group b. the prt-wo was tested on group a only. however, only 24 students from group a answered all items in the prt-wo. the students need to log in the email provided by upsi (siswa-mail) every time before answering the prt-wo. the google form was only permitted for the person who logs in via siswa-mail. the study was carried out in a semester of 14 weeks of instruction. table 1. list of assessment components for group a coursework percentage prt-wo assignment 3 lab report and other assignments 12 mid-term test 15 mini project 15 group presentation 15 final examination 40 total 100 research and evaluation in education validity and reliability of prt-wo 47 asrab ali nur faralina, ayop shahrul kadri, abdullah nurul syafiqah yap the 3% allocation mark was given as incentive to the students. the percentage is appropriately chosen to be neutral in course marks so that it does not seem to force students or either can be ignored by the students. this is important to make sure that students give their best contribution to the pilot test. table 2. distribution and division of prt-wo chapter sub-chapter release week-expired week 1 part 1 1.1 describing oscillation 1.2 simple harmonic motion 1.3 energy in simple harmonic motion w1-w2 part 2 1.5 the simple pendulum 1.7 damped oscillations 1.8 forced oscillations and resonance w1-w2 2 part 1 2.1 types of mechanical waves 2.2 periodic wave 2.3 mathematical description of a wave 2.4 speed of a transverse wave w2-w3 part 2 2.5 energy in s motion 2.6 wave interference, boundary condition and superposition 2.7 standing waves on a string 2.8 normal modes of a string w2-w3 3 part 1 3.1 sound waves 3.2 speed of sound waves 3.3 sound intensity 3.4 standing sound waves and normal modes w4-w5 part 2 3.5 resonance of sound 3.6 interference of waves 3.7 beats 3.8 doppler effect w5-w5 4 part1 4.1 maxwell’s equation and electromagnetic waves 4.2 plane electromagnetic waves and the speed of light 4.3 sinusoidal electromagnetic waves w6-w6 5 part 1 5.1 the nature of light 5.2 reflection and refraction 5.3 total internal reflection w7-w8 part 2 5.4 dispersion 5.5 polarisation 5.6 scattering of light 5.7 huygens’s principle w7-w8 6 part 1 6.1 reflection and refraction at a plane surface 6.2 reflection at a spherical surface 6.3 refraction at a spherical surface 6.4 thin lenses w8-w9 part 2 6.5 cameras 6.6 the eye 6.7 the magnifier 6.8 microscopes and telescopes w9-w9 7 part 1 7.1 interference and coherent sources 7.2 two-source interference of light 7.3 intensity in interference patterns w9-w10 part 2 7.4 interference in thin films 7.5 the michelson interferometer w10-w11 8 part 1 8.1 fresnel and fraunhofer diffraction 8.2 diffraction from a single slit 8.3 intensity in the single-slit pattern w12-w12 part 2 8.4 multiple slits 8.5 the diffraction grating 8.6 x-ray diffraction w12-w13 research and evaluation in education 48 − volume 2, number 1, june 2016 at the beginning of the semester, week 1, all the students were given a brief regarding the methods and steps that they need to know. every week, they were informed that prt-wo was available via siswa-mail and social media tool (such as whatsapp) as a reminder. this type of communication techniques aligns with the 21st century learning and literacy skills. if they face some technical problems, they can directly contact the prtwo administrator through whatsapp group. table 2 shows the chapter and subchapter, number of question for each part, release and due date of the prt-wo. there were 15 parts of prt-wo which were distributed almost periodically throughout the semester. as soon as the students answered all the prt-wo, the reliability test was administered. before that, the item analysis was done to eliminate inappropriate items through the determination of difficulty index of each item. only 24 undergraduate students answered all items in the prt-wo pilot test. responses from these groups of students were analyzed for kr20. reliability refers to a measure of consistency and stability of study or test as the measuring instrument. it aims to determine whether these measures give the same answer as when it is used to measure the same concept to population or respondent which are alike. there are several types of reliability, including internal consistency, testretest, equivalent and stability, equivalent consistency and scorer. internal consistency was chosen to be used for this instrument reliability test. it measured the extent to which the items in the test were consistent with each other and overall. internal consistency reliability consists of three types; split half reliability, kuder richardson reliability, and also cronbach's alpha reliability. kuderrichardson 20 (kr20) reliability was used for prt-wo since the items were given in two options: correct or wrong. besides, according to sabri (2013, pp.1-14) kuder-richardson 20 is a formula which is based on item difficulty where it is used to analyze the internal consistency of section a in the string instrument comprehensive test. the summarized process to produce prt-wo is illustrated in figure 2. figure 2. methodology framework research and evaluation in education validity and reliability of prt-wo 49 asrab ali nur faralina, ayop shahrul kadri, abdullah nurul syafiqah yap findings and discussion instrument validity and reliability are important aspects in a social science study, especially to produce a correct instrument. validity or legitimacy is a concept that refers to the extent to which instruments measure what you want to measure or study the extent to which it meets the job’s purpose (anastasi & urbina, 1997). a pilot study is considered as a small study carried out with the aim of improving and increasing the validity and reliability of the instrument (fraenkel, wallen & hyun, 1993). cohen's kappa index analysis was also used to determine the degree of agreement between evaluators (experts). steven (1958, pp.177-196) states that the agreement between evaluators is important to determine the value of high reliability for every unit used to describe a theme. to determine the level of agreement kappa, the value recommended by landis and koch (1977, pp.159-174) was used. the cohen’s kappa value for prt-wo is 0.66 and this value was the value of good reliability. cohen’s kappa index analysis was carried out to determine the degree of agreement between two raters on the item in the prt-wo. this supported the reliability of the prt-wo. the obtained cohen’s kappa for prt-wo is rated as good level of instrument reliability (landis & koch, 1977, pp.159-174). the item analysis for each question in prt-wo was carried out by determining the difficulty level (p) as suggested by macintosh and morrison (1969). the item analysis was done before running the kr20 to eliminate too difficult items. the difficulty level is defined as follows: p = n1/n where p is the difficulty level, n1 is the number of correct responses, n is the total number of students taking the test. the result is tabulated in table 3. the items with p<0.39 were rejected. as many as 13 out of 113 items as highlighted in the table were removed. these items were considered as very difficult. low p items indicated either the question itself is difficult to be understood in the aspect of language or the answer was difficult to be found in the main text. as the purpose of the instrument is to access students’ acquired knowledge through reading, very difficult items were removed so that students were not demotivated to continue their reading assignment. the following example 3 (c3q4) was among the rejected items taken from chapter 3 part 1. this item was rated as level 2 of the taxonomy. the difficulty level of the item was p=0.00, where nobody got it correct. even though the answer was implicitly written in the reference, students might be confused with other options. example 3 section 1: targeted reading sound travels at different speed in gas, fluid and solid and also in different temperatures. therefore, you have to understand how the sound travels in each of the medium. section 2: chapter 3 question 4 the restoring term in the expression (page 514) for the wave speed of the sound waves a. exhibits the difficulties in the fluid compression. b. exhibits the massiveness of a bulk fluid. c. has actually no relation to the sound wave. d. follows newton second law. section 3: feedback review page: 514. to check you can derive the speed of sound waves in a fluid in a pipe. for your information, human speech works on the same principle. another example of the rejected item is shown in example 4 (c1q?) taken from chapter 1 part 1. this item was initially rated as level 1 of the taxonomy with p=0.29. after revision, the item should be rated as level 3 which required students to interpret and analyse the equation of energy for simple harmonic motion. research and evaluation in education 50 − volume 2, number 1, june 2016 table 3. item analysis of prt-wo with difficulty level (p) items difficulty level items difficulty level c1q1 0.92 c4q6 0.67 c1q2 0.88 c4q7 1.00 c1q3 0.96 c4q8 1.00 c1q4 0.46 c5q1 0.08 c1q5 0.63 c5q2 0.92 c1q6 0.29 c5q3 0.96 c1q7 0.96 c5q4 0.96 c1q8 1.00 c5q5 1.00 c1q9 1.00 c5q6 0.88 c1q10 0.67 c5q7 0.96 c1q11 0.46 c5q8 0.96 c1q12 0.46 c5q9 0.96 c2q1 0.92 c5q10 0.96 c2q2 0.42 c5q11 0.29 c2q3 1.00 c5q12 0.92 c2q4 0.71 c5q13 0.29 c2q5 1.00 c5q14 0.25 c2q6 0.79 c5q15 0.42 c2q7 0.71 c6q1 1.00 c2q8 0.96 c6q2 0.92 c2q9 1.00 c6q3 1.00 c2q10 0.92 c6q4 0.83 c2q11 0.96 c6q5 0.92 c2q12 0.67 c6q6 0.92 c2q13 0.54 c6q7 0.96 c2q14 0.96 c6q8 1.00 c2q15 0.67 c6q9 1.00 c2q16 0.79 c6q10 0.96 c2q17 1.00 c6q11 0.92 c2q18 0.96 c6q12 0.96 c3q1 0.96 c6q13 1.00 c3q2 0.96 c6q14 0.96 c3q3 0.71 c6q15 1.00 c3q4 0.00 c6q16 1.00 c3q5 0.79 c7q1 0.96 c3q6 0.88 c7q2 1.00 c3q7 1.00 c7q3 0.96 c3q8 0.96 c7q4 0.83 c3q9 c7q5 0.88 c3q10 0.75 c7q6 0.58 c3q11 0.96 c7q7 0.08 c3q12 1.00 c7q8 0.75 c3q13 0.88 c7q9 0.96 c3q14 0.75 c7q10 0.42 c3q15 0.92 c8q1 0.75 c3q16 1.00 c8q2 0.58 c3q17 0.96 c8q3 0.96 c3q18 0.96 c8q4 0.75 c3q19 0.38 c8q5 0.75 c3q20 0.50 c8q6 0.46 c3q21 0.46 c8q7 0.58 c3q22 0.04 c8q8 0.79 c4q1 1.00 c8q9 0.63 c4q2 0.50 c8q10 0.96 c4q3 0.33 c8q11 0.21 c4q4 1.00 c8q12 0.33 c4q5 1.00 research and evaluation in education validity and reliability of prt-wo 51 asrab ali nur faralina, ayop shahrul kadri, abdullah nurul syafiqah yap example 4 section 1: targeted reading you must understand the type of energy involved in shm and how it changes at different locations along the motion, yet the total energy is kept constant. section 2: chapter 1 question 6 the total mechanical energy e of shm is also directly related to the ________ of oscillation. i. amplitude ii. angular frequency iii. velocity a. i only b. i and ii c. ii and iii d. i, ii, and iii section 3: feedback review page: 446. refer to equation 14.20 and (14.9) which relates the total mechanical energy to other physical quantities of oscillation. after the process of elimination in the item analysis, the remaining 100 items were analyzed for kr20. it was found that the value of the kr20 was rtest = 0.68. a value close to 0.80 is common to test the participant’s heterogeneous classrooms, whereas common values as low as 0.50 to test for a homogeneous group of participants. a test of reliability higher than 0.7 is considered to be reliable for group of measurement and it is considered as a widely accepted criterion (ding & beichner, 2009, pp.020103-2). the result was approximately aligned with the results of the studies which are conducted by ding and beichner (2009, pp.020103-2). conclusion and recommendation conclusion the process of validity and reliability of prt-wo was discussed. the prt-wo was crafted and given to the class instructor for the first step of content validation. the instrument was revised according to instructor’s comment. then, prt-wo was content validated by two experts in the field. at the same time, the experts rated their agreement on each item. these agreement data were used to obtain kappa’s value of 0.66. prt-wo was again revised according to experts’ review. the revised prt-wo was then distributed to the students in series along the semester for a pilot test. at the end of the semester, prtwo responses were analyzed using the item analysis and kr20. from 113 items, prtwo reduced to 100 items due to unsuitable difficulty index. finally, kr20 was done on the remaining items resulted in rtest = 0.68. prt-wo was examined under thorough process to measure the preclass reading effectiveness in term of score and available for futher related study in the teaching and learning waves and optics. the validated and reliable prt-wo is available via provided link in the reference. recommendation any institution that uses the same syllabus and reference book may use prt-two to access student pre-class reading effectiveness before entering class. the students are expected to be prepared and actively engaged during the class discussion. at the time this paper was written, the 14th edition of university physics book was already published. since this study used the 13th edition, slight modification such as figure and page numbering can be done to adapt to this new edition. acknowledgements the researchers would like to acknowledge the ministry of higher education for providing mybrain15 scholarship to support the first author in this study. references anastasi, a. & urbina, s. (1997). psychology testing (7 th ed.). new jersey, nj: prentice hall. bonham, s. (2011). whole class laboratories with google docs. the physics teacher, 49(1), 22-23. cohen, j. (1960). a coefficient of agreement for nominal scales. educational and psychosocial measurement, 20, 37-46. research and evaluation in education 52 − volume 2, number 1, june 2016 heiner, c.e., banet, a.i., & wieman, c. (2014). preparing students for class: how to get 80% of students reading the textbook before class. american journal of physics, 82(10), 989-996. henderson, c. & rosenthal, a. (2006). reading questions. journal of college science teaching, 35(7), 46. hoeft, m.e. (2012). why university students don't read: what professors can do to increase compliance. international journal for the scholarship of teaching and learning, 6(2), 12. howard, j.r. (2004). just-in-time teaching in sociology or how i convinced my students to actually read the assignment. teaching sociology, 32(4), 385-390. ding, l., & beichner, r. (2009). approaches to data analysis of multiple-choice questions. physical review special topics-physics education research, 5(2), 020103. fraenkel, j.r., wallen, n.e., & hyun, h.h. (1993). how to design and evaluate research in education (vol. 7). new york, ny: mcgraw-hill. kim, d. (2011). using google forms for student engagement and learning. educause quarterly, 34(1). krathwohl, d.r. (2002). a revision of bloom's taxonomy: an overview. theory into practice, 41(4), 212-218. landis, j.r. & koch, g.g. (1977). the measurement of observer agreement for categorical data. biometrics, 159-174. macintosh, h.g. & morrison, r.b. (1969). objective testing. london: university of london press. merriam, s. (2001). qualitative research and case studies in education: revised and expanded from case study research in education. san francisco, ca: jossey bass. moravec, m., williams, a., aguilar-roca, n., & o'dowd, d.k. (2010). learn before lecture: a strategy that improves learning outcomes in a large introductory biology class. cbe-life sciences education, 9(4), 473-481. philips, g. (1995). using open book tests to encourage textbook reading in college. journal of reading, 38(6), 484. podolefsky, n. & finkelstein, n. (2006). the perceived value of college physics textbooks: students and instructors may not see eye to eye. the physics teacher, 44(6), 338-342. ryan, t.e. (2006). motivating novice students to read their textbooks. journal of instructional psychology, 33(2), 135-141. sabri, s. (2013). item analysis of student comprehensive test for research in teaching beginner string ensemble using model based teaching among music students in public universities. international journal of education and research, 1(12), 1-14. sikorski, j., rich, k., saville, b., buskist, w., drogan, o., davis, s.f., ... & geller, e.s. (2002). faculty forum. teaching of psychology, 29(4), 312-320. stelzer, t., gladding, g., mestre, j.p., & brookes, d.t. (2009). comparing the efficacy of multimedia modules with traditional textbooks for learning introductory physics content. american journal of physics, 77(2), 184-190. steven, s.s. (1958). problems and method of psychophysics. psychological bulletin. lv, 177-196. young, h.d. & freedman, r.a. (2013). university physics with modern physics (13 th ed.). san francisco, ca: pearson higher ed. supplementary: prt-wo. retrieved from https://goo.gl/2mqftj. https://goo.gl/2mqftj research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 1, number 2, december 2015 (pages 158-174) available online at: http://journal.uny.ac.id/index.php/reid the effectiveness of microcontroller insructional system through simulation program method by using trainer kit 1) edidas; 2) jalius jama 1)2) padang state university, indonesia 1) edidasunp@yahoo.com; 2) jaliusjama@yahoo.com abstract the study was to test the learning effectiveness of a program simulation method by using the trainer kit in microcontroller system courses. the study was conducted to the students who took the microcontroller system courses in the academic year of 2013/2014. the students who took the course were divided into three groups each of which served as: (1) the experimental group, (2) control group, and (3) test instrument group. the learning outcome variables (y2) and the awareness of thinking (y1) served as the dependent variable; on the other hand, the motivation (x1) and the creativity (x2) served as independent variables. the data analysis was done by performing the analysis of variance (anova) and the multivariate analysis of variance (manova) in order to see the differences in the learning outcomes and the level of students‟ thinking. the results of the study showed that there were significant differences in the level of metacognition and learning outcome competence between the group which conducted program simulation learning by employing the trainer kit and the group not employing it. keywords: effective learning, learning simulation, quasi-experiment, metacognition level, learning outcome competence level research and evaluation in education 159 volume 1, number 2, december 2015 introduction the microcontroller system course in the electronic engineering study program, faculty of engineering, padang state university is one of the learning programs that belong to the group of technological and vocational education. the technological and vocational education has been offered in order to be selected only by the talented people (prosser theorem). unlike the general education that equips the students with knowledge and ability to pursue higher education, the technological and vocational education equips the students with a number of specific level and competency for entering the employment. according to presidential regulation number 8 year 2012 regarding the framework of indonesian national qualification (ind.: kerangka kualifikasi nasional indonesia (kkni)), the learning achievement of the graduates of diploma 3 is on level 5 and the description of level 5, according to the national qualification, is as follows: (1) being able to perform wide-scope tasks, to select the appropriate method both from the standardized one and from the unstandardized one by analyzing the available data and to show performance under the measured quality and quantity; (2) being able to master the theoretical concept of certain scientific domain in general and to formulate the procedural problem solving efforts; (3) being able to manage a workgroup and to compose a written report comprehensively; and (4) being able to be responsible for his or her own job and being able to be in charge of the achievement of the workgroup. with reference to the afore-mentioned description, students who undertake a diploma iii program should really master the concept of science both theoretically and practically. in order to meet such an objective, they should practice their thinking and working capacity. their desire to keep improving their thinking and working capacity will lead to the occurrence of creativity and innovation. the degree of qualification according to kkni might be viewed in figure 1. figure1. the degree of qualification according to kkni (source: directorate general of higher education (2011, p.15) vocational education is education that aims to supply reliable and professional labors. the students who undertake vocational education programs are expected to be professional, responsible, quickly-finding, quickly-understanding and quickly-respoding experts toward the changes occuring in their environment. within the learning process, they are introduced to multiple new problems and are trained in finding the solutions in order to develop their capabilities, to find their own problem-solving alternatives and to take decisions quickly. if there is a problem in their job, they should be able to find the multiple alternatives in solving the problem. in relation to the process, within the skill and expertise mastery in the microcontroller system course, the students are expected to be able to meet the objectives of the study program. in order to produce a graduate who has the capability of working independently, there should be a complex process. during the learning process, students are taught to master not only the vocational skills but also the entrepreneurial ability and the thinkingawareness (metacognition). thereby, the labors who have the capability of working independently will be the graduates who have been able to combine the following three capabilites altogether at the same time namely: (1) being able to implement the skill competency for performing their tasks; (2) being able to implement the entrepreneurial skill competency for starting their business; research and evaluation in education the effectiveness of microcontroller instructional system... 160 edidas & jalius jama and (3) being able to use their ideas for developing their business, for finding opportunities and for overcoming the existing obstacles. in the microcontroller system, there are many competencies that the students should master; in general, the competencies might be divided into two parts: (1) microcontroller system hardware-manufacturing capabiliity (microcontroller electronic combinations); and (2) micro-controller system softwaremanufacturing capability (application programs). the ability to manufacture the hardware should be mastered before the ability to manufacture the software. on the contrary, in order to manufacture the hardware of the micro-controller system, the students should master the microcontroller programming first. the condition of the students before the research was conducted is provided. the resume on the data of learning achievements is given in table 1. within the learning process for a vocational competency, the results of the learning achievements are categorized into go (competent) and not go (not competent). table 1 shows that from the two learning groups with a total of 35 members there have been only 12 people or 34% of the group members who have mastered the competency. on the other hand, there have been 23 people or 66% of the group members who have not mastered the competency. the finding has been the underlying reason for conducting the study by finding and analyzing the variables that might influence the cause and effect of such learning achievements. there might be various variables such as the aspects of teaching staffs, facilities, teaching methods, internal environments and external environments. table 1. the results of students‟ competence achievement score interval quality level number of students competence category number of university students in category percentage section 86186 section 86187 81100 a 1 0 competent 12 34 % 66 80 b 6 5 competent 56 65 c 4 5 incompetent 23 66 % 41 55 d 1 0 incompetent 0 40 e 5 8 incompetent number of university students 17 18 35 100% several problems that occur within the course are as follows: (1) the low motivation that the students have in the microcrontroller system component manufacturing and programming; (2) the low creativity that the students have in the microcrontroller system component manufacturing and programming; (3) low thinking awareness and understanding that the students have in the microcrontroller system component manufacturing and programming; (4) low capability of identifying the component damage and the program error in the microcontroller system; (5) the absence of effective learning methods for the microcontroller system; and (6) the low capability of performing cooperation, interaction and tolerance that the students have in working as a group. based on the explanation, the study is conducted under the following objectives: (1) to explain the absence of inter-independent variable interaction between the two groups, namely between the group that implements and the group that does not implement the trainer kit microcontroller mcs51; and (2) to explain that the level of metacognition and research and evaluation in education 161 volume 1, number 2, december 2015 the level of competency, the students‟ results within the program simulation learning method, is better for the group that implements the microcrontroller mcs51 than the group that does not implement the microcrontroller mcs51. simulation learning method multiple methods might be performed in order to master the skills in the microcontroller system and one of such methods is simulation. a simulation learning method is a learning process that employs simulator media as the learning tools. the simulator media might be in the form of software or hardware. the simulator media in the form of software display the behavior of phenomena that have been simulated in the computer screen; on the other hand, the simulator media in the form of hardware employ the actual tools but the target of the hardware might be minimized or might be replaced by a similar one. in relation to the simulation learning, nesbit in joyce, weill and calhoun (2009, p.443) states that: „simulation might stimulate the learning about: (1) competition; (2) cooperation; (3) empathy; (4) social system; (5) concept; (6) skills; (7) effectivenss; (8) penalty; (9) role of opportunity/chance; and (10) opportunity to perform critical thinking.‟ learning by means of simulation method might also enrich the knowledge, the skills and the attitudes of the students. by paying attention to and performing the simulation, the students might increase their understanding, their skills and also their attitudes toward the phenomena of a program. the more the program variations are simulated by the students, the more indepth understanding the students will gain and the more skillful the students will be in performing their job and in displaying their habits. microcontroller mcs51 trainer kit simulator in the simulation learning, the trainer kit serves as the learning media. learning media are the components of teaching deliverance system that might be used in supporting the learning process. the development of learning media should be based on the perception that the learning process will run smoothly, effective and fun if the learning process is supported by the learning media that might draw the students‟ interest and attention. motivation the program simulation learning by means of a trainer kit is expected to encourage the students‟ learning motivation. motivation is defined as a power that encourages a person to do and to direct his or her activities. motivation might come from the inside or from the outside such as from the surrounding neighborhood. motivation that comes from the inside is defined as intrinsic motivation, while motivation that comes from the outside is defined as extrinsic motivation. the indicators of motivation from the intrinsic factors are as follows: (1) the existence of desire and expectation toward success; (2) the existence of needs and encouragement toward learning; (3) the existence of expectation and dream toward future; and (4) the existence of achievements toward the learning process. on the other hand, the indicators of motivation from the extrinsic factors are as follows: (1) the existence of appreciation toward the learning process; (2) the existence of interesting activities within the learning process; and (3) the existence of conducive learning environment (sardiman, 2012, p.83). handoko (1992, p.59) states that „in order to find the power of students‟ motivation people may pay attention to the following indicators: (1) the strength of willingness to do something; (2) the duration devoted to the willingness to learn; (3) the willingness to leave other duties/tasks; and (4) the dilligence in working on the assignments.‟ another theory related to motivation is the expectancy theory of motivation. the expectancy theory of motivation defines that the intensity of tendency to perform something under certain way depends on the intensity of expectation. this theory emphasizes more on the outcomes than on the needs. research and evaluation in education the effectiveness of microcontroller instructional system... 162 edidas & jalius jama according to mcclelland (1961) in his theory about the human needs that have been very vital within an organization and a company, the theory focuses on three aspects: (1) needs for achievement, namely the ability to achieve in the relationship between the job standards that have been set and the struggle for gaining the job achievements; (2) needs for power, namely the motivation to rule that makes people behave in a normal and wise way but actually they want to gain control or to be admitted in their community; and (3) needs for affiliation, namely the desire to be more friendly and to be more accustomed to the colleagues in order to cooperate and to meet the objectives that would like to be achieved. creativity creativity is a unique mental process that has solely been done in order to generate something new, different and original (hurlock, 1978, p.3). a unique mental process takes the form of creating new ideas that are different from the existing ones. a student who might be considered as having a high creativity is the one who has many new ideas. the new ideas will appear from himself or herself if he or she sees a new object. the new ideas might take the form of simplification for assisting the use, the gain and even the production of the object. piirto (2011, p.1) states that, „creativity is simply defined here, as “to make something new,” as a prerequisite to innovation.‟ there are some people who regard that creativity only exists in art, technique, souvenir, film and alike, which are able to be manipulated or to be created. actually, creativity is not limited to certain domains; instead, creativity might cover all of the existing domains. explicitly, one might state that every single aspect that might create something new is a form of creativity. the 21 st century skills, one of which is the creative and innovative skill, have eight indicators as follows (piirto, 2001, p.1): (1) being able to use a wide range of idea creation techniques such as brainstorming; (2) being able to create new and worthwhile ideas, by which a creative person will always think about new aspects in solving multiple problems; (3) being able to elaborate, refine, analyze and evaluate their own ideas in order to improve and maximize creative efforts, for which the existing new ideas need to be elaborated in order to view their strength and their weaknesses and if there are weaknesses within the new ideas there should be efforts made for eliminating the weaknesses; (4) being able to develop, implement and communicate new ideas to others effectively, by which a creative person always communicates every new idea to other people in order that the new idea might be implemented well, and communicating the new idea with other who might also improve more ideas in order to gain more power within the implementation; (5) being able to be open and responsive to new and diverse perspectives, incorporate group input and feedback into the work, a creative person will always be open and responsive toward multiple new ideas; (6) being able to demonstrate originality and inventiveness in work and understand the real world limits to adopting new ideas, with which a creative person should be able to create something new and different from something that has been in existence or should be able to create something unique and original; (7) being able to view failure as an opportunity to learn; understand that creativity and innovation is a long-term, cyclical process of small successes and frequent mistakes, the creative people will always see failures as opportunities to learn; and (8) being able to act on creative ideas to make a tangible and useful contribution to the field in which the innovation will occur, by which a creative person will always perform concrete contribution within the domain in which the innovation will be implemented. metacognition metacognition is a thinking awareness that a person has in understanding his or her job and the objectives of doing his or her job. zohar and dori (2012, p.58) state that „metacognition usually is subdivided into two distinct components, including knowledge of cognition and regulation of cognition.‟ in general, metacognition is divided into two research and evaluation in education 163 volume 1, number 2, december 2015 components: metacognitive knowledge and metacognitive regulation. thinking awareness is an awareness that a person has regarding what he or she knows and what he or she is doing. metacognition is a thinking awareness that a person has regarding his or her own thinking process. flavell (1979, p.1) defines that metacognition has four components: (1) metacognitive knowledge; (2) metacognitive experiences; (3) metacognitive tasks and goals; (4) metacogntive strategies or actions. metacognitive knowledge metacognitive knowledge is related to the declarative knowledge, the procedural knowledge and the conditional knowledge. flavell provides several examples of metacognitive knowledge as follows: someone believes that he or she is able to learn better by listening than by reading or that he or she views his or her friends to be more socially aware than himself or herself. one‟s belief about oneself as a learner might facilitate or might inhibit his or her performance in the learning process. metacognitive experiences metacogntive experiences are related to the planning skills, predicting skills, monitoring skills, and evaluating skills. information, memories or past experiences might be recalled as a source in the problem solving process. metacognitive experiences also include affective responses toward duties. metacognitive tasks and goals metacognitive tasks and goals are the objectives or the results expected from the cognitive efforts. the metacogntive tasks and goals include understanding, inserting facts into the mind or generating something, such as written documents or answers to mathematic problems, only improving one‟s knowledge regarding certain objects. the objective achievement is very interesting in the metacognitive knowledge and in the metacognitive experiences. metacognitive strategies or actions metacognitive strategies are designed to monitor the cognitive progress. the metacognitive strategies are a process that has been implemented in order to control the cognitive activity itself and in order to ensure that the cognitive objectives (for example, problem solving flowchart, syntax program writing, programming algorythm understand-ing) has been met. a person with good metacognitive skills and awareness will employ the process for monitoring the learning process itself, for planning and monitoring the cognitive activities that are in progress and for comparing the cognitive results and the internal or the external standards. an example of using metacognitive strategies is having a retrospection at the end of every learning process with an objective of improving the knowledge content or of monitoring and evaluating the new knowledge. competencies of learning achievements the term competency is derived from latin language, „competere‟, which means appropriateness or appropriateness to certain jobs. according to spencer and spencer (1939, p.9), a competency is an underlying characteristic of an individual that is causally related to criterion-referenced effective and/ or superior performance in a job or situation. competency is a fundamental characteristic of an individual, namely a cause that is related to the criterion-referenced performance regarding the effective performance. the underlying characteristics imply that competency is an integral part of one‟s personality that has been embedded for a long time and might predict behaviors in multiple tasks and job situations. the related cause, or the causally related situation, implies that the competence causes or predicts the behavior and the performance. the criterion-referenced performance, in addition, implies that the competence actually predicts which person will do a job well or worse, as having been measured by specific criteria or standards. competencies, thereby, are a set of characteristics that encourage an individual‟s motive and that indicate how an individual acts, thinks or generalizes situations appropriately in a long term. research and evaluation in education the effectiveness of microcontroller instructional system... 164 edidas & jalius jama the decree of the minister of national education of the republic of indonesia number 045/u/2002 regarding the core curricullum of higher education in verse 1 states that competencies are a set of intelligent and fully responsible actions that an individual has as a prerequisite to be considered as being able to perform tasks in certain job domains by the community. a set of intelligent actions implies that an individual who has competencies will be able to take actions with enough knowledge and science that he or she has gained from his or her learning achievement. an individual who performs intelligent action will not take foolish actions that might damage the tools or that might endanger himself or herself. by taking intelligent actions, an individual will be able to predict the results and the risks from the job that he or she is doing. in formal education degree, competencies are the minimum capability qualification that describes the mastery of knowledge, skills and attitudes that have been studded. competencies are a set of certain jobs that will be done in order to perform a job well and rightly. knowledge mastery is an effort to learn and understand the scientific knowledge that underlies the related jobs. the scientific concepts are the knowledge regarding the theories and the laws related to the jobs that should be done, including the knowledge about the work and environment safety. the fundamental scientific knowledge regarding a job serves as the foundation of the competencies. for example, in order to be considered as a competent person in the domain of microcontroller, he or she should understand the knowledge of the electricity and electronic engineering. the more fundamental aspect of the competencies in the domain of microcontroller includes the theories of atom, the law of ohm, the law of kirchoff, the electro-static and the other laws of physics that are related to the electric current and electrical resistence. the last component of competencies that should be mastered by an individual is the affective component, namely attitude. the affective component is the most important component because a good and appropriate attitude will generate job results that are useful for the humankind. the attitude of a competent individual heavily determines whether the results of his or her work will be able to improve the quality of human life conveniently without causing any harm to the environment. other studies relevant to the recent study have been done by many researchers and some of the studies are discussed in the following sections. saemah and phillips (2006) studied 374 students in their second year in five colleges of malaysia national university (university kebangsaan malaysia). the study has paid attention to the relationship between the metacognitive capability and the attitude toward the learning achievement. the results of the study show that the level of meta-cognition has a direct and positive effect on the learning achievement (β?= 0.358; t = 2.851; p < 0.05). the results of the study show that the learning achievement might be improved by improving the students‟ metacognition. on the other hand, in order to improve metacognition awareness, students might set the learning objectives and improve their self-efficacy. however, the component of learning motivation that has orientation toward achievement has negative effect on the improvement of students‟ thinking awareness. in addition, this component of learning motivation also has negative effect on learning achievement. thereby, it might be concluded that the achievement targets in the learning process might provide bad effect on learning achievement. then, eskrootchi and oskrochi (2010) studies 72 students from the northwest middle-school kansas city in kansas. the results of the study show that the students who perform learning by means of the simulation method have better understanding than the others. the finding strengthens the researchers‟ belief that the simulation method has a huge effect on learning achievement. furthermore, liu (2010) studies 45 university students from the public university in taiwan. the prototype of computerassisted learning (cal)-based simulation has been named as simulation-assisted learning statistics (sals). the result of his study research and evaluation in education 165 volume 1, number 2, december 2015 shows that the learning process that implements the simulation-assisted learning statistics (sals) has been effective in reducing the university students‟ mistakes in statistics. from the studies that have been discussed, the researchers concluded that the learning process by means of the sals simulation method might decrease the university students‟ mistakes in statistics. these results will be a matter of comparison in the recent study. in addition, del populo pablo, et al. (2012) perform an evaluation study on the islm simulation program in a group of macroeconomy university students in the university of sevilla, spain, during the academic year of 2009/2010. is-lm is a term in the domain of economy and the term stands for investmentsaving/liquidity preference-money supply. is-lm is a fundamental material that has been implemented in the short-term macroeconomy teaching. the teaching, basically, is conducted by using graphs. the is-lm simulation program capability in overcoming the difficulties has been found in understanding curves because the university students are able to visualize the changes in the curve when the value parameters are changed. the variance analysis (anova) from the values that all of the university students have and from several complementary statistic tests that have been performed shows the differences between the university students that use the simulator and those who do not. the results of the study show that the average scores of the university students who use the simulation program are significantly higher compared to those of the students who do not use the simulation program. the -value of the anova analysis equals 1.073. kruskal-wallis chi-squared statistic shows a value equal to 2.51 with a p-value equal to 0.285. alias (2012) studied 40 respondents with the background of expert, instructor and learner who actively operated online and distance education. four experts were selected, two from england and the other two from malaysia. the instructors as well as the learners were also selected from the people who had been active in the long distance learning via the internet. the main objective of the study was to describe the design, the development and the formative assessment toward the learning process via the internet. the study was to answer the following questions: (1) what are the strategies that should be implemented in order to encourage the motivation of the e-learning users?; (2) how should the web technology be implemented in order to encourage the students‟ motivation?; and (3) what is the elearners‟ response toward the effectiveness, the practicality and the assessment of the learning consulation? the results of the study show that the learning console has potentials to reflect their learning process (m = 4.0; sd = 0,47) to encourage the students to improve their initiative (m = 4.33; sd = 0.67) and their selfefficacy (m = 4.0; sd = 0.47), to make the students keep motivated in the learning process (m=3.9; sd=0.74) and to provide the sense of achievements (m = 3.8; sd = 0.79). the university students also value that the learning console has provided opportunities for interaction between the students and the instructors (m = 3.9; sd= 0.74). on the other hand, the students value that the learning console has not had potentials for providing feedbacks (m=3.3; sd=1.25). hung et al. (2012) conducted another study to 117 fifth grade students of an elementary school located in the southern part of taiwan. the students were divided into two groups, namely an experimental group, consisting of 60 students (35 male students and 25 female students) and a control group, consisting of 57 students (31 male students and 26 female students). the experimental group took part in the study by performing new project-based learning in the form of digital story-telling and the control group took part in the study by performing conventional project-based learning including the project assignments and the result presentation in the groups altogether with the teachers‟ feedback and evaluation. the results of the study show that the experimental group, namely the group that took part in the study by performing new project-based learning in the form of digital storytelling has been able to effectively improve their learning motivation (f = 20.38; research and evaluation in education the effectiveness of microcontroller instructional system... 166 edidas & jalius jama p <0.001). digital storytelling is a learning approach that employs computer program application as the working guideline during their learning process. the students who were studying would follow the learning sequences by clicking the buttons according to the directions on the computer screen. the strength of digital storytelling approach is that the working guideline has been displayed not only in the texts or sketches but also in the original forms of the direction that will be clicked. hypothesis based on the elaboration, several hypotheses are formulated as follows: (1) the study will explain the absence of interaction between the independent variables from both groups, both the group that implements the microcontroller mcs51 trainer kit simulat ion program and the group that does not implement the microcontroller mcs51 trainer kit simulation program; and (2) the study will explain the level of metacognition and the level of competency within the results of the students that implement microcontroller mcs51 trainer kit simulation program; the assumption is that the students who employ the trainer kit have better learning achievements. research method the study was a quasi-experiment with a small sample and homogenous sample because the study was conducted in the class of microcontroller system course. the approach used was the quantitative one in which the data were described quantitatively. the data were in the form of numbers and analyzed in order to describe how effective the microcontroller program learning by means of trainer kit simulation was in comparison with that without the simulation program implementation. the research design selected was the noequivalent control group design in which the experimental group and the control group were selected not randomly but based on the groups in the microcontroller system course and these groups in the beginning of the study seemed to have similar competence and thinking awareness. in order to find the class that might have similar condition, the researchers performed a preliminary test to all of the existing classes and performed assessment. based on the results of the assessment, the researchers would decide the two classes that might have almost similar competence and thinking awareness. the study would also be developed by counting the moderator variables that might influence the results. based on the analysis toward the results of the preliminary test toward all classes, the researchers have found the two classes that would be studied, namely: class a (section 42951) and class b. the site of the study would be in the department of electronic engineering, the faculty of engineering, padang state university within the class of microcontroller system course. then, the period of the study would be from january to june 2014. the population of the study would be the students of diploma 3 electronic engineering study program, the department of electronic engineering, who has taken the class of microcontroller system course in january to june 2014. the numbers of the population would be 63 people, consisting of four classes. for the sample selection, the researchers would implement the nonprobability sampling in which there has not been any similar opportunity for each of the population members to be selected as the sample because the university students have been shaped in the classes. the opportunity to be selected as the sample would fall to the classes that has existed and these classes would be selected based on the results of the preliminary test that was administered to all of the existing classes. based on the results of the preliminary test, the researchers found two classes with similar capability. the sample was divided into two groups: the experimental group and the control group. the independent variables was the learning motivation (x1) and the university students‟ creativity (x2). on the other hand, the dependent variables were the students‟ metacognition (y1) and the students‟ learning achievements (y2). the control variables were research and evaluation in education 167 volume 1, number 2, december 2015 the variables that had a tendency to influence the dependent variables; therefore, the control variables were made constant during the study so that there was not anything that influence the relationship between the independent variables and the dependent variables. the moderating variables were the variables that were generated altogether with the independent variables and the moderating variables influenced the dependent variables. the moderating variables were calculated because the influence of these variables was significant on the two dependent variables. the operational definition for each of the variables was as follows. (1) the learning motivation (variable x1) is the encouragement to perform an activity that appears within an individual. in order to gather the data regarding the motivation, the researchers formulated the following indicators: (a) willingness to take an action; (b) time which is provid-ed for learning; (c) expectation toward appreciation (valence, expectation and also instrumentalist); (d) the motives of gaining achievement, cooperation and power; and (e) dilligence in accomoplishing the assignments. (2) the creativity (variable x2) is a situation in which an individual will always try something new. in order to gather the data regarding the creativity, the researchers formulated the following indicators: (a) using multiple ways for generating ideas; (b) creating new significant ideas; (c) having the elaborating (explaining), improving, analyzing and evaluating power toward his or her own ideas in order to increase and to maximize the creative efforts; (d) developing, implementing, and communicating the new ideas to other people effectively; (e) being open and responsive toward the new and multiple perspectives and combining the group advice and the feedback into the job; (f) showing skills and originality in the job and understanding the limitation of the actual world in order to adopt the new ideas; (g) viewing failure as an opportunity to learn since success which has been a long-term process and has been started from small achievements with many failures; and (h) taking actions based on the creative ideas in order to provide actual and useful contribution to the domain in which the innovation might take place. (3) the students‟ metacognition (variable y1) is the level of students‟ metacognition namely the level of thinking awareness that the students have during the learning process. in order to gather the data regarding students‟ metacognition, the researchers formulated the following indicators: (a) having metacognitive knowledge; (b) having metacognitive experiences; (c) having metacognitive tasks and objectives; and (d) having metacognitive strategies and actions. (4) the competencies of students‟ learning achievements (variable y2) are the assessment toward the level of proficiency that a student has attained after attending the learning process. in order to gather the data regarding the competency of students‟ learning achievements, the researchers formulated the following indicators: (a) having metacognitive capability; (b) having psychomotoric capability; and (c) having affective capability. in addition to the independent variables and the dependent variables, there would also be the control variables, namely the other variables that might have great tendency to influence the dependent variables. the influence of the control variables was kept constant both in the experimental group and in the control group. then, the control variables were as follows: lecturer or teacher preparedness, learning space comfort, learning period availability and learning media. the control variables were included into the calculation of data analysis because all of the control variables were made constant to the two groups, both the experimental and control groups. the control variables whose condition was made constant was viewed in the table of control variable analysis given in the appendix. in the table, there are 10 control variables: (1) lecturer preparedness; (2) technician preparedness; (3) space availability; (4) tool availability; (5) material availability; (6) teaching set readinessness; (7) learning time sufficiency; (8) environmental support; (9) environmental inhibition; and (10) other support/inhibition. the performance variable of microcontroller mcs51 media trainer that was selected for the treatment in the experimental research and evaluation in education the effectiveness of microcontroller instructional system... 168 edidas & jalius jama group was validated first. the validation of trainer performance was conducted by gathering information through the documentation and the questionnaire. there were three indicators validated, namely: (1) trainer design compatibility; (2) trainer implementation practicality; and (3) trainer reliability. in order to gather the data regarding the compatibility, the practicality and the reliability, there was an instrument that measured the compatibility, the practical-ity and the reliability of microcrontroller kit trainer. the instrument of mcs51 microcontroller trainer kit compatibility, practicality and reliability measurement that was made was given to five respondents consisting of lecturers and technicians who have been used to running the mcs51 microcontroller trainer. the results of the measurement was written into an observational table. then, the results of observation was analyzed in order to find its validity and reliability. if the data were dichotomous, then the analysis of mcs51 microcontroller trainer kit reliability performance referred to the agreement technique by means of cohen‟s kappa coefficient (wood, 2007): ... (formula 3.1) note: k = inter-rater agreement coefficient o = inter-rater assessment perfect agreement percentage e = inter-rater disagreement percentage the summary of cappa coefficient was turned into an assessment of trainer kit compatibility as a microcontroller system learning media. the summary is presented in table 2. with the assistance of spss statistic software, the researchers were able to calculate the inter-rater reliability coefficient. therefore, the number of raters for the indicator of trainer design compatibility as the microcontroller system learning media is five people; as a result, there are 10 k altogether with the summary displayed in table 2. table 2. the summary of cappa coefficient for the assessment of trainer kit compatibility as the microcontroller system learning media no cohen’s kappa coefficient inter-two raters agreement note 1 k1 r1 * r2 agreement between rater 1 and rater 2 2 k2 r1 * r3 agreement between rater 1 and rater 3 4 k3 r1 * r4 agreement between rater 1 and rater 4 4 k4 r1 * r5 agreement between rater 1 and rater 5 5 k5 r2 * r3 agreement between rater 2 and rater 3 6 k6 r2 * r4 agreement between rater 2 and rater 4 7 k7 r2 * r5 agreement between rater 2 and rater 5 8 k8 r3 * r4 agreement between rater 3 and rater 4 9 k9 r3 * r5 agreement between rater 3 and rater 5 10 k10 r4 * r5 agreement between rater 4 and rater 5 then, the inter-rater reliability was calculated based on the following average agreement: ... (formula 3.2) according to landis and koch in altman (1991, p.404), the meaning of k is as follows: < 0.20 = very bad 0.20 > 0.40 = bad 0.41 > 0.60 = moderate 0.61 > 0.80 = good 0.81 > 1.00 = very good since the data of practicality and reliability are ordinal data, in order to find the practicality reliability of the trainer kit media use, the researchers used inter class correlation (icc) with the following formula: ... (formula 3.3) note: icc = inter class coefficient var (α) = the difference of an aspect assessment within the inter-rater assessment = k = number of rater research and evaluation in education 169 volume 1, number 2, december 2015 var (β) = a rater‟s assessment average = n = number of items var (e) = a rater‟s average error with the assistance of spss statistical software, the researchers could calculate intertwo raters agreement coefficient. since the number of the raters is 5 people, there are 10 k (inter-rater reliability coefficient) and the resume is given in table 3. table 3. the resume of cappa coefficient for the trainer kit compatibility assessment as the microcontroller system learning media no cohen’s kappa coefficient inter-two raters agreement k score 1 k1 r1 * r2 0.588 2 k2 r1 * r3 1.000 4 k3 r1 * r4 1.000 4 k4 r1 * r5 0.300 5 k5 r2 * r3 0.588 6 k6 r2 * r4 0.588 7 k7 r2 * r5 0.588 8 k8 r3 * r4 1.000 9 k9 r3 * r5 0.300 10 k10 r4 * r5 0.300 k total number 6.252 k average 0.625 to achieve the reliability level assessed by five raters, the researchers found the average value of the agreement among all of the raters by using formula 3.2: according to the inter-rater agreement criteria of cohen‟s kappa coefficient, the figure k = 0.625 is in „good‟ category. thereby, the design of mcs51 microcontroller system trainer kit would be compatible for implementation in the microcontroller system laboratory practice. in order to find the reliability of the trainer kit media use practicallity by means of ordinal data, the researchers used inter class correlation (icc) by implementing formula 3.3. by using the spss for analyzing the assessment reliability from the media trainer kit practicallity indicator within the use of microcontroller system laboratory practice, the researchers found the coefficient of inter class reliability. the coefficient is displayed in table 4. table 4. coefficient of inter class correlation reliability regarding the media trainer kit use practicality intra-class correlation coefficient intra-class correlationa 95% confidence interval f test with true value 0 lower bound upper bound value df1 df2 sig single measures .665b .317 .932 10.921 5 20 .000 average measures .908c .699 .986 10.921 5 20 .000 based on the coefficient value of interclass correlation average measures, the indicator of media trainer kit use practicality is 0.908 and the reliability of the media trainer kit is in „very good‟ category. therefore, the mcs51 microcontroller system trainer media kit is practical for operation in the microcontroller system laboratory practicum. then, the researchers performed the reliability test toward the mcs51 microcontroller system trainer kit reliability during the laboratory practicum of microcontroller system. based on the analysis of the media trainer kit reliability, with the assistance of spss, the coefficient of interclass correlation single measures is equal to 0.563 and the results is displayed in table 5. the coefficient research and evaluation in education the effectiveness of microcontroller instructional system... 170 edidas & jalius jama value states that the trainer kit has moderate reliability for the implementation in the laboratory practicum of microcontroller system. in sum, statistically the mcs51 microcontroller media trainer kit can be implemented in the training process in the experimental group. table 5. the reliability coefficient of interclass correlation regarding the trainer reliability intraclass correlation coefficient intraclass correlationa 95% confidence interval f test with true value 0 lower bound upper bound value df1 df2 sig single measures .563b .268 .848 7.429 8 32 .000 average measures .865c .647 .965 7.429 8 32 .000 findings and discussions the first hypothesis testing was to test whether or not there was any interaction among several independent variables both in the experimental group and control group. the first hypothesis testing was useful for investigating whether there was any interaction between the first and the second independent variable on the dependent variable. therefore, the hypothesis is as follows: h0 : there is no interaction between the students‟ learning motivation (x1) and the students‟ creativity (x2). h1 : there is interaction between the students‟ learning motivation (x1) and the students‟ creativity (x2). table 6. results of multivariate test toward the experimental group experimental group multivariate testsc effect value f hypothesis df error df sig. intercept pillai's trace .999 3.868e3a 2.000 5.000 .000 wilks' lambda .001 3.868e3a 2.000 5.000 .000 hotelling's trace 1.547e3 3.868e3a 2.000 5.000 .000 roy's largest root 1.547e3 3.868e3a 2.000 5.000 .000 motivation pillai's trace 1.099 1.830 8.000 12.000 .166 wilks' lambda .078 3.214a 8.000 10.000 .044 hotelling's trace 9.487 4.743 8.000 8.000 .021 roy's largest root 9.242 13.862b 4.000 6.000 .003 creativity pillai's trace 1.390 3.417 8.000 12.000 .027 wilks' lambda .049 4.391a 8.000 10.000 .016 hotelling's trace 10.425 5.212 8.000 8.000 .016 roy's largest root 9.482 14.223b 4.000 6.000 .003 motivation *creativity pillai's trace .165 .493a 2.000 5.000 .638 wilks' lambda .835 .493a 2.000 5.000 .638 hotelling's trace .197 .493a 2.000 5.000 .638 roy's largest root .197 .493a 2.000 5.000 .638 a. exact statistic b. the statistic is an upper bound on f that yields a lower bound on the significance level. c. design: interception + motivation + creativity + motivation * creativity research and evaluation in education 171 volume 1, number 2, december 2015 based on the output of the multivariate significance test from multiple procedures (pillai, wilk‟s lamda, hoteling, and roy‟s), as displayed in table 6, the researchers found that the significant values in the line motivation*creativity are 0.638; 0.638; 0.638 and 0.638 respectively. all of these significance values are above 0.5 and they imply that the h0 is rejected and the h1 is accepted. thereby, in the experimental group there is interaction between motivation and creativity. the finding shows that in the experimental group the change of metacognition and competence is solely determined by the motivation and the creativity. next, the first hypothesis testing is administered to the control group. based on the output of multivariate significance test from multiple procedures (pillai, wilk‟s lamda, hoteling, and roy‟s), as displayed in table 7, the researchers found that the significance values in motivation*creativity line is 0.525; 0.598; 0.689 and 0.290 respectively. all of the significance values are above 0.5 and these values imply that the h0 is accepted and the h1 is rejected. thereby, in the control group there is interaction between motivation and creativity. table 7. results of multivariate test in the control group control group multivariate testsc effect value f hypothesis df error df sig. intercept pillai's trace .994 3.095e2a 2.000 4.000 .000 wilks' lambda .006 3.095e2a 2.000 4.000 .000 hotelling's trace 154.742 3.095e2a 2.000 4.000 .000 roy's largest root 154.742 3.095e2a 2.000 4.000 .000 motivation pillai's trace .899 1.361 6.000 10.000 .317 wilks' lambda .282 1.175a 6.000 8.000 .404 hotelling's trace 1.898 .949 6.000 6.000 .525 roy's largest root 1.457 2.428b 3.000 5.000 .181 creativity pillai's trace .657 .815 6.000 10.000 .582 wilks' lambda .375 .844a 6.000 8.000 .570 hotelling's trace 1.583 .791 6.000 6.000 .608 roy's largest root 1.527 2.545b 3.000 5.000 .170 motivation * creativity pillai's trace .508 .851 4.000 10.000 .525 wilks' lambda .538 .727a 4.000 8.000 .598 hotelling's trace .773 .580 4.000 6.000 .689 roy's largest root .640 1.599b 2.000 5.000 .290 a. exact statistic b. the statistic is an upper bound on f that yields a lower bound on the significance level. c. design: interception + motivation + creativity + motivation * creativity the testing of second hypothesis was conducted in two stages; the first stage of the testing was to find whether there had been significant difference between the experimental group and the control group in terms of metacognition level and learning outcomes competence. based on the results displayed in table 8, the significance value of the group variables is 0.029. the significance value of the group variable is smaller than 0.05; therefore, the finding implies that there are differences in terms of the metacognitive capability (y1) and the learning outcome competence (y2) between the experimental group and the control group. research and evaluation in education the effectiveness of microcontroller instructional system... 172 edidas & jalius jama table 8. the significance of the differences in terms of the metacognition and the learning achievements between the control group and the experimental group tests of between-subjects effects dependent variable: value_y df mean square f sig. source type iii sum of squares corrected model 2.188a 3 .729 3.407 .023 intercept 792.616 1 792.616 3.703e3 .000 group 1.070 1 1.070 4.998 .029 variable_y 1.091 1 1.091 5.094 .028 group * variable_y .040 1 .040 .185 .669 error 12.416 58 .214 total 809.926 62 corrected total 14.603 61 a. r squared = .150 (adjusted r squared = .106) the second stage of the testing was to test which group had bigger improvement in terms of metacognition and learning outcome competence. the results of the testing in the second stage is displayed in table 9. the mean of metacognition variable for the control group is 3.604, while the mean of metacognition variable for the experimental group is 3.816. the two figures show that the metacognition level that the experimental group possesses is better than that of the control group. in other words, the program simulation learning by means of the trainer kit may increase the students‟ metacognition. furthermore, the mean of learning outcome competence for the control group is 3.288 and the mean of learning outcomes competence for the experimental group is 3.608. the two figures show that the learning outcome competence level that the experimental group possesses is higher than that of the control group. therefore, the results support the conclusion that the program simulation learning by means of trainer kit may increase the students‟ learning outcome competence. table 9. the statistic description of y variables from the control group and the experimental group descriptive statistics dependent variable :value_y mean std. deviation n group variable_y control group metacognition 3.603913e0 .4863139 15 competence 3.287933e0 .4832891 15 total 3.445923e0 .5027442 30 experimental group metacognition 3.816175e0 .4981483 16 competence 3.601381e0 .3752768 16 total 3.708778e0 .4473527 32 total metacognition 3.713468e0 .4960546 31 competence 3.449713e0 .4525148 31 total 3.581590e0 .4892865 62 research and evaluation in education 173 volume 1, number 2, december 2015 conclusion and suggestions conclusion there have been interactions between the class that has and the class that has not implemented the simulation program learning method by means of mcs51 microcontroller trainer kit. the interactions have been marked by the interaction significance values of „motivation*creativity‟ and the values of both groups have been above 0.05. according to pillai‟s, trace, wilks‟ lambda, hotelling‟s trace dan roy‟s largest root, the significance values for the group that has implemented the mcs51 microcontroller trainer kit are 0.638; 0.638; 0.638 and 0.638, while the significance values for the group that has not implemented the mcs51 microcontroller trainer kit are 0.461; 0.500; 0.573 and 0.182. the simulation program learning method by implementing the mcs51 microcontroller trainer kit has shown better results for the experimental group rather than the control group. the evidence might be found in the results of univariate descriptive statistical text in the descriptive statistics table; in the table, the mean for the metacognition and the learning outcome competence for the group that has implemented the mcs51 microcontroller trainer kit is higher than that of the group that has not implemented the mcs51 microcontroller trainer kit. the mean for the group that has implemented the trainer kit is 3.816175, while the mean for the group that has not implemented the trainer kit is 3.603913. on the other hand, the mean of the learning outcome competence for the group that has implemented the trainer kit is 3.601381 and for the the group that has not implemented the trainer kit is 3.287933. suggestions the results of the research show that the use of better, practical and reliable trainer kit in performing the laboratory practice might improve the students‟ creativity and, in turn, might also improve the students‟ thinking awareness (metacognition) as well as the students‟ learning outcome competence. on the other hand, the effect of motivation and creativity individually has only been apparent on the improvement of the thinking awareness. thereby, future researchers should find more tips in improving the learning process in order to improve the students‟ motivation and creativity since the improve-ment of the students‟ motivation and creavitiy will lead to the improvement of the learning outcome competence. for the teachers of microcontroller systems, it is suggested that they implement the trainer kit as the learning media because the trainer kit might improve the thinking awareness (metacognition) through the motivation and the creativity that appears during the laboratory practicum. references alias, n. a. (2012). design of a motivational scaffold for the malaysian e-learning environment. educational technology & society, 15(1), 137–151. calhoun, c., & finch, a.v. (1982). vocational education: concepts and operations (2 nd ed.). belmont, california: wadworth. del pópulo pablo-romero, m., pozo-barajas, r., & de la palma gómez-calero, m. (2012). evaluation of teaching the islm model through a simulation program. educational technology & society, 15(4), 193–204. directorate general of higher education. (2011). kajian tentang implikasi dan strategi implementasi kkni [study on the implication and strategy of framework of indonesian national qualification]. jakarta: directorate of higher education. eskrootchi, r., & oskrochi, g. r. (2010). a study of the efficacy of project-based learning integrated with computerbased simulation-stella. educational technology & society, 13(1), 236–245. flavell, j.h. (1979). metacognition and cognitive monitoring: a new area of cognitive-developmental inquiry. american psychologist assosiation inc., 34(10), 906-911. research and evaluation in education the effectiveness of microcontroller instructional system... 174 edidas & jalius jama hung, c.m., hwang, g.j., & huang, i. (2012). a project-based digital storytelling approach for improving students' learning motivation, problem-solving competence and learning achievement. educational technology & society, 15(4), 368–379. hurlock, e.b. (1988). perkembangan anak jilid 2 [children development volume 2] (m. tjandrasa, trans.). new york, ny: mcgrow-hill. joyce, b., weil, m., & calhoun, e. (2009). model of teaching (model-model pembelajaran) (a. fawaid & a. mirza, trans.). new jersey: upper saddle river. liu, t.c. (2010). developing simulation-based computer assisted learning to correct students' statistical misconceptions based on cognitive conflict theory, using „correlation‟ as an example. educational technology & society, 13(2), 180–192. mcclelland (1961). mcclelland (needs for affiliation, power, and achievement) theory of motivation. retrieved from http://www.whatishumanresource.com /mcclelland-needs-for-affiliationpower-and-achievement-theory-ofmotivation. martin, h, (1992). motivasi daya penggerak tingkah laku [motivation of behavior activator]. jakarta: rineka cipta. maslow, a. (1954). motivation and personality. new york, ny: harper & row. minister of national education. (2002). keputusan menteri pendidikan nasional republik indonesia nomor 045/u/2002 [the decree of the minister of national education of the republic of indonesia number 045/u/2002]. jakarta: minister of national education. piirto, j (2011). creativity for 21 st century skill: how to emmbed creativity into curriculum. rotterdam: sense publishers. president. (2012). peraturan presiden nomor 8 tahun 2012 tentang kerangka kualifikasi nasional indonesia (kkni) [presidential regulation number 8 year 2012 regarding the framework of indonesian national qualification]. jakarta: president of the republic of indonesia. saemah, r., & philips, j.a. (2006). hubungan antara kesedaran metakognisi, motivasi dan pencapaian akademik pelajar universiti [the relationship between metacognitive awareness, motivation, and academic achievement of college students]. jurnal pendidikan, 31, 21-39. sardiman, a.m., (2012). interaksi dan motivasi belajar mengajar [learning-teaching interaction and motivation]. jakarta: rajawali press. spencer, l.m. & spencer, s.m. (1993). competence at work: models for superior performance. new york, ny: john wiley & sons. wood, j.m. (2007). understanding and computing cohen’s kappa: a tutorial. webpsychempiricist. retrieved from http://wpe.info/vault/wood07/wood0 7ab.html/. zohar, a & dori, y.j. (2012). metacognition in science education: trends in current research. new york, ny: springer science + business media b.v. http://www.whatishumanresource.com/mcclelland-needs-for-affiliation-power-and-achievement-theory-of-motivation http://www.whatishumanresource.com/mcclelland-needs-for-affiliation-power-and-achievement-theory-of-motivation http://www.whatishumanresource.com/mcclelland-needs-for-affiliation-power-and-achievement-theory-of-motivation http://www.whatishumanresource.com/mcclelland-needs-for-affiliation-power-and-achievement-theory-of-motivation http://wpe.info/vault/wood07/wood07ab.html/ http://wpe.info/vault/wood07/wood07ab.html/ research and evaluation in education e-issn: 2460-6995 † deceased 15 may 2015 research and evaluation in education volume 1, number 2, december 2015 (pages 129-145) available online at: http://journal.uny.ac.id/index.php/reid developing a model of competency certification test for vocational high school students 1) pardjono; 2) sugiyono; 3) †aris budiyono 1)2) yogyakarta state university, indonesia; 3) semarang state university, indonesia 1) pardjono@uny.ac.id; 2) sugiyono_1953@yahoo.com; 3) aries_budiy@yahoo.com abstract this study aims to develop, produce, and investigate the appropriateness of model of competency and expertise certification tests for vocational high school (vhs) students of the mechanical engineering expertise competency. to attain the objectives, the researchers conducted a research and development study consisting of 10 steps. the research product was validated by experts, vhs teachers, and lecturers at mechanical engineering education through focus group discussion (fgd), and the field tryout conducted at warga surakarta vhs and bhineka karya simo vhs, boyolali, central java. the results of the study are: (1) the study produces a model of competency and expertise certification tests based on the school production unit (cect_spu) for vhs students of the mechanical engineering expertise competency; (2) the cect_spu model satisfies the criteria for a good model by a mean score of 3.557; (3) the mean score of the model implementation in the tryouts are 3.670 in the individual tryout and 3.730 in the small-group tryout; (4) the cect_spu model satisfies the criteria for an effective model by a mean score of 3.730; (5) the cect_spu model satisfies the criteria for an efficient model by a mean score of 3.780; (6) the cect_spu model satisfies the criteria for a practical model by a mean score of 3.700. keywords: cect model, vhs students, mechanical engineering, spu mailto:1)aries_budiy@yahoo.com mailto:1)aries_budiy@yahoo.com mailto:sugiyono_1953@yahoo.com research and evaluation in education developing a model of competency certification test... 130 pardjono, sugiyono, & †aris budiyono introduction the implementation of vocational education, including sekolah menengah kejuruan (smk) -vocational high school (vhs) is now entering a crucial phase in which the graduates of vocational education are at stake in their readiness in the scene of the world's labour force in the regional and global levels, both within the context of china-asean free trade agreement (c-afta) and the asean free labour agreement (afla). in addition, they also have to face the demands of the use of new-findings-based technology for the efficient use of production, which requires the availability of renewable competencies in accordance with the demands of the 21 st century workforce competencies. coombs in gunawan (2006, p. 4) explains that a good quality of vocational education is when the students who have undergone such education process can be accepted in the world of work fitted with the field of their expertise. based on the aforementioned statement, it can be put forward that the vhs as a producer of graduates should be able to make every individual student has the ability, skill, and expertise that are relevant to the demands and needs of the workforce. thus, vocational education cannot be removed from the existing workforce development. the development of a marketable workforce should be made by vocational education based on the needs of the market (demand driven) through an increase in competence of graduates. statistical data of february 2011 from badan pusat statistik (bps) -the bureau of statistics center (bureau of statistics center, 2011, p. 39) shows that the formation of labour industrial sector has reached 13.71 million (12.32%) of the entire work figures of 111.28 million people. it hints that the needs of the job market in the industrial sector are still quite large. this condition provides an opportunity for vhs, especially those who are in the expertise areas of technology and engineering competence for mechanical engineering, to take a role in the fulfillment of labor in indonesia. in 2008, the number of public and private vhs graduates in central java are between 95% and 100%, which is, passing the range of being absorbed into employment that matches their program that is 30% up to 50%, and the waiting period to get the first job is on average of 1-6 months; the rest of them continued into college, as well as some unknown activities. vhs graduates from technology and engineering in mechanical engineering study who are required by the industry are operators of machine tools manuals, operators of computer numerically controlled (cnc) machines, electric welding, argon welding, metal casting, and, in addition, the required soft skills of perseverance, commitment, discipline, and the ability to work together (team work) (central java bureau of regional research and development, 2008, p. 21). a breakthrough has been made in order to prepare indonesian workforce to enter the scene of the world's workforce at local, national, and regional, as well as global level. it is a formulation of policy for competencybased human resource (hr) development (department of national education, 2004b, p.1). the hr is embodied in the national framework through the establishment of the institute of profession certification-national board of professions certification (ipcnbpc) -lembaga sertifikasi profesi-badan national sertifikasi profesi (lsp-bnsp). nbpc is a non-structural independent institute and responsible directly to the president which was established based on the government regulation number 23 year 2004, state gazette of the republic of indonesia number 78 year 2004. the hr development policy substance within the framework of the second standardization system is the agreement of nbpc national competence which will become reference for training institutions and agencies as well as other agencies testing which are tied to the development of hr. law number 20 year 2003 on national education system which states that the particular purpose of vocational high schools is preparing students to enter the workforce. as for the work acceptance, a person should be competent, and legally formal possession was proven by, for instance, a certificate of competency as ‘skill research and evaluation in education 131 volume 1, number 2, december 2015 passport’ which contains the competencybased skills that are owned by the holder through competency test and certification. a certificate of competency is issued in recognition of someone’s competent to perform particular work through a process of competence and certification carried out by the accredited educational or certification agencies, which are clearly expressed in article 61, paragraph 3 of law number 20 year 2003 which states that: ‘certificate of competence is given by organizers of education and training institutions to learners and citizens in recognition of competence to perform specific jobs after graduation from competency tests conducted by an accredited educational or unit certification agencies’. a study conducted by samsudi, et al. (2009, pp. 41-42) indicates that the model/ approach to competency test shows total of 74.28% respondents affirm that they still use project work approach with internal and external verification; but at the same time, 36.52% of respondents also provide an alternative to use the approaches/models applied by the ipc-nbpc and in general, the cost of implementation of competence by organizing school collaborate with business and industry (b&i) is cheaper, while other models cost is more expensive. furthermore, there is a difference beside the characteristics of the three models of cect. there are also the advantages and disadvantages of each model which are reviewed from the completion of cect model in vhs, assessors of the test, and infrastructure in the competency assessment center (cac). testing material and scoring system are also used so that the existing cect model needs to be developed. development is needed for perfecting one cect model that is currently underway, and can be implemented by all the vhs, that is, final task project (ftp) model through the improvements from the disadvantages. therefore, the developed cect of vhs students is expected to be able to (1) measure the competence of students, according to the standards established by the workforce to avoid mismatch and under qualified graduates, (2) contain skill needed in the future (future skill), (3) be implemented by vhs in competence of mechanical engineering in general, and (4) be accessed by all vhs students. based on the previous description, the research problems are outlined as follows: (1) how to develop model of competence and certification for vhs students on mechanical engineering expertise competence?; (2) what is the effective, efficient, and practical model of competence and certification for vhs students on mechanical engineering expertise competence?; (3) what is the appropriateness of the developed model of competence and certification for vhs students on mechanical engineering competence?. in addition, the objectives of this study are: (1) developing a model of competence and certification for vhs students on mechanical engineering expertise competence based on the existing models; (2) producing an effective, efficient, and practical model of competence and certification for vhs students on mechanical engineering expertise competence; and (3) finding out the feasibility of the developed model of competence and certification for vhs students on mechanical engineering competence. research method type of research the target of this study was discovering an effective, efficient, and practical model of competence and certification for vhs students on mechanical engineering expertise competence. therefore, the research method which was used in this study was research and develop-ment method (borg and gall, 1989, p.782). subject of research the subjects of this research were the students of mechanical engineering program of vocational high school (vhs) – sekolah menengah kejuruan (smk) in central java, indonesia, namely: smkn 2 wonogiri, smkn 2 purwokerto, smkn 1 adiwerna tegal, smkn 1 semarang, smkn 2 surakarta, smk st. mikael surakarta, smk research and evaluation in education developing a model of competency certification test... 132 pardjono, sugiyono, & †aris budiyono warga surakarta, smk ganeshatama boyolali, and smk bk2 simo boyolali. procedure of development the development procedures which were conducted in this study included: (1) preliminary studies and collection of information (research and information collection); (2) planning; (3) initial product development; (4) initial field trials (preliminary field testing); (5) main product revision; (6) main product field trials (main field testing); (7) operational product revision; (8) operational product field trials (operational field testing); (9) final product revision; (10) dissemination and implementation. technique of data analysis the techniques of data collection which were used were observation, interview, and documentation. the instrument which was used to perform the initial study was the guideline of the interview, while questions were employed for the instruments used in the validation of the design model, initial product trials, trials of the main product field test products and operations. quantitative data and qualitative descriptive analysis were employed in analyzing the data. at the preliminary study stage, descriptive introduction was analyzed interactively (interactive models of analysis) referred to five components: reduction of data analysis; triangulation; serving data; verification; and withdrawal of the conclusions, that carried out simultaneously and mutually interacting starts from the process of data collection. the collected data then was analyzed descriptivequalitatively. the analysis was performed toward the research instruments, models’ validity, guides, modules, completion, effectivity, efficiency, and practicality. the analysis of assessment instrument of the model (validity and reliability research instrument) was performed by experts. the validity limit value is 0.8 (guilford, 1936, p.279) and the limit value of the coefficient of reliability is 0.7 (nunnaly, 1981, p.245), then the instruments which were used were valid and reliable. effectivity, efficiency, and practicality of the model at the expanded time trials was determined through criteria which referred to score classification into level of evaluative meanings according to azwar (2003, p.157). the score classification refers to (m + 1, 5s) < x; (m + 0, 5s) ≤ x ≤ (m-0, 5s); (m + 0, 5s) ≤ x ≤ (m + 1, 5s); (m1, 5s) ≤ x ≤ (m-0, 5s); ≤ (m + 1, 5s), with m = average rating, and s = standard deviation (azwar, 2003, p.163). findings and discussion preliminary development result the urgency and characteristics of cect cect model was developed based on the data which were found in the field through the preliminary study. a preliminary study of the activities was needed to formulate a model of cect which already exists. before examining the existing models, survey was held to identify the urgency of the carried out cect for vhs, especially in mechanical engineering expertise competence and characteristics of existing cect expected by vhs. nine vhss or 100% opined that the urgency of holding cect was as a learning evaluation of a productive program, as a requirement of graduation, and as a measurement of competence achieved by students. meanwhile, cect for vhs students on mechanical engineering expertise competence were implemented to meet the requirements of working in the world of business and industry (b&i), five vhss (55.56) agree, two vhss (22.22%) strongly agree, and two vhss (22.22%) disagree. the characteristics of cect for vhs students on mechanical engineering expertise competence are described as follows: (1) it is conducted by the b&i, the professions association or the institute of profession certification of metal machinery indonesia (ipc-mmi) involving the disputing parties of vhs; (2) the involvement of b&i in the vhs is required in study plan, mainly determines the standard of competency and curriculum learning programs, productive implementation, and cect for vhs students; (3) the graduation standards, testing material, and research and evaluation in education 133 volume 1, number 2, december 2015 assessment criteria of cect by school, national education standards board, and graduates user (b&i) are needed; (4) a place for competency test or competency assessment center (cac) takes place at school or in b&i; (5) the assessor of competency and expertise certification tests (cect) is a prolific teacher who is certified as assessor, assessors from the b&i; (6) the requirements of students who can follow a cect are not determined by the b&i; (7) a cect is said to be successful if the students pass and get a certificate; (8) a certificate of competency plays not only as a condition for graduation but also as a work acceptance in b&i. existing model description in relation to the components of the management of cect which include planning, organizing, implementing, evaluating and reporting, there are three models of cect implementation found in the nine vhss, especially in the expertise competence of mechanical engineering. the first model (01) is a model of cect which follows models released by the ministry of education and culture. the regulation used is the standard operational procedure for national examination (sop_ne) published by the national education standardization board, and the technical directive of expertise competency test (td_ect) published continuously by the fostering directorate of vocational high schools (fdvhs). this model is implemented by seven of nine vhss (78%) surveyed which were state vocational high school (svhs) -sekolah menengah kejuruan negeri (smkn) 2 wonogiri, smkn 2 purwokerto, smkn 1 adiwerna tegal, smkn 1 semarang, smkn 2 surakarta, smk ganeshatama boyolali, and smk bk2 simo boyolali. the second model (02) is a model of cect that follows the models created by the ministry of education and culture, coupled with the implementation of school exam (test of competence at the level of school). this model uses the standard vhs product in the school's production units (spu) starting from the type of work, materials and quality workmanship. exam school that uses the existing spu product standards allows students to do various works according to the work made by spu, while institutionally spu replaces the role of b&i because spu does manufacture product as done by b&i. vhs that implements this model is smk warga surakarta. the third model (03) is the model of cect that follows the model developed by the institute of profession certification (ipc). the involvement of b&i in this case is the production unit (pu) of st. mikael academy of industrial mechanical engineering, surakarta, central java -akademi teknik mesin industri st. mikael surakarta, starting from: (a) material testing, assessment criteria, graduation standard as required by pu, (b) the material appropriate to pu requirements, (c) assessors by pu trainer, and (d) results/work pieces used by pu. production units as a representation of the b&i plays a very important role. vhs that implements this model is smk st. mikael surakarta. smk st. mikael is also appointed by the national board for professions certification (nbpc) and they serve the implementation of cect for other vhss that require it. result of the development design of spu-based cect model the design of cect model developed was based on the existing model, conceptual model, and research framework. the three basis of the development can be outlined as follows. conceptual model. work based learning (wbl) approach is a series of learning as a whole through competency-based education and training from: planning (curriculum) (work based curriculum), also known as competency based curriculum; work-based learning (competency based training); and evaluation of competence-based assessment using the work (work based assessment). wbl is a lesson or a college/school in which the program works together with the organization creating new opportunities of learning and experience in the place of work (boud, 2001, p.6). research and evaluation in education developing a model of competency certification test... 134 pardjono, sugiyono, & †aris budiyono the engagement of b&i is required from the very start of planning step to the implementation of the competence test. given that school is the supplier of the resources which are needed, b&i need to fulfill the necessary competencies. thus, it is needed in order to avoid mismatch and under qualified graduates. regarding the particular implementation of the existing cect in vhs, cect should be implemented properly so that the goal of cect can be achieved. in achieving its goal, cect models which are effective, efficient, and practical are required. the management of cect includes planning, organizing, implementing, evaluating, and reporting, as described by some scholars such as george, terry, and luther gulick (in handoko, 1999, p. 9) about the management functions. the components of each stage of the management are as follows: (1) planning, which comprises: engagement party user, testing material, assessment criteria and graduation standards, infrastructure, and the requirements of participants (students) who follow cect; (2) a cooperation mechanism at the organizing industry in competency and expertise certification tests (cect); (3) execution, which is consisting of: a place for competence test, assessors, and the duration of cect; (4) evaluation, which consists of: competency certificate and evaluation program of cect; (5) reporting in a form of graduation assessment and reporting cect program. as a concept, it can be summarized that the conduct of vocational education on the basis of wbl which is ranging from planning, learning, evaluation, to the test of competence always involves b&i. planning study in vhs involves b&i sync through competence and curriculum. next, the learning process involves b&i in the form of a field-work practice, industrial-work practices, double system education, internships, and more. at the end of educational process, competence test is carried out. the competence test involves b&i, profession association, and the institute of profession certification (ipc). finally, the management of cect is teaming up involving b&i, the profession association, and institute of profession certification (ipc) using the principles of management as explained before. existing model. based on the research data with questionnaires as data-collecting instrument, nine vhss in central java, especially in mechanical engineering expertise competence associating with the component management of cect including planning, organizing, implementing, evaluating and reporting; it can be said that the cect management can be implemented in three models. the existing models were: the first model (01), second model (02), and third model (03). the first model (01) is a model that is widely used, in which a whole series of competencies and certifications test management which are ranging from planning, organizing, implementing, reporting, and evaluation, follows the rules which are issued by the ministry of education and culture. the regulation which is used is heading the national examination (ne) which is published by the national education standardization bureau and the technical directive of expertise competency test which is published by the fdvhs. as for the vocational practice exam at school level, it employs the pattern of students working on a practical matter, such as practices on the final exam of the semester. the second model (02) is a model of a whole series of the competencies and certifications test management, which are ranging from planning, organizing, implementing, reporting and evaluation, which is organized by the ministry of education and culture, in this case is nbpc/fdvhs who published sop_ne and td_ect; but the schools exam is planned to also take advantage of the existing potential in a school production unit (spu) that has existed in the vhs. spu, which has run well replacing b&i in planning, especially for a competence test, is conducted by the schools outside the national exam. the schools exam applies that: (a) the material tests in accordance with the needs of the spu, (b) the material which is used is appropriate with the research and evaluation in education 135 volume 1, number 2, december 2015 spu requirement, (c) the assessors are from spu trainer, and (d) the results/work piece is used by spu. the specificity of this model is on the implementation of a vocational school practice exam, which is carried out through spu empowerment that had existed at the school. a job that is in the pu plays as a material which is made for school exams to test the expertise so that the resulting product/goods at the time of the test is used as a product of the spu. the third model (03) is the cect model that follows the model developed by the institute of profession certification (ipc) and cect models for schools that have a teaching factory as the representation of b&i. the involvement of b&i in this case is the production unit (spu) of st. mikael academy of industrial mechanical engineering, surakarta, which is ranging from the planning stage until the determining graduation phase (competent or incompetent). the features of this model are: (a) testing material, assessment criteria, graduation standard as required by pu, (b) the material used is appropriate as pu requirement, (c) the assessors are from pu instructors, and (d) the results/work pieces are used by pu. the pu here is a factory which is located in an institution (st. michael academy of industrial mechanical engineering, surakarta, central java). testing material, assessment criteria and graduation standards are issued by national board of professions certification (nbpc). vhs that implements this cect model uses industry/pu as cect or uses the facilities and infrastructure of the respective vhs with standards which are set forth by nbpc. based on the existing three models, it can be inferred that one of the weaknesses in the competence test is the involvement of b&i. the involvement of b&i in the competence and skill certification should be formulated by means as follows: (1) determination of the criteria and graduation standards, (2) the creation of questions/ material of the test, (3) verification of the tools and infrastructure (tools and machinery) in cac, and (4) the assessors. framework of cect model development the difference of the three models is analyzed from: the system, implementation process, resulting competencies, participants who are included, financing, and the recognition of b&i. meanwhile, the advantages and drawbacks of the model are analyzed from: the implementation of cect model in vhs, assessors, cac infrastructures, testing material, and scoring system. next, to formulate the development model, there will be grouping of the existing model. the groups consist of the superior model and models that can be implemented by all schools (see figure 1). the parameters of the model are two groups: (1) a superior model, i.e. the model with the following parameters: the test material, the basis of competence creation in b&i, standards/ criteria determined by the b&i or profession association/ipc, assessors from the b&i or the productive teacher assessors-certified. the resulted competence is in accordance with the standards/competencies defined by the b&i. the recognition is shown with the publication of a certificate of competency; and (2) models that can be implemented, namely the model with the following parameter: the use of facilities and infrastructure for the existing school or in cooperation with other schools, all students can join cect, the process of implementation in accordance with the schedule set by the ministry of education and culture or in accordance with the technical directive of expertise competency test issued by the fdvhs, the affordable financing, and assessors who come from b&i and prolific teacher in the vhs. the cect superior model is an effective model because the model can be used to measure the students' competence in accordance with the standard of b&i and the certificate of competence issued by b&i. in addition, the models that can be implemented efficiently and practically are the inaugural time, cost and ease of implementation of cect. research and evaluation in education developing a model of competency certification test... 136 pardjono, sugiyono, & †aris budiyono 1) testing material is made based on b&i competences 2) assessment standard/ criteria is determined by b&i or profession association 3) the assessors are from b&i or productive assessorcertified teachers 4) the resulted competence is in accordance with the standard competence set up by b&i 5) the recognition of b&i is shown by issuing the certificate of competence 1) using the infrastructure cac which is available in the school or in collaboration with other schools 2) all students can have cect 3) the organizing process is in accordance with the schedule set by ministry of education and culture or in accordance with competence test from fdvhs 4) reasonable financing is achieved 5) assessors are from the b&i and vhs productive teacher production unit (pu) as the representation of b&i because: (1) to cope the common difficulties of vhs to collaborate with b&i in the making of lesson plan, learning activities, and learning evaluation (competency test) is necessary (2) commonly, the vhs has already had pu as instructed by fdvhs (3) the implementation of school competence test available in existing model 02 has already been implemented properly (4) the pu which develops into teaching factory in a school as indicated in the third model superior model analysis a. the vocational education principles will be effective if: (1) the environment in which students get the training is the representation of the real situation they will face in the future, (2) the training tasks are done using the same way, tools, and machine as what are implemented in the workplace, and (3) training someone the way of thinking and working as needed in the workplace is done (proser, 1925). b. the learning assessment of vhs should use competence based method (competence-based assessment). the learning assessment of vhs is done through the appropriate expertise competence assessment. model that can be implemented cect model of vhs students mechanical engineering expertise competency based on pu superior model and model that can be implemented figure 1. finding flow of cect_spu model the formulation of model development draft the draft of cect model formulation (design) to vhs students on mechanical engineering expertise competence based on spu herein after referred to ‘cect_spu’ was designed based on a conceptual model, the existing model, and direction of development as contained in the framework of the development. the draft is further analyzed with the use of two principles, namely the principles of vocational education and competency-based assessment to find the advantages and disadvantages of the existing model. as a result, the superior and applicable model can be found. the draft of model formulation (design) is described as follows: (1) a vocational secondary school (vhs) collaborates with school production units (spu) to implement spu-based cect with regard to the regulation of the ministry of research and evaluation in education 137 volume 1, number 2, december 2015 education and culture, including: determining sop_ne and standards from nbpc and also fdvhs in the form of td_ect and determining the financing sources of cect; (2) joint planning between vhs and spu in composing cect, (3) vhs in collaboration with spu organizes cect, (4) vhs in collaboration with spu carries out cect, (5) vhs evaluates cect, and (6) vhs reports the implementation of cect. in order that the implementation of cect uses a school production unit (spu), its graduates are recognized by the b&i, and the spu must be verified by the b&i in terms of facilities and infrastructure (cac) which are owned, including: primary equipment, supporting equipment and room/space, and the certified assessors of the association profession. a sequential process of cect can be noted in the chart as presented in figure 2. figure 2. the cect_spu model draft figure initial product development to discuss the draft of cect_spu model collaboratively, there was focus group discussion (fgd) with the vhs principals in central java through the principal work discussion forum of vhs comprising 26 people. the initial product design of ‘cect_spu’ was developed from each of its components based on the theory of input and suggestions during the fgd. the development of each component of cect_spu is explained as follows: (a) the spu-based cect model of vhs students of mechanical engineering expertise competence is able to: (1) measure the students’ competence based on the standard, (2) contain the skills needed, (3) be implemented in vhs of mechanical engineering expertise competence in general, and (4) be accessed by all vhs students. ministry of education and culture through bsnp sets up sop_ne and standard fdvhs publishes td_ect giving financial subsidy vocational high school (vhs) along with school production unit (spu) planning organizing implementation students material and training enrichment e v a l u a t io n r e p o r t in g evaluation on organizing cect report on organizing cect module expertise certificate competence assessment research and evaluation in education developing a model of competency certification test... 138 pardjono, sugiyono, & †aris budiyono planning of spu-based cect on vhs students on mechanical engineering expertise competency covers several aspects, namely personal (man), financing (money), material, methods, market, equipment (machine), and reporting (time). planning the cect involves two agencies/institutes namely vhs and spu considering input in the form of regulation of the ministry of education and culture and in the form of sop_ne issued by nbpc, financing, and td_ect issued by fdvhs; (b) organizing is working mechanism between the organizers of the cect in vhs and spu that concerns with the results of the planning and preparation implementation of the spubased cect by involving a wide range of party as formulated in the committee. the components include: cect committee, the material/test, assessment criteria, graduation standards, standards of infrastructure (cac), the requirements of participants, and cect schedule. coordination is also undertaken to prepare the site, materials, practices and equipment to be used; (c) implementation is a stage of the implementation of the entire plan that had been drawn up. the activities in this stage is the implementation of students' competence test with the availability of the test place, test questions, practice materials, tools (machine used), assessors, duration of the time provided, and the guidelines of the assessment. the process of implementation of the spu-based cect can be observed in the flowchart presented in figure 3; (d) evaluation was undertaken with regard to two aspects, namely: evaluation of the attainment of competencies using a predetermined assessment standards, this evaluation is to determine whether the participants of the test are ‘competent’ or ‘incompetent’ by using the assessment standard and graduation guidelines made by the assessors, and evaluation of holding of cect ranging from evaluation on the planning, organizing, and implementation of the test, such as the adequacy of materials, time and the assessors . this evaluation was expected to be a feedback for the cect implementation in the next year. the evaluation of the organized test was made by the committee organizers; and (e) reporting is the last activity of organizing of pu-based cect for vhs students of mechanical engineering expertise competence. reporting concerns two things, namely: reporting on achievement of the students’ competencies, and cect in the form of a certificate in accordance with the results of the evaluation of skills competence based on the graduation standard which has been established and given to the test participants (students). certificates are issued by spu and are known to the head of the vhs. they also report the implementation of cect which is a feedback to the cect organizing at the stage of planning, implementation, and organizing. this report is given to the related parties (the school, pu and education service). validation (assessment) of the early products (draft cect_spu model) is performed by the experts, whhich consist of expert on education of mechanical engineering, mechanical engineering learning, education management, cect implementer in school, as well as b&i. the aspects that are assessed by the validators are: (a) the basic model development; (b) school production unit-based (spu-based) cect models for vhs students of mechanical engineering expertise competence; (c) spu-based cect planning component; (d) spu-based cect planning procedure; (e) spu-based cect organizing component; (f) spu-based cect implementation components; (g) spu-based cect implementation procedure; (h) the evaluation component; and (i) the reporting component. based on the assessment data of the model draft which are obtained from the standard deviation, the entire data is 0.414 and average data is 3.557 on the evaluation scale of 1 to 4. in order to determine the assessment model, the analysis is conducted by grouping the score into several criteria that the model is not good if m ≤ 2.936; less if 2.936 ˂ m ≤ 3.350; well if ˂ 3.350 m ≤ 3.764; and very good if 3.764 ˂ m ≤ 4.178. with reference to the classification criteria, then the draft model of cect_spu belongs to the category of ‘good’. research and evaluation in education 139 volume 1, number 2, december 2015 the assessment of model guidance and the strengthening material and training modules of cect_spu the cect_spu assessment guidance model was made by the principal and productive teachers of vhs on mechanical engineering expertise competencies consisting of 15 people. from the assessment results, the standard deviation of the entire data was 0.480 and the average data was 3.370. based on the set of criteria, cect_spu guidance model belongs to ‘good’ category. the cect_spu assessment guidance model was done by professors of mechanical engineering education and productive vhs teachers on mechanical engineering expertise competence consisting of 25 persons. from the assessment results, the standard deviation of the entire data was 0.478 and average data was 3.710. to determine the assessment of material reinforcement module and exercises, analysis was conducted by grouping the score into several criteria. based on the criteria, the module reinforcement material and exercises cect_spu belong to ‘good’ category. testing result of cect_spu model in this study, there are five testing associated with formulating the cect_spu model. first, the results of the validity testing of the cect_spu related to the substances of the cect management; second, the results of the validity testing cect_spu reviewed from the implementation of cect_spu; third, the results of the effectiveness testing of cect_spu; fourth, the results of the efficiency testing of cect_spu; fifth, the results of the practicality testing of cect_spu. the results of those five tests are presented in table 1. table 1. testing results of cect_spu no. data analysis result limit value conclusion 1. product validation (cect_spu) by experts m = 3.557 3.350 the model is good 2. evaluation of organizing model m = 3.670 (p) m = 3.730 (k) 3.422 3.422 the model can be implemented 3. evaluation of effectivity m = 3.730 3.506 the model is effective 4. evaluation of efficiency m = 3.780 3.569 the model is efficient 5. evaluation of practicality m = 3.700 3.468 the model is practical the test results indicate that according to experts, cect_spu model is good. seen from the cect_spu components consisting of planning, organizing, implementing, evaluating and reporting, each component has correlations (reciprocal links) between one another, and the respective attributes presented in the components have a correlation between inside and outside indicators presented in the other components. the cect of vhs students on mechanical engineering expertise competency based on spu covers several aspects, including personnel, funding, material, method, market, machines, and time. in planning, cect involves two agencies: vocational school and spu with regards to input in the form of the ministry of education and cultures’ regulations. in the implementation of field trials, cect_spu is declared as enforceable in vhs samples. the next test was carried out after the extended cect_spu model is declared as effective, efficient and practical model. thus, cect_spu model is a good model, so that it can be implemented effectively, efficiently, and practically. product revisions revision of cect_spu model in producing cect_spu model, the final product has been passed revisions several times. revisions were done four times in accordance with the stage of the method used to develop the model. the first product revision was revision of the initial products research and evaluation in education developing a model of competency certification test... 140 pardjono, sugiyono, & †aris budiyono resulted from the activities of the fgd planning model and expert judge-ment in the internal test. the second product revision was the revision of the main products made after the initial field products test (individual test). the third product revision was the revision of the operational product revisions made after the main product field test (a small group test). the fourth product revision was the revision of the final products made after the operational field test (extended test). the guidelines of revision and cect_spu module the revision or refinement of guidelines and cect_spu modules was made based on input from the evaluator's guide and trial or experiment. based on the revisions that have been made, the final model was produced as shown in figure 3. figure 3. final cect_spu model spu-based cect model of vhs students of mechanical engineering expertise competence is able to: (1) measure the students’ competence according to the standard, (2) contain the skills needed, (3) be implemented in vhs of mechanical engineering expertise competence in general, and (4) be accessed by all vhs students. c o m p e t e n t in c o m p e t e n t vocational high school (vhs) along with school production unit (spu) ministry of education and culture through bsnp set up sop_ne and standard fdvhs publishes td_ect giving financial subsidy implement the cect based on: a. testing material, equipment, material and time allotment have been set b. set assessor c. assessment guideline students material and training enrichment e v a lu a t io n r e p o r t in g evaluation on organizing cect report on organizing cect module expertise certificate im p le m e n t a t io n organizing p la n n in g construct the committee (man) and cect assessor set the cect method set the graduate market construct the material for cect: material/tests questions, assessment guideline, graduation criteria and testing material set the cect equipments finance planning (money) plan the reporting system (minute) coordinating the planning results: assessor, cect place, testing resouces, testing material and equipments; place; practice material; tools competence assessment research and evaluation in education 141 volume 1, number 2, december 2015 the analysis of final product the technical specification, characteristics, and the advantages of the product product specifications. the product’s name is model of competence test and certification for vhs students on mechanical engineering expertise competency. the competency and expertise certification tests based on the school production unit (cect_spu) content model consists of cect management components of vhs students, namely: planning, organizing, implementing, evaluating, and reporting. the usage of the model is managing vhs students’ competence and certification test, especially in their mechanical engineering competence. the model tools include (1) the guidelines of the implementation of cect_spu and (2) the module for enrichment material and training. product characteristics. cect_spu model was developed based on the concept of work based learning (wbl) on vocational education using vocational education principles (proser, 1925) and principles of competency based assessment so that each component developed in cect_spu always involved b&i in this spu. cect_spu model is a model working with spu which is the representation of b&i so that it can cope with the difficulties of vhs in collaboration with b&i, especially in the implementation of cect. it increases the relevance of graduates using the graduation standards, material tests, and assessments criteria done by the vhs with the spu so the students are tested based on the competencies required by the b&i at the moment. cac may use the existing infrastructures in vhs which are concerned or collaborated with other similar vhss. the assessors who examine the students come from the industry as well as a prolific teacher who has been certified as an assessor obtained from industry through professional associations. all students can follow cect as national exam requirements. a certificate of competence is given after the students are declared as competent. those who were incompetent are given the opportunity to repeat or take a remedy. product advantages. cect model which involves b&i in this case is the spu ranged from planning, organizing, implementing, evaluating, and reporting in the form of a certificate of competence. the cect_spu model is able to describe the vhs students’ skill competencies that are relevant to standard competencies needed by business/ industry (b&i). the device’s inputs that are required in the implementation of cect, such as: material/test questions, assessment criteria, graduation standards, and infrastructure standard required by b&i become the reason of the choice, so that the aspects are tested to meet the requirements of b&i. the model can be implemented by all vhs which have had spu at school or collaborated with spu from another school. the cect_spu model which has assessors is recognized by b&i. it can be followed by all students who are eligible to take the exam nationwide. cect_spu model is a form of cect model which provides an opportunity for the participants who have not passed the test to follow remedial tests and later, they are provided the module for enrichment training and materials. cect_spu model the skill passport contains the competency-based skills owned by the holder. it is obtained through competence tests and certification. a certificate of competency is issued in recognition of someone who is competent to perform a particular work through a process of competence test and certification carried out by the accredited educational or certification agencies. the regulation is clearly explained in article 61, paragraph 3 of law number 20 year 2003 which explains: ‘the certificate of competence is given by organizers of education and training institutions to learners and citizens in recognition of competence to perform a specific work after they were passed the competency tests conducted by an accredited educational or unit certification agencies’. research and evaluation in education developing a model of competency certification test... 142 pardjono, sugiyono, & †aris budiyono the graduation standards imposed on cect by the school are created by nbpc (nbpc under the ministry of national education), whereas the world of business/ industry (b&i) has a national working competency standard of indonesia developed by the department of labor. therefore, both of the standards need to be set up to avoid ‘mismatch’ between the expertise competencies yielded by the education world and skills competencies required by the workforce. the distinction of the standards is also related to (a) the types of jobs that are realistic (real world), while the school provides only simulation practice, (b) the quality of the work results in the industry which is measured by stating accepted or rejected, while in the school, the quality is stated by scores (0-100), and (c) the risk of financial failure in the real industrial world, while in school, there is still a lot of tolerance to redo the work (sidi, 2000, p.3). the students are declared that they are passed or competent in the competence and skill certification if they meet the criterion of graduation which has been set. the graduation criterion is the minimum requirement for being passed (department of education and culture, 2014). the graduation criterion is set in the score of 7.0. some cect models have been set and developed until now, including (mone, 2004c, p.5): (a) the implementation of certification is carried out by the school along with b&i which becomes its partner institution; (b) the implementation of competency certification is organized by a particular industry that has the nationwide recognition, for example, in the field of machinery done by st. mikael academy of industrial mechanical engineering, surakarta; (c) the implementation of certification is carried out by ipc. because the model developed is focused on the management, based on the validation of the cect indicators and components when the survey is conducted, the main components of cect are planning, organizing, implementing, evaluating, and reporting. this statement is in accordance with the statement of luther gulick (in handoko, 1999, p.9) who defines management as one of science branches because management is regarded as a field of knowledge which systematically understands why and how people work collaboratively to achieve goals and make the system work better and benefit humans. the cect management which is proposed is the process of planning, organizing, conducting, and controlling the cect including all its aspects so that the goals of cect competency test in which to perform and certify students could be achieved effectively and efficiently. the development concept of the cect model was based on three things. they are the conceptual model, the existing model, and the development direction. conceptual model the conceptual model as the reference is constructed from the concept of workbased learning, also known as wbl. wbl is a series of a whole learning through the competency-based education and training including: (1) planning (curriculum) or (work based curriculum), known as competencybased curriculum, (2) competency based training, and (3) the competency evaluation using work based assessment. the implementation of wbl in vhs cannot be separated from the role of b&i as explained by proser (1925) that (1) vocational education will be efficient if the environment where the students are trained is a replica of an environment where later they will work, (2) an effective vocational education can only be given when tasks are carried out in a manner, device, and machine that are the same as those set out at work and (3) vocational education will be effective if it coaches someone in the habit of thinking and working as required in the work. existing model there are three existing models which are found in the implementation of cect for vhs students on mechanical engineering expertise competence. the models are then named as the first model (01), the second model (02), and the third model (03). after examining the advantages and disadvantages of each model, it can be deduced that the research and evaluation in education 143 volume 1, number 2, december 2015 models which are effective and efficient can be implemented or practiced. by using the analysis based on the principles of vocational education (proser, 1925) and competencybased assessment (directorate of vhs, 2013), it can be concluded that there are two patterns of cect model: efficient and effective model, and model that can be implemented. the development direction of cect the model development has been directed to produce cect model which is effective, efficient, and practical. it is compatible with the purpose implemented by cect which is strengthened by the conceptual model of work-based learning and the existing models that are later analyzed based on the principles of vocational and work-based assessment. the development was focused on components and indicators of the management for vhs students of mechanical engineering expertise competence including planning, organizing, implementing, evaluating and reporting. conclusions and suggestions conclusions based on the developmental studies using the r&d approach and the study of the products as stated before, it can be concluded as follows: (1) the development of a model of competence and expertise certification test (cect) for students of mechanical engineering expertise competence is based on three things: (a) the concept of work based learning (wbl) on vocational education principles (proser, 1925) and principles of competencybased assessment. each component which was developed in the model always involves b&i which in this case is the school production units (spu); (b) the existing cect models which have been found is cect model that is implemented based on the sop_ne, published by the national standardization education, and technical guidelines agency for ect, published by fdvhs, the cect model-based sop_ne, and technical guidelines for ect and also test-based spu in every school, and cect model developed by the institute of profession certification (ipc); (c) the development direction of the cect model is excelled (effective and efficient) and can be implemented (practical) by vhs in particular mechanical engineering competency; (2) produced models of competence test and expertise certification for vhs students of mechanical engineering competency based on the school production unit (cect_spu) consists of management components, which include: planning, organizing, implementing, evaluating, and reporting. it is used as a model to manage the testing and certification expertise competency for vhs students, especially mechanical engineering expertise competence. the models are equipped with a manual on the cect_spu implementation and the modules for enriching materials and training; (3) the cect_spu model meets the eligibility requirements to be implemented based on the assessments as follows: (a) the validation results indicate that cect_spu model is a good model. it is in accordance with the results of the expert judgment 3.557. by using the criteria of a good model if 3.350 < m ≤ 3.764, the assessment results have an m above 3.350. therefore, the cect_spu model belongs to ‘good’ category. if it is seen from the cect_spu components which consist of planning, organizing, implementing, evaluating and reporting; then each component has correlations (reciprocal links) between one another. the management system of cect_spu includes several important aspects, namely personnel or man, funding or money, material, methods, market, machine and time; (b) the assessment result indicates that cect model is a model that can be implemented. this result is in accordance with the result of the implementation model assessment in the trial stage. the average value is 3.670 in individual trial and 3.730 in small group trials, by using the criteria that the model can be implemented if 3.422 ˂ m ≤ 3.907 for individual trial and 3.954 < m ≤ 3.499 to test small groups, then the cect_spu model can be implemented; (c) the assessment results indicate that the cect_spu model is a model which is effective. it is based on the results of the research and evaluation in education developing a model of competency certification test... 144 pardjono, sugiyono, & †aris budiyono assessment model of m 3.730 which is above 3.506 making the model operational field test on cect_spu (an expanded test) belongs to ‘effective’ category. the indicators of effectiveness of the model is a cooperation between vhs and business/industry world (b&i) which in this case is the spu as the representation of b&i, the required competencies at spu as needed in b&i, vhs which has an spu can apply this cect_spu model, the test materials depends on works provided in the spu, the cect_spu can measure vhs student competencies, and there is the materials and training enrichment and remedial tests for the students who are incompetent or failed; (d) the assessment results indicate that the cect_spu model is a model which is efficient. it is in accordance with the results of the assessment model of m above 3.569 in this case is 3.780, so the cect_spu model on the operational field test (test the extended) belongs to ‘efficient’ category. the cect-spu model is efficient because there is time savings in the management of cect because of the spu is located in the school environment. it can be implemented in accordance with the agenda of the school. the costs are cheaper because the school production unit (spu) was already in the respective schools or students can carry out their own spu in their school because the spu is in their school environment and does not require a great effort to establish a partnership with b&i because the b&i has already been represented by spu. the assessment results show that the cect_spu model is practical. it is in accordance with the results of the assessment model of the mpr over 3.468 that is 3.700, so cect_spu models in the operational field test (the extended test) belong to the category of ‘practical’. the indicator of practicality of cect_spu model is that it can be implemented by vhs, in the very own school, does not require a complicated bureaucracy because spu is owned by the school, is easy to communicate to b&i which is represented by spu, and the assessors who can be derived from spu. suggestions the utilization of research results and products are intended to the following: (1) cect_spu is expected to be utilized by the vhs on mechanical engineering expertise competency particularly as alternative models for overcoming adversity vhs in collaborating with b&i which happened so far. it is very important because in fact, vocational education must continue to collaborate with b&i so that the resulting graduates do not mismatch and under qualified according to the b&i requirements; (2) cect_spu is expected to be utilized by the directorate of construction of the vhs as a reference in order to continue to develop appropriate models of cect assembled on the field and the dynamics of the industry. this model is a model of early cect development, according to the policy development of the directorate construction of vhs i.e. developing school production units into school teaching factory. references azwar, s. (2003). tes prestasi: fungsi dan pengembangan pengukuran prestasi belajar [performance test: function and development of learning performance measurement]. yogyakarta: pustaka belajar. bureau of statistics center. (2011). data strategis bps [strategic data of the bureau of statistics center]. jakarta. bureau of regional research and development. (2008). laporan penelitian tentang keterkaitan pendidikan dan penyedia lapangan kerja di jawa tengah [research report on the interrelatedness of education and workforce provider in central java]. retrieved from http://gerbangtani.com/litbang/hasil_ penelitian/2-pendidikandanlapkerja.pdf. borg, w.r. & gall, m.d. (1989). educational research: an introduction (4 th ed.). new york, ny: longman. department of education and culture. (1997). ketrampilan menjelang 2020 untuk http://gerbangtani.com/litbang/hasil_penelitian/2-pendidikandanlapkerja.pdf http://gerbangtani.com/litbang/hasil_penelitian/2-pendidikandanlapkerja.pdf research and evaluation in education 145 volume 1, number 2, december 2015 era global [approaching 2020 skill for global era]. jakarta. department of national education. (2004b). direktori lembaga sertifikasi profesi dan tempat uji kompetensi [directory of institute of profession certification and competency test center]. jakarta: ditektorat jendral pendidikan dasar dan menengah. guilford, j.p. (1950). fundamental statistics in psychology and education (2 nd ed.). new york: mcgraw-hill. gunawan, r. (2006). relevansi kompetensi lulusan smk dengan tuntutan dunia kerja [the relevance of vhs graduates with workforce demand]. retrieved from http://file.upi.edu/direktori/fptk/j ur._pend._teknik_mesin/1951 05011980021-ricky_gunawan/ makalah_semnas_ptk_2006.pd f. handoko, t.h. (1999). manajemen [management] (2 nd ed.). yogyakarta: bpfe. nunnally, j.c. (1981). psychometric theory. new delhi: tata mcgraw-hill. samsudi, et al. (2009). uji kompetensi siswa smk dalam rangka ujian nasional [vhs students competency test in relation to national examination]. kajian smk, direktorat pembinaan smk depdiknas. http://file.upi.edu/%20direktori/fptk/jur._pend._%20teknik_mesin/195105011980021-ricky_gunawan/%20makalah_semnas_ptk_%202006.pdf http://file.upi.edu/%20direktori/fptk/jur._pend._%20teknik_mesin/195105011980021-ricky_gunawan/%20makalah_semnas_ptk_%202006.pdf http://file.upi.edu/%20direktori/fptk/jur._pend._%20teknik_mesin/195105011980021-ricky_gunawan/%20makalah_semnas_ptk_%202006.pdf http://file.upi.edu/%20direktori/fptk/jur._pend._%20teknik_mesin/195105011980021-ricky_gunawan/%20makalah_semnas_ptk_%202006.pdf http://file.upi.edu/%20direktori/fptk/jur._pend._%20teknik_mesin/195105011980021-ricky_gunawan/%20makalah_semnas_ptk_%202006.pdf research and evaluation in education issn 2460-6995 research and evaluation in education, 2(2), 2016, 108-121 available online at: http://journal.uny.ac.id/index.php/reid research article the measurement model of historical awareness * 1 aisiah; 2 suhartono; 3 sumarno 1 faculty of social sciences, universitas negeri padang, jln. prof. dr. hamka, air tawar, padang, 25131, sumatera barat, indonesia 2 faculty of cultural sciences, universitas gadjah mada, jl. nusantara 1, bulaksumur, caturtunggal, depok, sleman, 55281, yogyakarta, indonesia 3 graduate school of universitas negeri yogyakarta, jl. colombo no. 1, karangmalang, caturtunggal, depok, sleman, 55281, yogyakarta, indonesia abstract the study aimed to develop a measurement model of historical awareness through a research and development model adopted from the plomp model. historical awareness was measured through four components, namely: knowledge of historical events, understanding of historical research method, meaning of historical events, and usefulness of history. the development procedures of the development model included a preliminary investigation in the form of literary study about the constructs of historical awareness. in the design stage, the researcher designed a conceptual model and a hypothetical measurement model about historical awareness. then, the researcher performed a test construction namely assembling the test instrument for measuring historical awareness. eventually, the researcher administered a test, did evaluation and made revision. the test in the study referred to the empirical testing of the instrument, while the evaluation in the study referred to the efforts to identify the obstacles that the participants encountered within the empirical testing of the instrument in order to revise it. the empirical testing of the instrument involved history teacher-candidates at universitas negeri yogyakarta and universitas negeri padang. the data were gathered through the test by using the measurement instrument in the form of associative multiple choice test. for the construct analysis, the researcher implemented confirmatory factor analysis by means of lisrel 8.80 program. the results of the analysis show that the χ 2 = 121.98, the p-value = 0.11, rmsea = 0.043. in other words, the measurement model of historical awareness that had been developed was supported by the empirical data. keywords: historical awareness, measurement model, knowledge of historical event, historical research method, meaning of historical event, usefulness of history how to cite item: aisiah, a., suhartono, s., & sumarno, s. (2016). the measurement model of historical awareness. research and evaluation in education, 2(2), 108-121. doi:http://dx.doi.org/10.21831/reid.v2i2.8399 *corresponding author. e-mail: aisiah.unp@gmail.com http://dx.doi.org/10.21831/reid.v2i2.8399 research and evaluation in education the measurement model of... 109 aisiah, suhartono, & sumarno introduction this article reviews the measurement model of historical awareness. the intended measurement model is an effort to make a measurement model from a latent variable (historical awareness) through components or indicators presented in the form of a path diagram. the concept of historical awareness is defined as a condition or a reasoning process in which people recall the meaning and the usefulness of history. the constructs or the components of historical awareness include four aspects, namely: knowledge of historical events (pengetahuan peristiwa sejarah, pps), understanding of historical research method (pemahaman metode penelitian sejarah, pmps), meaning of historical events (pemaknaan peristiwa sejarah, mps) and usefulness of history (kegunaan sejarah, gs). the four constructs of historical awareness are derived from the ideas of indonesian historians such as soedjatmoko, ruslan abdulgani, and sartono kartodirdjo. threfore, ideas and a theoretical review regarding historical awareness are constructed and elaborated as follows. historical awareness is possessed only by human beings and, therefore, it has frequently been said that human beings are historical creatures. heller (1982, p.3) asserts that it is human beings throughout the world who can tell their history because human beings are those who understand the concept of ‘once upon a time.’ human beings review and analyze their life history and their nations. the nations that do not understand their history look like individuals who have lost their memories (suffering from amnesia/ senile dementia) so that they should search to find their identity throughout the darkness (kartodirdjo, 1993, p.50; hariyono, 1995, p.1). in other words, a nation that does not have historical awareness is a nation that has lost its identity. therefore, historical awareness should always exist in every citizen as the generation of a nation. every citizen should develop historical awareness in his or her nation and state life (rosenlund, 2011, p.1). the effort to develop historical awareness among the generation of a nation might be pursued by means of history education (history teaching). kartodirdjo (1993, p.51) asserts that historical subject has a socio-cultural function to encourage historical awareness. historical awareness is a key concept that has been very important and significant in historical didactic (thorp, 2014, p.iv; korber, 2015, p.1). people who are studying history will have the ability to compare the difference among periods, cultures and social systems (ata, 2009, p.8). this ability is the manifestation of an individual’s historical awareness. the teaching of history recently has not been successful in developing the historical awareness of young generation. the condition needs a very serious attention. in the university for example, there are some university students who have not understood and even comprehended the important meanings of their nation’s history (ministry of education and culture, 2012, p.54). the emphasis on the factual knowledge is certainly ‘dry’ and does not cover much of university students’ understanding of the exemplary values that are studied or researched in their final assignment. according to mardapi (2007, p.5), the teaching quality might be viewed from the assessment results. both aspects are related to one another and there should be continous improvement efforts. the assessment of historical awareness of history teacher candidates seems to be missed from most lecturers’ attentions. historical awareness reflects the internalization (wineburg, 2006, p.48) of life values and the nationality that is reflected in historical events taught by history teacher candidates at university. in order to measure the condition of historical awareness of history-teacher candidates, there should be a valid and reliable measurement instrument. the measurement of historical awareness level among history-teacher candidates at university becomes an urgent demand regarding the fact that these history-teacher candidates will be history-teachers who should grow historical awareness among the students at middle school. therefore, this study is intended to develop a measurement model of historical awareness. research and evaluation in education 110 − reid, 2(2), december 2016 constructs of historical awareness lukacs (1968, p.15) defines historical awareness simply as the past that has been recalled (remembered past); meanwhile, paska (2010, p.7) defines historical awareness in a more in-depth manner such how people view the past. historical awareness is a fundamental ability to recall and imagine the past events (kolbl & straub, 2001, p.8). according to lukacs (1968, pp.9-10), recalling the past involves cognition and recognition that are closely related to the reasoning process or activities. for ankersmit (1987, p.354), historical awareness as the reasoning process is marked by an awareness that the past depiction as an intellectual discourse associated to the particular factual accuracy. the constructs of historical awareness from the ideas and thoughts of several indonesian historians regarding the concept of historical awareness are formulated. their thoughts and ideas are presented in table 1. according to abdulgani (ministry of education and culture, 2012, p.43), the definition of historical awareness is as follows: historical awareness is a mental attitude ... that has been the strength to take active participation in history dynamics. historical awareness includes: first, the knowledge of historical facts and their causal relationship (the cause and effect among the historical facts); second, the loading of our mind with the logics, namely the existence of certain laws in history; and third, the improvement of our conscience by wisdom and intelligence in order to reflect from the past experiences. abdulgani views historical awareness in relation to the knowledge, meaning, and usefulness of history. according to soedjatmoko (ministry of education and culture, 2012, p.43), the concept of historical awareness is related to the mental attitude, but it is more emphasized on the way an individual puts himself in front of the social truth and reality within the perspective of present, past, and future. according to lapian (ministry of education and culture, 2012, p.42), historical awareness is defined as historical clarification namely a study of: elementary matters such as who, what, when, where and why; the impression of history and the function of history in education. leirissa (ministry of education and culture, 2012, p.41) tends to simplify lapian’s idea regarding historical awareness; in his opinion, historical awareness serves as an understanding of the essence of the historical study. ayatroehadi (ministry of education table 1. the historians’ ideas and thoughts regarding the concept of historical awareness historians definition of historical awareness construct conclusion ruslan abdulgani historical awareness has been a mental attitude (strength) that covers the knowledge of historical facts and their causality, the historical logic and the improvement of conscience by wisdom and intelligence for reflecting the past. 1. knowledge of historical events 2. understanding of historical research method 3. meaning of historical events 4. usefuness of history sartono kartodirdjo historical awareness will be improved by possessing historical knowledge, historical mindedness and by being able to imagine the situation of past history, cultural atmosphere, sentiment, idea, mentality, life style etc. soedjatmoko historical awareness has been a mental attitude and a manner to put oneself in front of the truth and the social reality in the perspective of present, past, and future. adrian bernard lapian historical awareness is not independent on clarification, namely a historical study that entails the elementary aspects such as who, what, when, where and why, the historical impression, and function in the education and the controversial aspects. r. z. leirissa historical awareness is an understanding of the essence of historical study. ayatroehadi historical awareness includes insight regarding history, the ideas within the historical insight, the theoretical and methodological foundation of historical study and the oral/written review regarding history. source: kutoyo in ministry of education and culture (2012) research and evaluation in education the measurement model of... 111 aisiah, suhartono, & sumarno and culture, 2012, p.42) in details view historical awareness as historical insights and ideas including historical knowledge, theoretical foundation and research methodology as well as oral and the written historical reviews. the ideas of several historians might be summarized into the definition that historical awareness is a condition and reasoning process in which an individual recalls the meaning of history and its usefulness. the meaning of history refers to the terminology of history as past events and history as a science and methodology in historical research/study. furthermore, another important aspect in historical awareness is understanding the meaning (significant meaning) of historical events in the form of values and impacts of historical events and the usefulness of history in life. various experiences from past events that have been studied give a certain meaning according to certain interpretation. eventually, the meaning will direct how history will be used in life. the constructs of historical awareness are derived from the concept of historical awareness according to historians’ thoughts which has been presented in table 1. the operationalization of the concept of historical awareness is adopted from the reasoning manner of greenberg (1991, p.7) who states that historical awareness as a conceptual system comprises interactive elements which allows comprehension of temporal/historical experience and individual placement in time/ history. in other words, historical awareness defined as a conceptual system consists of several aspects and has the function in forming historical awareness. therefore, in this research, the design of a conceptual system of historial awareness includes four components and this components are the hypothetical constructs in developing the measurement model of historical awareness. the four components as the hypothetical constructs of the measurement model include: (1) knowledge of historical events; (2) understanding of historical method; (3) meaning of historical events; and (4) usefulness of history (see figure 1). figure 1 is the conceptual system of historical awareness consisting of four constructs (components) which become the basis in forming historical awareness. the appearance of historical awareness starts from the knowledge of historical facts and interrelatedness among historical facts (kartodirdjo, 1986, p.9; latief, 2006, p.49). knowledge of historical events is the preliminary requirement for establishing historical awareness and, on the other hand, historical awareness is very important and influences production of historical knowledge (gleencross, 2010, pp.13). according to budhisantoso (ministry of education and culture, 2012, p.22), the development of historical awareness should be conducted by expanding historical knowledge and historical comprehension of cultural values of a nation. historical knowledge cannot be separated from the investigation process or the implementation of a research method. historical knowledge is proved by the robustness of historical research findings (kreuzer, 2010, p.383). the results of historical studies might strengthen historical aware figure 1. the constructs of historical awareness understanding of historical research method historical awareness meaning of historical events knowledge of historical events usefulness of history research and evaluation in education 112 − reid, 2(2), december 2016 ness and even historical awareness might be the foundation in distinguishing facts from myths (pompa, 1990, p.217) in the process of historical reconstruction. historical awareness helps an individual trace the meaning reflected in historical events. the meaning of historical events lies in the significance that people give to historical events (denison, 2011, p.47). the significance of historical event is shown by the values that have been reflected in the values that individuals have in the past (tucker, 2009,p.14) and impacts of those historical events. the ability to explore the significance of historical events and values that historical events contained reflects the depth level of historical awareness. historical awareness in this level does not solely come from the knowledge of the facts of historical events; instead, historical awareness in this level comes from the deep understanding of the significant meaning of historical events. by recalling the past, an individual might act better in the upcoming future (pownal, 2007, p.26). the expectation is that an individual will not repeat the same mistake that was made or experienced in the past for the sake of the future. knowledge of historical events knowledge of historical events is the knowledge about what (event) has occured in the past of human history or the knowledge about historical facts and processes (topolski, 1976, pp.305-411). the essence of the knowledge of historical events is explanation of historical events together with the overall facts, including ‘what,’ ‘who,’ ‘when,’ ‘where’, and ‘how’ (kartodirdjo, 1992, p.252; grant, 2003, p.60). the knowledge of histori-cal events does not lie on what aspects might inform the future; instead, the knowledge of historical events lies on what aspects might inform the past (elliott, 2003, p.24). the knowledge of historical events might be measured through what has been recalled regarding the facts that have been learned (grant, 2003, p.89). intellectual curiosity regarding the matters of the past is one of the reasons why people learn and study history (tosh, 1984, p.21). historical knowledge is one of the elements of historical understanding (grant, 2003, p.58). the historical understanding is viewed from the limits of substantive knowledge and procedural aspects of history disciplines (husband, kitson & pendry, 2003, p.58). historical understanding includes an understanding of causality (kitson, husband & steward, 2011, p.74). the students at the university use and produce information through texts and develop their skills in interpreting historical knowledge and thoughts (paska, 2010, p.3). university students might put themselves into the consumers and the producers of historical knowledge as they look more deeply into historial studies. understanding historical research method a historical research method refers to the use of a sequence of scientific procedures to verify historical evidences or sources (tosh, 2002, p.104). these procedures include topic selection, critics, internal and external criticism, analysis and interpretation and presentation in the form of a composition (kuntowijoyo, 2013, p.64). the topic selection should be in accordance with the interest of the researchers. after the topic has been selected, the sources are collected (heuristics). historical sources are a number of historical materials that might enlight the story of human life/life inheritance and the results of human activities, both physically and nonphysically (suhartono, 2010, p.29). historical sources consist of primary and secondary sources. after the historical sources (documents) have been found, there are two aspects that should be investigated, namely the authenticity and the credibility of the sources (gottschalk, 1956, p.27). this process is called source criticism. a source criticism is a process of verifying or testing the accuracy of historical sources in the form of source appropriateness (sjamsuddin, 2012, p.102). the source criticism consists of internal and external criticisms. the external criticism refers to the efforts to prove the source authenticity by investigating the physical sources/testing the external aspects of historical sources (suhartono, 2010, p.6). through an external criticism, a researcher might identify the originality of historical sources from historical events that are investigated. research and evaluation in education the measurement model of... 113 aisiah, suhartono, & sumarno on the other hand, an internal criticism refers to the efforts to investigate the source credibility, whether the sources are trustworthy or not, whether the sources are manipulated or not, whether the sources are biased or not, whether the sources are deceiving or not, and the like, in order to understand the content of historical sources (suhartono, 2010, pp.36-37). the testing of historical source credibility is a form of second verification (second investigation) for proving whether the historical sources are trustworthy or not (kuntowijoyo, 2013, p.78). verification refers to justification, proofing, validation and confirmation for attaining the trustworthy information. the source verification is conducted by asking several logical questions regarding a historical event and by comparing historical events to a number of other data in relation to historical events, so that the researcher can have the objective and reliable data in the process of interpretation. interpretation covers two elements: an analysis and a synthesis. an analysis refers to elaboration, while a synthesis refers to unification (kuntowijoyo, 2013, pp.78-79). after the data have been found, they are then analyzed and, therefore, historical facts can be revealed. people might have different opinions in the analysis and the synthesis. interpretation is frequently known as the source of subjectivity. this statement might be correct or incorrect. the statement might be correct because without a historian’s interpretation the data will not be able to convey any information. on the contrary, the statement might be incorrect if the historian is not honest about the data and the information that he or she has attained. subjectivity is admitted but should be avoided. last but not least, the historiography (composition) is the final stage in historical research. meaning of historical events the meaning of historical events for people, objects, and events depends on the value implementation in certain perspectives (barash, 2003, p.27). meaning does not appear as a part of fact (cohen, 1961, p.44). the historical meaning is shown by the historical significance. the effort to train the capacity in building the meaning of a historical event is a matter that exceeds simple knowledge-based content (russel & pellegrino, 2008, p.3). people might define or find the impor-tant meaning or significance of historical events by understanding the complexity of the events. wineburg (2006, p.37) asserts that history that is taught well will allow people to have enormous capacity in understanding the meaning that enables them to form the world. abdullah (ministry of education and culture, 2012, p.10) asserts as well that if historical reconstruction is conducted through a selection, then the establishment of historical awareness will be very selective. the events that belong to historical awareness are processed by the value system that ultimately will be the basis of a historical view. history provides not only meaning but also wisdom. history does not only memorize the past events, but also understand the meaning of these past events. questions regarding the historical meaning are those that always appear and are always questioned by human beings (kartodirdjo, 1986, p.5). history might be said as having historical meaning if it can deliver human being to the discovery of future aspects. human conscience becomes the basis of self-awareness from the life experiences in which historical meaning is reached (barash, 2003, p.109). everyone may protect himself or herself by understanding what he or she has been done before and the significance of his or her action (cohen, 1961, p.252). past experiences become a useful guidance for encountering the future. usefulness of history history has multiple usefulness. people will not learn history if history does not have any usefulness (kuntowijoyo, 2013, p.15). the usefulness of history might be viewed from theoretical and practical aspects. the theoretical usefulness of history is related to the tendency of learning past events for the sake of intellectual-academic needs (scientific importance) of history (latief, 2006, p.70). history offers the best materials for intellectuality exercises. by learning and investigating history, an individual will have wide understanding and knowledge. research and evaluation in education 114 − reid, 2(2), december 2016 all historical knowledge is based on the practical needs of human beings (mazabow, 2003, p.227). the practical usefulness of historical learning might be viewed from the educational, instructional, inspirational, and recreational aspects. history is useful for the educational aspect and the lesson provision. by learning history, people might find many educational examples in the form of moral actions and attitudes that should be attended and avoided. history is also useful for serving as learning materials (latief, 2006, pp.70-74; sjamsuddin, 2012, pp.126-216; kuntowijoyo, 2013, pp.15-28; tosh, 1984,p.7). by reading and reviewing history books, people will definitely find the meaning and the significance and the useful lessons from historical events. history might even be a source of inspiration (tosh, 1984, p.7). by reading various historical studies (autobiographies and biographies), an individual might attain inspirations where they want to go (kuntowijoyo, 2013, p.23). the inspiration might take the form of ideas, concepts, spirits, motivation, and sacrifice that make people realize the life obstacles and hindrances that they encounter (latief, 2006, p.72; sjamsuddin, 2012, p.216). people might see the past to find the solution for the current problems (gottschalk, 1956, p.172; tosh, 1984, p.15). by learning history, people will be creative in encountering the challenges of the century. history provides opportunities to learn from the past experiences. history has been the records of human experiences and human beings might obtain advantages from the multiple domain of science by learning past experiences (gottschalk, 1956, p.30). history also provides enjoyment that gives the esthetical sense, which opens heart and feelings (kuntowijoyo, 2013, p.25). the type of historical work in the form of biography might be turned into a joyful reading material that drives people to enjoy the nostalgic moments of the past experiences (latief, 2006, p.74; tosh, 1984, p.9). by visiting historical sites, people might sense the beauty of the life conditions in the past. the past experiences become the mirrors to view the future and compass to get to advancement. history is a set of experiences that become the basis for projecting the future and for predicting the upcoming events (tosh, 1984, pp.1-4; greenberg, 1991, p.38). the past experiences relevant to the present experiences in history will be the basis for formulating the actions toward the future (tosh, 1984, p.21). the historical knowledge is useful as a matter of assistance for interpreting the future (sjamsuddin, 2012, p.139) and also for equipping human beings with the discovery of recent awareness as well as for serving as the basis in projecting the recent abstraction (latief, 2006, p.45). past experiences also become the basis for anticipating every single possibility that might occur in the future. measurement model of historical awareness measurement is assigning of numbers to individuals in systematic ways as a means of presenting properties of the individuals or an object (allen & yen, 1979, p.2; mardapi, 2012, p.5). the condition of an individual under measurement in the domain of education is usually related to the learning results. in historical learning, the measurement of learning results might be directed to the measurement of historical awareness because it is one of the historical learning results. the measurement of historical awareness should be conducted in a specific model since there are so many aspects that should be learned and contributed to establishing historical awareness as having been described from the theoretical constructs regarding historical awareness. a measurement model shows the relationship between one observed variable (indicator/response) to another that becomes representation of a latent variable (schumacker & lomax, 2010, p.114; khine, 2013, pp.5-7). ghozali (2008, p.127) states that a measurement model describes how good the indicators can be used as the measurement factors of latent variables (construct latent) such as knowledge, behaviors, and attitudes. hendryadi (2014, p.63) asserts that a measurement model is an effort to create measurement modelling from the latent variables through dimensions or indicators. in more detail, kusnendi (2008, p.98) states that a measurement model as a form of variable operationalization or research constructs beresearch and evaluation in education the measurement model of... 115 aisiah, suhartono, & sumarno notes: pps : pengetahuan peristiwa sejarah (knowledge of historical events) pmps: pemahaman metode penelitian sejarah (understanding of historical research method) mps : makna peristiwa sejarah (meaning of historical events) gs : kegunaan sejarah (usefulness of history) comes the measurable indicators that will be formulated into a certain path diagram. khine (2013, p.6) states that in a wider sense, a measurement model determines how a theory will be operationalized as latent and observed variables. in this study, the design of the measurement model is developed from the theoretical knowledge/the empirical study and then hypothesizes the relationship pattern between observed and latent variables. next, the hypothetical model was tested statistically by means of empirical data. theory plays an important role in the construction of the measurement model (khine, 2013, p.41). the measurement model of historical awareness is developed based on the results of the theoretical review of historical awareness. four components of historical awareness in the model have been found (see figure 1). the results of the theoretical review show that historical awareness is established by four aspects: knowledge of historical events, understanding of the historical research method, the meaning of historical events, and the usefulness of history. the indicators of each aspects is presented in table 2. the theoretical constructs of the com-ponents and indicators of historical awareness become the basis of the developed hypothetical masurement model, as presented in figure 2. table 2. components and indicators of historical awareness components indicator a. knowledge of historical events understanding the facts of historical events, including: what (event), who (figure), when (period), where (place) and why (cause) b. understanding of historical research method identifying the procedures of conducting historical research, including: heuristics, criticism, verification, interpretation and historiography c. meaning of historical events finding the positive impact, the negative impact and the positive values of historical events d. usefulness of history identifying the usefulness of history theoretically and practically (instructional, educational, inspirational, recreational and predictional usefulness) source: lukacs, 1968; kartodirdjo, 1986, 1992, 1993; ministry of education and culture, 2012; sjamsuddin, 2012 figure 2. the hypothetical measurement model of historical awareness research and evaluation in education 116 − reid, 2(2), december 2016 method this study implemented the research and development method adapted from the model proposed by plomp (1982, p.5). the procedure of the development of measurement model of historical awareness included preliminary investigation (literature study), design (formulating construct and hypothetical diagram of measurement model of historical awareness), construction (making the measurement instrument) and testing (empirical testing of the instrument). data gathering was conducted through empirical testing. the developed instrument of historical awareness was an associative multiple-choice test which was designed for history-teacher candidates as the research subjects in empirical testing. the subjects involved history-teacher candidates at two universities: universitas negeri yogyakarta and universitas negeri padang. the sample was established by using stratified random sampling technique involving the first, third, and fifth semester students in the academic year of 2014/2015. the total subjects in the empirical testing were 190 history teachercandidates. table 3. details of research subjects university universitas negeri yogyakarta universitas negeri padang total semester i v i iii v total 45 27 28 48 42 190 72 118 source: field data the data were analyzed through the confirmatory factor analysis by using software lisrel 8.80. this analysis was used to estimate the validity, reliability, and model fitness formulated from the results of the theoretical review. khine (2013, p.6) asserts that the confirmatory factor analysis is frequently implemented in the measurement model testing. the validity and reliability of the measurement model is shown by the validity and reliability of the measurement instrument that has been tested. the construct validity was measured from the loading factor value resulted from the factor analysis. an observed variable might be considered valid for the construct of the measurement (the latent variable) if the loading factor values was above 0.3 (schumacker & lomax, 2010, p.185). then, the construct reliability was from the number of standard square of loading factor from each indicator and the number of error variances . the formula for measuring the value of construct reliability coefficient in a simple manner proposed by wijanto (2008, p.175) is as follows: findings and discussions the focus of the measurement model development of historical awareness is to test the validity and reliability of the instrument constructs designed and to test the goodness of fit of the measurement model tested. the objective of the development is to obtain empirical evidence regarding the factors and indicators existed in the measurement model of historical awareness. the constructs of the measurement instrument that was hypothesized consist of four factors and 19 indicators. the total number of the test items is 90. the result of the confirmatory factory analysis shows that out of 90 items under analysis, there are 28 items that are invalid and insignificant (loading factor < 0.3 and t-value < 1.96). the invalid test items are eliminated before doing the reanalysis conducted through the modification in accordance with the suggestions existed in the lisrel software. the goodness of fit model is obtained from that. the distribution of the changes on the items of the historical awareness test after the process of reanalysis is presented in table 4. the testing of goodness of fit upon the measurement model of historical awareness that had been tested prioritized to the commonly used one, including goodness of fit (gof) by viewing chi-square (χ 2 ) value (the smaller χ2 value, the better the result would be) and the probability value (p-value) ≥ 0.05, rmsea ≤ 0.08 (wijanto, 2008, pp.54-62; research and evaluation in education the measurement model of... 117 aisiah, suhartono, & sumarno hair, black, babin, & anderson, 2009, p.22). the results of the second order confirmatory factor analysis (2 nd order cfa) show that the measurement model of historical awareness is supported by the empirical data based on the p-value criteria (≤ 0.08). these findings prove that the measurement model of historical awareness shows goodness of fit model and the hypothetical model is accepted. thereby, the measurement model of historical awareness resulted from the theoretical review is supported by empirical data. the measurement model of historical awareness resulted from the empirical testing is presented in figure 3. an important aspect that should be given attention to the measurement model is related to the validity and reliability of the measurement instrument constructs. the validity and reliability of the historical awareness measurement model are shown by the value of loading factor in each indicator of four latent constructs of historical awareness. the results of 2 nd order cfa as displayed in the lisrel output (fig. 3) show that the standardized loading factor (slf) value of the indicators of latent variable in the measurement model of historical awareness has met the requirements; slf is significantly higher than 0.3, with the t-value higher than 1.96 at the significance level of 95%. table 4. test item distribution of historical awareness instrument aspects of the instrument test items preliminary instrument eliminated test items final instrument knowledge of historical events 1 to 20 1, 3, 6, 11 1 s.d 16 understanding of historical research method 1 to 25 1, 2, 3, 4, 5, 14, 14, 15, 16, 18, 19 1 s.d 14 meaning of historical events 1 to 20 2, 3, 9, 14, 15, 19 1 s.d 14 usefulness of history 1 to 25 3, 8, 10,16, 19, 20, 21, 25 1 s.d 18 total 90 28 62 source: results of analysis data by using lisrel software 8.80 figure 3. historical awareness measurement model resulted from the empirical testing (standardized) research and evaluation in education 118 − reid, 2(2), december 2016 table 5 shows the value of the loading factor in each indicator that has passed the empirical testing process. overall, the loading factor values of all indicators range from 0.36 to 0.72. the lowest loading factor value is shown by the ‘kritik’ indicator of the pmps latent variable (0.36), while the highest loading factor value is found in the ‘gedu’ (educational usefulness) indicator from the gs latent variable (0.72). the t-value of all indicators ranges from 2.99 to 5.54 (> 1.96). therefore, it can be stated that the measurement instrument of historical awareness in the form of associative multiple-choice test items has good construct validity and is valid for measuring historical awareness. the reliability of the measurement model of historical awareness is shown by the coefficient of composite reliability (cr). the composite reliability is known as multidimensional reliability because the measured constructs are multidimensional and based on the confirmatory factor analysis. the coefficient of composite reliability explains the value of the indicator proportion in explaining the measured constructs (margono, 2013, p.19). a study by widhiarso & mardapi (2010, p.17) proves that the coefficient of composite reliability had high accuracy in the multidimensional model. the estimation of the coefficient of composite reliability for the constructs of historical awareness measurement model in table 5 ranges from 0.6 to 0.8. these coefficients are acceptable as long as the validity indicators of the model constructs are good table 5. the validity and reliability of the instrument construct factor and indicator 2nd order cfa construct validity construct reliability slf* t-value cr decision pps siapa 0.59 ** good 0.7 good kapan 0.43 3.48 good dimana 0.69 5.02 good mengapa 0.53 4.10 good pmps heuristik 0.63 ** good 0.7 good kritik 0.36 2.99 good eksplanasi 0.59 4.68 good mps dampos 0.54 ** good 0.6 acceptable damneg 0.41 3.16 good nilpos 0.53 3.86 good gs gteo 0.64 ** good 0.8 good ginst 0.47 3.89 good gedu 0.72 5.54 good ginspi 0.49 4.05 good grek 0.53 4.29 good gpred 0.66 5.16 good source : results of data analyzed with lisrel 8.8 software *slf : standardized loading factor ** : defined by default by lisrel, t-value was not estimated cr : composite reliability table 6. results of overall goodness of fit model of historical awareness instrument gof targeted score attained score note χ2 statistics expected to be small 121.98 expected to be small χ2 probability (p-value) ≥ 0.05 0.101 good rmsea ≤ 0.08 0.045 good gfi ≤ 0.90 0.85 good agfi ≥ 0.90 0.84 quite good cfi ≥ 0.90 0.96 good nfi ≥ 0.90 0.89 good source: results of data analyzed with lisrel 8.80 software research and evaluation in education the measurement model of... 119 aisiah, suhartono, & sumarno (hair et al., 2009, p.688). the coefficient of cr for the dimension of pps and of pmps is equal to 0.7, the coefficient of cr for the dimension of mps is equal to 0.6 and the coefficient of cr for the dimension of gs is equal to 0.8. therefore, it can be stated that the measurement instrument of historical awareness might provide reliable or trustworthy results. the results of overall goodness of fit of the model show that the developed measurement model of historical awareness theoretically is supported by empirical data. the results of overall goodness of fit of the model is presented in table 6. conclusions the results of the study show that the developed measurement model of historical awareness is valid and reliable and fit to the empirical data. the constructs of the measurement model of historical awareness consist of four dimensions i.e. the knowledge of historical events, understanding of historical research method, the meaning of historical events, and the usefulness of history. the validity of the measurement model shown by the validity of test instrument constructs and the loading factor values of all indicators in the measurement model of historical awareness ranges from 0.36 to 0.72. the reliability of the measurement model of historical awareness is shown by the coefficient of composite reliability (cr) that ranges from 0.6 to 0.8. the empirical testing of fit of the model shows that the model is fit, with the χ 2 value of 121.98, the p-value of 0.11 and rmsea of 0.043. thereby, it can be concluded that the developed measurement model of historical awareness is supported by empirical data. references allen, m. j., & yen, w.m. (1979). introduction to measurement theory. monterey, ca: brooks/ cole. ankersmit, f.r. (1987). refleksi tentang sejarah: pendapat-pendapat modern tentang filsafat sejarah [reflection on history: modern opinions on historical philosophy]. (d. hartoko, trans.). jakarta: gramedia. ata, b. (2009). the turkish prospective history teachers’ understanding of analogy in history education. international journal of historical learning, teaching and research, 8(1), 6-18. barash, j. a. (2003). martin hiedegger and problem of historical meaning. new york, ny: fordham university press. cohen, m. r. (1961). the meaning of human history. chicago, il: the open court. denison, b. j. (2011). history, time, meaning and memory: idea for sociology of religious. leiden: koninklijke brill nv. elliott, j. (2003). the limits of historical knowledge. european review, 11(1), 2125. doi:10.1017/s1062798703000036 ghozali, i. (2008). structual equation modeling: teori, konsep, dan aplikasi. semarang: badan penerbit universitas diponegoro. gleencross, a. (2010, october). historical awareness in international relation theory: a hidden disciplinary dialogue. paper presented in millennium conference at universitas berden. gottschalk, l. (1956). understanding history: a premier of historical method. new york, ny: alfred a. knopf. grant, s.g. (2003). history lessons: teaching, learning, and testing in us high school classrooms. mahwah, nj: lawrence erlbaum associate. greenberg, d. (1991). metahistory of everyday: historical awareness in lived existence (set in late eighteenth century britain) (unpublished master’s thesis). university of british columbia, canada. hair, j.f., black, w.c., babin, b.j., & anderson, r.e. (2009). multivariate data analysis (7 th ed.). upper saddle river, nj: prentice hall. hariyono. (1995). mempelajari sejarah secara efektif. jakarta: pustaka jaya. heller, a. (1982). a theory of history. london: routledge & keagan paul. research and evaluation in education 120 − reid, 2(2), december 2016 hendryadi. (2014). structural equation modeling dengan lisrel 8.80: pedoman untuk pemula. yogyakarta: kaukaba. husband, c., kitson a., & pendry, a. (2003). understanding history teaching. philadelphia, pa: open universitu press. ministry of education and culture. (2012). pemikiran tentang pembinaan kesadar-an sejarah. kartodirdjo. (1986). ungkapan-ungkapan filsafat sejarah barat dan timur: penjelasan berdasarkan kesadaran sejarah. jakarta: gramedia. kartodirdjo. (1992). pendekatan ilmu sosial dalam metodologi sejarah. jakarta: gramedia pustaka utama. kartodirdjo. (1993). pembangunan bangsa. yogyakarta: aditya media. khine (ed.). (2013). contemporary approaches to research in learning innovations: application of structural equation modeling in educational research and practice. rotterdam: sense. kitson, a., husbands, c., & steward, s. (2011). teaching and learning history 11-18: understanding the past. new york, ny: open university press mcgraw-hill education. kolbl, c. & straub, j. (2001). historical consciousness in youth. theoretical and exemplary empirical analyses. forum qualitative sozialforschung / forum: qualitative social research, 2(3). retrieved from http://www.qualitative-research. net/index.php/fqs/article/view/904. korber, a. (2015). historical awareness, historical competencies – and beyond? some conceptual development within german history didactics. 56 s, urn: urn:nbn:de:0111-pedocs108118. kreuzer, m. (2010). historical knowledge and quantitative analysis: the case of the origins of proportional representation. american political science review, 104(2), 369-392. kuntowijoyo. (2013). pengantar ilmu sejarah. yogyakarta: tiara wacana. kusnendi. (2008.) model-model persamaan struktural. bandung: alfabeta. latief, j. a. (2006). manusia, filsafat dan sejarah. jakarta: bumi aksara. lukacs, j. (1968). historical awareness: remembered past. new york, ny: harper & row. mardapi, d. (2007). teknik penyusunan instrumen tes dan nontes. yogyakarta: mitra cendikia. mardapi, d. (2012). pengukuran, penilaian dan evaluasi pendidikan. yogyakarta: nuha medika. margono, d. (2013). aplikasi analisis faktor konformatori untuk menentukan reliabilitas multidimensi. statistika, 13(1), 17-24. mazabow, g. (2003). the development of historical awareness in the teaching of history in south africa school (unpublished doctoral dissertation). university of south africa, pretoria. paska, l. m. (2010). does film affect learning engangement?: historical inquiry and the document-based question in middle school social studies classroom (unpublished doctoral dissertation). albany university, new york, ny. plomp, t. & van de wolde, j. (1982). the general model for systematical problem solving. in t. plomp (eds.). design of educational and training (in dutch). utrecht: lemma. netherland. faculty of educational science and technology, university of twente. enschede the netherlands. pompa, l. (1990). human nature and historical knowledge: hume, hegel and vico. cambridge: cambridge university press. rosenlund, d. (2011). subject construction, assessment and aligment in history education. paper presented at the 39 th congres of npfp/nera in malmo university, malmo, sweden. russel, w.b. & pellegrino, a. (2008). constructing meaning from historical research and evaluation in education the measurement model of... 121 aisiah, suhartono, & sumarno content: a research study. journal of social studies research, 32(2), 3-15. schumacker, r.e., & lomax, r.g. (2010). a beginner’s guide to structural equation modeling. (3 rd ed.). new york, ny: routledge. sjamsuddin, h. (2012). metodologi sejarah. yogyakarta: ombak. suhartono. (2010). teori dan metodologi sejarah. yogyakarta: graha ilmu. thorp, r. (2014). historical awareness, historical media, and history education. historiska medier. isbn 978-91-7601077-8 [electronic version]. topolski, j. (1976). methodology of history. (o. wojta-siewicz, trans.). boston, ma: d. reidel. tosh, j. (1984). the pursuit of history: aim, method, and new directions in the study of modern history (1 rd ed.). new york, ny: longman. tosh, j. (2002). the pursuit of history: aim, method, and new directions in the study of modern history (3 rd ed.). london: longman pearson education. tucker, a. (eds). (2009). a companion to the philosophy of history and historiography. chihester west sussex: blackwell. widhiarso, w. & mardapi, d. (2010). komparasi ketepatan estimasi koefisien reliabilitas teori skor murni klasik.jurnal penelitian dan evaluasi pendidikan, 14(1). retrieved from http://journal.uny.ac.id/index.php/jpe p/article/view/1973. wineburg, s. (2006). berpikir historis: memetakan masa depan mengajarkan masa lalu. (m. maris, trans.). philadelphia, pa: temple university. wijanto. s.h. (2008). structural equation modeling dengan lisrel 8.8: konsep & tutorial. yogyakarta: graha ilmu. http://journal.uny.ac.id/index.php/jpep/article/view/1973 http://journal.uny.ac.id/index.php/jpep/article/view/1973 how to cite item: lastariwati, b., pardjono, p., & sukamto, s. (2016). students’ entrepreneurial behavior in the application of ‘ekrenfatiha’ productive entrepreneurial teaching model at culinary programs vocational schools. research and evaluation in education, 2(1), 53-70. doi: http://dx.doi.org/10.21831/reid.v2i1.6528 research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 2, number 1, june 2016 (pages 53-70) available online at: http://journal.uny.ac.id/index.php/reid students` entrepreneurial behavior in the application of ‘ekrenfatiha’ productive entrepreneurial teaching model at culinary programs of vocational schools 1 badraningsih lastariwati; 2 pardjono; 3 sukamto 1,2,3 yogyakarta state university 1 badraningsih@yahoo.co.id; 2 jpardjono@yahoo.com; 3 otmakus2010@yahoo.co.id abstract the purpose of the study is to observe students' entrepreneurial behaviors in the implementation of entrepreneurship processes in production subjects. the study applied production entrepreneurial teaching model to some vocational high school students with their ‘ekrenfatiha catering’ as their project. the entrepreneur process was integrated in the catering production subject and it consisted of the following aspects: exploration, business plans, facilitation, action and output. the study employed research and development approach, which referred to plomp development model. the data were analyzed by using descriptive statistics. the research subjects were some students of 1 sewon state vocational high school. the results of the test on the model implementation show the students’ mastery of the entrepreneurial behaviors showing the characteristics of responsibility, innovation, honesty, independence, creativity, leadership, dilligence, discipline, cooperation, risk-taking and good communication. it also shows that there is a concrete improvement during the continuous process regarding every observed entrepreneurial behavior, and in general the students’ entrepreneurial behaviors could be classified as good. keywords: entrepreneurial behavior, productive entrepreneurial learning mailto:badraningsih@yahoo.co.id mailto:jpardjono@yahoo.com mailto:otmakus2010@yahoo.co.id research and evaluation in education 54 − volume 2, number 1, june 2016 introduction one of the educational challenges in indonesia is improvement in both vocational quantity and quality in order to meet local and national demands and to be able to compete globally. in addition, there is also an expectation that education in indonesia should produce creative human resources to develop creative economy and to design an effective vocational education for all vocational high schools or other vocational institutions. the reason is that vocational high school students are very close to employment, and one of the available employements is entrepreneurship. as a result, entrepreneurship might become one of their career options (european commission enterprise and industry (ecei), 2009, p.35). being entrepreneurs might decrease the number of unemployment that has been one of the urgent matters happening nowadays in indonesia. furthermore, it is also useful in terms of increasing visa and prosperity of the country. it might also be useful for decreasing the level of poverty (macke & markey, 2003, p.1). it shows that entrepreneurship might be the best potential for economy and development. through the development of creative economy, the government expects that entrepreneurship culture will be a part of indonesian working ethics in order to generate reliable, tough and indepent entrepreneurs. the expectation is very important because entrepreneurial activities belong not only to the micro-economy setting but also to the macro-economy (centre of curriculum of the ministry of national education, 2010). vocational education provides not only skills but also other relevant knowledge in order to get a proper living. the function of vocational high schools in preparing the needs of necessary labor covers two dimensions. the first is the quantitative dimension related to the educational function of vocational school to supply educated and skillful labors according to the demands of the employment. the second is the qualitative dimension to generate educated, trained and skillful labors who can be the motors of regional economic movement (directorate general of secondary education, 2011, p.73). in relation to these statements, vocational education is identical with learning to work. vocational education tries to improve one’s technique and position in his or her environment through technology mastery and, at the same time, to meet the needs of employment. as a result, vocational education is often regarded as having strong contribution to the national economy. the development of creative economy in 2010-2014 was based on the individual creativity, skills and talents in order to pursuit economic creative ideas that will have great impacts on indonesian prosperity (directorate general of secondary education, 2011, p.54). education in indonesia is oriented toward habituation, empowerment and formation to bring about noble, honest and superior characteristics altogether with the other life skills. the paradigm treats, facilitates and encourages students to be independent subjects with responsibility, creativity, innovation, support and entrepreneurship (ministry of education and culture, 2012, p.6; ministry of education and culture, 2012, p.2). entrepreneurship in indonesia is still relatively behind other countries although it has entered the age of knowledge and information. based on a calculation provided by ciputra foundation, the number of entrepreneurs in indonesia is 400,000 people (around 0.18% of total population) (ciputra, 2009, november 3). it is below the world’s standard figure. according to ciputra (2009, november 3) and moerdiyanto (2013, p.7), a country will be developed if the number of entrepreneurs is more than 2% of the total population. in order to improve the situation, the government should take some serious efforts and one of them is trying to achieve demographic dividend. the demographic dividend is predicted to be achieved in 2020-2035 (ministry of education and culture, 2012, p.12). in order to take the best of it, starting from 2010-2035, indonesia should have big investment on the human resources development, by establishing a universal secondary education, and entrepreneurship should be one of the components in it. (ministry of education and culture, 2012, p.1). as mandated in act of the republic of indonesia research and evaluation in education students’ entrepreneurial behavior... 55 badraningsih lastariwati, pardjono, & sukamto (2003) number 20 year 2003 about the national education system, the national long term development plan of the ministry of education national year 2010-2014 (ministry of national education, 2010, pp.1-2) emphasizes that the efforts to pursue the human resources quality include: science, technology, and economic competitive edge. therefore, through entrepreneurial learning, students will be able to achieve the targeted number of entrepreneurs in indonesia. entrepreneurship-based education is an education that implements the principles and methodologies of value internalization to the students. it might be pursued through a curriculum integrated to the development occurring around the students’ neighborhood and through the use of teaching models and strategies that is relevant to the teaching objectives (winarno, 2008, p.124). badan nasional sertifikasi profesi (bnsp) or national professional certification agency states that the entrepreneurship education might generate entrepreneurial behaviors and leardership characteristics which are very related to the ways of managing business so that students will be able to start their own business independently (bnsp, 2006). the objective of entrepreneurship programs in vocational high schools is basically to internalize entrepreneurial values through entrepreneurial habituation. according to the european commission enterprise and industry (ecei) (2009), entrepreneurial paradigm and skills might be promoted through learning by doing (experiencing entrepreneurship in real practice). therefore, entrepreneurship is expected to be the life attitude and national character of indonesian people (ciputra, 2009, november 3). furthermore, entrepreneurial education is one of the best ways for supporting economic growth and creating more employment. the idea is in accordance with the results of the latest research saying that 78% of the graduates from entrepreneurial education are able to work right upon graduation (directorate general for enterprise & industry of european commission, 2012, p.4). nowadays, entrepreneurial learning in vocational high schools is one of the supporting theoretical training and education subjects. the implementation of entrepreneurship in vocational high schools recently is around 1.93% of the total periods for six semesters. it is insufficient to create independence and entrepreneurship spirit for the graduates. therefore, the design of entrepreneurial learning in vocational high schools should be reviewed on the following aspects: curriculum, learning strategy, learning method, media and teaching media (sarbiran, 2002). in order to improve the effectiveness of entrepreneurial encouragement to the students, there should be a model development for the production entrepreneurial learning in the culinary vocational high schools. thus, this research is focused on the entrepreneurial education in culinary vocational high schools. the integration of entrepreneurial stages and culinary production training in education subjects is based on the nested approach with project-based learning. as a result, the students will be able to implement the entrepreneurial skills in real practice within their respective domain and to study several production skills altogether at the same time. thereby, the teachers will be able to strengthen the entrepreneurial values, attitudes, and behaviors of the students. the objective of the research, then, is to develop entrepreneurial behaviors by implementing entrepreneurship processes in the production subjects. this research observes the implementation of entrepreneurship encouragement through a teaching model which is called ‘ekrenfatiha’ for the productive entrepreneurship subjects. entrepreneurship is a value that should be actualized into behaviors that will be the source of energy, motor, objective, strategy, effort, process and business (sanusi, 1994). according to prawiro in the centre of curriculum, the ministry of national education (2010, p.16), entrepreneurship is a necessary value for starting and developing a business. entrepreneurship is a process of creating something new (being creative) and different (being innovative) which will provide benefits for the competitive edge. a similar understanding is given by kuratko and hodgetts (1989, pp.5-6) and also hisrich and peters (2002, p.42). they argue that entrepreneurresearch and evaluation in education 56 − volume 2, number 1, june 2016 ship is a process of innovation and creation. entrepreneurship is a process of implementing creativity and innovation in solving problems and finding solutions for improving business opportunities (zimmerer, 1996, p.20). entrepreneurship refers to the values that form one’s characters and behaviors in order to stay creative, empowered, noble and humble so that he or she will be able to improve his or her income within his or her business activities (centre of curriculum of ministry of national education, 2010, p.15). based on these definitions, it can be concluded that entrepreneurship is a process of implementing the values that form one’s characters and behaviors that will be able to grow creativity and innovation in solving problems and in finding opportunities. kuratko and hodgetts (1989, p.6) explain that people who perform entrepreneurship are called entrepreneurs. entrepreneurs are motivators as well as creators (kao, 1991, p.191). entrepreneurs also refer to individuals with strong, creative, and innovative edge, who master in-depth business knowledge and who behave under the objective of creating a new business unit (suryana, 2003, p.10). meredith in pusposutardjo (1999), provides the characteristics of an entrepreneur. she writes that the characteristics of an entrepreneur are: (a) having self-confidence; (b) having orientation toward task and output; (c) having willing to take risk; (d) having leadership; (e) having future orientation; and (f) having originality. from the behavioral perspective, psychiatricts regard an entrepreneur as an achievement-oriented individual who has been stimulated to find new challenges and output. meanwhile, vesper (in winardi, 2004) in a very positive tone toward the market economy regards entrepreneurs as the pillar of industry, and the motor and pioneer who constructively destroy the condition of ‘status quo’. as stated by muhadi and saptono (2005, p.15), that there are several factors which strongly influence students’ entrepreneurship spirit, such as: the parents’ employment background, the families’ cultures, and the training and education process at school. furthermore, the entrepreneurship spirit consists of two factors, namely, personal values and orientation. personal values are the traits that consist of locus of internal control, creativity, independence and planning. orientation consists of aspects of achievement pursuit and moderate real risk-taking ability (noer & wirjodirjo, 2007, p.237). locus of internal control is the core of entrepreneurship spirit (purnomo, 1999). the entrepreneurship spirit covers the following aspects: independent attitude and paradigm, willingness to take a risk, responsibility, always creating and improving the resources value, openness for feedbacks, always looking for better chances, not getting satisfied easily, always performing innovation and improvisation for future improvement, and having a good moral responsibility (suryana, 2003, p.10). the entrepreneurship spirit encourages someone to be interested in finding and managing business professionally. the interest is accompanied by an accurate planning and calculation. kashmir (2007, p.17) states that entrepreneurship tries to find benefit and create profitable business opportunities. suffering loss is not a big problem since entrepreneurs are aware that loss is a natural consequence. they even believe the bigger the risk, the bigger profit they will get. in their opinion, they would not suffer any loss as long as they work with full of encouragement and calculation. the mastery of entrepreneurship spirit is expected to combine motivation, vision, optimism, and communication and courage in utilizing the business opportunity (suryana, 2003, p.13). according to kuehl and lambing (1999, p.11), entrepreneurship is a creative business that establishes values from nothing which will be useful for everyone. furthermore, they state that successful entrepreneurship has four main elements, namely: (1) ability (related to skills and iq) in reading opportunities, innovations, management and sale; (2) encouragement (related to eq and mental) in coping with fear, risk control and leaving ‘comfort zone’; (3) determination (related to self-motivation) in the form of persistence, firmness and mind power; and (4) creativity that ‘generates’ inspirations for the ideas. ideas are used for finding opportunities based on the intuition he or she gets (related to the experience). research and evaluation in education students’ entrepreneurial behavior... 57 badraningsih lastariwati, pardjono, & sukamto on the other hand, the directorate general of educational quality improvement and education personnel (2010, pp.9-12) suggests that there are two types of entrepreneurship dimensions or characteristics, namely: (a) the fundamental quality, covering mindset, heartset and physical strength; and (b) instrumental quality, which covers interdiscipline mastery. therefore, it can be concluded that the fundamental quality of entrepreneurship, which covers mindset, heartset and physical strength, is very essential in an entrepreneurial process. the reason is that in an entrepreneurial process, an entrepreneur needs a mindset for establishing creative and innovative ideas. he or she also needs some mentality for building integrated, smart, dynamic, harmonious and flexible teamwork so that he or she might be self-confident and competitive on the basis of high solidarity. all of these aspects should be supported by good physical strength. such an entrepreneurial process involves more than the division of management position. according to hisrich, peters, and dean (2005, p.39), an entrepreneurial process has the following four different stages: (a) opportunity identification and evaluation; (b) business plan development; (c) decisions on necessary resources; and (d) output management. similarly, a review on the entrepreneurial process has also been provided by ciputra. in ciputra university, an entrepreneurial process has the following five stages: discovery, concept developent, resources, action and harvesting. the five stages are implemented according to the national content standards for the entrepreneurship education. these five stages, altogether with individual characters and entrepreneurial behaviors, become an entity in the entrepreneurship skills (consortium for entrepreneurship education, 2004, appendices). according to the consortium for entrepreneurship education (2004), an entrepreneurial process should go through the following stages: discovery (8 aspects), concept development (8 aspects), resourcing (7 aspects), actualization (11 aspects) and harvesting (4 aspects). in order to have better understanding of the stages of an entrepreneurial process in the productive entrepreneurship learning model for the vocational high schools, these following terms are used: exploration, business plan, facilitation, action and output. exploration exploration is the stage of performing a creative and innovative mindset. a creative mindset is demanded for describing the future business situation and operation. it will also help in providing vision that might not be made by an exploration toward a trend in the present time. the exploration stage covers the following aspects: exploration toward expectation and inspiration, idea establishment, and creative and innovative idea development. business plan a business plan is a stage of designing the creation, distribution, and transformation from the value proportion to the selected customer segmentation through several activities facilitated by the resources with an objective of generating profits (osterwalder & pigneur, 2010). a business plan covers every single aspect that an entrepreneur would like to perform with his or her business and how they might be implemented. the process of composing the aspects that will be involved in realizing the business ideas demands an understanding of what, why, who, how, where, when and how much the entrepreneurial efforts might be made. such a process urges the entrepreneurs to take decision and have wider sight on the ideas and on how to turn these ideas into a business. this stage also helps them to consider the areas that need more reviews (ehmke & akridge, 2005, p.1). the business mind stage covers the following aspects: decision on market targets, product type, product superiority, risk and opportunity, marketing strategy, capital source, and promotion strategy. facilitation the facilitation stage covers natural resources management, human resources management, space management, capital management, raw materials management, decision on production process, and decision on research and evaluation in education 58 − volume 2, number 1, june 2016 cost demands. according to soegoto (2010, p.199), human resources management is a series of organizational activities directed toward recruiting, developing and maintaining the existing employees in order to achieve the company’s objectives. action action means transforming business ideas into reality. in the implementation, the role of an entrepreneur as the leader of a company strongly determines the business success. this stage consists of employee motivation, supervision, direction and coordination. evaluation this stage is also known as evaluation because all of the output will be evaluated by companies. basically, the evaluation is focused on the comparison between the planning and implementation. if there is a divergence, it would be evaluated. the output stage covers evaluation and reflection. entrepreneurship is the intertwine of overall vocational education and entrepreneurial attitudes maintained through the overall vocational education system (ecei, 2009, p.22). apart from the existing vocational domains, ecei (2009, p.7) states that the most effective way in teaching entrepreneurship is to have the students participated in practical projects and other learning activities emphasized on ‘learning-by-doing’ principle. therefore, they will have actual experience. the problem-based and experience-oriented education has very important role in developing the entrepreneurial mindset and ability. the efforts to introduce entrepreneurship as an explicit objective in a curriculum will become a clear signal that entrepreneurship is important for every student. in addition, entrepreneurship will ease the teachers to spend the teaching periods on the given subjects because entrepreneurship is not explicitly included in the curriculum. moreover, there have been many occasions in which the teachers would like to participate with the students in the entrepreneurial activities but they have to prepare these activities outside the periods. the learning type should refer to the curriculum available to all students and should not depend on individual expectation, or the initiatives of the teachers and the school. several experts recommend entrepreneurship to be introduced as a compulsory item in the curriculum so that all students would be able to learn it (ecei, 2009, p.23). entrepreneurship education is not intended to be limited to the general business or economy study because it is aimed at increasing creativity, innovation, and entrepreneurial efforts. in some cases, it is integrated into the compulsory curriculum; but in some others, it is an optional subject or extracurricular activity provided by the schools (ecei, 2009, p.23). higher education should increase students’ awareness on entrepreneurship. the entrepreneurial mindset and skills might be well promoted through the principle of ‘learning-by-doing’ and practical entrepreneurial experiences (projects and practical activities) (commission of the european communities, 2006, p.4). the learning model implemented in the study was project-based learning. projectbased learning is an approach demanding students to construct a ‘bridge’ connecting learning materials of multiple subjects. by doing so, they will be able to view the knowledge holistically. project-based learning is an indepth investigation on a topic in the real world; it will be valuable for the students’ attention and efforts. project-based learning is an approach that pays attention to comprehension. the students explore, interprete and synthesize information in meaningful manner. global school net (2000) reported the results of a study by the auto desk foundation regarding the characteristics of project-based learning. it showed that project-based learning was an approach possessing the following characteristics: students made decision on the framework; they proposed some problems or challenges; they designed a process to find the solution to overcome the proposed problems or challenges; they were collaboratively responsible for accessing and managing information to solve the problems; the evaluation process was implemented continuously; the research and evaluation in education students’ entrepreneurial behavior... 59 badraningsih lastariwati, pardjono, & sukamto students periodically performed reflections on the activities they had performed; the final product of learning activities would be evaluated qualitatively; and the learning situation was very tolerant to mistakes and changes. the project-based learning approach was developed based on the philosophy of learning constructivism. constructivism develops a learning atmosphere that urges the students to arrange their own knowledge (bell, 1995, p.28). the project-based learning provides freedom for the students to plan the learning activities, to implement a project collaboratively and to generate product that might be presented to other people. entrepreneurial behaviors entrepreneurial characters, attitudes, spirit and values might appear in the form of entrepreneurial behaviors (suryana, 2003, p.6). behavior is a function of direct interaction between an individual and his or her orientation toward objectives. thus, behaviors are motivated by the desire to achieve certain objectives (winardi, 2004, p.32). according to bird and schjoedt (2009), behaviors are action. therefore, entrepreneurial behaviors describe behaviors as individual activities (businessmen). on the other hand, entrepreneurial behaviors are reflected in attitudes, interpersonal relationship, arrangement capabilities, marketing and finance (hawkins & turla, 1993, p.388). lumpkin, cogliser, and schneider (2009, p.50) and also wiklund and shepherd (2003, p.1310) believe entrepreneurial behaviors as individual behaviors instead of company behaviors. entrepreneurial behaviors are proximal results of business actor’s cognition and emotion. they are also an individualcentric proximal cause in business outputs. knowledge regarding the entrepreneurial behaviors is important for educators, students, and media and creative workers because they are usually the results of creation and innovation (bird & schjoedt, 2009, p.352). entrepreneurial behaviors can also be defined as a study on human behaviors involved in identifying and benefitting the opportunities by creating and developing new enterprises (bird & schjoedt, p.353; carsrud, brannback, & brandt, 2009), and also in exploring and creating temporary opportunities in the appearing organizational process (gartner, carter & reynolds, 2010, p.99). the entrepreneurial behaviors are also admitted as supporting social changes and facilitating innovation within an established organization (kuratko, ireland, covin, & hornsby, 2005, p.700). from the explanation, it can be concluded that entrepreneurial behavior is a function of direct interaction between an individual and his or her environment. an individual’s behaviors are reflected in his or her attitudes when he or she would like to achieve certain objectives. table 1. the attitudes and the descriptions of entrepreneurship education value entrepreneurial behaviors description independent not easily dependent on other people in accomplishing certain duties creative behaving and doing something in order to generate different ways or results from the ones that have been possessed risk-taking able to do challenging jobs and courageous to take risks orientation to action turning ideas into real actions leading open to criticisms and suggestions, sociable, cooperative and directing other people honest being a trustworthy person in terms of words, actions and jobs disciplined obedient toward multiple regulations and requirements hardworking performing actual efforts to accomplish tasks and overcome multiple drawbacks cooperative able to establish relationships with the other people in performing actions and duties innovative creative in solving problems and finding opportunities to improve and enrich life responsible able and willing to perform the given obligations and duties persistent not easily surrender in accomplishing an objective with multiple alternatives communicative happy in talking, socializing and cooperate with other people source: centre of curriculum of ministry of national education (2010, pp.10-11) research and evaluation in education 60 − volume 2, number 1, june 2016 figure 1. ‘ekrenfatiha’ production entrepreneurship learning model for the culinary vocational high school (lastariwati, 2013) e n tr e p re n e u rs h ip s o ft s k il ls in te g ra te d m a te ri a l teacher as facilitator student projects (competency targets) culinary production training (related to competence) preparation implementatio n evaluation eks ren fa ti ha authenticity academic strength active exploration entrepreneurial steps monitoring procedure active students knowledge attitude practice teacher collaborated with student on active learning f&b management service culinary production culinary competency indicators evaluation students with entrepreneurship attitude and behavior output learning application interpersonal assessment culinary production subject entrepreneurship subject indicators and objectives are integrated productive entrepreneurship integrated in culinary competence according to the teachers’ terms study on entepreneurship behaviors and particular culinary skills according to the working flow the topic and sub topic are integrated. erfth is a learning target competencies to be trained process input research and evaluation in education students’ entrepreneurial behavior... 61 badraningsih lastariwati, pardjono, & sukamto the entrepreneurial production learning model development for the culinary vocational high school will form entrepreneurial behaviors and it demands a learning condition that enables students to perform exploration, comprehension and implementation of the entrepreneurial values and independent attitude in a working situation. the situation serves as integrative media between hardskill and entrepreneurial skill. the mastery of entrepreneurial behaviors should be accompanied by feedback and support. a positive habituation will help to form positive habit and behavior. the description of ‘ekrenfatiha’ production entrepreneurship teaching model for the culinary vocational high school can be seen in figure 1. method the study was conducted in 1 sewon state vocational high school and the subjects were the students of grade xi of culinary study program who were taking continental cuisine class. a research and development (r & d) model was employed in the study. the main element of the model was the implementation of an integrated approach on entrepreneurship and production learning under project-based learning. plomp model (1997) was employed for the development of the entrepreneurial learning model in the culinary vocational high school. the general problem-solving model for the domain of education consisted of the following phases: preliminary investigation; design; realization or construction; test, evaluation, and revision; and implementation. the instruments used for data collection consisted of two different parts. the first was the instrument of students’ entrepreneurial behaviors observation in which some trained enumerators reflected and observed the students’ entrepreneurial behaviors during the production entrepreneurship learning process by means of the ‘ekrenfatiha’ model. the second was peer assessment which was representing the students’ entrepreneurial behaviors assessment according to their colleagues during the learning process. both instruments were tested in every meeting. the test was administered to measure the dynamic between the students’ entrepreneurial attitudes and their entrepreneurial behaviors performed during the learning process by means of ekrenfatiha model. findings and discussions table 2 presents the fluctuative changes in the students’ entrepreneurial behaviors in each meeting. the students had different projects in the basic competence examination (the examination was made appropriate to the basic competence as the target of mastery). as a result, the fluctuative changes occur. it depends on the load of the projects that the students should finish. it is the most frequent trait. the ‘independent’ attitude (11%) is the second, followed by ‘responsible’, ‘honest’, ‘leading’ and ‘hardworking’ (8%). in the fourth place is ‘innovative’, ‘disciplined’, and ‘cooperative’ (7%). in the fifth place, it is ‘action-oriented’ and ‘persistent’ attitude (5%). ‘communicative’ and ‘risk-taking’ attitude (4%) are in the sixth place. the last is ‘evaluative’ and ‘result-oriented’ (3%). the students’ courage to take risk and their ‘communicative’ attitude are low. students admitted that they were shy and afraid of making mistakes (according to interviews). it is shown by their reason that they were not confident and afraid of selling their products. table 2. the frequency of entrepreneurial behaviors occurring in 1 sewon state vocational high school behaviors mean creative 140.2 innovative 75.6 independent 120.6 responsible 89 honest 95.2 leading 96.8 persistent 57.2 disciplined 83.6 cooperative 76.6 action-oriented 58.8 hardworking 87.8 communicative 48.4 risk-taking 46.8 evaluative 35 reflective 28.8 source: lastariwati (2013) research and evaluation in education 62 − volume 2, number 1, june 2016 table 3 clearly shows that the highest achievement is found in the fourth meeting. in the fourth meeting, the students were able to run the project and to meet the target given by the teacher especially in terms of product sale. the ‘creative’ attitude occurs dominantly with 701 times of occurence. the occurence of the attitude is mostly found in the action (35.38%). then, the ‘innovative’ attitude occurs 378 times. it occurs mostly in the action and exploration stages. on the other hand, ‘independent’ attitude occurs mostly in the action (47.60%) and facilitation (26.70%). a similar situation is also found in the following attitudes, namely: ‘responsibility’ (445 times), ‘honesty’ (476 times), ‘leading’ (484 times), ‘persistence’ (286 times), ‘discipline’ (418 times), ‘cooperative’ (383 times), ‘action-oriented’ (294 times), ‘hard-working’ (439 times), ‘communicativeness’ (242 times) and ‘risk-taking’ (234 times). the ‘evaluative’ and ‘reflective’ attitudes occur frequently in the action and output (table 3-4). the changes of entrepreneurial behaviors in the action and output tended to be stable. however, the only significant improvement was found in the fifth meeting. the students’ ‘reflective’ attitude occurs more frequently than ‘evaluative’ attitude. the students’ mastery of entrepreneurial behaviors was found in the peer assessment during the learning process. peer assessment was an objective assessment conducted by the students’ colleagues upon the mastery of entrepreneurial behaviors in a group. it was conducted at the end of the session. the peer assessment on the entrepreneurial behaviors was performed five times in the continental culinary class. the changes of entrepreneurial behaviors occur in each meeting. table 5 shows the data on the students’ mastery of entrepreneurial behaviors in the implementation of production entrepreneurship learning model. the changes of students’ entrepreneurial behaviors occur due to different competencies that should be learned in each meeting (the project was prepared in accordance with the basic competence as the target of mastery). the changes depend on the load of the project that the students should run. the development of sudents’ entrepreneurial behaviors in 1 sewon state vocational high school is positive (table 6). the students’ colleagues in the project were involved in order to evaluate the mastery of entrepreneurial behaviors that they possessed. table 3. the frequency of entrepreneurial behaviors occuring per entrepreneurial stage in 1 sewon state vocational high school creative innovative independent responsible honest leading persistent disciplined cooperativce e 112 76 66 46 53 46 12 38 32 b 151 43 35 43 39 47 29 42 18 f 140 47 161 92 92 128 77 65 127 a 248 173 287 216 239 210 137 231 169 o 50 39 54 48 53 53 31 42 37 note: e =exploration; b = business plan; f = facilitation; a = action; o = output source: lastariwati (2013) table 4. the frequency of entrepreneurial behaviors occurence per entrepreneurial stage in 1 sewon state vocational high school action-oriented hardworking communicative risk-taking evaluative reflective e 18 34 13 26 9 11 b 19 17 7 35 21 11 f 85 97 56 38 28 13 a 149 244 130 111 58 62 o 23 47 36 24 59 47 note: e =exploration; b = business plan; f = facilitation; a = action; o = output research and evaluation in education students’ entrepreneurial behavior... 63 badraningsih lastariwati, pardjono, & sukamto the mastery of entrepreneurial behaviors based on the peer assessment in the ‘exploration’ stage can be seen in figure 2. in this stage, these behaviors were found: responsible, honest, independent, innovative and creative. from these behaviors, it is found that ‘responsible’ attitude is the behavior the students master the most with the mean score of 3.04, while ‘creative’ attitude is mastered by the students in the second place with the mean score of 2.69. based on the five treatments observed, it is found that the mastery development in ‘exploration’ stage is fluctuative. from figure 2, it is clear that entrepreneurial behaviors occur variously in each stage. in the ‘exploration’ stage, ‘responsible’ attitude is the highest mastery, followed by ‘honest’, ‘independent’, ‘innovative’, and ‘creative’ attitudes. from the peer assessment (figure 3), it is found that the students’ ‘leading’ attitude is a dominant aspect in the ‘business plan’ stage. the ‘leading’ attitude is a process of selfdirecting to give instruction or influence other people in a working group. during the treatment, the development showed a positive trend in each stage. table 5. the assessment on the students’ entrepreneurial behaviors in continental culinary class entrepreneurial stage behaviors mean st. dev behaviors mean st. dev exploration creative 2.69 0.11 honest 2.82 0.11 innovative 2.77 0.09 independent 2.79 0.07 responsible 3.04 0.05 business plan leading 2.79 0.05 disciplined 2.93 0.08 facilitation persistent 2.84 0.08 cooperative 3.03 0.07 honest 2.87 0.05 responsible 3.00 0.12 action action oriented 2.71 0.06 cooperative 2.98 0.04 discipline 2.88 0.09 innovative 2.83 0.1 responsible 2.91 0.03 honest 2.94 0.09 communicative 3.01 0.12 hardworking 2.72 0.09 risk-taking 2.79 0.09 independent 2.84 0.08 output evaluative 1.49 0.16 reflective 1.44 0.24 source: lastariwati (2013) figure 2. the occurence of entrepreneurial behaviors in the ‘exploration’ stage (peer assessment) (lastariwati, 2013) research and evaluation in education 64 − volume 2, number 1, june 2016 figure 4 shows the profile of changes in the students’ entrepreneurial behaviors. although there is fluctuation in the ‘facilitation’ stage, students’ overall entrepreneurial behaviors show a positive trend in the development. in the ‘action’ stage, the entrepreneurial behaviors that the students mastered were: ‘action-oriented’ (2.71); ‘disciplined’ (2.88); ‘cooperative’ (2.98); ‘innovative’ (2.83); ‘honest’ (2.97); ‘hard-working’ (2.72); ‘responsible’ (2.91); ‘communicative’ (3.01); ‘risk-taking’ (2.79); and ‘independent’ (2.84). the changes in the entrepreneurial behaviors are varied and fluctuative. figure 3. the occurence of entrepreneurial behaviors in the ‘business plan’ stage (peer assessment) (lastariwati, 2013) figure 4. the occurence of entrepreneurial behaviors in the ‘facilitation’ stage (peer assessment) (lastariwati, 2013) figure 5. the occurence of entrepreneurial behaviors in the ‘action’ stage (peer assessment) (lastariwati, 2013) research and evaluation in education students’ entrepreneurial behavior... 65 badraningsih lastariwati, pardjono, & sukamto in the ‘result’ stage, it is found that ‘evaluative’ and ‘reflective’ attitudes get the highest score (figure 6). however, after the examination on the basic competencies, the students’ learning targets and projects have met the requirements. the findings in the ‘output’ stage was strengthened by the final assessment from the teachers. in general, the mastery of entrepreneurial behaviors of the students of 1 sewon state vocational high school can be classified in several categories as presented in table 6. the students’ mastery of entrepreneurial behaviors was observed based on the selfevaluation sheet regarding the mastery of entrepreneurial behaviors in the daily activities. subjectively, the students provided their opinion regarding the existence of the entrepreneurial behaviors and how they were applied in daily activities. for five times, they provided their opinion about the occurence of evenly-distributed entrepreneurial behaviors (figure 7). the ‘communicative’ attitude is dominant with the mean score of 5.16±1.00 (table 7). different from the results of evaluation, they argued that they had applied the ‘evaluative’ and ‘reflective’ attitude in their daily life. figure 6. the occurence of entrepreneurial behaviors in the ‘result’ stage (peer assessment) (lastariwati, 2013) table 6. the classification of entrepreneurial behaviors mastery in 1 sewon state vocational high school no entrepreneurial behavior classifications f % 1 poor 2 1.3 2 moderate 22 13.8 3 good 117 73.1 4 very good 19 11.9 total 160 100.0 source: lastariwati (2013) table 7. the evaluation of the entrepreneurial behaviors of the students of 1 sewon state vocational high school entrepreneurial behaviors mean entrepreneurial behaviors mean creative 4.24 disciplined 4.53 innovative 3.60 cooperative 4.89 independent 4.58 action-oriented 4.15 responsible 4.44 hardworking 4.87 honest 4.87 communicative 5.16 leading 3.70 risk-taking 4.59 persistent 4.10 reflective 3.95 evaluative 4.22 entrepreneurial ability 4.36 entrepreneurial concept drafting 4.11 mean score of entrepreneurial behaviors 4.37 source: lastariwati (2013) research and evaluation in education 66 − volume 2, number 1, june 2016 the entrepreneurial behaviors that have been reflected in the students of 1 sewon state vocational high school are ‘communicative,’ ‘cooperative’ and ‘honest’. the students realized that performing ‘honest’, ‘cooperative’ and ‘communicative’ attitudes was important. the statement does not imply that other behaviors are not important; instead, all of the entrepreneurial behaviors overall complete each other in order to form a better person. the mastery of entrepreneurial behaviors should be accompanied by feedback and support. a positive habituation would form a positive attitude. in the implementation of the extended field test (the examination of basic competence), various cases of occurences from observation and peer assessment are found. specifically, in ‘exploration’ stage, the following entrepreneurial behaviors are found: ‘creative’, ‘innovative’, ‘independent’, ‘responsible’ and ‘honest’. meanwhile, the entrepreneurial behavior found in the ‘business plan’ stage is ‘leading’. then, in the ‘facilitation’ stage, the following behaviors are found: ‘persistent’, ‘honest’, ‘responsible’, ‘disciplined’ and ‘cooperative’. next, in the ‘action’ stage, the following behaviors are found: ‘actionoriented’, ‘disciplined’, ‘cooperative’, ‘innovative’, ‘honest’, ‘hard-working’, ‘responsible’, ‘communicative’, ‘risk-taking’ and ‘independent’. eventually, the ‘output’ stage is related to the ‘evaluative’ and ‘reflective’ attitudes. for the mastery of entrepreneurial behavior during the examination of basic competence, the ‘responsible’ attitude ranks the first place. in other words, most of the students show ‘responsible’ attitude during the examination. being responsible refers to have awareness in terms of accomplishing the given tasks or duties correctly and punctually. the students believe that every single hard work would be beneficial not only for themselves but also for other involved parties. meanwhile, the ‘honest’ attitude is in the second place. being honest, during the examination, is an important element that should be included in the learning process. the reason is that all of the students should possess honesty within themselves. furthermore, the ‘cooperative’ attitude is the third. being cooperative refers to the ability of establishing relationship with other people in carrying out certain actions and duties. being cooperative might also be a motivation for carrying out the given duties well and appropriately. in addition, the ‘disciplined’ attitude refers to any attitude that shows orderliness and obedience to multiple regulations and requirements. in the examination of basic competence, the students did the projects that had been decided and agreed with the teachers. figure 7. the evaluation of the entrepreneurial behaviors of the students of 1 sewon state vocational high school (lastariwati, 2013) research and evaluation in education students’ entrepreneurial behavior... 67 badraningsih lastariwati, pardjono, & sukamto in relation to the other behaviors, the ‘independent’ attitude refers to not being easily dependent on other people in accomplishing the given tasks. then, the ‘innovative’ attitude refers to the ability of employing creativity in order to solve problems and to find opportunities for improving and enriching life. being creative describes the behaviors of generating something new with an added value. as a result, this something new might be admitted as an output of students’ creation by the users. the communal creativity, in addition, occurs with improvement in each stage of the entrepreneurial process. meanwhile, ‘reflective’ and ‘evaluative’ attitudes are the last behaviors in the list. after the examination of basic competence was accomplished, students’ learning projects and targets were achieved. the ‘output’ stage was strengthened by the final assessment from the teachers. the changes in entrepreneurial behaviors might also be given attention in each stage of the entrepreneurial process. the mastery of students’ entrepreneurial behaviors is strengthened by the students’ self-assessment. it is a reflection of entrepreneurial behaviors applied by the students in their daily life. in relation to entrepreneurial behaviors, the behaviors mostly represented in 1 sewon state vocational high school are ‘communicative’, ‘cooperative’ and ‘honest’. conclusions and suggestions conclusions the expanded field test for the productive entrepreneurial learning model implemented in the culinary vocational high school is concluded as follows. the expanded field test was performed in 1 sewon state vocational high school. the students’ entrepreneurial behaviors in the implementation of the entrepreneurial process stages in ‘ekrenfatiha’ model integrated in the culinary production learning, namely in the continental cuisine class, were observed. the stages in the entrepreneurial process observed were: ‘exploration’, ‘business plan’, ‘facilitation’, ‘action’ and ‘output’. in the implementation of preliminary investigation under the classical manner, the frequency of occurences of the entrepreneurial behaviors is repectively as follows (from the highest to the lowest): ‘responsible’, ‘innovative’, ‘honest’, ‘independent’, ‘creative’, ‘leading’, ‘persistent’, ‘disciplined’, ‘cooperative’, ‘risk-taking’, ‘independent’ and ‘communicative’. there has been actual improvement in the continuous repetition for each of the entrepreneurial behavior, and in general, the students’ entrepreneurial behaviors might be classified as a good one. suggestions based on the research, it is recommended that: (1) ‘ekrenfatiha’ production entrepreneurship learning model can be implemented in actual manner in all production subjects within study programs under the culinary and the tourism vocational high schools as well as in the fashion study program and the beauty study program. each of the projects might be customized according to the class conditions and the competences expected to be achieved. (2) it is also recommended that the implementation of ‘ekrenfatiha’ production entrepreneurship learning model be more effective if the model is supported by all of the school members. thereby, the integration of entrepreneurship culture at schools might grow well. references act of the republic of indonesia, no 20 year 2003 about national education system (2003). bell, b.f. (1995). children’s science, contructivism and learning in science. victoria: deakin university press. bird, b. & schjoedt, l. (2009). entrepreneurial behavior: its nature, scope, recent research, and agenda for future research. in a.l. carsrud & m. brannback (eds), understanding the entrepreneurial mind: international studies in entrepreneurship, vol. 24 (pp. 327-358). heidelberg, ny: springer. bnsp. (2006). standar isi untuk satuan pendidikan dasar dan menengah: standar kompetensi dan kompetensi dasar smk/mak [content standard for primary and secondary research and evaluation in education 68 − volume 2, number 1, june 2016 educational units: standard and basic competences]. jakarta: badan nasional sertifikasi profesi (bnsp). carsrud, a., brännback, m.j.e, & brandt, k. (2009). motivations: the entrepreneurial mind and behavior. in a.l. carsrud & m. brännback (eds.), understanding the entrepreneurial mind: opening the black box (pp.141–166). heidelberg, ny: springer. centre of curriculum of the ministry of national education. (2010). bahan pelatihan penguatan metodologi pembelajaran berdasarkan nilai-nilai budaya untuk membentuk daya saing dan karakter bangsa: pengembangan pendidikan kewirausahaan [training materials on strengthening learning methodology based on cultural values to form the nation’s competence and character: entrepreneurial education development]. jakarta: pusat kurikulum badan penelitian dan pengembangan kementerian pendidikan nasional. ciputra (2009, november 3). kewirausahaan harus menjadi karakter, kurikulum kewirausahaan diterapkan di sekolah tahun 2010 [entrepreneurship must be a character, entrepreneurship curriculum will be applicable at schools in 2010]. harian kompas. retrieved from http://cetak.kompas.com/read/xml/ 2009/11/03/04323917/kewirausahaan.h arus..menjadi.karakter commission of the european comminuties. (2006). implementing the community lisbon programme: fostering entrepreneurial mindsets through education and learning. brussels: communication from the commission to the council, the european parliament, the european economic & social committee, & the committee of the regions. consortium for entrepreneurship education. (2004). national content standards for entrepreneurship education. retrieved on 12 november 2013 from http://www.entreed.org/standards_too lkit/index.htm. directorate general for enterprise & industry of european commission. (2012). effects and impact of entrepreneurship programmes in higher education. brussels: european commission. directorate general of educational quality improvement and education personnel. (2010). pembelajaran berbasis paikem (ctl, pembelajaran terpadu, dan pembelajaran tematik). materi pelatihan penguatan-penguatan pengawas sekolah [training materials on strengthening school supervisor]. jakarta: kementerian pendidikan nasional. directorate general of secondary education. (2011). rencana strategis direktorat jenderal pendidikan menengah kementerian pendidikan dan kebudayaan 2010-2014 [strategic plans of the directorate general of secondar education of the ministry of education and culture 2010-2014]. jakarta: direktorat jenderal pendidikan menengah, kementerian pendidikan dan kebudayaan. ehmke, c., & akridge, j. (2005). the elements of a business plan: first steps for new entrepreneurs. purdue extension ec-735. west lafayette, in: aiccpurdue university. european commission enterprise and industry (ecei). (2009). best procedure project: entrepreneurship in vocational education and training (final report of the expert group). brussels: enterprise & industry cg, european commission. gartner, w.b., carter, n.m., & reynolds, p.d. (2010). entrepreneurial behavior: firm organizing processes. in z.j. acs & d.b. audretsch (eds.). handbook of entrepreneurship research: an interdisciplinary survey and introduction (vol. 5, part 2). new york, ny: springer. global school net. (2000). introduction to networked project-based learning. retrieved on 10 july 2011 from http://www.gsn. org/web/pbl/whatis.htm hawkins, k.l., & turla, p.a. (1993). ujilah tingkah kecerdasan anda sebagai seorang http://www.entreed.org/standards_toolkit/index.htm http://www.entreed.org/standards_toolkit/index.htm research and evaluation in education students’ entrepreneurial behavior... 69 badraningsih lastariwati, pardjono, & sukamto wiraswatawan [test your entrepreneurial attitudes]. solo: dabara. hisrich, r.d. & peters, m.p. (2002). entrepreneurship (5 th ed.). boston, ma: mcgraw-hill/irwin. hisrich, r.d., peters, m.p., & dean, a.s. (2005). entrepreneurship (6 th ed). new york, ny: mcgraw-hill/irwin. kao, j.j. (1991). the entrepreneurial organization. new jersey, nj: prentice hall. kashmir. (2007). kewirausahaan. jakarta: raja grafindo perkasa. kuehl, c., & lambing, p. (1999). small business planning & management (3 rd ed). oak brook, il: the dryden press. kuratko, d.f. & hodgetts, r.m. (1989). entrepreneurship: a contemporary approach. san diego, ca: harcourt college. kuratko, d.f., ireland, r.d., covin, j.g., & hornsby, j.s. (2005). a model of middle-level managers’ entrepreneurial behavior. entrepreneurship theory and practice, 29(6), 699-716. lastariwati, b. (2013). model pembelajaran kewirausahaan produktif ekrenfatiha untuk sekolah menengah kejuruan program studi pariwisata bidang keahlian tata boga [learning model of ekrenfatiha productive entrepreneurship for tourism program study in the expertise program of culinary in vocational high school] (doctoral dissertation). universitas negeri yogyakarta, yogyakarta. lumpkin, g.t., cogliser, c., & schneider, d. (2009). understanding and measuring autonomy: an entrepreneurial orientation perspective. entrepreneurship theory and practice, 33 (1), 47-69. macke, d. & markley, d. (2003). readiness for entrepreneurship: tools for energizing entrepreneurship. center for rural entrepreneurship, no. 1. retrieved from http://www.tvaed.com/pdf/readiness_ entreneurship.pdf ministry of education and culture. (2012). pendidikan menengah universal (wajib belajar 12 tahun) [universal secondary education (12-year compulsory education)]. presented by the general director of secondary education in a national discussion forum of the national conference of education and culture in 2012 in sawangan, depok, indonesia. ministry of national education. (2010). rencana pembangunan jangka menengah nasional kementerian pendidikan nasional 2010-2014 [national mid-term development planning of the ministry of national education in 2010-2014]. jakarta: kementerian pendidikan nasional. moerdiyanto. (2013). peranan inkubator bisnis dalam pengembangan usaha mikro, kecil, dan menengah di indonesia [the role of business incubator in developing micro, small, and medium enterprise in indonesia]. unpublished speech text of inauguration. presented at the professor inauguration in universitas negeri yogyakarta, yogyakarta. muhadi, fx., & saptono, l. (2005). jiwa kewirausahaan siswa smk: suatu survei pada tiga smk negeri dan tujuh smk swasta di diy [vocational high school (vhs) students’ entrepreneurial spirit: a survey on three state vhss and seven public vhss in yogyakarta]. jurnal widya dharma, 16 (1), 15-28. noer, b.a, & wirjodirjo, b. (2007). pola asuh orang tua yang membentuk jiwa wirausaha anak: sebuah studi pada mahasiswa teknik industri its surabaya [parental education which forms children’s entrepreneurial spirit: a study on the students of industrial engineering of its surabaya]. jurnal ekonomi dan manajemen, 8(2), 236-251. osterwalder, a. & pigneur, y. (2010). business model generation: a handbook for visionaries, game changers, and challengers. hoboken, nj: john wiley & sons. plomp, t. (1997). educational design: introduction. in t. plomp. educational and training system design: introduction. utrecht (the netherlands): lemmafaculty of http://www.tvaed.com/pdf/readiness_entreneurship.pdf http://www.tvaed.com/pdf/readiness_entreneurship.pdf research and evaluation in education 70 − volume 2, number 1, june 2016 educational science and technology, university of twente, netherland. purnomo. (1999). modul kewirausahaan [entrepreneurship module]. jakarta: universitas terbuka. pusposutardjo, s. (1999). pengembangan budaya kewirausahaan melalui matakuliah keahlian [developing entrepreneurial culture through expertise lecturing]. paper presented at the semiloka wawasan entrepreneurship. ikip yogyakarta, yogyakarta. sanusi, a. (1994). menelaah potensi perguruan tinggi untuk membina program kewirausahaan dan mengantar kehadiran pewirausaha muda [analyzing higher education institutions potential to construct entrepreneurship program and bring out the presence of young entrepreneurs]. paper presented at seminar kewirausahaan, inkubator bisnis bandung, stmb-kadin jabar. sarbiran. (2002). optimalisasi dan implementasi peran pendidikan kejuruan dalam era desentralisasi pendidikan [optimalization and implementation of the role of vocation in educational decentralization era]. speech presented at dies natalis xxxviii uny. soegoto, e.s. (2010). entrepreneurship: menjadi pebisnis ulung [entrepreunership: becoming skilled entrepreneurs] (revised ed.). jakarta: gramedia. suryana. (2003). kewirausahaan: pedoman praktis, kiat, dan proses menuju sukses [entrepreneurship: a practical orientation, trick, and process to success]. jakarta: salemba empat. wiklund, j. & shepherd, d. (2003). knowledge-based resources, entrepreneurial orientation, and the performance of small and medium-sized businesses. strategic management journal, 24, 13071314. winardi, j. (2004). entrepreneur dan entrepreneurship [entrepreneur and entrepreneurship]. jakarta: prenada media. winarno, a. (2008). pengembangan model pembelajaran internalisasi nilai-nilai kewirausahaan pada sekolah menengah kejuruan di kota malang [developing entrepreneurial values internalization learning model to vocational high schools in malang municipality]. jurnal ekonomi bisnis, 14 (2), 124-131. zimmerer, w.t. (1996). entrepreneurship and the new venture formation. new jersey, nj: prentice hall. how to cite item: samritin, s., & suryanto, s. (2016). developing an assessment instrument of junior high school students’ higher order thinking skills in mathematics. research and evaluation in education, 2(1), 92-107. doi:http://dx.doi.org/10.21831/reid.v2i1.8268 research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 2, number 1, june 2016 (pages 92-107) available online at: http://journal.uny.ac.id/index.php/reid developing an assessment instrument of junior high school students’ higher order thinking skills in mathematics 1 samritin; 2 suryanto 1 muhammadiyah university of buton; 2 yogyakarta state university 1 samritin55@yahoo.co.id; 2 suryauny@yahoo.com abstract this study is a research and development study. it aims to produce an instrument for assessing junior high school (jhs) students’ higher order thinking skills (hots) in mathematics. its procedure consists of nine steps: (1) constructing the test specification; (2) writing test items; (3) analyzing test items; (4) conducting the first tryout; (5) analyzing the results of the first try out; (6) revising the test; (7) assembling the test; (8) conducting the second tryout; and (9) analyzing the results of the second tryout. the instrument content validity was obtained through the focus group discussion (fgd) forum, and delphi technique. the construct validity was found out through the tryout data analysis. the instrument tryout was conducted twice involving 264 participants in the first tryout and 821 participants in the second tryout. the results of the study indicate that the instrument for assessing jhs students’ hots in mathematics has met the validity and reliability criteria. from the results of the content validity analysis, it can be concluded that the instrument is valid, and it was supported by the items validity indices above 0.79. from the results of the construct validity analysis, it can be concluded that the instrument is valid, as indicated by the value of χ 2 = 67.69, with p-value = 0.10, root mean square error of approximation (rmsea) = 0.03, supported by goodness of fit index (gfi) of 0.97, normed fit index (nfi) of 0.95, and adjusted goodness of fit index (agfi) of 0.95. the instrument reliability is 0.88. the developed instrument for assessing hots in mathematics consists of 12 items, each of which is of essay test type. the test items have difficulty indices in a range of 0.30 ≤ pi ≤ 0.7. keywords: assessment instrument, higher order thinking, junior high school, mathematics research and evaluation in education developing an assessment instrument... 93 samritin & suryanto introduction the development of thinking skills is an important aspect in education. byrnes (2008, p.42) states that in vygotsky's view, thinking skills develop from the lowest level to a higher level. therefore, school is expected to facilitate the development of students' thinking skills of the lower level to higher level. higher level thinking skills or higher order thinking skills (hots) can be defined as a cognitive process that involves analysis, synthesis, and evaluation (stanley & moore, 2010, p.10). a student who develops his hots will have analytical acuity, the ability to synthesize, and good evaluation capabilities. hots in mathematics can be defined as the ability to perform mathematical processes or complex tasks or math problems involving connection, problem solving, and mathematical reasoning. connection is the ability to see and create linkages among mathematical ideas, between mathematics and other subjects, and between mathematics and everyday life (kaur & lam, 2012, p.2). further, de lange (1999, p.15), atkin (2003, p.15), and shafer and foster (1997, p.1) classify connection as a second-level math skills. connection abilities consist of: (1) the ability to make or explain mathematical relationships between concepts or between concepts of mathematics and the real world or between mathematics and other disciplines; and (2) the ability to integrate information and choose different procedures or strategies in solving problems or offering more than one approach to solve a problem. solving a problem means finding a way out of a difficulty, a way round of an obstacle, attaining an aim which is not immediately attainable (polya, 1981, p.ix). to solve a problem means to find such an action (polya, 1981, p.117). a problem is a situation in which an individual or group is called upon to perform a task for which there is no readily accessible algorithm which determines completely the method of solution (lester, 1980, p.287). accordingly, a task is a problem when there is no readily accessible algorithm to reach the solution. in solving a problem, the correct answer could be more than one, and so could the strategies to solve it. the strategy or way to solve a problem can vary but each way produces a correct solution. the mathematical reasoning involves gathering evidence, making conjectures, establishing generalizations, building arguments, and drawing logical conclusions (peressini & webb, 1999, p.156). formal mathematical reasoning includes reasoning or evidence, which is, a logical conclusion based on assumptions and definitions. the mathematical reasoning often begins with exploration, making allegations, and comes to a conclusion (nctm, 2000, p.342). reasoning is an important aspect in mathematics (nctm, 2009, p.402). mathematical reasoning is essentially about development, justification, and use of mathematical generalization (russel, 1999, p.1). creating generalizations also enables problem solving, as generalizations support learners to see the underlying structure of the problem and the bigger class of problems or ideas that it instantiates. therefore, mathematics teaching and assessment need to consider the development of students’ mathematical reasoning. complex problem solving and mathematical reasoning is classified as a third-level math skills (highest level skills) by atkin (2003, p.15). the achievement of this level (problem solving and mathematical reasoning) is seen from the students’ ability to do mathematization, analyze, justify, communicate, interprete, develop own models and strategies, and make arguments and generalizations. the afore-mentioned description shows that the development of hots in mathematics can be facilitated through the offered stimulus such as math problems that require students to analyze, reason, interpret, present ideas, and find and apply mathematical concepts. giving a variety of new problems will lead students to explore and synthesize concepts logically as creative steps that can lead them to find the right solution. of course, the problem or the question must be appropriate with the developmental level of students. the reform movement in mathematics education puts the emphasis on teaching for understanding the learning and assessment of hots (thomas, okten, & buis, 2002, p.1). research and evaluation in education 94 − volume 2, number 1, june 2016 this opinion emphasizes changes in mathematics teaching practices in order to facilitate the development of hots. this opinion also emphasizes that the students’ hots should also be considered in the assessment. the results of the assessment will have an impact on the implementation of the teaching and learning process. brookhart (2010, p. 9 & 12) states that the results of assessing of hots increases students' motivation and achievement. this statement shows the importance of assessing of hots. the results of a preliminary study conducted in 18 junior high schools (jhss) in the province of south east sulawesi in january february 2012 found that hots in math has not been assessed. it is seen from the payload skills required in these test items, which are used in school. this preliminary study results in the fact that the test items that require students hots are not found in jhs’ math tests. these findings indicate that the assessment instrument that is used in the classroom could not assess the students’ skills to a higher level, so that the students’ skills at the higher level are not known. this indicates that the test results have not provided sufficient or maximum information about the students’ skills. the implication is that the teaching process improvement based on the results of the test is also not optimal. the description indicates the importance of improving the quality of assessment systems. a good assessment system can provide good information to improve the teaching process. the assessment system is quite good if done in accordance with the appropriate procedures/mechanisms, one of which is the use of appropriate instrument. a test as an instrument used to obtain the information about students’ competence development should have a good quality and be developed in accordance with the procedures of instrument development. the preliminary study indicates that the assessment in the classroom is still not well planned. the assessment instrument used in schools has not been well designed. the tests which are used to assess students’ learning outcomes are made without regarding the preparation of the tests. these tests are compiled without the test grating. classroom assessment found that the emphasis on the results of thinking is more dominating than the thinking proccess of students. the test used takes the form of the multiple-choice items more than the other forms. in the field, it was found that 33% of schools as the subject of the preliminary study used multiple-choice test items only, 16.67% schools used the essay test items only, and 66.67% schools used the essay test, with a percentage of 20% at most. the multiple choice test is the most powerful tool to measure students’ mastery of the subject matter. this form can also be used to measure the competencies of students to a higher level. however, the use of multiple choice test items emphasizes only the results, while the students' thinking processes cannot be known. in addition, it is also not known whether the students’ response is a result of their thinking or the result of guessing. the use of multiple-choice tests also resulted in non-habit the student to provide a description of answer or argument in solving the problem. for these reasons, the variation of the test type is needed. in addition to variations of tests types, the quality of test items should be considered in the assessment. to determine the quality of the test, the test item analysis is required. the analysis of the test item is one important aspect in the implementation of the assessment. the results of the test item analysis provides information about the quality of the tests used. if the test items do not have a good parameters, they cannot provide good information as desired. the preliminary study also found that all schools studied had daily test document containing the analysis of the data on students’ mastery learning, but all schools do not have the document on test item analysis. this means that these schools have evidence of the development of students but they are not supported by a quality assessment tool known as parameters such as test items. the fact does not show the real condition of assessment throughout indonesian schools, but it does show that there are many problems associated with the implementation research and evaluation in education developing an assessment instrument... 95 samritin & suryanto of assessment in the classroom which need to be resolved. the development of assessment instruments of students’ hots in mathematics becomes an important issue to discuss. this is considering that the tests which are able to assess students’ hots in mathematics and which have evidence of validity and whose item parameters are reliable and good have not been developed. to resolve these problems, a systematic development research is required. the result of this development study is an instrument to assess students’ hots in mathematics. it will have implications for improving the quality of teaching in the classroom. the hots assessment result can be used to plan the next teaching and learning which can facilitate the development of students' competence to a high level. the result of this assessment will also provide a positive implication for students to study harder to be able to resolve new challenging problems. the students’ habits of resolving the new challenging problems can improve their hots. based on the background mentioned earlier, the research problem can be formulated as ‘what is the result of the development of assessment instruments of junior high school students’ hots in mathematics like?’ in line with the formulated problem, the purpose of this research is to produce an assessment instrument of junior high school students’ hots in mathematics. the results of this study are expected to: (1) provide benefits to add insight into the theory of hots and the development of assessment instruments of hots in mathematics, (2) increase the standard instruments that can be used by teachers to assess students' skills in junior high school mathematics, and (3) be a reference for researchers to conduct similar studies or extend research. method type of research this was a development study, which was aimed at producing an instrument for assessing junior high scool students’ hots in mathematics. procedure of development the development procedure used in this study referred to the instrument development procedure proposed by mardapi (2008, p.88) consisting of nine steps: (1) developing test specifications, (2) writing test items, (3) reviewing the test items, (4) doing the test tryout, (5) analyzing the test, (6) improving the test, (7) assembling the test, (8) carrying out the test, and (9) interpreting the test results. in this study, the step of the development was divided into two stages: design phase and test tryout phase. the design phase included activities in the first step to the third step and the test tryout phase included activities of the fourth step to the ninth step. in the tryout phase, the fourth step was called the first tryout and followed by analysis while the eighth step was called the second tryout and followed by analysis. the design phase at this stage, the activities carried out were (1) developing a test specification, (2) writing test items, and (3) examining and repair the test items. developing test specification the test specification contained a description of the overall characteristics of the test. this step included activities of (a) specifying the purpose of the test, (b) designing the test blue print, and (c) selecting the test form. the test specification served as a practical manual for test developers to plan the content of the subjects tested, the aspects of behavior to be measured, the test form, and test length. the test blue print was presented in the form of a matrix that contained the components which consisted of: the material which was tested, measurable aspects of behavior, and cognitive levels to be measured. the cognitive aspect to be measured in this study was determined based on the operational definition of hots in mathematics. the cognitive aspect consisted of (1) connection (l2) and (2) problem solving and mathematical reasoning (l3). the aspects of content or material were determined based on the study of mathematics that supported the research and evaluation in education 96 − volume 2, number 1, june 2016 achievement of competency standards (cs) and basic competence (bc) in the content standards of mathematics education for grade eight students of junior high school. the material tested, the cognitive behavior, and the cognitive aspect to be measured are outlined in table 1. the test developed in this study was an esay test. test items writing the number of the test items made for each indicator was at least one item, which was tailored to the cognitive aspects measured. the writing of the test items also considered their compliance with the hot test criteria, namely using new materials (novelty), as brookhart (2010, p.25), writes that a test item requires a complex thought, using simple sentences but clearly targets the question and uses the good and grammatical indonesian. each item was accompanied by an answer key and scoring guidelines or rubrics. the result at this stage was called the initial draft or draft-1. reviewing and improving test items the test items that had been written (draft-1) were reviewed by experts through a focus group discussion (fgd). the experts were four mathematics education experts and four experts in the field of measurement. this activity was intended to obtain content validity and was conducted on july 26, 2012 at the graduate school of yogyakarta state university, indonesia. the review activity included reviewing the test blue print, answer key, and scoring guidelines. in general, the review of the test items consisted of test content, test item construction, and language aspects. the judgement and input from the experts in both oral and written forms were subsequently analyzed. based on the results of the review, draft-2 was obtained. draft-2 was reviewed by experts by using the delpi techniques. the review involved five mathematics education experts. at this stage, a valid test was obtained and it was called draft-3. draft-3 was then assessed quantitatively by six experts. the results of this quantitative assessment were analyzed and a validity index was obtained. after assessed by experts, the test items were assembled into a test package. in the test package assembling, the test items were then arranged from easy to difficult items. it is intended to reduce the anxiety of tryout participants. the tests that had been assembled were subsequently tested to obtain the characteristics of the empirical tests. tryouts phase design, subjects, and tryout schedule the tryout activity of the products consisted of two phases: the first tryout and the second tryout. these activities were carried out in the province of south east sulawesi. the subjects in this study were class viii students of junior high school. they were selected based on the needs of the development. table 1. blue print materials bc indicator level operations on algebra 1.1 solving problems using operations on algebra l2 relations and functions 1.3 solving problems associated with the relations and function. l2, l3 function value 1.4 solving problems related to the value of the function. l2, l3 straight line equation 1.6 solving problems related to the gradient, equations, and graphs straight line l2 linear equation systems with two variables 2.2 assessing mathematical models of the problems associated with lestv. l3 making a mathematical model of the problems associated with lestv. l3 2.3 interpreting lestv into real-world situations. l3 solve problems related to the lestv. l3 assessing the settlement of lestv truth. l3 research and evaluation in education developing an assessment instrument... 97 samritin & suryanto they were international-standard pioneering school or rintisan sekolah bertaraf internasional (rsbi)/ex-rsbi and also national-standard school or sekolah standar nasional (ssn) students. the number of the schools used was adjusted with the number of test tryout subjects required to take the tests. crocker and algina (1986, p.322) state that the acceptable number of participants in the is 200. moreover, muraki and bock (1998, p.35) explain that a good minimum limit for restricted testing activities is as many as 250 people, and for general purposes, a minimum of 500 people are required. the tryout was conducted twice and it was preceded by a readability test, which was involving 10 year eight students of junior high school and a math teacher of junior high school (jhs) 6 raha. the first tryout was conducted in the regency of baubau, southeast sulawesi. this activity involved 264 year eight students of state jhs 1 and state jhs 2 baubau, and was conducted on november 23 to 25, 2013. the first tryout result was analyzed and used as the basis to revise the instrument. the test which had been revised based on the first tryout data analysis was tested on a large scale. it was the second tryout, which was held on 2nd to 10th december 2013. this activity involved 821 eight grade students from four schools, namely state jhs 2 baubau, state jhs 1, state 2 jhs 2, and state 3 jhs 3 raha. the analysis of the data from the second tryout was intended to see if the test had satisfied the specified criteria or not. the results of the second tryout data analysis showed that the test satisfied the specified criteria. therefore, the test was not revised and became the final draft. data type, instruments and data collection techniques the data in this study were quantitative, in the form of scores which were given to students’ responses to the tried out test. in order to obtain the data, instrument was used. the instrument was obtained through the process of judgement and it was revised based on the suggestions from experts. data analysis techniques the qualitative data obtained from the experts were analyzed to answer the question of ‘whether the product was valid or not’. instrument validation results based on the judgement of experts were analyzed qualitatively and were revised if necessary. the quantitative data obtained from the experts’ assessment of the test items were analyzed using formula aiken validity indices (aiken, 1985, p.132) and the results were used as an evidence of content validity quantitatively. aiken (1985, p.134) sets the lowest value of validity index depending on the number of experts and the criteria used. the lowest value for six experts and five criteria is 0.79. the quantitative data on the results of test tryouts were used to answer the question of ‘whether the test satisfied the criteria of construct validity, reliability, and item parameters’. the validity of the test which was based on empirical data was analyzed using confirmatory factor analysis (mueller, 1996, p.112). the instrument was considered valid if the model fits the data. the criteria which were used to make decisions that the model fit to the data were based on: (1) p-value of chi-squre (x2)> 0.05, (2) the root mean square error of approximation (rmsea) < 0.5 (schumacker & lomax, 2004, p.82). the instrument reliability coefficient criterion used was minimum 0.7 (nunnally, 1981, p.245; urbina, 2004, p.137). the assessment of the instrument tryout results was conducted by two assessors. the calculation of the inter-rater consistency used cohen's kappa formula and the calculation of the instrument reliability based on the verified data used the alpha cronbach's formula. the item parameter of the instruments according to ctt was seen from the difficulty and discrimination indices. however, the discrimination index of the criterion-referenced test does not affect the quality of the test, so that the item parameter in this study was only the difficulty index. the criteria used to determine the item difficulty index was 0.3 to 0.7 (allen & yen, 1979, p.121). the estimation of the test item difficulty (pi) referred to the formula by nitko and brookhart (2007, p.324), as follow: research and evaluation in education 98 − volume 2, number 1, june 2016 itemofscorenimummiitemofscoreximumma itemofscorenimummiitemofaveragescore ip    findings and discussion the indicators derived from the selected basic competenties were loaded in instrument blue print, which is a reference in writing instrument items. the instrument consists of items that have been written, called draft1. draft-1 consists of 18 items, each of which is in the form of an esay test. instrument validation results the initial draft of the instrument or draft-1 which had been developed, subsequently handed over to the experts to be reviewed. the review of the test was conducted through focus group discussion (fgd) forum. the program involved eight experts consisting of four experts of mathematics education and four experts of educational measurement. in this activity, the experts did the review of the draft of the developed instruments including scoring the rubric and blue print of the instrument. in general, the experts’ suggestions were: (1) a few items of the instrument must be replaced or revised because the problems are not in accordance with the criteria of hot test in mathematics; and (2) a few items of the instrument need to be revised to make them more suited to the conditions of students. based on the experts suggestions, the instrument was revised. the results of this revision was then discussed again by experts through delpi techniques. the discussion was was conducted several times with each expert to obtain a valid test. the discussion at this stage resulted in draft-3 of intrument. draft-3 consisted of 14 items that had been declared valid by the experts. draft-3 was produced through delpi techniques and was assessed by six experts. the scores given were based on the experts’ point of view of the relevance of the indicator. there were five criteria used in the assessment: the score of 1 if the item was not relevant, the score of 2 if the item was not relevant, the score of 3 if the item was less relevant but could be used, the score of 4 if the item was relevant, and the score of 5 if the item was very relevant. based on the results of the quantitative judgement, further validity index of each item was calculated using the formula of aiken (1985, p.132). steps were taken to obtain the validity of the aiken index by first calculating the number of assessors in the -ith criterion of each item, and then index v aiken was calculated. the results of the calculation are presented in table 2. table 2. the number of assessors on each criterion and v aiken indices for each item item criteria sum of experts v 1 2 3 4 5 1 3 3 6 0.875 2a 4 2 6 0.833 2b 4 2 6 0.833 3a 4 2 6 0.833 3b 2 4 6 0.917 4 2 4 6 0.917 5 3 3 6 0.875 6 4 2 6 0.833 7 3 3 6 0.875 8 4 2 6 0.833 9 2 4 6 0.917 10 1 5 6 0.958 11 2 4 6 0.917 12 2 4 6 0.917 table 2 shows that the validity of each item index is above the specified minimum criterion, 0.79 (aiken, 1985, p.134). the criteria were established by aiken to six experts with the scale of 5. based on the index of every item, it was concluded that the instrument was valid. therefore, it was determined that the instrument was ready to be tried-out. product tryout results the first tryout results the results of the first tryout were scored. the scoring was done using a scoring rubric. each item had a scoring rubric in accordance with that item. in developing the scoring rubric, the fairness aspect was considered so that students were not disadvantageous in the scoring. the scoring rubric used was the analytic form. before used, the scoring rubrics were reviewed by the experts through focus group discussions and delpi techniques. research and evaluation in education developing an assessment instrument... 99 samritin & suryanto the scoring was performed by two assessors. the use of two assessors in scoring was intended to avoid the effect misinterpretation of the students' answers, the effect of fatigue, and other effects. the scoring was done on the students' answers by scoring one item at a time. it resulted in two data scores. therefore, to produce a single data score, two assessors verified the data. the verification of scores was performed on the different scores by the assessors. the verification of scores was performed by the assessors by reviewing the answer sheets together. based on the results of the verification an accurate data score was obtained. the verified data score was then used to analyze the item parameter, reliability, and validity based on empirical data. items test parameter on the first tryout the difficulty and discrimination indices are two item test atributes on the ctt analysis. however, in the criterion-related test, the discrimination index is not considered in the selection of items. thus, in this study, the parameter analyzed was only the item difficulty index (pi). the results of the analysis of the item difficulty of the instrument are presented in table 3. table 3. test item difficulty indices on the first tryout no item difficulty (pi) 1 1 0.34 2 2a 0.55 3 2b 0.59 4 3a 0.07* 5 3b 0.06* 6 4 0.32 7 5 0.31 8 6 0.30 9 7 0.45 10 8 0.40 11 9 0.43 12 10 0.35 13 11 0.33 14 12 0.35 table 3 shows that there are two items that indicate the instrument is too difficult to resolve by students. it is seen from the difficulty indices of the items, each of which is less than 0.3, 3a has an index of 0.07 and item 3b has the difficulty index of 0.06. the other test items have indices of difficulty the range of 0.30 to 0.7. the results of the search to the items that had difficulty indices of less than 0.3 showed that item 3a was answered by only 51 or 19.3% of the tryout participants. for item 4a, the number of participants who obtained a score of 1 was as many as 48 students, score of 2 as many as three students, and none of the participants was able to achieve a maximum score of 3. item 3b was answered by only 46 or 17.4% of the tryout participants. for item 4a, the number of participants who obtained the score of 1 was as many as 42 students, score of 2 as many as four students, and none of the participants was able to achieve the maximum score of 3. therefore, items 3a and 3b were then removed from the test package and not included in the next analysis. instrument reliability in the first tryout the reliability of an instrument is related to the measurement error. the scoring which was conducted by more than one rater on the same instrument will provide high reliability if the consistency of scoring is high. this means that an inter-rater measurement error illustrates the magnitude of the inconsistency scores given by the two scorers. the reliability of the instrument in this study was considered from the coefficient of interrater kappa (measure of agreement kappa) of cohen and cronbach's alpha. technically, the reliability coffiecient estimation was performed with the help of spss. the results of interrater agreement calculations are presented in table 4. table 4 shows that the consistency of measurements by the two scorers is very high. this is seen from kappa coefficient at least 0.950 on number 7. the reliability of the instrument in the first tryout of this study was estimated based on data verified coefficient α = 0.87. this coefficient is higher than the required minimum reliability coefficient of 0.7 for a good instrument (nunnally, 1981, p.245; urbina, 2004, p.137). thus, based on the results of the first tryout it is concluded that the instrument is reliable. research and evaluation in education 100 − volume 2, number 1, june 2016 table 4. kappa coefficients of each item based on the results of the first tryout no. item kappa coefficient 1 1 0.983 2 2a 0.976 3 2b 0.971 4 4 0.978 5 5 0.961 6 6 0.950 7 7 0.961 8 8 0.973 9 9 0.988 10 10 0.963 11 11 0.961 12 12 0.967 analysis of instruments validity based on the data of the first tryout results the construct validity of the instrument was analyzed using factor analysis. factor analysis used in this study is confirmatory factor analysis (cfa). cfa was conducted by using lisrel. the data used in the cfa are the verified data. cfa was conducted after the result of the normality assumption testing was obtained. the normality testing results indicate that the data have univariate non-normalities, shown by the p-value of skewness and kurtosis of each variable (items instrument). all of the test items have the p-value = 0.00 <0.05. the results of tests of multivariate normality also showed the p-value = 0.00 for skewness and kurtosis. this indicates that the data have a multivariate non-normalities. to perform the factor analysis of the data which do not meet the requirements of normality lisrel, additional data in the form of asymptotic covariance matrix (acm) were required in order to obtain unbiased estimation results (schumacker & lomax, 2004, p.34). the validity analysis based on empirical data from the first tryout was conducted without including items 4a and 4b because these items have been removed based on the analysis of item parameter according to ctt. the results of the validity analysis showed x2= 65.51 with the p-value = 0.14, and rmsea = 0.04, which indicates that the model fits to the data. the model is declared fit to the data means the instrument is valid based on empirical data of the first tryout results. figure 1 and figure 2 show the fulfillment of these criteria. the results of the analysis are shown in figure 2, which also shows that the items of the instrument have a significant relationship with the hots. figure 1 shows that the lowest loading factor of the instrument items on the first tryout is 0.46 and the highest is 0.7. figure 2 also shows that the loading factor of each item at α = 0.05 is significant. this is indicated by the item minimum t-value of 5.72 more than of tα=0.05 = 1.96. so, the correlation of the instrument items to hots is significant. figure 1. loading factor the instrument items based on the first tryout data research and evaluation in education developing an assessment instrument... 101 samritin & suryanto results of the second tryout analysis of the instrument item parameter on the second tryout the results of the analysis of the difficulty parameter test items based on data from the second tryout are clearly presented in table 5. table 5. item difficulty indices based on the second tryout no items difficulty (pi) 1 1 0.35 2 2a 0.59 3 2b 0.61 4 3 0.32 5 4 0.33 6 5 0.36 7 6 0.49 8 7 0.41 9 8 0.48 10 9 0.37 11 10 0.33 12 11 0.39 table 5 shows that the difficulty para-meters on all test items are in the range of 0.30 ≤ pi ≤ 0.7 which means that all items have a good parameter. the easiest instru-ment items have the difficulty indices of 0.59 and 0.60. those items are items 2a and 2b. both of these items were formulated from the indicators derived from basic competency (bc) 1.3, that is to understand relationships and functions. in teaching and learning processes, the basic competency is developed through learning the subject of relations and functions. this means that both items were formulated to measure the hots of junior high school students on the material relations and functions. the most difficult instrument items developed in this study have difficulty indices of 0.32 and 0.33. the item that has the difficulty index of 0.32 is item3. the item that has the difficulty index of 0.32 is items 4 and 10. these items are formulated from three indicators derived from two different basic competencies. item 3 is formulated of the indicators derived from bc 1.3, i.e. to understand relations and functions, or to measure jhs students’ hots on relations and functions. item 4 is formulated of the indicators outlined in bc 1.4, i.e to determine the value of a function. to achieve this competence, the students learned the learning material on relationships and function. item 4 is an item that is formulated to measure jhs students’ hots in relations and functions. item10 is formulated from indicators derived from bc 2.3m i.e to finish mathematical models of the problems associated with the system of linear equations of two variables. to reach bc 2.3, the students have to learn linear equations system with two variables. item 10 is used to measure jhs students’ hots in linear equations system with two variables. figure 2. results of t-value estimation based on the first tryout data research and evaluation in education 102 − volume 2, number 1, june 2016 the instrument developed in this study consists of four items on the connection level (l2) and eight items on the problem solving and mathematical reasoning level (l3). the items on l2 are items 1, 3, 4, and 6. item 1 is formulated from indicators derived from bc 1.1. item 3 is formulated from indicators derived from bc 1.3. item 4 is formulated from indicators derived from bc 1.4. item 6 is formulated from indicators derived from bc 1.6. the items on l3 are items 2a, 2b, 5, 7, 8, 9, 10, and 11. the items on l3 are formulated from the indicators derived from bc 1.3 (i.e. item numbers 2a, and 2b), bc 1.4 (i.e. item 5), bc 2.2 (i.e. items 7 and 11), and bc 2.3 (i.e. items 8, 9, and 10). instrument reliability on the second tryout the reliability of an instrument is related to the measurement error. the scoring done by more than one scorer on the same instrument will provide high reliability if the consistency of scoring is high. the inter-rater consistency in this study was calculated using cohens’ kappa measure of agreement and the instrument reliability of the verified data was calculated using cronbach alpha formula. the inter-rater consistency calculation results of the second tryout are presented in table 6. table 6. kappa coefficients of each item based on the results of the second tryout no. item coefficient of kappa 1 1 0.971 2 2a 0.962 3 2b 0.992 4 3 0.973 5 4 0.980 6 5 0.983 7 6 0.951 8 7 0.982 9 8 0.965 10 9 0.959 11 10 0.949 12 11 0.993 table 6 shows that the consistency of measurement made by the two scorers is very high. this is evident from its lowest kappa coefficient of 0.949 on item 10. the reliability of the instrument based on the verified data on the second tryout, which was calculated by using the formula of cronbach's alpha, showed a coefficient of 0.88. it is higher than the specified minimum reliability coefficient of 0.7 (nunnally, 1981, p.245; urbina, 2004, p.137), so it was concluded that the instrument is reliable. figure 3. instrument item loading factor based on the second tryout data research and evaluation in education developing an assessment instrument... 103 samritin & suryanto instrument validity based on the results of the second tryout the validity analysis which was based on the empirical data of the second tryout results was conducted using factor analysis. factor analysis which was used in this study is cfa, which is conducted using lisrel. the data which were used in the cfa were the verified data. cfa was conducted after the result of normality assumption testing was obtained. the result of the normality assumption testing indicates that the data have univariate non-normalities, as shown by the pvalue of skewness and kurtosis of each variable. all of the test items have p-value = 0.00 <0.05. the result of the multivariate normality testing also showed p-value = 0.00 for skewness and kurtosis. this indicates that the data have multivariate non-normalities. therefore, it is concluded that the data are not normally distributed. the confirmatory factor analysis of the second tryout data used additional data which are called asymptotic covariance matrix or acm as described in the data analysis of the first tryout results. the results of the validity analysis are presented in figure 3 and figure 4. figure 3 shows that the results of the confirmatory factor analysis show x2= 67.69 with p-value = 0.10 and rmsea = 0.03, which indicates that the model fits the data. the fitness of the model to the data is supported by gfi = 0.97, agfi = 0.95, and nfi = 0.95, respectively ≥ 0.95 (schumacker & lomax, 2004, p.82). that the model was declared fit the data means the instrument is valid based on empirical data. figure 3 and 4 show that the items of instrument have a significant relationship with hots in mathematics. in the figure, it can be seen that the lowest loading factor of the instrument items in the second tryout is 0.49 and the highest is 0.72. based on the t-value given in figure 4, it can be concluded that the loading factor of each item of the instrument is significant at α = 0.05 level. the t-value of each items is at least 13.94, more than that of tα = 0.05 = 1.96. junior high school students’ hots in mathematics the test scores of tryout results are the source of information about hots in mathematics of junior high school students who took the tests in tryout activities. the information about the students' hots was obtained through the interpretation of scores. the score interpretation resulted in a value. this value can be presented in the form of numbers or words. the score interpretation results or assessment results can be used for various purposes such as to improve the quality of learning and to report learning outcomes. figure 4. estimation results of t-values based on the second tryout data research and evaluation in education 104 − volume 2, number 1, june 2016 reporting students’ learning outcomes in each subject in school uses a composite score obtained from several tests. the values obtained from different tests sometimes have different score ranges or scales, so that these values cannot be composited from the raw scores. therefore, a transformation process of scores from each source into a certain score range or scale is required. furthermore, these scores may be composited into the final value, in accordance with the desired rules. the results of assessing the hots of junior high school students in mathematics is one of the important sources for composite score reported. when schools use grades 0-10 or 0-100 on the final report, then the scores of the test participants must be transformed into the value of 0-10 or 0-100. this transformation can be performed by using linear transformation by dividing the score of the acquisition with the ideal score, and then the result is multiplied by 10 to obtain a value in the range 0-10 or multiplied by 100 to obtain a value in the range 0-100. in the range 0-10, the highest value obtained by the test participants is 9.39 and the lowest is 0.00. in the range 0-100, the highest score obtained by the test participants is 93.93 and the lowest is 0.00. the assessment results can also be presented in the form of predicate very low to very high. producing a value in the form of a predicate can be done by making a categorization score. the hots test in this study has a maximum score of 33 and a minimum score of 0. thus the range of scores is 36, and the average value is 16.5. the ideal range is divided into six units of standard deviation, resulting in 5.5 as the ideal standard deviation. based on the ideal average ( ix ) and the ideal standard deviation (si), the categorization of junior high school students’ hots in mathematics is: (1) ix + 1.5si < x or 24.75 < x (very high category); (2) ix < x ≤ ix + 1,5si or 16.5 < x ≤ 24.75 (high category); (3) ix 1,5si < x ≤ ix or 8.25 < x ≤ 16.5 (low category); and (4) x ≤ ix 1,5si or x ≤ 8.25 (very low category) (adapted from azwar, 2009, p.108). based on the categorization, it is known that junior high school students’ hots in mathematics at the second tryout is: (1) participants who have a very high skill are as many as 9.74%; (2) participants who have a high skill are as many as 25.94%; (3) participants who have a low skill are as many as 29%; and (4) participants who have a very low skill are as many as 35.32%. the results of assessing junior high school students’ hots in mathematics show that the dominant value is held by the participants who have low and very low skills, as many as 64.32%. this percentage indicates the number of test takers who have scores no more than the ideal average score. while the test participants who have high and very high ability is only 35.68%. this percentage shows the number of participants who have scores above the ideal score average. the description shows that the hots in mathematics of junior high school students that involved in the second tryout tends to be low. in relation to this, the search of the acquisition results of the scores of the participants was carried out. the result of the search shows that the average score of the students is 13.42. this score is 3.08 lower than the ideal score. the score distribution not normally visible from skewness value is 0.32>0. the slope of the distribution of the scores is shown in figure 5. urbina (2004, p.60) argues that the skewness > 0 occurs if most scores are at the low level. this means that the test participants are dominated by those whose score is low. figure 5. score distribution curve in the second tryout research and evaluation in education developing an assessment instrument... 105 samritin & suryanto the results of the search on the participants’ test scores show the highest score of 31 with a frequency (f) = 1 or 0.12%. the lowest score obtained by participants is 0 with the frequency (f) = 1 or 0.12%. the dominant score obtained by the participants is 7 with the frequency of 51 or 6.21%. this means that the number of the test takers who scored 7 is higher than that of those who have other scores. the next dominant score achieved by participants is 6 with the frequency (f) = 50 or 6.09% followed by the score of 3 with the frequency (f) = 47 or 5.72%. discussion the most suitable form of instrument used for assessing students' hots in mathematics is essay type test because the students' thinking processes can be determined based on the description of the given answer. an essay type test requires them to demonstrate their knowledge in accordance with the demanded problem. typically, all forms of tests can be used to assess students' hots in mathematics, such as, multiple choice test, but their thinking process cannot be determined. the correct answer chosen by the students in multiple-choice tests cannot reveal whether it is the result of thinking or guessing. the instrument items for assessing junior high school students’ hots in mathematics developed in this study has a difficulty index parameter in the range of 0.3 to 0.7. the difficulty index of the items is in a good category. this is due to the development of the instrument which has been through a systematic process and well done. the instrument development process which starts from the preparation of test specifications and then proceed with writing the test items performed by considering various aspects that can affect the students' ability to answer the questions. those aspects are the suitability of the indicator and test items with the curriculum and students' developmental level, language aspects, and cultural aspects. another factor affecting the test item parameters developed in this study so that they are in a good category is the arrangement of test items into the test package. the arrangement of test items into a test package from the simple to the most difficult item is to reduce the anxiety of students in answering the questions. the low anxiety of the students when doing the test allows them to answer according to their ability. the answers which are in accordance with the students' true abilities affect the test item difficulty parameter because the test items are designed in accordance with the level of development and knowledge of students. in the writing of test items, the level of students’ knowledge was seen from the content of the curriculum in use. the instrument which was developed in this study is in the valid category as the implication of a systematic process of instrument development which is done well. the validity evidence consists of content validity and construct validity evidences. the content validity evidence of this instrument is obtained from the experts’ judgement and the construct validity evidence is from the analytical results of fitting the model with empirical data. the writing of the test items in this study which is considered representative of curriculum content and fulfillment of the criteria of hot test in mathematics and the competence of the experts or instrument reviewers are the factors that affect the validity of the instrument. the experts involved in the review and assessment of the instrument items are competent experts in mathematics education and measurement, so that the results of the assessment can be justified. the results of the experts’ judgement show that from the content aspect, the instrument is valid, which is visible from aikens validity index where the lowest is 0.83. the experts have done reviewed and assessed the instrument, so that the resulting instrument really measures what it is supposed to measure, as indicated by fitness of the model to empirical data. the esay test form must be supported by the scoring guidelines called the scoring rubric. the scoring rubric used in this study was designed as well as possible and validated together with the items of the instrument developed, so that the difference in the scores given by the two scorers was very little, and this is also supported by the lowest kappa research and evaluation in education 106 − volume 2, number 1, june 2016 coefficient of 0.949 of each item. this shows that the inter-rater reliability coefficient is in a very good category. the different scores were verified by two scorers very carefully so as to produce accurate score data with the lowest error of measurement, which is visible from the reliability coefficient of α = 0.88. the coefficient is in a good category and has met the criteria. the description shows that the instrument developed in this study has met the reliability criteria viewed from both the interrater reliability aspect and the normative aspect. junior high school students’ hots in mathematics tends to be low. this is due to the students’ unfamiliarity in working on the problems that require hots. the students are accustomed to working on the problems that require low skills so that only some students are capable of achieving the maximum score when solving hots problems. the junior high school students’ hots in mathematics is mostly low, but some students of state jhs 1 baubau said that they were happy doing these tests because the problems in the tests were very challenging and arouse their curiosity. the students claimed to agree when the tests or final exams included questions that demanded high level tinking. these students have benefited from this research. meanwhile, in the discussions with the mathematics teachers in the tryout, it was found that the teachers agree to use a test that requires hots in tests or final exams. it is just that there are difficulties in constructing test items with strict criteria, particularly the novelty criteria of items. they argue that it is very good when there are examples of such test items available. the teachers also claimed that many students, especially those with middle and low ability, will have difficulty to provide the correct answers. conclusions and recommendations conclusions based on the analysis of the findings, it can be concluded that: (1) the instrument for assessing students’ higher order thinking skill in mathematics which is developed in this study consists of 12 items, each of which is of the essay type test item; (2) the test items developed in this study have difficulty indices ranging from 0.3 to 0.7, which means it meets the criteria of a good item parameter; (3) the instrument developed in this study has a reliability coefficient of 0.88, which means that it meets the criteria; (4) the instrument for assessing jhs students’ hots in mathematics developed in this study is valid, whose evidence is indicated by the item validity index above 0.79, and whose evidence of construct validity based on empirical data is indicated by the value of χ2 = 67.69, p-value = 0.10, rmsea = 0.03, gfi = 0.97, nfi = 0.95 and agfi = 0.95. recommendations based on the results of the analysis of the findings, it is recommended that: (1) such an instrument should be developed using standard procedures in order to produce a good hot assessment instrument; (2) teachers assess the ability of students’ thinking to higher levels; (3) mathematics teachers be trained to create hots assessment instruments; (4) department of education conduct training for teachers in the development of an assessment instrument of hots; and (5) other researchers conduct further research in order to increase the number of items of instruments for asessing students’ hots and for all levels. references aiken, l.r. (1985). three coefficients to analyzing the reliability and validity of rating. educational and psychological measurement, 45, 131-142. allen, m.j. & yen, w.m. (1979). introduction to measurement theory. monterey, ca: brooks/cole. atkin, j.m. (2003). assessment in support of instruction and learning. workshop report. washington, wa: the national academies. azwar, s. (2009). penyusunan skala psikologi (12 th ed.) [composing psychological scale]. yogyakarta: pustaka pelajar. research and evaluation in education developing an assessment instrument... 107 samritin & suryanto brookhart, s.m. (2010). how to assess higher order thinking skills in your classroom. alexanderia, va: ascd. byrnes, j.p. (2008). cognitive development and learning in instructional contexts. boston, ma: pearson education. crocker, l. & algina, j. (1986). introduction to classical and modern test theory. new york, ny: cbs college. de lange, j. (1999). framework for classroom assessment in mathematics. retrieved on january 12, 2012 from http://www.fi.uu.nl/catch/products/fr amework/de_lange_frameworkfinal.pdf kaur, b. & lam, t.t. (eds.). (2012). reasoning, communication and connections in mathematics. singapore: world scientific. lester, f.k. (1980). research on mathematical problem solving. in shumway, r.j. (eds.). research in mathematics education, pp. 286-323. reston, va: nctm. mardapi, d. (2008). teknik penyusunan instrumen tes dan nontes [test and non-test instruments composing techniques]. yogyakarta: mitra cendekia. mueller, r.o. (1996). basic analysis of structural equation modeling. new york, ny: springer-verlag new york. muraki, e. & bock, r.d. (1998). parscale: irt item analysis and test scoring for rating scale data. chicago, il: scientific software international. national council of teachers of mathematics (nctm). (2000). principles and standards for school mathematics. reston, va: the national council of teachers of mathematics. national council of teachers of mathematics (nctm). (2009). guiding principles for mathematics curriculum and assessment. reston, va: the national council of teachers of mathematics. retrieved on 15 january 2013 from http://standards.nctm.org/document/c hapter2/content.aspx?id=23273 nitko, a.j. & brookhart, s.m. (2007). educational assessment of students. boston, ma: pearson prentice hall. nunnally, j.c. (1981). psychometric theory. new delhi: mcgraw hill. peressini, d. & webb, n. (1999). analyzing mathematical reasoning in students’ response across multiple performance assessment tasks. in stiff, l.v. & curcio, f.r. (eds.). developing mathematical reasoning in grades k-12. pp. 156–174. reston, va: nctm. polya, g. (1981). mathematical discovery: on understanding, learning, and teaching problem solving. new york, ny: john wiley & sons. russel, s.j. (1999). mathematical reasoning in the elementary grades. in stiff, l.v. & curcio, f.r. (eds.). developing mathematical reasoning in grades k-12. pp. 1– 12. reston, va: nctm. schumacker, r.e. & lomax, r.g. (2004). a beginner’s guide to structural equation modeling (2 nd ed.). mahwah, nj: lawrence erlbaum associates. shafer, m.c. & foster, s. (1997). the changing face of assessment. principled practice in mathematics & science education, 1(2), 1-12. stanley, t. & moore, b. (2010). critical thinking and formative assessment. lachmont, ny: eye on education. thomas, d.a., okten, g., & buis, p. (2002). on-line assessment of higher-order thinking: a java-based extension to closed-form testing. icots6, 1-4. retrieved on june 6, 2013 from https://www.stat.auckland.ac.nz/ ~iase/publications/1/6d4_thom.pdf urbina, s. (2004). essential of psychological testing. hoboken, nj: john wiley & sons. http://standards.nctm.org/document/chapter2/content.aspx?id=23273 http://standards.nctm.org/document/chapter2/content.aspx?id=23273 https://www.stat.auckland.ac.nz/%20~iase/publications/1/6d4_thom.pdf https://www.stat.auckland.ac.nz/%20~iase/publications/1/6d4_thom.pdf research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 1, number 2, december 2015 (pages 114-128) available online at: http://journal.uny.ac.id/index.php/reid the effectiveness of english teaching program in senior high school: a case study 1) alfred irambona; 2) kumaidi 1) burundi national university, burundi; 2) yogyakarta state university, indonesia 1) irambonaalfred@yahoo.fr; 2) kuma_426@yahoo.com abstract this article evaluates the effectiveness of english teaching program in the eleventh graders of senior high school. this study was a summative case study using a mixed-method. descriptive statistics were used to analyze quantitative and qualitative data followed by a descriptive analysis following context, input, process, and product (cipp) model. the informants were 43 students of sekolah menengah atas negeri 3 yogyakarta (sman 3 yogyakarta) -3 yogyakarta state senior high school and two english teachers of the school. the findings reveal that the program objectives, classroom condition, students’ needs and barriers are in effective category. in input component, it is found that both teachers are qualified and experienced. the teaching training is not sufficient; students’ textbooks and course designs are in effective category. the process component shows that teaching materials, teaching methods, teaching activities and assessments are in effective category. the product component shows that english marks, students’ needs and barriers are in effective category. however, the teaching materials are in ‘not effective’ category. keywords: program effectiveness, program objectives, students’ needs and barriers, teaching methods, teaching materials mailto:irambonaalfred@yahoo.fr research and evaluation in education 115 volume 1, number 2, december 2015 introduction different countries have different policies. each country has its own visions for its people to fulfill. to achieve that, respective countries plan how to reach their goals, and the most effective way to pass through is education. nelson mandela states, ‘education is the powerful weapon which you can use to change the world,’ one can understand to which degree education plays a great role in changing a country. his great idea has been concurred with the idea of moore (2009, p.9) in which he suggests that changes in society are often reflected in more demands being placed on our educational system. it is a luck to have a country whose people are educated. there is no country which can develop without education. each country needs to make educational changes according to its needs. through education, people may get knowledge and good behavior which are the best factors leading to the bright life and future. with their knowledge, patriotism, innovations and inventions arise. that is why actually, many governments are doing their best to invest in education, trying to adjust the level of their national education with the standard of the countries who have possessed high standard or best education. they always try to make a new educational program in order to refresh the educational institution in their respective countries according to the world's trend. the minister of education and culture of indonesia, nuh (2013) announces that ‘the future of this nation depends on the new curriculum.’ according to his quotation, it is obvious that without changing the education, there will be no progress. due to that issue, indonesian government officials, especially those in educational sphere, sat together to elaborate the currently implemented curriculum, namely curriculum 2013, and all corresponding teaching/learning materials, its corresponding programs, textbooks, without putting aside how that curriculum is going to be implemented. a curriculum is a vehicle of the government’ needs in all fields if there is something to change or promote, for instance, languages, arts, and sciences. in line with this, fterniati and spinthourakis (2006, p.42) state that: the curriculum makes significant efforts towards promoting instructional change in the way language arts and all subjects are taught. the successful implementation of the new national curriculum needs to be based on appropriately designed materials, continuing inservice instruction of educators as well as informed and supportive school subject advisors to serve as methodology facilitators. together they create a frame of authentic and more effective praxis which can lead to students who will become engaged, literate and critical citizens in the twenty-first century. however, those materials have to be evaluated before, during, and after the implementation, just to verify whether they match the intended countries’ goals or not. if the countries’ objectives are reached, they can be strengthened. in case they are not attained, decisions on whether to maintain the curriculum or just to change it can be taken. this situation causes people to observe many curriculum changes in many countries where their results did not correspond to what they were expecting to get, or simply their needs have changed with the time. at this stage, indonesia, which is the country where this study is carried out, can be a good example to be taken to clarify the case. time changes and things change accordingly. since the independence year, namely, 1945, indonesian curriculum has been changed many times as the following: 1947, 1952, 1964, 1968, 1975, 1984, 1994, 2004, 2006, and finally 2013. it has been changed ten times because the governmental needs changed over the time. the needs maybe not only to catch up with its surrounding countries, but also with the whole world. now, since july 2013, the government is trying out the new curriculum 2013 to verify if it suits to indonesian’s needs. to do so, the ministry of education chose some pilot schools, including sekolah menengah atas negeri 3 yogyakarta (sman 3 yogyakarta) -3 yogyakarta state senior high school, where the new curriculum was tried out. the results from this trying out will be research and evaluation in education the effectiveness of english teaching program... 116 alfred irambona & kumaidi the cornerstone to the ministry of education and culture in taking decision about the curriculum 2013. since then, many researches and evaluations have been carried out to analyze the effectiveness or the impact of the implementation of the new curriculum. evaluation is a crucial activity which has to be done in any domain where people want to know the progress of their program. in educational sphere, evaluation is more than being important because it depicts the effectiveness, strengths, weaknesses, or the failures of a given program. in other words, it shows to which extent students have reached the desired standards. thus, it is understood that an effective evaluation will be based on that set of objectives which students are intended to meet. these objectives or goals are considered as key guides of program evaluations in educational domain since they clearly mention targets and expectations of education. moreover, they outline materials for grades and what topics to be taught in schools, how topics are sequenced and presented to students, what levels of understanding are to be expected, as well as what skills students will have to develop, and when. in few words, curriculum can be taken as a bridge leading to the wise island of knowledge, where skills are improved. actually, indonesian government is working hard to put its students on the level of other countries in the matter of english teaching language in an updated way that can foster students to face the global development in each domain such as economy, culture, science, politics, and technology. indonesia is now increasingly concerned to produce competitive citizens who will be able to respond positively to a new environment, new world, and new era. it is trying to produce people who can adapt, change, and learn new skills at different points in their lives and who will contribute to the society which they wish to develop in the future. to be successful, the government opted to use communicative approach across the educational levels since 1994, referring to the current trends of english language teaching and changes in the world. problem identifications sman 3 yogyakarta is said to be one of the best schools in indonesia. its students are said to be among the best students in indonesia. before using the new program, indonesian education employed the one established in 2006. the researchers felt that teachers are not yet ready for the new curriculum because the teachers did not get sufficient training time. moreover, the books and other teaching facilities such as video tapes, compact discs, teaching materials, and teaching aids are not yet ready, and some of the, are also not sufficient. the insufficient facilities can be another barrier to the satisfaction of students’ needs. in addition to that, the researchers wondered whether the actual english program matches the diverse students’ needs or not. the question of the teaching environment also emerges since it is among the influencing factors in education. problem formulations before conducting this case study, the evaluators were very curious to know if this new curriculum 2013 is going to be effective or not. based on this curiosity, the problems questioned in this study are outlined as follows: (1) how is the context of the english teaching program at sman 3 yogyakarta in grade eleven?; (2) how is the input of the english teaching program at sman 3 yogyakarta in grade eleven?; (3) how does the teaching process of the english teaching at sman 3 yogyakarta respond to the needs and barriers of the eleventh graders?; (4) how effective the english teaching program is at sman 3 yogyakarta in grade eleven? research objectives the research objectives can be formulated as the follows: (1) finding out if the context falls in line with the needs and barriers of the eleventh graders of sman 3 yogyakarta; (2) finding out whether the time allotted to teaching-training related to the curriculum 2013 was enough or not; (3) finding out if the instructional materials correspond to the teaching english of the research and evaluation in education 117 volume 1, number 2, december 2015 new program; (4) discovering the strategies used to implement the english teaching program at sman 3 yogyakarta, eleventh grade; and (5) finding out the effectiveness of the english teachers in their teaching activities at the eleventh grade of sman 3 yogyakarta. understanding what evaluation is evaluation is defined as a ‘systematic attempt to gather information in order to make judgements or decisions’ (lynch, 1996, p.2). according to yarbrough, shulh, hopson, et al. (2010, p.xxiv), ‘evaluation is a systematic investigation of the worth or merit of an object’. if we analyze the precedent definition, it can be concluded that that evaluation is a systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards. in the book of evaluation of educational program, fernandez (1984, p.1) says that evaluation is defined as ‘the process of determining to what extent the educational objectives are being realized’. according to sanders and sullins (2006, p.1), the term evaluation is understood as ‘the process of systematically determining the quality of a program and how it can be improved’. according to stufflebeam and shinkfield (1985, p.159), evaluation is the process of delineating, obtaining, and providing descriptive and judgmental informa-tion about the worth and merit of some objectives, goals, design, implementation, and impact in order to guide decision making, serve needs for accountability and promote understanding of the involved phenomena. this definition sees evaluation as a well detailed and organized process where the evaluator collects all needed information about the program understudy, analyze them before judging its worth. another point of view is the one proposed by kaufman and thomas (1980, p.4) who define evaluation as ’a process used to assess the quality of what is going on. evaluation may determine what is working, what to change, and what to keep.’ based on the different definitions given, it is obvious that all definitions have something in common. all of them converge on judging the value of something, taking decisions, or worthiness of it. this definition leads to the understanding of program evaluation. program evaluation program evaluation is an activity used in any domain where programs are implied. in any sphere where a given program is in use or being implemented, it is a great need to evaluate it in order to know its worth. in any case, policy-makers or stakeholders are curious to know if their programs in use are effective or not, or simply if they are bringing the desired results. in case it works, good and fine, it will be noticed and developed. however, if it does not work, a need of change will be called upon. it is good to mention that a program can be thought of as a group of related activities that is intended to achieve one or several related objectives (hawthorn & mcdavid, 2006, p.15). to meet different expectations of a given country, respective state and federal policymakers have to evaluate their different programs to verify if they are moving in the desired way as planned. likewise, the present study is going to focus on the evaluation of the english teaching program used in senior high schools to see if the new english teaching program is effective or not. according to rossy, lipsey, and freeman (2004, p.16), ‘program evaluation is the use of social research methods to systematically investigate the effectiveness of social programs in ways that are adapted to their political and organizational environments and are designed to inform social action to improve social conditions’. from the quotation, it is understood that program evaluation is the use of different social research methods to check the effectiveness of social programs, find out if the programs have been implemented as planned by the society, organization, or by the government. in other words, we can conclude that a program evaluation is all the systematic steps research and evaluation in education the effectiveness of english teaching program... 118 alfred irambona & kumaidi passed through in looking for the worth and whether the merit of a given program (the whole program or its different components) is working well as planned or not, looking for different hindrances it is facing, or the progress it is making. effectiveness after giving a varied set of definitions of a program evaluation, the next question was to know when a given program is said to be effective, something which can be known only through evaluation. evaluating the effectiveness of a program is not a simple task. a program will be effective when it has achieved its intended objectives/goals. regarding to the definition of effectiveness, cambridge advanced learner’s dictionary (2002) gives an understandable definition. it states that ‘something is effective when it is successful or achieving the results that you want.’ oxford dictionary embraces the same idea that ‘effectiveness is the degree to which something is successful in producing a desired result; success’. in the same way, it can be understood that effectiveness is the degree to which objectives are achieved and the extent to which targeted problems are solved. objectives are needed to identify the expected outcomes, to suggest directions, and to determine the means of evaluation. if one tries to analyze the different definitions given, he/she can draw a conclusion that through objectives, the effectiveness of a teaching program can be deduced from there. in educational institutions, the a teaching program will be claimed as being effective in case the stated objectives have met the needs of the students. program effectiveness is a determination of worth made after deliberating about preset planning goals and judgments made during program evaluation processes. in case credible objectives are attained, its effectiveness will be drawn. otherwise, the program will be claimed as being ineffective. it is fair to say that the effectiveness of a teaching program is based on students’ achievement. no one can dare to say that a given teaching program is effective when students are failed or are just fairly passing with low marks. thus, can we say that the effectiveness of a program will just happen or bring itself? the answer to that question is no. if we try to make a deep analysis of the meaning of effectiveness, we will discover that it is a hard work resulting from the combination of many factors. in this study, the researchers did not forget the role of teachers who play a central and vital role in teaching activities as long as they are the ones who make the teaching activities happen, to mean that they are the ones to help the students achieve the stated goals. moreover, some english exposure, the needs of the students, teaching methods, and teaching/ learning materials were interrelated in discussing about the program effectiveness. effective teaching effective teaching was, is and will be the main dream that any good teacher wishes to reach. according to the idea center journal (2002, p.1) teaching is effective in case it has an impact on students, for instance, if the students have made progress in their learning activities. in the journal, it is clarified that its main indicators of the effectiveness are derived by answering one question: do students make progress in achieving objectives selected by the instructor? thus, it is understood that if the chosen objectives are attained, the teaching effectiveness can be claimed. in the same way of thinking, effective teaching can be defined as teaching that successfully achieves the leaning by pupils intended by the teacher (chris, 2009, p.7). teaching methods any teaching activity without teaching methods is meaningless. there are so many teachers who are intelligent, full of knowledge, whose teaching the outcomes are not satisfying. they fail because they are not equipped with good teaching methods, or they simply do not know how to merge content and delivery process. this means that teaching methods are the first key of teaching success. a good teacher is the one who is not only equipped with a range of teaching http://www.businessdictionary.com/definition/degree.html http://www.businessdictionary.com/definition/objective.html http://www.businessdictionary.com/definition/problem.html research and evaluation in education 119 volume 1, number 2, december 2015 methods, but also who knows how to intermingle them according to the material, situation, and learners. thus, teachers must be trained so that their teaching activities bring desirable outcomes. cipp model in this research at hand, the cipp model which is developed by stuffbeam (1970) as mentioned by fernandez (1984, p.7) was used. the acronym of cipp stands for four types of evaluation functions namely context, input, process, and product. this model was chosen because of its suitability. in the evaluators’ view, it sounds more effective than the others because it focuses on all stages of the program, whether it is at the beginning, in the middle, or at the end of the program. zhang, et al. (2011, p.1), says that: ... the cipp evaluation model is designed to systematically guide both evaluators and stakeholders in posing relevant questions and conducting assessments at the beginning of a project (context and input evaluation), while it is in progress (input and process evaluation), and at its end (product evaluation). in line with that, stufflebeam and shinkfield (1985, p.162) explain that cipp provides information either before or during a project and allows making an evaluation on each of its components (context, input, process, or product) or a combination of its components due to the needs of the audiences. fundamentally, the use of cipp model is intended to promote growth and to help the responsible leadership and staff of an institution to obtain and use feedback of systematically so as to excel in meeting important needs, or at least, to do the best they can with the available resources. (stufflebeam & shinkfield, 1985, p.166). in short, this method is the one which is mostly used in program evaluations because it has a lot of advantages as long as it gives a general picture of what to focus in every single step. achieving the effectiveness of a program is not an easy task. it is a combination of many factors such as different textbooks used in the teaching activities, syllabi, different methods used while teaching, and other teaching facilities. in order to know the effectiveness of the current english teaching program, the cipp model evaluation by stuffbeam was chosen. this type of model of evaluation consists of four stages namely context, input, process, and product. the context stage is dealing with the different needs of beneficiaries, who are the students in this case. the goals or program objectives to be attained are also interrelated in this part. problems or barriers of the needs attainment are focused here. the main point will be the checking out of the availability of teaching materials and their verification to discover if they will help to respond to the students’ needs. the input stage focuses on how teachers plan their lessons. it will be also the time of checking if the selected teaching materials are going to help the different needs of the students. it includes different teaching books which are chosen by teachers to be used and all additional teaching materials. in this section, teachers’ training will also be taken into consideration as one of the strategies which is used to reach the students’ needs. with regard to the process stage, it is about how teachers try to put his/her planned lessons into practice as planned following the schedule for example. moreover, this section deals with the different strategies and methodologies that teachers will use to reach the needs of their recipients – learners. the research will verify whether the different strategies or the differently-used teaching methodologies really help to answer the students’ needs or not. the product stage is the time of collecting all findings from the different components of the method being used. it is followed by their analysis so that the objectives of the program can be reviewed whether they are met or not, whether the needs of the students are fulfilled or not. thus, based on the different stages of the cipp model, its summary is the following. research and evaluation in education the effectiveness of english teaching program... 120 alfred irambona & kumaidi research questions the research questions which guided this study were elaborated following the cipp model. those questions were: context: (a) what were the english program objectives set in curriculum 2013?; (b) up to which degree did the english program objectives set in curriculum 2013 cover the students’ needs at the eleventh grade of sman 3 yogyakarta?; (c) up to which degree did the english program objectives set in curriculum 2013 cover students’ problems at the eleventh grade of sman 3 yogyakarta?; (d) how conducive was the classroom environment of sman 3 yogyakarta in helping students to meet their needs and overcome their language hindrances? input: (a) what were the qualifications and experience of the english teachers at the eleventh grade of sman 3 yogyakarta?; (b) what were the perceptions of the english teachers at the eleventh grade of sman 3 yogyakarta about teaching training related to curriculum 2013?; (c) how was the availability of teaching/learning materials at the eleventh grade of sman 3 yogyakarta?; (d) how did english teachers at the eleventh grade of sman 3 yogyakarta design their lessons? process: (a) how were the available teaching/learning materials helping the students to overcome their needs and problems at the eleventh grade of sman 3 yogyakarta?; (b) how effective were the english teaching methods in helping the students of the eleventh grade of sman 3 yogyakarta to overcome their needs and problems while learning english?; (c) in connection with the students’ needs and problems, how were some teaching activities practiced in classes?; (d) what were the perceptions of english teachers of the eleventh grade of sman 3 yogyakarta regarding to the assessments? product: (a) how were students’ needs and problems dealt with the end of the first semester at sman 3 yogyakarta?; (b) what were the perceptions of the students and their teachers of the eleventh grade of sman 3 yogyakarta regarding to the students’ english marks at the end of the first semester?; (c) what were the perceptions of the students of the eleventh grade of sman 3 yogyakarta regarding to the available teaching/learning materials?; (d) what were the perceptions of the students of the eleventh grade of sman 3 yogyakarta regarding to the implemented english teaching program? research methods this study is a summative evaluation of a case study which employed a mixedmethod, also called a pragmatic approach to research. this research was conducted at sekolah menengah atas negeri 3 yogyakarta (sman 3 yogyakarta) -3 yogyakarta state senior high school. the school understudy is one of the best schools in indonesia. it is also one of the international standard pilot-project school members in indonesia since the academic year of 20062007. the researchers made pre-observation two times in july 2014. the collection of data was done from april 2015 and it took one month. the analysis and conclusions were drawn in may 2015. in this study, the informants were composed of 43 students studying in the eleventh grade of sman 3 yogyakarta. they were chosen using a purposive sampling technique from a total number of eight classes of grade eleven because the rest of the classes had different english teaching programs. in addition to the students, their respective two english teachers were also among the respondents of this research. in this study, data were collected using a variety of instruments depending upon what kind of data to be collected. it includes observation, questionnaires, interviews and documentations. as this evaluation is a qualitative study with a quantitative support, the instruments are divided into two parts. there are a qualitative part and a quantitative one. to get quantitative data, questionnaires were handed to teachers and students. the qualitative data were collected through interview which was administered to teachers. documentation technique was used to get both qualitative and quantitative data. in analyzing the data, the three key stage methods of analyzing qualitative data research and evaluation in education 121 volume 1, number 2, december 2015 stated by miles and huberman (1994, pp.1012) were followed. those stages are data reduction, data display, and also conclusion drawing. first of all, data were reduced just to make them simpler to analyze. after that, the researchers summarized the collected data after reduction, sort them without forgetting to discard irrelevant data before drawing conclusions, so that the data will be easily understood. at the end, to make general conclusions, the findings were descriptively analyzed to check if they have responded to the students’ needs and if the course has effectively reached the program’s objectives. with regard to data from surveys, i.e., some quantitative data from students’ survey, a descriptive statistics was used, based on the modified likert scale method proposed by mardapi (2008, p.123) as presented in table 1. after analyzing the data quantitatively, the results were interpreted qualitatively in order to make them easier to understand. some of them were compared to the teachers’ responses before drawing conclusions. table 1.students’ perceptions for teaching materials, classroom conditions, and teaching methods score x categories x ≥ m + 1. sd strongly agree/always m ≤ x 50. evaluation criteria of the results in this section, each question was analyzed independently and percentages were used to measure the effectiveness. to that end, each item was regarded as effective if the sum of the options ‘a’ and ‘b’ was bigger than 50%, for instance, ʃ (a + b) >50. findings and discussions the findings are presented and discussed referring to the four components of cipp model. in the context component, it is found that teachers know the needs and problems of their students. it is found as a big and important point to appreciate because teachers are taken as parents. they replace the role of students’ parents at school. thus, as good parents know when their children are hungry or not, what they like or not, teachers of sman 3 yogyakarta have been found in the category of good parents. they know their students’ needs and problems. in other words, they care about their students. second, it is found that the english program objectives encompass most of the students’ needs and problems. it is justified by the fact that: (a) the most and main students’ needs are covered up to 66.67% of the students’ needs; (b) the most and the main students’ problems are covered up to 75% of the students’ barriers. based on those findings, it can be concluded that the teaching english program objectives are well elaborated as they include most of the students’ needs and problems. it is one of good characteristics a teaching program should have. a good teaching program is the one whose objectives are shaped based on the recipients’ capacity, needs and problems. in short, we can conclude that the program objectives covered effectively the students’ barriers. the classroom environment is found to be conducive to the teaching/learning at the school understudy. it means that the classroom conditions fulfill all conditions of a good setting for studying. in other words, the classroom conditions are favorable in helping students to follow their learning well, by helping them to meet their needs, and finding solution to the problems they are facing while studying. it coincides with wei (2011, p.90) in his study where the findings about relationship between students’ perceptions of classroom environment and their motivation in learning english language indicate that most of the students perceived their classroom as having affiliation and they were extrinsically motivated. considering that, teachers have a big responsibility to create thrilling classroom settings for their students so that the goals of teaching activity can be achieved. moreover, suleman, aslam, and hussain (2014, p.71) find out that classroom favorable environment has a significant positive effect on the academic achievement scores of secondary school students. it can be perceived that environment plays a big role in research and evaluation in education 123 volume 1, number 2, december 2015 teaching. classroom settings have to be taken into consideration by teachers while planning their lessons. otherwise, it will be hard to achieve the teaching goals. the input component shows that both teachers are not only qualified in english teaching but also experienced with a mean of 13 years in teaching english. the first one is a ten-years-experienced teacher in teaching english and she is a master of english teaching. the other one is a bachelor of english teaching and has been teaching for 16 years. the combination of the two characteristics is very important in teaching domain. those characteristics are the other factors which can be the basis of students’ success in english. at secondary level, the percentage of student scoring advanced and proficient increases as the increasing of number of years of teaching experience (dial, 2008, p. 107). secondly, it is found that both english teachers of sman 3 yogyakarta who have followed the teaching training related to curriculum 2013 and the teachers who are not satisfied with the training they followed in the allotted time agree that teaching training was insufficient. thus, they were claiming for extra teaching training. generally, in any domain and in teaching domain in particular, training is one of the most important key factors to take into consideration if a person wants to succeed. with training, one gets knowledge, confidence and likewise will be familiar with the material to be used. thus, the stakeholders in education should have thought twice about teaching training before they implement english teach-ing program. if not, it can end up in smoke. third, it is also discovered that both teachers have followed some ongoing training seminars, such as some collaborative seminars with other schools or internal seminars related to curriculum 2013. it is an excellent strategy to use in strengthening their skills and knowledge. what happened to sman 3 yogyakarta teachers also happened in pakistan where the findings of a study about teachers’ training showed that the teachers claimed that the training program should be longer (long term training,) because it was proven that the training was effective as a useful forum to make them become effective english teachers (wati, 2011). in the same perspectives, farooq and aslam (2011, p.30) embrace wati saying that training gushes ability of working in any sort of employee, even in nonprofessional and new employees; it pushes up the abilities of professionals to a higher stage from where they currently stand. however, those teaching trainings have to be well organized if the government wants the training to be successful. if not, the outcome of the teaching training will be like what happened in pakistan where many training programs for english subject teachers were organized at different levels so that teachers would be equipped with necessary latest knowledge and skills to make them good english teachers. however, the results of training did not match with the desired objectives. from this condition, it is necessary for the government not only to organize training but also to give more time and devotion in order to be successful. moreover, it is found that not all students have their personal textbooks. a number of 32 students or 74.42% of the total students have each their personal english book. it means that there are 11 students or 25.58% of the students who have no personal english textbooks. it implies that it is not easy for those who do not have their own books in case they want to revise or prepare in advance for the next lesson or simply in case they have a homework that has to be found in the book. each student should have their own english textbooks so that they can use them whenever they need it without losing time to go to borrow the book from their friends, who probably also need to use their books. apart from the students’ books, it is found that there are various teaching facilities such as other english books, teachers’ books, syllabus, computer lab, recorder, internet connection, board markers, white boards, head projectors, as well as audio visual facilities such as radios, cd player, dvd player, and tv set. however, the teachers expressed that the teaching materials were insufficient. research and evaluation in education the effectiveness of english teaching program... 124 alfred irambona & kumaidi teaching materials are taken as the light in the darkness. they support a lot the teaching activities as they are the ones that teachers refer to in preparing and in teaching activities. they are taken as the guidelines of teaching activities. without them, the teaching activities will not get any supports, or, otherwhise, be miss-directed. in the same component, it is found that teachers at sman 3 yogyakarta design the lessons by themselves. they admitted that they sometimes collaborate with their fellows, ask some expert in english teaching, or just read some articles in order to have strong and successful lesson plans. another finding is that the elaborations of lesson plans are based on the program objectives, the needs and problems of the students. lesson plans are one of the vital factors in teaching because they trace paths which lead to the connection between what is taught and what should be taught. in other words, lesson plans will give a connection between the objectives of the program and the content of the lesson being taught. this paragraph is concluded with a note of agreement with the saying which is stated as ‘failing to prepare is preparing to fail’. as with the process component, the findings show that the teaching/learning materials are in an ‘effective’ category. it means that they help students to meet their needs and overcome their barriers while studying english language. however, it is found that there are some teaching/learning materials which are overused while there are others which are less used. through video and projectors which are used in up to 86.05% of the teaching activities, it can be concluded that most of the students’ needs and barriers should be resolved there as teachers admitted that they showed a variety of videos. for example, most of the needs and problems related to different language skills may be easily resolved through movies. they can practice listening, writing a summary of what they saw in the movies, reading, they can gain lots of vocabulary items and expressions, and they can even speak. thus, through videos or what was projected, students’ needs and barriers can be resolved. it is also found that students produce other things such as novels, a lot of posters, autobiography booklets, and had made a lot of handcrafts based on their interest-based learning. with such activities, students will produce something they like and they are interested in. it means that students at the school understudy explore all the potentiality they have in various domains so that they can meet their needs. by practicing, they face their problems and solve them in practicing. discussing the findings concerning the teaching methods, it is found that teaching methods are effective in the point of view of both teachers and students. it means that the teaching methods are leading students to the fulfillment of their needs and in the resolution of their problems. it may have resulted in the use of various teaching activities by the teachers. it is found that communicative, scientific approaches, project-based learning, and interest-based learning are used at that school. as they are experienced teachers, they know when a given teaching methods should be applied and when it needs to be changed. for example, the way of mixing methods from old curriculum (curriculum 2006) and the new one (curriculum 2013) is a great thing to appreciate because they know which methods are effective and which ones are ineffective. the english teachers of sman 3 yogyakarta look for many ways to help their students to overcome their problems and meet their needs. it is reflected in the fact that they organize discussion groups, excursions in english speaking places, and also organize students exchange programs with native english speaking students inside and outside of the country. in all those activities, the students will get many occasions to improve their english. if all teachers helped their students in such a way, all indonesian students would be able to improve their english skills very fast. the government should support teachers even schools with such a glorious initiative of devotion. research and evaluation in education 125 volume 1, number 2, december 2015 as long as assessments are concerned, it is found that assessments are helping teachers to find solutions of their students’ problems and needs. through assessments, teachers know where their students are strong and where they are weak. they get a database of information and, likewise, they know where to put emphasis on. thus, they will know what to prepare for the next classes. it is also found that feedbacks are done in classes. with feedbacks, students will also know about their weaknesses and strengths. in such a way, students will know what to focus while studying. regarding the last component of cipp model, it is found that at the end of the semester, most of the students’ needs are fulfilled. it is found that a number of 31 students which stands for 72.09% of the students express that their needs were reached. it means that at the end of the first semester, the students’ needs are in an ‘effective’ category. the data got from both teachers also show that most of the students’ needs are fulfilled. the comparison of the two sides shows that there is convergent point of view from both sides. from the findings, based on the effectiveness criteria, it can be concluded that the english teaching program is effectively responding to the students’ needs. in the same way, most of the students’ barriers are solved. it means that the teaching program is in an effective category as the majority of the students that is 74.42% of the students accepted that most of their barriers are revolved. the information got from both teachers shows that most of the students’ problems are solved. from the findings about the two points, it is clear that teachers are focusing on one side, i.e. on students’ barriers. it is better to make a balance between the two sides so that both students’ barriers and needs should be dealt successfully. it may also have resulted in the fact that teachers are not balancing students’ inferred needs and expressed ones. according to arends and kilcher (2010, p.91), determining and responding to students’ needs is one of the most complex and difficult tasks faced by teachers. from this fact, it can be deduced that teachers should have focus in areas where many students’ needs are stressed and try to weigh students’ inferred needs and the ones already stated in the curriculum. regarding to students and teachers’ perceptions about students’ grades, it is found that most of the students that is 76.74% of the students or 33 students said that they are satisfied with their grade at the end of the semester. it is also found that all students passed the subject with a class mean of 83. from the findings, it is clear that teachers did their best in bringing their students reach and pass the national passing minimal score, which is, 75. one of the ultimate goals of education is to have all students reaching the national standard. thus, the comparison of the findings can lead to the conclusion that the english teaching program was effective in that part of the teaching/learning activities. regarding the students’ and teachers’ perceptions on the teaching/learning materials, the teaching/learning materials are not judged enough as the majority of the students said that the teaching/learning materials are not enough. teachers also proved that the available teaching/learning materials are insufficient. based on the findings, it is concluded that the teaching materials are not effectively helping students to overcome their problems and meet their needs. it is understandable because since the input component, the teachers had complained saying that the teaching materials are enough only for the basic purposes. thus, the teaching methods are found not effective. the government should avail sufficient and appropriate teaching/ learning materials before the implementation of a new program to increase the success in teaching. last but not least, the english teaching program is perceived as effective by the majority of students that is equal to 60.47%. it means that teachers taught and evaluated the program objectives. moreover, it is reflected in the fact that all students pass the english course with good grades. research and evaluation in education the effectiveness of english teaching program... 126 alfred irambona & kumaidi conclusions and recommendations based on the findings from the context component, it can be concluded that english program objectives set in curriculum 2013 presented in the table 4 are found as effective as they cover most of the students’ needs and barriers. they are found rich as they encompass all the language components. on one hand, it is found that the main students’ needs are covered up to 66.67%. on the other hand, it is discovered that students’ barriers are covered up to 75%. concerning the classroom conditions, it is found that the school understudy was built in a noisy environment. however, the mean score of the analysis was 16.6, and it is included in the ‘agree’ category. it simply means that the classroom condition is conducive to meet students’ needs and overcome their barriers. from the input component, the data analysis and discussions show that both sman 3 yogyakarta english teachers are qualified and experienced in english teaching. the mean of experience is 13 years in teaching english. however, the teaching training is found not enough. teachers mentioned that they did not get a lot from that teaching training as long as the time allotted to that training was not enough. they claimed the need of extra training related to the 2013 curriculum. as the students’ books are concerned, the findings show that not all students have their personal students’ english book. it is found that 74.42% of the students, or 32 students from a total number of 43 students have personal students’ book. regarding the course designs, teachers of the eleventh grade of sman 3 yogyakarta admitted that they created their own course designs. they also said that they collaborated with other english teachers in sharing what they know and/or get some references in case there was something they were not sure of. in elaborating what to teach, it is found that the objectives of the program and students’ needs and barriers are at the core of the course designs. in the process component, it is found that the available teaching/learning materials are capable in helping students to fulfill their needs and overcome their problems at sman 3 yogyakarta. however, there are some teaching/learning materials which are oftenly used and others which are less used. the most used are videos and projectors, while the less used are magazines and board games. the teaching methods do not only help students to meet their needs but also to solve their problems. to that end, a variety of teaching methods are in regular use such as communicative approach, scientific approach, and project-based learning. teachers are not relying only on the teaching methods already set in the curriculum 2013; they are also using their hidden curriculum and collaborating with other english teachers. moreover, it is also concluded that teachers really know students’ needs and barriers. a variety of teaching activities are practiced in classes. however, some discrepancies are noticed, that is, the way those teaching activities are practiced in classes and the way they are expected by the students. taking students’ needs for instance, it is found through some teaching activities that there are some students’ needs which are more practiced and others which are less practiced than expected by the students. it is the same case for the students’ barriers. in short, teachers are not totally consistent in focusing in sides or areas where students have more problems or needs. regarding the perceptions of english teachers of the eleventh grade of sman 3 yogyakarta concerning the assessments, it is concluded that assessments are very helpful. teachers are choosing tests based on students’ needs and barriers. feedbacks are also done just to show students where they have made mistakes and are correct, and encourage them without forgetting to compliment those who perform/do better. they are found as being effective as they are helping teachers to resolve students’ problems and help students to meet their needs. the data analysis and discussions about product component lead to the following conclusions. at the end of the first semester, the majority of students’ needs are resolved. it is the case of 31 students out of 43 students or 72.09% of the total students whose needs research and evaluation in education 127 volume 1, number 2, december 2015 are met. it means that the students’ needs are in an ‘effective’ category. in the same way, the students’ barriers are found to be effective as the majority students’ barriers are resolved. a number of 32 students or 74.42% of the students are in the ‘effective’ category as their barriers are solved. concerning students’ marks, teachers admitted that all students pass the english course with very good marks. the average of the english class success is 83. a number of 33 students or 76.74% of the students are in ‘effective’ category as they mentioned that they are satisfied with their english marks. concerning the teaching/learning materials, they are found to be insufficient. they cannot totally help in finding students’ solutions to their problems and in helping students to meet their needs. since the beginning, teachers mentioned that the teaching/ learning materials are not enough. the data analysis and discussions made from the perceptions of students and teachers regarding the english teaching program lead to the conclusion that the english teaching program is good. both teachers said that the program is effective and the majority of the students (60.47%) said that they are satisfied with the english teaching program. therefore, from the gained data, it can be concluded that the english program objectives, classroom condition, process component, teaching methods, and the product component are effective. however, the teaching/learning materials used are not effective enough to help students achieve the best results. recommendations to the government: (a) the government should organize more training sessions for english teachers regarding curriculum 2013 so that teachers can be familiar with the materials to be taught; (b) the government should avail more english students’ books so that each student gets their own english textbook; (c) the government should avail new teaching/learning materials related to the implementation of curriculum 2013. to english teachers at sman 3 yogyakarta: teachers should be more consistent in their teaching activities by balancing students’ needs and barriers with the program objectives so that all of the students’ needs and barriers can be covered. for further researchers: (a) because this case study evaluates only the first semester, further researchers should make an evaluation of the whole academic year or the whole program; (b) further researchers should also conduct follow-ups and periodical evaluations to keep the english program updated. bibliography arends, r.i. & kilcher, a. (2010). teaching for student learning: becoming an accomplished teacher. new york, ny: routledge. cambridge advanced learner’s dictionary. (2000). cambridge advanced learner’s dictionary (10 th ed.). london: cambridge press. chris, k. (2009). effective teaching in schools: theory and practice (3 rd ed.). cheltenham: thornes ltd. farooq, m. & aslam, k.m. (2011). impact of training and feedback on employee performance. far east journal of psychology and business, 5(1), 23-33. fernandez, h.j.x. (1984). evaluation of educational programs. jakarta: national educational planning, evaluation and curriculum development. fterniati, a. & spinthourakis, j.a. (2006) national curriculum reform and new elementary school language arts textbooks in greece. international journal of learning, 13(4), 1-11. retrieved from http://www.elemedu.upatras.gr/english /images/afterniati/nationalcurriculum reformandnewelementaryschoollang uageartstextbooksingreece.pdf idea center (2002). some thoughts on selecting idea objectives. retrieved from https://txwes.edu/media/twu/contentassets/images/faculty-and-staff/ideasurvery-information/somethoughts-on-selectingidea-objectives.pdf http://www.elemedu.upatras.gr/english/images/afterniati/nationalcurriculumreformandnewelementaryschoollanguageartstextbooksingreece.pdf http://www.elemedu.upatras.gr/english/images/afterniati/nationalcurriculumreformandnewelementaryschoollanguageartstextbooksingreece.pdf http://www.elemedu.upatras.gr/english/images/afterniati/nationalcurriculumreformandnewelementaryschoollanguageartstextbooksingreece.pdf http://www.elemedu.upatras.gr/english/images/afterniati/nationalcurriculumreformandnewelementaryschoollanguageartstextbooksingreece.pdf research and evaluation in education the effectiveness of english teaching program... 128 alfred irambona & kumaidi kaufman, r. & thomas, s. (1980). evaluation without fear. new york, ny: new viewpoints. lynch, b.k. (1996). language program evaluation: theory and practice. cambridge: cambridge university press. mardapi, d. (2008). teknik penyusunan instrument tesdan non tes [test and nontest instrument arrangement technique]. yogyakarta: mitracendekia press. mcdavid, j.c. & hawthorn, l.r.l. (2006). program evaluation and performance measurement: an introduction to practice. thousand oaks, ca: sage publications, inc. miles, m.b. & huberman, a.m. (1994). qualitative data analysis: an expanded sourcebook (2 nd ed.). thousand oak, ca: sage publications. moore, k.d. (2009). effective instructional strategies: from theory to practice (2 nd ed.). thousand oaks, california: sage publications. nuh, m. (2013). the future of this nation depends on the new curriculum. the jakarta post. retrieved from: http://www.thejakartapost.com/news/ 2013/02/19/future-indonesia-dependsnew-curriculum-minister.html. sanders, j.r. & sullins c.d. (2006). evaluating school programs: an educator’s guide (3 rd ed.). thousand oaks, california: corwin press, sage publications company. stufflebeam, d.l. & shinkfield, a.j. (1985). systematic evaluation. boston: kluwe niljhoff publishing. suleman, q., aslam h.d. & hussain i. (2014). effects of classroom physical environment on the academic achievement scores of secondary school students in kohat division, pakistan. international journal of learning & development, 4(1), 71-82. wati, h. (2011). the effectiveness of indonesian english teachers training programs in improving confidence and motivation. international journal of instruction, 4(1), 79-104. wei, l.s. (2011). relationship between students’ perceptions of classroom environment and their motivation in learning english language. international journal of humanities and social science, 1(21), 240-250. zhang, g., zeller, n., griffith, r., metcalf, d., williams, j., shea, c. & misulis, k. (2011). using the context, input, process, and product evaluation model (cipp) as a comprehensive framework to guide the planning, implementation, and assessment of service-learning programs. journal of higher education outreach and engagement, 15(4), 57-83. http://www.thejakartapost.com/news/2013/02/19/future-indonesia-depends-new-curriculum-minister.html http://www.thejakartapost.com/news/2013/02/19/future-indonesia-depends-new-curriculum-minister.html http://www.thejakartapost.com/news/2013/02/19/future-indonesia-depends-new-curriculum-minister.html research and evaluation in education issn 2460-6995 research and evaluation in education, 2(2), 2016, 206-219 available online at: http://journal.uny.ac.id/index.php/reid research article academic performance and moral competence: a match made in heaven? * 1 umaru mustapha zubairu; 2 chetubo kuta dauda; 3 olalekan busra sakariyau; 4 isa imam paiko 1,2,3,4 department of entrepreneurship and business studies of federal university of technology minna, p.m.b 65, gidan-kwanu, minna bida road, niger state abstract this study aims to empirically assess the relationship between accounting students' academic performances and moral competencies by focusing on final-year accounting students enrolled at the international islamic university malaysia (iium). the students' moral competencies were measured using a scenario-based instrument developed through a collaboration with islamic accounting scholars, called the muslim accountant moral competency test (mamoc), whilst students' academic performances were measured using their cumulative grade point averages (cgpas). contrary to the expected positive relationship between these two variables, the study found a negative, and insignificant, relationship. the implication of this result is that iium's accounting department needs to conduct a comprehensive review of the ethical content of its courses and use a more effective strategy of how to more effectively integrate islamic values into the curriculum. additionally, institutionalizing a measure of students' moral competencies would enable the department to objectively determine how well it is doing in developing the moral competencies of its students. keywords: moral competence, academic performance, higher education, islamic perspective how to cite item: zubairu, u., dauda, c., sakariyau, o., & paiko, i. (2016). academic performance and moral competence: a match made in heaven?. research and evaluation in education, 2(2), 206-219. doi:http://dx.doi.org/10.21831/reid.v2i2.8956 *corresponding author. e-mail: uzubairu@gmail.com http://dx.doi.org/10.21831/reid.v2i2.8956 research and evaluation in education academic performance and moral competence... 207 umaru mustapha zubairu, chetubo kuta dauda, olalekan busra sakariyau, & isa imam paiko introduction the rash of financial scandals over the last two decades (enron, worldcom, arthur andersen, parmalat, global financial crisis, etc.) has highlighted the steady moral decline amongst business people generally, and accountants specifically (berg, 2015; hemraj, 2015; markham, 2015). this is because these scandals could not occur without complicity in guarding and preparing of critical financial information. to address this serious moral malaise amongst accountants, universities have been assigned to integrate ethics education into the accounting curriculum. it is expected that this will lead to the development of morally competent future accountants. in 2012, malaysian government joined this crusade by releasing an educational blueprint which had as one of its main objectives the development of morally competent professionals by universities (malaysia education blueprint 2013-2025, 2012). the population of malaysia's muslim represents 61.3% of the total population (department of statistics, malaysia, 2010). for this reason, the development of future muslim accountants imbued with islamic values is crucial in addressing the moral malaise amongst the country's accountants. islamic universities in the country thus play a central role in achieving the government's morality mandate. these universities claim to be able to produce morally competent muslim accountants by integrating islamic values in every accounting course offered. if these claims are true, then it is logical to expect that students who do well in these courses will possess a high level of moral competence in an accounting context. in other words, there ought to be a positive correlation between muslim accounting students' academic performances as determined by the cumulative grade point average (cgpa) and their moral competencies. this study sought to empirically assess the existence of such a positive relationship between the accounting students’ cgpa and their moral competencies by focusing on the final-year accounting students enrolled at the international islamic university malaysia (iium). their moral competencies were measured using a scenario-based instrument developed through a collaboration with islamic accounting scholars, called muslim accountant moral competency test (mamoc). the rest of the paper proceeds as follows. firstly, the research paradigm adopted in this paper is described. this is followed by a review of the scholarship on the relationship between academic performance and moral competence. this paper's conceptual framework is then presented, followed by the research methodology adopted. the findings of the paper along with a discussion of their implications follow, and finally, the paper ends with a conclusion. paradigm for muslims, islam represents a complete way of life. what this implies is that every aspect of a muslim's life, including the conducting of research, is guided by the principles of islam as embodied in its two primary sources of guidance, the noble quran and the sunnah (teachings, deeds, sayings, and silent permissions of the noble prophet muhammad [peace be upon him, pbuh]). the sunnah is found in narrations by the companions of the prophet muhammad (pbuh) called ahadith (plural of hadith). the implication of adopting this paradigm in conducting research is that all concepts identified in a study are defined in accordance with islam's primary sources of guidance, rather than adopting conventional definitions of these concepts. additionally, the expected relationships between these concepts are also derived from the quran and sunnah. this study embraces this paradigm, and this is particularly appropriate as the focus of the study is on the moral competencies of muslim accounting students, and this concept can only be fully understood by referring to the source of all muslim morality, the quran and sunnah. in order to shed more light on some verses of the quran and ahadith, commentaries by renowned islamic scholars are also relied upon. the review of relevant scholarship that explored the relationship between moral competence and academic performance revealed four streams of research based on the level of research and evaluation in education 208 − reid, 2(2), december 2016 education of the students as follows: stream one deals with elementary school students (schaps, solomon & watson, 1985; benninga, berkowitz, kuehn, & smith, 2003; snyder, flay, vuchinich, acock, washburn, beets, & li, 2009; hood, 2011); stream two deals with middle school students (hightower, 2001; flay et al., 2012; elias, white, & stepney, 2014); stream three deals with high school students (wynne & walberg, 1985; mollman, 2004; kariuki & williams, 2006; lombardo, 2008; griffin, 2011); and stream four deals with university students (luttamaguzi, 2012; olowookere, alao, odukoya, adekeye, & ade’agbude, 2015). these streams are later discussed in the paragraphs. the studies reviewed did not make explicit reference to ‘moral competence’, however they examined the relationship between academic performance and ‘character’. their conceptualization of character included traits like honesty, compassion, justice, perseverance and trustworthiness (mollman, 2004; lombardo, 2008); all these traits are what constitute a person's moral competence. for this reason, the term ‘moral competence’ is used in the place of ‘character’ in this review, so as to facilitate the flow of the paper, and avoid unnecessary confusion to the readers due to mere semantics. stream one: elementary schools all four studies which were reviewed in this section took place in the united states of america, and all found a positive correlation between students' moral competencies and academic performance; the difference was in term of the strength of the correlation. schaps et al. (1985) determine students' moral competencies through interviews and group-based task sessions, whilst teachers' observations were used in order to determine their academic performances. a small positive correlation was found between the two variables. benninga et al. (2003) also find a small positive correlation between the two variables in their study. however, they employed different proxies for student moral competencies and academic performances; character education implementation levels were used as a proxy for moral competency, meanwhile, stanford achievement test, 9 th edition (sat-9) scores were used as a proxy for academic performance. snyder et al. (2009) used the same proxy for moral competence which was used by benninga et al. (2003), but used the hawaii content and performance standard scores to measure academic performance, instead of the sat-9. they found a moderate positive correlation between student moral competencies and academic performance. character education implementation levels was also used by hood (2011) as a proxy for moral competence, whilst the new jersey school report card was used to ascertain academic performance. she found a strong positive correlation between student moral competencies and academic performances. stream two: middle school elias et al (2014) provide theoretical arguments as to the complementary relationship between moral competency and academic performance amongst middle school students. they urged all middle schools to establish character education programs so as to leverage this complementary relationship: produce students with excellent characters and exemplary academic prowess. hightower (2001) and flay et al. (2012) both studies conducted in the usa provide evidence that enabled elias et al (2014) to posit that middle students' moral competencies are positively related to their academic performances. hightower (2001) developed a questionnaire to measure middle students' moral competencies and used their sat-9 scores as a proxy for their academic performances. she found a strong and positive relationship between these two variables. the study of flay et al.(2012) took place in chicago, and they used the value-added illinois state achievement test scores as a proxy for the academic performances of students in the study, whilst character education implementation levels served as a proxy for student moral competencies. after six years of the program, students from schools that implemented the program had 15% better scores than students in the schools without the program. research and evaluation in education academic performance and moral competence... 209 umaru mustapha zubairu, chetubo kuta dauda, olalekan busra sakariyau, & isa imam paiko stream three: high school students wynne and walberg (1985), mollman (2004), and lombardo (2008) present normative arguments about the positive relationships between high school students’ moral competency and academic performance. they conclude that all high schools should have dual goals of developing the moral competencies and academic performances of students. the trend of us-based empirical studies continues with the study by kariuki and williams (2006). high school students at a military school were surveyed; their moral competencies were measured using a modification of a ‘what do you really believe?’ survey, whilst the cumulative grade point averages (cgpa) served as a measure of their academic performances. the results reveal a significant correlation between moral competence and academic performance. in south carolina, griffin (2011) surveyed high school students, using their participation in a character development class as a proxy for their improved moral competencies. the students' grades in an english class before and after the character development class served as a measure of their academic performance. there are 62% of the students improved their pre-test scores, whilst 42% improved by a whole letter grade. stream four: university students a study in uganda by luttamaguzi (2012) provides a different context from the us studies reviewed. the study focuses on education majors, and their moral competencies were measured using an author-developed questionnaire. students' cgpas served as measures of their academic performances. the results reveal a positive and significant relationship between the two variables. kern and bowling (2015) take us back to the us, with a survey of law students whose moral competencies were measured using a values in action character strengths inventory. like luttamaguzi (2012), kern and bowling (2015) used cgpa to serve as a measure of students' academic performances, and the results also show a positive relationship between moral competence and academic performance. olowookere et al. (2015) provide a nigerian perspective on the relationship between students' moral competencies and academic performances by surveying the students enrolled in a leadership development program. the authors developed a questionnaire called the ‘character development questionnaire’ to measure the students' moral competencies, whilst cgpa was used to measure academic performance. like all the studies reviewed, their study also found a positive relationship between these two variables. observations from the review the first observation is that all the studies reviewed either advocated for a positive relationship between student moral competence and academic performance, or provided empirical evidence of this positive relationship. this was despite the fact that different proxies were used to assess moral competence and academic performance. this body of evidence gives great credibility to the intuitive expectation expressed earlier in this paper that there should be a positive relationship between the moral competencies and academic performances of final-year accounting students enrolled at iium. the second observation is that the vast majority of the studies already reviewed were conducted in the usa. there are only two studies (luttamaguzi, 2012; olowookere et al., 2015) that were conducted outside the us, and both were in africa. this current study examines the relationship between moral competence and academic performance in a malaysian context, thus giving an asian perspective which seems to be absent from the current scholarship in this area. the third observation is that the discussion of moral competence has been from either a secular or christian perspective. this study provides an islamic perspective which provides a different viewpoint, and thus contributes additional knowledge to the existing scholarship. therefore, this study had two main concepts: moral competence and academic performance. the conceptualizations of these concepts in this study, as well as the expected relationship between them from an islamic perspective are presented below. research and evaluation in education 210 − reid, 2(2), december 2016 moral competence in this study, a morally competent muslim accountant was defined as one who has the ability to make moral decisions in line with the commands of allah in the noble quran, and in accordance with the sunnah of the noble prophet muhammad (pbuh), in discharging his or her duties as an accountant. in islam, this concept of moral competence is made up of two separate but interdependent parts: (1) knowing the right thing to do (moral action), and (2) doing the right thing for allah’s sake alone (moral intention). in the sight of almighty allah, a moral action is only acceptable if the moral intention is solely for his pleasure. the blessed prophet muhammad (pbuh) explains this very important point in the famous hadith narrated by umar bin al-khattab: the messenger of allah (pbuh) said, ‘the deeds are considered by the intentions, and a person will get the reward according to his intention. so whoever emigrated for allah and his messenger, his emigration will be for allah and his messenger; and whoever emigrated for worldly benefits or for a woman to marry, his emigration would be for what he emigrated for’ (al-nawawi, book 1, hadith 1). this is a very crucial concept which this study took into consideration when assessing the moral competencies of final-year accounting students enrolled at iium. figure 1 illustrates the above-mentioned conceptualization of moral competence. academic performance as mentioned in the paradigm section of this paper, luttamaguzi (2012), kern and bowling (2015), and olowookere et al. (2015) all explore the relationship between students' moral competencies and academic performance in a university context. these three studies all utilized the students' cgpas as a measure of their academic performances. following their example, this study also used cgpa to capture students' academic performances. islamic position on the relationship between academic performance and moral competence islam is in agreement with the empirical evidence provided in the paradigm section of this paper that a positive relationship should exist between academic performance and moral competence. this is particularly true in iium's case where all accounting courses are supposed to be integrated with islamic moral values. the following verse of the noble quran and hadith shed light on this issue: and among people and moving creatures and grazing livestock are various colors similarly. only those fear allah, from among his servants, who have knowledge. indeed, allah is exalted in might and forgiving (qs. fatir: 28). ‘the prophet muhammad (pbuh) said: ”verily, god loves if any of you does a job, he does it with perfection”’ (al-bayhaqi). applying the verse above in the context of this study, the cgpa represents a measure of the students' knowledge about islamic principles in an accounting context; those with the highest cgpa possess the most knowledge, and thus they ought to fear allah the most; ‘fear of allah’ representing moral competence. in other words, the higher the cgpa, the higher the moral competence. figure 1: conceptualization of moral competence moral competence moral intention (actions done solely for allah's pleasure) for allah’s sake) moral actions in line with qur’an and sunnah research and evaluation in education academic performance and moral competence... 211 umaru mustapha zubairu, chetubo kuta dauda, olalekan busra sakariyau, & isa imam paiko similarly, applying the hadith, moral competence represents one's level of obedience to god's commands. in the hadith, god commands humans to strive for perfection in all that we do. for the accounting students, those with the highest cgpa have striven the most to achieve perfection, thus displaying the highest level of moral competence. moral competence and academic performance are positively related. method this section presents the study's research design and methodology. firstly, an overview of iium is provided, with particular emphasis on the accounting department from which the students that participated in the study belong to. secondly, the steps followed to develop a profile of a morally competent muslim accountant are presented. thirdly, the steps followed to develop the instrument to measure the moral competencies of muslim accounting students are presented. finally, the actual procedure adopted for measuring the moral competencies of the specified students is discussed. an overview of the international islamic university malaysia (iium) iium was established on 23rd may, 1983 based on the philosophy that all fields of knowledge should lead toward the recognition of, and submission to, the fact that almighty allah is the only one worthy of worship and is the absolute creator and master of the universe. iium has a four-pronged mission of integration, islamization, internationalization and comprehensive excellence (international islamic university malaysia, 2014). the university provides bachelors’ degrees, masters’ degrees and doctorate degrees in its 13 faculties called ‘kulliyyahs’. of particular interest to this study was iium’s undergraduate accounting program accredited by the malaysian institute of accountants (mia). mia accreditation of an accounting program is very important as only students that graduate from such programs can legally call themselves ‘accountants’ in malaysia (malaysian institute of accountants, 2012). iium's accounting program claims to integrate islamic principles into contemporary accounting knowledge. in order to graduate, students are required to complete a minimum of 134 hours of a combination of university-required, kulliyyahrequired and departmental courses which includes practical trainings; three of the university-required courses have islamic ethical content: (a) islamic worldview, (b) islam, knowledge and civilization, and (c) ethics and fiqh for everyday life; four of the kulliyyah-required courses have islamic ethical content. in addition, a recent islamization initiative has mandated that ethical and islamic content be integrated in all courses (international islamic university malaysia, 2014). at the time this study, iium department of accounting had 28 academic staff and 552 undergraduate students enrolled. developing the profile of the morally competent muslim accountant this section described the process adopted in order to develop the profile of a morally competent muslim accountant. the profile was developed from the perspective of the muslim accounting graduate, and was divided into two components: (1) finding the ‘right’ job, and (2) following an islamic ‘code of conduct’. a discussion of these two components is presented below. finding the ‘right’ job nu'man b. bashir (allah be pleased with him) reported: i heard allah's messenger (pbuh) as having said this (and nu'man) pointed towards his ears with his fingers): what is lawful is evident and what is unlawful is evident, and in between them are the things doubtful which many people do not know. so he who guards against doubtful things keeps his religion and honor blameless, and he who indulges in doubtful things indulges in fact in unlawful things, just as a shepherd who pastures his animals round a preserve will soon pasture them in it. beware, every king has a preserve, and the things god has declared unlawful are his preserves. beware, in the body there is a piece of flesh; if it is sound, the whole body is sound and if it is corrupt the whole body is corrupt, and hearken it is the heart (al-hajjaj, 1599a, book 22, hadith 133). research and evaluation in education 212 − reid, 2(2), december 2016 for a muslim accounting graduate, the first challenge that faces him or her is finding the ‘right’ job. the ‘right’ job is one where all activities are in line with the quran and sunnah. as the hadith cited teaches us, the permissible jobs are clear and the prohibited jobs are clear, and the morally competent muslim accountant must be able to make this distinction. the permissible activities are numerous in number, and thus the muslim accountant has many options. say, ‘my lord has only forbidden immoralities what is apparent of them and what is concealed and sin, and oppression without right, and that you associate with allah that for which he has not sent down authority, and that you say about allah that which you do not know’ (q.s. araf: 33). however, there are certain kinds of jobs a muslim accountant has to avoid because the activities that he or she engages in are incompatible with the commands of almighty allah. ‘…and cooperate in righteousness and piety, but do not cooperate in sin and aggression. and fear allah; indeed, allah is severe in penalty’ (q.s. maeda: 2). some of the most commonly known haram activities include (1) dealing in interest, (2) gambling, and (3) dealing with intoxicants. after securing a job at an allahapproved organization, the next concern for the morally competent muslim accountant is to fulfil his or her duties in accordance with the commands of almighty allah. this ‘code of conduct’ represents the second component of the profile. following an islamic code of conduct as mentioned earlier in previous parts of this paper, every act of a morally competent muslim must be done with the objective of earning the pleasure of the most gracious allah. a muslim accountant must thus keep this critical objective in mind whilst discharging his or her duties as an accountant. the objective of this component of the profile was to develop a comprehensive islamic code of conduct that includes all the qualities that a morally muslim accountant must display to please his creator. in order to develop this code of conduct for muslim accountants, the study adopted a two-pronged approach. the first prong was to adopt the code of conduct for muslim accountants developed by the accounting and auditing organization for islamic financial institutions (aaoifi) as a foundation for this component of the profile. aaoifi’s code of conduct for muslim accountants was first published in 1991 and is derived from the noble quran and sunnah; this made it an excellent starting point. the aaoifi’s code of conduct contains five ethical principles described below: trustworthiness. the muslim accountant should be straightforward and honest whilst discharging his or her duties, and must never present untruthful information. objectivity. the muslim accountant should be fair, impartial and free from any conflict of interest. professional competence and diligence. the muslim accountant must possess the requisite skill necessary to successfully discharge his duties. confidentiality. the muslim accountant must never divulge information obtained about an organization during the course of discharging his or her duties without permission unless he or she is legally or professionally obliged to do so. professional conduct and technical standards. the muslim accountant must observe the rules of professional conduct and obey the accounting and auditing standards of shariahcompliant organizations. the second prong was to interview and consult extensively with five islamic scholars well versed in the quran and sunnah, particularly in the areas of ‘islamic accounting’ as well as ‘fiqh muamalat’ (laws of islamic business transactions). these consultations established the content validity of aaoifi’s code of conduct. in addition to the five qualities listed by aaoifi's code of conduct, the scholars suggested three more qualities to be added under the umbrella of ‘faith-driven’ conduct, which are unique to the muslim accountant. these qualities include (1) avoiding interest, (2) avoiding gambling, and (3) avoiding physical contact with the opposite sex (ghairu mahram). research and evaluation in education academic performance and moral competence... 213 umaru mustapha zubairu, chetubo kuta dauda, olalekan busra sakariyau, & isa imam paiko in total the developed profile of the ideal muslim accountant consisted nine key qualities: (1) identifying the right job; (2) trustworthiness; (3) objectivity; (4) professional competence and diligence; (5) confidentiality; (6) professional conduct and technical standards; (7) avoiding interest; (8) avoiding gambling; (9) avoiding physical contact with the opposite sex. after the development of the two-component profile of a morally competent muslim accountant was completed, the next step was to develop an instrument capable of effectively measuring the nine qualities contained in the profile. developing the muslim accountant moral competency test (mamoc) a collaborative effort by the researchers and the five islamic accounting and fiqh muamalat scholars resulted in the development of nine interrelated ethical scenarios to measure each of the nine qualities of a morally competent muslim accountant highlighted above. the instrument developed was called ‘the muslim accountant moral competency test’ or ‘mamoc’. mamoc has a title and three main sections: the instrument was titled ‘understanding the career aspirations and workrelated decisions of future accountants’. in order to minimize social desirability bias amongst the respondents, the study's objective was disguised by giving the instrument this neutral heading without any obvious moral overtones. the first section was a demographic section with ten items (age, religion, gender, nationality, the question ‘how often do you pray daily?’, the question ‘how important is religion in your family?’, year of study, type of secondary school attended, the question ‘how often do you visit your place of worship?’ and cgpa). the second section was titled ‘choosing your dream job’. here, the respondents were given a choice of five job offers from companies in different industries. each job offered had a company description, a job description, and an annual salary. this section sought to determine if the muslim accounting students knew what the right job was from an islamic perspective. to test this important aspect of the students' moral competencies, all the jobs offered were unacceptable from an islamic perspective; it was expected that the morally competent student would recognize this, and consequently rejected all job offers on the basis of their unacceptability islamically. the third section was titled ‘living your dream job’, and contained eight ethical scenarios, each of which testing each of the eight qualities of a morally competent accountant mentioned earlier [(1) trustworthiness, (2) objectivity, (3) professional competence and diligence, (4) confidentiality, (5) professional conduct and technical standards, (6) avoiding interest, (7) avoiding gambling, (8) avoiding physical contact with the opposite sex]. the protagonist in the scenarios was a friend of the respondent, and the respondent was required to resolve the ethical dilemmas by advising his or her friend on what to do. the scenarios were structured in this way with the hope that respondents would be more honest in their answers if they were placed in an advisory capacity, rather than as the main actors in the scenarios. to conclude the instrument, the students were asked whether they would remain with the company after all the experiences contained in the previous scenarios. the ethical scenarios contained in the instrument were then resolved by the scholars based on evidence from the quran and sunnah. their solution served as the model answer to each scenario, and also served as a scoring guide for determining the moral competencies of the students surveyed. a pilot study was carried out using the newly developed instrument to assess whether respondents would understand the instructions, terminology and content of the questionnaire. additionally, the pilot study enabled the researcher to ascertain the reliability of the scoring system developed by the islamic scholars. first-year muslim students from the economics and management faculty at iium were used to conduct the pilot study. these students were enrolled in four different sections of a financial accounting fundamentals class, and were selected because they closely resembled the students selected for the actual study, final-year muslim accounting students. a questionnaire was distributed to 100 sturesearch and evaluation in education 214 − reid, 2(2), december 2016 dents, and they were asked to carefully go through the questionnaire and ask any questions they might have as to its content. all the students stated that they clearly understood how to complete the questionnaires. the students were then told to take the questionnaires home, complete them, and bring them to the next class session. they were also told to write down how long it took them to complete the questionnaire. as many as 33 out of the 100 students returned the completed questionnaires. an analysis of the completed questionnaires revealed that the students did indeed understand how to complete it. they provided well-thought out and clear resolutions to the various scenarios, and followed the stated instructions very well. the fact that first-year students could understand the instructions, content and terminology of mamoc so well provided the researchers with confidence that the actual respondents of the study, final-year accounting students at iium, would understand just as well. savulescu, crisp, fulford, and hope, (1999) explain that any instrument that is used in order to measure moral competence must be capable of being reliably applied by different raters. they also suggested that ‘naïve’ raters should be utilized (naïve raters are those not involved in the development of the instrument). following the advice proposed by savulescu et al.(1999), after the pilot study was completed, the inter-rater reliability of the scoring system was assessed using the completed questionnaires from the pilot study; inter-rater reliability is defined as ‘the degree to which different judges or raters agree in their assessment decisions’ (phelan & wren, 2006). one of the researchers and one naïve rater employed the model answers in order to assess the moral competencies of the students that participated in the pilot study. halgren (2012) stated that intra-class correlation (icc) is the most commonly used statistical procedure to determine inter-rater reliability for studies that have two or more raters, with continuous variables. spss was used in order to calculate the instrument’s inter-rater reliability using icc. high icc values indicate greater inter-rater reliability, with an icc estimate of the 1 indicating perfect agreement and 0 indicating only random agreement. negative icc estimates indicate systematic disagreement between the raters (halgren, 2012). after the completed pilot study questionnaires had been rated by one of the researchers and the naive rater, an intra-class correlation coefficient of 0.943 showed that the two raters had an almost perfect agreement when assessing the moral competencies of the pilot study participants. this result proved the reliability of the model answer for the assessment of students’ moral competencies. another revelation of the pilot study was the fact that it took the students an average of thirty minutes to complete the questionnaire. this time was then used for the actual study. finally, the pilot study revealed that the best approach for conducting the survey would be to get the students to complete the survey during class time, as opposed to letting them take it home and bringing it back during the next class session. by conducting the survey in class, the completed questionnaires were collected immediately, thus ensuring a much higher response rate in the actual study. as specified in an earlier section of this paper, the moral competence (mc) from an islamic perspective is a product of two components, namely: moral action in line with the quran and sunnah (ma), and moral intention to please almighty allah alone (mi). participating students were asked to resolve each scenario by stating the action they would advice their friend to take (ma), and providing a reason for that advice (mi). if a student’s ma corresponded with the model ma, a score of 1 was given; if it did not, a score of 0 was given. the same rule applied for mi (1 for the correct reason, and 0 for incorrect reason). for each scenario, a student’s mc = ma * mi. for a student to have a score for any scenario, both ma and mi must have corresponded with the model answers, otherwise he or she scored 0 for that scenario. scores for each scenario were added to provide an overall mc score for each student; mc scores could range from a minimum of ‘0’ to a maximum of ‘10’. research and evaluation in education academic performance and moral competence... 215 umaru mustapha zubairu, chetubo kuta dauda, olalekan busra sakariyau, & isa imam paiko the students' academic performances were determined by their response to the cgpa item in the demographic information section of mamoc. they had five options to choose from: <2.00, 2.00-2.50, 2.51-3.00, 3.01-3.50, and >3.50. measuring the moral competencies of muslim accounting students of iium following the approach adopted in the pilot study, all sections of a compulsory finalyear accounting course were surveyed using mamoc. there are 72 final-year students completed the questionnaire. the final-year students were selected for this study because they had completed most of the courses offered at iium, and thus were expected to have been imbued with all the islamic values that had supposedly been integrated into the courses. the relationship between the students' academic performances and their moral competencies was then determined using spearman's rank order correlation. this non-parametric technique was used instead of its parametric alternative, pearson's product motion correlation, because the assumption of linearity was violated. findings and discussion this section of the paper presents the research findings and their implications. the descriptive statistics of the study's respondents is presented first. the relationship between the academic performances of iium final-year accounting students and their moral competencies as determined using spearman's rank order correlation is then presented. a discussion of the implications of the results concludes the section. descriptive statistics table 1, 2 and 3 present descriptive statistics of iium accounting students surveyed in this study. table 1 reveals that the majority of accounting students surveyed were female, representing 86.1% of the sample; table 2 shows the distribution of the students' cgpas, with the majority of students having a cgpa between 3.01 and 3.50, representing 61.1% of the sample. table 3 shows that the average age of the students was 23.47 years, and that their mean moral competence score was 5.39 out of a maximum of 10. table 1. gender frequency percent valid male 10 13.9 female 62 86.1 total 72 100.0 table 2. cgpa frequency percent valid 2 to 2.5 2 2.8 2.51-3 16 22.2 3.01-3.5 44 61.1 more than 3.5 10 13.9 total 72 100.0 table 3: age and moral competence scores n min. max. mean std. deviation age 72 22 25 23.47 .804 mcscore 72 2 10 5.39 2.268 valid n (listwise) 72 determining the relationship between iium final-year accounting students' aca-demic performances and their moral competencies table 4 presents the correlation between students' cgpas and moral competence scores. the results are startling as they go against the grain of all previous studies reviewed in this paper. rather than the expected positive correlation between the two variables, there was a negative correlation of -0.036. however, the correlation was not statistically significant. cohen (1988) as cited by pallant (2001) suggested the following interpretation of the strength of a correlation: .10 to .29 (small); .30 to .49 (medium); .50 to 1.0 (large). the correlation coefficient of -0.036 does not reach the threshold of a small correlation between the variables. the implication is that there is almost no relationship at all between the students' cgpa and their moral competencies. table 5 sheds more light on the surprising negative correlation between student's cgpas and moral competence scores by proresearch and evaluation in education 216 − reid, 2(2), december 2016 viding the correlation of cgpa with each ethical scenario contained in mamoc. it is observed that there is a small but significant positive correlation between students' cgpas and the fourth scenario which measured the students' professional competence. there is also a small but significant negative correlation between students' cgpa and the sixth scenario which measured students' understanding that gambling is forbidden in islam. discussion the basic premise of this study was that there would be a positive relationship between the academic performances and moral competencies of iium's final-year accounting students. this was based on the supposition that islamic values were integrated into all the accounting courses at the university. however, the results revealed an unexpected negative, but insignificant, relationship between academic performance and moral competency. further analysis revealed a small, positive and significant correlation between students' academic performances and their understanding of professional competence. the implication of this results is that iium's accounting department needs to conduct a comprehensive review of the ethical coverage in the current curriculum. focused strategies to integrate the nine attributes of a morally competent accountant identified in mamoc must be developed and implemented by the accounting staff. it is also recommended that the issue of moral competence be given equal importance to academic achievement by the institutionalization of a measure of moral competence, which could be called the moral grade point average (mgpa), to go hand in hand with the traditional cgpa. regarding our second recommendation that moral competence be institutionalized, the malaysian ministry of higher education has reached the same conclusion. in 2015, the ministry unveiled plans to institute a more holistic measure of university student performance titled ‘the integrated cumulative grade point average’ (icgpa) (ann, 2015; khor, 2015; tay, 2015). the icgpa which has been the result of six years of research and consultations will be tested in selected faculties at five public universities: universiti kebangsaan malaysia (ukm), universiti teknologi mara (uitm), universiti malaysia terengganu, universiti malaysia kelantan and universiti malaysia pahang. the icgpa is intended to provide a more comprehensive measure of student performance by addressing nine specific skill sets: (1) knowledge and understanding, (2) practical skills, (3) social skills and responsibilities, (4) professional skills, ethics and values, (5) communication skills, leadership, and teamwork, (6) problem-solving skills and scientific thinking, (7) information management and lifelong learning, (8) entrepreneurship and management, and (9) unity and patriotism (ann, 2015; khor, 2015; tay, 2015). the fourth skill set encompasses the important issue of moral competence. table 4. correlation between cgpa and moral competence scores cgpa mcscore spearman's rho cgpa correlation coefficient 1.000 -.036 sig. (2-tailed) . .766 n 72 72 mcscore correlation coefficient -.036 1.000 sig. (2-tailed) .766 . n 72 72 table 5. correlation between cgpa and individual scenarios scenario 1 2 3 4 5 6 7 8 9 10 correlation coefficient .014 .079 -.127 .234* .131 -.290* -.003 -.063 -.169 -.166 *correlation is significant at the .05 level (2-tailed) research and evaluation in education academic performance and moral competence... 217 umaru mustapha zubairu, chetubo kuta dauda, olalekan busra sakariyau, & isa imam paiko conclusion in islam, true salvation lies in having a strong faith in allah which is evidenced by sincerely obeying him in all things. this is particularly important for muslim accountants who play a crucial role as guardians and providers of crucial financial information which so many stakeholders rely upon. iium is charged with the crucial task of producing morally-competent muslim accountants who will restore the damaged credibility of accountants in malaysia. to achieve this goal, iium's accounting department has integrated islamic values into all its accounting courses. for this reason, it is expected that students who performed well in this courses would also possess a high level of moral competence. however, this study found a negative but insignificant relationship between the academic performances and moral competencies of iium's final-year accounting students. the implication of this result is that there is a need to revisit the ethical content of iium's accounting curriculum, as well as the process of integrating islamic values into the accounting courses. additionally, institutionalizing the measuring of students' moral competencies would enable iium to objectively determine how well it is doing in meeting its mandate of producing morally competent muslim accountants. ‘it is most hateful to allah that you should say that which you do not do’ (q.s. saff: 3). references accounting and auditing organization for islamic financial institutions. (2015). accounting, auditing, and governance standards for islamic financial institutions. retrieved from http://aaoifi.com/ standard/accounting-standards/?lang= en al-bayhaqi, a. b. (1994). al-sunan alkubra. beirut: dar al-kutub al-ilmiyyah. al-hajjaj, m. b. (2007). sahih muslim. riyadh: dar-us-salam publications. al-nawawi, i. (1999). riyad-us-saliheen. riyadh: dar-us-salam publications. ann, h. w. (2015). icgpa too little, too late? retrieved from http://www.bfm.my/ current-affairs-icgpa.html benninga, j. s., berkowitz, m. w., kuehn, p., & smith, k. (2003). the relationship of character education implementation and academic achievement in elementary schools. journal of research in character education, 1(1), 19-32. berg, m. (2015). lending blind: shadow banking and federal reserve governance in the global financial crisis. annandale-on-hudson, ny: levy economics institute. department of statistics, malaysia (2010). population distribution and basis demographic characteristics. retrieved from https:// web.archive.org/web/20150301154300 /http://www.statistics.gov.my/portal/ download_population/files/census2010 /taburan_penduduk_dan_ciri-ciri_asa s_demografi.pdf elias, m. j., white, g., & stepney, c. (2014). surmounting the challenges of improving academic performance: closing the achievement gap through socialemotional and character development. journal of urban learning, teaching, and research, 10, 14-24. flay, b. r., acock, a., vuchininch, s., lewis, k., bavarian, n., schure, m., ... & ji, p. (2012). social-emotional and character development to improve student behaviour and academic achievement: results from two school-based randomized trials. in 2nd international conference on the future of education, florence, italy. griffin, k. p. (2011). the effect of character education on the academic achievement of high school students. the divergent learning journal, 25. halgren, k. a. (2012). computing inter-rater reliability for observational data: an overview and tutorial. tutor quants methods psychol., 8(1), 23-34. http://aaoifi.com/%20standard/accounting-standards/?lang=%20en http://aaoifi.com/%20standard/accounting-standards/?lang=%20en http://aaoifi.com/%20standard/accounting-standards/?lang=%20en http://www.bfm.my/%20current-affairs-icgpa.html http://www.bfm.my/%20current-affairs-icgpa.html research and evaluation in education 218 − reid, 2(2), december 2016 hemraj, m. (2015). us statutory regulation. in m. hemraj (ed.), credit rating agencies: self-regulation, statutory regulation and case law regulation in the united states and european union (pp. 93-149). switzerland: springer international. hightower, m. a. b. (2001). the relationship between middle school students' level of character development and their behaviour, academic achievement, and attendance. legacy etds. paper 775. retrieved from http://digitalcommons. georgiasouthern.edu/etd_legacy/775 hood, k. l. (2011). character education and parental involvement: impact on academic achievement (master’s thesis). rowan university, glassboro, nj. international islamic university malaysia. (2014). mission and vision. retrieved from http://www.iium.edu.my/medicine/ab out-us/mission-vision kariuki, p. & williams, l. (2006). the relationship between character traits and academic performance of afjrotc high school students. retrieved from http://files.eric.ed.gov/ fulltext/ed494959.pdf kern, m. l., & bowling, d. s. (2015). character strengths and academic performance in law students. journal of research in personality, 55, 25-29. khor, a. (2015). towards an integrated grading system. retrieved from http://www.thestar.com.my/news/ed ucation/2015/08/16/towards-an-inte grated-grading-system/ lombardo, t. (2008). ethical character development and personal and academic excellence. the wisdom page. retrieved from www.wisdompage.com /lombardo—ethicalcharacterdevelop ment2011.pdf luttamaguzi, j. b. (2012). influence of moral aptitude on academic performance of the undergraduate students in school of education makerere university. kampala: makarere university institutional repository. retrieved from http://dspace.mak.ac. ug/handle/10570/3810 malaysia education blueprint 2013-2025 (2012). develop values-driven malaysians. retrieved from http://www.moe.gov. my/userfiles/file/ppp/preliminary-blu eprint-eng.pdf malaysian institute of accountants (2012). nip philosophy. retrieved from http: //www.iim.org.my/en/falsafah-pin markham, j. w. (2015). a financial history of modern us corporate scandals: from enron to reform. new york, ny: routledge. mollman, s. (2004). the effects of character education on positive self-esteem and academic achievement (doctoral dissertation). university of wisconsin-stout, menomonie, wi. olowookere, e. i., alao, a. a., odukoya, j. a., adekeye, o. a., & ade’agbude, g. (2015). time management practices, character development and academic performance among university undergraduates: covenant university experience. creative education, 6(01), 79. pallant, j. (2001). spss survival manual. berkshire: mcgraw-hill education. phelan, c., & wren, j. (2006). exploring reliability in academic assessment. retrieved from https://www.uni.edu/ chfasoa/reliabilityandvalidity.htm. savulescu, j., crisp, r., fulford, k. w., & hope, t. (1999). evaluating ethics competence in medical education. journal of medical ethics, 25(5), 367-374. schaps, e., solomon, d., & watson, m. (1985). a program that combines character development and academic achievement. educational leadership, 43(4), 32-35. snyder, f., flay, b., vuchinich, s., acock, a., washburn, i., beets, m., & li, k. k. (2009). impact of a social-emotional and character development program on school-level indicators of academic achievement, absenteeism, and disciplinary outcomes: a matched-pair, clusterhttp://www.iium.edu.my/medicine/about-us/mission-vision http://www.iium.edu.my/medicine/about-us/mission-vision http://files.eric.ed.gov/%20fulltext/ed494959.pdf http://files.eric.ed.gov/%20fulltext/ed494959.pdf http://www.thestar.com.my/news/education/2015/08/16/towards-an-inte%20grated-grading-system/ http://www.thestar.com.my/news/education/2015/08/16/towards-an-inte%20grated-grading-system/ http://www.thestar.com.my/news/education/2015/08/16/towards-an-inte%20grated-grading-system/ http://www.iim.org.my/en/falsafah-pin http://www.iim.org.my/en/falsafah-pin https://www.uni.edu/%20chfasoa/reliabilityandvalidity.htm https://www.uni.edu/%20chfasoa/reliabilityandvalidity.htm research and evaluation in education academic performance and moral competence... 219 umaru mustapha zubairu, chetubo kuta dauda, olalekan busra sakariyau, & isa imam paiko randomized, controlled trial. journal of research on educational effectiveness, 3(1), 26-55. tay, e. (2015). malaysia to roll out icgpa programme. retrieved from http:// highered.easyuni.com/2015/08/malaysi a-to-roll-ut-icgpa-programme/ wynne, e. a., & walberg, h. j. (1985). the complementary goals of character development and academic excellence. educational leadership, 43(4), 15-18. how to cite item: nurrahmah, n., zamroni, z., & sumarno, s. (2016). an etnographic study of elementary education in the rural area of dompu county, the province of west nusa tenggara. research and evaluation in education, 2(1), 79-91. doi:http://dx.doi.org/10.21831/reid.v2i1.8270 research and evaluation in education e-issn: 2460-6995 research and evaluation in education volume 2, number 1, june 2016 (pages 79-91) available online at: http://journal.uny.ac.id/index.php/reid an etnographic study of elementary education in the rural area of dompu county, the province of west nusa tenggara 1 nurrahmah; 2 zamroni; 3 sumarno 1 mataram state islamic institute; 2,3 yogyakarta state university 1 rahmah03@yahoo.com; 2 zamroni1947@gmail.com; 3 sumarno_unj@yahoo.co.uk abstract the study aims to describe: (1) public service for elementary education in rural areas; (2) the meaning of education and the implementation of elementary education in the people of rural areas; and (3) the life and meaning of poverty for people of rural areas. the study was etnographic research. the subjects were the providers and users of educational service. the research concludes that: (1) educational service in rural areas has not been coordinated and integrated both vertically and horizontally, so that the service elements have not been maximum in providing educational service; (2) the implementation of education is influenced by the surrounding environments (policy, community, and nature) so that its condition or the problems emerging in the lower level is a reflection of that in the upper level. people in rural areas regard education as a symbol of profession and self-actualization within the children’s selves in order that the children would show respect to their parents, would not destroy the nature, have noble characteristics and be smart persons for their own sake in the future; (3) physically people in rural areas might be described as a community that lacks facilities including transportation, highway systems, water, electricity, and market for trading the harvest. on the other hand, mentally, people in rural areas might be described as a community that is fond of having aids, enjoys the final products, is lazy and dependent on the nature. people in rural areas regard their poverty based on physical indicators (the possession of luxurious goods, rice fields, livestock, income, and housings) and non-physical indicators (the dependency on the nature and absence of education). keywords: etnography, elementary education, rural areas mailto:rahmah03@yahoo.com mailto:zamroni1947@gmail.com mailto:sumarno_unj@yahoo.co.uk research and evaluation in education 80 − volume 2, number 1, june 2016 introduction after being an independent state for more than 60 years, indonesia still encounters insufficient educational service for elementary education and high education. sudarminta (2000, pp.9-13) mentions that there are three educational problems which are encountered by the state in the third millenium: low educational quality, insufficient learning systems in schools, and the widely-spread moral crisis in society. slamet (2008, p.1) states that the educational drawbacks owned by the republic of indonesia, in comparison to other asean countries, are related to the following aspects: (1) the amount of educational participation, (2) the period of 9-year compulsory education, (3) educational quality, (4) the average learning period (in general only 7 years of education), and (5) the poor efforts in improving education system. as a result, educational discrepancy has become very wide. the world bank (2012) reports that institutional weakness becomes the potential drawback for educational improvement in indonesia, especially in the case of elementary education. in relation to the drawback, the world bank also provides recommendations for improving education in indonesia. the recommendations finally encourage the birth of decentralization and regional autonomy in almost all sectors including education. from political perspective, many regions would like to become independent. however, from the natural resource and facility perspective, the dream of becoming independent areas seems impossible to realize. similarly, dompu county, as a region under the province of west nusa tenggara (or ntb for short), also has suffered from similar problems since the county implemented the autonomy system in 1999. compared to the other counties in ntb, dompu county is classified as one of the poorest autonomous areas. a quote from the study performed by smeru research institute (2005) strengthens the description of the county as follows: using the 2000 administrative arrangement as a reference, the fgd participants ranked mataram as the highest, since the people in this capital city of ntb are considered as the wealthiest compared to the people in other districts. dompu is placed at the lowest rank because the area of this district is very large and the population density is low. the condition of the people in dompu was considered to be lacking in all aspect, including health condition, education level, income and infrastructure. bima, located in the eastern part of sumbawa island, was ranked second because this region is the gateway to and from east nusa tenggara, as well as the center for development in the eastern part of ntb. many people from bima have also become successful emigrants to other island. the bottomline is that the results of the research place dompu county into the lowest rank in comparison to the other counties such as mataram and bima. from the aspect of health condition, education, income and infrastructure, dompu county still needs more improvement. tukiran (1993, p.15) states that in order to determine whether a region is poor or not, the government might refer to the indicators that reflect the poverty, namely potentials, facilities, settlements and residents’ condition that the related region has. furthermore, sumodiningrat (2000, p.56) argues that a poor society in general is marked by their inability or their powerlessness to, for example, meet the fundamental needs such as food, nutrition, clothing, housing, education and also health. compared to the regional gross domestic product (rgdp) of other counties that have been well-structured, the rgdp that dompu county has might be classified as very small only rp 1,984 billion, whereas the percentage of poor people in the county is 19.90%. based on the data in nusa tenggara barat dalam angka (central bureau of statistics of nusa tenggara barat province, 2011), the percentage of dompu county residents in terms of illiteracy from 10 years old and above has been 11.33%. of the five counties capital in sumbawa island, dompu county is in the second place after bima county in terms of the number of illiterate people. many programs have been initiated and implemented by the regional government, both in educational and in economic sectors. in the economic sector, the regional governresearch and evaluation in education an etnographic study of elementary education... 81 nurrahmah, zamroni, & sumarno ment has initiated a program for improving the maintenance of cows, corn and seaweeds called pijar (sapi (cow), jagung (corn) and rumput laut (seaweed)) with an expectation to decrease the number of poor people for about 10% each year. in the education sector, the regional government has implemented free education in the elementary level and even in the higher level. the construction of learning facilities such as the school buildings and the provision of educational aids has been performed as well. however, in practice, the government has not attained satisfying results. according to the team of educational advocacy formed by the government of dompu county, there are two matters that mainly inhibit the implementation of good educational service, i.e the low level of students’ participation and the low level of teachers’ attendance, especially those in the remote areas. although education has been free of charges, the children keep leaving their school during their school period. the two main reasons why the children leave their school are to help their parents earning the living for the family (especially whose parents are fishermen) and to earn their own allowance (especially in the case of peer influence). in relation to the problem, ilo (international labour office, 2003, p.22) provides the statement as follows: the education and preparation for working life of the current generation of children are of key importance to the drive to reduce and eradicate extreme poverty. children from families living on poverty incomes often start work at the age when their better-off counterparts are beginning to read. education for children is actually the key factor for overcoming poverty. however, the children from families with lower economic background has started to work at the age when they should start reading. sachs, quoted by suharko (2007, p.4), states that a strategy for overcoming poverty will be to break down the ‘hereditary poverty chain’ before eliminating the poverty problems. although the poor have the intention to solve their problems, they will not be able to do that with their own resources. the main point is that the poor should be assisted in solving their poverty problems. furthermore, surakhmat (2009, pp.1718) argues that from the perspective of education, the answer to poverty elimination is in the efforts to improve the society’s spirit, power and ability in order to help themselves. in this regard, education has a very important role in overcoming the negative effect caused by the cycle of drawbacks. zamroni (2010, p.185) also supports that the role of education in poverty elimination is very decisive. the countries that have succeeded in eliminating poverty are the ones that have been able to provide education for all of their citizens. based on the existing conditions and opinions that have been reviewed, it is clear that poverty and education are interrelated. poverty makes people or a group of people unable to access sufficient education. on the contrary, education might not be accessed by people or a group of people if they are poor naturally, culturally and even structurally. specific to the research site, although the educational service has been provided freely and the number of school buildings is increasing, the children keep leaving the schools because of the multiple forms of poverty that they suffer from. the question, now, would be: what is the importance of constructing more and more school buildings from one year to year if the children or students keep leaving school? then, how will educational process be implemented if the teachers do not support the implementation of education? another question that might be proposed is: why does dompu county, which is one of the sub-districts which becomes a tourism destination, the producer of seaweeds and the mining area, suffer from educational drawback? last but not least, why does the increasing number of educational aids not improve the students’ awareness and participation in the educational process? in general, the underlying question might be as follows: why is it hard to cut down the poverty chain? experts from the world bank provide a new paradigm that the increasingly multidimensional poverty should not be seen from the outer layer. narayan, chambers, shah, and petesch, (2000, p.1), specialists of social development from the research and evaluation in education 82 − volume 2, number 1, june 2016 poverty-handling department of world bank, state that the poor are the one who understands himself or herself. the results of a study by thuy (2012, p.196) also support the statement and these results imply that the different community members have different definition and interpretation of poverty. they provide their definition based on the condition of their poverty. therefore, this etnographic study was performed in order to understand the meaning of education and poverty from the perspective of the society which is included the cycle of poverty. the reason is that the educational drawback has been very complex and related to social, cultural and economic problems. as a result, there should be a research method that might have an in-depth review on the relationship among the interrelated problems based on the facts in the society in which the education has been implemented. the matters related to the perspective and daily life might be easier to understand by means of an etnographic-qualitative approach because the approach is closely related to the values of belief, paradigm, life and behaviors and these values might not be investigated quantitatively. the objective of the study is to describe elementary education service and its implementation in the rural areas and to describe the meaning of life and the meaning of education and poverty for the community in rural areas. method two etnographic approaches were implemented, namely realistic ethnography and institutional ethnography. the study lasted for four months from september 2012 until december 2012. it was performed in two subdistricts that were classified as poor and remote areas in dompu county, the province of west nusa tenggara. the two subdistricts were hu’u and pekat. observations to 22 elementary schools and junior high schools or the equivalent educational institutions were conducted. the subjects were educational service providers and educational service users. the data were collected by means of participatory observation, interview and documentation. the data validity was attained by the following multiple manners: deviating evidence investigation, triangulation, participatory methods, comparison, quasi-statistics and implementation of detailed, complete and various data. the data were analyzed sequentially according to the stages in an etnographic study and the sequence were as follows: domain analysis, taxonomy analysis, componential analysis and cultural-theme analysis. findings and discussion although education and poverty might be a cause and effect altogether at the same time, the research was connducted based on the assumption that education is the key factor to overcome poverty. based on the assumption, it is obvious that if education were improved, then poverty would be decreased and vice versa. elementary educational service in rural areas the educational service in the study is categorized into three aspects: teachers, curriculum, and facilities. the focus classification was performed based on the matters directly involved in education. quantitatively, the number of the teachers in rural areas does not decrease and this is apparent from the ratio of teacher to student. the ratio of teacher to student in rural areas is smaller than that of the national one. the problem is on the teacher distribution and quality. the teachers in rural and remote areas are dominated by the part-time ones. as a result, the schools have only two or three fulltime teachers. most of the part-time teachers are appointed by means of recruitment process held directly by the schools. consequently, the payment of the part-time teachers becomes the schools’ responsibility according to the financial level. on the other hand, although there have been several part-time teachers who possess the right to have incentives from the regional government, the amount of the incentive is relatively low. then, the attendance level of the teachers in rural areas may be classified as very low and this matter is reflected from the inapproresearch and evaluation in education an etnographic study of elementary education... 83 nurrahmah, zamroni, & sumarno priate learning process. a teacher may be absent for a day and even for a week consecutively. there are various reasons for the low attendance level, such as: the teachers do not come from the local area or do not live around the school area; the learning facilities in the remote areas do not support the teachers’ settlement; and the malfunction of educational offices in the county. another reason is the low supervision performed by the educational supervisors and the low leadership quality performed by the principals. furthermore, another important matter was the curriculum which is implemented by the schools in the educational process. during the research period, the schools that become the subjects of the study implemented schoolbased curriculum (kurikulum tingkat satuan pendidikan or ktsp) and there was a training for the vice principals to strengthen the curriculum implementation. the head of curriculum department in the regional office of youth and sport states that one of the problems in the implementation of the curriculum is the schools’ and the teachers’ capability of elaborating the curriculum into smaller components such as the basic competence, the competence standard, and the passing grade. the results of a study conducted in two different sub-districts show that the local subjects might be made flexible in accordance with the community’s demands and problems in each sub-district. for example, in the subdistrict of hu’u that becomes one of the tourism objects, the local subject is the health of reproduction system. on the other hand, in the sub-district of pekat that has been dominated by sasak tribe with high number of premature marriages, the local subject then should be the problems of having premature marriages. however, the development of local subjects is inhibited by the low socialization of the regional office of youth and sport, the distant communication channel among the sub-districts, and the lack of information regarding what aspects should be loaded in these local subjects. facilities are also important in educational process. the schools in distant areas might be considered as being established based on the community movement. this aspect is apparent from the existence of schools in each village. in general, the ratio between the number of schools and students is acceptable. similarly, the ratio between the class and the student number in general is acceptable as well. unfortunately, the buildings, especially the classrooms, in these schools are only built from bamboos and palm leaves. in addition, these classrooms in most cases should be split into two rooms due to the space inavailability. another aspect that should be given attention to is that almost all of the schools within the rural areas are the facilities as follows: toilets, hygienic water, libraries and electricity. furthermore, the principal’s office, that is usually combined with the teachers’ office, the library and the visual aids, should be given attention. because of this, the students do not have freedom in reading the books or in using the visual aids. the regional office of youth and sport states that the schools are built according to the community movement for the sake of adaptation between the community and the environment, especially for the farmers whose fields are separated from one another. however, the construction is built from one phase to another. moreover, the regional government has limited ability in affording the teachers’ office if the office should be constructed separately from the schools and the principal’s office. the implementation of elementary education in rural areas the implementation of educational process is focused on the students, learning and environments that affect the educational process itself: policy, community and nature. the students in rural and remote areas might be described physically and mentally. physically, these students, especially those in rural areas, have red hair and dark complexion due to the sun heat. in addition, they also tend to wear various shirts or dresses and sandals to school. mentally, these students have low attendance level in school for several reasons. first, the teachers’ attendance level in their schools is low. second, they have to go to work with their parents. third, they have to take care of their relatives. fourth, they have festival days to celebrate. fifth, they have to research and evaluation in education 84 − volume 2, number 1, june 2016 deal with the unfriendly weather. because of those reasons, the teachers in rural areas have extra task, namely, to find and take these students back to schools because they rarely attend the classes in the schools. from the aspect of achievement, these students, especially those in the rural areas, are generally low. although the data found in the regional office of youth and sport show that the results of their national examination are ideal, in practice, based on the results of the study, these students are very low especially in fundamental abilities such as reading, writing, calculating, and understanding general knowledge about the republic of indonesia. these students also had not understood their timetable as well. it seems that there might be an association between the habits in family and the students’ low achievement. a research performed outside the school period finds that the parents do not supervise and guide the children in their learning process. similarly, the principals and teachers have also stated that although the schools have provided timetable, it will not work without parental supervision. furthermore, most of the schools in rural and remote areas applied the teaching process uniquely. most of the teaching processes were not implemented according to the lesson plan because the teachers grouped the students from different classes. this had been done in order to cope with the teachers’ absence, the lack of classrooms, and the low students’ attendance. when these students were grouped into one class, the teaching process was conducted in one of the following manners: either the blackboard would be divided into two or three parts, or the students would be given general assignments. the other aspect that supports the implementation of teaching processes is environment. in this regard, environment refers to the community, the policy and the nature. environment might provide both positive and negative effects to the educational process. community is an environment in which the students spend most of their time. parents and society are responsible for supporting the educational process that the students have by, for example, supervising the educational process implemented in the society. the supervision also covers the students’ activities in society. however, in practice, the students’ activities, especially learning activities, are not supervised well within the community as having been the case in the sub-district of hu’u and pekat that have been dominated by bima and dompu tribes. in these cases, students are found to perform activities in contrary to their learning process and, at the same time, parents also do not provide good examples for their children. for instance, the parents invite these children to be involved in musical events from evening until dawn and they are also asked to drink liquors. furthermore, these children also do not respect their peers who have been praying in the mosque during the musical event. on the contrary, in several communities of the remote areas, especially in the transmigration area which is dominated by sasak tribe (lombok), the learning process in the society has been implemented well. children in these areas have religious activities that are directed to educational processes. unfortunately, this tribe has a tradition named merari kocet or having premature marriage. the tradition has become one of the drawbacks for the teaching process in sasak tribe. another case that happens in most of remote and rural societies is the use of educational aid fund for inappropriate matters. furthermore, these societies employed card for the educational aid, namely the program of expected family (program keluarga harapan or pkh) as a guarantee for getting loans. the government’s policy in the domain of education is aimed at improving education itself. the improvement of education might be pursued through the following efforts: free education, scholarship grant, incentive grant for the teachers, and the construction of learning facilities. however, such policies in some cases do not meet the expectations. the process of recruiting the principals, teachers and educational supervisors does not meet the demands and regulations. the process does not calculate other aspects that might affect further educational processes. on the other hand, the principals in some other cases recruit temporary teachers and the recruitement is out of their jurisdiction; as a result, there research and evaluation in education an etnographic study of elementary education... 85 nurrahmah, zamroni, & sumarno has been abundance in terms of temporary teachers within certain schools and, therefore, there might be inappropriate incentive/grant to these temporary teachers. recruiting principals and educational supervisors without calculating the demands might change the teacher composition, especially for the permanent or full-time teachers, specifically for those who come from the same educational background. the unclear jurisdiction between the regional office of education and the subdistrict office of education also becomes another aspect that should be given attention. the sub-district officers of education think that they do not have clear jurisdiction. the case is apparent from the ill-function subdistrict office of education. the principals, teachers, and educational supervisors have direct coordination to the regional office of education without having been connected to the sub-district office of education first. eventually, the learning process in the subdistrict office of education has been left behind compared to that of the regional office of education. from the aspect of environment, geographical condition also affects the educational processes, especially in the second research site: sub-district of pekat. pekat sub-district consists of separated mountainous areas. it lacks transportation facilities and has an unreliable highway system. it also has high rainfalls. as a result, the students have low attendance level and change timetables because of the weather condition. the meaning of education for the poor the meaning of education for poor people in the southern tip of indonesia is to improve the quality of human resources who will cultivate natural resources. according to these people, human resources are the sources for all natural resources. the reason is that education will form the characteristics of human resources who will cultivate natural resources, and education will eventually change their traditions or behaviors. therefore, by means of education, human characteristics might be improved. similarly, one of the principals who becomes the study subjects gave the following statement: ….di sini ada tambang yang masuk…saya berikan arahan…pandangan pada siswa itu… ini tambang ini anak-anak untuk kamu semua, bukan untuk orang lain. kalau kamu memang tidak sekolah, tidak mungkin kamu masuk di tambang itu….harus orang yang berpendidikan…harus orang yang berijazah… (we have a mine in this area. i tell my students that the mine belongs to them. so, if they do not go to school then they will not be able to work in the mine because only educated people are able to work in the mine.) this matter is associated to the potential natural resources that the sub-district of hu’u has and the cultivation of the natural resources by external people because of the incapable human resources. furthermore, the principal also stated: ….tidak ada lagi yang tidak sadar sekolah sekarang, harus sekolah semua…kalau memang tidak sekolah kan rugi…nanti menyesal…tidak ada duluan…pasti belakangan…pasti menyesal nggomi doho watisi hanumu…tio kaimu maa batu polisi…maa batu hanu…musti punya ijazah…batu ketua rt saja harus punya ijazah sekarang…siapa tau nggomi doho ake pede…ndadi presiden kombi…ndadi menteri kombi pede. (now, all of the people in this subdistrict have been aware of going to school. they have also been aware that if they do not go to school then they will feel sorry for themselves. and feeling sorry always comes late. even being a chief of community should have educational backgrounds. we are never able to predict the future. if you have good educational background, there might be a possibility that you become a minister or even a president in the future.) the principal provided the meaning of education for his students in the future. he emphasized that all of the students should go to school or otherwise they would be left behind. furthermore, he stated that even if research and evaluation in education 86 − volume 2, number 1, june 2016 they wanted to become a chief of community or a policeman they should have good educational background. by having good educational background, they might also be a minister or even a president. furthermore, the principal of public elementary school of 13 hu’u viewed that education or knowledge that his students have might change their destiny in the future. this matter has been proven before; although his stu-dents were only able to speak english and play ski, they had been successful in their life. he provided the statement as follows: punya ilmu…orang anak-anak yang sudah punya ilmu itu sudah kaya raya..hasil didikan saya…maksud saya..dia sudah bisa berbahasa inggris, dia mendapat uang dari turis-turis manca negara…ada yang sudah kaya betul itu…baru usia belasan tahun.. (my students who have had enough knowledge now are successful and rich. i mean that they have been able to speak english and they earn money from guiding foreign tourists. even one of my students has become very rich although he is still very young.) the principal elaborated further that his students who had been able to speak english and play ski had been made as foster children by foreign tourists. they had even been given some fund for starting their business. these foreign tourists disliked the children who had poor educational background. then, the chief of school committee in the public elementary school of 08 hu’u from the sub-district of hu’u stated that the students should have education not for being the civil servants but for making them aware of themselves and showing respect to their parents. then, what type of education should the students have? at least, the type of education that the students should have might be preserving the nature such as not operating the instrument that might endanger sea life. within the discussions with the students of public junior high school, it is found that these students have high expectation on the education that they attempt. these students said that they want to go to school because they want to be smart so they would be able to get a decent job. by getting a job, they might save money so they would be rich. although they do not have teachers, they might learn from any book by themselves. from these discussions, two matters in the paradigm, namely, the meaning of education and their future expectation on education, are found. for these students, going to school means to be smart in order to be able to have occupation so that they might be rich people. for the students in elementary school, going to school means to be teachers, midwives, doctors, policemen and even farmers. a similar statement has also been provided by the students from an elementary school in a remote area. they said that they go to school in order to be teachers, midwives and doctors although they are poor and they have limited access to electricity in their daily life. on the other hand, one of the students’ parents who works as a seaweed farmer stated that going to school would mean nothing if they could not earn their living. he gave his statement as follows: sakola kapa’i ademu labo daa ngaha katantu bata pa wa’uni made. (going to school without being able to earn the living would be in vain.) this parent chooses to bring his children to go working although they have to leave their school because they would be dead if they could not earn their living. however, a different statement was provided by a mother, who stated that by going to school, her child becomes slightly smarter and finally he can afford an occupation. she gave her statement as follows: de watini ne’e jaa kau loa ana sato’i…ne’e jaa karawi sato’i…ndede pa. (i want my child to be smarter so that he might afford an occupation. that is the reason why i send him to school.) furthermore, the principal of 02 hu’u public junior high school of explained the meaning of education in the following detail: ya kalau menurut saya itu…pada titik terakhir itu bagaimana anak itu berketerampilan yang bagus…kemudian itu juga berdasarkan research and evaluation in education an etnographic study of elementary education... 87 nurrahmah, zamroni, & sumarno agama itu bagaimana anak itu berakhlak yang mulia itu…menjadi orang yang baik…yang pintar dan untuk masa yang akan datang buat mereka itu..kalau menurut saya…makanya di sekolah kita ini sering dilakukan imtaq setiap hari jum’at misalnya apa yang kita lakukan itu baca yasinan…kemudian ada istilahnya lagi senandung al-qur’an diberikan pada anakanak itu….sehingga nanti anak itu ada perubahan sedikit demi sedikit…karena kita mengingat bahwa akhir-akhir ini banyak anakanak yang tawuran…kemudian hal-hal yang lain yang kita lakukan seperti tadi…anakanak diajak untuk melakukan kegiatan ekstrakurikuler itu tadi untuk seperti lintas alam…sehingga mereka itu tidak ingat lagi halhal yang tidak hanu. (in my opinion, at the very last point, educational process is about to have good skills. from the perspective of religion, educational process is about to have noble characters, to be good people and to be smart people for the sake of their own future. therefore, in our school, every friday we have imtaq (a program of spiritual guidance), yasinan (reading surah yasin together) and alqur’an recital. the students start to improve slightly and this is very good regarding the case of students’ fight that has happened recently. then, we also invite our students to be involved in extracurricular activities, such as, camping so that they will forget improper and unnecesssary things.) the principal’s explanation regarding the meaning of education is related to the recent development of the children in the sub-district of hu’u who are involved in the problems of drug abuse and free sex. the existence of tourism object and mine provides negative effect to the students’ behaviors and association. the description of life in rural areas physically, people living in rural areas might be described as those who lack lifesupporting facilities such as water spring, markets (where to buy the daily needs and to sell the crops and cattle), and communication channel. in remote areas, especially in rural and mountaineous ones, those facilities are far below the required standards; the highway is insufficient, the water spring is located far from the settlement, the electricity does not work properly, and the market is located far away from the settlement as well. as a result, these people, especially the farmers, are unable to sell the crops and cattle, and although they might be able to sell them, the price will not exceed the production cost and the sale price. this situation also causes high additional cost in affording the seed and fertilizer. most of the people who live in remote and rural areas work as farmers and fishermen. when they have to face the problems such as harvest failure and unfriendly weather, these people choose to sell or put the rice field into mortgage. in the first research site, namely the sub-district of hu’u, the farmers sold their rice fields to new owners; however, when the harvest period came, they would collect the remainder of the crops in the rice fields that used to belong to the people. based on the afore-mentioned explanation, the reason why the life of the people has not changed is more of internal factors. no matter how great the blessing that they have and the income that they have earned, the life of these people would not have changed if they did not use both benefits for the productive matters. in addition, the case became worse if the people were lazy when they actually should go to work, whereas they could change their life if they go working. brand, as having been quoted by thut and adams (2005, p.520), states that the poverty of a country might be caused by: (1) low productivity land, (2) the lack of capital, (3) low educational level, and (4) low leadership in initiating economic development. furthermore, he explains that the elements of underdeveloped economy might be associated with causal relationship that creates an evil cycle, and the evil cycle depicts how the underdeveloped countries nurture such association (thut & adams, 2005, pp.520-521). one of such evil cycle, as having been observed by the economists, is presented in figure 1. research and evaluation in education 88 − volume 2, number 1, june 2016 figure 1. the evil cycle of poverty thereby, a group of elements such as market inexcellency, low quality labors and alike might cause low productivity and eventually the lack of capital. the lack of capital is a situation inhibiting economic development and, in higher level, inhibits social and educational development. based on figure 1, it is found that low income will cause low saving. in practice, although most people in the research sites belong to the area of high income, it does not imply that they have increasing saving or economic life. they tend to spend their income on unproductive matters instead of productive ones such as having investment or pursuing the business capital. for instance, seaweed farmers and woodsellers earn rp 2,000,000.00 a week; but they tend to waste their money by hiring bands and buying liquors. koentjaraningrat (1987, p.4) explains that the cultural value that most indonesian should have is ‘future-oriented’. such cultural value will encourage people to overview and plan their future very carefully, therefore, they should be wise with their income. all of us are aware that the widely-spread trait of being economical would be necessary for a nation to accumulate as much capital as possible. furthermore, farmers in the second research site, the sub-district of pekat, have a habit of selling their rice fields for affording marriage and working abroad because they do not have sufficient income. in the case of working abroad, they will use the income they have earned for repurchasing their rice field. the regional policy of turning dompu county into the centre of corn plantation does not help the farmers to improve their life quality. although the regional government has encouraged the program, the farmers are still inhibited by the expensive price and the scarcity of seeds and fertilizers. the case becomes worse when the farmers do not find any market for selling their crops after the harvest period, whereas the crops and other agricultural products are very fragile. as a result, these farmers do not have any option other than selling the crops and the agricultural products in a low price. such problems actually give contribution to the poverty faced by the people and become their reasons for selling their rice fields or working abroad. the meaning of poverty for people in the poor areas people in the poor areas have different perspectives of interpreting poverty. the perspectives depend on their view and position in observing poverty. when the students were asked to mention what someone has if he or she is regarded as a rich person, they mentioned one by one that it is household utensils that represent the status of being rich. the household utensils that they mention might be found in one of the quotes from the interview with the students of junior high schools. according to these students, the people who are categorized as being rich are the ones who have the items such as refrigerator, television unit, washing machine, rice cooker (some of the local people call it ‘kuskes’), and parabole antenna. in another occasion, when these students were asked about the meaning of rich people, most of them answered that rich people are the ones who are able to afford schooling. however, other students disagreed with them and this group of students proposed that the people who are able to afford schooling would eventually find an occupation and finally would be rich. the question was intendedly asked in order to find the definition of being poor from other point of view. in addition, using such a question would make it easy for the students to propose their statement. based on the discussions, it is conmarket inexcellence: underdeveloped retardation lack of capital low investment low productivity low saving low real income research and evaluation in education an etnographic study of elementary education... 89 nurrahmah, zamroni, & sumarno cluded that poor people are those who have the situation in contrary to the definition that these students have mentioned. when the criteria of being rich and poor are asked to a group of parents, the following responses are provided: dou ma miskin re dou ma darere (poor people are the ones who have nothing.) kade’e mbere langi… (wait for the heavy rain falling from the sky.) kade’e mbere langi…ngena hanupa di ru’u oma (wait for the heavy rain falling from the sky, only for the rice fields.) edepa di pandana…tiwara tolo… (just wait for the heavy rain falling from the sky, because we do not have any rice field.) di doropa…rahanu edepa…kanggihi sakali samba’a…nawara walisi ura nawara diweha… tisi wara ura indo wara di weha (living only in the hills and not having any rice field. if we have heavy rain, it means we might have something to harvest. if we do not, it means we might not have anything to harvest.) from the discussion, it is found that the group of parents views poor people as the ones who are only waiting for the heavy rain to fall from the sky upon their rice field, and the ones who do not have fertile rice field. according to them, poor people had harvest period only once in a year and the harvest depends on the rain. if they have heavy rain, then they might have their harvest and vice versa. on the other hand, another group of parents views the rich people as those who have rice fields, cows and horses. cows and horses are beneficial cattle for local people because they can benefit their strength for cultivating the land, for selling the cattle and for transporting their items. these parents’ view implies dependency on the condition of the nature. the view is in accordance with the results of a research by kutanegara, mustar, and purwatiningsih (2007, p.170) which conclude that the condition of villages in the hilly areas and the poor transportation causes the people to be dependent in the existing situation. their being surrendered to the existing conditions is mostly found in javanese people, including those in the special province of yogyakarta and of central java. in relation to the statement, the results of the study show that the people living outside java island also have a similar trait. narayan, chambers, shah, and petesch (2000, p.22) also state that although the experience of poverty might be different for different groups of people, there might be peculiar similiarities that the different groups of people shared. in addition, when a principal was interviewed, he implied that the indicators of being poor are viewed from the aspect of income, field possession, and cattle possession (especially the cows). in relation to the statement, a mother from sorinomo, the sub-district of pekat, views poverty by comparing her condition to that of the people around her who are granted the pkh aid. she said that the people who are granted the pkh aid might not be classified as poor people because they have better condition than her; they have better houses, more cattle and more fields. going back to the definition of being poor, the following question was asked to another principal from one of the public elementary schools in the most eastern tip of indonesia: ‘might being lazy and uneducated be regarded as being poor?’ the principal answered, ‘it means total degradation because these people are lazy and uneducated; as a result, they finally sell every single item of their belongings whereas they used to be rich.’ such phenomena occur frequently in local people. the local people sell their rice field to new owners and when the harvest period comes, these people collect the remainders from those rice fields. based on the definitions of being poor that has been elaborated, it is found that the respondents or the informants regard poverty by viewing the surrounding aspects and the experience; in addition, they also compare the past and recent situations as well as the existresearch and evaluation in education 90 − volume 2, number 1, june 2016 ing and external condition within the areas where they live. conclusions and suggestions the educational institutions in dompu county has not been well integrated and well coordinated both vertically and horizontally in terms of providing educational service. as a result, the elements or indicators of educational service such as teachers, curriculum, and learning facilities have not been maximum in terms of providing educational service for the rural areas. the teaching process does not meet the expectation especially in the rural and remote areas. the implementation of education in the lower level reflects that of the higher level, so when the higher level education face certain problems, then the lower level education will also have problems. the reason is that the implementation of education is affected by the surrounding environment, namely the policy, community, and nature. the level of students’ participation and achievement in the educational process is affected by the cultural capital they inherit, the experiences they have, the community where they come from, and the culture they learn during the teaching process in the school. in other words, the students’ participation and achievement are the reflection of cultural process that they get from the school and from the community. if the culture in the teaching process and in the community changes, then the teaching process and output that the students get will also change. on the other hand, if the teaching process changes, then the culture that the students and the community have will also change. the process of cultural change might occur negatively and positively to both the community and the teaching process itself. such a process of cultural change that has been reflected in the students and community members is also reflected in their view of education. based on the experience and natural condition that surround these people, the community regards education as follows. first, they regard education as a means of achieving their dreams through the possession of profession, the knowledge in certain occupations, and knowledge for exploiting the natural resources. second, they regard education as the efforts to encourage their children to be respectful toward the parents and nature. education is regarded as a process of mastering certain skills and having religious basis for pursuing noble characters. the students should be smart not only for their own sake, but also for other people’s sake. however, there is a different meaning of education: education would mean nothing if it is not able to earn a living. the meaning is given by a seaweed farmer who chooses to bring his children to the coasts rather than to school. physically, the people in the research sites can be described as the ones who lack sufficient facilities such as transportation, highway, water, electricity and market; market is very important because in the market, these people, especially the farmers, might purchase their daily need and sell their crops or their cattle. mentally, the people in the research sites can be described as the ones who suffer from poverty; they are fond of having aids, enjoying the final products and depending on the nature. based on the situation that they have and the natural condition that surrounds them, the people in the rural areas provide the meaning of poverty based on the physical and non-physical indicators. from the physical indicators, poverty is classified by referring to the indicators that show the presence (the possession) or the absence of luxurious items, land, cattle, income and houses. then, from the non-physical indicators, poverty is classified by referring to the dependency on nature, namely by ‘expecting the heavy rain to fall from the sky.’ if the rain falls from the sky, then they will have something to harvest and vice versa. another opinion implies that not having sufficient educational background might be regarded as being poor because it will encourage people to sell their rice field and they would finally be poor whereas they used to be rich. from the research findings, in order to improve the educational service, the following steps are suggested. the provider of educational service should delegate the authorities from the level of regional government to that research and evaluation in education an etnographic study of elementary education... 91 nurrahmah, zamroni, & sumarno of sub-district government with a clear jurisdiction in order to shorten the chain of administration so that the service and the implementation of education might operate well. then, the curriculum applied in the rural and remote areas should be adapted to the situation of local people based on the geographical condition, the potential natural resources, and the problems that inhibit to the implementation of education. the future researchers of educational etnography are suggested that perform a similar study with a bigger number of sample in order to have wider implementation of the results. references central bureau of statistics of nusa tenggara barat province. (2011). nusa tenggara barat dalam angka [nusa tenggara barat (ntb) in figures]. mataram: central bureau of statistics of nusa tenggara barat province in cooperation with regional development planning board of ntb province. international labour office. (2003). working out of poverty. geneva: international labour office. koentjaraningrat. (1987). kebudayaan dan mentalitas pembangunan [culture and development mentality]. jakarta: pt. gramedia. kutanegara, m., mustar, e., & purwatiningsih, s. (2007). mendorong program kemiskinan dan raskin berbasis lokal [encouraging local-based poverty and rice for poor society programs]. populasi, 18, 167-185. narayan, d., chambers, r., shah, m.k., & petesch, p. (2000). poverty is powerlessness and voicelessness. finance & development, 37(4), p.18. slamet. (2008). desentralisasi pendidikan di indonesia [educational decentralization in indonesia]. jakarta: departemen pendidikan nasional. smeru research institute. (2005). developing a poverty map in indonesia: a tool for better targeting in poverty reduction and social protection programs. retrieved from www.smeru.or.id/.../povertymapping4/ sudarminta, j. (2000). tantangan dan permasalahan pendidikan di indonesia memasuki milenium ketiga [educational challenge and problems in indonesia entering the third millennium]. in a. atmadi & y. setiyaningsih. (eds.), transformasi pendidikan memasuki milenium ketiga [educational transformation entering the third millennium] (pp.315). yogyakarta: kanisius. suharko. (2007). the roles of ngos in rural poverty reduction: the case of indonesia and india. nagoya: nagoya university. surakhmat, w. (2009). pendidikan nasional: strategi dan tragedi [national education: strategy and tragedy]. jakarta: pt kompas media nusantara. thut, i.n. & adams, d. (2005). pola-pola pendidikan dalam masyarakat kontemporer [educational patterns in contemporary society]. (spa teamwork, trans.). new york, ny: mcgraw-hill book. thuy, t.n. (2012). poverty reduction strategies in an ethnic minority community: multiple definitions of poverty among khmer villagers in the mekong delta, vietnam. asian social science, 6, 196-208. doi:http://dx.doi.org/10.5539/ass.v8n6 p196 tukiran. (1993). penentuan desa miskin [determining poor villages]. populasi, 14(1), 13-23. world bank. (2012). protecting poor and vulnerable household in indonesia. jakarta: world bank jakarta office. zamroni. (2010). pendidikan dan kemiskinan. in tukiran, a.j. pitoyo, & p.m. kutanegara (eds.). akses penduduk miskin terhadap kebutuhan dasar [access of poor inhabitants towards primary needs] (pp.185-221). yogyakarta: pusat studi kependudukan dan kebijakan, universitas gadjah mada. http://www.smeru.or.id/.../povertymapping4/ http://dx.doi.org/10.5539/ass.v8n6p196 http://dx.doi.org/10.5539/ass.v8n6p196 research and evaluation in education issn 2460-6995 research and evaluation in education, 2(2), 2016, 165-180 available online at: http://journal.uny.ac.id/index.php/reid research article determining standard of academic potential based on the indonesian scholastic aptitude test (tbs) benchmark * 1 idwin irma krisna; 2 djemari mardapi; 3 saifuddin azwar 1 center of educational assessment, jl. gunung sahari raya block b no.4, gn. sahari selatan, kemayoran, jakarta pusat municipality, 10610, dki jakarta, indonesia 2 graduate school of universitas negeri yogyakarta, jl. colombo no. 1, karangmalang, caturtunggal, depok, sleman, 55281, yogyakarta, indonesia 3 faculty of psychology of universitas gadjah mada, jl. sosio humaniora, bulaksumur, caturtunggal, depok, sleman, 55281, yogyakarta, indonesia abstract the aim of this article was to classify the indonesian scholastic aptitude test or tes bakat skolastik (tbs) results for each subtest and describe scholastic aptitudes in each subtest. the subject of this study was 36,125 prospective students who took the selection test in some universities. data analysis began by estimating testees‟ ability using the item response theory, and benchmarking process using the scale anchoring method applying asp.net web server technology. the results of this research are four benchmarks (based on cutoff scores) on each subtest, characters which differentiate potential for each benchmark, and measurement error on each benchmark. the items netted give a description of the scholastic aptitude potential clearly and indicate uniqueness so that it could distinguish difference in potential between a lower bench and a higher bench. at a higher bench, a higher level of reasoning power is required in analyzing and processing needed information so that the individual concerned could do the problem solving with the right solution. the items netted at a lower bench in the three subtests tend to be few so that the error of measurement at such a bench still tends to be higher compared to that at a higher bench. keywords: indonesian scholastic aptitude test (tbs), benchmark, scholastic aptitude how to cite item: krisna, i., mardapi, d., & azwar, s. (2016). determining standard of academic potential based on the indonesian scholastic aptitude test (tbs) benchmark. research and evaluation in education, 2(2), 165-180. doi:http://dx.doi.org/10.21831/reid.v2i2.8465 *corresponding author. e-mail: idwinirma@gmail.com http://dx.doi.org/10.21831/reid.v2i2.8465 research and evaluation in education 166 − reid, 2(2), december 2016 introduction selection in relation to the entry of new students into a university has always become an important issue in several countries. it is related to the criteria used in the acceptance of new students who would study at the university. the ratio of the number of student candidates to the small student capacity causes the universities to be compulsorily selective in choosing the candidates that would be their new students. besides, effectiveness in the selection of new students is also an important matter in higher educational system because the quality of student candidates has an effect on the internal efficiency and quality of the educational program offered (harman, 1994, p. 313). effectiveness would be attained when the selection system has an accuracy in prediction so that it would have an effect on efficiency in the economic aspect. the selection activity for the entry of new students into the university in indonesia generally uses an achievement test as a reference for decision-making. an achievement test is designed to measure the result of a learning or training program conducted in a controlled condition (anastasi, 1988, p. 411). only in 2009 the test of academic potential started to be used as complement of the achievement test. though 2009 was the year when the test of potential was nationally started to be in use, several universities had started using it earlier. the test of potential as part of the entry test at the university is also used by developed countries, such as the united states of america, which uses a test of potential called the scholastic aptitude test (sat). sweden has developed the swedish scholastic aptitude test since 1977 (wedman, 1994, p. 5). also known as swesat, it is designed as a selection test that is fair and in line with the future success of student candidates if they are accepted as new students at the university. one of the institutions in indonesia developing the test of potential is centre of educational assessment, office of research and development, ministry of education and culture republic of indonesia. the development has been conducted since 1990 and since 2000 the test of academic potential has been named tes bakat skolastik („scholastic aptitude test‟). the construction of the test of potential or tes bakat skolastik (tbs) is based on understanding of intelligence. some research indicates that the test of potential has a relation with the intelligence test. the results of research by frey and detterman (2003) show that the correlation between sat scores with those of several iq tests ranges from 0.53 to 0.83, with this giving strong evidence that sat could also serve as intelligence test. intelligence and aptitude are cognitive abilities possessed by every individual (cohen & swerdlik, 2002, pp. 257, 301). intelligence refers to the intellectual ability which generally functions in various fields of achievement, while aptitude is a more specific ability used in certain fields of achievement only (berk, 2000, pp. 316-319). aptitude serves to predict one‟s future success which requires special ability. the test of potential measures learners‟ reasoning ability more than their memory. the reasoning process is a more specific part in the thinking process, with one, in reasoning, more frequently using the principles of logic (galotti, 2004, pp.391-392). reason is used to make a conclusion based on information obtained. in reasoning, each individual has his or her own ways. psychologists continuously make explorations on general principles related to human experience not restricted to only one type of reasoning. conclusion-drawing models which are related to one‟s logic and thinking process were developed by johnson-laird (solso, 2001, pp.428-429). some findings related to one‟s way in reasoning indicate the use of premises in the form of phrases or in the form of illustrations. the reasoning abilities which are measured in tbs consist of verbal reasoning and mathematics applying reasoning concepts. in line with the research by olatoye and aderogba (2011), numerical and verbal reasoning could together explain the variance amounting to 38.8% in an aptitude test and the coefficient of correlation between verbal and numerical reasoning of up to 0.713. numerical ability is the same in domain as verbal ability and general aptitude. research and evaluation in education determining standard of academic potential... 167 idwin irma krisna, djemari mardapi & saifuddin azwar since 2001, tbs has been used by several state universities as a part of selection. tbs is also used in the selection of new employees at private agencies and several ministries. tbs consists of three subtests, namely, verbal, quantitative, and reasoning subtests. the three subtests measure the same ability, namely, reasoning, presented in the form of verbal logic, mathematical logic, and reasoning ability in evaluating the correctness of a conclusion. the three subtests indicate a sufficiently significant correlation and the highest correlation is between the quantitative and reasoning subtests (azwar, 2008, p. 12). the development of the tbs items is done in several stages, starting with the stage of writing the item grid through to the stage of item storage in the bank of items. the process of data analysis determines the test items fit to enter the test-item bank. the item analysis uses the item response theory (irt) model. the estimation of the testees‟ ability (called latent trait) in irt is based on their response to test items. the irt model specifically describes the relation between ability and item characteristic on one side and the testee‟s response to the test items on the other. irt has models which are not limited to types, depending on the number of parameters used to describe the test items. measurement in psychology and education is usually of the same dimension with different test items and also with different groups. tbs uses different test packages to measure different groups of people but the dimension measured is the same. hopefully, the scores obtained from the test could be used to compare one group with another. for that purpose, the processing of test results uses the model called irt 1 pl or the rasch model. the rasch model is called the oneparameter logistic model because it contains only one parameter related to the test item, namely, level of difficulty. therefore, this model is known as a simple model in irt (embretson & reise, 2000, p.67). even if there is another factor having an effect on the results, when we measure something that is certain such as the right solution to a test item, only one of the attributes of the two factors is needed. with empirical data as the basis, it would be difficult to separate the concepts of level of difficulty and level of human ability. rasch gives a contribution in relation to this matter by providing the rasch model formula using the concepts of statistical mathematics. the dependent variable is the probability of a person to successfully answer the test item i, with the probability shown as p(xis =1). the logistic function for the rasch model is as follows: p(xis = 1 | θs, βi ) = (1) in which θs is a person‟s ability, βi is the level of item difficulty, and exp (θs – βi) is the natural antilog of the difference in score between the person‟s ability and level of item difficulty. the level of item difficulty would be obtained at the time the person‟s ability to answer correctly has the probability of 0.5. the higher the level of item difficulty, the higher the level of the person‟s ability to answer correctly with a probability of 0.5. consistent movement to the right along the icc (item characteristic curve) indicates increasingly more difficult items and increasingly higher levels of ability. up to now, improvements have been done continually and used as part of decision making. all this time the test results announced have been in the form of only scores without the accompaniment of interpretations of the scores. the formulation of test result interpretation depends on definitions of the content, level, and cutoff score, which specifically describe the ability descriptor that would be used as reference for policy makers (ferrara, svetina, skucha, & davidson, 2011, p.5). deciding the cutoff score is part of determining benchmarks in the test results. setting a bench could be done by deciding a score to be used as reference. in setting a cutoff score, consistency is required among educational policy makers and psychometry experts (bejar, 2008, p.4). at the level of higher education, benchmarks could be used as tools in preparing students for the next teaching and learning process and their chances in pursuing a career. the benchmark of sat based on combined scores is 1550 (wyatt, research and evaluation in education 168 − reid, 2(2), december 2016 kobrin, wiley, camara, & proestler, 2011, p.13). participants attaining the benchnark (of 1550) have the advantages of, among others, a greater possibility of enrolling at a higher educational institution with a length of study time of 4 years (rather than 2 years), more possibility of survival up to the second and third academic years, and having a higher fygpa (first-year grade point average). one of the important components in benchmarking is a set of performance standards based on testees‟ response to questions (resnick, nolan & resnick, 1995, p.454). a description of someone‟s cognitive process could be obtained by using as a basis for his or her response to a test item given. the process of developing the test and the procedure of establishing performance standards are, in nature, prospective (based on the descriptor level to orient the test development), progressive (based on the content and performance standard articulated at each level), and predictive (using the descriptor level of performance and standards based on theory and empirical evidence). with the setting of the two systems, decision makers could give accurate decisions based on existing information by using the right measurement and evaluation methods. the method which was employed to define the level of someone‟s ability and the cutoff score related to the level is called standard setting. standard setting is a part which is integrated into the development of a test instrument (cizek & bunch, 2007, p.247). standard setting is a method which is related to the coherence between the educational policy and the evaluation system in a country. the activity of determining the standard or cutoff score could be done by using some methods. most of the existing standard setting methods could be categorized as continuum models. continum models are divided in type into models focusing on the test, or testcentered models, and models focusing on testees or examinee-centered models (jaeger, 1989, p.492). the determination of a minimum completeness criterion is determined not only through government policies, but also by the participants based on tests and based on measuring instruments (mardapi, hadi, & retnawati, 2015, p.39). the testcentered model is based on experts‟ judgment concerning the test used. the experts judge the ability needed at each test item to estimate someone‟s ability according to the standards that have been set. the examinee-centered model is based on experts‟ judgment in grouping people according to level of ability by using several external criteria outside the test scores. the standard setting method is developed to overcome problems in setting performance standards. one of the methods that could be used in setting the standards is the scale anchoring method. the scale anchoring method is one of the methods of interpreting measurement results (in the form of scores) which describes the ability and competence possessed by the learners at several different values in a scale. the making of the description is related to behavior scale or item mapping. item mapping starts with the concept of content referencing which is introduced by bock, et al. (kelly, 2002, p.377). content referencing describes irt and recommends the procedure in which items are placed on a scale of response probability to describe students‟ ability and comprehension. item mapping is widely used in interpreting assessments in a large scale like the young adult literacy survey, national adult literacy survey (nals), national assessment of educational progress (naep) and timss (trends in international mathematics and science study). the continuation of the content referencing and item mapping is called the scale anchoring method (beaton & allen, 1992, pp.195-198). in this method, several points in a scale are chosen and then the test items fitting those points are identified. a test item would be declared fitting a point (or anchor point) when most of the learners related to the anchor point could do the item concerned but those related to the point under it could not do it. the distribution of the test item group that could be answered by most learners at each different point is then studied to obtain a description of ability at each point. additional critera would be used when there are only a few suitresearch and evaluation in education determining standard of academic potential... 169 idwin irma krisna, djemari mardapi & saifuddin azwar able test items so that the description of someone‟s ability would be more enriched. scale anchoring provides normative information of the knowledge that someone masters based on a construct being measured (beaton & allen, 1992, p.192). the basic idea of scale anchoring is to know the ability that someone possesses at a certain point in a scale based on response to an item given and to pay attention to other responses at an adjacent point. the description of someone‟s ability at determined points might well be unreliable so that one should be careful in determining the points. the points chosen are called anchor points or anchor levels. according to forsyth (1991), percentiles could be used to determine the points. the procedure could be widely applied on various scales with the purpose of grouping or making a certain characteristic in someone‟s ability even with a test which is noncognitive in nature or at least the scale is an ordinal one (beaton & allen, 1992, p.191). thus, the method could also be used with the test of potential. four numbers established as international benchmarks used as anchor points are the 25th, 50th, 75th, and 90th percentiles (martin, mullis, beaton, gonzales, smith & kelly, 1997; mullis, martin, beaton, gonzales, smith & kelly, 1998). with the percentile value as the basis, corresponding learners‟ scores are determined. there is a possibility for several test scores of learners not to be exactly the same as these scores so that ranges of scores plus and minus five are given. this range contains students‟ scores that are homogenous and sufficiently concentrated at each anchor point where adjacent levels are sufficiently far from each other so that it would hopefully enable recognition of interlevel distinction. the percentage of correct answers to each item is calculated and the criteria as a reference for the inclusion of the items in the benchmark category are determined. a response probability of 50% would result in an item at an anchor point with the students answering correctly and those, otherwise, equaling each other. a response probability of 80% would result in an item that could be answered correctly by 80% of the students but it would possibly become considered an easy item. in order to overcome it, it is determined that the item is interpreted as being mastered when the response probability is 65%. in scale anchoring, the anchor item at each level hopefully could distinguish adjacent anchor points. to determine that, criteria are needed to identify the item that should be chosen to consider performance at more than one anchor point. additional criteria for percentage of students attached to a certain anchor point and that of those attached to the anchor point right under it need to be determined. a criterion determined is that the response probability is less 50%, meaning that the students answering the item concerned incorrectly would be more than those answering it correctly. anchor items are items that reflect learners‟ conceptual knowledge and comprehension at different scale points expressed with a high value of probability. the description at each benchmark level had better imply that the students who reach that point have a great possibility of understanding and doing the item concerned. the definition at each level should be carefully considered so that it could distinguish the levelof someone. ideally, when the evaluation program has a clear definition of a level and intends to use the level, its establishment is done early in the process of test development (perie, 2008, p.16). it would help the test planner and the test user and also the policy maker in reporting the level of a test that could distinguish the ability and knowledge that someone has. measurement results would be more meaningful when the form of the report is easily understood by various circles, able to give policy makers accurate information, and able to minimize the occurrence of errors in interpretation. interpretation of the potential at each level is greatly needed by puspendik (centre of educational assessment) in the course of reporting test results to stakeholders. a report of test results which includes a description of someone‟s potential would be more meaningful when it also gives information about errors in measurement. accurate calculations of measurement errors could be research and evaluation in education 170 − reid, 2(2), december 2016 done by using the irt approach (geisinger & mccormick, 2010, p.40). irt not only gives estimations of test item and testee parameters but also considers what the precision of each parameter estimated is like. the use of information as a term in this context was first raised by fisher in 1922 to indicate precision of estimation (keeves & alagumalai, 1999, p.35). the function of what is termed information in this case is to describe to what extent the model which has been chosen (1pl, 2pl, or 3pl) is able to give information concerning traits-level estimation along a latent-traits scale. thus, the effectiveness of test or test item measurement at each ability level would be able to be measured. mathematically, the information function of an item (if) fulfills the following equation. ii () = (2) ii () is the information of item i on , is a derivation of pi() on , pi() is the response function of the item, and ) = 1 pi(). (hambleton, swaminathan, & rogers, 1991). equation (2) would be more simple when calculated by using equation (3). ii () = (3) when using irt 1 pl, a =1 and c = 0. based on formula (3), information would be higher in level when the value of b is close to . the information function of the test is the accumulation of the information function of the items and mathematically fulfills equation (4). i() =  ii () (4) an amount of information given by the test on  would be inversely proportional to a precision of ability estimation called the standard error of measurement (sem). the relation between sem and test information is expressed as in equation (5). se () = (5) based on equation (5), the standard error (of measurement) and the test information are inversely related, with the greater the information, the smaller se would be. the magnitude of measurement error depends on the number of test items in the test and the quality of the test. this writing would discuss the classification of each subtest of tbs by using the scale anchoring method, describe the scholastic aptitude of each subtest of tbs according to test item grouping, and estimate the measurement error at each bench. the description of the scholastic aptitude potential and standard error at each benchmark level could be adopted as references for the test developers so that the development of tbs items becomes more effective and efficient. method in its design, the research which was concerned here was descriptive in nature so that it could describe and interpret an object in accordance with the reality in existence. by the means of descriptive research, a description of someone‟s potential in line with the benchmark level determined could be obtained. the data used in the research were obtained from the center of educational assessment, institute of research and development, and ministry of education and culture. the data originated in the results of the selection test for new students‟ entry into state universities in indonesia. the subjects put under analysis were 36,125 in number. the data were dichotomous in form, with any right answer given the score of 1 and any wrong answer given the score of 0. in addition to those raw data, the data that were qualitative in nature were also compiled in the form of an interpretation of analysis results by means of holding fgd (focus group discussion). the test instrument in the form of the scholastic aptitude test used consisted of 12 different test packages. each package consisted of 30 verbal subtest items, 20 quantitative research and evaluation in education determining standard of academic potential... 171 idwin irma krisna, djemari mardapi & saifuddin azwar subtest items, and 31 reasoning subtest items. the verbal subtest measured the verbal logic ability, namely, the ability in solving problems verbal in nature and containing language elements. there were 4 verbal abilities measured, namely, synonymy, antonymy, analogy, and reading comprehension. the quantitative subtest measured reasoning or numerical logic ability, namely, the ability to solve problems related to numbers by using basic mathematical concepts. the quantitative subtest consisted of sub-subtests of number sequences, arithmetic and algebra, and geometry. the reasoning subtest measured individual logic abilities, including the ability to evaluate the truth of a conclusion and the ability to use logic to construct a conclusion. the reasoning subtest was divided into three sub-subtests, namely, those of logical, diagrammatic, and analytical reasoning. the benchmarking process was initiated with an estimation of human ability by means of the irt approach using the winsteps program. participants‟ ability was expressed in the form of a logit scale ranging from -4 to 4. the values would further be converted by using the mean of 300 and the standard deviation of 50. the benchmark setting was conducted by using the scale anchoring method with the technology of the web server asp.net. the benchmark setting was executed through four stages. the first stage was of the setting of the cutoff score at each bench. the cutoff scores used in the research concerned here were percentile 25 (bench 1), percentile 50 (bench 2), percentile 75 (bench 3), and percentile 90 (bench 4). the second stage was of the grouping of test items according to cutoff scores. after the cutoff scores were set, the data of test participants‟ responses to test items were obtained. with those data as a basis, the proportion of correct answers to each test item was determined. the third stage was of deciding the test items entering the benches according to the criteria that had been set. a test item would belong to a certain bench when most responses answer the item correctly at the bench and answer it wrongly at the bench under it. the fourth stage was of deciding the descriptor of potential at each bench. the descriptor setting was done by holding fgd attended by resource persons competent at tbs development. the analysis stage after the benchmark process was that of calculating the standard error of measurement. it was calculated based on the test information formula. the test information was obtained based on the group of test items at each level obtained from the benchmarking process above. the test information was determined at the value of . the  value was obtained based on the score obtained at the bench point (or cutoff score). findings and discussion the irt analysis with the winsteps program had the purpose of making a conversion table that would be used as a basis in determining testees‟ ability. the irt analysis was also done to discard test items that did not fit according to the 1 pl model. the analysis was done on several packages and each package had a conversion table differing from that of any of the other packages. after testees‟ ability was determined, the next thing was making a file of the person‟s response and ability. the file would be used in determining the classification of tbs items by using the c# program. the classification of tbs items had the purpose of mapping the test items according to the levels discussed in previous sections. the results of the analysis on the program classifying the tbs items can be seen in figure 1. in figure 1, the benches are in the column of description. at each bench, the value of ability is presented according to the percentile. the grouping of test items at each bench is presented in three colors, green indicating the test items meeting the bench criteria, yellow indicating those meeting the almost-anchor criteria, and red indicating those meeting the criteria of being too difficult to anchor. figure 1 presents the results of the analysis on the quantitative subtest of package 13. of 30 test items, 12 meet the bench criteria, eight meet the almost-anchor criteria, and five meet the criteria of being too difficult to anchor. the classification of the results of each subtest of tbs is presented in table 1. research and evaluation in education 172 − reid, 2(2), december 2016 figure 1. results of analysis on the program classifying tbs items table 1. classification of tbs items at each bench subtes bench cut off score anchor almost anchor too difficult to anchor verbal 1 320 60 15 0 2 340 6 5 31 3 360 14 14 45 4 380 6 14 40 quantitave 1 280 5 3 0 2 310 6 7 3 3 340 11 12 23 4 370 12 17 25 reasoning 1 340 52 11 0 2 370 19 9 24 3 400 18 16 18 4 430 10 8 25 with the results of analysis as the basis, the test items meeting the anchor criteria at each bench turned out to be only a few in number. the verbal test items that could be retained at benches 1, 2, 3, and 4 are 13.6%, 1.4%, 3.2%, and 1.4% of the 440 test items analyzed. however, at each bench, the test items used in making the description of potential could be netted. the quantitative subtest items that the anchor criteria of benches 1, 2, 3, and 4 could net are 1.8%, 2.2%, 3.9%, and 4.3% of the 280 items analyzed. the reasoning subtest items that the anchor criteria of benches 1, 2, 3, and 4 could research and evaluation in education determining standard of academic potential... 173 idwin irma krisna, djemari mardapi & saifuddin azwar table 2. description of potential in the verbal subtest bench verbal subtest synonymy antonymy analogy reading 1 270 the individual is able to determine a word in daily general use that is the equivalent of a word in a set of word choices that are also in daily general use and tend to have no similarity in meaning with each other. the individual is able to identify a word in daily general use that is the antonym of a word in a set of word choices which include one of the equivalents of the identified word above. the individual is able to identify analogies related respectively to subject/concept place, object-product, and conceptexample. the individual is able to know, mention, or reexplain something in a discourse presented. 2 290 the individual is able to determine a word not in sufficient daily general use which is the equivalent of a word in a set of word choices that are in daily general use and a part of them have a similarity in meaning. the individual is able to identify a word in daily general use which is the antonym of a word in a set of word choices that have significantly different meanings, are in general use, and tend to include no equivalent of the identified word above. the individual is able to identify analogies related respectively to concept-function and otherwise, and concept-ownership and otherwise. the individual is able to comprehend, interpret, and express with different words/sentences something in a discourse presented. 3 320 the individual is able to determine a word that is rarely used in daily life which is the equivalent of a word in a set of word choices that are in daily general use with most of them having a similarity in meaning. the individual is able to identify a word rarely used in daily life which is the antonym of a word in a set of word choices that have significantly different meanings, are in general use, and tend to include no equivalent of the identified word above. the individual is able to identify analogies that are respectively categorical in nature and related to two concepts that are mutually complementary. the individual is able to apply and analyze something in a discourse presented. 4 340 the individual is able to determine a word that is rarely used in daily life which is the equivalent of a word in a set of word choices that are rarely used in daily life with most or all of them having a similarity in meaning. the individual is able to identify a word that is rarely used in daily life which is the antonym of a word in a set of word choices that have a similarity in meaning, tend to be rarely used, and include no equivalent of the identified word above. the individual is able to identify analogies that are respectively synonymy and antonymy in nature. the individual is able to make a synthesis and an evaluation of something in a discourse presented. net are 17.6%, 6.4%, 6.1%, and 3.4% of the 296 items analyzed. the description of the testees‟ potential at each sub-subtest would be made more in-depth by using other criteria. however, the items meeting the anchor criteria remain being the main references. the setting of benches used the scores around the mean of the results of analysis on all the packages above. the classification of tbs items per subtest based on four international benchmark percentile values (kelly, 2002, p.378) resulted in four benches and a group of test items at each bench. based on the analysis, the items netted at benches 3 and 4 turned out to be greater in number compared to those retained at benches 1 and 2. it could be explained because the test packages analyzed were those used for selection needs. in constructing test items into a test instrument, one had better pay attention to the purpose of giving the test and be able to anticipate the distribution of the testees‟ ability. a test whose purpose is to be a selection instrument should be able to net individuals with high levels of ability. research and evaluation in education 174 − reid, 2(2), december 2016 table 3. description of potential in the quantitative subtest bench quantitative subtest number sequence arithmetic & algebra geometry 1 240 the individual is able to determine the pattern of a number sequence which is a combination of numerical operations. a. the individual is able to calculate a numerical operation (addition/ subtraction) on whole numbers and exponent numbers. b. the individual is able to solve a linear equation of two variables which is presented in a story form and uses whole numbers. c. the individual is able to solve a statistics problem in a diagrammatic form. the individual is able to solve a problem in both picture and story form involving simple twoor three-dimensional geometrical shapes. 2 280 the individual is able to solve a number sequence with 1 or 2 jumps each time and by using combined numerical operations, such as, the addition of an exponented number each time. a. the individual is able to solve a numerical computation using a combination of numerical operations (addition, subtraction, multiplication, or division) on whole numbers. b. the individual is able to solve a computation related to a story involving a number set. c. the individual is able to solve an algebraic computation with one variable. d. the individual is able to solve a problem with statistics (the mean score) presented in graph or diagram form. a. the individual is able to solve a problem in plane geometry involving a combination of two twodimensional shapes. b. the individual is able to determine the volume of a solid shape (cubic or rectangular) having different units. 3 320 the individual is able to solve a number sequence with 1 or 2 jumps each time by using a combination of numerical operations so that a pattern of another number sequence is formed in the number sequence. a. the individual is able to calculate a computation in number or story form with a numerical operation (addition/subtraction/division/multiplica tion) on whole or rational numbers. b. the individual is able to solve a problem in algebra consisting of three unknown variables. c. the individual is able to solve a problem with statistics (the mean score) presented in story form. d. the individual is able to solve a problem in a story involving arithmetic by making an equation consisting of 2 variables and using rational numbers. a. the individual is able to calculate the magnitude of an angle in a geometrical shape. b. the individual is able to analyze information in a picture and use the information in problem solving. c. the individual is able to calculate the area/volume of an object whose case fits daily life. 4 360 the individual is able to solve a number sequence with 1 or 2 jumps each time which is formed from a multi-level pattern. a. the individual is able to calculate a numerical computation by means of a combination of numerical operations and a combination of number types (whole, exponented, or rational). b. the individual is able to determine the relation involving two or three variables and using rational numbers. c. the individual is able to analyze and process information of a problem in statistics presented in story or graph form. d. the individual is able to use logical and mathematical reasoning to solve a problem in arithmetic. a. the individual is able to use logical and mathematical reasoning to solve a problem in two-dimensional geometry in combinedshapes or story form. b. the individual is able to use logical and mathematical reasoning to solve a problem in three-dimensional geometry in combinedshapes or story form. research and evaluation in education determining standard of academic potential... 175 idwin irma krisna, djemari mardapi & saifuddin azwar test items chosen for selection needs are to be able to estimate testees having the ability fitting cutoff scores with the probability of 0.5 in answering items correctly (hambleton & swaminatan, 1985, p.229). thus, the proportion of items with moderate and high levels of difficulty is greater when compared to that of easy items when constructing tbs items. one of the considerations is that the test takers are prospective students at several state universities in indonesia. an effect resulting from this condition is that the construction of the description of potential at bench 1 becomes less perfect. based on the classification of test results in a previous section, the next step was constructing the description of potential at each bench. at this stage, the researcher was helped by 10 people who were competent at their field. before the description making, a preceding fgd was held between the researchers. the description of the potential in the verbal, quantitative, and and several resource persons. reasoning subtests can be seen in table 2, table 3, and table 4. with the description of potential at each bench of the verbal, quantitative, and reasoning subtests as the basis, it was found that each bench had unique features. the unique features of the verbal subtest were, among others, (1) differences in degree of generality in the use of words in daily life and combination in category of answer choices, (2) a pattern of relation occurring in items on analogy, and (3) cognitive activity ranging from memorization through to evaluation in items on reading comprehension. hayes (1989) breaks down cognitive activity into several stages: identifying the problem, representing the problem, planning the solution, table 4. description of potential in reasoning subtest bench reasoning subtest logical diagrammatic analytical 1 280 the individual is able to determine a conclusion based on two premises containing an argument that is, in nature, general/universal/a common postulate. the individual is able to determine the function of part relation between objects differing in type/form/ function/characteristic. the individual is able to use the information needed for problem solving. 2 310 the individual is able to determine a conclusion based on two premises containing an argument which is assumptive (supposition/assumption) in nature and does not apply universally/generally/commonly. the individual is able to determine the function of part relation between objects that are the same in classification but different in function /form/characteristic. the individual is able to analyze and determine the information needed for problem solving. 3 340 the individual is able to determine a conclusion based on two premises containing an argument which is assumptive in nature and does not apply generally on all answer alternatives using the two premises. the individual is able to determine the function of part relation between living creatures or between abstract concepts (like constructs of profession/status/ condition/characteristic). the individual is able to analyze and process the information needed for problem solving. 4 370 the individual is able to determine a conclusion based on two premises containing an argument which is hypothetical in nature. the individual is able to determine the function of part relation between objects-living creatures concepts simultaneously. the individual is able to analyze and process the information needed for problem solving with various possible solutions. research and evaluation in education 176 − reid, 2(2), december 2016 executing the plan, evaluating the plan, and also evaluating the solution. according to the toefl descriptor ibt for the reading test, between low and high levels, there is a difference in understanding the sentence which is expressed explicitly or implicitly, factually or abstractly, and in the complexity of a concept (gomez, noah, schedl, wright & yolkut, 2007, pp.424-437). the higher the bench of the tbs item on discourse, the more the need for the evaluation stage of cognitive activity for solution. examples of items per bench of the verbal subtest are presented as follows. example 1. item of bench 1 in the verbal subtest on reading. oceans have enchanted the human race for thousands of years – perhaps since people stood on the shore thinking of where waves came from and what was there beyond the far horizon. but at those times the sea was also something feared. there reigned storm gods, horrible creatures, and catastrophes. only after centuries do human beings dare to traverse it far to the middle until the land is out of sight. the sea still enchants us though many of its secrets have been revealed. we fly across it without hesitation. various cargo ships traverse it, transporting food, fuels, raw materials, and factory products. modern fishing ships hunt fishes and process them on board. but in various places there are still many traditional fishermen using nets from sailing ships or sailboats. for scientists studying the sea, the last 30 years have yielded interesting and abundant new information. as if in a detective story, gradually clues are collected – from rocks at the sea bottom and fossils on land, from modern volcanoes and traces of magnetism in ancient rocks. and from all those emerges a picture of a past gigantic geological force – still changing the sea bottom even now. imagine a landscape with mountains greater than the himalayas, plains defeating africa and asia in vastness, and trenches that could swallow mountains. that landscape exists – at the ocean bottom – made by an awesome force that has been tearing the earth‟s rocky crust, and then shaking it and turning it inside out repeatedly for millions of years. the idea that continents shift is nothing new. it was first expressed 130 years ago. but at that time it was considered outrageous and ridiculous and the idea was ignored. with the passing of time, there was increasingly more proof until the invention of the echo sounder and the equipment to grip and open the curtain in the 1960s. martin bramwell “ocean” what made the condition at the bottom of the ocean? a. the movement of the earth b. a gigantic geological force* c. the power of a sea-bottom creature d. huge animals of the sea bottom e. waves left by large boats example 2. item of bench 2 on antonymy in the verbal subtest reduction a. profit b. dividend c. demand d. addition* e. advantage example 1 is a test item on discourse of bench 1. according to that example, it is hoped that someone could mention a fact, definition, or concept found in the discourse without having to do any analyzing activity. the fact to be mentioned according to the discourse is the process forming the condition at the ocean bottom. example 2 is a test item on antonymy of bench 2. the word reduction is a word in common use in education. it has the sense of descent or decrease. the answer choices given are also common in nature and tend to have different meanings. the description of the quantitative subtest is differentiated into those of number sequence, arithmetic and algebra, and geometry. each bench also has its own specific characteristics. the characteristics are, among others: (1) a complexity in the pattern forming a number sequence, with the higher the bench, the more complex the pattern occurring in the series; (2) the mathematical operaresearch and evaluation in education determining standard of academic potential... 177 idwin irma krisna, djemari mardapi & saifuddin azwar tion, number type, number of variables in equations and item material in the sub-subtest of arithmetic and algebra; and (3) the geometrical shapes in pictorial and narrative items. however, the cognitive process at bench 3 and that at bench 4 are almost the same in complexity so that no consistent increase occurs. the results of research by ferrara, et al. (2011) also indicate that the description of the cognitive and language process in items on mathematics does not consistently rise at levels 3, 4, and 5. a descriptor that could not describe the ability that should be mastered at each level causes lack of clarity of what ability someone should possess at each level. several examples of the quantitative subtest items at each bench are as follows. example 3. item of bench 1 on arithmetic and algebra in quantitative subtest the diagram describes the level of final education of every head of the family in an rt (rukun tetangga or ‟neighborhood community‟) named rt. 03. if the number of the families in rt. 03 is 72, how many of them are those whose final education was smp (sekolah menengah pertama or ‟junior high school‟)? a. 10 b. 12* c. 15 d. 18 e. 32 example 4. item of bench 2 on geometry in the quantitative subtest 1,5 cm 4 ,8 c m the actual height of a house is 7 m. if, in a drawing of the house, its height is 4.8 cm and its width is 1.5 cm, its actual width is …. a. 2.188 m* b. 2.240 m c. 21.88 m d. 22.40 m e. 224 m in example 3, an item of arithmetic and algebra at bench 1, the potential measured is in solving a statistics item in the form of a circular diagram. the item becomes an easy one for an individual because the individual directly uses the information obtained from the diagram without having to make a mathematical equation. the item on geometry in example 4 is also a geometry item of bench 2. it is hoped that, in dealing with the item, an individual could solve a problem in geometry concerning an already modified two-dimensional drawing of an object. based on the description of potential in reasoning previously discussed, the specific characteristics distinguishing the benches from each other are: (1) the nature of the premise in the sub-subtest on logic and the drawing of a conclusion based on two premises given; (2) the characteristic of the subjectconcept relation in the diagram item; and (3) the cognitive process in the sub-subtest on analyticality, starting from the using until processing information in order to obtain a solution. the following are examples of items of certain benches in the reasoning subtest. example 5. item of bench 1 on logic all motorcycle riders must wear a yellow helmet. all female motorcycle riders wear gloves. a. a number of motorcycle riders do not wear a yellow helmet though they wear gloves. b. all motorcycle riders do not wear gloves. c. a number of motorcycle riders wear neither a helmet nor gloves. d. there are motorcycle riders who wear a yellow helmet but they do not wear gloves.* e. a number of motorcycle riders do not wear a helmet and do not wear gloves. s1 1600 sma smp600 s2 research and evaluation in education 178 − reid, 2(2), december 2016 example 6. item of bench 2 on analyticality the results of a geological survey in several regions in africa indicate that there are several volcanoes that are still active with a división as follows. a volcano which is highly active has an activeness scale above 7, one which is moderately active has an activeness scale ranging from 4 to 7, and one which is of a low level in being active has an activeness scale of less than 4. it is found that mount h has an activeness scale of 5, mount k has an activeness scale 4 points higher than that of mount h, mount a is below mount k with a difference of 3 points below its activeness scale while mount w has an activeness scale 5 points above that of mount s, which has an activeness scale 3 points below that of mount a. so the right statement is … . a. mounts h and w are volcanoes with moderate activeness. b. mounts k and a are volcanoes which are highly active. c. mount h is higher in level of activeness than mount w. d. mounts a and h are volcanoes with moderate activeness.* e. mount k is lower in level of activeness than mount w. examples 5 and 6 are respectively items on logicality and analyticality. the premise given in example 5 is general and factual in nature and describes the obligation of motorcycle riders. with that condition, it could be easier for an individual to draw a conclusion. the item on analyticality in example 6 measures an individual in analyzing and determining the information being used. the individual could determine the activeness of a volcano based on the criteria available and make an analysis to decide which volcano fits the criteria the most. the error of measurement at the university level was also determined based on the value of the information function of the test. the results of the analysis on the error of measurement at the benches can be seen in table 5. the mean of the error of measurement at each bench in the verbal subtest is 0.22, that in the quantitative subtest is 0.31, and that in the reasoning subtest is 0.21. the error of measurement of the quatitative subtest at bench 1 is still sufficiently high; it is also caused by the smallness of the number of items retained at the bench. the case is different from the reasoning subtest; there the error of measurement at bench 1 is the smallest compared to that at any of the other benches. according to the classification results, the items retained at that bench are sufficiently great in number. table 5. error of measurement sub bench verbal quantitative reasoning bench 1 0.25 0.47 0.19 bench 2 0.22 0.33 0.22 bench 3 0.20 0.24 0.20 bench 4 0.22 0.18 0.25 with the analysis on the error of measurement at each bench as the basis, it is found that the error of measurement at a the lower bench tends to be higher except that at bench 1 of the subtest on reasoning. a factor causing it is that at a lower bench the items retained are only a few in number so that the test information given is also little in amount. the small value of the test information causes the error of measurement to become greater. the proportion of items with low difficulty levels is smaller compared to that of items with high difficulty levels because the test package is used as a selection instrument. another factor is the existence of difference between human ability and the difficulty level of the items retained so that the resulting test information becomes little in amount. conclusion and suggestion conclusion with the results of the analysis and discussion as the basis, the following conclusion could be drawn. (1) the classification of tbs with the method of scale anchoring is able to group the tbs items into four benches. the items netted give a description of the scholastic aptitude potential clearly and are research and evaluation in education determining standard of academic potential... 179 idwin irma krisna, djemari mardapi & saifuddin azwar able to distinguish difference in cognitive process between a lower bench and a higher bench. (2) the description of the potential at each bench indicates uniqueness so that it could distinguish difference in potential between a lower bench and a higher bench. at a higher bench, a higher level of reasoning power is required in analyzing and processing needed information so that the individual concerned could do the problem solving with the right solution; (3) the items netted at a lower bench in the three subtests tend to be few so that the error of measurement at such a bench still tends to be higher compared to that at a higher bench. suggestion based on the objective, significance, and conclusion of the research, it is suggested that other researchers who are interested in the benchmarking process conduct related research on another class subject. this suggestion is offered with the consideration that each class subject possesses its own uniqueness. references anastasi, a. (1988). psychological testing (6 th ed.). new york, ny: macmillan. azwar, s. (2008). kualitas tes potensi akademik versi 07a [the quality of the academic potential test version 07a]. jurnal penelitian dan evaluasi pendidikan, 12(2), 231-250. retrieved from http://journal.uny.ac.id/index.php/jpe p/article/view/1429/1217 beaton, a.e. & allen, n.l. (1992). interpreting scales through scale anchoring. journal of educational statistics, summer, 17(2), 191-204. bejar, i. i. (2008). standard setting: what is it? why is it important?. r&d connections, 7. berk, l. (2000). child development (5 th ed.). massachusetts, ma: allyn and bacon. cizek, g.j. & bunch, m.b. (2007). standard setting: a guide to establishing and evaluating performance standards on test. thousand oaks, ca: sage. cohen, r.j. & swerdlik, m.e. (2002). psychological testing and assessment: an introduction to test and measurement (5 th ed.). boston, ma: mcgraw-hill. embretson, s. & reise, s.p. (2000). item response theory for psychologists. mahwah, nj: lawrence erlbaum. ferrara, s., svetina, d., skucha, s. & davidson, a.h. (2011). test development with performance standards and achievement growth in mind. educational measurement: issues and practice, 30(4), 3-15. forsyth, r.a. (1991). do naep scales yield valid criterion-referenced interpretations? educational measurement: issues and practice, 10(3), 3-9, 16. frey, m.c. & detterman, d.k. (2003). scholastic assessment or g? the relationship between the scholastic assessment test and general cognitive ability. case western reserve, oh: department of psychology. galotti, k.m. (2004). cognitive psychology in and out of the laboratory (3 rd ed.) pp.391-392). belmont, ca: wadsworth. geisinger, k.f. & mccormick, c.m. (2010). adopting cut scores: poststandard-setting panel considerations for decision makers. educational measurement: issues and practice, spring, 1, 38-44. gomez, p.g., noah, a., schedl, m., wright, c., & yolkut, a. (2007). proficiency descriptors based on a scale-anchoring study of the new toefl ibt reading test. language testing, 24, 417-444. hambleton, r.k., swaminathan, h. & rogers, h.j. (1991). fundamentals of item response theory. newbury park, ca: sage. hambleton, r.k. & swaminathan, (1985). item resnse theory: principles and applications. boston, ma: kluwer-nijhoff. harman, g. (1994). student selection and admission to higher education: policies http://journal.uny.ac.id/index.php/jpep/article/view/1429/1217 http://journal.uny.ac.id/index.php/jpep/article/view/1429/1217 research and evaluation in education 180 − reid, 2(2), december 2016 and practices in the asian region. higher education, 27(3), 313-339. hayes, j. r. (1989). the complete problem solver (2 nd ed). hillsdale, nj: erlbaum. jaeger, r.m. (1989). certification of student competence. in r.l. linn (ed.), educational measurement (3 rd ed., pp. 485514). new york, ny: american council on education/macmillan. keeves, j.p. & alagumalai, s. (1999). new approaches to measurement. in g.f. masters & j.p. keeves (eds.), advances in measurement in educational research and assessment (pp.23-42). new york, ny: pergamon. kelly, d.l. (2002). appplication of the scale anchoring method to interpret the timss achievement scales. in d.f. robitaille & a.e. beaton (eds), secondary analysis of the timss data. new york, ny: kluwer academic publishers. mardapi, d., hadi, s., & retnawati, h. (2015). menentukan kriteria ketuntasan minimal berbasis peserta didik. jurnal penelitian dan evaluasi pendidikan, 19(1), 38-45. doi:http://dx.doi.org/10.21831/pep.v1 9i1.4553 martin, m.o., mullis, i.v.s., beaton, a.e., gonzalez, e.j., smith, t.a., & kelly, d.l. (1997). science achievement in the primary school years: iea‟s third international mathematics and science study (timss). chestnut hill, ma: boston college. mullis, i. v. s., martin, m. o., beaton, a.e., gonzalez, e. j., kelly, d. l., & smith, t.a. (1998). mathematics and science achievement in the final year of secondary school: iea’s third international mathematics and science study (timss). chestnut hill, ma: boston college. olatoye, r.a. & aderogba, a.a. (2011). performance of senior secondary school science students in aptitude test: the role of student verbal and numerical abilities. journal of emerging trends in educational research and policy studies (jeteraps), 2(6),431-435. perie, m. (2008). a guide to understanding and developing performance-level descriptors. educational measurement: issues and practice, winter, 27(4),15-29. resnick, l.b, nolan, k.j., & resnick, d.p. (1995). benchmarking education standards. educational evaluation and policy analysis, 17(4), 438-461. solso, r. (2001). cognitive psychology (6 th ed, pp.428-429). boston, ma: allyn and bacon. wedman, i. (1994). the swedish scholastic aptitude test: development, use, and research. educational measurement: issues and practice, winter, 13, 5-11. wyatt, j., kobrin, j., wiley, a., camara, w.j., & proestler, n. (2011). sat benchmarks: development of the college readiness and its relationship to secondary and postsecondary school performance. college board: research report, 5, 5-30. http://dx.doi.org/10.21831/pep.v19i1.4553 http://dx.doi.org/10.21831/pep.v19i1.4553