Rasch Measurement in Language Research: Creating the Foreign Language Classroom Anxiety Inventory Research Reports Rasch Measurement in Language Research: Creating the Foreign Language Classroom Anxiety Inventory Miranda J. Walker*a, Panayiotis Panayidesa [a] Lyceum of Polemidia, Limassol, Cyprus. Abstract The purpose of this study was to construct a new scale for measuring foreign language classroom anxiety (FLCA). It begun with the creation of an extended item pool generated by qualitative methods. Subsequent Rasch and semantic analyses led to the final 18-item Foreign Language Classroom Anxiety Inventory (FLCAI). In comparison with the Foreign Language Classroom Anxiety Scale (FLCAS), the FLCAI demonstrated more convincing evidence of unidimensionality and the optimal 5-point Likert scale functioned better. The FLCAI, while 55% the length of the FLCAS, thus more practical for classroom practitioners to administer and analyse, maintains its psychometric properties and covers a wider range on the construct continuum thus improving the degree of validity of the instrument. Finally, test anxiety was shown to be a component of FLCA. Keywords: foreign language classroom anxiety, Rasch measurement, unidimensionality, reliability Europe's Journal of Psychology, 2014, Vol. 10(4), 613–636, doi:10.5964/ejop.v10i4.782 Received: 2014-04-03. Accepted: 2014-06-13. Published (VoR): 2014-11-28. Handling Editor: Maciej Karwowski, Academy of Special Education, Warsaw, Poland *Corresponding author at: Anthemidos 12, 4007, Limassol, Cyprus. E-mail: mirandajanewalker@gmail.com This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Foreign language anxiety (FLA) is a key issue to be addressed by language teachers as it can cause students to become less receptive to language input (Krashen, 1981), and thus slow down the language learning process. Furthermore, it can negatively affect student motivation (Liu, 2012; Liu & Huang, 2011) as well as achievement (Horwitz, 1986; MacIntyre & Gardner, 1991; Mahmood & Iqbal, 2010; Yan & Horwitz, 2008). Debate, initiated by Sparks and Ganschow (1991), continues over correlation and causation, in other words whether anxiety causes poor performance or poor performance causes anxiety (Horwitz, 2000; MacIntyre & Gregersen, 2012). Either way, as anxiety can have debilitating effects on foreign language learning, identifying students with high levels of foreign language anxiety is important (Horwitz, Horwitz, & Cope, 1986). By identifying such students, the teacher can take steps to help them cope with this. If the teacher is unaware that their students are suffering from FLA, they may perceive students’ behaviour as lack of motivation, abilities and / or poor attitude. In fact Aida (1994) suggests that students learn more effectively when teachers take necessary measures to help them overcome their FLA. Europe's Journal of Psychology ejop.psychopen.eu | 1841-0413 http://creativecommons.org/licenses/by/3.0/ http://creativecommons.org/licenses/by/3.0 http://ejop.psychopen.eu/ http://ejop.psychopen.eu/ http://www.psychopen.eu/ Literature Review Anxiety, according to Spielberger (1983), is the subjective feeling of tension, apprehension, nervousness, and worry associated with an arousal of the automatic nervous system. (Horwitz, Horwitz, & Cope, 1986, p. 125). He defines state anxiety as an ‘unpleasant emotional state or condition’ and trait anxiety as a ‘relatively stable indi- vidual difference in anxiety-proneness as a personality trait’ (p. 1). Horwitz et al. (1986) suggest that clinically ‘the subjective feelings, psycho-physiological symptoms, and behavioural responses of the anxious foreign language learner are essentially the same as for any specific anxiety’ (p. 29). Notwithstanding, Horwitz et al. (1986), as pi- oneers in the field, have played a significant role in facilitating an understanding of FLA by providing the following definition: FLA is ‘a distinct complex of self-perceptions, beliefs, feelings and behaviors related to classroom lan- guage learning which arise from the uniqueness of the language learning process’ (p. 128). Horwitz et al. (1986) further suggest that FLA is related to communication anxiety, fear of negative evaluation and test anxiety. Horwitz and Young (1991) do however inform us that in the literature there are two approaches to language anxiety. In one it is viewed as ‘a manifestation of other more general types of anxiety’ (p. 1). In the other it is considered as ‘a distinctive form of anxiety expressed in the response to language learning’ (ibid). Aida (1994) suggests that students with a fear of negative evaluation might become passive in the classroom and that, in extreme cases, the students may even consider skipping lessons in order to avoid anxiety situations, and thus they are left behind. Gender comparisons of FLA have been varied. Park and French (2013) report significantly higher anxiety levels in females than males whereas Matsuda and Gobel (2004) found no significant differences. Along with their world renowned definition of FLA, Horwitz et al. (1986) designed the Foreign Language Classroom Anxiety Scale (FLCAS). This 33 item, five category Likert scale was designed for use with university students almost 30 years ago but remains a popular instrument. Chan and Wu (2004) note that ‘due to the scale’s success on construct validation and reliability, the FLCAS has been widely adopted by many researchers to explore learners’ foreign language anxiety’ (p. 292). The majority of these studies has also been with university students. Park and French (2013) suggest that as this anxiety scale has been widely used around the world, psychometric evidence has been established. They state that the internal consistency of the FLCAS was high in many cited studies. However, they add that the latent factor structures differed across studies citing, among others, Aida (1994), Horwitz (1986) and Tóth (2008). Most recently, Panayides and Walker (2013) showed through Rasch measurement, that the scale is unidimensional, and that test anxiety is indeed a component of FLCA. Nonetheless, they brought into question the extremely high reliability (internal consistency) suggesting possible flaws in the scale. Teachers and students are direct stakeholders in language teaching and learning. Both must work in unison in order to achieve maximum results. Indeed research has shown that anxiety can ‘be changed and shaped through teacher intervention in learning’ (Robinson, 2002, p. 8). This reiterates the need for teachers to assess their students’ degree of anxiety. As teachers’ perceptions of their students’ language anxiety are not always congruent with that of the students (Levine, 2003), self-report instruments measuring student anxiety are vital tools for the educator. Over recent years the literature has begun to embrace a more dynamic, multifaceted relationship between anxiety, motivation, self-efficacy and other language learning variables such as learner characteristics and teaching styles (Liu, 2012; Liu & Huang, 2011). Such advances in research, as well as the fact that time and settings are not constant, infer a need to re-evaluate and perhaps refine even widely-accepted instruments such as the FLCAS. Besides, even ‘the most accepted working hypotheses themselves may need revising’ (Spielmann & Radnofsky, 2001, p. 261). Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 614 http://www.psychopen.eu/ Validity of Scales ‘Validity always refers to the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores’ (Messick, 1993, p. 13). This is because ‘test responses are a function not only of the items, tasks, or stimulus conditions but of the persons responding and the context of measurement’ (Messick, 1993, p. 15). For this reason, as Yun and Ulrich (2002) stress, the appropriateness of an instrument should be established through validity investigations prior to its use in new situations or new attributes of population. Furthermore, Luyt (2012) advocates that such a process determines whether a test ‘requires revision or a new instrument might better be developed’ (p. 297). Panayides and Walker (2013) conducted a study of the psychometric properties of the FLCAS on a Cypriot senior high school population (16-18 years old). They verified that test anxiety was a component of FLA, which had previously been disputed (Aida, 1994; Cheng, Horwitz, & Schallert, 1999; Matsuda & Gobel, 2004). Their analysis also revealed flaws in the original FLCAS. They found the reliability of the FLCAS to be very high, in accordance with most primary studies on the FLCAS. They suggested that such high reliability is undesirable in psychometric scales since it can lower their degree of validity. Panayides and Walker’s (2013) investigation revealed two reasons for such a high reliability. First, the items covered a rather narrow range on the construct continuum. Such construct underrepresentation threatens validity (Messick, 1993) and jeopardizes the precision of person estimates (Panayides & Walker, 2013). Second, the scale includes many parallel items. The researchers believe that the use of parallel items should be avoided as: • they may give a false sense of a high degree of reliability • the validity of the instrument is compromised • for basic research, very high reliabilities are not necessary (Nunnally, 1978) • while Ray (1988) supports the use of parallel items in order to investigate the consistency in an individual’s response pattern, such practice can be successfully avoided by using Item Response Theory (IRT) models and person fit statistics. One approach which excels in collecting a wide range of validity evidence of an instrument is Rasch measurement. Rasch Measurement The most important difference perhaps, and at the same time the strength, of the Rasch models over other IRT models is its philosophy. Rasch measurement is a mathematical framework of ideal measurement, against which test and scale developers can assess their data (Bond & Fox, 2001, 2007). Any departure from the models’ re- quirements constitutes parting from useful measurement. Other IRT models are statistical models aimed at incor- porating all characteristics observed in the data without any regard as to whether they contribute to the measurement process. Panayides, Robinson, and Tymms (2010) argue that the difference is between measurement and mod- elling. If the aim is to describe the data at hand, trying to model all of their characteristics, then IRT models are preferable. On the other hand if the aim is to construct good measures then the scale or test items should be constrained to the principles of measurement, and this can be achieved only through the use of Rasch measurement. Assessing Unidimensionality, the Rasch Approach Scales such as the FLCAS, 'where single scores are used to position individuals on a latent trait continuum should be unidimensional' (Panayides & Walker, 2013, p. 496). Thus, before such scales are used, their unidimensionality must be established as an important component in the investigation of their degree of validity. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 615 http://www.psychopen.eu/ Factor analysis (FA) is the most widely used method for assessing the dimensionality of data. However, FA has a major drawback. As with all other statistical analyses, it operates on interval-level data when in fact, the scale scores are ordinal by nature. This makes results of such analysis disputable. Rasch measurement transforms ordinal scores into an interval-level logit scale thus producing more reliable results (Wright & Masters, 1982). Schumacker and Linacre (1996) state that: Factor analysis is confused by ordinal variables and highly correlated factors. Rasch analysis excels at constructing linearity out of ordinality and at aiding the identification of the core construct inside a fog of collinearity. (p. 470) Following the Rasch calibrations, principal components analysis (PCA) of the standardised residuals can be per- formed. This method has been shown to be more effective at identifying multidimensionality than factor analysis of the raw data (Linacre, 1998). With regard to the critical value of the eigenvalue, above which the factor extracted can be considered a different dimension, Linacre (2005) argues that any value smaller than 2 indicates the strength of less than two items thus the implied dimension has little strength in the data. He suggests that, in a test with a reasonable length, if a secondary dimension has eigenvalue less than 3, the test is most probably unidimensional. This Study The FLCAS is irrefutably a well-established instrument which has been used productively for so long that it is in- evitable that the new scale will be met by some with scepticism and that comparisons may be made between the two instruments. This is welcomed as it will help establish the validity of the scale in other similar settings. This study follows on from Panayides and Walker (2013) who suggested further research be carried out so as to refine the FLCAS. Rasch measurement was used in this study to construct a new foreign language classroom anxiety scale that is less lengthy, covers the construct of FLA more adequately and maintains the reliability, while at the same time enhances the validity of the instrument. The following recommendations made by Panayides and Walker (2013) were adhered to. First, five items were removed from the original FLCAS due to poor fit to the Rasch model, and a careful semantic analysis of the re- maining 28 items was conducted in order to remove parallel or repetitive items. Second, new items were added so as to achieve a wider coverage of the construct and to improve item targeting. Finally, the category labels were changed in order to facilitate possible collapsing should the need arise. The research questions guiding this study were: 1. Is the new 5-point rating scale psychometrically optimal? 2. Does the new scale provide reliable person measures? 3. Do the scale items define a single construct? 4. How does the new scale compare with the original FLCAS? Despite the fact that Panayides and Walker (2013) demonstrated test anxiety to be a component of FLCA through Rasch analyses, previous inconsistencies in research findings (Aida, 1994; Horwitz et al., 1986; MacIntyre & Gardner, 1989) as well as the significance of test anxiety in the literature, indicated a need to explore this again in the new scale. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 616 http://www.psychopen.eu/ Method The Creation of the New Scale In keeping with AERA, APA, & NCME (1999) the following were documented: (1) the procedures used to develop, review, try out and select items from the item pool; (2) the model used for evaluating psychometric properties of items and data used for item selection (3) the IRT used in the test development, the item response model and evidence of model fit. Gehlbach and Brinkworth (2011) suggest that ‘a literature review and focus group-interview data can be synthesized into a comprehensive list to facilitate the development of items’ (p. 380). This study sought teacher and student input and in doing so supports Pasquale’s (2011) stance that ‘to ignore such beliefs handicaps the language teaching and learning process at nearly every step of the way’ (p. 97). To this end, two focus groups were held in January 2013 with experienced language teachers. The creation of the new scale began with an oversized pool of items. These included 22 items from the original 33 items of the FLCAS (Horwitz et al., 1986) and 17 newly created items. The original 33 items were reduced to 22 through a multi-step process. First, the five items found by Panayides and Walker (2013) to be misfitting the Rasch model in a similar sample of Cypriot high school students were removed. They found the scale to be unidi- mensional once these had been removed. Second a further six items were removed including four parallel items and two which were not considered relevant to classroom anxiety. The two deemed unrelated for the classroom setting were: Item 14, I would not be nervous speaking the foreign language with native speakers and Item 32, I would probably feel comfortable around native speakers of the foreign language. Following this, additional items were generated with the intention of improving coverage of the construct. In ac- cordance with the substantive approach (Messick, 1993), items were included in the pool on the basis of judged relevance. The researchers based their decisions on an in-depth study of the literature, an informal discussion with EFL students, two focus group discussions with language teachers and three personal interviews with recently retired teachers of senior high school EFL. EFL students from ten mixed ability classes were asked informally by their teacher, one of the researchers in this study, to state what makes them anxious in the EFL classroom. They were told that their answers would help her address their anxieties better and also facilitate research. All ideas raised were taken down in writing by the re- searchers. In January 2013 two focus group discussions were held. Rodriguez, Schwartz, Lahman, & Geist (2011) contend that focus groups are ‘a powerful qualitative research method which, especially when designed to be culturally responsive, facilitate collection of rich and authentic data’ (p. 400). One of the groups comprised of 12 teachers and the other of 37 teachers. The participants, who had between one and 33 years teaching experience, were asked to share their beliefs concerning what makes students anxious. They were asked to note down their ideas without discussion a few minutes prior to the main group discussions. These notes were collected to prevent changes being made during the discussion. Ideas raised during the ensuing discussion were noted by one of the two researchers, while the other coordinated the discussion. Finally three recently retired teachers of senior high school EFL, with an average of 37 years teaching experience, one of whom a trained psychologist, were consulted. They were asked by email to write down what they believed Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 617 http://www.psychopen.eu/ causes EFL students classroom anxiety. Following this, personal interviews were held between them and one of the researchers. The purpose of these was to expand on ideas raised in the written comments. Details of the data collected from the student and teacher discussions and interviews can be found in Appendix A, in frequency order. The new items stemming from qualitative data collected for the extended item pool were written directly in Greek as the data from which they were generated was Greek. This is also the language of the participants. Items designed as a result of the literature review were written in English and translated into Greek. The Greek version of the original FLCAS, translated by Panayides and Walker (2013) was also used. The resulting 39 item scale was administered, in April 2013, to 212 students in three senior high schools in Limassol, Cyprus, all of whom had been studying English for a minimum of seven years. In compliance with Messick’s (1993) guidelines, the item responses were obtained and analyzed. The final scale items were subsequently selected from the pool. The selection process involved, in line with AERA et al. (1999), looking into two different sets of criteria (a) semantic and (b) statistical – measurement. The addressing of semantic issues included the removal of parallel items and those items which had been shown to contain ambiguities. An example of ambiguity can be found in Item 15 of the original FLCAS (Horwitz et al., 1986) ‘I get upset when I don't understand what the teacher is correcting’. Some stated that their level of English was higher than that of the lesson and so they very rarely made mistakes and wondered whether they should be answering hypothetically. It is customary for children in Cyprus to take private English tuition, from as young as six years of age, which progresses at a faster rate than school lessons do. Such confusion could lead to unreliable results and thus it was considered prudent to remove this item. For the examination of statistical – measurement criteria, the Rasch Rating Scale Model (Andrich, 1978) was used which is appropriate for analyses of Likert scales. The investigation involved analysing the estimated measures of the items in order to ensure a wide coverage of the construct, the point measure correlations and the item fit statistics. Five items with outfit of greater than 2.0 and with both infit and outfit greater than 1.5 were removed. Finally 22 items were kept. These items included 10 from the original FLCAS (Horwitz et al., 1986) and 12 newly designed items. Eight items concerned test anxiety, five items communication anxiety, four fear of neg- ative evaluation and five general anxiety items. Interestingly this break down reflects the emphasis given to each aspect of FLA in the group discussions, informal discussions with students and interviews held within the setting during the development stage of the extended item pool. The 22-item scale was administered to 285 high school students (16-18 years old) in October 2013. The Likert Scale — Great emphasis and consideration was placed on the number and labels of the Likert scale to be used. In their validation study of the FLCAS, Panayides and Walker (2013) documented concerns regarding the Likert scale for the population under study. The main problem encountered was that the probability curves of the middle three categories, namely ‘disagree’, ‘neither agree nor disagree’ and ‘agree’, failed to peak (be the most probable choice) for sufficiently large ranges on the construct continuum. They could not however be collapsed for semantic reasons. Since this is a valid concern it has been addressed in designing the modified scale by changing the labels. The importance of semantics is that should any subsequent problems in the Likert categories be diagnosed and collapsing of categories become necessary, it will be possible to do so. As such participants are asked to indicate ‘How often do the following statements apply to you?’ by selecting between never, rarely, sometimes, frequently and always. This maintains a starting point of five categories. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 618 http://www.psychopen.eu/ Assessing Unidimensionality — Many of the methods offered by the Rasch models were used for the assessment of the scale dimensionality. First, PCA of the standardised residuals was used as suggested by Linacre (1998). Second, the point measure correlations were observed for the possibility of unacceptable values (negative or close to zero) and third, the well-known mean square statistics, infit and outfit. Various values are suggested in the literature for cut-off values including the most popular for scales are 1.4 (Wright, Linacre, Gustafson, & Martin- Lof, 1994), 1.5 (Bond & Fox, 2001, 2007) and 1.6 (Curtis, 2004). There is a simple explanation as to why higher values are considered adequate as cut-off values for scales. For high stakes multiple-choice tests the items are highly controlled, carefully constructed and piloted and the examinees respond in a highly controlled environment. Questionnaires are usually less carefully constructed and there is less control over how respondents behave. Observational instruments usually have even less control (or even no control) of how respondents behave. Therefore, less control → more off-dimensional behaviour → worse fit expected. (Linacre, personal communication, March 7, 2007) The primary purpose of conducting a test or administering a scale is to measure the ability or position on the latent trait continuum of people. One needs measures that are good enough for the purpose. Rough measures are useful for the purpose of assessing personality traits and thus the fit criteria can be much more relaxed. Con- sequently 1.5 was chosen as the cut-off value for this study. In deciding whether an item should be removed from the scale or replaced, ‘one should use the suggested cut-off scores as a guide, and then rely on his/her profes- sional judgement and intuition to reach the best possible decision’ (Panayides, 2009, p. 134). Next the items were divided into two groups, test anxiety items and the rest. Person measures were obtained from the two groups of items separately. The two sets of person measures were compared using the correlation coefficient and by performing t-tests for differences in the measures from the two different calibrations as suggested by Smith (2002). A 95% confidence interval for the t-values would be approximately between -2 and 2. Values outside this range indicate significant differences between the person measures. Hence, any percentage of t- values outside this range of less than 5% indicates that the two item groups give statistically equivalent person measures and the two groups of items can be considered as measuring the same construct. The next step was to once again divide the items into two groups, the items from the original FLCAS and newly created items. The same method of t-tests for differences between the two sets of person measures (one from each item group) was used. In both cases where the t-test method was employed, the item estimates used were the ones obtained from the final calibration of the 18-item scale (shown in Table 2 in the results section). Finally, for the investigation of invariance, persons were divided by gender and item estimates were obtained separately for the two groups. Again the two sets of item estimates were compared using a scatter plot and the calculation of the correlation coefficient. Statistically equivalent item measures were obtained which supports the property of invariance. In other words the construct measured by the scale has the same meaning across the two groups of people. Reliability — Three reliability indices were used in assessing the reliability of the scale: Person Reliability, which shows how well the instrument can distinguish persons; Person Separation (Gp), which takes values from zero to infinity and indicates the spread of person measures in standard error units and Strata, given by the formula [(4Gp+1)/3]. Strata determines the number of statistically distinct levels (separated by at least three errors of measurement) of person abilities that the items have distinguished. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 619 http://www.psychopen.eu/ Item Reliability was also used as it is useful in showing how well the items are discriminated by the sample of re- spondents. Wright and Masters (1982) emphasise that good item separation is a necessary condition for effective measurement. Rasch Diagnostics for the Optimal Number of Likert Scale Categories — Optimizing a rating scale is ‘fine- tuning’ to try to squeeze the last ounce of performance out of a test (Linacre, 1997, para. 1). Linacre (2002) and Bond and Fox (2001, 2007) describe the Rasch measurement diagnostics for evaluating the effectiveness of the number of categories of a Likert scale. These diagnostics facilitate the investigation of the extent to which respond- ents can clearly identify the ordered nature of the rating scale response options, and whether they can accurately distinguish the difference between each category. Categories with very low frequencies (Linacre suggests lower than 10) do not provide enough observations for estimating stable threshold values. Such categories should be removed or collapsed with adjacent categories, provided that the semantics permit such collapsing. The average of the measures of all persons in the sample who choose successive categories should increase monotonically. This indicates that those with higher position on the construct continuum endorse higher categories. Likewise, the thresholds or step calibrations should also increase monotonically, otherwise they are considered disordered. In addition, the range of each category (the distance between successive thresholds) should not be too wide in order to avoid large gaps in the variable (not more than 5.0 logits), or too narrow to show distinction between categories (not lower than 1.0 for the case of a 5-point Likert scale). Step disordering and very narrow distances between thresholds ‘can indicate that a category represents too narrow a segment of the latent variable or correspond to a concept that is poorly defined in the minds of the respondents’ (Linacre, 2002, p. 98). Finally, the outfit statistic provides another useful tool in assessing the effectiveness of the categories. Values of outfit greater than 2.0 indicate more misinformation than information, that is, the category introduces noise into the measurement process. Estimation Method — WINSTEPS (Linacre, 2005) was used for the analysis of the data. The estimation method used with this software is Joint Maximum Likelihood Estimation (JMLE) in preference to Conditional Maximum Likelihood Estimation or Marginal Maximum Likelihood Estimation. Linacre (2005) explains that JMLE is preferred ‘because of its flexibility with missing data. It also does not assume a person distribution’ (p. 11). He also clarifies that any estimation bias is not a real concern as, except in rare cases with short tests or small samples. Results The First Calibration of the 22-Item Scale The first calibration (22 items and 285 persons) showed very satisfactory reliability indices: Person Reliability = .93, Person Separation = 3.70 and Strata = 5.27. However, this calibration also revealed that some items had almost identical statistics. When constructing new scales a good range of item difficulty estimates is needed (that is, a good spread of the items on the construct continuum) so as to attain a high degree of reliability. Empirically, when two or more items are functioning almost exactly the same, then any one of these is as good as the other. When this happens, one can look at these items qualitatively, asking the question: are any of the items more valuable than the rest? And then retain the item or items that are deemed more valuable and remove the others. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 620 http://www.psychopen.eu/ Table 1 shows four groups of statistically very similar items. Group A has two items, Group B three, Group C four and Group D two. Four items were removed: one from each of Groups B and D, Items 4 and 1 respectively and two from Group C, Items 12 and 16. Table 1 Groups of Statistically Similar Items Wording Pt-measure correlation Item difficulty Item number Item group In tests, I worry that I won’t understand the vocabulary in the text.2A .74.57-0 Essays make me nervous.11 .74.58-0 I start to panic when I have to speak without preparation in the language class.8B .72.07-0 I always feel that the other students speak the foreign language better than I do.19 .73.10-0 I keep thinking that the other students are better at languages than I am.4 .74.11-0 I am afraid that my language teacher is ready to correct every mistake I make.16C .70.270 When we have an oral dialogue, I worry that I might not be able to understand what the other person is saying. 17 .72.260 It embarrasses me to volunteer answers in my language class.13 .70.220 I worry that I might not understand the instructions in a listening test.12 .68.200 I tremble when I know that I am going to be called on in language class.1D .63.550 I get anxious when the test has a listening component.7 .67.570 Note. Newly generated items are in bold. Semantic Justifications for the Removal of Items — ‘I tremble when I am going to be called on in language class’ (Item 1) was removed because this concept is an extension of ‘it embarrasses me to volunteer answers in my language class’ (Item 13). Furthermore the adjective ‘tremble’ was considered by the researchers to be so specific that even very anxious students may not choose it. Items 4 and 19 were found not only to be statistically almost identical, but their meaning was also parallel. Being nervous about listening tests in general (Item 7) could well include worrying about not understanding the instructions in listening tests (Item 12). It was therefore considered that the latter could be removed. The fear of having every mistake corrected by the teacher (Item 16) caused some undesired ambiguity. Some participants asked if they should answer hypothetically since their level of lin- guistic competence was higher than the difficulty level of the class. Others wished to respond theoretically as they had never experienced this. Calibrations of the 18-Item Scale The remaining 18 items were used for the second calibration. Initially, a person fit analysis revealed a few misfitting students. Seven of those (approximately 2.5% of the sample) were considered badly misfitting with infit and/or outfit values greater than 2.5. The responses of these students were considered to be distorting the measurement process and were removed. The remaining dataset, 18 items and 278 students, was used for the final calibration. Table 2 shows the item statistics, in difficulty order, from the most difficult to endorse to the easiest. The item dif- ficulties range from -1.41 to 1.40. This covers a range of 2.81 logits on the construct continuum. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 621 http://www.psychopen.eu/ Table 2 Item Statistics in Measure Order ROutfitInfitErrorDifficultyItem .6018 .171.091.100.401 .6821 .780.950.090.001 .687 .990.021.090.660 .6714 .091.291.090.560 .7422 .750.740.090.470 .7517 .770.840.080.340 .7213 .031.980.080.250 .755 .860.790.080.100 .738 .950.960.080.01-0 .7519 .890.930.080.05-0 .729 .071.081.080.13-0 .7620 .900.910.080.25-0 .773 .860.820.080.40-0 .7410 .111.061.080.45-0 .762 .950.870.080.55-0 .7711 .910.950.080.60-0 .716 .361.311.080.93-0 .7015 .511.501.080.41-1 Note. R (final column) shows the point measure correlations. Reliability — The Person Reliability was found to be .93, the Person Separation 3.57 and Strata 5.09. Also Item Reliability was .98. Despite the removal of four items from the scale, the reliability indices remained essentially unaffected, thus maintaining the high degree of reliability of the scale. Dimensionality of the Scale — Various analyses were performed for assessing whether the scale is unidimen- sional. First the point measure correlations (shown in Table 2) of all items were positive and highly significant, ranging from .60 to .77. PCA of standardised residuals — Table 3 shows the results of the PCA of the standardised residuals. Table 3 Standardised Residual Variance in Eigenvalue Units ModelledEmpirical Variance component %%Eigenvalue Raw variance explained by measures .362.961.229 Raw variance explained by persons .049.748.023 Raw variance explained by items .313.213.26 Raw unexplained variance .737.138.018 Unexplained variance in 1st factor .64.22 Total raw variance in observations .0100.0100.247 Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 622 http://www.psychopen.eu/ To assess the strength of the measurement dimension one looks at the variance explained by the measures. In this case it is 61.9% of the total variance in the data. The first factor had an eigenvalue of 2.2, the strength of less than three items. Also the first factor explains just 4.6% of the total variation and only 12.1% of the unexplained variance. Furthermore, the ratio of the variance explained by the measures to the variance explained by the 1st factor is 13.3:1. The evidence collected from the PCA of the standardized residuals suggests that no second di- mension is present in the data meaning that the scale is unidimensional. Item Fit — With the exception of Item 15, ‘I worry about my grade in English’, all other items were well below the infit and outfit cut-off values. Item 15 with item difficulty estimate of -1.41 had marginal values (infit = 1.50 and outfit = 1.51). Further investigation showed that this item was the easiest in the scale and its marginal fit was caused by very few unexpected high scores by person with low positions on the construct continuum. For example, persons with entries 101 and 146 had estimates of -2.22 and -3.06 and scored 5 and 4 on the item respectively. Both persons were positioned much lower than the item on the construct continuum and their probabilities of scoring 5 and 4 respectively were below 2%. Just these two unexpected responses raised the infit and outfit from 1.45 and 1.46 to 1.50 and 1.51 respectively. As this item was considered important semantically, since it was the only item asking about students’ overall assessment, and was the easiest in the scale thus widening the construct coverage, it was kept in the scale. Property of Invariance — The dataset was divided into two subgroups. One contained the 18 items and 87 male students and the other 18 items and 194 female students. Four students did not state their gender. The correlation between the two sets of item difficulty estimates, yielded from the two separate calibrations, was .967. Furthermore, Figure 1 shows the scatter plot of these item estimates together with a 95% confidence interval. Figure 1. Scatter plot of item estimates from two separate calibrations (Male - Female). Only one of the 18 items estimates fell slightly outside the confidence interval. This, together with the very high correlation, are a strong indication that invariance holds, meaning the construct indeed has the same meaning among the two groups of students. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 623 http://www.psychopen.eu/ Is Test Anxiety a Component of FLCA? — The aim of creating a scale is to obtain estimates of persons’ level of anxiety or their positions on the FLCA continuum. In order to investigate whether test anxiety is a component of FLCA the items of the scale were divided into two groups. The first group contained the seven test anxiety items (Items 2, 6, 7, 10, 14, 18 and 20) and the second the remaining 11 items (Items 3, 5, 8, 9, 11, 13, 15, 17, 19, 21 and 22). Person measures were estimated for the two groups of items separately. The correlation of the two sets of measures was highly significant at .876. Perhaps more importantly, significant differences between measures were found only in 12 cases (four t-values below -2.0 and eight t-values above 2.0) and this represents 4.3% of the sample (12 out of 278 students). Since this percentage is below 5% one can infer that the person estimates from the two separate item groups are statistically equivalent and that the two item groups indeed measure the same construct. Figure 2 shows the distribution of the t-values which approximates very well a standard normal (Mean = 0 and standard deviation = 1) with a mean of 0.06 and standard deviation 1.053. Figure 2. Distribution of t-values (1). FLCAS Items vs New Items — Finally the items were once again divided into two groups. The first group contained the seven FLCAS items (6, 8, 13, 18, 19, 21, 22) and the second the 11 new items (2, 3, 5, 7, 9, 10, 11, 14, 15, 17, 20). Person measures were estimated for these groups of items separately. The correlation of the two measures was highly significant at .869. Significant differences between person measures were found in 11 cases (four t- values below -2.0 and seven t-values above 2.0) which represents 4.0% of the sample. This method also gave statistically equivalent person measures supporting again the unidimensional structure of the scale. Figure 3 shows the distribution of the t-values with a mean of 0.008 and a standard deviation of 1.026. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 624 http://www.psychopen.eu/ Figure 3. Distribution of t-values (2). Category Functioning — Table 4 shows the Rasch diagnostics for the investigation of the functioning of the Likert scale categories. Table 4 Rasch Diagnostics for the Category Functioning ThresholdsOutfitInfitAverage measureObserved countLabelCategory None1608Never1 .920.880.62-2 1297Rarely2 .73-1.930.031.29-1 1000Sometimes3 .61-0.031.990.35-0 539Frequently4 .670.071.091.430 288Always5 .661.161.141.641 All categories had high observed frequencies, the average person measure corresponding to the categories monotonically increases, infit and outfit values are very close to the expected value of 1.00 and the thresholds also monotonically increase. Finally, the ranges between successive thresholds (1.12, 1.28 and 0.99) are large enough to show distinction between categories. The satisfactory ranges between successive thresholds can be seen in Figure 4 which shows the category probabilities. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 625 http://www.psychopen.eu/ Figure 4. Category probabilities. Each probability curve (corresponding to each of the five categories) peaks for a significant range along the con- tinuum and this shows that each category is the most probable for the corresponding range. Since these basic criteria suggested by Linacre (2002) are met, the 5-point Likert scale can be considered optimal. Item Targeting — Figure 5 shows the person item-map. The items are well targeted for persons of just below the persons' mean FLCA to about one and two thirds standard deviations above the mean (from -1.41 to 1.40 logits) covering a range of 2.81 logits. The person measures range from -4.03 to 6.03. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 626 http://www.psychopen.eu/ Figure 5. Person-item map (Each '#' represents two persons). Comparisons Between the FLCAI and the FLCAS Table 5 shows comparisons between the main psychometric features of the new scale presented in this study, the Foreign Language Classroom Anxiety Inventory (FLCAI), and the equivalent FLCAS features from the 2013 study by Panayides and Walker (2013). Both instruments were administered to similar-sized samples from the same population (Cypriot high school students of age 16-18) and data were analysed through Rasch measurement thus facilitating comparisons. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 627 http://www.psychopen.eu/ Table 5 Comparisons between the FLCAI and FLCAS FLCAIFLCASDescription 18 items33 itemsScale length 285304Sample size High school studentsHigh school studentsSample composition 16 – 1816 – 18Age of participants Reliability Person Reliability .93.93 Person Separation .573.643 PCA of standardised residuals Variance by measures .9%61.6%51 First factor extracted Eigenvalue .22.52 % of total variance .6%4.4%4 Ratio (Measures:1st factor) .3:113.9:111 2.81 logits1.44 logitsConstruct coverage OptimalMarginally optimalCategory functioning The FLCAI (18 items) is shorter than the FLCAS (33 items) by 45%. Nevertheless, Person Reliability is the same and Person Separation is negligibly lower. The strength of the measurement dimension is arguably better in the FLCAI as the variance explained by the measures is 61.9% of the total variance as opposed to 51.6% in the FLCAS. Furthermore, the eigenvalue of the first factor extracted by PCA of the standardised residuals is 2.2 in the FLCAI and slightly higher at 2.5 in the FLCAS. Also, the ratio of variance explained by the measures to the variance explained by the first factor is higher for the FLCAI (13.3:1) than for the FLCAS (11.9:1). The above evidence shows that core construct measured by the FLCAI has more strength in the data than that measured by the FLCAS, thus there is more convincing evidence of unidimensionality in the new scale. Two further points are noteworthy in the comparisons made in Table 5. First there is a wider construct coverage for the 18 items of FLCAI (2.81 logits) than for the 33 items of FLCAS (1.44 logits). Second, the five categories of the Likert scale in the FLCAI function better than those in the FLCAS. Conclusions The purpose of this study was to design an appropriate instrument for measuring foreign language classroom anxiety in high schools. Prerequisites were it being psychometrically successful, that is, having a high degree of validity and reliability and of course being an appropriate instrument for the intended population. The FLCAS (Horwitz et al., 1986) has been efficaciously used for almost 3 decades however validity is not time and location independent. A study of its psychometric properties in senior high schools in Cyprus (Panayides & Walker, 2013) drew attention to the need for modifications. Such revisions included an examination of its length, which has been suggested to have inflated reliability (Panayides & Walker, 2013). Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 628 http://www.psychopen.eu/ The new scale was created in a multi-step fashion. It began with an examination of the original scale. This was followed by the creation of an enlarged item pool which was generated by an extensive examination of the literature, informal discussions with language teachers and students, three interviews with highly experienced English language teachers, as well as focus group discussions with language teachers. Concerns raised in the literature, such as those of Sparks and Patton (2013) that the FLCAS ‘is likely to be measuring individual differences in students’ language learning skills and / or self-perceptions about their language skills rather than anxiety unique to L2 learning’ (p. 870), were reflected upon in the selection of items. The enlarged item pool (39 items) was administered to 212 16 – 18 year old students in early 2013. This was then shortened again through a multi-step process which included Rasch analyses and thorough semantic examination. This resulted in a 22 item scale which was admin- istered to 285 16 – 18 year old students in October 2013. Finally fine tuning was performed in order to create the most appropriate, efficient and functional scale whilst maintaining reliability. This included careful reanalysis of both the statistic and the semantic properties of the re- maining items. Four further items were deemed unnecessary for statistical and semantic reasons. The resulting 18-item form of the scale was used for the final analyses. Rasch measurement was judged to be the most appropriate method of yielding answers to the following research questions: Is the New 5-Point Rating Scale Psychometrically Optimal? The effectiveness of the scale was investigated through the Rasch diagnostics suggested by Linacre (2002) and Bond and Fox (2001, 2007). All categories had high observed frequencies and thus an adequate number for stable estimates. Both the average person measure corresponding to each category and the threshold estimates monotonically increased and the ranges between successive thresholds were satisfactory rendering each category the most probable choice in an adequate range. Finally the outfit values of the categories were all close to the expected value of 1.00. All the evidence collected supports the hypothesis that categories function satisfactorily and thus five can be considered the optimal number of categories for the FLCAI. Does the New Scale Provide Reliable Person Measures? All the indices calculated reveal a high degree of reliability. The Person Reliability was .93 indicating that the scale can distinguish individuals very well. The Person Separation was 3.57 and Strata 5.09 confirming the good sep- aration of persons along the FLCA continuum. Finally, the Item Reliability was .98 indicating that the items are well discriminated by the sample of respondents and this is a necessary condition for effective measurement (Wright & Masters, 1982). Do the Scale Items Define a Single Construct? A variety of evidence was collected to support the unidimensional structure of the scale. First, all point measure correlations were positive and highly significant (.60 to .77). Second, PCA of the standardised residuals revealed a strong measurement dimension explaining approximately 62% of the variation in the data. Furthermore, there was no significant second dimension present in the data (its eigenvalue was only 2.2 explaining only 4.6% of the total variation). Also the measurement dimension was 13.3 times stronger than that second dimension. Third, all items fit the Rasch model well. Only one item (Item 15) was marginal but it was not removed from the scale for two reasons; its misfit was caused by very few students responding unexpectedly to it and it was the easiest item in the scale, widening the coverage of the FLCA construct. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 629 http://www.psychopen.eu/ Fourth, two separate calibrations were performed on the 18 items from two distinct groups, male and female stu- dents. The item difficulties obtained from the two groups were statistically equivalent, showing that the property of invariance holds for this scale. That is, the construct has the same meaning across the two different groups of people. Fifth, in agreement with Panayides and Walker (2013), the researchers showed that test anxiety is indeed a component of FLCA by separating the items into two groups, one containing the seven test anxiety items and the other the 11 remaining items. Person estimates were obtained from the two groups and these estimates were shown to be statistically equivalent, strengthening the belief that this scale indeed measures one construct and is unidimensional. Finally, it was shown that the seven FLCAS items included in FLCAI measure the same construct as the 11 newly created items. Since the FLCAS is a well-established scale which has been validated many times, this result not only strengthens the hypothesis that FLCAI is unidimensional, but also that the dimension measured by the inventory is indeed FLCA. How Does the New Scale Compare With the Original FLCAS? Panayides and Walker’s study (2013) facilitated direct comparisons between the FLCAS and FLCAI as data in both studies were analysed through the use of Rasch measurement with a time lapse of just one year. Panayides and Walker (2013) suggested that the very high reliability of the FLCAS was caused by the length of the scale, the use of parallel items and the narrow construct coverage. This study confirmed this since the new scale has just 55% the length of the original scale, containing 18 items rather than the 33, without losing any of its psycho- metric properties. In fact, the FLCAI is psychometrically superior to the FLCAS for the following reasons. First its degree of reliability is not lowered by the downsizing of the scale. Thus, a new shorter scale has been created without any loss in the degree of reliability from the original. The shortening of the instrument was beneficial as teachers and 'most researchers are constrained by another real-world factor: survey length' (Matthews, Kath, & Barnes-Farrell, 2010, p. 76). Second, it covers a much wider range of FLCA (2.81 logits as opposed to 1.44 logits). Third the categories function better and fourth, the unidimensional structure of the scale is more convincing with a much higher strength in the main dimension measured by it. Despite achieving a wider spread of item difficulties and wider construct coverage than that in the FLCAS, the items of the FLCAI are well targeted for students with higher levels of FLCA. The mean item difficulty is, as always in Rasch analyses, zero whereas the mean person measure is well below at -1.40 logits. They suggest that the targeting of the items of the FLCAS can be explained by the fact that senior high school students in Cyprus have been studying English since early primary school, and they are therefore familiar with the language. Consequently their anxiety levels are well below that of high school or university beginner level students of English, who have been the participant population in many other studies in the literature. This study aligns itself with Panayides and Walker’s (2013) stance, and suggests that should the FLCAI be used in a different setting, with less experienced students of English, or another foreign language, the mean item difficulty would probably be much closer to the mean person measure. Limitations The results reported here are very convincing and the researchers believe that the FLCAI will also prove worthy and appropriate for other EFL and foreign language learner populations where the students have been studying Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 630 http://www.psychopen.eu/ English for more than 8-10 years. Notwithstanding, the validity of the scale cannot be taken for granted for any other population, just as it could not for the FLCAS. The FLCAI is a highly appropriate instrument for measuring FLCA among the Cypriot high school population. Finally, even though every effort was made to provide an accurate equivalent for these items, slight semantic differences cannot be ruled out between the English and Greek versions of the FLCAS and the FLCAI. Concluding Remarks For the creation of the FLCAI Rasch analyses were complemented by semantic analyses. This was deemed ne- cessary as this inventory is intended not only for use in further educational research but also, and perhaps more importantly, it is intended to be used by language teachers at the beginning of each academic year to assess their students’ anxiety level. It was therefore of paramount importance to design an instrument that is as time economic as possible during both administration and analysis, without compromising validity, reliability or usefulness. It is also recommended that when teachers administer this instrument in the future they not only consider their students’ total anxiety score but also what they find most anxiety provoking. This will lead to better student support. The FLCAI was shown here to have a high degree of validity and reliability. However, further studies of the validity and appropriateness of the scale in diverse settings such as among university students and different countries are encouraged. Given the thoroughness of the methodology used, and the strengths of Rasch measurement, it is believed that the FLCAI will prove to be a valuable instrument maintaining its psychometric properties in other settings. Funding The authors have no funding to report. Competing Interests The authors have declared that no competing interests exist. Acknowledgments The authors have no support to report. References Aida, Y. (1994). Examination of Horwitz, Horwitz, and Cope’s construct of foreign language anxiety: The case of students of Japanese. The Modern Language Journal, 78(2), 155-168. doi:10.1111/j.1540-4781.1994.tb02026.x American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME] (1999). Standards for educational and psychological testing. Washington, DC: AERA. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573. doi:10.1007/BF02293814 Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the social sciences. Mahwah, NJ: Lawrence Erlbaum. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 631 http://dx.doi.org/10.1111/j.1540-4781.1994.tb02026.x http://dx.doi.org/10.1007/BF02293814 http://www.psychopen.eu/ Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the social sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum. Chan, D., & Wu, G. (2004). A study of foreign language anxiety of EFL elementary school students in Taipei County. Journal of National Taipei Teachers College, 17(2), 287-320. Cheng, Y.-s., Horwitz, E. K., & Schallert, D. L. (1999). Language anxiety: Differentiating writing and speaking components. Language Learning, 49(3), 417-446. doi:10.1111/0023-8333.00095 Curtis, D. D. (2004). Person misfit in attitude surveys: Influences, impacts and implications. International Education Journal, 5(2), 125-144. Gehlbach, H., & Brinkworth, M. E. (2011). Measure twice, cut down error: A process for enhancing the validity of survey scales. Review of General Psychology, 15(4), 380-387. doi:10.1037/a0025704 Horwitz, E. K. (1986). Preliminary evidence for the reliability and validity of a foreign language anxiety scale. TESOL Quarterly, 20(3), 559-562. doi:10.2307/3586302 Horwitz, E. K. (2000). It ain’t over ’til it’s over: On foreign language anxiety, first language deficits, and the confounding of variables. The Modern Language Journal, 84, 256-259. doi:10.1111/0026-7902.00067 Horwitz, E. K., Horwitz, M. B., & Cope, J. (1986). Foreign language classroom anxiety. The Modern Language Journal, 70(2), 125-132. doi:10.1111/j.1540-4781.1986.tb05256.x Horwitz, E. K., & Young, D. J. (1991). Language anxiety: From theory and research to classroom implications. Englewood Cliffs, NJ: Prentice Hall. Krashen, S. D. (1981). Second language acquisition and second language learning. Retrieved from http://www.sdkrashen.com/content/books/sl_acquisition_and_learning.pdf Levine, G. S. (2003). Student and instructor beliefs and attitudes about target language use, first language use, and anxiety: Report of a questionnaire study. The Modern Language Journal, 87(3), 343-364. doi:10.1111/1540-4781.00194 Linacre, J. M. (1997). Guidelines for rating scales and Andrich thresholds (MESA Research Note #2). Retrieved from http://www.rasch.org.rn2.htm Linacre, J. M. (1998). Detecting multidimensionality: Which residual data-type works best? Journal of Outcome Measurement, 2(3), 266-283. Linacre, J. M. (2002). Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85-106. Linacre, J. M. (2005). WINSTEPS Rasch measurement computer program (Version 3.65) [Computer software]. Chicago, IL: Winsteps.com. Liu, H.-j. (2012). Understanding EFL undergraduate anxiety in relation to motivation, autonomy, and language proficiency. Electronic Journal of Foreign Language Teaching, 9(1), 123-139. Liu, M., & Huang, W. (2011). An exploration of foreign language anxiety and English learning motivation. Education Research International, 2011, Article 493167. doi:10.1155/2011/493167 Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 632 http://dx.doi.org/10.1111/0023-8333.00095 http://dx.doi.org/10.1037/a0025704 http://dx.doi.org/10.2307/3586302 http://dx.doi.org/10.1111/0026-7902.00067 http://dx.doi.org/10.1111/j.1540-4781.1986.tb05256.x http://www.sdkrashen.com/content/books/sl_acquisition_and_learning.pdf http://dx.doi.org/10.1111/1540-4781.00194 http://www.rasch.org.rn2.htm http://dx.doi.org/10.1155/2011/493167 http://www.psychopen.eu/ Luyt, R. (2012). A framework for mixing methods in quantitative measurement development, validation, and revision: A case study. Journal of Mixed Methods Research, 6(4), 294-316. doi:10.1177/1558689811427912 MacIntyre, P. D., & Gardner, R. C. (1989). Anxiety and second-language learning: Toward a theoretical clarification. Language Learning, 39(2), 251-275. doi:10.1111/j.1467-1770.1989.tb00423.x MacIntyre, P. D., & Gardner, R. C. (1991). Investigating language class anxiety using the focused essay technique. The Modern Language Journal, 75(3), 296-304. doi:10.1111/j.1540-4781.1991.tb05358.x MacIntyre, P. D., & Gregersen, T. (2012). Affect: The role of language anxiety and other emotions in language learning. In S. Mercer, S. Ryan, & M. Williams (Eds.), Psychology for language learning: Insights from research, theory and practice (pp. 103-118). New York, NY: Palgrave Macmillan. Mahmood, A., & Iqbal, S. (2010). Difference of student anxiety level towards English as a foreign language subject and their academic achievement. International Journal of Academic Research, 2(6, Pt. 1), 199-203. Matthews, R. A., Kath, L. M., & Barnes-Farrell, J. L. (2010). A short, valid, predictive measure of work–family conflict: Item selection and scale validation. Journal of Occupational Health Psychology, 15(1), 75-90. doi:10.1037/a0017443 Matsuda, S., & Gobel, P. (2004). Anxiety and predictors of performance in the foreign language classroom. System, 32(1), 21-36. doi:10.1016/j.system.2003.08.002 Messick, S. (1993). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-104), Phoenix, AZ: American Council on Education and The Oryx Press. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill. Panayides, P. (2009). Exploring the reasons for aberrant response patterns in classroom maths tests (Doctoral Dissertation, Durham University, Durham, United Kingdom). Retrieved from http://etheses.dur.ac.uk/2042/ Panayides, P., Robinson, C., & Tymms, P. (2010). The assessment revolution that has passed England by: Rasch measurement. British Educational Research Journal, 36(4), 611-626. doi:10.1080/01411920903018182 Panayides, P., & Walker, M. J. (2013). Evaluating the psychometric properties of the Foreign Language Classroom Anxiety Scale for Cypriot senior high school EFL students: The Rasch measurement approach. Europe's Journal of Psychology, 9(3), 493-516. doi:10.5964/ejop.v9i3.611 Park, G., & French, B. (2013). Gender differences in the Foreign Language Classroom Anxiety Scale. System, 41(2), 462-471. doi:10.1016/j.system.2013.04.001 Pasquale, M. (2011). Folk beliefs about second language learning and teaching. AILA Review, 24, 88-99. doi:10.1075/aila.24.07pas Ray, J. J. (1988). Semantic overlap between scale items may be a good thing: Reply to Smedslund. Scandinavian Journal of Psychology, 29, 145-147. doi:10.1111/j.1467-9450.1988.tb00784.x Robinson, P. (Ed.). (2002). Individual differences and instructed language learning. Amsterdam, The Netherlands: John Benjamins. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 633 http://dx.doi.org/10.1177/1558689811427912 http://dx.doi.org/10.1111/j.1467-1770.1989.tb00423.x http://dx.doi.org/10.1111/j.1540-4781.1991.tb05358.x http://dx.doi.org/10.1037/a0017443 http://dx.doi.org/10.1016/j.system.2003.08.002 http://etheses.dur.ac.uk/2042/ http://dx.doi.org/10.1080/01411920903018182 http://dx.doi.org/10.5964/ejop.v9i3.611 http://dx.doi.org/10.1016/j.system.2013.04.001 http://dx.doi.org/10.1075/aila.24.07pas http://dx.doi.org/10.1111/j.1467-9450.1988.tb00784.x http://www.psychopen.eu/ Rodriguez, K. L., Schwartz, J. L., Lahman, M. K. E., & Geist, M. R. (2011). Culturally responsive focus groups: Reframing the research experience to focus on participants. International Journal of Qualitative Methods, 10(4), 400-417. Schumacker, R. E., & Linacre, J. M. (1996). Factor analysis and Rasch analysis. Rasch Measurement Transactions, 9(4), 470. Retrieved from http://www.rasch.org/rmt/rmt94k.htm Smith, E. V., Jr. (2002). Detecting and evaluating the impact of multidimensionality using item fit statistics and principal components analysis of residuals. Journal of Applied Measurement, 3(2), 205-231. Sparks, R. L., & Ganschow, L. (1991). Foreign language learning differences: Affective or native language aptitude differences? The Modern Language Journal, 75(1), 3-16. doi:10.1111/j.1540-4781.1991.tb01076.x Sparks, R. L., & Patton, J. (2013). Relationship of L1 skills and L2 aptitude to L2 anxiety on the Foreign Language Classroom Anxiety Scale. Language Learning, 63(4), 870-895. doi:10.1111/lang.12025 Spielberger, C. D. (1983). Manual for the stait-trait anxiety inventory (STAI-Form Y). Palo Alto, CA: Consulting Psychologists Press. Spielmann, G., & Radnofsky, M. L. (2001). Learning language under tension: New directions from a qualitative study. The Modern Language Journal, 85(2), 259-278. doi:10.1111/0026-7902.00108 Tóth, Z. (2008). A foreign language anxiety scale for Hungarian learners of English. Working Papers in Language Pedagogy, 2, 55-77. Wright, B. D., Linacre, J. M., Gustafson, J.-E., & Martin-Lof, P. (1994). Reasonable mean square fit values. Rasch Measurement Transactions, 8(3), 370. Retrieved from http://www.rasch.org/rmt/rmt83b.htm Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press. Yan, J. X., & Horwitz, E. K. (2008). Learners’ perceptions of how anxiety interacts with personal and instructional factors to influence their achievement in English: A qualitative analysis of EFL learners in China. Language Learning, 58(1), 151-183. doi:10.1111/j.1467-9922.2007.00437.x Yun, J., & Ulrich, D. A. (2002). Estimating measurement validity: A tutorial. Adapted Physical Activity Quarterly, 19, 32-47. Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 634 http://www.rasch.org/rmt/rmt94k.htm http://dx.doi.org/10.1111/j.1540-4781.1991.tb01076.x http://dx.doi.org/10.1111/lang.12025 http://dx.doi.org/10.1111/0026-7902.00108 http://www.rasch.org/rmt/rmt83b.htm http://dx.doi.org/10.1111/j.1467-9922.2007.00437.x http://www.psychopen.eu/ Appendix Appendix A The data collected from the discussions with students and teachers, and the three interviews was subjected manually to theme analysis. The complete list of ideas and their frequencies can be seen below. These had a direct impact on the designing of items for inclusion in the extended item pool. The items in bold feature in the final version of the scale. 19Tests 10Oral participation 7Grades 5Being mocked if they make serious mistakes 5Peer pressure / other students' reactions 4Unclear instructions for listening tests 4Essays / creative writing 4Fear of making mistakes / reproduce incorrectly 3Unclear instructions for written tests 3Oral test 3Homework – unfamiliar tasks, too much 2Worry that they are not understood correctly 2Teacher's reaction if they make a mistake 2Teacher criticism 2Talking about mistakes 2Unknown vocabulary 2Unfamiliar tasks in tests 1Being interrupted whilst giving an answer 1Threats from the teacher related to their grades 1Working individually 1Lack of self-confidence / sense of trust 1Difficult concepts in the curriculum 1Not knowing what is required 1Listening tests 1Amount of studying required for the lesson 1When the teacher is unfriendly 1When the teacher does not show understanding 1Teacher not using exercises which could relax them in the lesson 1Not being given encouragement 1Fear of expressing their opinions 1Teacher centred teaching 1Technology 1Sharing ideas 1Their pronunciation 1Fear that they haven't understood the teacher correctly 1Not being well-prepared 1Too much grammatical theory 1Written exercises in the classroom 1Double period 1Lack of knowledge about the teacher Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Walker & Panayides 635 http://www.psychopen.eu/ Appendix B: The Foreign Language Classroom Anxiety Inventory The numbering on this inventory is that used for the 22-item scale. This is so as to avoid any potential confusion in references made to items throughout the study. New items are in bold. Instructions: Below you will find a list of statements related to the feelings a student may experience during foreign language lessons. Please respond with √ or X in the box which best reflects how often these statement apply to you. A lw ay s O ft en S om et im es R ar el y N ev er 2. In tests, I worry that I won’t understand the vocabulary in the texts. 3. I get nervous when the tasks are unfamiliar to me. 5. I worry that my English teacher might ask me something that I won’t understand. 6. I am usually at ease during tests in my language class. 7. I get anxious when the test has a listening component. 8. I start to panic when I have to speak without preparation in language class. 9. I am afraid I may mispronounce a word in front of the class. 10. Thoughts of doing poorly interfere with my concentration in tests. 11. Essays make me nervous. 13. It embarrasses me to volunteer answers in my language class. 14. During important tests I am so tense that it upsets my stomach. 15. I worry about my grade. 17. When we have an oral dialogue, I worry that I might not be able to understand what the other person is saying. 18. The more I study for a language test, the more confused I get. 19. I always feel that the other students speak the foreign language better than I do. 20. During tests I find myself worrying about the consequences of failing. 21. Language class moves so quickly I worry about getting left behind. 22. I get nervous when I don't understand every word the language teacher says. About the Authors Miranda Jane Walker holds a BA in Hispanic Studies and Modern Greek (King’s College, University of London) a BA in English Language and Literature (University of Cyprus) and an MA in Education Leadership and Management (Open University, UK). She is currently an EdD candidate at the Open University, UK. She teaches Spanish in Secondary Education in Limassol, Cyprus. Her research interests include teacher and student motivation and anxiety in the foreign language classroom as well as educational leadership and management. Panayiotis Panayides holds a BSc in Statistics with Mathematics (Queen Mary College, University of London), an MSc in Educational Testing (Middlesex University, UK) and a PhD in Educational Measurement (University of Durham, UK). He is an assistant headmaster and head of the Mathematics department in Secondary Education in Limassol, Cyprus. His research interests include educational and psychological measurement as well as research into Mathematics education. PsychOpen is a publishing service by Leibniz Institute for Psychology Information (ZPID), Trier, Germany. www.zpid.de/en Europe's Journal of Psychology 2014, Vol. 10(4), 613–636 doi:10.5964/ejop.v10i4.782 Rasch Measurement in Language Research: The FLCAI 636 http://www.psychopen.eu/ http://www.zpid.de/en Rasch Measurement in Language Research: The FLCAI (Introduction) Literature Review Validity of Scales Rasch Measurement Assessing Unidimensionality, the Rasch Approach This Study Method The Creation of the New Scale Results The First Calibration of the 22-Item Scale Calibrations of the 18-Item Scale Comparisons Between the FLCAI and the FLCAS Conclusions Is the New 5-Point Rating Scale Psychometrically Optimal? Does the New Scale Provide Reliable Person Measures? Do the Scale Items Define a Single Construct? How Does the New Scale Compare With the Original FLCAS? Limitations Concluding Remarks (Additional Information) Funding Competing Interests Acknowledgments References Appendix Appendix A Appendix B: The Foreign Language Classroom Anxiety Inventory About the Authors