Начиная с начала 2000 года осуществляется внедрение GHIS в здравоохранении, в рамках принятого проекта о реформирование информ


Mathematical Problems of Computer Science  45, 35--43, 2016. 

 
Analysis of Experiments of a New Approach for Test Quality 
Evaluation  

Mariam E. Haroutunian, Varazdat K. Avetisyan 
 

Institute for Informatics and Automation Problems of NAS RA 
e-mail: armar@ipia.sci.am,  avetvarazdat@gmail.com 

 
Abstract 
 

In the previous paper [1] we suggested a new model of test quality 
evaluation based on Information measures such as Shannon entropy and 
average mutual information. To establish the practical bounds of these 
measures and the required number of examinees, some experiments were 
conducted.  In this paper the analysis of these experiments are provided.  

Keywords: Test quality, Shannon entropy, Average mutual 
information, Classical test theory, Item response theory. 

 
1. Introduction  
 
Test developers are basically concerned about the quality of test items and how examinees 
respond to it when constructing tests. Test theories and related models provide a frame of 
reference for doing test design work or solving other practical problems.  A good test model 
might specify the precise relationships among test items and ability scores, so that careful design 
work can be done to produce desired test score distribution and errors of the size that can be 
allowed. A good test theory or model can also handle errors of measurements by helping 
understand the role that measurement errors play in estimating examinee’s ability and 
correlations between variables and true scores or ability scores. There are two currently popular 
statistical frameworks to address test data analysis and test quality evaluation: Classical Test 
Theory (CTT) [2] and Item Response Theory (IRT) [3]. CTT is a theory about test scores that 
introduces three concepts - test score, true score and error score. In the CTT, the notion of ability 
is expressed by the true score, which is defined as "the expected value of observed performance 
on the test of interest." An examinee's ability is defined only in terms of a particular test. When 
the test is "hard," the examinee will appear to have low ability; when the test is "easy," the 
examinee will appear to have higher ability. CTT was the dominant statistical approach for 

 
35 
 

mailto:armar@ipia.sci.am
mailto:avetvarazdat@gmail.com


Analysis of the Experiments of a New Approach for Test Quality Evaluation  36 

testing data until Lord and Novick (1968) placed it in the context with several other statistical 
theories of mental test scores, notably IRT. IRT is a model-based measurement statistical theory 
in which the performance of an examinee on a test item can be predicted (or explained) by a set 
of factors called traits, latent traits, or abilities; and the relationship between the examinees' item 
performance and the set of traits underlying item performance can be described by a 
monotonically increasing function called an item characteristic function or item characteristic 
curve (ICC). Each of these approaches has its advantages and disadvantages [4].  For example, in 
CTT item parameters are dependent on the examinee sample from which they are obtained, but 
in IRT these parameters are examinee group independent. But on the other hand, in case of CTT 
smaller examinee sample sizes are required for analysis and the methods are simpler compared 
to IRT. Besides the existing CTT and IRT models, we have developed a new approach [1] based 
on Information measures such as Shannon entropy and average mutual information.   

The main idea of the new approach is the following. Suppose that the test consists of  N 
items, each item can be considered as a binary random variable (RV)  X1, X2, .., XN     with 
probabilities p for correct answers and 1 − p, for incorrect answers: 

 
𝑋𝑋𝑖𝑖 = �
1    𝑤𝑤𝑤𝑤𝑤𝑤ℎ  𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑤𝑤𝑤𝑤𝑝𝑝 𝑝𝑝𝑖𝑖,

         0      𝑤𝑤𝑤𝑤𝑤𝑤ℎ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑤𝑤𝑤𝑤𝑝𝑝  1 − 𝑝𝑝𝑖𝑖,
              𝑤𝑤 = 1, 𝑁𝑁.������   

 
We consider Shannon entropy of RV 𝑋𝑋𝑖𝑖 
 

𝐻𝐻(𝑋𝑋𝑖𝑖) = −�𝑝𝑝(𝑥𝑥𝑖𝑖 ) log 𝑝𝑝(𝑥𝑥𝑖𝑖
𝑥𝑥𝑖𝑖

) 

 
and the average mutual information of two items: 

 
𝐼𝐼�𝑋𝑋𝑖𝑖  ∧ 𝑋𝑋𝑗𝑗 � = � 𝑝𝑝(𝑥𝑥𝑖𝑖 , 𝑥𝑥𝑗𝑗 ) log
𝑝𝑝(𝑥𝑥𝑖𝑖, 𝑥𝑥𝑗𝑗 )

𝑝𝑝(𝑥𝑥𝑖𝑖) ∗ 𝑝𝑝(𝑥𝑥𝑗𝑗 )𝑥𝑥𝑖𝑖,𝑥𝑥𝑗𝑗
= 

𝐻𝐻(𝑋𝑋𝑖𝑖) − 𝐻𝐻�𝑋𝑋𝑖𝑖 | 𝑋𝑋𝑗𝑗� = 𝐻𝐻�𝑋𝑋𝑗𝑗� − 𝐻𝐻�𝑋𝑋𝑗𝑗 | 𝑋𝑋𝑖𝑖�. 
 

Our test quality evaluation model consists of the following methods: 
Method 1. If the value of 𝐻𝐻(𝑋𝑋𝑖𝑖) is close to 0, it means that we have a bad test item, which 

can be very easy or very difficult. If the value of 𝐻𝐻(𝑋𝑋𝑖𝑖) is close to 1 we have a good test item. 

Method 2. If the value of 𝐼𝐼�𝑋𝑋𝑖𝑖  ∧ 𝑋𝑋𝑗𝑗 � is close to 0, it means that there is independency of 
test items Xi and Xj . In case of values close to min�𝐻𝐻(𝑋𝑋𝑖𝑖), 𝐻𝐻�𝑋𝑋𝑗𝑗�� Xi  and Xj  items repeat each 
other. 

Method  3.  If t h e  value of conditional entropy �𝐻𝐻�𝑋𝑋𝑗𝑗 | 𝑋𝑋𝑖𝑖�� is close to 𝐻𝐻(𝑋𝑋𝑖𝑖 ), then Xi  
and Xj   
are independent. 

 
However, several questions remain open. 
1. How precisely our model evaluates the quality of test items and how comparable is it 

to CTT and IRT estimation methods? 
2. Which are the permissible limits of 𝐻𝐻(𝑋𝑋𝑖𝑖) and 𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗 � ? 


M.  Haroutunian, V. Avetisyan 37 

3. Which is the sufficient number of the examinee samples for precise evaluation? 
The answers to these questions can be found experimentally. 

2. Description of Experiments 
 

The results of school final exams of Armenian Language and Literature held in 2008 were 
selected for testing. The results were provided in encrypted form by the Center for assessment 
and testing. Four test-results are chosen to be analyzed. Each test consists of 80 items, and the 
number of schoolchildren who participated in the examination process is 2000. The names of 
the first 50  𝑋𝑋𝑖𝑖   items are 𝐴𝐴1, 𝐴𝐴2, … 𝐴𝐴50   and the names of the last 30  𝑋𝑋𝑖𝑖   items 
are 𝐵𝐵1, 𝐵𝐵2, … 𝐵𝐵30.  For analysis test quality evaluation system developed by us was used in [5].  

For each item of four tests the 𝐻𝐻(𝑋𝑋𝑤𝑤), CTT difficulty index [2] and IRT b parameter [3] 
values have been calculated, the comparability of the mentioned parameters observed and  the 
permissible limits of 𝐻𝐻(𝑋𝑋𝑖𝑖)defined. Difficulty is defined in both CTT and IRT.  

In CTT the difficulty index P is the proportion of examinees who answer the item 
correctly. For multiple-choice, true/false, and other items that are scored as right (1 point) or 
wrong (0 points), item difficulty is the proportion of examinees who answered the item 
correctly. It ranges from 0 to 1. Item difficulty for a polytomous item (an item scored in more 
than two ordinal categories) is simply the item mean or average item score. It ranges between 
the minimum and the maximum possible item scores.  

In IRT the difficulty index b (IRT b parameter) is on the same metric as the 
proficiencies or traits. This metric is arbitrary, but often it is anchored so that the proficiency 
distribution in a designated group has a mean of 0 and standard deviation of 1. The item 
difficulty identifies the proficiency at which about 50% of the examinees are expected to 
answer the item correctly.  

To observe the dependency of 𝐻𝐻(𝑋𝑋𝑖𝑖) and 𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗� values on the examinee sample 
size and define the enough number of examinee samples five experiments are carried out for 
each test. The analysis was conducted by choosing the same test at random based on 500, 300, 
200, 100, 50 participants’ results.  

For each test 𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗� and CTT correlation coefficient 𝑅𝑅�𝑋𝑋𝑖𝑖 , 𝑋𝑋𝑗𝑗� between 𝑋𝑋𝑖𝑖 and 𝑋𝑋𝑗𝑗 
items [2] was calculated, their compatibility was observed and the permissible limits of 
𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗� were defined.  Correlation coefficient ranges from -1 +1. Coefficient value should 
be small or equal to 0.3. If coefficient value is close to +1, it means that test items repeat each 
other and one of that items should be removed from the test. The negative correlation means 
there is an independency of test items.  

 
3. The Analysis of Results.  

 
Based on the first experiment results comparison of 𝐻𝐻(𝑋𝑋𝑖𝑖), CTT difficulty  index P and IRT b 
parameter values of Test1  are shown in Figure 1. 

According to CTT test items for which difficulty values are between 0.3 and 0.74 interval are 
good items (not easy and not very difficult - 34 items), and based on the analyzed data we can see that  
𝐻𝐻(𝑋𝑋𝑖𝑖) values of Test 1 for these items are  between 0.82 and 1.0.  For easy test items, difficulty 
values are between 0.75 and 0.9 (29 items), 𝐻𝐻(𝑋𝑋𝑖𝑖) values are between 0.48 and 0.81.  For very easy 
test items difficulty values are between 0.9 and 1.0 (14 items), 𝐻𝐻(𝑋𝑋𝑖𝑖) values are between 0.12 and 
0.47.  Approximately the same results were obtained for the tests 2, 3 and 4.  

As we can see in case of 𝐻𝐻(𝑋𝑋𝑖𝑖)’s  large  values close to 1  IRT b parameter gets large values. 
 

Analysis of the Experiments of a New Approach for Test Quality Evaluation  38 

 
Fig. 1.  𝐻𝐻(𝑋𝑋𝑖𝑖), CTT difficulty parameters P and IRT b parameters of Test1. 

H(Xi) 

IRT b 

Difficulty P 

-5

-4

-3

-2

-1

0

1

2

3

A11A21B19 A3 A48 B6 B9 A50 A4 B12A14A13A47B24A10 A8 A26A24B21A40A25B11B14A46B16B28


M.  Haroutunian, V. Avetisyan 39 

To analyze the dependency of 𝐻𝐻(𝑋𝑋𝑖𝑖)  values on the number of examinee sample we draw 
𝐻𝐻(𝑋𝑋𝑖𝑖) graphics of each test based on the results of five experiments. The graphics are shown in 
Figure 2. 
 

Fig. 2.  𝐻𝐻(𝑋𝑋𝑖𝑖) graphics for five experiments of Test1. 
 
The maximum differences of 𝐻𝐻(𝑋𝑋𝑖𝑖) values are presented in Table1. 

 
Table 1. 

Maximum 
difference of 
𝐻𝐻(𝑋𝑋𝑖𝑖) values 

500 300 200 100 50 

500 - 0.089 0.07 0.135 0.2 
300 0.089 - 0.13 0.15 0.19 
200 0.07 0.13 - 0.16  
100 0.135 0.15 0.16 - 0.3 
50 0.2 0.19 0.23 0.3 - 

 
While decreasing the examinee sample size until 100, it is obvious that the differences of  

𝐻𝐻(𝑋𝑋𝑖𝑖) values are small and the maximum difference is 0.15. But when examinee sample size is 
decreased more than 100, the difference is close to 0.3, and in case of values equal to 50 the 
difference is close to 0.3.  

With the same principle for each test the mutual information 𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗� was calculated  
and the dependency of 𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗� values on the number of examinee sample was observed. The 
graphics are shown in Figure 3. 

 
Analysis of the Experiments of a New Approach for Test Quality Evaluation  40 

 
Fig. 3. 𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗�  for five experiments of Test1 A1 item. 
 

The average mutual information and correlation between test items also have been analyzed. 

The graphics based on some items’ data are presented in Figure 4 and Figure 5. 
 

Fig. 4.  Correlation �𝑅𝑅�𝑋𝑋𝑖𝑖 , 𝑋𝑋𝑗𝑗�� and average mutual information �𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗�� 
 between A1 item and  other 80 items. 

 
A1 (500)
A1 (300

A1 (200)

A1 (100)

A1 (50)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

A27B15A12A31 B8 B21A33B25A32 B1 B23A10A16A46 B4 A17A24B13A47B29A14B17 B9 B12

R(X,Y) 

I(X^Y)x10

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

a4 a7 a1
0

a1
3

a1
6

a1
9

a2
2

a2
5

a2
8

a3
1

a3
4

a3
7

a4
0

a4
3

a4
6

a4
9 b2 b5 b8 b1
1

b1
4

b1
7

b2
0

b2
3

b2
6

b2
9


M.  Haroutunian, V. Avetisyan 41 

 
Fig. 5. Correlation �𝑅𝑅�𝑋𝑋𝑖𝑖 , 𝑋𝑋𝑗𝑗�� and average mutual information �𝐼𝐼�𝑋𝑋𝑖𝑖 ∧  𝑋𝑋𝑗𝑗�� 
 between A11 item and  other 80 items. 

 
Table 2. 

A11 A5 A8 A15 A26 A31 B18 B19 B20 B23 
R(X,Y) -0.0083 -0.0186 -0.0581 -0.0195 -0.0385 -0.0005 -0.0151 

-
0.0219 -0.0177 

I(X∧Y) 0.0011 0.0013 0.0049 0.0001 0.00001 0.0001 0.000004 0.0004 0.00006 
 

For item A11 the negative correlation with other items and average mutual information values 
are presented in Table 2. When we compare items’ correlations and average mutual information 
graphics, it is easy to see that the results are comparable. For example, from the correlation matrix 
and graphics shown in Figure 5 we can see that A11 test item has negative correlation with other 
test items, for these items the values of mutual information are presented in Table2.  If correlation 
values are negative, average mutual information values are small enough, but from test we should 
remove those items, so the smallest permissible limit value of average mutual information should be 
0.005.  

 
4. Conclusion  
 
In this research while analyzing the data,  the following has been determined. 
 

1. The methods suggested by us are correctly defining the quality of test items and the 
results are comparable with CTT and IRT estimation methods. 

2. Simpler mathematical analysis is needed compared to IRT. 
3. 𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗� describes the dependence of test items which does not have an equivalent 

in IRT. 
4. For test good items 𝐻𝐻(𝑋𝑋𝑖𝑖) values should be between 0.8 and 1.0, for fairly good items 

(easy) values are between 0.45 and 0.8, and for bad test items 𝐻𝐻(𝑋𝑋𝑖𝑖) values are 
between 0 and 0.45. 

R(X,Y)

I(X^Y)x10

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

a1 a4 a7 a1
0

a1
3

a1
6

a1
9

a2
2

a2
5

a2
8

a3
1

a3
4

a3
7

a4
0

a4
3

a4
6

a4
9 b2 b5 b8 b1
1

b1
4

b1
7

b2
0

b2
3

b2
6

b2
9


Analysis of the Experiments of a New Approach for Test Quality Evaluation  42 

5. For  𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗 �  preferable are the values smaller than 0.05 and greater than 0.005 
0.005≤ 𝐼𝐼�𝑋𝑋𝑖𝑖  ∧  𝑋𝑋𝑗𝑗 � ≤ 0.05 

6. Smaller sample sizes are required in comparison with CTT and IRT. The sample size 
should be more than 100.  In CTT the sample size is between 200 and 500, and in IRT 
it depends on the IRT model, but samples over 500 are needed. 

 
References 

[1] M. Haroutunian and V. Avetisyan, “New approach for test quality evaluation based on 
Shannon Information measures”, Transactions of IPIA of NAS RA, Mathematical 
Problems of Computer Science, vol. 44, pp. 7-21, 2015. 

[2] M. B. Chelishkova, Theory and practice of pedagogical tests constructing, Moscow: Logos, 
2002. 

[3] C. DeMars, Item Response Theory. Oxford University Press; 1 edition, 2010. 
[4] K. Hambleton and W. Jones, “Comparison of classical test theory and item response 

theory and their applications to test development”, Educational Measurement: Issues 
and Practice,  vol. 12,  no.  3,  pp.  38-47, 1993. 

[5] M. Haroutunian and V. Avetisyan, “Development of the test quality evaluation 
system”, Proceedings of the International Conference on Computer Science and 
Information Technologies (CSIT 2015), Yerevan, Armenia, September 28-October 2, 
pp. 372--375, 2015. 

 
Submitted 10.10.2015, accepted 20.01.2016 
 
 
Թեստի որակի գնահատման նոր մոտեցման փորձարկումների 

վերլուծություն 
 

Մ․ Հարությունյան,Վ․ Ավետիսյան 
 

Ամփոփում 

Նախորդ հոդվածում [1] հեղինակների կողմից առաջարկվել է թեստերի 
որակի գնահատման նոր մոդել՝ հիմնված Շենոնի էնտրոպիայի և միջին 
փոխադարձ ինֆորմացիայի վրա։ Այս մեծությունների սահմանային 
արժեքները և թեստավորման մասնակիցների բավարար քանակը որոշելու 
համար կատարվել են փորձարկումներ։ Հոդվածում ներկայացված է այդ 
փորձարկումների վերլուծությունը, որից հետևում է, որ հաշվարկները ավելի 
պարզ են IRT-ի համեմատությամբ, թեստավորման արդյունքների 
վերլուծության համար պահանջվում է  մասնակիցների  ավելի քիչ քանակ, 
քան` CTT-ում և IRT-ում։ 


M.  Haroutunian, V. Avetisyan 43 

Анализ экспериментов нового подхода для оценки качества теста 
 

М. Арутюнян, В. Аветисян 
 

Аннотация 

 В предыдущей статье [1] авторами была предложена новая модель оценки качества 
теста на основе энтропии Шэннона и средней взаимной информации. Для того, чтобы 
установить практические пределы этих величин и необходимое количество экзаменуемых,  
были проведены эксперименты. В данной статье представлен анализ этих экспериментов, 
из которого следует, что расчеты более просты по сравнению с IRT, для анализа 
результатов тестирований требуется меньшее количество экзаменуемых, чем в  СТТ и 
IRT. 
 

	Analysis of Experiments of a New Approach for Test Quality Evaluation
	Mariam E. Haroutunian, Varazdat K. Avetisyan
	2. Description of Experiments