Microsoft Word - 376-2527-1-LE-rev


ACTA IMEKO 
ISSN: 2221‐870X 
November 2016, Volume 5, Number 3, 16‐23 

 
ACTA IMEKO | www.imeko.org  November 2016 | Volume 5 | Number 3 | 16 

Proficiency tests with uncertainty information: Analysis using 
the maximum likelihood method 

Katsuhiro Shirono, Masanori Shiro, Hideyuki Tanaka, Kensei Ehara 

National Metrology Institute of Japan, AIST Central 5, 1‐1‐1 Higashi, Tsukuba, Ibaraki, Japan  

 
Section: RESEARCH PAPER  

Keywords: proficiency test; En score; uncertainty; ISO 13528; maximum likelihood method 

Citation: Katsuhiro Shirono, Masanori Shiro, Hideyuki Tanaka, Kensei Ehara, Proficiency tests with uncertainty information: Analysis using the maximum 
likelihood method, Acta IMEKO, vol. 5, no. 3, article 4, November 2016, identifier: IMEKO‐ACTA‐05 (2016)‐03‐04 

Editor: Franco Pavese, Italy 

Received March 31, 2016; In final form March 31, 2016; Published November 2016 

Copyright: © 2016 IMEKO. This is an open‐access article distributed under the terms of the Creative Commons Attribution 3.0 License, which permits 
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Funding: This work was supported by a Grant‐in‐Aid for Scientific Research (No. 26870899) from the Japan Society for the Promotion of Science (JSPS), Japan. 

Corresponding author: Katsuhiro Shirono, e‐mail: k.shirono@aist.go.jp 

 
1. INTRODUCTION 

The proficiency test (PT) using an interlaboratory 
comparison is an effective tool to assure the quality of the 
measurements of calibration and testing laboratories. 
Participation in a PT is usually required for a laboratory to be 
accredited under ISO/IEC 17025:2005 [1]. 

For performance evaluation in a PT with information on 
uncertainty, a comparison with the result of a reference 
laboratory is implemented using the En score as described in 
ISO 13528:2015 [2], which is referred to as the En number in 
ISO 13528:2005 [3]. ISO 13528 is the standard for the statistical 
methods used in PTs. Given that the measurement value of 
Laboratory k and its expanded uncertainty are respectively xk 
and Uk while those of the reference laboratory are respectively 
Xref and Uref, the En score for Laboratory k is defined as 
follows:  

    2ref2refn UUXxE kkk  ,  (1) 

 
where superscript (k) means Laboratory k. When |En(k)| ≤ 1 
and > 1, the performances of Laboratory k are evaluated as 
“satisfactory” and “unsatisfactory”, respectively.  

Because it is difficult to designate the reference laboratory in 
some testing fields, performance evaluation in a PT with 
uncertainty information where an appropriate reference 
laboratory does not exist has not yet been described in ISO 
13528:2015. It must be noted that when there is no reference 
laboratory, outliers can seriously impair the quality of the PT. 
On the other hand, since a PT is conducted to check the 
proficiency of laboratories, the possible existence of 
laboratories with inadequate proficiency should be taken into 
consideration. Thus, a robust analysis method is required.  

Although no suggestion can be found in ISO 13528:2015, 
several analysis procedures have already been proposed for key 
comparison tests, which are a type of PT for national metrology 
institutes that are basically implemented without a specific 
reference laboratory. A guideline on statistical methods for key 

ABSTRACT 
In this study, we report the application of the maximum likelihood method to the analysis of proficiency test data when uncertainty 
information is given and a reference laboratory does not exist. There are two causes that could impair the quality of analysis using the 
maximum  likelihood method: the existence of an unknown random effect, and outliers. The conditions under which performance 
evaluations can be appropriately conducted are discussed in this study. To avoid serious impacts from these two causes, the maximum 
permissible  standard  uncertainty  of  an  unknown  random  effect  and  the  minimum  permissible  standard  uncertainty  of  the  values 
reported  by  a  participating  laboratory  are  quantified.  Through  simulations,  the  maximum  and  the  minimum  permissible  standard 
uncertainties are found to be 0.3 and 0.5 times the  intermediate magnitude of the standard uncertainty that the participants are 
expected to report. We believe that our proposed procedure based on these criteria is sufficiently simple to be employed in actual 
proficiency tests.  


ACTA IMEKO | www.imeko.org  November 2016 | Volume 5 | Number 3 | 17 

comparisons was presented by Cox [4] in 2002, in which a 
robust analysis referred to as Procedure B is proposed for a 
case where the results are inconsistent. Moreover, analysis with 
the largest consistent subset (LCS), also proposed by Cox [5], 
has been widely employed in such analysis. The LCS is the 
subset with the largest data size among the subsets whose 
consistencies are confirmed through a 2 test. Various other 
methods have also been proposed so far [6]−[10]. It is worth 
noting that although the statistical models employed in these 
proposals differ from each other, they are useful in their 
respective situations.  

We also developed an analysis method that is robust to 
outliers [11], [12], which is referred to as the robust method in 
this paper. This method comprises two steps: (i) the detection 
of an unknown random effect, and (ii) performance evaluation 
using the local maximum likelihood (LML) method. The robust 
method is explained in the appendix, and a brief summary of it 
is given in Subsection 2.1. The advantage of this method is that 
it lessens the risk of performances being evaluated based on 
inappropriate data influenced by an unknown random effect, 
and allows performance evaluations to be conducted with clear 
statistical meaning.  

In the present study, the robust method is reduced to a 
simpler method, which is referred to as the global maximum 
likelihood (GML) method. The robust method, in which LML 
estimators are employed, may give the impression of being 
difficult to implement because of its computational complexity. 
The approach using the GML method is explained in 
Subsection 2.2, and is, we believe, as simple as Algorithm A in 
ISO 13528:2015 Appendix C. If the conditions are clearly 
given, this approach can be employed in an actual PT.  

Corresponding to the two steps in the robust method, 
conditioning is conducted by two parameters: (i) the maximum 
permissible standard uncertainty for a random effect, and (ii) 
the minimum permissible standard uncertainty for the reported 
values. The first parameter is quantified by examining the 
magnitude of a random effect that will significantly affect the 
quality of the PT. The second parameter is considered because, 
when an outlier has an extremely small uncertainty, the GML 
method can offer different results from the LML method. 
These parameters are quantified relative to the intermediate 
magnitude of the standard uncertainty that the participants are 
expected to report. 

This paper is organized as follows: Section 2 provides the 
basic theory of the robust method and the GML method. The 
qualification of the parameters and the proposal of a practical 
procedure based on the qualified parameters are then given in 
Section 3. In Section 4, the procedure is applied to the data of 
an actual PT. A brief conclusion is presented in Section 5, and 
information on the robust method is provided in the Appendix. 

2. ROBUST AND GLOBAL MAXIMUM LIKELIHOOD (GML) 
METHODS 

2.1. Robust method 

Suppose that n laboratories participate in a PT. It is assumed 
that Laboratory i reports xi and ui as the reported value and its 
standard uncertainty for i = 1, 2, …, n. qi is defined as the 
square of the standard uncertainty ui.  

As mentioned earlier, the two steps in the robust method are 
(i) the detection of an unknown random effect, and (ii) 
performance evaluation using the LML method. It is possible 
that inhomogeneity or instability of the PT items, or vagueness 

of the measurand, may cause a large unknown random effect 
and seriously impair the quality of the PT. The first step is 
therefore necessary. The second step is important in order to 
evaluate performances in a manner that is robust to outliers. 

In the first step, the marginal likelihoods are computed. The 
marginal likelihood is, simply put, the likelihood of the model. 
The marginal likelihoods of the models in which an unknown 
random effect is not considered or is commonly considered for 
all n laboratories are defined as 0 and n, respectively. When 
0 < n, the model with no random effect is more likely, and 
the data are regarded to be inappropriate for the performance 
evaluation. In the robust method, the marginal likelihoods of 
some other models with a random effect are also computed. 
Moreover, an estimation of the measurand is given robustly to 
outliers through Bayesian analysis.  

Two examples are shown in Figure 1(a) and (b). In Figure 
1(a), the data are largely dispersed. In this case, 0 < 7, and an 
unknown random effect is detected. The data are therefore 
inappropriate for use in the performance evaluation. On the 

 
(a) 

 
(b) 

Figure 1. Simulated PT data: (a) (x1, x2, x3, x4, x5, x6, x7) and (u1, u2, u3, u4, u5, 
u6, u7) are given as (a) (1, 2, 3, 4, 5, 6, 7) and (0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2), 
and (b) (1, 2, 4, 4, 4, 6, 6.4) and (1, 1, 1, 1, 1, 1, 0.04). The error bars show 
the expanded uncertainties with a coverage factor of 2.  


ACTA IMEKO | www.imeko.org  November 2016 | Volume 5 | Number 3 | 18 

other hand, most of the data seem consistent in Figure 1(b). In 
this case, an unknown random effect is not detected. The 
estimation of the measurand is given as 4.0, which is 
comparable to the values of x3 to x5.  

In the performance evaluation, the statistical model xi ~ N(, 
qi + i) (i = 1, 2, …, n) is considered, where  is assumed to be 
the true value of the measurand in the PT, and i is an 
additional variance. Several local maximums for the likelihood 
of the model can be observed. To analyse the data robustly, the 
combination of parameters maximizing the likelihood in which 
the value of  is closest to the estimate of the measurand is 
focused on. Defining the LML estimators as the parameters in 
the combination, the symbols LML and 1LML to nLML are 
introduced to express the LML estimators  and 1 to n.  

The performance evaluation for Laboratory k is then 
conducted to check the validity of the statistical model xk ~ 
N(, qk) and xi ~ N(, qi + iLML) for all i ≠ k. The specific 
form of the statistic is given in the appendix as the extended En 
score. When |En(k)| ≤ 1 and > 1, the performances of 
Laboratory k are evaluated respectively as “satisfactory” and 
“unsatisfactory” as described in connection with (1). 

The extended En scores obtained using the robust method 
for the data in Figure 1(b) are given in Table 1. The extended 
En scores of x3 to x5, whose values are close to the estimation 
of the measurand, are 0.0, and their performances are evaluated 
as “satisfactory”. On the other hand, the extended En score of 
the apparent outlier of x7 is 2.4, and the performance is 
evaluated as “unsatisfactory”. Thus, it is found that the robust 
method is not significantly influenced by outliers. 

2.2. Global maximum likelihood (GML) method 

Although the robust method is flexibly applicable, the 
computation may be complicated. Therefore, it is meaningful to 
provide a simpler method. Analysis using the GML estimator is 
proposed here. In this section, n, xi, ui, and qi are defined as in 
Subsection 2.1. 

The statistical model xi ~ N(, qi + i) (i = 1, 2, …, n) is 
considered in this method as in the robust method. This model 
is actually identical to Model III proposed by Willink [8]. In 
Willink’s paper, the GML estimator of i, iGML, is given as 
iGML = max{|xi − GML|2 − qi, 0}, where GML is the GML 
estimator of . Moreover, Willink showed that GML can be 
given through the minimization of the following quantity Q():  

     
 









 


n

i i

i
i

x
Q

1

2

log



 ,  (2) 

where i=max{|xi -|
2, qi}. 

We apply the same concept as in the robust method to the 

performance evaluation. From an analogy to En(k) in the robust 
method given in (A5) in the appendix, the following extended 
En score is computed for the performance evaluation of 
Laboratory k with iGML = iGML + qi : 

 
1

GML2

GML
1

GML

12

1




































n

ki ik

n

ki ii
n

ki ikk
n

u

xx
E




,  (3) 

where  kin  denotes the summation for i = 1, 2, …, k − 1, k 
+ 1, …, n. This statistic is analogic to the concept of the 
“exclusive statistics” [13], [14]. When |En(k)| ≤ 1 and > 1, the 
performances of Laboratory k are evaluated as “satisfactory” 
and “unsatisfactory”, respectively. This analysis method is 
referred to as the GML method in this study. 

Since no check of an unknown random effect is conducted 
in this method, the GML method is not appropriate when an 
unknown random effect has a serious impact on the PT data. 
Therefore, the GML method should not be applied to the data 
shown in Figure 1(a) without any quantitative consideration of 
the unknown random effect.  

Moreover, GML and LML may differ from each other when 
an extremely small uncertainty is associated with an outlier. 
When the GML method is applied to the data in Figure 1(b), 
the extended En scores are reported as shown in Table 1. The 
computed extended En score of Laboratory 7 is 0.8, and the 
performance is evaluated as “satisfactory”. This unnatural result 
occurs due to the extremely small uncertainty of x7. 

However, we believe that the GML method is sufficiently 
simple to be employed in an actual PT analysis if certain 
conditions are satisfied. Specifically, the method is applicable 
when it is confirmed that the uncertainty of an unknown 
random effect is negligibly small, and no extremely small 
standard uncertainties are reported by the participating 
laboratories. In Section 3, the conditions are discussed in 
quantitative terms and a practical procedure is proposed. 

3. CONDITIONING FOR APPLICATION OF THE GLOBAL 
MAXIMUM LIKELIHOOD APPROACH 

3.1. Concept of the maximum and the minimum permissible 
uncertainties 

In this section, we focus on the two factors that might cause 
a problem in the performance evaluation proposed in 
Subsection 2.2: (i) an unknown random effect, and (ii) outliers. 
The first factor can be neglected when the random effect is 
much smaller than the uncertainty of the reported values, while 
the second factor can be neglected when no extremely small 
uncertainty is reported.  

Once it is clarified how large an uncertainty will be reported 
by the participants in a PT, criteria to avoid the effects of these 
factors can be examined. The standard uncertainty that is 
expected to be averagely reported by the participants in a PT is 
referred to as the expected standard uncertainty in this study, 
and expressed as uexp. It is assumed that the expected standard 
uncertainty can be roughly determined based on technical 
knowledge in advance of the PT.  

The maximum permissible standard uncertainty of an 
unknown random effect and the minimum permissible standard 
uncertainty of the reported values are introduced to avoid the 
influences of the random effect and outliers, respectively. These 
are discussed in Subsections 3.2 and 3.3 based on the 

Table  1.  Extended  En  scores  computed  using  the  robust  method  and  the 
GML method. 

Laboratory (k)  En
(k)
 in the robust method  En

(k)
 in the GML method 

1  − 1.4  − 2.7

2  − 1.0  − 2.2

3  0.0  − 1.2

4  0.0  − 1.2

5  0.0  − 1.2

6  0.9  − 0.2

7  2.3  0.8


ACTA IMEKO | www.imeko.org  November 2016 | Volume 5 | Number 3 | 19 

assumption that the expected standard uncertainty uexp is 
appropriately given. In Subsection 3.4, a practical procedure is 
proposed based on these discussions.  

3.2. Influence of the random effect 

Only inhomogeneity and instability are considered as sources 
of an unknown random effect in this study. These random 
effect sources tend to have a serious impact on the analysis of a 
PT. It is assumed that it has been confirmed that the definition 
of the measurand, including the choice of the measurement 
methods, will not have a serious influence.  

Defining rnd as the standard deviation of the random effect, 
the case in which rnd is estimated to be 0.3 × uexp is considered. 
The criterion that the standard deviation of the inhomogeneity 
must not exceed 0.3 × P, where P is the standard deviation 
for the proficiency assessment, is given in ISO 13528:2015. 
Regarding the instability, almost the same criterion is used. This 
criterion is applied by extension to a case with uncertainty 
information in this study. Since P could be interpreted as the 
average value of the dispersion of the reported values, it seems 
appropriate to replace P with uexp. 

To characterize these criteria quantitatively, letting −1(.) be 
the inverse of the cumulative standard normal probability 
function, it is considered that xi is derived from N(, qi + rnd2) 
and the data are given as follows: 

  13.01 1
2




  niΦ
V

x i ,  (4) 

qi = uexp = 1 for i = 1, 2, …, n, 

where zi = −1(i /(n+1)), and V = 
n
i 1 (zi –

n
i 1 zi/n)2/(n −1). 

Using these data, rnd2 is unbiasedly estimated to be 0.32. It 
should be noted that this sequence of xi does not necessarily 
mean that xi is truly derived from N(0, 1 + 0.32), and other 
statistical models including xi ~ N(, qi + i) (i = 1, 2, …, n) 
employed in the robust method might yield the same sequence 
of xi. In Figure 2, the dispersion of data when n = 30 is shown. 

Consequently, the unknown random effect is undetected 
when the data are given by (4). The marginal likelihoods of the 
models in which an unknown random effect is considered for 
no and all laboratories, 0 and n, are computed for the cases 
with n = 2, 5, 10, 20, 100, and 200. Figure 3(a) shows 0 and n, 
implying that 0 is always larger than n. Thus, it is concluded 
based on the discussion in Subsection 2.1 that no unknown 
random effect is detected. 

Since the result could change if different data were provided, 
the property is checked with more dispersive data. The 
following data are considered:  

  15.01 1
2




  niΦ
V

x i ,  (5) 

qi = uexp = 1 for i = 1, 2, …, n. 
Figure 3(b) shows 0 and n, and 0 > n for all n as well. 

This means that the unknown random effect is undetectably 
small even when rnd is roughly estimated to be 0.5 × uexp.  

Therefore, it is concluded that 0.3 × uexp could be a strong 
candidate for the maximum permissible standard uncertainty. In 
this discussion, rnd is unbiasedly estimated using the PT data. 
In ISO 13528:2015, on the other hand, inhomogeneity and 
instability are basically evaluated independently from the PT 
data. However, since the random effect is undetectably small 

even with more dispersive data, it can be said to be conservative 
to set the maximum permissible standard uncertainty as 0.3 × 
uexp, irrespective of the method used for the estimation of rnd. 

3.3. Sensitivity to outliers 

We recommend that the GML method be employed only 
when the number of the laboratories with |En| ≤ 1 is 10 or 
more. When the number of participants is small, the 
discrimination of outliers from the other values is technically 
difficult. Such discrimination is, however, necessary in the case 
of the GML method. In the following discussion, the GML 
method is characterized through simulated data with an outlier 
and 10 other data. Cases with a smaller data size are not taken 
into consideration.  

We believe that 0.5 × uexp is appropriate as the minimum 
permissible standard uncertainty mentioned in subsection 3.1. It 

 
Figure 2. Simulated PT data given by (4) when n = 30. The error bars show 
the expanded uncertainties with a coverage factor of 2. 

 
Figure  3.  Computed  logarithmic  marginal  likelihoods  per  number  of 
laboratories  for  the  data  when  n  =  2  to  200  using  the  data  given  in  the 
equations of (a) (4) and (b) (5). 


ACTA IMEKO | www.imeko.org  November 2016 | Volume 5 | Number 3 | 20 

is considered that the uncertainty should be larger than 1/√10 
× uexp ≈ 0.32 × uexp, because the weighted mean of the 10 data 
with the standard uncertainty of uexp is given as 1/√10 × uexp. 
When ui < 1/√10 × uexp, the value xi can have a strong impact 
on the determination of the GML estimator. Thus, the 
minimum uncertainty of 0.5 × uexp seems a possible choice. The 
robustness of the GML method with this value is consequently 
examined in this subsection.  

In discussing the property of the GML method in 
quantitative terms, it is assumed that uexp = 1, and x1 to x10 are 
considered to be given as follows:  

  111 1 iΦVx i  , (6) 
qi = uexp2 = 1, where V is defined in Subsection 3.2. In addition 
to these data, x11 and q11 are basically determined differently so 
as to be characterized as outliers. See Figure 4 as an example 
with x11 = 2.0 and q11 = 0.52. Examples with q11 = 0.32 and 0.52 
are described in this study; the analysis with q11 = 0.32 is 
conducted for comparison. If the simulation with the smaller 
minimum permissible standard uncertainty of 0.3 × uexp 
provides a result that is robust to the outliers, it can be said that 
setting the minimum permissible standard uncertainty as 0.5 × 
uexp is an adequately conservative choice. 

Consequently, the criterion of a minimum permissible 
standard uncertainty of 0.5 × uexp seems appropriate. When 
LML = GML, the identical extended En scores are given in both 
the LML and the GML methods for all of the laboratories. 
Figure 5(a) and (b) show a comparison between LML and GML 
for q11 = 0.32 and 0.52, respectively. LML and GML are in 
perfect agreement with each other under the conditions of both 
q11 = 0.32 and q11 = 0.52. Even in the case of q11 = 0.3, the 
results are not significantly contaminated by the existence of 
outliers. Thus, the validity of the criterion of 0.5 × uexp is 
confirmed. 

It should be noted that the participating laboratories must 
agree in advance to a PT in which standard uncertainty less 
than the minimum permissible standard uncertainty is not 
usually reported, and will not be reported in the PT. This is 
because the measurements for a PT must be conducted under 
the usual experimental conditions. It is not recommended that a 
laboratory that usually reports an uncertainty of less than 0.5 × 
uexp reports an uncertainty of 0.5 × uexp or more only for the 
purpose of this PT.  

3.4. Proposed procedure using the GML method 

Based on the discussions in Subsections 3.2 and 3.3, we 
suggest the following four conditions for application of the 
GML method: (i) no unknown random effect other than 
inhomogeneity and/or instability of the PT items exists, (ii) the 
random effect caused by the inhomogeneity and/or instability 
has a standard deviation of less than 0.3 × uexp, (iii) the 
minimum permissible standard uncertainty from the 
laboratories is 0.5 × uexp, and (iv) 10 or more laboratories report 
data for which the performance is evaluated as “satisfactory”.  

The following procedure entailing 11 steps is proposed to 
realize the above conditions: 
1. It is confirmed that there is no or practically negligible 

unknown uncertainty in the definition of the measurand, 
including the choice of the measurement method.  

2. The PT provider determines the expected uncertainty, uexp, 
from the existing technical knowledge.  

3. The PT provider checks that the standard deviation of the 
random effect caused by the inhomogeneity and/or 
instability of the PT items is less than 0.3 × uexp. 

4. All of the participants agree that a standard uncertainty of 
less than 0.5 × uexp is not usually reported, and will also not 
be reported in the PT. 

5. The PT is implemented and the data of xi and ui (i = 1, 2, 
…, n) are obtained. 

6. It is confirmed that a representative value (e.g., the median) 
of the reported standard uncertainties is adequately close to 
the expected uncertainty. 

7.    






















 

 


n

i i

i
i

xxx

x

ni 1

2

 ...,,

old logminarg
1




 . 

 
Figure 4. Simulated PT data given by (7) when x11 = 2.0 and q11 = 0.5
2
. The 

error bars show the expanded uncertainties with a coverage factor of 2. 

 
Figure 5. Computed GML and LML estimators with q11 = 0.3

2
 and 0.5

2
 as a

function of x11. x1 to x10 yielded by (7). 


ACTA IMEKO | www.imeko.org  November 2016 | Volume 5 | Number 3 | 21 

8. 
 
 




















 














 


n

i ii

n

i iii

xq

xqx

1

2old

1

2old

GML

,max/1

,max/




 . 

9. If |GML − old| > [ ni 1 {1/max(qi, (xi − GML)2)}]−1/2, 
where  is a small value like 10–3, old = GML and go to 
Step 8. 

10. For k = 1, 2, …, n, En(k) is computed through  

 
 

1
GML2

GML
1

GML

12

1




































n

ki ik

n

ki ii
n

ki ikk
n

u

xx
E




, 

where iGML = max(qi, (xi − GML)2). 
11. It is confirmed that 10 or more laboratories have a 

performance of |En(k)| ≤ 1.  
For Steps 1, 3, 4, 6, and 10, if the conditions are not satisfied, 
the robust method should, in principle, be implemented instead 
of the GML method. However, in Step 4, if a laboratory 
requests permission to report a smaller standard uncertainty 
than the minimum permissible uncertainty, such a request may 
be accepted only when the laboratory has sufficient technical 
evidence.  

In Step 7, the initial value for the estimation of GML is 
determined as xi with which Q(xi) in (2) is the smallest among 
all Q(xi). Although it is not mathematically assured that the 
GML estimator can be obtained through this estimation, we 
have not found a case in which the GML estimator is not given 
by this determination of the initial value.  

The chief advantage of the GML method is that the 
algorithm is sufficiently simple to be employed in an actual PT; 
or in other words, to be incorporated into ISO 13528:2015. In 
ISO 13528:2015, several algorithms for a PT without 
uncertainty information are described. It can be said that the 
above algorithm is as simple as those methods.  

4. EXAMPLE OF APPLICATION: MEASUREMENT OF 
CONCENTRATION OF COPPER IN WATER 

The GML method is applied to the data from a PT that was 
conducted by the Japan Society for Analytical Chemistry (JSAC) 
from 2014 to 2015 [15]. In this test, the concentration of 
copper in water was measured and reported in the unit of mg/L. 
It should be noted that GML analysis was not applied in the 
actual analysis, and the performances were evaluated through a 
comparison with the reference laboratory. These data are cited 
merely as a set of numerical examples. The following 
explanation is given based on the procedure described in 
Subsection 3.4. 

Step 1: JSAC has implemented PTs for the measurement of 
the concentration of metals in water since 2007. Thus, 
information on the uncertainty caused by the measurement 
method has been shared by the participants, and the uncertainty 
has been incorporated into the reported uncertainty. Hence, it 
cannot be an unknown component of the uncertainty. 

Step 2: The expected standard uncertainty uexp was yielded as 
0.0034 mg/L from the past PT data implemented from 2013 to 
2014 [16]. The median of the reported relative standard 
uncertainties in the PT in 2013 to 2014 was 1.7 %. uexp was 
therefore given as 1.7 % of the set concentration of copper in 

the PT in 2014. The set concentration was 0.200 mg/L, and uexp 
was given as 0.0034 mg/L.  

Step 3: The maximum permissible standard uncertainty of 
the random effect was given as 0.0010 mg/L, which is 0.3 times 
the expected uncertainty. Inhomogeneity of the PT items was 
checked in this test. The evaluated standard uncertainty 
between units was calculated as 0.0003 mg/L through analysis 
of variance. Since this value is much smaller than 0.0010 mg/L, 
the effect of the inhomogeneity on the PT results was negligible. 
In actuality, when the robust method is applied, the random 
effect cannot be detected. 

Step 4: The minimum permissible standard uncertainty of 
the reported values was given as 0.0017 mg/L, which is 0.5 
times the expected uncertainty. Since information on the 
minimum permissible standard uncertainty was not given in the 
actual PT, there was a laboratory that reported a standard 
uncertainty of less than 0.0017 mg/L. The data of that 
laboratory have been removed, because the data are treated 
only as a numerical example in the present study. Of course, it 
is not recommended in an actual PT that data be removed after 
the PT without reasonable grounds.  

Step 5: The 22 reported values are shown together with their 
respective standard uncertainties in Table 2 and Figure 6. The 
reported values ranged from 0.1908 mg/L to 0.2417 mg/L. 
Laboratories 6 and 20 reported extremely large standard 
uncertainties. Unlike the case of extremely small uncertainties, 
these large uncertainties are not considered to impair the quality 
of the PT. 

Step 6: The median of the standard uncertainties in Table 2 
is 0.0036 mg/L, which is close to the expected standard 
uncertainty, 0.0034 mg/L. Since the median is larger, it can be 

Table  2.  Actual  data  reported  in  a  PT  of  the  concentration  of  copper  in 
water conducted by the Japan Society for Analytical Chemistry from 2014 to 
2015, and the evaluated values of Q(xi) and En(i) from the data. One set of 
data in which the standard uncertainty exceeded the minimum permissible 
standard uncertainty was removed only for the purpose of this study.  

Laboratory (i)
Reported 
value (xi) 

Standard 
uncertainty 

(ui) 
Q(xi)  En

(i)
 

1 0.1908 0.0088  ‐162.13  ‐0.9 

2 0.196 0.0094  ‐179.68  ‐0.5 

3 0.1967 0.0033  ‐182.31  ‐1.4 

4 0.1987 0.0024  ‐189.10  ‐1.4 

5 0.1991 0.0050  ‐190.38  ‐0.7 

6 0.202 0.11  ‐199.65  0.0 

7 0.2025 0.0022  ‐201.26  ‐0.8 

8 0.2025 0.0029  ‐201.26  ‐0.6 

9 0.2043 0.0023  ‐205.46  ‐0.4 

10 0.2056 0.0036  ‐207.32  0.0 

11 0.2059 0.0033  ‐207.43  0.0 

12 0.206 0.011  ‐207.42  0.0 

13 0.2061 0.0044  ‐207.39  0.0 

14 0.2070 0.0018  ‐206.13  0.3 

15 0.2071 0.0027  ‐205.89  0.2 

16 0.2071 0.0023  ‐205.89 0.3 

17 0.2072 0.0018  ‐205.64 0.4 

18 0.2116 0.0067  ‐185.02 0.4 

19 0.2129 0.0036  ‐180.16 1.0 

20 0.214 0.039  ‐176.38 0.1 

21 0.216 0.014  ‐169.83 0.4 

22 0.2417 0.0036  ‐127.96 4.9 


ACTA IMEKO | www.imeko.org  November 2016 | Volume 5 | Number 3 | 22 

said that a conservative evaluation was conducted to determine 
the maximum permissible standard uncertainty for checking the 
uncertainty of the random effect. On the other hand, there is a 
possibility such that the minimum permissible standard 
uncertainty of the reported value might be too small. However, 
as mentioned in Subsection 3.3, when the minimum permissible 
standard uncertainty is 0.3 times the expected standard 
uncertainty, the GML method works well in most cases. Since 
0.0036 (mg/L) / 0.0034 (mg/L) is smaller than 0.5 / 0.3 = 1.67, 
this slight difference does not seem to have a significant impact 
on the performance evaluation. 

 Steps 7 to 9: The values of Q(xi) are shown in Table 2, and 
the minimum is given when i = 11. The value of x11 is therefore 
employed as the initial value in the repetitive computation to 
determine GML. The repetitive computation in Steps 8 and 9 
gives the value of GML as 0.2059 mg/L. In this case, LML = 
GML, and the performances evaluated using the GML method 
are identical to those using the robust method. 

Steps 10 and 11: The extended En scores are computed, and 
these are shown in Table 2 together with the reported values. 
The number of laboratories with a performance of |En(k)| ≤ 1 
is 19, which is more than 10. There are three laboratories whose 
magnitudes of En scores are larger than 1.0. It is found from 
Figure 6 that the values that do not contain GML in the range 
of their expanded uncertainties are evaluated as “unsatisfactory”. 
These results seem natural. We have presented a detailed 
discussion on the validity of the extended En scores in another 
paper [11].  

For comparison, other methods were applied to this 
example. The robust method shows performance evaluation 
results identical to those obtained using the GML method. 
Analysis using the largest consistent subset [4] obtains En scores 
with magnitudes exceeding 1.0 only for Laboratories 3, 4, 19, 
and 22. This difference does not seem to be essential. We have 
provided further discussion through a comparison with other 
methods in another paper [12]. 

5. CONCLUSION 

In this study, the application of maximum likelihood to the 
analysis of proficiency test data is discussed for cases where 
uncertainty information is given and a reference laboratory does 
not exist. To prevent serious effects from an unknown random 
effect and a few outliers, the following four conditions are 
suggested: (i) no unknown random effect other than 
inhomogeneity and/or instability of the PT items exists, (ii) any 
random effect caused by inhomogeneity and/or instability has a 
standard deviation of less than 0.3 × uexp, (iii) the minimum 
permissible standard uncertainty from the laboratories is larger 
than 0.5 × uexp, and (iv) the performances of 10 or more 
laboratories are consequently “satisfactory”, where uexp is the 
standard uncertainty that the participants are expected to report 
averagely. Based on these suggestions, a practical procedure is 
proposed. Moreover, the analysis method is characterized 
through an actual example. We believe that the analysis method 
proposed in this study can provide natural results with a 
computationally simple algorithm. 

ACKNOWLEDGEMENT 

This work was supported by a Grant-in-Aid for Scientific 
Research (No. 26870899) from the Japan Society for the 
Promotion of Science (JSPS). 

APPENDIX: METHOD PROPOSED IN OUR PREVIOUS 
STUDIES 

The method proposed in our previous studies [11], [12], 
which is referred to as the robust method in the main 
manuscript, is explained. It should be noted that the meanings 
of some symbols are different from those in the main 
manuscript.  

In this method, the model selection is implemented through 
a comparison of the marginal likelihood. The statistical model 
with the following parameters is considered: 
1. the number of data to which the common random effect is 

given, m (m = 0, 2, 3, …, n); 
2. the identification vector for correspondence of the 

laboratory and the data, vK = (K(1), K(2), …, K(n))T (K(1) < 
K(2) < … < K(m), K(m+1) < K(m+2) < … < K(n)); and  

3. the parameters for the priors , m+1, m+2, …, and n. (1 ≤ 
 < +∞, 1 ≤ i (i = 1, 2, …, n)); 

where n is the number of participating laboratories. 
Suppose that Laboratory K(i) reports the measurement value 

xi and its standard uncertainty ui (i = 1, 2, …, n). Let qi = ui2 for 
simplicity of the description. xi is assumed to be derived from 
the normal distribution with the same mean of . On the other 
hand, the variances of the distribution for the reported values 
of Laboratories K(i) (i = 1, 2, …, m) are assumed to be qi + c, 
where c is the variance caused by an unknown random effect. 
The variances for the reported values of Laboratories K(i) (i = 
m + 1, m + 2, …, n) are assumed to be qi + i, where i is the 
additional variance caused by the unskillfulness of these 
laboratories. Thus, the model distributions of xi are given as 
follows:  

     
     .,,1for   ,N~

,,,1for   ,N~ c
nKmKiqx

mKKiqx

iii

ii











 (A1) 

Defining 

 
Figure  6.  Actual  PT  data  shown  in  Table  2.  The  error  bars  show  the 
expanded  uncertainties  with  a  coverage  factor  of  2.  The  dotted  line

indicates the GML estimator, GML = 0.2059 mg/L. The data denoted by the
empty  circles  are  the  values  whose  performances  are  found  to  be
“unsatisfactory” in the GML method. 


ACTA IMEKO | www.imeko.org  November 2016 | Volume 5 | Number 3 | 23 

iii

m

i i

q
q




 

















  ,1

1

1 c
c , (A2) 

the priors of , c, and i (i = m + 1, …, n), p(), p(c), and 
p(i), are given as follows:  
   

 

   .   

,  

,-  1
1

1

1
c

-
cc

iiii

m

i i

qp

qp

p

i 



































  (A3) 

The priors of c and i, p(c) and p(i), are given accordingly. 
The hyperparameters of m, vK,  and i, are optimized to 
maximize the following modified marginal likelihood:  

      



W

c
1

cc ,,,, θvxθ dddppmlΛ
m

mi
iiK  , (A4) 

where x = (x1, …, xn)T,  = (m+1, …, n)T, and W= {, c, | −
∞ <  < +∞, 0 < c < +∞, 0 < i < +∞}. l(, c, |x, m, 
vK) is the likelihood of , c, and  given x, m, and vK. The 
marginal likelihoods with m = 0 and m = n are referred to as 0 
and n, respectively, and employed in the main manuscript. It 
should be noted that vK is not a parameter when m = 0 and m = 
n.  

The point is that if m ≥ 2 is chosen as the optimized 
parameter, the performance evaluation should not be 
implemented, because m ≥ 2 means c > 0. c is the variance of 
a random effect. The effect must be corrected before the 
performance evaluation.  

Only when m = 0 is chosen, the performance evaluation is 
given. For the performance evaluation, let the posterior mean 
of  with the optimized model be rob. Several combinations of 
(, 1, ..., n) locally maximizing the likelihood of the statistical 
model xi ~ N(, qi + i) (i = 1, 2, …, n) can exist. However, the 
LML estimators of  and i, LML and iLML, are specifically 
defined as the values of  and i included in the combinations 
of (, 1, ..., n) whose value of  is the closest to rob among 
those several combinations.  

Defining iLML = qi + iLML, the extended En score for 
Laboratory k is proposed as follows:  

 
1

LML2

LML
1

LML

12

1




































n

ki ik

n

ki ii
n

ki ikk
n

u

xx
E




, (A5) 

to check the validity of the following statistical model:  

 
   .,...,1,1,...,1  ,N~

,,N~
LML nkkix

qx

ii

kk




 (A6) 

REFERENCES 

[1] International Organization for Standardization 
(ISO)/International Electrotechnical Commission (IEC), 
“ISO/IEC 17025:2005: General requirements for the 
competence of testing and calibration laboratories”, ISO, 
Geneva, 2005.  

[2] ISO/IEC, “ISO/IEC 13528:2015: Statistical methods for use in 
proficiency testing by interlaboratory comparisons”, ISO, 
Geneva, 2015. 

[3] ISO/IEC, “ISO/IEC 13528:2005: Statistical methods for use in 
proficiency testing by interlaboratory comparisons”, ISO, 
Geneva, 2005. 

[4] M. G. Cox, Metrologia 39 (2002) pp. 589–595. 
[5] M. G. Cox, Metrologia 44 (2007) pp. 187–200.  
[6] R. C. Paule and J. Mandel, J. Res. Natl. Inst. Stand. Technol. 87 

(1982) pp. 377–385. 
[7] B. Toman, A. Possolo, Accred. Qual. Assur. Vol. 14 (2009), pp. 

553–563. 
[8] R. Willink, “Advanced mathematical and computational tools in 

metrology and testing X”, World Scientific, Singapore, 2015, 
ISBN: 978-981-4678-61-2, pp. 78–89. 

[9] S. K. Shirono, H. Tanaka, K. Ehara, Metrologia 47 (2010) pp. 
444–452. 

[10] R. N. Kacker, A. Forbes, R. Kessel, K.-D. Sommer, Metrologia 
45 (2008) pp. 512–23. 

[11] K. Shirono, H. Tanaka, M. Shiro, K. Ehara, Measurement 83 
(2016) pp. 135-143.  

[12] K. Shirono, H. Tanaka, M. Shiro, K. Ehara, Measurement 83 
(2016) pp. 144-152. 

[13] A. G. Steele, B. M. Wood, R. J. Douglas, Metrologia 38 (2001) 
pp. 483–8. 

[14] M. J. T. Milton, M G Cox, Metrologia 40 (2003) pp. L1-L2. 
[15] The Japan Society for Analytical Chemistry (JSAC), JSAC/PTP-

43: Report on the proficiency test in accordance with ISO/IEC 
17043, JSAC, Tokyo, 2015. 

[16] JSAC, JSAC/PTP-40: Report on the proficiency test in 
accordance with ISO/IEC 17043, JSAC, Tokyo, 2014.