Upsala J Med Sci 84: 3 7 4 6 , 1979 Evaluation of a Computer Programme for Interpretation of 12-lead Electrocardiograms Johan Landelius and Lars Nordgren F r o m the Department o j Clinical Physialogy, University Hospital, Uppsula, S w e d e n ABSTRACT A 12-lead electrocardiogram (ECG) interpretation programme (Cardionics, Brussels) was tested for computer diagnosis of ECG. The computer performance was evaluated on an ECG population from hospital patients by a method giving results which will allow the clinician to judge the usefulness of the computer diagnosis in clinical practice. Diagnoses made by experienced ECG readers were in essential agreement with the computer diagnoses in 83.5 % of 4 9 3 ECGs. Cli- nically significant disagreements due to differences in criteria occurred in 6.0 % of the tracings,whereas such disagreements due to programme errors were found in 10.5 %. INTRODUCTION Several computer programmes for interpretation of electrocardiograms (ECG) are available today. Evaluation of such programmes is time-consuming and in some aspects difficult, especially concerning their performance in the clini- cal situation. Firstly it is essential to define the composition of the tested population of ECGs in terms of the prevalence of certain abnormalities. The ideal would be to have access to an archive of ECGs accepted generally by in- vestigators and including ECGs with abnormalities due to diseases of a known type and at a known stage, as well as ECGs from normal subjects. The perfor- mance of a computer programme could then be determined in relation to a par- ticular ECG abnormality, to a certain disease, and to a normal as well as a mixed ECG population. Secondly, it is important to analyse disagreements between the computer diagnosis and the correct diagnosis, that is the diagnosis made by manual in- terpretation, and to classify them into those due to differences in criteria and those due to programme errors. The latter will consist either of mismea- surements or of deficient programme logic. A s pointed out by Bailey et al. (11, criteria differences do not indicate any technical deficiency in the programme. 37 Thirdly, it is also necessary to consider the possibility of human reader errors and failures in the recording quality in studies of this kind. MATERIAL AND METHODS A material of 510 unselected ECGs was obtained at random from patients sent to our department from different wards and outpatient units of the University Hospital in Uppsala. The tracings were collected from consecutive cases exami- ned in one of four routine ECG stations between June and September, 1974. The tracings represent a hospital population and many of the patients had cardio- pulmonary diseases. Each recording consisted of the standard 12-lead system (I, 11, 111, aVR, a m , aVF, V -V ) and the orthogonal vectorcardiographic leads X, Y and Z as described by Frank (2). 1 6 A Cardionics cart was used for the recordings, which were made both on a strip chart and on high fidelity FM analogue tape. The tape recordings were sent to Cardionics, Brussels,where they were digitized and computerized by the Mount Sinai Hospital programme (1973), whereafter the computer printout of each ECG diagnosis as well as a D-A-converted printout of the same ECG tracing we- re returned to our department. Six ECGs had to be excluded - two of them were lost for unknown reasons and four were lost in the technical process. Of the 504 ECGs remaining for further analysis, 10 were excluded because of faulty lead connections, as discussed later, and one was discarded because it could not be classified. This left 493 ECGs, obtained from 4 9 3 patients (281 male and 212 female). The age range of the men was 19 to 85 years (mean 58; S.D. 14) and of the women 17 to 89 years (mean 54; S.D. 17). Each ECG was interpreted in detail by the authors, neither of whom was aware of the computer interpretation or of the interpretation of the other reader at the time of the first reading. The two manual interpretations of each ECG were then combined by the two readers to fit a set of ECG criteria common to the department and in agreement with international rules (3). This combined "depart- mental" interpretation is referred to in the following as the "reader diagno- sis''. The individual reader error appeared to be small but was not further ana- lysed. The vectorcardiographic leads were not considered by the readers in their manual interpretation but were used by the computer programme, mainly to deter- mine loop areas and vector angles. The reader diagnosis was then compared with the diagnosis produced by the computer. The comparison was made essentially along the lines proposed by Bai- ley et al. (1). Their definitions for agreement and disagreement were modified as follows. 1. Reader-computer agreement. "Agreement" was defined as an identical interpretation by the reader and the computer (group A). No further analysis was performed. 38 2. Reader-computer disagreement. trDisagreement" was divided into (a) minor disagreement importance (group B); (b) major disagreement probably of clinical importance (group C); and (c) major disagreement definitely of clinical importance (group probably of no clinical D) * In each of groups B, C and D, the disagreements were classified into those due to criteria differences and those due to programme error. The complete set of criteria used by the computer programme was available. Most of the abnormal ECGs in this study contained multiple abnormalities. In groups C and D, only those diagnostic statements which carried clinical significance were noted and submitted to statistical analysis. As mentioned above, 10 tracings had faulty lead connections. This was done intentionally by the clinical technologists in order to test the computer. In seven of these, two precordial leads had been exchanged and as a result the computer gave a false statement of anterior myo- cardial infarction. In the remaining three, extremity leads had been exchanged. Two of these were correctly identified as "lead wire inversion" but in the third case the computer report was: "unusual electrical axis, compatible with left ventricular hypertrophy". Thus the programme seemed to have difficulty in cases of faulty lead connections. This emphasizes the need for correct recor- ding of the ECG, and also calls for supplementation of the programme logic. STATISTICS The following formulas were used, as proposed by Rautaharju et al. ( 4 ) . a=true positives, b=false positives, c=false negatives, d=true negatives Sensitivity (SE) = 100 a/(a+c), specificity (SP) = 100 d/(b+d), accuracy of positive tests (AP) = a/(a+b) , accuracy of negative tests (AN) = d/(c+d), N error ratio = (b+c)/a, over-all diagnostic accuracy (DA) = P(1) SE(1) 1=1 N = number of diagnostic test categories (I) P(1) = fraction of statements in category (I) SE(1) = fraction of correctly diagnosed statements in category (I) None of these indices gives a fully representative and relevant picture of dis- agreement and failure. The basic limitation is that all events are classified as of equal importance, the more advanced concepts of statistical decision theory being disregarded. However, a more thorough discussion of statistical principles is beyond the aim of this study. RESULTS Out of 4 9 3 ECG tracings, the reader diagnosis in 2 3 4 ECGs comprised one or 39 more clinically significant abnormalities ( 4 7 . 5 % abnormal and 5 2 . 5 Z normal). The exact number of ECGs in each kind of reader statement is given in the Fi' gures. The results are expressed in bar graphs. The first bar in each figure shows the sensitivity of the computer diagnosis in relation to the reader diag- nosis, and the third bar shows the specificity. The total number of diagnoses for each graph was 4 9 3 , except for the graphs of myocardial infarction and arrhythmia, where the totals were 5 0 6 and 5 2 9 respectively, due to the occur- rence of more than one statement for a given ECG. In the final statistical analysis, groups A and B were both regarded as agreements, and groups C and D as disagreements. Left ventricular Neg hypertrophy (LVH) for LVH by readers by program by readers 7 8 92.2 :.t ?5.9 0.5 99 5 Myocardial inlorction Neg (MI) tolol tor MI by readers by program by readers 5 3 I6 0 w.7 19 8 6 3 H program ermrs criteria differences program ermrs criteria differences 0 agreements 0 agreements Fig. 1. Left ventricular hypertrophy (LVH). In each bar graph the number (n=) written above the first bar represents the total number of times the readers made the diagnostic statement listed. The number (n=) over the middle bar re- presents the total number of times the programme made the diagnostic statement in question. The number (n=) over the third bar represents the total number of tracings in which the readers did not make the diagnostic statement under con- sideration. Each of the three bars may be subdivided into three segments. The bottom segment (no lines) indicates the per cent of cases in which the readers were in agreement with the computer programme. The middle segment (oblique li- nes) indicates the per cent of cases in which the readers and the programme disagreed due to criteria differences. The top segment (horizontal lines) indi- cates the percentage of cases in which disagreements between readers and com- puter programme resulted from programme errors (in this figure, none). Fig. 2. Myocardial infarction. Explanation, see Fig. 1. Left ventricular hypertrophy (LVH). There were 5 1 reader diagnoses of LVH. The criteria used by the computer programme were very similar, although not i- dentical, to those of the readers. There were no disagreements dueto programe 40 errors. The sensitivity was 92 % (47/51), and the specificity 99.5 % (4401442). Right ventricular hypertrophy (RVH). A reader diagnosis of RVH was made in only five cases. Only two of these were correctly identified by the computer but the disagreements were due to criteria differences, the programme prefer- ring statements such as "right intraventricular conduction delay". The speci- ficity of the computer diagnoses was 99.5 % (4861488). Myocardial infarction. The computer programme utilized several degrees of severity in its diagnostic language, based on the amplitude and width of Q waves, the location of Q, combinations of Q, R-wave progression and combina- tions of Q waves or Q equivalents and ST-T changes in different leads. I n the reader diagnosis a two-graded scale of severity was used. This was based on much the same criteria but certain differences did exist. Programme errors, though rather infrequent, occurred more often for myocardial infarction than for hypertrophy. In one case the computer erroneously reported diaphragmatic myocardial infarction, based on a T wave obscured in both shape and amplitude by atrial flutter waves. In another case the computer did not measure S-T segment elevation correctly because S-T displacements were measured by a J point with a fixed time relation to the preceding QRS complex. The sensitivity of the computer diagnosis was 83 % (59/71) and its specificity 96.3 % (4191435). Programme errors were more common than criteria differences regarding diaphrag- matic infarctions, while the reverse was true for anterior or antero-lateral infarctions. Arrhythmia. There were no difficulties in diagnosing arrhythmia in the ma- nual interpretations in any of the E C G s . The computer reported "undetermined rhythm" whenever a tracing failed to satisfy the computer's logic or criteria for aspecific rhythm diagnosis. This happened in several cases, mostly cases of atrial fibrillation, but these disagreements were usually classified as group B, as we felt the additional diagnoses in these cases to be clinically more significant and as the computer made correct statements of the latter diagnoses. However, in five cases of a false computer diagnosis of "undeter- mined rhythm", the report was classified as group C or D on the ground of mis- measurement. To summarize the over-all performance of the computer programme regarding arrhythmia, there were no disagreements due to criteria differences. The sensitivity and specificity of the computer diagnosis, determined by a two-group classification procedure, were 82 % (71187) and 97.5 % (431/442), respectively. The over-all diagnostic accuracy, determined by a multi-group classification procedure, was 95.1 %. Atrial fibrillation. Out of 41 tracings of atrial fibrillation, the compu- ter programme correctly identified 31 (76 %). The specificity was 99.8 % (4511 1 4 5 2 ) . In eight cases the computer stated "sinus rhythm", mostly in combina- tion with "sinus arrhythmia", "sinus arrest" or "supraventricular ectopic 41 Table 1. Frequency of different rhythm diagnoses made by reader Sinus rhythm 4 4 2 Ectopic atrial rhythm 4 AV nodal rhythm 2 Atrial flutter 4 Atrial fibrillation 4 1 Supraventricular ectopic beats 1 2 Ventricular ectopic beats 2 4 5 2 9 - beats" and in two cases it stated "undetermined rhythm". The computer falsely reported atrial fibrillation in one case where the tracing showed a normal s i - nus rhythm suddenly changing into ectopic atrial rhythm. Other supraventricular arrhythmias. The computer programme falsely reported junctional rhythm or junctional tachycardia in eight cases of normal sinus rhythm, where the P waves were not detected. This gave a specificity of 9 8 . 4 % ( 4 8 3 / 4 9 1 ) . Two actual cases of AV nodal rhythm were correctly stated as such. Concerning atrial flutter, the computer mismeasured two out of four cases but did not give any false positive answers. Ventricular arrhythmia. There were 2 4 reader diagnoses of ventricular arrhythmia, i.e. premature ventricular contractions (PVC). The sensitivity of the computer was 8 8 % ( 2 1 / 2 4 ) , the disagreements all being due to programme errors. The specificity was 9 9 . 8 % ( 4 6 8 / 4 6 9 ) . In one case the computer measu- red an artifact and stated PVC. There were no ECGs with series of PVCs or more complex ventricular arrhythmia. First degree AV block. The reader and the computer used the same criteria for the diagnosis of first degree AV block regarding the time interval. Never- theless, three computer diagnoses were judged as false positives on the basis of criteria differences, because the computer stated "first degree AV block" in combination with "undetermined rhythm". The logical procedure would have been to supress all AV block statements in the presence of the statement "un- determined rhythm". I n one case the computer did not detect normal P waves. The sensitivity of the computer diagnosis was 97 % ( 3 7 / 3 8 ) , and the specifi- city 9 8 . 9 % ( 4 5 0 1 4 5 5 ) . Second degree AV block. The computer did not recognize the two cases of atrial flutter. There were no other tracings containing second or third degree AV block. Left bundle branch block (LBBB). There were 11 reader diagnoses of LBBB. The computer made one false positive and three false negative statements in this respect, giving a sensitivity of 7 3 % ( 8 / 1 l ) and a specificity of 9 9 . 8 % ( 4 8 1 / 4 8 2 ) . Two false negative reports were due to programme errors. 42 A r r h y t h m i a (total) by readers by progr. 18 82 1 ~ - 8 1 13 87 Neg. for a r r h y t h m i a by readers n - 4 4 2 2 . 5 97.5 Primary ST and Neg. lor T-wave changes (ST-T) ST-T by readers by program by readers I1 4 5 2 93.4 3.0 0.6 96.4 ' ~ 3 0 program errors @!j criteria dillerences 4 0 agreements Fig. 3. Arrhythmia. Explanation, see Fig. 1. Fig. 4 . Primary S-T segment and T wave changes. Explanation, see Fig. 1. Overall performance (tracings) 521493 Program errors 301493 Criteria differences L11lL93 Agreements program errors criteria differences 5 0 agreements Overoll performance (diagnostic Statements; n - 6 8 1 ) Clinically relevant obnormolity by readers by program No clinically relevant obnarmolit y by readers n- 259 progrom ermrs criteria dillerences 6 0 agreements Fig. 5. Over-all performance (tracings). Explanation, see Fig. 1. Fig. 6. Over-all performance (diagnostic statements). Explanation, see Fig. 1. 43 Right bundle branch block (RBBB). Out of ten cases of RBBB the computer m i s - interpreted one as RVH, due to criteria differences, giving a sensitivity of 90 % ( 9 / 1 0 ) . No false positive reports were made. Intraventricular conduction delay. Due to mismeasurement, the computer re- ported preexcitation in one tracing where the correct diagnosis should have been intraventricular conduction delay. In 11 tracings out of 12 the reader and computer diagnoses agreed, the sensitivity of the computer diagnosis being 92 Z and its specificity 9 9 . 6 % ( 4 7 9 / 4 8 1 ) . Primary S-T segment and T wave changes. The sensitivity of the computer diagnosis was 83.4 Z ( 1 6 1 / 1 9 3 ) and its specificity 98.0 % ( 2 9 4 / 3 0 0 ) . Most dis- agreements were due to programme errors, where the computer seemed to be in- accurate in detecting the shape of the S-T segment and a l s o abnormal S-T de- pressions or elevations. The reason for this might have been that judgements concerning the S-T segment are based on deviations from a J point fixed in temporal relation to the preceding QRS complex. The computer also had diffi- culties regarding the shape of the T wave, especially concerning coupling bet- ween this shape and slight changes in amplitude. Miscellaneous. There were 11 cases with left anterior hemiblock, for all of which complete agreement was found, and there were no false positive re- ports. This was also the result for 24 cases of marked left axis deviation, two cases of right axis deviation, two cases of counter-clockwise rotation and three cases of clockwise rotation. There were five pacemaker ECGs. In two of these there was considered to be disagreement, on criterional grounds, as the computer did not report the presence of spontaneous beats, which in one case revealed extreme T-wave inversion in precordial leads. Over-all performance. In 411 cases ( 8 3 . 5 X ) the readers and the computer were in agreement, while disagreements were noted in the remaining 82 cases. Detailed analysis revealed that in 6 . 0 Z ( 3 0 / 4 9 3 ) these disagreements were based upon the use of different diagnostic criteria. In the remaining 52 cases ( 1 0 . 5 W ) they resulted from programme errors such as mismeasurement, pattern recognition failures or deficient programme logic. Regarding the over-all efficiency of the computer diagnosis, as expressed by a two-group classifica- tion procedure, the sensitivity was 83.5 % ( 3 5 5 / 4 2 5 ) and the specificity 82.6% ( 2 1 4 / 2 5 9 ) . The error ratio was 0 . 3 , the accuracy of positive tests (AP) 0 . 9 and the accuracy of negative tests ( A N ) 0.8. When expressed by a multi-group classification procedure, the over-all diagnostic accuracy (DA) was 83.5 % . DISCUSSION There are limitations to the use of the present ECG population for evalua- ting a computer programme. Some diagnoses were very uncommon and some were 44 not represented at all. However, one aim was To test the computer against a re- presentative sample of our own day-by-day routine clinical population of ECGs. The method of statistical analysis of the result also has its limitations, mainly in that no differential weight is given to different ECG diagnoses. Some differentiation in this respect was made in our investigation, however, since the ECGs were initially classified according to their clinical signifi- cance in a broad sense. With the present method of analysis, unnecessary dis- cussion about diagnostic criteria is avoided. Discrepancies regarding criteria can of course influence the decision whether to use the services of a computer programme or not. On the other hand, such difficulties can be overcome in co- operation with the computer programme constructor. More serious are disagree- ments due to programme errors, where the user has to find out firstly, whether the programme is adequate for the kind of subject population in question and secondly, whether the rate of programme errors can be accepted. Some authors have not only used sensitivity and specificity, as defined above, as measures of diagnostic efficiency, but have also used what is often called mean performance (MP) and association index (AI), where MF' = A(SE+SP) and A 1 = SE+SP-100. Rautaharju et al. ( 4 ) have claimed that all of these con- cepts are often misunderstood in the sense that they have not been validated against the composition of the test ECG population used. We feel that M p and A1 are not of much value in determining the efficiency of the present compu- ter programme a s used on the present population. The sensitivity of the computer diagnosis was the same whether two-group 2 or multi-group classification procedures were used, and the values were fairly satisfactory. Likewise, the specificity was acceptable and the values were within the range commonly reported in other similar studies of computer pro- gramme performance. The error ratio (0.3) was not satisfactory, however, as it implied that one-third of the diagnostic statements made by the computer were false. The accuracy of positive tests was acceptable but the accuracy of negative tests was disappointing and reflected an inability to detect arrhyth- mias and ST-T aberrations, in particular. We consider that the results of this study adequately describe the performance of the tested computer program- me as applied to the type of population of ECGs used here. ACKNOWLEDGEMENT The authors are indebted to Mr. Wauquez, Cardionics, Brussels, for his kind help in providing us with the equipment and computer services necessary for this study. 45 REFERENCES 1. Bailey, J.J., Itscoitz, S.B., Hirschfield Jr, J.W., Grauer, L.E. & Horton, M.R.: A method for evaluating computer programs for electrocardiographic interpretation. I. Application to the experimental IBM program of 1971. Circulation 50:73-79, 1974. 2. Frank, E.: An accurate, clinically practical system for spatial vector- cardiography. Circulation 13: 737-749, 1956. 3. Goldman, M.J.: Principles of electrocardiography, Lange Medical Publica- tions, Los Altos, California, 1973. 4 . Rautaharju, P.M., Blackburn, H.W. and Warren, J.W.: The concepts of sensi- tivity, specificity and accuracy in evaluation of electrocardiographic, vectorcardiographic and polarocardiographic criteria. J Electrocardiology 9:275-281, 1976. Received in final form November 23, 1978 Address for reprints: Dr Johan Landelius Dept of Clinical Physiology University Hospital 5-750 14 Uppsala Sweden 46