Count data analysis


PDF

An example of count data analysis

Gilbert Berdine MD  a, Shengping Yang PhDb


Correspondence to Gilbert Berdine MD   

Email: Gilbert.Berdine@ttuhsc.edu


 + Author Affiliation
 - Author Affiliation
 
  aa pulmonary physician in the Department of Internal Medicine Texas Tech University Health Science Center in Lubbock, TX

 a a biostatistician in the Department of Pathology at TTUHSC. Lubbock, TX.

 
SWRCCC 2015;3(11):55-59

doi:10.12746/swrccc2015.0310.149

...................................................................................................................................................................................................................................................................................................................................
 

Previously, we have introduced Poisson and Negative binomial regressions for modeling count data. Here we will use a real example to demonstrate how to use SAS software performing such analyses.  


A new oral antibiotic drug Gorilacillin was developed and has had excellent effects in several clinical trials. Gorilacillin has two side effects, including rash and elevated liver function tests (ELFT). To evaluate whether different patient groups have different risks of having such side effects, a Gorilacillin side effect study was conducted. A total of 5,275 participants were recruited from five participating countries. All the patients were followed for up to one month, and the number of patients who developed rash or ELFT was recorded. The goal of the study was to investigate whether there was a significant difference in developing side effects for patients in different age groups.


Data from the study were collected and stored in an Excel table (see below for a partial view of the table).


	Country
	Age
	# Patients
	# Rash
	# ELFT

	Great Britain
	(0-4)
	65
	1
	0

	Great Britain
	(5-9)
	18
	0
	0

	Great Britain
	(10-14)
	229
	1
	1

	Great Britain
	(15-19)
	59
	1
	1

	Great Britain
	(20-24)
	65
	0
	0

	Great Britain
	(25-29)
	49
	2
	0

	…
	 
	 
It is typical to use a Poisson or negative binomial regression for analyzing such data since the outcome is a side effect count, and the probability for developing side effects is low – rare events. Meanwhile, because rash and ELFT are the two main side effects of Gorilacillin, an event is declared if a patient develops either rash or ELFT. Note that the numbers of patients in different age groups/countries are different; thus it makes sense to model the rate of side effect per patient, as a function of age group and country, to adjust for the differences in number of patients.


A scatter plot can usually help visualize potential relationships. As we can see from Figure 1, there does not seem to have a strong relationship between side effect rate and age group. Also the side effect rates are quite similar among countries.


To apply a Poisson regression, we have, 


To adjust for the numbers of patients in different age groups/countries, we use the rate of side effect (dividing the expected number of events by the number of patients) as the outcome variable. Equivalently, the above equation can also be written as,


, 

 
where the additional term on the right-hand side, log(n), is called an offset.
Corresponding to the above two models, there are two equivalent SAS statements:


proc genmod data=data;

class country age;

model incident/n =  age country / dist= poisson link=log;

lsmeans age / ilink diff cl;

run;

 
Or equivalently,


proc genmod data=data;

class country age;

model incident =  age country / offset=logn dist=poisson link=log;

lsmeans age / ilink diff cl;

run;

 
Note that, in the second equation, log⁡n=log⁡(n), where n is the number of patients in a specific group. The lsmeans statement can be used to obtain the side effect rate estimates for the 10 age groups, averaged over countries. The ilink option specifies the inverse link function to be used for calculating the rate estimates, and the cl option produces the confidence intervals. In addition, the diff option provides all pairwise comparisons of side effect rates among age groups.

	 Analysis     Of Maximum Likelihood Parameter Estimates

	 Parameter
	  
	 DF
	 Estimate
	 Standard     Error
	 Wald     95% Confidence Limits
	 Wald     Chi-Square
	 Pr > ChiSq

	Intercept
	 
	1
	-3.8135
	0.3599
	-4.5189
	-3.1081
	112.28
	<.0001

	Age
	(0-4)
	1
	-0.5275
	0.4444
	-1.3984
	0.3435
	1.41
	0.2352

	Age
	(10-14)
	1
	-0.8328
	0.6796
	-2.1647
	0.4992
	1.50
	0.2204

	Age
	(15-19)
	1
	0.0258
	0.3575
	-0.6749
	0.7266
	0.01
	0.9424

	Age
	(20-24)
	1
	-0.3447
	0.4277
	-1.1830
	0.4936
	0.65
	0.4203

	Age
	(25-29)
	1
	0.6658
	0.4733
	-0.2618
	1.5934
	1.98
	0.1595

	Age
	(30-34)
	1
	-0.4745
	0.5735
	-1.5986
	0.6495
	0.68
	0.4080

	Age
	(35-39)
	1
	-0.3067
	0.6396
	-1.5603
	0.9469
	0.23
	0.6316

	Age
	(40-44)
	1
	-0.4021
	0.5016
	-1.3853
	0.5810
	0.64
	0.4227

	Age
	(45-49)
	1
	-0.9098
	0.5660
	-2.0192
	0.1995
	2.58
	0.1080

	Age
	(5-9)
	0
	0.0000
	0.0000
	0.0000
	0.0000
	.
	.

	country
	Great Britain
	1
	-0.1291
	0.4409
	-0.9931
	0.7350
	0.09
	0.7697

	country
	India
	1
	-0.2636
	0.3141
	-0.8792
	0.3520
	0.70
	0.4013

	country
	Japan
	1
	-0.1973
	0.3572
	-0.8974
	0.5027
	0.31
	0.5806

	country
	Turkey
	1
	-0.1805
	0.4629
	-1.0877
	0.7268
	0.15
	0.6967

	country
	United States
	0
	0.0000
	0.0000
	0.0000
	0.0000
	.
	.

	Scale
	 
	0
	1.0000
	0.0000
	1.0000
	1.0000
	 
	 
The above SAS output table shows that age group was not significantly associated with Gorilacillin side effect rate and there was no significant difference among the 5 countries. 


	 Age Least Squares Means

	 Age
	 Estimate
	 Standard Error
	 z Value
	 Pr > |z|
	 Alpha
	 Lower
	 Upper
	 Mean
	 Standard Error

      of Mean
	 Lower Mean
	 Upper Mean

	(0-4)
	-4.4951
	0.3556
	-12.64
	<.0001
	0.05
	-5.1921
	-3.7981
	0.01116
	0.003970
	0.005560
	0.02241

	(5-9)
	-3.9676
	0.2778
	-14.28
	<.0001
	0.05
	-4.5121
	-3.4231
	0.01892
	0.005256
	0.01098
	0.03261

	(10-14)
	-4.8004
	0.6035
	-7.95
	<.0001
	0.05
	-5.9832
	-3.6175
	0.008227
	0.004965
	0.002521
	0.02685

	(15-19)
	-3.9418
	0.2602
	-15.15
	<.0001
	0.05
	-4.4517
	-3.4319
	0.01941
	0.005051
	0.01166
	0.03233

	(20-24)
	-4.3123
	0.3369
	-12.80
	<.0001
	0.05
	-4.9726
	-3.6520
	0.01340
	0.004516
	0.006925
	0.02594

	(25-29)
	-3.3018
	0.3809
	-8.67
	<.0001
	0.05
	-4.0483
	-2.5553
	0.03682
	0.014020
	0.01745
	0.07767

	(30-34)
	-4.4421
	0.5144
	-8.64
	<.0001
	0.05
	-5.4503
	-3.4340
	0.01177
	0.006055
	0.004295
	0.03226

	(35-39)
	-4.2743
	0.5877
	-7.27
	<.0001
	0.05
	-5.4261
	-3.1224
	0.01392
	0.008182
	0.004400
	0.04405

	(40-44)
	-4.3698
	0.4182
	-10.45
	<.0001
	0.05
	-5.1894
	-3.5501
	0.01265
	0.005292
	0.005575
	0.02872

	(45-49)
	-4.8775
	0.5061
	-9.64
	<.0001
	0.05
	-5.8693
	-3.8856
	0.007616
	0.003854
	0.002825
	0.02054


From the lsmeans estimates, we see that the estimated side effect rate for patents 0-4 years old was 1.1% (the Mean column; table above) with a confidence interval (0.6%, 2.2%; the Lower Mean and Upper Mean columns), and for the 5-9 years old it was 1.9% with a confidence interval (1.1%, 3.3%), etc.


	The diff option does provide all pairwise comparisons should such comparisons be of interest (table below shows part of the comparisons).


	 Differences     of Age Least Squares Means

	 Age
	 _Age
	 Estimate
	 Standard     Error
	 z Value
	 Pr     > |z|
	 Alpha
	 Lower
	 Upper

	(0-4)
	(10-14)
	0.3053
	0.7116
	0.43
	0.6679
	0.05
	-1.0894
	1.7000

	(0-4)
	(15-19)
	-0.5533
	0.4237
	-1.31
	0.1916
	0.05
	-1.3837
	0.2771

	(0-4)
	(20-24)
	-0.1828
	0.4870
	-0.38
	0.7074
	0.05
	-1.1372
	0.7716

	(0-4)
	(25-29)
	-1.1933
	0.5264
	-2.27
	0.0234
	0.05
	-2.2250
	-0.1616

	(0-4)
	(30-34)
	-0.05294
	0.6021
	-0.09
	0.9299
	0.05
	-1.2330
	1.1271

	(0-4)
	(35-39)
	-0.2208
	0.6694
	-0.33
	0.7415
	0.05
	-1.5329
	1.0912

	(0-4)
	(40-44)
	-0.1253
	0.5363
	-0.23
	0.8152
	0.05
	-1.1764
	0.9257

	(0-4)
	(45-49)
	0.3824
	0.6129
	0.62
	0.5327
	0.05
	-0.8189
	1.5837

	(0-4)
	(5-9)
	-0.5275
	0.4444
	-1.19
	0.2352
	0.05
	-1.3984
	0.3435

	(10-14)
	(15-19)
	-0.8586
	0.6681
	-1.29
	0.1988
	0.05
	-2.1681
	0.4509

	…
	 
	 
	…
	…
		… 
	… 
	… 
	… 
	… 
	… 


Now, recall that we previously explained that a negative binomial regression model might be more appropriate should data overdispersion exist. To test overdispersion, an easy way is to apply a negative binomial regression with scale=0 and noscale options in the model statement. These options test whether overdispersion of the form μ+kμ^2  exists by testing whether the dispersions parameter equals to 0.


proc genmod data=data;

class country age;

model incident/n =  age country / dist= nb link=log scale=0 noscale;

run;

 
	 Lagrange     Multiplier Statistics

	 Parameter
	 Chi-Square
	 Pr > ChiSq
	  

	Dispersion
	8309.4881
	<.0001*
	
	* One-sided p-value


From the above test of  overdispersion result, we can see that the p value is less than 0.0001, and  thus it is appropriate to use a negative binomial regression.

 
	 Analysis     Of Maximum Likelihood Parameter Estimates

	 Parameter
	  
	 DF
	 Estimate
	 Standard     Error
	 Wald     95% Confidence Limits
	 Wald     Chi-Square
	 Pr > ChiSq

	Intercept
	 
	1
	-3.6458
	0.3249
	-4.2826
	-3.0091
	125.94
	<.0001

	Age
	(0-4)
	1
	-0.5332
	0.4053
	-1.3276
	0.2612
	1.73
	0.1883

	Age
	(10-14)
	1
	-0.6927
	0.5926
	-1.8542
	0.4688
	1.37
	0.2424

	Age
	(15-19)
	1
	-0.1166
	0.3365
	-0.7761
	0.5429
	0.12
	0.7289

	Age
	(20-24)
	1
	-0.4504
	0.3971
	-1.2287
	0.3279
	1.29
	0.2567

	Age
	(25-29)
	1
	0.6895
	0.4348
	-0.1627
	1.5417
	2.51
	0.1128

	Age
	(30-34)
	1
	-0.4385
	0.5165
	-1.4509
	0.5738
	0.72
	0.3959

	Age
	(35-39)
	1
	0.4003
	0.4448
	-0.4715
	1.2721
	0.81
	0.3681

	Age
	(40-44)
	1
	-0.3026
	0.4460
	-1.1768
	0.5715
	0.46
	0.4974

	Age
	(45-49)
	1
	-0.3803
	0.4186
	-1.2009
	0.4402
	0.83
	0.3636

	Age
	(5-9)
	0
	0.0000
	0.0000
	0.0000
	0.0000
	.
	.

	country
	Great Britain
	1
	-0.1721
	0.4080
	-0.9717
	0.6275
	0.18
	0.6731

	country
	India
	1
	-0.2972
	0.2858
	-0.8573
	0.2629
	1.08
	0.2983

	country
	Japan
	1
	0.0361
	0.3020
	-0.5558
	0.6280
	0.01
	0.9048

	country
	Turkey
	1
	-0.1599
	0.4131
	-0.9695
	0.6498
	0.15
	0.6987

	country
	United States
	0
	0.0000
	0.0000
	0.0000
	0.0000
	.
	.

	Dispersion
	 
	1
	1.0492
	0.0141
	1.0219
	1.0773
	 
	 
The result from the negative binomial regression (table above) is similar to that from the Poisson regression. We did not detect any difference in side effect rate between the reference and other age groups. Looking at the raw data in the scatter plot (Figure 1), one might think that Americans of age 25-29 were at risk for adverse effects of the drug, but the statistical analysis shows the result to be within the 95% confidence limit for purely random effect compared to the reference group. The American 25-29 data point appears, at first glance, to be an outlier with some non-random effect, but, in fact, it is a purely random walk from the other data points. 

The statistical analysis is consistent with the reality of the situation. Gorilacillin does not exist and the data was simulated by sampling rare occurrence events from an online game in which the game developers assure us that the events are, indeed, random. The game has generated all sorts of “theories” about how to elicit these rare events more often, but the statistical analysis shows the “theories” to be no more substantial than Americans age 25-29. This example illustrates how rare events can seem to generate “outliers” that are merely results of small samples and rare occurrence rates. 

Many times, rare events are hard to observe, and it might take quite some time before one event is observed. If feasible, one alternative strategy of studying an association between a rare event and potential risk factors is to collect data retrospectively. For example, identify the list of patients who had the event, match them with those who did not have the event, then collect all the necessary data and perform data analysis. 


...................................................................................................................................................................................................................................................................................................................................


Published electronically:  7/15/2015

Conflict of Interest Disclosures: none

 
Return to top