Papers in Physics, vol. 7, art. 070018 (2015) Received: 2 November 2015, Accepted: 27 November 2015 Edited by: R. Dickman Reviewed by: M. Hutter, Australian National University, Canberra, Australia. Licence: Creative Commons Attribution 3.0 DOI: http://dx.doi.org/10.4279/PIP.070018 www.papersinphysics.org ISSN 1852-4249 Bayesian regression of piecewise homogeneous Poisson processes Diego J. R. Sevilla1∗ In this paper, a Bayesian method for piecewise regression is adapted to handle counting processes data distributed as Poisson. A numerical code in Mathematica is developed and tested analyzing simulated data. The resulting method is valuable for detecting breaking points in the count rate of time series for Poisson processes. I. Introduction Bayesian statistics have revolutionized data anal- ysis [1]. Techniques like the Generalized Lomb- Scargle Periodogram [2] allow us to obtain oscilla- tion frequencies of time series with unprecedented accuracy. The Gregory and Loredo method [3] goes further allowing us to find and characterize periodic signals of any period and shape. To detect non-periodical variations, the Ex- act Bayesian Regression of Piecewise Constant Functions by Marcus Hutter (hereafter Hutter’s method) [4] is valuable. It permits to estimate the most probable partition of a data set in seg- ments of constant signals, determining the num- ber of segments and their borders, and in-segments means and variances. Hutter’s method works with two continuous distributions: Normal, and Cauchy-Lorentz. The latter —the canonical ex- ample of a pathological distribution with unde- fined moments—, is also suitable to analyze data with other symmetric probability distributions, es- pecially with heavy tails. In the case of counting processes, especially for ∗E-mail: dsevilla@fceia.unr.edu.ar 1 Departamento de F́ısica y Qúımica, Escuela de For- mación Básica. Facultad de Ciencias Exactas, Ingenieŕıa y Agrimensura. Universidad Nacional de Rosario, Av. Pellegrini 250, S2000BTP Rosario, Argentina. low rates, when data consist in non-negative small integers, methods specially designed to discrete probability distributions are necessary. Some re- gression methods, specially for non-homogeneous Poisson processes [5], were developed. In this paper, Hutter’s method is adapted for an- alyzing data distributed as Poisson. The results are summarized in a code in Mathematica [6]. It can be used to analyze data of several physical processes which follow the Poisson distribution (e.g., detec- tion of photons in X-ray Astronomy, particles in nuclear disintegration, etc.), if sudden changes in detection rates are suspected. II. Method Hutter’s method is summarized in Table 1 of Ref. [4] in a pseudo C code which is divided in two blocks. The first one calculates moments Akij with k = 0, 1, 2 of the PDF of the statistical models for segments of data Dij := {ni+1, . . . ,nj}. The sec- ond one performs the regression from moments Akij. The code developed in this work is divided in three blocks. As the members of the Poisson distributions fam- ily are identified by one parameter -the mean rate r of the Poisson process-, the PDF of the models 070018-1 Papers in Physics, vol. 7, art. 070018 (2015) / D. J. R. Sevilla for a segment Dij is [1] P(r|Dij,I) = P(r|I) P(Dij|r) P(Dij) , (1) where P(r|I) is the prior of parameter r, P(Dij|r) is the likelihood of segment Dij for a given r, P(Dij) is the global likelihood of the family, and I represents a prior information. Usually, the prior information consists of global quantities calculated from D := D0N , i.e., from all the data set. For Poisson processes, only one quantity is necessary: the mean rate r̂. Considering the conjugate prior of the Poisson distribution [7], the prior results P(r|r̂) = rr̂−1 e−r Γ(r̂) . (2) For a Poisson process with rate r, the likelihood of a segment Dij is P(Dij|r) = j∏ t=i+1 rnt e−r nt! . (3) So, the moments of the posterior can be ex- pressed in an analytical form Akij = Γ(k + r̂ + ∑j t=i+1 nt) ∏j t=i+1 1 nt! Γ(r̂) (j − i + 1)k+r̂+ ∑j t=i+1 nt . (4) Code block 1 calculates Akij. It needs as input the time series to be analyzed (list data). The output are functions A0[i,j], A1[i,j] and A2[i,j] and integer n, which is the length of data. Code block 1: Mathematica code to calculate Akij. n=Length[data]; r=Mean[data]; Do[ Do[ d=j-i; m=r+Sum[data[[t]],{t,i+1,j}]; A0[i,j]=(m -1)!/( Gamma[r]*(d+1)^m)* Product [1/ data[[t]]!,{t,i+1,j}]; A1[i,j]=m*A0[i,j]/(d+1); A2[i,j]=(m+1)*A1[i,j]/(d+1); ,{j,i+1,n}]; ,{i,0,n}]; As the second block of Hutter’s code only needs the moments Akij as inputs, it could work properly with no changes. It computes the evidence, the probability for k segments and its MAP estimation k̂, the probability of boundaries locations and the MAP locations of the k̂ boundaries, the first and second in-segment moments, and an interesting re- gression curve that smooths the final result. Nevertheless, for our specific problem, once the segments boundaries are obtained, we can estimate their means and variances straightforwardly, so we only use a part of Hutter’s second block, which is shown in code block 2. The logical of the al- gorithm is explained in Ref. [4]. Code block 2 needs as inputs A0[i,j], A1[i,j], A2[i,j] and n, all calculated in code block 1, and integer kmax, which is the maximum number of segments to be considered. The outputs are the evidence (e), the probability for k segments (c[k]), its MAP (khat), the probability of boundaries locations (B[i]), and their MAP (that[p]). Code block 2: Mathematica code to calculate breaking points. Do[ L[0,i]= KroneckerDelta[i,0]; R[0,i]= KroneckerDelta[i,n]; ,{i,0,n}]; Do[ Do[ L[k+1,i]=Sum[L[k,h]*A0[h,i],{h,k,i-1}]; R[k+1,i]=Sum[R[k,h]*A0[i,h],{h,i+1,n-k}]; ,{i,0,n}] ,{k,0,kmax -1}]; e=1/ kmax*Sum[L[k,n]/ Binomial[n-1,k-1],{k,1,kmax }]; Do[ c[k]=L[k,n]/( Binomial[n-1,k-1]* kmax*e) ,{k,1,kmax }]; khat =1; Do[If[c[khat]reg[[BP[k]]],1,-1]; BP1[k]=BP[k]+s*ethat[k]; BP2[k]=BP[k]-s*ethat[k]; ,{k,1,NBP -1}]; re1=Flatten[Table[Table[(nn[k]-Sqrt[nn[k]])/mm[k] ,{BP1[k]-BP1[k-1]}] ,{k,1,NBP }]]; re2=Flatten[Table[Table[(nn[k]+Sqrt[nn[k]])/mm[k] ,{BP2[k]-BP2[k-1]}] ,{k,1,NBP }]]; III. Applications and Discussion Figure 1 (top) shows, in blue dots, data simulated using Mathematica. Data consist of 150 Poisson distributed elements, the first 50 with rate 1.5, the second 50 with rate 0.5, and the last 50 with rate 1.0. Applying the first 2 blocks of code on data, 0 50 100 150 0 1 2 3 4 5 0 1 Element Number C o u n ts �� B o u n d a r y P r o b . Figure 1: Top: Simulated data (blue dots) and boundaries location probability (red line). Bottom: Regression curve and its error estimation (black dashed line and gray zone), and the rate curve used in simulation (blue line). we can see that the probability of having 2 break- ing points is very high. Figure 1 (top) also shows, in red line, the probability for the boundaries loca- tions. Applying code block 3, we obtain the regres- sion [Fig. 1 (bottom), black dashed curve] and its error estimation [Fig. 1 (bottom), gray zone]. The continuous blue line in Fig. 1 (bottom) indicates the rates used in simulation. The regression in the example above fits very well with the rate curve used in simulation. But sometimes regressions result qualitatively different to the rate curve, showing more or less breaking points, even for data simulated in the same con- ditions. This effect is due to chance. To show this issue, 2000 simulations with the same condi- tions were performed. In 992 of them, two break- ing points were found. In the others, there were found zero (44), one (208), three (419), four (155), five (73), six (41), and seven or more (68) breaking points. For cases in which two breaking points were found, statistics of the most likely boundaries loca- 070018-3 Papers in Physics, vol. 7, art. 070018 (2015) / D. J. R. Sevilla Figure 2: Top: Histogram of the boundaries loca- tions. Bottom: Histogram of the in-segment mean rates. Both figures were calculated for a set of 2000 simulations similar to that shown in Fig. 1. tions and in-segment mean rates were calculated. Figure 2 shows histograms of those statistics. Figure 2 (top) shows histograms for boundaries locations. It is clear that the bigger the step, the smaller the uncertainty on its location. Figure 2 (bottom) shows histograms of in-segment rates. It is clear that the greater the rate, the smaller its relative statistical error. Figure 3 shows data and boundaries location probabilities for a simulation similar to the pre- vious ones, but now with rates 3.0, 1.0 and 2.0. Comparing Fig. 1 (top) and Fig. 3 (top), we can see that in the latter one the boundaries locations are found more accurately. Again, 2000 simulations with the same condi- tions were performed. In 1159 of them, two break- ing points were found, while in the others, there were found zero (1), one (33), three (499), four (167), five (75), six (32), and seven or more (34) breaking points. Figure 4 shows histograms of the statistics of the most likely boundaries locations and in-segment rates for the simulations with two 0 50 100 150 0 1 2 3 4 5 6 7 8 0 1 Element Number C o u n ts �� B o u n d a r y P r o b . Figure 3: Top: Simulated data (blue dots) and boundaries location probability (red line). Bottom: Regression curve and its error estimation (black dashed line and gray zone), and the rate curve used in simulation (blue line). breaking points found. Comparing with Fig. 2, we can see that the histograms are now narrower. These results confirm what was stated above. It is important to note that the probability for the real curve to be completely inside the region defined by the error estimations of the regression is significantly less than one. It is easy to see why: if the errors were independent and equal to the standard error, the probability of satisfying n er- ror conditions simultaneously would be 0.68n. But even the actual probability could be lower, since it is clear that the errors must be dependent. Nev- ertheless, the error estimations presented here are useful to get an idea of the accuracy of the regres- sion. Finally, the capability to detect a breaking point with this code was tested for different count rates. To do this, simulated data sets of a single step in the count rate were used. Data sets consist in 100 Pois- son distributed elements, the first 50 for a rate r1 and the last 50 for a rate r2. 1000 simulations were 070018-4 Papers in Physics, vol. 7, art. 070018 (2015) / D. J. R. Sevilla Figure 4: Top: Histogram of the boundaries loca- tions. Bottom: Histogram of the in-segment rates. Both figures were calculated for a set of 2000 sim- ulations similar to that shown in Fig. 3. Table 1: Statistics of successful detections for a sin- gle step. r1\r2 3.0 2.0 1.6 1.2 0.8 0.4 0.86 0.74 0.70 0.63 0.46 0.8 0.77 0.70 0.60 0.36 1.2 0.69 0.61 0.30 1.6 0.65 0.28 2.0 0.60 done for each pair (r1,r2). Statistics of success- ful detections are presented in Table 1. A success- ful detection is considered when only one breaking point between elements 40 and 60 is detected. Table 1 shows that the smaller is the mean rate difference and the smaller are the mean rates, the more difficult is the detection of the step. This result is expected because in a Poisson distribution the variance is equal to the mean. IV. Conclusions In this work, a code for Bayesian regression of piece- wise constant functions was adapted to handle data from Poisson processes. For this purpose, equations for calculating the moments of the posteriors of seg- ments of data were found through Bayes theorem, considering the conjugate prior of the Poisson dis- tribution as prior. These results, as well as part of Hutter’s method, were used to calculate the most probable number of segments and their boundaries. Procedures for calculating in-segments mean rates and the uncertainties of mean rates and boundaries locations are also provided. The resulting method is summarized in a code in Mathematica. The code was applied to simulated data. Firstly, two examples with tree segments were analyzed. The code performed well in both cases considering the dispersion of data, and the results improved in the case of higher mean rates and mean rates dif- ferences. This occurs because of the statistical dis- persion of Poisson distributed data, which is greater than the mean rate if the mean rate is lower than one. Finally, simulations of data of a single step were analyzed for different rates, and statistics of the regressions with only one breaking point are pre- sented in a table. This table shows the effect of the rates and rate differences in the regression ac- curacy, and, together with the errors estimations provided by the code, can serve as an indicator of the reliability of the method. Supplementary material including the source code for the algorithms can be found at the journal website [8]. Acknowledgements - This work was partially supported by the National University of Rosario. [1] P C Gregory, Bayesian logical data analysis for the physical sciences, Cambridge University Press, Cambridge, UK (2004). [2] G L Bretthorst, Lecture notes in statistics, Springer, Berlin (1988). 070018-5 Papers in Physics, vol. 7, art. 070018 (2015) / D. J. R. Sevilla [3] P C Gregory, T J Loredo, A new method for the detection of a periodic signal of unknown shape and period, Astrophys. J. 398, 146 (1992). [4] M Hutter, Exact Bayesian regression of piece- wise constant functions, Bayesian Analysis 2, 635 (2007). [5] J F Lawless, Regression methods for Poisson process data, J. Am. Stat. Assoc. 82, 399 (1987). [6] Wolfram Research Inc., Mathematica version 9.0, Wolfram Research, Inc., Champaign, Illi- nois (2012). [7] A Gelman, J B Carlin, H S Stern, D B Rubin, Bayesian data analysis, Taylor & Francis, UK (2014). [8] Mathematica codes and examples by the author can be found at http://www.papersinphysics.org. 070018-6