Improve your forecasts of events: use the gamma-Poisson model
Forecasters often strive to predict the rate at which events occur.1 However, traditional models used by forecasters, such as Laplace’s rule (based on the beta-binomial model) have their limitations. These models are sensitive to the choice of scale (e.g., whether time is measured in years or months) and do not accommodate multiple events occurring within a single time period.
In 2022, Jamie Sevilla and Ege Erdil published a blog post that proposed a more robust alternative: the gamma-Poisson model. This model is time- or scale-invariant, meaning that the choice of scale does not affect the results. In this post, we will delve deeper into the gamma-Poisson model, exploring its assumptions and how to fully utilise the posterior predictions, and finally recommending an alternative to Sevilla and Erdil’s suggested prior.
There are three main recommendations in this post.
Consider the assumptions behind your model, particularly that the rate of events is constant and that the times between events are independent.
Use the full posterior, including both the uncertainty in the event rate and the inherent randomness in when events occur, when making forecasts.
Consider the Gamma(1/3, 0) prior when no prior information is available (recommended by Kerman (2011) as a “neutral prior”), rather than the Gamma(1, 0) recommended by Sevilla and Erdil.
The gamma-Poisson model
The gamma-Poisson model is probably the simplest possible model for events occurring in continuous time. The gamma refers to the distribution for the rate of events, and the Poisson to the distribution for how events occur conditional on the rate of events. It requires only three assumptions:
Our prior belief for the rate of events can be represented as a gamma distribution.
The rate does not change over the period of time we are analysing.
The events follow a homogeneous Poisson process, meaning that events are independent and the chance of future events is not affected by past events.
The gamma distribution has two parameters: the shape (α, the Greek letter alpha) and the rate (β, the Greek letter beta). We write this distribution as Gamma(α, β). Both parameters must be greater than zero to ensure a proper distribution.2 When used as a prior distribution, the parameters have intuitive interpretations: α represents the number of observed events, and β denotes the length of the observation period. As long as we are consistent in our analysis, the choice of units for β is irrelevant, ensuring that our model is time- or scale-invariant.
One convenient property of the gamma-Poisson model is that the posterior distribution for the rate of events will also be a gamma distribution.3 This allows us to easily write down our posterior beliefs. If our prior distribution is Gamma(α, β) and we observe x events over T time periods, our posterior becomes Gamma(α+x, β+T). A useful forecasting quantity is the probability that there are no events in some future period of length t, which is:
Using the posterior distribution for forecasting
When forecasting from the gamma-Poisson model, there’s two reasons for our uncertainty. First, we are unsure of the underlying rate of events, represented by our posterior gamma distribution. Second, the process has some inherent randomness over when the events occur; that is, even if we knew the underlying rate, we still would not know how many events will occur in any period.
Often, a point estimate is taken for the first of these which understates our uncertainty. This section will explain how to take both into account for several circumstances of interest.
The rate of events
The gamma-Poisson model provides us with a posterior distribution for the rate of events: Gamma(α+x, β+T).4 The mean of this distribution, (α+x) / (β+T), represents our updated belief about the average rate of events occurring in a given time period.
If we want to forecast the number of events in the next t time periods, we also need to take into natural stochasticity in the process. Consider that, even if we knew the rate of events exactly, we would still have some uncertainty over how many events will occur. To take into account both our uncertainty over the rate of events and this stochasticity, we should use our posterior predictive distribution. Under the gamma-Poisson model, this is NegativeBinomial(α+x, (β+T) / (t+β+T)).5 The mean of this distribution, t(α+x) / (β+T), represents our expected number of events in the next t time periods.
Time between events
The mean time between events is one over the rate of events (eg: if the rate is two per year, the mean time between events is half a year). Our posterior here is InverseGamma(α + x, β + T). The mean of this distribution is only defined if α + x > 1, which (if you follow my recommendations for setting a prior) occurs when you’ve observed one event. Intuitively, if we haven’t seen any events, then we have some belief that the event never occurs, and hence an infinite time between events. When the mean is defined, it is (β + T) / (α + x - 1).
Again, we get a posterior predictive distribution for the time until the next event. Here is is: Lomax(α + x, β + T), which has the same mean as the previous InverseGamma.
Probability of no events
Finally, we often want to know the probability that there are no events in a period of length t. This can be derived from either the negative binomial of lomax distributions above, in either case giving:
Choosing the prior
The choice of our prior, specifically the values of α and β, can be quite influential if we have not observed many events (certainly less than 5, although even up to around 10). When we have relevant information (e.g., a suitable reference class), we should choose these parameters to reflect that information. However, in cases where no applicable information is available, we may want a “reference” or “objective” prior that is broadly applicable.
Acceptable choices
Any suitable reference prior should have 0 < α ≤ 1 and β = 0 to satisfy the following three principles.
Ensure that our posterior distribution is always proper, requiring α > 0 and β ≥ 0.6
Avoid choices which cause changing inference if we change scales. As soon as we choose β > 0, the choice of scale matters, which is exactly what we want to avoid. Therefore, we should choose β = 0 and decide on α.
If we have not observed any events, we should consider the single most probable outcome (the posterior mode) to be that the rate of events is 0, requiring α ≤ 1.
Specific choices
Several recommendations have been made for choosing α. Note that if x is larger than about 5 or 10, the recommendations will yield similar results, so the choice is not critical. If you have fewer events, I would recommend trying α = 1 and α = 1/3 to check how sensitive your results would be to this assumption for the specific quantities you care about.
Sevilla and Erdil recommended α = 1 because it closely resembles Laplace’s rule and provides the best point estimate of the time between events (in expectation). However, it tends to overestimate the rate of events significantly.7 This prior will mean that the expected rate of events is always noticeably higher than the observed rate. Furthermore, there is quite a large posterior probability that this rate is higher (see the figure below).
Kerman (2011) recommends α = 1/3 because it implies that the true rate is equally likely to be greater or lower than x / T for all values of x and T, as long as x ≥ 1 (at least one event observed). This is because the median of a gamma distribution with parameters a and b is well approximated by (a - 1/3)/b. Intuitively, this seems reasonable: if we have seen x events in T time periods, we should think it is just as likely that the mean rate is less than or greater than x/T.
Another popular choice is to make α very small, say 10-6. This makes the prior pretty flat, and approximates the “scale-invariant” prior that Sevilla and Erdil want to use but do not due to creating an improper posterior. Furthermore, it minimises the mean squared error in estimating the rate of events. However, this prior places far too much probability mass on extremely small rates of events before we observe one.
Overall: α ≈ 0 and α = 1 are the best choices for estimating the rate of events and time between events respectively. However, neither perform that well when we have not seen any events (especially α ≈ 0), and will perform poorly when we care about the parameter that the prior performs poorly on. Choosing α = 1/3 provides a reasonable trade-off between the two, and has the additional desirable property that, whenever we have observed at least one event, we think that the rate of events we’ve observed (x/T) is equally as likely to be too high as too low.
Conclusion
Sevilla and Erdil correctly pointed out that using Laplace’s rule for a continuous observation (such as time) leads to inconsistencies. Here, we’ve laid out some details of the assumptions and use of this model. I’d strongly recommend you to make use of the full posterior distribution for your forecasting, and consider a Gamma(1/3, 0) prior.
Bonus: Kerman (2011) argues, for essentially the same reasons given here, that we should use a Beta(1/3, 1/3) rather than a Beta(1, 1) prior for probabilities.
The rate of an event is the average (mean) number of events that occur per unit time. For example, the number of births per year or pandemic per decade.
A proper distribution fulfils the requirements of a probability distribution: the probabilities are never negative, and the sum of probabilities across all outcomes is 1.
This is because the gamma and Poisson distributions are conjugate distributions.
Be wary that an alternative parameterisation of the gamma distribution is sometimes used, which has 1/(β+T) as the second parameter. Check the documentation for any software package or that you get the correct mean.
Be wary that an alternative parameterisation of the negative binomial distribution is sometimes used, which has t / (t + β + T) as the second parameter. Check the documentation for any software package or that you get the correct mean.
A proper posterior requires α + x > 0 and β + T > 0. As long as we have observations for a non-zero amount of time, then T > 0 and hence β = 0 is valid. However, we cannot guarantee that we observe an event (we might have x = 0) which requires the strict inequality α > 0.
The difference here is intuitively confusing but can be explained due to considering what happens when events are rare. Here, the expectation of the time between events will greatly be affected by how much you believe that very large values of the time between events is possible. Furthermore, due to the lack of data, this belief is largely driven by your prior. Therefore, for an accurate mean (across all possible value of times between events) you prefer to underestimate the time between events, or equivalently overestimate the rate of events. This problem only occurs when considering expectations.