# Introduction to Pathogen.jl

Pathogen.jl is a package that provides utilities for the simulation and inference of pathogen phylodynamics, built in the Julia language. Specifically, Pathogen.jl presents an extension to the individual level infectious diesase transmission models (ILMs) of Deardon et al. (2010), to simultaneously model infectious disease transmission and evolution. Pathogen genomic sequences are used in conjunction with the covariate and disease state information of individuals to infer disease transmission pathways, external disease pressure, and infectivity parameters.

At it’s core, an ILM describes the probability an individual will become exposed to a disease in a specific time interval based on their covariate information and the disease state of the rest of the population. In a discrete time ILM, the probability an individual $i$ will become exposed at time $t$ is expressed as (Deardon et al. 2010):

$P(i,t) = \begin{cases} 1 - \exp \left \{ \left[-\Omega_{S}(i)\sum_{j \in I_{(t)}} \Omega_{T}(j)\kappa(i,j)\right] + \epsilon(i,t) \right \} \text{ if } i \in S_{(t)}\\ 0 \text{ otherwise} \end{cases}$,

where:

• $I_{(t)}$ is the set of infectious individuals at time $t$
• $S_{(t)}$ is the set of susceptible individuals at time $t$
• $\Omega_{S}(i)$ represents risk factors associated with susceptible individual $i$
• $\Omega_{T}(i)$ represents risk factors associated with infectious individual $j$
• $\kappa(i,j)$ is an infection kernel representing risk factors involving both a susceptible individual $i$, and an infectious individual $j$
• $\epsilon(i,t)$ represent external risk factors associated with susceptible individual $i$

## Pathogen transmission pathways

In an ILM, a susceptible individual’s exposure risk is the cumulation of various risk factors. The specific source of an infection in this framework, however, is irrelevant. In the phylodynamic ILM extension, the source of an infection is of relevance to the both the pathways of pathogen transmission and of pathogen evolution. As such, it is required that seperate exposure probabilities are described for each susceptible-infectious individual combination. Furthermore, as an individual may only be infected from a single source, we must determine which of the competiting sources (if any) are responsible for exposure to the pathogen. To do this we are required to model in continuous time, resulting in the following exposure probability:

$P(i, j, t) = \begin{cases} \Omega_{S}(i)\Omega_{T}(j)\kappa(i,j) \exp \left\{ -\Omega_{S}(i)\Omega_{T}(j)\kappa(i,j)\delta(t)\right\} \text{ if } i \in S_{(t)}, j \in I_{(t)} \\ 0 \text{ otherwise} \end{cases}$,

where $\delta(t)$ is the time from the previous event. Which is to say that an individual’s time to exposure is exponentially distributed. An individual’s risk of exposure from external factors can also be defined separately as:

$P(i, t) = \begin{cases} \epsilon(i,t) \exp \left \{- \epsilon(i,t) \delta(t) \right \} \text{ if } i \in S_{(t)} \\ 0 \text{ otherwise} \end{cases}$.

An individual’s transition from an exposed to an infectious state is assumed to be exponentially distributed with probability:

$P(i, t) = \begin{cases} \rho(i) \exp \left \{- \rho(i) \delta(t) \right \} \text{ if } i \in E_{(t)} \\ 0 \text{ otherwise} \end{cases}$.

where,

• $E_{(t)}$ is the set of exposed individuals at time $t$,
• $\rho(i)$ is the infectivity rate (inverse of the mean latent period) associated with an exposed individual $i$.

Similarly, an individual’s transition from an infectious state to a removed state is assumed to be exponentially distributed with probability:

$P(i, t) = \begin{cases} \gamma(i) \exp \left \{- \gamma(i) \delta(t) \right \} \text{ if } i \in I_{(t)} \\ 0 \text{ otherwise} \end{cases}$.

where,

• $\gamma(i)$ is the removal rate (inverse of the mean infectious period) associated with an infectious individual $i$.

## Pathogen phylogenetics

Pathogen.jl models mutations in the genome of a pathogen as it is transmitted through a population. In the simplest case, this may be done purely through a transition-transversion rate matrix,

$\mathbf{Q} = \begin{Bmatrix} -\mu_{A} & \mu_{AT} & \mu_{AC} & \mu_{AG} \\ \mu_{TA} & -\mu_{T} & \mu_{TC} & \mu_{TG} \\ \mu_{CA} & \mu_{CT} & -\mu_{C} & \mu_{CG} \\ \mu_{GA} & \mu_{GT} & \mu_{GC} & -\mu_{G} \end{Bmatrix},$

where each $\mu$ is the rate of transition or tranversion between two specified nucleotides. These rates must be defined in the same time units as the rates associated with the pathogen transmission process. There are several transition-transversion models of various complexity that may be utilized. Additionally, site heterogeneity could be incorporated into the model as well as other types of mutations.

## Simulation framework

As individuals in a simulated population are in interaction with one another, the occurence of one event impacts the occurence of others. Every change in disease state membership (i.e. membership in sets $S$, $E$, $I$, and $R$) impacts event probabilities in some manner.

An exposure event to an individual $i$, changes their (previously zero) probability of becoming infectious to $\rho(i)$ and zeros their probability of becoming exposed from any of the potential sources (since they have already been exposed). Once an individual has been exposed there are also probabilities associated with the mutation of the pathogen that has been transmitted to them, which remains non-zero until individual $i$ is removed from the population.

An infection event to an individual $i$, changes their (previously zero) probability of being removed to $\gamma(i)$ and also zeros their probability of becoming infectious (since they are already infected). While individual $i$ is infectious, every remaining susceptible individual in the population is at some risk of exposure to the pathogen that individual $i$ hosts.

In an SEIR model, a removal event to individual $i$ zeros their probability of removal, and also zeros the probability that they may expose any remaining susceptible individuals.

A mutation event may only change that nucleotide’s mutation rate (i.e. under a neutral theory of pathogen evolution), however it is also possible that a pathogen sequence could be associated with disease transmissability, latency, or recovery, creating potential for selection pressures.

The most straightforward and computationally efficient method of generating these interdependent events involves use of the minimum of independent exponential random variables. As the occurence of each event is described by an exponential distribution, we can generate the minimum of these distributions through summing their rate parameters. That is, the time of the next event in our system can be assumed to be exponentially distributed, with rate

$\lambda = \sum_{i \in S_{(t)}} \left[ \epsilon(i,t) + \sum_{j \in I_{(t)}} \Omega_{S}(i) \Omega_{T}(j) \kappa(i,j) \right] + \sum_{k \in E_{(t)}} \rho(k) + \sum_{j \in I_{(t)}} \gamma(j) + \sum_{l \in E_{(t)} \cup I_{(t)}} \chi_{l,t}$,

where,

• $\chi_{l, t}$ is the overall mutation rate of the $l^{th}$ individual at time $t$, which can be further broken down into specific mutations (e.g. transition-transversion rates for each nucleotide in a genomic sequence).

Identification of the event that occurs at this time can be found through a discrete distribution, where the proportion of rate $\lambda$ that can be attributed to any event, is its probability in being selected. After an event time has been generated, and a specific event identified, the necessary rate updates can be made. In Pathogen.jl this information is conveniently organized and modified in a rate array where each column represents an individual (or an “external source”), and each row an event type.

The transmission of disease is observed under a specified surveillance model. The surveillance model specifies a detection lag for infection and removal events. The structure of this surveillance model is flexible, and an exponential detection lag model is included with Pathogen.jl. In the exponential detection lag model, observation delay is modelled by an exponential distribution, with a sole parameter, $\nu$, the detection rate.

## Inference framework

Pathogen.jl utilizes Markov chain Monte Carlo (MCMC) to sample from the joint posterior distribution. As the posterior density can not be calculated fully until our observations have been augmented with actual recovery, infection, and exposure times, as well as exposure source, our MCMC algorithm must incorporate data augmentation. Specifically, data augmentation and joint posterior distribution sampling is performed by Gibbs-within-Metropolis-like algorithm, as described below:

1. A parameter vector, $\mathbf{\theta}^{t*}$ is proposed from a multivariate Gaussian distribution with mean $\mathbf{\theta}^{t-1}$, and covariance matrix $\Sigma$.
2. Observed data $D_{obs}$, are augmented by $D_{aug}^{t*}$, which are generated from a conditional distribution $\text{P}(D_{aug}^{t*} \lvert \mathbf{\theta}^{t*}, D_{obs})$.
3. The joint posterior density is calculated as the product of the likelihood and the prior, $\pi(\mathbf{\theta}^{t*} \lvert D_{aug}^{*}) = L(\mathbf{\theta}^{t*} \lvert D_{aug}^{t*}) \times p(\mathbf{\theta}^{t*})$, and $\mathbf{\theta}^{t*}$ is accepted as $\mathbf{\theta}^{t}$ with probability $\min(1, \frac{\pi(\mathbf{\theta}^{t*} \lvert D_{aug}^{t*})}{\pi(\mathbf{\theta}^{t-1} \lvert D_{aug}^{t-1})})$.
4. If $\mathbf{\theta}^{t*}$ is rejected, $\mathbf{\theta}^{t} = \mathbf{\theta}^{t-1}$, and $D_{aug}^{t}$ is generated from $\text{P}(D_{aug}^{t} \lvert \mathbf{\theta}^{t}, D_{obs})$.

In other words, model parameters are first proposed using a Metropolis-Hastings scheme. Data augmentation follows, using a Gibbs sampling scheme. The parameter proposals are then evaluated with a Metropolis-Hastings acceptance scheme. And finally, in the case of rejection, an additional Gibbs sampling step is utilized for re-augmenting the data.

The transition kernel’s covariance matrix may also be tuned using the adaptive scaling method of Roberts and Rosenthal (2001).

## References

Deardon, R., Brooks, S. P., Grenfell, B. T., Keeling, M. J., Tildesley, M. J., Savill, N. J., … & Woolhouse, M. E. (2010). Inference for individual-level models of infectious diseases in large populations. Statistica Sinica20(1), 239.

Roberts, G. O., & Rosenthal, J. S. (2001). Optimal scaling for various Metropolis-Hastings algorithms. Statistical science16(4), 351-367.