Skip to contents

Compute posterior probabilities of P. vivax recurrent states relapse, reinfection and recrudescence using genetic data.

Please note, the progress bar does not necessarily increment uniformly (see details below); it may seem stuck when the code is still running.

Usage

compute_posterior(
  y,
  fs,
  prior = NULL,
  MOIs = NULL,
  return.RG = FALSE,
  return.logp = FALSE,
  progress.bar = TRUE
)

Arguments

y

Observed data in the form of a list of lists. The outer list is a list of episodes in increasing chronological order. The inner list is a list of named markers per episode. Episode names can be specified, but they are not used. Markers must be named. Each episode must list the same markers. If not all markers are typed per episode, data on untyped markers can be encoded as missing (see below). For each marker, one must specify an allelic vector: a set of distinct alleles detected at that marker. NAs encode missing per-marker data, i.e., when no alleles are observed for a given marker. NA entries in allelic vectors that contain both NA and non-NA entries are ignored. Allele names are arbitrary, but must correspond with frequency names (see examples below). The same names can be used for alleles belonging to different markers. As such, frequencies must be specified per named allele per named marker.

fs

List of per-marker allele frequency vectors. Names of the list must match with the marker names in y. Within lists (i.e., for each marker), frequencies must be specified per allele name.

prior

Matrix of prior probabilities of the recurrence states for each recurrent episode. Each row corresponds to an episode in increasing chronological order. The column names must be C, L, and I for recrudescence, relapse and reinfection respectively. Row names can be specified but they are not used. If prior is NULL (default), per-episode recurrent states are equally likely.

MOIs

Multiplicity of infection for each episode. If MOIs are not provided, the most parsimonious MOIs compatible with the data will be used; see determine_MOIs.

return.RG

Boolean for whether to return the relationship graphs, defaults to FALSE. If return.logp is set to TRUE, then return.RG is overridden to be TRUE, as log-probabilities are returned for each relationship graph.

return.logp

Boolean for whether to return the log-likelihood for each relationship graph, defaults to FALSE. When setting return.logp to TRUE, return.RG should also be set to TRUE. Setting return.logp to FALSE allows for permutation symmetries to be exploited to save computational time, see enumerate_alleles. Setting this to TRUE will result in longer runtimes, especially when multiplicities of infection are large. Note that this argument does not affect the output of the posterior probabilities.

progress.bar

Boolean for printing progress bars. When TRUE (default), the progress bar is printed to the screen. Please note that the progress bar does not necessarily increment uniformly (see details below); it may seem stuck when the code is still running.

Value

List containing:

marg

Matrix of marginal posterior probabilities of recurrent states for each recurrence, one row per recurrence with "C" for recrudescence, "L" for relapse, and "I" for reinfection. marg is a simple summary of joint (see next): each marginal probability of a recurrent state is a sum over a subset of joint probabilities of recurrent state sequences. For example, the marginal probability of "C" at the first of two recurrences is a sum over the joint probabilities of "CC", "CL", and "CI".

joint

Vector of joint posterior probabilities of each recurrent state sequence, where within a sequence "C" denotes recrudescence, "L" denotes relapse, and "I" denotes reinfection.

RGs

List of relationship graphs with their log-likelihoods stored. Only returned if return.RG is TRUE. See enumerate_RGs.

Details

compute_posterior() computes posterior probabilities proportional to the likelihood multiplied by the prior. The likelihood sums over

  • ways to phase allelic data onto haploid genotypes

  • graphs of relationships between haploid genotypes

  • ways to partition alleles into clusters of identity-by-descent

compute_posterior() expects each per-episode, per-marker allelic vector to be a set of distinct alleles. Allele repeats at markers with observed data, and NA repeats at markers with missing data, are removed in a data pre-processing step. NAs in allelic vectors that also contain non-NA values are removed in a data pre-processing step.

We enumerate all possible relationship graphs between haploid genotypes, where pairs of genotypes can either be clones, siblings, or strangers. The likelihood of a sequence of recurrence states can be determined from the likelihood of all relationship graphs compatible with said sequence. More details on the enumeration and likelihood calculation of relationship graphs can be found in enumerate_RGs and RG_inference respectively. For each relationship graph, the model sums over all possible identity-by-descent partitions. Because some relationship graphs are compatible with more identity-by-descent partitions than others, the log p(Y|RG) progress bar does not necessarily increment uniformly.

Notable model assumptions and limitations:

  • Perfect detection of alleles (no genotyping error)

  • No within-host de novo mutations

  • Parasites are outbred

  • All siblings are regular siblings

  • Relationship graphs compatible with a given sequence of recurrent states are equally likely a priori

  • We do not recommend running `compute_posterior() when the total genotype count (sum of per-episode multiplicities of infection) exceeds eight, because there are too many relationship graphs.

  • Presently, Pv3Rs only supports prevalence data (categorical data that signal the detection of alleles), not quantitative data (data that signal the proportional abundance of the alleles detected).

Examples

# ===========================================================================
# Example where alleles are named numerically
# ===========================================================================
# Data
y <- list(enroll = list(m1 = c('3','2'), m2 = c('1','2')),
          recur1 = list(m1 = c('1','4'), m2 = c('1')),
          recur2 = list(m1 = c('1'), m2 = NA))

# Allele frequencies
fs <- list(m1 = c('1' = 0.78, '2' = 0.14, '3' = 0.07, '4' = 0.01),
           m2 = c('1' = 0.27, '2' = 0.73))

# Compute posterior probabilities using default prior
compute_posterior(y, fs, progress.bar = FALSE)
#> Number of valid relationship graphs (RGs) is 250
#> Computing log p(Y|RG) for 250 RGs
#> Finding log-likelihood of each vector of recurrence states
#> 
#> $marg
#>                C         L         I
#> recur1 0.0000000 0.1949556 0.8050444
#> recur2 0.2938829 0.2476598 0.4584573
#> 
#> $joint
#>         CC         LC         IC         CL         LL         IL         CI 
#> 0.00000000 0.05539481 0.23848807 0.00000000 0.05314490 0.19451493 0.00000000 
#>         LI         II 
#> 0.08641590 0.37204139 
#> 


# ===========================================================================
# Example where alleles are named arbitrarily and probabilities are plotted
# ===========================================================================
# Data
y <- list(episode0 = list(marker1 = c("Tinky Winky", "Dipsy"),
                          marker2 = c("Tinky Winky", "Laa-Laa", "Po")),
          episode1 = list(marker1 = "Tinky Winky",
                          marker2 = "Laa-Laa"))

# Allele frequencies
fs <- list(marker1 = c("Tinky Winky" = 0.4, "Dipsy" = 0.6),
           marker2 = c("Tinky Winky" = 0.1, "Laa-Laa" = 0.1, "Po" = 0.8))

# Compute posterior probabilities using default prior
posterior_probs <- compute_posterior(y, fs, progress.bar = FALSE)
#> Number of valid relationship graphs (RGs) is 30
#> Computing log p(Y|RG) for 30 RGs
#> Finding log-likelihood of each vector of recurrence states
#> 

# Plot posterior probabilities on the simplex
plot_simplex(c("Recrudescence", "Relapse", "Reinfection"), 0.5) # Simplex
#> NULL
xy <- project2D(posterior_probs$marg[1,]) # Project probabilities
points(x = xy["x"], y = xy["y"], pch = 20) # Plot projected probabilities



#============================================================================
# Demonstrating the return of the prior when all data are missing
#============================================================================
# Data
y_missing <- list(enroll = list(m1 = NA),
                  recur1 = list(m1 = NA),
                  recur2 = list(m1 = NA))

# Return of the prior
suppressMessages(compute_posterior(y_missing, fs = list(m1 = c("A" = 1))))
#> Warning: Markers m1 has data on one episode only
#> Warning: Episodes enroll & recur1 & recur2 have no data
#> $marg
#>                C         L         I
#> recur1 0.3333333 0.3333333 0.3333333
#> recur2 0.3333333 0.3333333 0.3333333
#> 
#> $joint
#>        CC        LC        IC        CL        LL        IL        CI        LI 
#> 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 
#>        II 
#> 0.1111111 
#> 

# Return of the prior re-weighted to the exclusion of recrudescence
suppressMessages(compute_posterior(y_missing, fs = list(m1 = c("A" = 1)),
                 MOIs = c(1,2,3)))
#> Warning: Markers m1 has data on one episode only
#> Warning: Episodes enroll & recur1 & recur2 have no data
#> $marg
#>        C   L   I
#> recur1 0 0.5 0.5
#> recur2 0 0.5 0.5
#> 
#> $joint
#>   CC   LC   IC   CL   LL   IL   CI   LI   II 
#> 0.00 0.00 0.00 0.00 0.25 0.25 0.00 0.25 0.25 
#> 

# (Recrudescing parasites are clones of previous blood-stage parasites. The
# Pv3R model assumes no within-host de-novo mutations and perfect allele
# detection. As such, recrudescence is incompatible with an MOI increase on
# the preceding infection.)


# ===========================================================================
# Demonstrating the cosmetic-only nature of episode names
# ===========================================================================
# Data
y <- list(enroll = list(m1 = NA),
          recur2 = list(m1 = NA),
          recur1 = list(m1 = NA))

# Use a non-uniform prior for the purpose of illustration
prior <- matrix(c(0.2,0.2,0.6,0.7,0.1,0.2), byrow = TRUE, nrow = 2,
                dimnames = list(c("recur1_prior", "recur2_prior"),
                                c("C", "L", "I")))

# Print posterior and prior, noting that "recur1_prior" is returned for
# "recur2", and "recur2_prior" is returned for "recur1"
suppressMessages(compute_posterior(y, fs = list(m1 = c(a = 1)), prior))$marg
#> Warning: Data and prior episode names disagree
#> Warning: Markers m1 has data on one episode only
#> Warning: Episodes enroll & recur2 & recur1 have no data
#>          C   L   I
#> recur2 0.2 0.2 0.6
#> recur1 0.7 0.1 0.2
prior
#>                C   L   I
#> recur1_prior 0.2 0.2 0.6
#> recur2_prior 0.7 0.1 0.2


#============================================================================
# Demonstrating the informative nature of non-recurrent data
#============================================================================
# Data and allele frequencies
y_het <- list(list(m1 = c('1', '2')), list(m1 = NA))
y_hom <- list(list(m1 = '1'), list(m1 = NA))
fs = list(m1 = c('1' = 0.5, '2' = 0.5))

# The prior is not returned despite there being no recurrent data.
suppressMessages(compute_posterior(y = y_het, fs))$marg
#> Warning: Markers m1 has data on one episode only
#> Warning: 
#>              C         L         I
#> [1,] 0.3292683 0.3414634 0.3292683
suppressMessages(compute_posterior(y = y_hom, fs, MOIs = c(2,1)))$marg
#> Warning: Markers m1 has data on one episode only
#> Warning: 
#>              C         L         I
#> [1,] 0.3358209 0.3283582 0.3358209



#============================================================================
# Demonstrating the effect of increasingly large relationship graphs: the
# marginal probability of the first recurrence changes slightly, albeit at a
# decreasing rate, as the number of additional recurrences (all without data)
# increases. The change is greatest when the observed allele is rare.
#============================================================================
# Data for different recurrence counts where only the 1st recurrence has data
ys <- list(scenario1 = list(enroll = list(m1 = "A"),
                            recur1 = list(m1 = "A")),
           scenario2 = list(enroll = list(m1 = "A"),
                            recur1 = list(m1 = "A"),
                            recur2 = list(m1 = NA)),
           scenario3 = list(enroll = list(m1 = "A"),
                            recur1 = list(m1 = "A"),
                            recur2 = list(m1 = NA),
                            recur3 = list(m1 = NA)),
           scenario4 = list(enroll = list(m1 = "A"),
                            recur1 = list(m1 = "A"),
                            recur2 = list(m1 = NA),
                            recur3 = list(m1 = NA),
                            recur4 = list(m1 = NA)))

# Allele frequencies: smaller f_A leads to larger change
f_A <- 0.1; fs <- list(m1 = c("A" = f_A, "Other" = 1-f_A))

# Compute posterior probabilities and extract marginal probabilities
results <- lapply(ys, function(y) compute_posterior(y, fs, progress.bar = FALSE)$marg)
#> Number of valid relationship graphs (RGs) is 3
#> Computing log p(Y|RG) for 3 RGs
#> Finding log-likelihood of each vector of recurrence states
#> 
#> Warning: Episode recur2 has no data
#> Number of valid relationship graphs (RGs) is 12
#> Computing log p(Y|RG) for 12 RGs
#> Finding log-likelihood of each vector of recurrence states
#> 
#> Warning: Episodes recur2 & recur3 have no data
#> Number of valid relationship graphs (RGs) is 60
#> Computing log p(Y|RG) for 60 RGs
#> Finding log-likelihood of each vector of recurrence states
#> 
#> Warning: Episodes recur2 & recur3 & recur4 have no data
#> Number of valid relationship graphs (RGs) is 358
#> Computing log p(Y|RG) for 358 RGs
#> Finding log-likelihood of each vector of recurrence states
#> 

# Extract results for the first recurrence
results_recur1 <- sapply(results, function(result) result[1,])
results_recur1 # Results are different for different scenarios
#>    scenario1  scenario2  scenario3  scenario4
#> C 0.60606061 0.61538462 0.62294515 0.62925894
#> L 0.33333333 0.32307692 0.31476034 0.30781517
#> I 0.06060606 0.06153846 0.06229451 0.06292589

# Visualise the change in the marginal probability of the first recurrence
plot_simplex(c("Recrudescence", "Relapse", "Reinfection")) # Plot simplex
#> NULL
xy <- apply(results_recur1, 2, project2D) # Project probabilities
points(x = xy["x", ], y = xy["y", ], pch = "-", col = 1:4) # Plot projections
legend("left", col = 1:4, pch = "-", pt.cex = 2, bty = "n", legend = 1:4,
title = "Recurrence \n count") # legend