
Compute posterior probabilities of P. vivax recurrent states
Source:R/compute_posterior.R
compute_posterior.Rd
Compute posterior probabilities of P. vivax recurrent states relapse, reinfection and recrudescence using genetic data.
Please note, the progress bar does not necessarily increment uniformly (see details below); it may seem stuck when the code is still running.
Usage
compute_posterior(
y,
fs,
prior = NULL,
MOIs = NULL,
return.RG = FALSE,
return.logp = FALSE,
progress.bar = TRUE
)
Arguments
- y
Observed data in the form of a list of lists. The outer list is a list of episodes in increasing chronological order. The inner list is a list of named markers per episode. Episode names can be specified, but they are not used. Markers must be named. Each episode must list the same markers. If not all markers are typed per episode, data on untyped markers can be encoded as missing (see below). For each marker, one must specify an allelic vector: a set of distinct alleles detected at that marker.
NA
s encode missing per-marker data, i.e., when no alleles are observed for a given marker.NA
entries in allelic vectors that contain bothNA
and non-NA
entries are ignored. Allele names are arbitrary, but must correspond with frequency names (see examples below). The same names can be used for alleles belonging to different markers. As such, frequencies must be specified per named allele per named marker.- fs
List of per-marker allele frequency vectors. Names of the list must match with the marker names in
y
. Within lists (i.e., for each marker), frequencies must be specified per allele name.- prior
Matrix of prior probabilities of the recurrence states for each recurrent episode. Each row corresponds to an episode in increasing chronological order. The column names must be C, L, and I for recrudescence, relapse and reinfection respectively. Row names can be specified but they are not used. If
prior
is NULL (default), per-episode recurrent states are equally likely.- MOIs
Multiplicity of infection for each episode. If MOIs are not provided, the most parsimonious MOIs compatible with the data will be used; see
determine_MOIs
.- return.RG
Boolean for whether to return the relationship graphs, defaults to
FALSE
. Ifreturn.logp
is set toTRUE
, thenreturn.RG
is overridden to beTRUE
, as log-probabilities are returned for each relationship graph.- return.logp
Boolean for whether to return the log-likelihood for each relationship graph, defaults to
FALSE
. When settingreturn.logp
toTRUE
,return.RG
should also be set toTRUE
. Settingreturn.logp
toFALSE
allows for permutation symmetries to be exploited to save computational time, seeenumerate_alleles
. Setting this toTRUE
will result in longer runtimes, especially when multiplicities of infection are large. Note that this argument does not affect the output of the posterior probabilities.- progress.bar
Boolean for printing progress bars. When
TRUE
(default), the progress bar is printed to the screen. Please note that the progress bar does not necessarily increment uniformly (see details below); it may seem stuck when the code is still running.
Value
List containing:
marg
Matrix of marginal posterior probabilities of recurrent states for each recurrence, one row per recurrence with "C" for recrudescence, "L" for relapse, and "I" for reinfection.
marg
is a simple summary ofjoint
(see next): each marginal probability of a recurrent state is a sum over a subset of joint probabilities of recurrent state sequences. For example, the marginal probability of "C" at the first of two recurrences is a sum over the joint probabilities of "CC", "CL", and "CI".joint
Vector of joint posterior probabilities of each recurrent state sequence, where within a sequence "C" denotes recrudescence, "L" denotes relapse, and "I" denotes reinfection.
RGs
List of relationship graphs with their log-likelihoods stored. Only returned if
return.RG
isTRUE
. Seeenumerate_RGs
.
Details
compute_posterior()
computes posterior probabilities proportional to the likelihood multiplied by the prior. The likelihood
sums over
ways to phase allelic data onto haploid genotypes
graphs of relationships between haploid genotypes
ways to partition alleles into clusters of identity-by-descent
compute_posterior()
expects each per-episode, per-marker allelic
vector to be a set of distinct alleles. Allele repeats at markers with
observed data, and NA
repeats at markers with missing data, are
removed in a data pre-processing step. NA
s in allelic vectors that
also contain non-NA
values are removed in a data pre-processing step.
We enumerate all possible relationship graphs between haploid genotypes,
where pairs of genotypes can either be clones, siblings, or strangers. The
likelihood of a sequence of recurrence states can be determined from the
likelihood of all relationship graphs compatible with said sequence. More
details on the enumeration and likelihood calculation of relationship graphs
can be found in enumerate_RGs
and RG_inference
respectively. For each relationship graph, the model sums over all possible
identity-by-descent partitions. Because some relationship graphs are
compatible with more identity-by-descent partitions than others, the log
p(Y|RG) progress bar does not necessarily increment uniformly.
Notable model assumptions and limitations:
Perfect detection of alleles (no genotyping error)
No within-host de novo mutations
Parasites are outbred
All siblings are regular siblings
Relationship graphs compatible with a given sequence of recurrent states are equally likely a priori
We do not recommend running `compute_posterior() when the total genotype count (sum of per-episode multiplicities of infection) exceeds eight, because there are too many relationship graphs.
Presently,
Pv3Rs
only supports prevalence data (categorical data that signal the detection of alleles), not quantitative data (data that signal the proportional abundance of the alleles detected).
Examples
# ===========================================================================
# Example where alleles are named numerically
# ===========================================================================
# Data
y <- list(enroll = list(m1 = c('3','2'), m2 = c('1','2')),
recur1 = list(m1 = c('1','4'), m2 = c('1')),
recur2 = list(m1 = c('1'), m2 = NA))
# Allele frequencies
fs <- list(m1 = c('1' = 0.78, '2' = 0.14, '3' = 0.07, '4' = 0.01),
m2 = c('1' = 0.27, '2' = 0.73))
# Compute posterior probabilities using default prior
compute_posterior(y, fs, progress.bar = FALSE)
#> Number of valid relationship graphs (RGs) is 250
#> Computing log p(Y|RG) for 250 RGs
#> Finding log-likelihood of each vector of recurrence states
#>
#> $marg
#> C L I
#> recur1 0.0000000 0.1949556 0.8050444
#> recur2 0.2938829 0.2476598 0.4584573
#>
#> $joint
#> CC LC IC CL LL IL CI
#> 0.00000000 0.05539481 0.23848807 0.00000000 0.05314490 0.19451493 0.00000000
#> LI II
#> 0.08641590 0.37204139
#>
# ===========================================================================
# Example where alleles are named arbitrarily and probabilities are plotted
# ===========================================================================
# Data
y <- list(episode0 = list(marker1 = c("Tinky Winky", "Dipsy"),
marker2 = c("Tinky Winky", "Laa-Laa", "Po")),
episode1 = list(marker1 = "Tinky Winky",
marker2 = "Laa-Laa"))
# Allele frequencies
fs <- list(marker1 = c("Tinky Winky" = 0.4, "Dipsy" = 0.6),
marker2 = c("Tinky Winky" = 0.1, "Laa-Laa" = 0.1, "Po" = 0.8))
# Compute posterior probabilities using default prior
posterior_probs <- compute_posterior(y, fs, progress.bar = FALSE)
#> Number of valid relationship graphs (RGs) is 30
#> Computing log p(Y|RG) for 30 RGs
#> Finding log-likelihood of each vector of recurrence states
#>
# Plot posterior probabilities on the simplex
plot_simplex(c("Recrudescence", "Relapse", "Reinfection"), 0.5) # Simplex
#> NULL
xy <- project2D(posterior_probs$marg[1,]) # Project probabilities
points(x = xy["x"], y = xy["y"], pch = 20) # Plot projected probabilities
#============================================================================
# Demonstrating the return of the prior when all data are missing
#============================================================================
# Data
y_missing <- list(enroll = list(m1 = NA),
recur1 = list(m1 = NA),
recur2 = list(m1 = NA))
# Return of the prior
suppressMessages(compute_posterior(y_missing, fs = list(m1 = c("A" = 1))))
#> Warning: Markers m1 has data on one episode only
#> Warning: Episodes enroll & recur1 & recur2 have no data
#> $marg
#> C L I
#> recur1 0.3333333 0.3333333 0.3333333
#> recur2 0.3333333 0.3333333 0.3333333
#>
#> $joint
#> CC LC IC CL LL IL CI LI
#> 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111 0.1111111
#> II
#> 0.1111111
#>
# Return of the prior re-weighted to the exclusion of recrudescence
suppressMessages(compute_posterior(y_missing, fs = list(m1 = c("A" = 1)),
MOIs = c(1,2,3)))
#> Warning: Markers m1 has data on one episode only
#> Warning: Episodes enroll & recur1 & recur2 have no data
#> $marg
#> C L I
#> recur1 0 0.5 0.5
#> recur2 0 0.5 0.5
#>
#> $joint
#> CC LC IC CL LL IL CI LI II
#> 0.00 0.00 0.00 0.00 0.25 0.25 0.00 0.25 0.25
#>
# (Recrudescing parasites are clones of previous blood-stage parasites. The
# Pv3R model assumes no within-host de-novo mutations and perfect allele
# detection. As such, recrudescence is incompatible with an MOI increase on
# the preceding infection.)
# ===========================================================================
# Demonstrating the cosmetic-only nature of episode names
# ===========================================================================
# Data
y <- list(enroll = list(m1 = NA),
recur2 = list(m1 = NA),
recur1 = list(m1 = NA))
# Use a non-uniform prior for the purpose of illustration
prior <- matrix(c(0.2,0.2,0.6,0.7,0.1,0.2), byrow = TRUE, nrow = 2,
dimnames = list(c("recur1_prior", "recur2_prior"),
c("C", "L", "I")))
# Print posterior and prior, noting that "recur1_prior" is returned for
# "recur2", and "recur2_prior" is returned for "recur1"
suppressMessages(compute_posterior(y, fs = list(m1 = c(a = 1)), prior))$marg
#> Warning: Data and prior episode names disagree
#> Warning: Markers m1 has data on one episode only
#> Warning: Episodes enroll & recur2 & recur1 have no data
#> C L I
#> recur2 0.2 0.2 0.6
#> recur1 0.7 0.1 0.2
prior
#> C L I
#> recur1_prior 0.2 0.2 0.6
#> recur2_prior 0.7 0.1 0.2
#============================================================================
# Demonstrating the informative nature of non-recurrent data
#============================================================================
# Data and allele frequencies
y_het <- list(list(m1 = c('1', '2')), list(m1 = NA))
y_hom <- list(list(m1 = '1'), list(m1 = NA))
fs = list(m1 = c('1' = 0.5, '2' = 0.5))
# The prior is not returned despite there being no recurrent data.
suppressMessages(compute_posterior(y = y_het, fs))$marg
#> Warning: Markers m1 has data on one episode only
#> Warning:
#> C L I
#> [1,] 0.3292683 0.3414634 0.3292683
suppressMessages(compute_posterior(y = y_hom, fs, MOIs = c(2,1)))$marg
#> Warning: Markers m1 has data on one episode only
#> Warning:
#> C L I
#> [1,] 0.3358209 0.3283582 0.3358209
#============================================================================
# Demonstrating the effect of increasingly large relationship graphs: the
# marginal probability of the first recurrence changes slightly, albeit at a
# decreasing rate, as the number of additional recurrences (all without data)
# increases. The change is greatest when the observed allele is rare.
#============================================================================
# Data for different recurrence counts where only the 1st recurrence has data
ys <- list(scenario1 = list(enroll = list(m1 = "A"),
recur1 = list(m1 = "A")),
scenario2 = list(enroll = list(m1 = "A"),
recur1 = list(m1 = "A"),
recur2 = list(m1 = NA)),
scenario3 = list(enroll = list(m1 = "A"),
recur1 = list(m1 = "A"),
recur2 = list(m1 = NA),
recur3 = list(m1 = NA)),
scenario4 = list(enroll = list(m1 = "A"),
recur1 = list(m1 = "A"),
recur2 = list(m1 = NA),
recur3 = list(m1 = NA),
recur4 = list(m1 = NA)))
# Allele frequencies: smaller f_A leads to larger change
f_A <- 0.1; fs <- list(m1 = c("A" = f_A, "Other" = 1-f_A))
# Compute posterior probabilities and extract marginal probabilities
results <- lapply(ys, function(y) compute_posterior(y, fs, progress.bar = FALSE)$marg)
#> Number of valid relationship graphs (RGs) is 3
#> Computing log p(Y|RG) for 3 RGs
#> Finding log-likelihood of each vector of recurrence states
#>
#> Warning: Episode recur2 has no data
#> Number of valid relationship graphs (RGs) is 12
#> Computing log p(Y|RG) for 12 RGs
#> Finding log-likelihood of each vector of recurrence states
#>
#> Warning: Episodes recur2 & recur3 have no data
#> Number of valid relationship graphs (RGs) is 60
#> Computing log p(Y|RG) for 60 RGs
#> Finding log-likelihood of each vector of recurrence states
#>
#> Warning: Episodes recur2 & recur3 & recur4 have no data
#> Number of valid relationship graphs (RGs) is 358
#> Computing log p(Y|RG) for 358 RGs
#> Finding log-likelihood of each vector of recurrence states
#>
# Extract results for the first recurrence
results_recur1 <- sapply(results, function(result) result[1,])
results_recur1 # Results are different for different scenarios
#> scenario1 scenario2 scenario3 scenario4
#> C 0.60606061 0.61538462 0.62294515 0.62925894
#> L 0.33333333 0.32307692 0.31476034 0.30781517
#> I 0.06060606 0.06153846 0.06229451 0.06292589
# Visualise the change in the marginal probability of the first recurrence
plot_simplex(c("Recrudescence", "Relapse", "Reinfection")) # Plot simplex
#> NULL
xy <- apply(results_recur1, 2, project2D) # Project probabilities
points(x = xy["x", ], y = xy["y", ], pch = "-", col = 1:4) # Plot projections
legend("left", col = 1:4, pch = "-", pt.cex = 2, bty = "n", legend = 1:4,
title = "Recurrence \n count") # legend