A Julia package for inferring the strength of selection from cancer sequencing data. Package simultaneously estimates the number of subclones in the population and their relative fitnesses. This is done by generating synthetic data by simulating different population dynamics (see CancerSeqSim.jl) and fitting this to data using Approximate Bayesian Computation (see ApproxBayes.jl).
The package enables analysis as described in Quantification of subclonal selection in cancer from bulk sequencing data. See this paper for the technical background.
To download the package, once you're in a Julia session type the following command:
Then once you are in a julia session the package can be loaded with
Below is an example of how to run an analysis, for some more examples with more details go here.
Running an analysis requires variant allele frequencies (VAFs) as measured in deep sequencing of cancer samples. Generating the synthetic data assumes that the cancer is diploid, therefore any mutations falling in non-diploid regions should be removed. This does unfortunately mean that in highly aneuploid tumours, there will not be enough mutations to perform an analysis. We would recommend a minimum of 100 mutations.
The main function to perform an analysis is
fitABCmodels. This takes as its first argument either a vector of Floats or a string pointing to a text file with a vector of floats, which will be read in automatically. The second argument is the name of the sample you wish to analyse which will be used to write the data and plots to a file. There are then a number of keyword arguments set to reasonable defaults. More details of these arguments and their defaults can be found by typing
?fitABCmodels in a Julia session.
There is some example data generated from the simulation found in the examples directory. For example the following command will run the inference on the
oneclone.txt data set with 400 posterior samples and 5*10^5 iterations given sequencing depth of sample is 300X:
using Random Random.seed!(123) out = fitABCmodels("example/oneclone.txt", "oneclone", read_depth = 300, resultsdirectory = "example", Nmaxinf = 10^6, maxiterations = 5*10^5, save = true, nparticles = 400);
Running this is quite computationally expensive, so running on a cluster would be recommended in most cases.
Also included are a number of functions to summarize the output and plot the posterior. The output of the fitABCmodels function prints a summary of the posterior distribution as well as some details on the ABC convergence. For example for the above we can see a print out of the model fitting results:
out Total number of simulations: 5.31e+05 Cumulative number of simulations = [400, 1501, 3347, 6755, 11325, 17583, 25144, 43005, 94935, 225243, 531167] Acceptance ratio: 7.53e-04 Tolerance schedule = [5701.38, 3576.56, 2353.92, 1721.4, 1348.54, 1156.23, 1021.12, 877.64, 721.45, 575.07, 467.33] Model probabilities: Model 1 (0 subclones): 0.003 Model 2 (1 subclones): 0.819 Model 3 (2 subclones): 0.178 Parameters: Model 1 (0 subclones) Median (95% intervals): Parameter 1 - μ/β: 21.81 (20.75,24.80) Parameter 2 - Clonal Mutations: 352.81 (345.11,382.23) Parameter 3 - Cellularity: 0.68 (0.66,0.69) Model 2 (1 subclones) Median (95% intervals): Parameter 1 - μ/β: 17.81 (14.30,23.13) Parameter 2 - Clonal Mutations: 308.29 (232.27,362.17) Parameter 3 - Fitness: 0.86 (0.40,2.27) Parameter 4 - Time (tumour doublings): 8.15 (5.06,12.50) Parameter 5 - Cellularity: 0.70 (0.68,0.74) Parameter 6 - Subclone Frequency: 0.62 (0.51,0.70) Parameter 7 - Subclone Mutations: 222.00 (151.71,295.00) Model 3 (2 subclones) Median (95% intervals): Parameter 1 - μ/β: 14.88 (11.83,18.84) Parameter 2 - Clonal Mutations: 304.75 (252.96,366.35) Parameter 3 - Fitness - Subclone 1: 1.38 (0.41,23.07) Parameter 4 - Time (tumour doublings) - Subclone 1: 9.88 (4.10,13.98) Parameter 5 - Fitness - Subclone 2: 0.84 (0.14,3.05) Parameter 6 - Time (tumour doublings) - Subclone 2: 8.71 (2.54,14.05) Parameter 7 - Cellularity: 0.70 (0.67,0.73) Parameter 8 - Subclone 1 Frequency: 0.63 (0.53,0.72) Parameter 9 - Subclone 2 Frequency: 0.09 (0.05,0.51) Parameter 10 - Subclone 1 Mutations: 216.00 (82.09,280.98) Parameter 11 - Subclone 2 Mutations: 192.00 (72.94,301.17)
We can also plot the posterior distributions.
Plot the posterior model probabilities.
Plot the histogram for model with 1 subclone (second argument is the number of subclones).
Plot the posterior parameter distribution for model with 1 subclone.
Note the ground truth of the parameters in this case are. And as can be seen we do a reasonably good job of recovering the parameters. All ground truth parameters are within the 95% credible intervals.
Finally we can also save all plots and text files with posterior distributions to a directory, unless specified the default will be to create a file a directory called
output in the current working directory. It is also possible to automatically save the output by adding a
save=true keywords to
saveresults(out; resultsdirectory = "example") saveallplots(out, resultsdirectory = "example")
5 months ago