Tutorial
Tutorial.Rmd
library(ASAP)
Introduction
ASAP is an R package that leverages PCA and NNLS (Non-Negative Least Squares) to assess the ancestral composition of admixed individuals with high accuracy and reliability, the approach further described in here. ASAP assigns the genetic ancestry of list of target samples or populations given a list of putative sources.
ASAP estimations are based on a PCA where the targets and source groups are available, along with those we suggest to use additional groups to better define the PC space. Once that the PC space is defined, a set of NNLS is then applied on the PC coordinates, effectively summarizing the genetic ancestry.
For its most basic usage ASAP needs:
- A PCA matrix (a dataframe with N PCs)
- A list of target and source groups (or samples)
Basic usage example
Right below you can find a very basic way to run ASAP, if you want more details, please keep on reading.
$ pca = read_eigen(pca_input = 'data/TOY.pca.evec')
$ ASAP_result = ASAP(pca_input = pca, sources = c('GST1','GST2'), admixed=c('70GST1.30GST2'))
ASAP package has two functions to read the PCA matrix, read_eigen()
and read_flash()
.
* read_eigen()
will read a PCA that has been created with smartpca from the EIGENSOFT software.
* read_flash()
instead, will read a PCA that has been created with flashpca software.
The goal of both functions is to set the PCA file so that it looks as follows:
POP | IND | PCN |
---|---|---|
Source1 | IND_1 | 0.01 |
Source1 | IND_2 | 0.02 |
Source2 | IND_1 | -0.03 |
Admix1 | IND_1 | 0.08 |
If neither read_eigen()
nor read_flash()
is for you, you might want to simply use read.table()
, and set the file so that it has the aforementioned look.
For the sake of the example, let's say you obtained a PCA from the software EIGENSOFT, using smartpca.
$ pca = read_eigen(pca_input = 'data/TOY.pca.evec')
The second step is to set up an 'AS_file', a file with the list of the Admixed groups (A) and the Source groups (S). If you want to use ASAP sample-wise rather than group-wise, simply adjust the PCA file so that in the 'POP' column is identical to the 'IND' column, and set the AS_file with the samples list, rather than the group list.
The AS_file is a two-columns file with the population list on the first column, and the 'A/S' information on the second column. The 'A/S' information stands for Admixed (A) or Source (S), for each population/group we will indicate whether ASAP should consider it as a Source (S) or as an admixed target (A), the file looks like this:
POP A/S | |
---|---|
Source1 | S |
Source2 | S |
Admix1 | A |
To read the AS_file, a simple read_table(file, header=T)
will be sufficient.
$ AS_file = read.table('data/Example_AS', header=TRUE)
With the PCA and AS_file loaded, we are finally ready to run ASAP as follows:
$ ASAP_result = asap(pca_input = pca, as_file = AS_file)
You can avoid relying on the AS_file if you wish, and rather use a vector of the target and source groups directly in asap() function, as follows:
$ ASAP_result = asap(pca_input = pca, sources = c('GST1','GST2'), admixed=c('70GST1.30GST2'))
Finally, if you want to save ASAP results on a table-like format, you can use write_asap()
, this way:
$ ASAP_result <- asap(pca_input = pca, sources = c('GST1','GST2'), admixed=c('70GST1.30GST2'))
$ write_asap(ASAP_input = ASAP_result, output_name = 'my_dir/my_asap_results.txt')
Plotting
The function plot_asap()
can help you plot asap()
results. Based on your preferences, you can either rely on basic R or on ggplot2 to plot your results. Let's say we want to plot ASAP results using ggplot2:
$ ASAP_result <- asap(pca_input = pca, sources = c('GST1','GST2'), admixed=c('70GST1.30GST2'))
$ plot_asap(ASAP_result, 'ASAP_plot', type_ggplot = 'YES')
Alternatively, to plot with basic R, just run the command without type_ggplot
option, like so:
$ ASAP_result <- asap(pca_input = pca, sources = c('GST1','GST2'), admixed=c('70GST1.30GST2'))
$ plot_asap(ASAP_result, 'ASAP_plot')
You will find a barplot in your working directory.
Euclidean Distances between Sources
The function pcs_distances
allows you to estimate and visualize the Euclidean Distances (ED) between pairs of sources. To run the function you need to provide:
-
pca_input
takes a pca matrix -
output_name
string with path an output file name -
sources_file
two-columns table with the list of source pairs to test:
S1 S1 | |
---|---|
GST1 | GST2 |
GST2 | GST3 |
GST3 | GST4 |
-
return_plot = 'YES'/NULL
allows you to decide whether to plot the distances ('YES') or not (NULL)
$ pca = read_eigen(pca_input = 'data/TOY.pca.evec')
$ Source_D = read.table('data/Sources_Distances', header =T)
$ pcs_distances(pca_input = pca, output_name = 'my/res/dir/output_file', sources_file = Source_D, return_plot = 'YES')
Resampling
To produce Standard Errors (SE), you can rely on resampling techniques. Let's say we produce 20 PCAs to estimate the SE, each of them on a dataset with a different SNPs subset: we thus end up with 20 PCAs, all with slighly different values of PC coordinates.
The function read.resampling()
reads all 20 PCAs and perform ASAP on each of them. It will return a list containing the results for each resampling (20 ASAP results). Given that read.resampling()
will read and perform ASAP on each resampled set, the function uses several arguments:
I)path_tofiles
: path to the directory where all the 20 PCAs are stored
II)file_pattern
, a string containing the name shared by all 20 PCAs, the function will use the string as a pattern to find all files with said string. If this parameter seems confusing, consider that we modelled the function considering that in a resampling setting one could name the PCAs as follows: 'PCA_Resampled_1', 'PCA_Resampled_2',...,'PCA_Resampled_20'. A file_patters as 'PCA_Resampled_' will find all files from 'PCA_Resampled_1' to 'PCA_Resampled_20'.
III)as_file
, this is the AS_file we used in asap()
. It is a two-columns file with the population list on the first column, and the 'A/S' information on the second column. The 'A/S' information stands for Admixed (A) or Source (S), for each population/group we will indicate whether ASAP should consider it as a Source (S) or as an admixed target (A), the file looks like this:
POP A/S | |
---|---|
Source1 | S |
Source2 | S |
Admix1 | A |
-
eigentype
, this is an optional argument. If present, PCA will be read through read_eigen() function, if absent PCA will be loaded viaread_flash()
.
All in all, the function can be ran as such:
$ AS_file = read.table('data/Example_AS', header=TRUE)
$ ASAP_resampling = read.resampling(path_tofiles = 'data/', file_pattern = 'TOY_Jack', as_file = AS_file, eigentype)
Once you ran ASAP on all 20 of your resampled PCAs, to estimate the SE you can use the se.resampling()
function. To do so, you will need: * your main ASAP results, obtained with asap()
on the main PCA (the PCA with all the available SNPs), * the ASAP results obtained from the 20 resampled PCAs (with the function read.resampling()
you can see above).
In the example below, you can see that we first loaded the AS_file through read.table()
, the ASAP main results are obtained with asap()
, the ASAP resampling results are obtained with read.resampling()
. Importantly, to ran se.resampling()
, you will also need the number of resampled analyses you ran, for this tutorial we ran 20 PCAs, therefore the parameter 'chromovec' is set as 20 (integer).
$ main_pca = read_eigen(pca_input = 'data/TOY.pca.evec')
$ AS_file = read.table('data/Example_AS', header=TRUE)
$ ASAP_main_result <- asap(pca_input = main_pca, as_file = AS_file)
$ ASAP_resampling = read.resampling(path_tofiles = 'data/', file_pattern = 'TOY_Jack', as_file = AS_file, eigentype)
$ chromovec = 20
$ ASAP_SE = se.resampling(ASAP_main_result, ASAP_resampling, chromovec)