SARS-CoV-2 Tutorial#
This tutorial shows the basics of how to interact with V-pipe.
Requirements#
The tutorial assumes that you have installed V-pipe using the quick install installation documentation, and that the workflow is setup with the following structure:
vp-analysis
├── V-pipe
├── mambaforge
└── work
vp-analysisis the main directory where you have installed V-pipeV-pipeis the directory with V-pipe’s own codemambaforgehas dependencies to start using V-pipe (bioconda, conda-forge, mamba, snakemake)workis the directory where you have performed the test analysis
Setting up the work directory#
We will create a fresh work directory for this tutorial. A V-pipe work directory typically contains the following files and directories:
config.yaml: the configuration file. For example to tell V-pipe where to find the samples and the reference genome. All configuration options are described in the configuration schemasamples.tsv: a tab-separated file listing the samples to be processed. The first two columns are mandatory and represent the hierarchical levels of the samples. The third and fourth column are optional and contain the read length and protocol name.vpipe: a wrapper script to start the workflowsamples/: the directory containing the raw data of the samples
And after running the workflow:
results/: the results of the workflow.snakemake: the directory containing the snakemake working files
For your convenience, you can set up a boilerplate working directory with the script init.sh. This will copy a config.yaml and the vpipe wrapper script to get started:
cd vp-analysis
mkdir -p work_sarscov2
cd work_sarscov2
../V-pipe/init_project.sh
Preparing the dataset#
In this section, we will prepare the dataset. We will download the reads from the SRA and prepare the directory structure as described in the configuration documentation.
The dataset has been generated by doi:10.1093/nsr/nwaa036, and is avaible from the sequence read archive (SRA): SRR10903401 and SRR10903402.
Below you can find a bash script that downloads the reads from the SRA and prepares the directory structure. Run these commands in the work_sarscov2 directory.
download_reads () {
# SRR ID
SRR_ID=$1
# path to the directory containing the reads on the ENA FTP
FTP_PATH=$2
# creating the raw_data directory
mkdir -p samples/$SRR_ID/20200102/raw_data
# downloading the reads from ENA
# note that we rename the _1 suffix to _R1 and _2 to _R2
curl \
-o samples/$SRR_ID/20200102/raw_data/"${SRR_ID}"_R1.fastq.gz \
ftp://ftp.sra.ebi.ac.uk/"$FTP_PATH"/"$SRR_ID"/"$SRR_ID"_1.fastq.gz
curl \
-o samples/$SRR_ID/20200102/raw_data/"${SRR_ID}"_R2.fastq.gz \
ftp://ftp.sra.ebi.ac.uk/"$FTP_PATH"/"$SRR_ID"/"$SRR_ID"_2.fastq.gz
}
# executing the function for the two SRR IDs
download_reads SRR10903401 vol1/fastq/SRR109/001
download_reads SRR10903402 vol1/fastq/SRR109/002
The downloaded files should have the following structure:
📁samples
├───📁SRR10903401
│ └───📁20200102
│ └───📁raw_data
│ ├───🧬SRR10903401_R1.fastq
│ └───🧬SRR10903401_R2.fastq
└───📁SRR10903402
└───📁20200102
└───📁raw_data
├───🧬SRR10903402_R1.fastq
└───🧬SRR10903402_R2.fastq
Configuration#
In the work_sarscov2 directory you can find the file config.yaml. Open it in your editor and copy-paste the contents of the yaml file below in there. Our data is from the SARS-CoV-2 virus. Therefore, we can use the reference genome and annotation files provided by V-pipe, that we can specify at general.virus_base_config. In addition to that, we will set the input.read_length to 150 and output.snv to true to enable the SNV calling:
general:
virus_base_config: 'sars-cov-2'
input:
samples_file: samples.tsv
read_length: 150
output:
trim_primers: false
snv: true
local: false
global: false
visualization: false
diversity: false
QA: false
upload: false
dehumanized_raw_reads: false
You can also get it from the V-pipe repository with:
cp ../V-pipe/docs/example_sarscov2_data/config.yaml .
Running V-pipe#
After having prepared the configuration, we can continue with a dry run to see what gets executed and to create the samples.tsv file:
./vpipe --dryrun --cores 2
As this is your first run of V-pipe, it will automatically generate the sample collection table (samples.tsv). Check samples.tsv in your editor. It is always a good idea check the content of the samples.tsv file, as it is used to collect the samples for the analysis. Of course, you can also provide samples.tsv yourself, before running the pipeline. If you did not use the expected directory structure, this file might end up empty or some entries might be missing. If so, you can safely delete it and re-run with option --dry-run to regenerate it. More information on the samples.tsv file can be found in the documentation.
Finally, we can run the V-pipe analysis. The first run will take a while because it will install all necessary software dependencies with conda:
./vpipe -p --cores 2
# -p and --cores (and all other options) are passed to snakemake. -p is for printing shell cmds.
# takes a while to run, needs to install packages
Note
Note that vpipe is a wrapper for snakemake. All options that are passed to vpipe are options to snakemake. More information about snakemake options can be found in the snakemake documentation.
Output#
The output of the SNV calling is aggregated in a standard VCF file, located in results/{hierarchy}/variants/SNVs/snvs.vcf, you can open it with your favorite VCF tools for visualisation or downstream processing.
It is also available in a tabular format in results/{hierarchy}/variants/SNVs/snvs.csv.
Expected output#
The small dataset that we used in this tutorial section has been analyzed by doi:10.1093/nsr/nwaa036. The results of the original analysis (using bwa, samtools mpileup, and bcftools) are displayed in Table 2 in the article:
Using either the VCF or CSV files, compare with the results given out by V-pipe (with bwa and ShoRAH).
For positions 19164 and 24323 of SRR10903401 and position 11563 of SRR10903402, we expect to see similar results in V-pipe.
For the remaining positions (1821, 26314 and 26590 of SRR10903401), we expect that ShoRAH will consider the variants of poor quality and reject them because there is very little support ( <= than 5 reads supporting the alt).