SARS-CoV-2 Tutorial

SARS-CoV-2 Tutorial#

This tutorial shows the basics of how to interact with V-pipe.

Requirements#

The tutorial assumes that you have installed V-pipe using the quick install installation documentation, and that the workflow is setup with the following structure:

vp-analysis
├── V-pipe
├── mambaforge
└── work

vp-analysis is the main directory where you have installed V-pipe
V-pipe is the directory with V-pipe’s own code
mambaforge has dependencies to start using V-pipe (bioconda, conda-forge, mamba, snakemake)
work is the directory where you have performed the test analysis

Setting up the work directory#

We will create a fresh work directory for this tutorial. A V-pipe work directory typically contains the following files and directories:

config.yaml: the configuration file. For example to tell V-pipe where to find the samples and the reference genome. All configuration options are described in the configuration schema
samples.tsv: a tab-separated file listing the samples to be processed. The first two columns are mandatory and represent the hierarchical levels of the samples. The third and fourth column are optional and contain the read length and protocol name.
vpipe: a wrapper script to start the workflow
samples/: the directory containing the raw data of the samples

And after running the workflow:

results/: the results of the workflow
.snakemake: the directory containing the snakemake working files

For your convenience, you can set up a boilerplate working directory with the script init.sh. This will copy a config.yaml and the vpipe wrapper script to get started:

cd vp-analysis

mkdir -p work_sarscov2
cd work_sarscov2
../V-pipe/init_project.sh

Preparing the dataset#

In this section, we will prepare the dataset. We will download the reads from the SRA and prepare the directory structure as described in the configuration documentation.

The dataset has been generated by doi:10.1093/nsr/nwaa036, and is avaible from the sequence read archive (SRA): SRR10903401 and SRR10903402.

Below you can find a bash script that downloads the reads from the SRA and prepares the directory structure. Run these commands in the work_sarscov2 directory.

download_reads () {
  # SRR ID
  SRR_ID=$1
  # path to the directory containing the reads on the ENA FTP
  FTP_PATH=$2

  # creating the raw_data directory
  mkdir -p samples/$SRR_ID/20200102/raw_data

  # downloading the reads from ENA
  # note that we rename the _1 suffix to _R1 and _2 to _R2
  curl \
  -o samples/$SRR_ID/20200102/raw_data/"${SRR_ID}"_R1.fastq.gz \
  ftp://ftp.sra.ebi.ac.uk/"$FTP_PATH"/"$SRR_ID"/"$SRR_ID"_1.fastq.gz

  curl \
  -o samples/$SRR_ID/20200102/raw_data/"${SRR_ID}"_R2.fastq.gz \
  ftp://ftp.sra.ebi.ac.uk/"$FTP_PATH"/"$SRR_ID"/"$SRR_ID"_2.fastq.gz
}

# executing the function for the two SRR IDs
download_reads SRR10903401 vol1/fastq/SRR109/001
download_reads SRR10903402 vol1/fastq/SRR109/002

The downloaded files should have the following structure:

📁samples
├───📁SRR10903401
│   └───📁20200102
│       └───📁raw_data
│           ├───🧬SRR10903401_R1.fastq
│           └───🧬SRR10903401_R2.fastq
└───📁SRR10903402
    └───📁20200102
        └───📁raw_data
            ├───🧬SRR10903402_R1.fastq
            └───🧬SRR10903402_R2.fastq

Configuration#

In the work_sarscov2 directory you can find the file config.yaml. Open it in your editor and copy-paste the contents of the yaml file below in there. Our data is from the SARS-CoV-2 virus. Therefore, we can use the reference genome and annotation files provided by V-pipe, that we can specify at general.virus_base_config. In addition to that, we will set the input.read_length to 150 and output.snv to true to enable the SNV calling:

general:
    virus_base_config: 'sars-cov-2'

input:
    samples_file: samples.tsv
    read_length: 150

output:
    trim_primers: false
    snv: true
    local: false
    global: false
    visualization: false
    diversity: false
    QA: false
    upload: false
    dehumanized_raw_reads: false

You can also get it from the V-pipe repository with:

cp ../V-pipe/docs/example_sarscov2_data/config.yaml .

Running V-pipe#

After having prepared the configuration, we can continue with a dry run to see what gets executed and to create the samples.tsv file:

./vpipe --dryrun --cores 2

As this is your first run of V-pipe, it will automatically generate the sample collection table (samples.tsv). Check samples.tsv in your editor. It is always a good idea check the content of the samples.tsv file, as it is used to collect the samples for the analysis. Of course, you can also provide samples.tsv yourself, before running the pipeline. If you did not use the expected directory structure, this file might end up empty or some entries might be missing. If so, you can safely delete it and re-run with option --dry-run to regenerate it. More information on the samples.tsv file can be found in the documentation.

Finally, we can run the V-pipe analysis. The first run will take a while because it will install all necessary software dependencies with conda:

./vpipe -p --cores 2
# -p and --cores (and all other options) are passed to snakemake. -p is for printing shell cmds. 
# takes a while to run, needs to install packages

Note

Note that vpipe is a wrapper for snakemake. All options that are passed to vpipe are options to snakemake. More information about snakemake options can be found in the snakemake documentation.

Output#

The output of the SNV calling is aggregated in a standard VCF file, located in results/{hierarchy}/variants/SNVs/snvs.vcf, you can open it with your favorite VCF tools for visualisation or downstream processing. It is also available in a tabular format in results/{hierarchy}/variants/SNVs/snvs.csv.

Expected output#

The small dataset that we used in this tutorial section has been analyzed by doi:10.1093/nsr/nwaa036. The results of the original analysis (using bwa, samtools mpileup, and bcftools) are displayed in Table 2 in the article:

Using either the VCF or CSV files, compare with the results given out by V-pipe (with bwa and ShoRAH).

For positions 19164 and 24323 of SRR10903401 and position 11563 of SRR10903402, we expect to see similar results in V-pipe.
For the remaining positions (1821, 26314 and 26590 of SRR10903401), we expect that ShoRAH will consider the variants of poor quality and reject them because there is very little support ( <= than 5 reads supporting the alt).