Viral Recon
nf-core/viralrecon is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples. The pipeline supports both Illumina and Nanopore sequencing data. For Illumina short-reads the pipeline is able to analyse metagenomics data typically obtained from shotgun sequencing (e.g. directly from clinical samples) and enrichment-based library preparation methods (e.g. amplicon-based: ARTIC SARS-CoV-2 enrichment protocol; or probe-capture-based). For Nanopore data the pipeline only supports amplicon-based analysis obtained from primer sets created and maintained by the ARTIC Network.*
*Pulled from [https://nf-co.re/viralrecon](https://nf-co.re/viralrecon)
Note
The modules runs nextflow on the backend and thus utilizes Docker within Docker.
Parameters
- Fastq DirDir
Basecalled Fastq files
- Fast5 DirDir
Fast5 files directory from which you received the basecalled fastq directory of files from
Returns
- Consensus./viralrecon/medaka|nanopolish
Consensus FASTA files are made for both assembly processes
- MultiQC Report: ./viralrecon/multiqc/multiqc_report.html
HTML files that has information of your run
Ensure you’ve loaded a run with a fastq and fast5 directory specified
Select one of the included primer-schemes from the drop-down list. For this example, the data is ncov-related so we will choose
Default Genome fasta for SARS-nCoV-2
.Select one of the basecaller options medaka or nanopolish
Select the
Play button
button to start the pipeline
v1.11
Loading report..
MultiQC Toolbox
Highlight Samples
Regex mode off
Rename Samples
Paste two columns of a tab-delimited table here (eg. from Excel).
First column should be the old name, second column the new name.
Regex mode off
Show / Hide Samples
Regex mode off
Export Plots
Save Settings
You can save the toolbox settings for this report to the browser.
Load Settings
Choose a saved report profile from the dropdown box below:
About MultiQC
This report was generated using MultiQC, version 1.11
You can see a YouTube video describing how to use MultiQC reports here: https://youtu.be/qPbIlO_KWN0
For more information about MultiQC, including other videos and extensive documentation, please visit http://multiqc.info
You can report bugs, suggest improvements and find the source code for MultiQC on GitHub: https://github.com/ewels/MultiQC
MultiQC is published in Bioinformatics:
MultiQC: Summarize analysis results for multiple tools and samples in a single report
Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
Bioinformatics (2016)
doi: 10.1093/bioinformatics/btw354
PMID: 27312411
A modular tool to aggregate results from bioinformatics analyses across many samples into a single report.
This report has been generated by the nf-core/viralrecon analysis pipeline. For information about how to interpret these results, please see the documentation.
Report
generated on 2022-06-28, 20:17
based on data in:
/opt/data/work/3d/4888d88f879efecfe230d579e6e085
Variant calling metrics
generated by the nf-core/viralrecon pipeline
Sample | # Mapped reads | Coverage median | % Coverage > 1x | % Coverage > 10x | # SNPs | # INDELs | # Missense variants | # Ns per 100kb consensus | Pangolin lineage | Nextclade clade |
---|---|---|---|---|---|---|---|---|---|---|
single_barcode | 20266 | 173.00 | 100.00 | 100.00 | 5 | NA | 3 | 407.99 | A.1 | 19B |
Variant calling metrics: Columns
Uncheck the tick box to hide columns. Click and drag the handle on the left to change order.
Sort | Visible | Group | Column | Description | ID | Scale |
---|---|---|---|---|---|---|
|| | Variant calling metrics | # Mapped reads | Total number of mapped reads relative to the viral genome | # Mapped reads |
None | |
|| | Variant calling metrics | Coverage median | Median coverage calculated by mosdepth | Coverage median |
None | |
|| | Variant calling metrics | % Coverage > 1x | Coverage > 1x calculated by mosdepth | % Coverage > 1x |
None | |
|| | Variant calling metrics | % Coverage > 10x | Coverage > 10x calculated by mosdepth | % Coverage > 10x |
None | |
|| | Variant calling metrics | # SNPs | Total number of SNPs called by artic minion that pass quality filters | # SNPs |
None | |
|| | Variant calling metrics | # INDELs | Total number of INDELs called by artic minion that pass quality filters | # INDELs |
None | |
|| | Variant calling metrics | # Missense variants | Total number of missense mutations identified by variant annotation with SnpEff | # Missense variants |
None | |
|| | Variant calling metrics | # Ns per 100kb consensus | Number of N bases per 100kb in consensus sequence generated by artic minion | # Ns per 100kb consensus |
None | |
|| | Variant calling metrics | Pangolin lineage | Pangolin lineage inferred from the consensus sequence generated by artic minion | Pangolin lineage |
None | |
|| | Variant calling metrics | Nextclade clade | Nextclade clade inferred from the consensus sequence generated by artic minion | Nextclade clade |
None |
Pangolin
Pangolin uses variant calls to assign SARS-CoV-2 genome sequences to global lineages.
Run table
Statistics gathered from the input pangolin files. Hover over the column headers for descriptions and click Help for more in-depth documentation.
This table shows some of the metrics parsed by Pangolin. Hover over the column headers to see a description of the contents. Longer help text for certain columns is shown below:
- Conflict
- In the pangoLEARN decision tree model, a given sequence gets assigned to the most likely category based on known diversity.
If a sequence can fit into more than one category, the conflict score will be greater than
0
and reflect the number of categories the sequence could fit into. If the conflict score is0
, this means that within the current decision tree there is only one category that the sequence could be assigned to.
- In the pangoLEARN decision tree model, a given sequence gets assigned to the most likely category based on known diversity.
If a sequence can fit into more than one category, the conflict score will be greater than
- Ambiguity score
- This score is a function of the quantity of missing data in a sequence.
It represents the proportion of relevant sites in a sequence which were imputed to the reference values.
A score of
1
indicates that no sites were imputed, while a score of0
indicates that more sites were imputed than were not imputed. This score only includes sites which are used by the decision tree to classify a sequence.
- This score is a function of the quantity of missing data in a sequence.
It represents the proportion of relevant sites in a sequence which were imputed to the reference values.
A score of
- Scorpio conflict
- The conflict score is the proportion of defining variants which have the reference allele in the sequence. Ambiguous/other non-ref/alt bases at each of the variant positions contribute only to the denominators of these scores.
- Note
- If any conflicts from the decision tree, this field will output the alternative assignments. If the sequence failed QC this field will describe why. If the sequence met the SNP thresholds for scorpio to call a constellation, it’ll describe the exact SNP counts of Alt, Ref and Amb (Alternative, reference and ambiguous) alleles for that call.
Sample Name | Lineage | Conflict | Ambiguity | S call | S support | S conflict | Version | Pangolin version | PangoLEARN version | Pango version | QC Status | Note |
---|---|---|---|---|---|---|---|---|---|---|---|---|
single_barcode | A.1 | 0.0 | 1.0 | PLEARN-v1.2.123 | 3.1.20 | 2022-01-20 | v1.2.123 | Pass |
Pangolin Run details: Columns
Uncheck the tick box to hide columns. Click and drag the handle on the left to change order.
Sort | Visible | Group | Column | Description | ID | Scale |
---|---|---|---|---|---|---|
|| | Pangolin | Lineage | The most likely lineage assigned to a given sequence based on the inference engine used and the SARS-CoV-2 diversity designated. | lineage |
None | |
|| | Pangolin | Conflict | Conflict between categories in decision tree | conflict |
None | |
|| | Pangolin | Ambiguity | Quantity of missing data in a sequence | ambiguity_score |
None | |
|| | Pangolin | S call | Scorpio: If a query is assigned a constellation by scorpio this call is output in this column | scorpio_call |
None | |
|| | Pangolin | S support | Scorpio: The proportion of defining variants which have the alternative allele in the sequence. | scorpio_support |
None | |
|| | Pangolin | S conflict | Scorpio: The proportion of defining variants which have the reference allele in the sequence. | scorpio_conflict |
None | |
|| | Pangolin | Version | A version number that represents both the pango-designation number and the inference engine used to assign the lineage | version |
None | |
|| | Pangolin | Pangolin version | The version of pangolin software running. | pangolin_version |
None | |
|| | Pangolin | PangoLEARN version | The dated version of the pangoLEARN model installed. | pangoLEARN_version |
None | |
|| | Pangolin | Pango version | The version of pango-designation lineages that this assignment is based on. | pango_version |
None | |
|| | Pangolin | QC Status | Indicates whether the sequence passed the QC thresholds for minimum length and maximum N content. | qc_status |
None | |
|| | Pangolin | Note | Additional information from Pangolin | note |
None |
pycoQC
pycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data
Statistics
Sample Name | N50 - Pass (bp) | N50 - All (bp) | Median read qual - Pass | Median read qual - All | Active Channels - Pass | Active Channels - All | Run duration (h) |
---|---|---|---|---|---|---|---|
pycoqc | 507 | 507 | 12.3 | 12.3 | 503 | 503 | 48.0 |
Pycoqc Stats Table: Columns
Uncheck the tick box to hide columns. Click and drag the handle on the left to change order.
Sort | Visible | Group | Column | Description | ID | Scale |
---|---|---|---|---|---|---|
|| | pycoQC | N50 - Pass (bp) | N50 - passing reads (base pairs) | passed_n50 |
n50 | |
|| | pycoQC | N50 - All (bp) | N50 - all reads (base pairs) | all_n50 |
n50 | |
|| | pycoQC | Median read qual - Pass | Median PHRED quality score - passing reads | passed_median_phred_score |
phred | |
|| | pycoQC | Median read qual - All | Median PHRED quality score - all reads | all_median_phred_score |
phred | |
|| | pycoQC | Active Channels - Pass | Number of active channels - passing reads | passed_channels |
channels | |
|| | pycoQC | Active Channels - All | Number of active channels - all reads | all_channels |
channels | |
|| | pycoQC | Run duration (h) | Run duration (hours) | all_run_duration |
None |
Read / Base counts
Number of sequenced reads / bases passing and failing QC thresholds.
Read length
Distribution of read length for all / passed reads.
Quality scores
Distribution of quality scores for all / passed reads.
Samtools
Samtools is a suite of programs for interacting with high-throughput sequencing data.
Samtools Flagstat
This module parses the output from samtools flagstat
. All numbers in millions.
mosdepth
mosdepth performs fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing
Coverage distribution
Distribution of the number of locations in the reference genome with a given depth of coverage
For a set of DNA or RNA reads mapped to a reference sequence, such as a genome or transcriptome, the depth of coverage at a given base position is the number of high-quality reads that map to the reference at that position, while the breadth of coverage is the fraction of the reference sequence to which reads have been mapped with at least a given depth of coverage (Sims et al. 2014).
Defining coverage breadth in terms of coverage depth is useful, because sequencing experiments typically require a specific minimum depth of coverage over the region of interest (Sims et al. 2014), so the extent of the reference sequence that is amenable to analysis is constrained to lie within regions that have sufficient depth. With inadequate sequencing breadth, it can be difficult to distinguish the absence of a biological feature (such as a gene) from a lack of data (Green 2007).
For increasing coverage depths (1×, 2×, …, N×), coverage breadth is calculated as the percentage of the reference sequence that is covered by at least that number of reads, then plots coverage breadth (y-axis) against coverage depth (x-axis). This plot shows the relationship between sequencing depth and breadth for each read dataset, which can be used to gauge, for example, the likely effect of a minimum depth filter on the fraction of a genome available for analysis.
Coverage plot
Number of locations in the reference genome with a given depth of coverage
For a set of DNA or RNA reads mapped to a reference sequence, such as a genome or transcriptome, the depth of coverage at a given base position is the number of high-quality reads that map to the reference at that position (Sims et al. 2014).
Bases of a reference sequence (y-axis) are groupped by their depth of coverage (0×, 1×, …, N×) (x-axis). This plot shows the frequency of coverage depths relative to the reference sequence for each read dataset, which provides an indirect measure of the level and variation of coverage depth in the corresponding sequenced sample.
If reads are randomly distributed across the reference sequence, this plot should resemble a Poisson distribution (Lander & Waterman 1988), with a peak indicating approximate depth of coverage, and more uniform coverage depth being reflected in a narrower spread. The optimal level of coverage depth depends on the aims of the experiment, though it should at minimum be sufficiently high to adequately address the biological question; greater uniformity of coverage is generally desirable, because it increases breadth of coverage for a given depth of coverage, allowing equivalent results to be achieved at a lower sequencing depth (Sampson et al. 2011; Sims et al. 2014). However, it is difficult to achieve uniform coverage depth in practice, due to biases introduced during sample preparation (van Dijk et al. 2014), sequencing (Ross et al. 2013) and read mapping (Sims et al. 2014).
This plot may include a small peak for regions of the reference sequence with zero depth of coverage. Such regions may be absent from the given sample (due to a deletion or structural rearrangement), present in the sample but not successfully sequenced (due to bias in sequencing or preparation), or sequenced but not successfully mapped to the reference (due to the choice of mapping algorithm, the presence of repeat sequences, or mismatches caused by variants or sequencing errors). Related factors cause most datasets to contain some unmapped reads (Sims et al. 2014).
Average coverage per contig
Average coverage per contig or chromosome
Bcftools
Bcftools contains utilities for variant calling and manipulating VCFs and BCFs.
Variant Substitution Types
Variant Quality
Indel Distribution
Variant depths
Read depth support distribution for called variants
SnpEff
SnpEff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes).
Variants by Genomic Region
The stacked bar plot shows locations of detected variants in the genome and the number of variants for each location.
The upstream and downstream interval size to detect these genomic regions is 5000bp by default.
Variant Effects by Impact
The stacked bar plot shows the putative impact of detected variants and the number of variants for each impact.
There are four levels of impacts predicted by SnpEff:
- High: High impact (like stop codon)
- Moderate: Middle impact (like same type of amino acid substitution)
- Low: Low impact (ie silence mutation)
- Modifier: No impact
Variants by Effect Types
The stacked bar plot shows the effect of variants at protein level and the number of variants for each effect type.
This plot shows the effect of variants with respect to the mRNA.
Variants by Functional Class
The stacked bar plot shows the effect of variants and the number of variants for each effect type.
This plot shows the effect of variants on the translation of the mRNA as protein. There are three possible cases:
- Silent: The amino acid does not change.
- Missense: The amino acid is different.
- Nonsense: The variant generates a stop codon.
Variant Qualities
The line plot shows the quantity as function of the variant quality score.
The quality score corresponds to the QUAL column of the VCF file. This score is set by the variant caller.
nf-core/viralrecon Software Versions
are collected at run time from the software output.
Process Name | Software | Version |
---|---|---|
ARTIC_GUPPYPLEX | artic | 1.2.1 |
ARTIC_MINION | artic | 1.2.1 |
ASCIIGENOME | asciigenome | 1.16.0 |
bedtools | 2.30.0 | |
BCFTOOLS_QUERY | bcftools | 1.14 |
BCFTOOLS_STATS | bcftools | 1.14 |
COLLAPSE_PRIMERS | python | 3.9.5 |
CUSTOM_DUMPSOFTWAREVERSIONS | python | 3.9.5 |
yaml | 5.4.1 | |
CUSTOM_GETCHROMSIZES | custom | 1.14 |
GUNZIP_GFF | gunzip | 1.10 |
MAKE_VARIANTS_LONG_TABLE | python | 3.9.9 |
MOSDEPTH_AMPLICON | mosdepth | 0.3.3 |
MOSDEPTH_GENOME | mosdepth | 0.3.3 |
NANOPLOT | nanoplot | 1.39.0 |
NEXTCLADE_RUN | nextclade | 1.10.2 |
PANGOLIN | pangolin | 3.1.20 |
PLOT_MOSDEPTH_REGIONS_AMPLICON | r-base | 4.0.3 |
PLOT_MOSDEPTH_REGIONS_GENOME | r-base | 4.0.3 |
PYCOQC | pycoqc | 2.5.2 |
QUAST | quast | 5.0.2 |
SAMTOOLS_FLAGSTAT | samtools | 1.14 |
SAMTOOLS_IDXSTATS | samtools | 1.14 |
SAMTOOLS_INDEX | samtools | 1.14 |
SAMTOOLS_STATS | samtools | 1.14 |
SAMTOOLS_VIEW | samtools | 1.14 |
SNPEFF_ANN | snpeff | 5.0e |
SNPEFF_BUILD | snpeff | 5.0e |
SNPSIFT_EXTRACTFIELDS | snpsift | 4.3 |
TABIX_BGZIP | tabix | 1.12 |
TABIX_TABIX | tabix | 1.12 |
VCFLIB_VCFUNIQ | vcflib | 1.0.2 |
Workflow | Nextflow | 22.04.0 |
nf-core/viralrecon | 2.4.1 |
nf-core/viralrecon Workflow Summary
- this information is collected when the pipeline is started.
Core Nextflow options
- revision
- 2.4.1
- runName
- agitated_watson
- containerEngine
- docker
- launchDir
- /opt/data
- workDir
- /opt/data/work
- projectDir
- /root/.nextflow/assets/nf-core/viralrecon
- userName
- root
- profile
- docker
- configFiles
- /root/.nextflow/assets/nf-core/viralrecon/nextflow.config
Input/output options
- platform
- nanopore
- protocol
- metagenomic
- outdir
- /opt/outdir
Reference genome options
- genome
- MN908947.3
- fasta
- https://github.com/artic-network/artic-ncov2019/raw/master/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta
- gff
- https://github.com/nf-core/test-datasets/raw/viralrecon/genome/MN908947.3/GCA_009858895.3_ASM985889v3_genomic.200409.gff.gz
- primer_bed
- https://github.com/artic-network/artic-ncov2019/raw/master/primer_schemes/nCoV-2019/V3/nCoV-2019.primer.bed
- primer_set_version
- 3
Nanopore options
- fastq_dir
- /opt/fastq/fastq_pass
- fast5_dir
- /opt/fastq/fast5_pass
- sequencing_summary
- /opt/sequencing_summary/sequencing_summary_FAN44250_77d58da2.txt
- artic_minion_caller
- medaka
- artic_scheme
- nCoV-2019
- artic_minion_medaka_model
- r941_min_high_g360
Nanopore/Illumina options
- nextclade_dataset
- /opt/clade
- nextclade_dataset_name
- sars-cov-2
- nextclade_dataset_reference
- MN908947
- nextclade_dataset_tag
- 2022-01-18T12:00:00Z
Illumina QC, read trimming and filtering options
- skip_kraken2
- true
Max job request options
- max_cpus
- 3
- max_memory
- 8GB
MultiQC v1.11 - Written by Phil Ewels, available on GitHub.
This report uses HighCharts, jQuery, jQuery UI, Bootstrap, FileSaver.js and clipboard.js.
Plot Table Data
Regex Help
Toolbox search strings can behave as regular expressions (regexes). Click a button below to see an example of it in action. Try modifying them yourself in the text box.
samp_1 samp_1_edited samp_2 samp_2_edited samp_3 samp_3_edited prepended_samp_1 tmp_samp_1_edited tmpp_samp_1_edited tmppp_samp_1_edited #samp_1_edited.tmp samp_11 samp_11111
See regex101.com for a more heavy duty testing suite.