Viral Recon

nf-core/viralrecon is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples. The pipeline supports both Illumina and Nanopore sequencing data. For Illumina short-reads the pipeline is able to analyse metagenomics data typically obtained from shotgun sequencing (e.g. directly from clinical samples) and enrichment-based library preparation methods (e.g. amplicon-based: ARTIC SARS-CoV-2 enrichment protocol; or probe-capture-based). For Nanopore data the pipeline only supports amplicon-based analysis obtained from primer sets created and maintained by the ARTIC Network.*

*Pulled from [https://nf-co.re/viralrecon](https://nf-co.re/viralrecon)

Note

The modules runs nextflow on the backend and thus utilizes Docker within Docker.

Parameters

Fastq DirDir: Basecalled Fastq files
Fast5 DirDir: Fast5 files directory from which you received the basecalled fastq directory of files from

Returns

Consensus./viralrecon/medaka|nanopolish

Consensus FASTA files are made for both assembly processes

MultiQC Report: ./viralrecon/multiqc/multiqc_report.html

HTML files that has information of your run

Ensure you’ve loaded a run with a fastq and fast5 directory specified

Select one of the included primer-schemes from the drop-down list. For this example, the data is ncov-related so we will choose Default Genome fasta for SARS-nCoV-2.
Select one of the basecaller options medaka or nanopolish
Select the Play button button to start the pipeline

MultiQC Report

v1.11

Loading report..

Toolbox

MultiQC Toolbox

Highlight Samples

Regex mode off

Rename Samples

Click here for bulk input.

Paste two columns of a tab-delimited table here (eg. from Excel).

First column should be the old name, second column the new name.

Regex mode off

Show / Hide Samples

Regex mode off

px

Aspect ratio

Plot scaling

X

Download the raw data used to create the plots in this report below:

Format:

Note that additional data was saved in multiqc_data when this report was generated.

Choose Plots

If you use plots from MultiQC in a publication or presentation, please cite:

MultiQC: Summarize analysis results for multiple tools and samples in a single report
Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
Bioinformatics (2016)
doi: 10.1093/bioinformatics/btw354
PMID: 27312411

Save Settings

You can save the toolbox settings for this report to the browser.

Load Settings

Choose a saved report profile from the dropdown box below:

About MultiQC

This report was generated using MultiQC, version 1.11

You can see a YouTube video describing how to use MultiQC reports here: https://youtu.be/qPbIlO_KWN0

For more information about MultiQC, including other videos and extensive documentation, please visit http://multiqc.info

You can report bugs, suggest improvements and find the source code for MultiQC on GitHub: https://github.com/ewels/MultiQC

MultiQC is published in Bioinformatics:

MultiQC: Summarize analysis results for multiple tools and samples in a single report
Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
Bioinformatics (2016)
doi: 10.1093/bioinformatics/btw354
PMID: 27312411

A modular tool to aggregate results from bioinformatics analyses across many samples into a single report.

This report has been generated by the nf-core/viralrecon analysis pipeline. For information about how to interpret these results, please see the documentation.

JavaScript Disabled

MultiQC reports use JavaScript for plots and toolbox functions. It looks like you have JavaScript disabled in your web browser. Please note that many of the report functions will not work as intended.

Report generated on 2022-06-28, 20:17 based on data in: /opt/data/work/3d/4888d88f879efecfe230d579e6e085

Variant calling metrics

generated by the nf-core/viralrecon pipeline

Showing ¹/₁ rows and ¹⁰/₁₀ columns.

Sample	# Mapped reads	Coverage median	% Coverage > 1x	% Coverage > 10x	# SNPs	# INDELs	# Missense variants	# Ns per 100kb consensus	Pangolin lineage	Nextclade clade
single_barcode	20266	173.00	100.00	100.00	5	NA	3	407.99	A.1	19B

Uncheck the tick box to hide columns. Click and drag the handle on the left to change order.

Sort	Group	Column	Description	ID	Scale
\|\|	Variant calling metrics	# Mapped reads	Total number of mapped reads relative to the viral genome	`# Mapped reads`	None
\|\|	Variant calling metrics	Coverage median	Median coverage calculated by mosdepth	`Coverage median`	None
\|\|	Variant calling metrics	% Coverage > 1x	Coverage > 1x calculated by mosdepth	`% Coverage > 1x`	None
\|\|	Variant calling metrics	% Coverage > 10x	Coverage > 10x calculated by mosdepth	`% Coverage > 10x`	None
\|\|	Variant calling metrics	# SNPs	Total number of SNPs called by artic minion that pass quality filters	`# SNPs`	None
\|\|	Variant calling metrics	# INDELs	Total number of INDELs called by artic minion that pass quality filters	`# INDELs`	None
\|\|	Variant calling metrics	# Missense variants	Total number of missense mutations identified by variant annotation with SnpEff	`# Missense variants`	None
\|\|	Variant calling metrics	# Ns per 100kb consensus	Number of N bases per 100kb in consensus sequence generated by artic minion	`# Ns per 100kb consensus`	None
\|\|	Variant calling metrics	Pangolin lineage	Pangolin lineage inferred from the consensus sequence generated by artic minion	`Pangolin lineage`	None
\|\|	Variant calling metrics	Nextclade clade	Nextclade clade inferred from the consensus sequence generated by artic minion	`Nextclade clade`	None

Pangolin

Pangolin uses variant calls to assign SARS-CoV-2 genome sequences to global lineages.

Run table

Statistics gathered from the input pangolin files. Hover over the column headers for descriptions and click Help for more in-depth documentation.

This table shows some of the metrics parsed by Pangolin. Hover over the column headers to see a description of the contents. Longer help text for certain columns is shown below:

Conflict
- In the pangoLEARN decision tree model, a given sequence gets assigned to the most likely category based on known diversity. If a sequence can fit into more than one category, the conflict score will be greater than 0 and reflect the number of categories the sequence could fit into. If the conflict score is 0, this means that within the current decision tree there is only one category that the sequence could be assigned to.
Ambiguity score
- This score is a function of the quantity of missing data in a sequence. It represents the proportion of relevant sites in a sequence which were imputed to the reference values. A score of 1 indicates that no sites were imputed, while a score of 0 indicates that more sites were imputed than were not imputed. This score only includes sites which are used by the decision tree to classify a sequence.
Scorpio conflict
- The conflict score is the proportion of defining variants which have the reference allele in the sequence. Ambiguous/other non-ref/alt bases at each of the variant positions contribute only to the denominators of these scores.
Note
- If any conflicts from the decision tree, this field will output the alternative assignments. If the sequence failed QC this field will describe why. If the sequence met the SNP thresholds for scorpio to call a constellation, it’ll describe the exact SNP counts of Alt, Ref and Amb (Alternative, reference and ambiguous) alleles for that call.

Showing ¹/₁ rows and ⁸/₁₂ columns.

Sample Name	Lineage	Conflict	Ambiguity	S call	S support	S conflict	Version	Pangolin version	PangoLEARN version	Pango version	QC Status	Note
single_barcode	A.1	0.0	1.0				PLEARN-v1.2.123	3.1.20	2022-01-20	v1.2.123	Pass

Uncheck the tick box to hide columns. Click and drag the handle on the left to change order.

Sort	Group	Column	Description	ID	Scale
\|\|	Pangolin	Lineage	The most likely lineage assigned to a given sequence based on the inference engine used and the SARS-CoV-2 diversity designated.	`lineage`	None
\|\|	Pangolin	Conflict	Conflict between categories in decision tree	`conflict`	None
\|\|	Pangolin	Ambiguity	Quantity of missing data in a sequence	`ambiguity_score`	None
\|\|	Pangolin	S call	Scorpio: If a query is assigned a constellation by scorpio this call is output in this column	`scorpio_call`	None
\|\|	Pangolin	S support	Scorpio: The proportion of defining variants which have the alternative allele in the sequence.	`scorpio_support`	None
\|\|	Pangolin	S conflict	Scorpio: The proportion of defining variants which have the reference allele in the sequence.	`scorpio_conflict`	None
\|\|	Pangolin	Version	A version number that represents both the pango-designation number and the inference engine used to assign the lineage	`version`	None
\|\|	Pangolin	Pangolin version	The version of pangolin software running.	`pangolin_version`	None
\|\|	Pangolin	PangoLEARN version	The dated version of the pangoLEARN model installed.	`pangoLEARN_version`	None
\|\|	Pangolin	Pango version	The version of pango-designation lineages that this assignment is based on.	`pango_version`	None
\|\|	Pangolin	QC Status	Indicates whether the sequence passed the QC thresholds for minimum length and maximum N content.	`qc_status`	None
\|\|	Pangolin	Note	Additional information from Pangolin	`note`	None

pycoQC

pycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data

Statistics

Showing ¹/₁ rows and ⁷/₇ columns.

Sample Name	N50 - Pass (bp)	N50 - All (bp)	Median read qual - Pass	Median read qual - All	Active Channels - Pass	Active Channels - All	Run duration (h)
pycoqc	507	507	12.3	12.3	503	503	48.0

Uncheck the tick box to hide columns. Click and drag the handle on the left to change order.

Sort	Group	Column	Description	ID	Scale
\|\|	pycoQC	N50 - Pass (bp)	N50 - passing reads (base pairs)	`passed_n50`	n50
\|\|	pycoQC	N50 - All (bp)	N50 - all reads (base pairs)	`all_n50`	n50
\|\|	pycoQC	Median read qual - Pass	Median PHRED quality score - passing reads	`passed_median_phred_score`	phred
\|\|	pycoQC	Median read qual - All	Median PHRED quality score - all reads	`all_median_phred_score`	phred
\|\|	pycoQC	Active Channels - Pass	Number of active channels - passing reads	`passed_channels`	channels
\|\|	pycoQC	Active Channels - All	Number of active channels - all reads	`all_channels`	channels
\|\|	pycoQC	Run duration (h)	Run duration (hours)	`all_run_duration`	None

Read / Base counts

Number of sequenced reads / bases passing and failing QC thresholds.

loading..

Read length

Distribution of read length for all / passed reads.

loading..

Quality scores

Distribution of quality scores for all / passed reads.

loading..

Samtools

Samtools is a suite of programs for interacting with high-throughput sequencing data.

Samtools Flagstat

This module parses the output from samtools flagstat. All numbers in millions.

loading..

mosdepth

mosdepth performs fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing

Coverage distribution

Distribution of the number of locations in the reference genome with a given depth of coverage

For a set of DNA or RNA reads mapped to a reference sequence, such as a genome or transcriptome, the depth of coverage at a given base position is the number of high-quality reads that map to the reference at that position, while the breadth of coverage is the fraction of the reference sequence to which reads have been mapped with at least a given depth of coverage (Sims et al. 2014).

Defining coverage breadth in terms of coverage depth is useful, because sequencing experiments typically require a specific minimum depth of coverage over the region of interest (Sims et al. 2014), so the extent of the reference sequence that is amenable to analysis is constrained to lie within regions that have sufficient depth. With inadequate sequencing breadth, it can be difficult to distinguish the absence of a biological feature (such as a gene) from a lack of data (Green 2007).

For increasing coverage depths (1×, 2×, …, N×), coverage breadth is calculated as the percentage of the reference sequence that is covered by at least that number of reads, then plots coverage breadth (y-axis) against coverage depth (x-axis). This plot shows the relationship between sequencing depth and breadth for each read dataset, which can be used to gauge, for example, the likely effect of a minimum depth filter on the fraction of a genome available for analysis.

loading..

Coverage plot

Number of locations in the reference genome with a given depth of coverage

For a set of DNA or RNA reads mapped to a reference sequence, such as a genome or transcriptome, the depth of coverage at a given base position is the number of high-quality reads that map to the reference at that position (Sims et al. 2014).

Bases of a reference sequence (y-axis) are groupped by their depth of coverage (0×, 1×, …, N×) (x-axis). This plot shows the frequency of coverage depths relative to the reference sequence for each read dataset, which provides an indirect measure of the level and variation of coverage depth in the corresponding sequenced sample.

If reads are randomly distributed across the reference sequence, this plot should resemble a Poisson distribution (Lander & Waterman 1988), with a peak indicating approximate depth of coverage, and more uniform coverage depth being reflected in a narrower spread. The optimal level of coverage depth depends on the aims of the experiment, though it should at minimum be sufficiently high to adequately address the biological question; greater uniformity of coverage is generally desirable, because it increases breadth of coverage for a given depth of coverage, allowing equivalent results to be achieved at a lower sequencing depth (Sampson et al. 2011; Sims et al. 2014). However, it is difficult to achieve uniform coverage depth in practice, due to biases introduced during sample preparation (van Dijk et al. 2014), sequencing (Ross et al. 2013) and read mapping (Sims et al. 2014).

This plot may include a small peak for regions of the reference sequence with zero depth of coverage. Such regions may be absent from the given sample (due to a deletion or structural rearrangement), present in the sample but not successfully sequenced (due to bias in sequencing or preparation), or sequenced but not successfully mapped to the reference (due to the choice of mapping algorithm, the presence of repeat sequences, or mismatches caused by variants or sequencing errors). Related factors cause most datasets to contain some unmapped reads (Sims et al. 2014).

loading..

Average coverage per contig

Average coverage per contig or chromosome

loading..

Bcftools

Bcftools contains utilities for variant calling and manipulating VCFs and BCFs.

Variant Substitution Types

loading..

Variant Quality

loading..

Indel Distribution

loading..

Variant depths

Read depth support distribution for called variants

loading..

SnpEff

SnpEff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes).

Variants by Genomic Region

The stacked bar plot shows locations of detected variants in the genome and the number of variants for each location.

The upstream and downstream interval size to detect these genomic regions is 5000bp by default.

loading..

Variant Effects by Impact

The stacked bar plot shows the putative impact of detected variants and the number of variants for each impact.

There are four levels of impacts predicted by SnpEff:

High: High impact (like stop codon)
Moderate: Middle impact (like same type of amino acid substitution)
Low: Low impact (ie silence mutation)
Modifier: No impact

loading..

Variants by Effect Types

The stacked bar plot shows the effect of variants at protein level and the number of variants for each effect type.

This plot shows the effect of variants with respect to the mRNA.

loading..

Variants by Functional Class

The stacked bar plot shows the effect of variants and the number of variants for each effect type.

This plot shows the effect of variants on the translation of the mRNA as protein. There are three possible cases:

Silent: The amino acid does not change.
Missense: The amino acid is different.
Nonsense: The variant generates a stop codon.

loading..

Variant Qualities

The line plot shows the quantity as function of the variant quality score.

The quality score corresponds to the QUAL column of the VCF file. This score is set by the variant caller.

loading..

nf-core/viralrecon Software Versions

are collected at run time from the software output.

Process Name	Software	Version
`ARTIC_GUPPYPLEX`	`artic`	`1.2.1`
`ARTIC_MINION`	`artic`	`1.2.1`
`ASCIIGENOME`	`asciigenome`	`1.16.0`
	`bedtools`	`2.30.0`
`BCFTOOLS_QUERY`	`bcftools`	`1.14`
`BCFTOOLS_STATS`	`bcftools`	`1.14`
`COLLAPSE_PRIMERS`	`python`	`3.9.5`
`CUSTOM_DUMPSOFTWAREVERSIONS`	`python`	`3.9.5`
	`yaml`	`5.4.1`
`CUSTOM_GETCHROMSIZES`	`custom`	`1.14`
`GUNZIP_GFF`	`gunzip`	`1.10`
`MAKE_VARIANTS_LONG_TABLE`	`python`	`3.9.9`
`MOSDEPTH_AMPLICON`	`mosdepth`	`0.3.3`
`MOSDEPTH_GENOME`	`mosdepth`	`0.3.3`
`NANOPLOT`	`nanoplot`	`1.39.0`
`NEXTCLADE_RUN`	`nextclade`	`1.10.2`
`PANGOLIN`	`pangolin`	`3.1.20`
`PLOT_MOSDEPTH_REGIONS_AMPLICON`	`r-base`	`4.0.3`
`PLOT_MOSDEPTH_REGIONS_GENOME`	`r-base`	`4.0.3`
`PYCOQC`	`pycoqc`	`2.5.2`
`QUAST`	`quast`	`5.0.2`
`SAMTOOLS_FLAGSTAT`	`samtools`	`1.14`
`SAMTOOLS_IDXSTATS`	`samtools`	`1.14`
`SAMTOOLS_INDEX`	`samtools`	`1.14`
`SAMTOOLS_STATS`	`samtools`	`1.14`
`SAMTOOLS_VIEW`	`samtools`	`1.14`
`SNPEFF_ANN`	`snpeff`	`5.0e`
`SNPEFF_BUILD`	`snpeff`	`5.0e`
`SNPSIFT_EXTRACTFIELDS`	`snpsift`	`4.3`
`TABIX_BGZIP`	`tabix`	`1.12`
`TABIX_TABIX`	`tabix`	`1.12`
`VCFLIB_VCFUNIQ`	`vcflib`	`1.0.2`
`Workflow`	`Nextflow`	`22.04.0`
	`nf-core/viralrecon`	`2.4.1`

nf-core/viralrecon Workflow Summary

- this information is collected when the pipeline is started.

Core Nextflow options

revision: 2.4.1
runName: agitated_watson
containerEngine: docker
launchDir: /opt/data
workDir: /opt/data/work
projectDir: /root/.nextflow/assets/nf-core/viralrecon
userName: root
profile: docker
configFiles: /root/.nextflow/assets/nf-core/viralrecon/nextflow.config

Input/output options

platform: nanopore
protocol: metagenomic
outdir: /opt/outdir

Reference genome options

genome: MN908947.3
fasta: https://github.com/artic-network/artic-ncov2019/raw/master/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta
gff: https://github.com/nf-core/test-datasets/raw/viralrecon/genome/MN908947.3/GCA_009858895.3_ASM985889v3_genomic.200409.gff.gz
primer_bed: https://github.com/artic-network/artic-ncov2019/raw/master/primer_schemes/nCoV-2019/V3/nCoV-2019.primer.bed
primer_set_version: 3

Nanopore options

fastq_dir: /opt/fastq/fastq_pass
fast5_dir: /opt/fastq/fast5_pass
sequencing_summary: /opt/sequencing_summary/sequencing_summary_FAN44250_77d58da2.txt
artic_minion_caller: medaka
artic_scheme: nCoV-2019
artic_minion_medaka_model: r941_min_high_g360

Nanopore/Illumina options

nextclade_dataset: /opt/clade
nextclade_dataset_name: sars-cov-2
nextclade_dataset_reference: MN908947
nextclade_dataset_tag: 2022-01-18T12:00:00Z

Illumina QC, read trimming and filtering options

skip_kraken2: true

Max job request options

max_cpus: 3
max_memory: 8GB

Viral Recon

Parameters

Returns

Toggle navigation v1.11

MultiQC Toolbox

Apply Highlight Samples

Apply Rename Samples

Apply Show / Hide Samples

Export Plots

Choose Plots

Save Settings

Load Settings

About MultiQC

Variant calling metrics

Variant calling metrics: Columns

Pangolin

Run table Help

Pangolin Run details: Columns

pycoQC

Statistics

Pycoqc Stats Table: Columns

Read / Base counts

Read length

Quality scores

Samtools

Samtools Flagstat

mosdepth

Coverage distribution Help

Coverage plot Help

Average coverage per contig

Bcftools

Variant Substitution Types

Variant Quality

Indel Distribution

Variant depths

SnpEff

Variants by Genomic Region Help

Variant Effects by Impact Help

Variants by Effect Types Help

Variants by Functional Class Help

Variant Qualities Help

nf-core/viralrecon Software Versions

nf-core/viralrecon Workflow Summary

v1.11

Highlight Samples

Rename Samples

Show / Hide Samples

Run table

Coverage distribution

Coverage plot

Variants by Genomic Region

Variant Effects by Impact

Variants by Effect Types

Variants by Functional Class

Variant Qualities