Viral Recon

nf-core/viralrecon is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples. The pipeline supports both Illumina and Nanopore sequencing data. For Illumina short-reads the pipeline is able to analyse metagenomics data typically obtained from shotgun sequencing (e.g. directly from clinical samples) and enrichment-based library preparation methods (e.g. amplicon-based: ARTIC SARS-CoV-2 enrichment protocol; or probe-capture-based). For Nanopore data the pipeline only supports amplicon-based analysis obtained from primer sets created and maintained by the ARTIC Network.*

*Pulled from [https://nf-co.re/viralrecon](https://nf-co.re/viralrecon)

Note

The modules runs nextflow on the backend and thus utilizes Docker within Docker.

Parameters

Fastq DirDir

Basecalled Fastq files

Fast5 DirDir

Fast5 files directory from which you received the basecalled fastq directory of files from

Returns

Consensus./viralrecon/medaka|nanopolish
  • Consensus FASTA files are made for both assembly processes

MultiQC Report: ./viralrecon/multiqc/multiqc_report.html
  • HTML files that has information of your run


  1. Ensure you’ve loaded a run with a fastq and fast5 directory specified

_images/viralrecon1.png
  1. Select one of the included primer-schemes from the drop-down list. For this example, the data is ncov-related so we will choose Default Genome fasta for SARS-nCoV-2.

  2. Select one of the basecaller options medaka or nanopolish

  3. Select the Play button button to start the pipeline

MultiQC Report

Highlight Samples

Regex mode off

    Rename Samples

    Click here for bulk input.

    Paste two columns of a tab-delimited table here (eg. from Excel).

    First column should be the old name, second column the new name.

    Regex mode off

      Show / Hide Samples

      Regex mode off

        Export Plots

        px
        px
        X

        Download the raw data used to create the plots in this report below:

        Note that additional data was saved in multiqc_data when this report was generated.


        Choose Plots

        If you use plots from MultiQC in a publication or presentation, please cite:

        MultiQC: Summarize analysis results for multiple tools and samples in a single report
        Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
        Bioinformatics (2016)
        doi: 10.1093/bioinformatics/btw354
        PMID: 27312411

        Save Settings

        You can save the toolbox settings for this report to the browser.


        Load Settings

        Choose a saved report profile from the dropdown box below:

        About MultiQC

        This report was generated using MultiQC, version 1.11

        You can see a YouTube video describing how to use MultiQC reports here: https://youtu.be/qPbIlO_KWN0

        For more information about MultiQC, including other videos and extensive documentation, please visit http://multiqc.info

        You can report bugs, suggest improvements and find the source code for MultiQC on GitHub: https://github.com/ewels/MultiQC

        MultiQC is published in Bioinformatics:

        MultiQC: Summarize analysis results for multiple tools and samples in a single report
        Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
        Bioinformatics (2016)
        doi: 10.1093/bioinformatics/btw354
        PMID: 27312411

        A modular tool to aggregate results from bioinformatics analyses across many samples into a single report.

        This report has been generated by the nf-core/viralrecon analysis pipeline. For information about how to interpret these results, please see the documentation.

        Report generated on 2022-06-28, 20:17 based on data in: /opt/data/work/3d/4888d88f879efecfe230d579e6e085


        Variant calling metrics

        generated by the nf-core/viralrecon pipeline

        Showing 1/1 rows and 10/10 columns.
        Sample# Mapped readsCoverage median% Coverage > 1x% Coverage > 10x# SNPs# INDELs# Missense variants# Ns per 100kb consensusPangolin lineageNextclade clade
        single_barcode
        20266
        173.00
        100.00
        100.00
        5
        NA
        3
        407.99
        A.1
        19B

        Pangolin

        Pangolin uses variant calls to assign SARS-CoV-2 genome sequences to global lineages.

        Run table

        Statistics gathered from the input pangolin files. Hover over the column headers for descriptions and click Help for more in-depth documentation.

        This table shows some of the metrics parsed by Pangolin. Hover over the column headers to see a description of the contents. Longer help text for certain columns is shown below:

        • Conflict
          • In the pangoLEARN decision tree model, a given sequence gets assigned to the most likely category based on known diversity. If a sequence can fit into more than one category, the conflict score will be greater than 0 and reflect the number of categories the sequence could fit into. If the conflict score is 0, this means that within the current decision tree there is only one category that the sequence could be assigned to.
        • Ambiguity score
          • This score is a function of the quantity of missing data in a sequence. It represents the proportion of relevant sites in a sequence which were imputed to the reference values. A score of 1 indicates that no sites were imputed, while a score of 0 indicates that more sites were imputed than were not imputed. This score only includes sites which are used by the decision tree to classify a sequence.
        • Scorpio conflict
          • The conflict score is the proportion of defining variants which have the reference allele in the sequence. Ambiguous/other non-ref/alt bases at each of the variant positions contribute only to the denominators of these scores.
        • Note
          • If any conflicts from the decision tree, this field will output the alternative assignments. If the sequence failed QC this field will describe why. If the sequence met the SNP thresholds for scorpio to call a constellation, it’ll describe the exact SNP counts of Alt, Ref and Amb (Alternative, reference and ambiguous) alleles for that call.
        Showing 1/1 rows and 8/12 columns.
        Sample NameLineageConflictAmbiguityS callS supportS conflictQC StatusNote
        single_barcodeA.1
        0.0
        1.0
        Pass

        pycoQC

        pycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data

        Statistics

        Showing 1/1 rows and 7/7 columns.
        Sample NameN50 - Pass (bp)N50 - All (bp)Median read qual - PassMedian read qual - AllActive Channels - PassActive Channels - AllRun duration (h)
        pycoqc
        507
        507
        12.3
        12.3
        503
        503
        48.0

        Read / Base counts

        Number of sequenced reads / bases passing and failing QC thresholds.

           
        loading..

        Read length

        Distribution of read length for all / passed reads.

        loading..

        Quality scores

        Distribution of quality scores for all / passed reads.

        loading..

        Samtools

        Samtools is a suite of programs for interacting with high-throughput sequencing data.

        Samtools Flagstat

        This module parses the output from samtools flagstat. All numbers in millions.

        loading..

        mosdepth

        mosdepth performs fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing

        Coverage distribution

        Distribution of the number of locations in the reference genome with a given depth of coverage

        For a set of DNA or RNA reads mapped to a reference sequence, such as a genome or transcriptome, the depth of coverage at a given base position is the number of high-quality reads that map to the reference at that position, while the breadth of coverage is the fraction of the reference sequence to which reads have been mapped with at least a given depth of coverage (Sims et al. 2014).

        Defining coverage breadth in terms of coverage depth is useful, because sequencing experiments typically require a specific minimum depth of coverage over the region of interest (Sims et al. 2014), so the extent of the reference sequence that is amenable to analysis is constrained to lie within regions that have sufficient depth. With inadequate sequencing breadth, it can be difficult to distinguish the absence of a biological feature (such as a gene) from a lack of data (Green 2007).

        For increasing coverage depths (1×, 2×, …, N×), coverage breadth is calculated as the percentage of the reference sequence that is covered by at least that number of reads, then plots coverage breadth (y-axis) against coverage depth (x-axis). This plot shows the relationship between sequencing depth and breadth for each read dataset, which can be used to gauge, for example, the likely effect of a minimum depth filter on the fraction of a genome available for analysis.

        loading..

        Coverage plot

        Number of locations in the reference genome with a given depth of coverage

        For a set of DNA or RNA reads mapped to a reference sequence, such as a genome or transcriptome, the depth of coverage at a given base position is the number of high-quality reads that map to the reference at that position (Sims et al. 2014).

        Bases of a reference sequence (y-axis) are groupped by their depth of coverage (0×, 1×, …, N×) (x-axis). This plot shows the frequency of coverage depths relative to the reference sequence for each read dataset, which provides an indirect measure of the level and variation of coverage depth in the corresponding sequenced sample.

        If reads are randomly distributed across the reference sequence, this plot should resemble a Poisson distribution (Lander & Waterman 1988), with a peak indicating approximate depth of coverage, and more uniform coverage depth being reflected in a narrower spread. The optimal level of coverage depth depends on the aims of the experiment, though it should at minimum be sufficiently high to adequately address the biological question; greater uniformity of coverage is generally desirable, because it increases breadth of coverage for a given depth of coverage, allowing equivalent results to be achieved at a lower sequencing depth (Sampson et al. 2011; Sims et al. 2014). However, it is difficult to achieve uniform coverage depth in practice, due to biases introduced during sample preparation (van Dijk et al. 2014), sequencing (Ross et al. 2013) and read mapping (Sims et al. 2014).

        This plot may include a small peak for regions of the reference sequence with zero depth of coverage. Such regions may be absent from the given sample (due to a deletion or structural rearrangement), present in the sample but not successfully sequenced (due to bias in sequencing or preparation), or sequenced but not successfully mapped to the reference (due to the choice of mapping algorithm, the presence of repeat sequences, or mismatches caused by variants or sequencing errors). Related factors cause most datasets to contain some unmapped reads (Sims et al. 2014).

        loading..

        Average coverage per contig

        Average coverage per contig or chromosome

        loading..

        Bcftools

        Bcftools contains utilities for variant calling and manipulating VCFs and BCFs.

        Variant Substitution Types

        loading..

        Variant Quality

        loading..

        Indel Distribution

        loading..

        Variant depths

        Read depth support distribution for called variants

        loading..

        SnpEff

        SnpEff is a genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes).

        Variants by Genomic Region

        The stacked bar plot shows locations of detected variants in the genome and the number of variants for each location.

        The upstream and downstream interval size to detect these genomic regions is 5000bp by default.

        loading..

        Variant Effects by Impact

        The stacked bar plot shows the putative impact of detected variants and the number of variants for each impact.

        There are four levels of impacts predicted by SnpEff:

        • High: High impact (like stop codon)
        • Moderate: Middle impact (like same type of amino acid substitution)
        • Low: Low impact (ie silence mutation)
        • Modifier: No impact
        loading..

        Variants by Effect Types

        The stacked bar plot shows the effect of variants at protein level and the number of variants for each effect type.

        This plot shows the effect of variants with respect to the mRNA.

        loading..

        Variants by Functional Class

        The stacked bar plot shows the effect of variants and the number of variants for each effect type.

        This plot shows the effect of variants on the translation of the mRNA as protein. There are three possible cases:

        • Silent: The amino acid does not change.
        • Missense: The amino acid is different.
        • Nonsense: The variant generates a stop codon.
        loading..

        Variant Qualities

        The line plot shows the quantity as function of the variant quality score.

        The quality score corresponds to the QUAL column of the VCF file. This score is set by the variant caller.

        loading..

        nf-core/viralrecon Software Versions

        are collected at run time from the software output.

        Process Name Software Version
        ARTIC_GUPPYPLEX artic 1.2.1
        ARTIC_MINION artic 1.2.1
        ASCIIGENOME asciigenome 1.16.0
        bedtools 2.30.0
        BCFTOOLS_QUERY bcftools 1.14
        BCFTOOLS_STATS bcftools 1.14
        COLLAPSE_PRIMERS python 3.9.5
        CUSTOM_DUMPSOFTWAREVERSIONS python 3.9.5
        yaml 5.4.1
        CUSTOM_GETCHROMSIZES custom 1.14
        GUNZIP_GFF gunzip 1.10
        MAKE_VARIANTS_LONG_TABLE python 3.9.9
        MOSDEPTH_AMPLICON mosdepth 0.3.3
        MOSDEPTH_GENOME mosdepth 0.3.3
        NANOPLOT nanoplot 1.39.0
        NEXTCLADE_RUN nextclade 1.10.2
        PANGOLIN pangolin 3.1.20
        PLOT_MOSDEPTH_REGIONS_AMPLICON r-base 4.0.3
        PLOT_MOSDEPTH_REGIONS_GENOME r-base 4.0.3
        PYCOQC pycoqc 2.5.2
        QUAST quast 5.0.2
        SAMTOOLS_FLAGSTAT samtools 1.14
        SAMTOOLS_IDXSTATS samtools 1.14
        SAMTOOLS_INDEX samtools 1.14
        SAMTOOLS_STATS samtools 1.14
        SAMTOOLS_VIEW samtools 1.14
        SNPEFF_ANN snpeff 5.0e
        SNPEFF_BUILD snpeff 5.0e
        SNPSIFT_EXTRACTFIELDS snpsift 4.3
        TABIX_BGZIP tabix 1.12
        TABIX_TABIX tabix 1.12
        VCFLIB_VCFUNIQ vcflib 1.0.2
        Workflow Nextflow 22.04.0
        nf-core/viralrecon 2.4.1

        nf-core/viralrecon Workflow Summary

        - this information is collected when the pipeline is started.

        Core Nextflow options

        revision
        2.4.1
        runName
        agitated_watson
        containerEngine
        docker
        launchDir
        /opt/data
        workDir
        /opt/data/work
        projectDir
        /root/.nextflow/assets/nf-core/viralrecon
        userName
        root
        profile
        docker
        configFiles
        /root/.nextflow/assets/nf-core/viralrecon/nextflow.config

        Input/output options

        platform
        nanopore
        protocol
        metagenomic
        outdir
        /opt/outdir

        Reference genome options

        genome
        MN908947.3
        fasta
        https://github.com/artic-network/artic-ncov2019/raw/master/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta
        gff
        https://github.com/nf-core/test-datasets/raw/viralrecon/genome/MN908947.3/GCA_009858895.3_ASM985889v3_genomic.200409.gff.gz
        primer_bed
        https://github.com/artic-network/artic-ncov2019/raw/master/primer_schemes/nCoV-2019/V3/nCoV-2019.primer.bed
        primer_set_version
        3

        Nanopore options

        fastq_dir
        /opt/fastq/fastq_pass
        fast5_dir
        /opt/fastq/fast5_pass
        sequencing_summary
        /opt/sequencing_summary/sequencing_summary_FAN44250_77d58da2.txt
        artic_minion_caller
        medaka
        artic_scheme
        nCoV-2019
        artic_minion_medaka_model
        r941_min_high_g360

        Nanopore/Illumina options

        nextclade_dataset
        /opt/clade
        nextclade_dataset_name
        sars-cov-2
        nextclade_dataset_reference
        MN908947
        nextclade_dataset_tag
        2022-01-18T12:00:00Z

        Illumina QC, read trimming and filtering options

        skip_kraken2
        true

        Max job request options

        max_cpus
        3
        max_memory
        8GB