SNV Analysis Report

Overview

Reads Source: List of SRA experiments
Total Samples: 7 samples
Results Directory: ../influenzaA/
Reference Host Genome: 2-alignment/host/genomes/Homo_sapiens_GRCh38/genome.fa

Reference Pathogen Genomes:

Genome file	Protein file	Gene file
data/influenzaA/genome.fa	data/influenzaA/protein.fa	data/influenzaA/genes.gbk

Input Reads:

ID	Type	File 1	File 2
DRR051417	paired	data/fastq/DRR051417_1.fastq	data/fastq/DRR051417_2.fastq
DRR051423	paired	data/fastq/DRR051423_1.fastq	data/fastq/DRR051423_2.fastq
DRR051428	paired	data/fastq/DRR051428_1.fastq	data/fastq/DRR051428_2.fastq
DRR051431	paired	data/fastq/DRR051431_1.fastq	data/fastq/DRR051431_2.fastq
DRR051442	single	data/fastq/DRR051442.fastq
DRR051444	single	data/fastq/DRR051444.fastq
DRR051450	paired	data/fastq/DRR051450_1.fastq	data/fastq/DRR051450_2.fastq

Read Quality

Quality Check

The quality check was done using FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). This tool analyzes the quality of all reads in fastq files and creates reports that help identify quality issues in high-throughput sequencing datasets. All the results were stored in 1-quality/fastqc.

Read Cropping

Read cropping was done using Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic). This tool preprocesses high-throughput sequencing data from next-generation sequencing platforms. It specializes in quality control and trimming of raw sequence reads, removing artifacts, adapters, and low-quality bases. When SNVGuru identifies that a read has a quality decay greater than 1.0, it crops the reads down to 100 base pairs. The cropped fastq files were stored in 1-quality/fastq.

Host Alignment

The reads were aligned against a host reference genome in order to remove reads belonging to the host instead of the pathogen, which could alter the results of the analysis. This alignment was done using BWA (https://bio-bwa.sourceforge.net/). This tool is a widely used read aligner made for short DNA reads. Since it is not splice-aware, it is not suggested for RNA-seq data.. The initial alignments were stored in SAM format at 2-alignment/host/sam.

After doing this, the reads that did not align against the host reference genome were extracted using samtools (http://www.htslib.org/). First, it runs samtools view -F 256 on the SAM files, so that every sequence that aligned is ignored and the rest is saved in BAM files at 2-alignment/host/bam. Then, it runs samtools bam2fq on the resulting BAM files to transform them into fastq files. These filtered fastq files were stored at 2-alignment/host/fastq. The number of reads are the following:

Sample	Reads Before Filter	Reads After Filter
DRR051417	640081	129882
DRR051423	209048	150275
DRR051428	415068	76937
DRR051431	327386	76062
DRR051442	253441	46147
DRR051444	374342	115856
DRR051450	424300	283018

Pathogen Alignment

The reads were aligned against the provided reference pathogen genomes using BWA (https://bio-bwa.sourceforge.net/). This tool is a widely used read aligner made for short DNA reads. Since it is not splice-aware, it is not suggested for RNA-seq data.. The initial alignments were stored in SAM format at 2-alignment/pathogen/sam. Then, using samtools (http://www.htslib.org/), the alignments were sorted and transformed into a BAM file running samtools sort, and finally, the MD and NM tags were added running samtools calmd. These resulting BAM files were stored at 2-alignment/pathogen/bam, where the .sorted.bam files are the result of samtools sort, and the .bam files are the final BAM files resulting from samtools calmd.

Alignment Quality

The alignments against the pathogen reference genome were analyzed using Qualimap 2 (http://qualimap.conesalab.org/). This tool inspects SAM/BAM files, analyzes the features of the mapped reads and generates a report of the aligned data. This helps detect issues in the sequencing and/or mapping of the data. The results were stored at 3-qualimap.

After the analysis is done, SNVGuru removes the samples that produced a general error rate greater than 3.0%. The error rates were the following:

Reference pathogen	ID	Error rate (%)
influenzaA	DRR051417	2.67
influenzaA	DRR051423	2.81
influenzaA	DRR051428	2.7
influenzaA	DRR051431	2.67
influenzaA	DRR051442	2.55
influenzaA	DRR051444	3.37
influenzaA	DRR051450	2.9

SNV Calling

The SNV calling step was performed using REDItools2 (https://github.com/BioinfoUNIBA/REDItools2) and JACUSA2 (https://github.com/dieterich-lab/JACUSA2).

REDItools2 is a toolkit designed for the analysis of RNA editing events in high-throughput sequencing data, identifying, quantifying, and characterizing RNA editing sites from RNA-seq data. It generates TXT files with the SNV data, which were transformed into VCF files, and these VCF files were also modified for using them as SnpEff inputs. These files were stored at 4-snvCalling/reditools. The files used for SnpEff are named as SAMPLE.reditools.presnpeff.vcf.

JACUSA2 is a framework for single nucleotide variant and reverse transcriptase induced arrest event detection in next-generation sequencing data. It generates VCF files with the SNV data, which were then preprocessed for using them as SnpEff inputs. These files were stored at 4-snvCalling/jacusa. The original output files are named as SAMPLE.jacusa.vcf. while the files used for SnpEff are named as SAMPLE.jacusa.presnpeff.vcf. There are also some files named as SAMPLE.jacusa.vcf.filtered and SAMPLE.jacusa.vcf.filtered.idx that are byproducts of the execution of the program.

Gene and Functional Effect Identification

For identifying the gene and functional effect of each SNV, the VCF files from the previous step were processed with SnpEff (http://pcingola.github.io/SnpEff/). It is a genetic variant annotation and functional effect prediction toolbox, particularly made for single nucleotide polymorphisms and small insertions/deletions. It categorizes variants based on their impact on genes, classifying them into different functional consequences such as synonymous, nonsynonymous, frameshift, and more. The output files of this tool were stored at 5-snpeff.

Allele-Specific Strand Odds Ratio Calculation

The computation of AS strand odds ratio (AS_SOR) was done executing BCFtools' (https://samtools.github.io/bcftools/) mpileup on each resulting BAM file from the alignment using the argument -a FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP in order to get the allelic depth of the forward and reverse strands for both the reference and the aligned sequences. The output files are found at 4-snvCalling/depths/REFERENCE_NAME/SAMPLE_NAME.mpileup.vcf for each pathogen reference genome and sample pair.

Each output file's last column is named as the path of the respective BAM file. This column has a string that, when split by the colon (:) character, results in six fields. The fourth one is the allelic depth for the forward strand (ADF), and the fifth one is the allelic depth for the reverse strand (ARF). Both fields have two comma-separated values, where the first one corresponds to the reference allele and the second one corresponds to the alternate allele. This leaves us with four values: forward reference depth (FRD), reverse reference depth (RRD), forward alternate depth (FAD) and reverse alternate depth (RAD). The formula for calculating the AS_SOR, according to GATK (https://gatk.broadinstitute.org/hc/en-us/articles/4414586726683-AS-StrandOddsRatio) is as follows: $$AS\_SOR = {ln(\frac{FAD * RRD}{FRD * RAD}) + ln(\frac{min(FRD, RRD)}{max(FRD, RRD)}) - ln(\frac{min(FAD, RAD)}{max(FAD, RAD)})}$$ If a mutation has an AS_SOR > 4.0, then it is filtered out of the resulting files and graphs.

Results

Common Identified SNVs

This step merges the identified SNVs from JACUSA2 and REDItools2 by position and mutation (nucleotide change). If any combination of position and mutation is not found in either of the outputs, it is discarded. Furthermore, these SNVs are filtered by the following values:

Minimum base quality: 35
Minimum read quality: 25
Minimum SNV coverage: 20
Minimum main read support: 4
Minimum SNV frequency: 0.0

If there is a position that has multiple mutations, these are split into a row per mutation per position.

These files were stored at 6-visualization/csv/globalCommon.csv for the global results among all samples, and 6-visualization/SAMPLE_NAME/csv/runCommon.csv for the results of each sample. There is also a file for the global results and for each sample of the results by JACUSA2 (6-visualization/REFERENCE_NAME/csv/globalJacusa.csv and 6-visualization/REFERENCE_NAME/SAMPLE_NAME/csv/jacusa.csv) and REDItools2 (6-visualization/csv/globalReditools.csv and 6-visualization/SAMPLE_NAME/csv/reditools.csv). Here is a sample from the global results file.

CHROM	Position	Alt	Reference	Type	AAVar	GeneName	GeneID	RefReads	AltReads	TotalReads	Frequency	A	C	G	T	JacRefReads	JacAltReads	JacTotalReads	JacFrequency	JacA	JacC	JacG	JacT	Sample
NC_007366.1	123	T	G	missense_variant	p.Gly32Trp	HA	FLUAVH3N2_s4p1	583	5	588	0.8503000000000001	0	0	583	5	693	5	698	0.7163	0	0	693	5	DRR051417
NC_007366.1	144	T	G	stop_gained	p.Gly39*	HA	FLUAVH3N2_s4p1	614	4	618	0.6472	0	0	614	4	773	4	777	0.5147999999999999	0	0	773	4	DRR051417
NC_007366.1	175	G	A	missense_variant	p.Gln49Arg	HA	FLUAVH3N2_s4p1	0	585	585	100.0	0	0	585	0	0	684	684	100.0	0	0	684	0	DRR051417
NC_007366.1	185	T	C	synonymous_variant	p.Val52Val	HA	FLUAVH3N2_s4p1	0	552	552	100.0	0	0	0	552	0	682	682	100.0	0	0	0	682	DRR051417
NC_007366.1	200	G	A	synonymous_variant	p.Glu57Glu	HA	FLUAVH3N2_s4p1	1	515	516	99.8062	1	0	515	0	1	618	619	99.8384	1	0	618	0	DRR051417
NC_007366.1	211	A	G	missense_variant	p.Ser61Asn	HA	FLUAVH3N2_s4p1	0	527	527	100.0	527	0	0	0	0	643	643	100.0	643	0	0	0	DRR051417
NC_007366.1	220	T	C	missense_variant	p.Thr64Ile	HA	FLUAVH3N2_s4p1	0	566	566	100.0	0	0	0	566	0	664	664	100.0	0	0	0	664	DRR051417
NC_007366.1	226	A	G	missense_variant	p.Gly66Glu	HA	FLUAVH3N2_s4p1	0	558	558	100.0	558	0	0	0	0	656	656	100.0	656	0	0	0	DRR051417
NC_007366.1	249	G	A	missense_variant	p.Ile74Val	HA	FLUAVH3N2_s4p1	459	4	463	0.8639000000000001	459	0	4	0	544	4	548	0.7299	544	0	4	0	DRR051417
NC_007366.1	314	T	C	synonymous_variant	p.Phe95Phe	HA	FLUAVH3N2_s4p1	0	338	338	100.0	0	0	0	338	0	412	412	100.0	0	0	0	412	DRR051417
NC_007366.1	347	A	C	synonymous_variant	p.Arg106Arg	HA	FLUAVH3N2_s4p1	0	389	389	100.0	389	0	0	0	0	448	448	100.0	448	0	0	0	DRR051417
NC_007366.1	459	G	A	missense_variant	p.Thr144Ala	HA	FLUAVH3N2_s4p1	0	633	633	100.0	0	0	633	0	1	688	689	99.8549	1	0	688	0	DRR051417
NC_007366.1	476	C	T	synonymous_variant	p.Asn149Asn	HA	FLUAVH3N2_s4p1	0	635	635	100.0	0	635	0	0	0	724	724	100.0	0	724	0	0	DRR051417
NC_007366.1	485	T	C	synonymous_variant	p.Ser152Ser	HA	FLUAVH3N2_s4p1	0	574	574	100.0	0	0	0	574	0	692	692	100.0	0	0	0	692	DRR051417
NC_007366.1	501	G	A	missense_variant	p.Arg158Gly	HA	FLUAVH3N2_s4p1	0	492	492	100.0	0	0	492	0	1	592	593	99.8314	1	0	592	0	DRR051417
NC_007366.1	511	G	A	missense_variant	p.Asn161Ser	HA	FLUAVH3N2_s4p1	0	522	522	100.0	0	0	522	0	0	568	568	100.0	0	0	568	0	DRR051417
NC_007366.1	512	T	C	synonymous_variant	p.Asn161Asn	HA	FLUAVH3N2_s4p1	0	500	500	100.0	0	0	0	500	0	548	548	100.0	0	0	0	548	DRR051417
NC_007366.1	530	A	G	synonymous_variant	p.Leu167Leu	HA	FLUAVH3N2_s4p1	0	479	479	100.0	479	0	0	0	0	550	550	100.0	550	0	0	0	DRR051417
NC_007366.1	547	C	T	missense_variant	p.Leu173Ser	HA	FLUAVH3N2_s4p1	0	437	437	100.0	0	437	0	0	0	494	494	100.0	0	494	0	0	DRR051417
NC_007366.1	606	T	C	synonymous_variant	p.Leu193Leu	HA	FLUAVH3N2_s4p1	0	470	470	100.0	0	0	0	470	1	546	547	99.8172	0	1	0	546	DRR051417
NC_007366.1	644	G	T	missense_variant	p.Asn205Lys	HA	FLUAVH3N2_s4p1	0	529	529	100.0	0	0	529	0	0	638	638	100.0	0	0	638	0	DRR051417
NC_007366.1	654	T	A	missense_variant	p.Ser209Cys	HA	FLUAVH3N2_s4p1	0	505	505	100.0	0	0	0	505	0	551	551	100.0	0	0	0	551	DRR051417
NC_007366.1	655	T	G	missense_variant	p.Ser209Ile	HA	FLUAVH3N2_s4p1	0	499	499	100.0	0	0	0	499	0	540	540	100.0	0	0	0	540	DRR051417
NC_007366.1	659	G	A	synonymous_variant	p.Leu210Leu	HA	FLUAVH3N2_s4p1	0	549	549	100.0	0	0	549	0	2	612	614	99.6743	2	0	612	0	DRR051417
NC_007366.1	669	T	G	missense_variant	p.Ala214Ser	HA	FLUAVH3N2_s4p1	1	538	539	99.8145	0	0	1	538	2	638	640	99.6875	0	0	2	638	DRR051417
NC_007366.1	689	A	C	synonymous_variant	p.Val220Val	HA	FLUAVH3N2_s4p1	0	513	513	100.0	513	0	0	0	0	591	591	100.0	591	0	0	0	DRR051417
NC_007366.1	711	G	A	missense_variant	p.Thr228Ser	HA	FLUAVH3N2_s4p1	0	485	486	100.0	0	0	485	1	0	559	560	100.0	0	0	559	1	DRR051417
NC_007366.1	711	G	A	missense_variant	p.Thr228Ser	HA	FLUAVH3N2_s4p1	0	485	486	100.0	0	0	485	1	0	559	560	100.0	0	0	559	1	DRR051417
NC_007366.1	711	G	A	missense_variant	p.Thr228Ala	HA	FLUAVH3N2_s4p1	0	485	486	100.0	0	0	485	1	0	559	560	100.0	0	0	559	1	DRR051417
NC_007366.1	711	G	A	missense_variant	p.Thr228Ala	HA	FLUAVH3N2_s4p1	0	485	486	100.0	0	0	485	1	0	559	560	100.0	0	0	559	1	DRR051417
NC_007366.1	713	T	C	synonymous_variant	p.Thr228Thr	HA	FLUAVH3N2_s4p1	0	498	498	100.0	0	0	0	498	0	583	583	100.0	0	0	0	583	DRR051417
NC_007366.1	724	A	G	missense_variant	p.Ser232Asn	HA	FLUAVH3N2_s4p1	0	448	448	100.0	448	0	0	0	0	526	526	100.0	526	0	0	0	DRR051417
NC_007366.1	743	A	G	synonymous_variant	p.Arg238Arg	HA	FLUAVH3N2_s4p1	0	435	435	100.0	435	0	0	0	0	536	536	100.0	536	0	0	0	DRR051417
NC_007366.1	750	A	G	missense_variant	p.Asp241Asn	HA	FLUAVH3N2_s4p1	0	445	445	100.0	445	0	0	0	0	513	513	100.0	513	0	0	0	DRR051417
NC_007366.1	753	A	G	missense_variant	p.Val242Ile	HA	FLUAVH3N2_s4p1	0	433	433	100.0	433	0	0	0	0	491	491	100.0	491	0	0	0	DRR051417
NC_007366.1	840	A	C	synonymous_variant	p.Arg271Arg	HA	FLUAVH3N2_s4p1	1	360	362	99.723	360	1	1	0	1	419	421	99.7619	419	1	1	0	DRR051417
NC_007366.1	840	A	C	synonymous_variant	p.Arg271Arg	HA	FLUAVH3N2_s4p1	1	360	362	99.723	360	1	1	0	1	419	421	99.7619	419	1	1	0	DRR051417
NC_007366.1	840	A	C	missense_variant	p.Arg271Gly	HA	FLUAVH3N2_s4p1	1	360	362	99.723	360	1	1	0	1	419	421	99.7619	419	1	1	0	DRR051417
NC_007366.1	840	A	C	missense_variant	p.Arg271Gly	HA	FLUAVH3N2_s4p1	1	360	362	99.723	360	1	1	0	1	419	421	99.7619	419	1	1	0	DRR051417
NC_007366.1	911	G	T	missense_variant	p.Asn294Lys	HA	FLUAVH3N2_s4p1	0	374	374	100.0	0	0	374	0	0	456	456	100.0	0	0	456	0	DRR051417
NC_007366.1	959	C	T	synonymous_variant	p.Phe310Phe	HA	FLUAVH3N2_s4p1	0	321	321	100.0	0	321	0	0	0	404	404	100.0	0	404	0	0	DRR051417
NC_007366.1	983	C	T	stop_gained	p.Tyr318*	HA	FLUAVH3N2_s4p1	0	340	341	100.0	1	340	0	0	0	402	403	100.0	1	402	0	0	DRR051417
NC_007366.1	983	C	T	stop_gained	p.Tyr318*	HA	FLUAVH3N2_s4p1	0	340	341	100.0	1	340	0	0	0	402	403	100.0	1	402	0	0	DRR051417
NC_007366.1	983	C	T	synonymous_variant	p.Tyr318Tyr	HA	FLUAVH3N2_s4p1	0	340	341	100.0	1	340	0	0	0	402	403	100.0	1	402	0	0	DRR051417
NC_007366.1	983	C	T	synonymous_variant	p.Tyr318Tyr	HA	FLUAVH3N2_s4p1	0	340	341	100.0	1	340	0	0	0	402	403	100.0	1	402	0	0	DRR051417
NC_007366.1	1034	A	G	synonymous_variant	p.Gly335Gly	HA	FLUAVH3N2_s4p1	0	341	341	100.0	341	0	0	0	0	392	392	100.0	392	0	0	0	DRR051417
NC_007366.1	1082	A	C	synonymous_variant	p.Ile351Ile	HA	FLUAVH3N2_s4p1	0	351	351	100.0	351	0	0	0	0	415	415	100.0	415	0	0	0	DRR051417
NC_007366.1	1112	T	A	synonymous_variant	p.Gly361Gly	HA	FLUAVH3N2_s4p1	313	4	317	1.2618	313	0	0	4	382	4	387	1.0363	382	0	1	4	DRR051417
NC_007366.1	1112	T	A	synonymous_variant	p.Gly361Gly	HA	FLUAVH3N2_s4p1	313	4	317	1.2618	313	0	0	4	382	4	387	1.0363	382	0	1	4	DRR051417
NC_007366.1	1116	A	G	missense_variant	p.Val363Ile	HA	FLUAVH3N2_s4p1	0	329	329	100.0	329	0	0	0	0	411	411	100.0	411	0	0	0	DRR051417
NC_007366.1	1118	G	A	synonymous_variant	p.Val363Val	HA	FLUAVH3N2_s4p1	0	346	346	100.0	0	0	346	0	1	413	414	99.7585	1	0	413	0	DRR051417
NC_007366.1	1121	T	C	synonymous_variant	p.Asp364Asp	HA	FLUAVH3N2_s4p1	0	382	382	100.0	0	0	0	382	0	408	408	100.0	0	0	0	408	DRR051417
NC_007366.1	1159	G	C	missense_variant	p.Thr377Arg	HA	FLUAVH3N2_s4p1	0	534	534	100.0	0	0	534	0	0	642	642	100.0	0	0	642	0	DRR051417
NC_007366.1	1200	G	A	missense_variant	p.Asn391Asp	HA	FLUAVH3N2_s4p1	0	769	769	100.0	0	0	769	0	0	890	890	100.0	0	0	890	0	DRR051417
NC_007366.1	1202	T	C	synonymous_variant	p.Asn391Asn	HA	FLUAVH3N2_s4p1	0	699	699	100.0	0	0	0	699	0	838	838	100.0	0	0	0	838	DRR051417
NC_007366.1	1224	C	A	synonymous_variant	p.Arg399Arg	HA	FLUAVH3N2_s4p1	0	666	666	100.0	0	666	0	0	0	849	849	100.0	0	849	0	0	DRR051417
NC_007366.1	1226	A	G	missense_variant	p.Arg399Ser	HA	FLUAVH3N2_s4p1	0	651	653	100.0	651	0	0	2	0	840	842	100.0	840	0	0	2	DRR051417
NC_007366.1	1226	A	G	missense_variant	p.Arg399Ser	HA	FLUAVH3N2_s4p1	0	651	653	100.0	651	0	0	2	0	840	842	100.0	840	0	0	2	DRR051417
NC_007366.1	1226	A	G	synonymous_variant	p.Arg399Arg	HA	FLUAVH3N2_s4p1	0	651	653	100.0	651	0	0	2	0	840	842	100.0	840	0	0	2	DRR051417
NC_007366.1	1226	A	G	synonymous_variant	p.Arg399Arg	HA	FLUAVH3N2_s4p1	0	651	653	100.0	651	0	0	2	0	840	842	100.0	840	0	0	2	DRR051417
NC_007366.1	1241	C	A	synonymous_variant	p.Thr404Thr	HA	FLUAVH3N2_s4p1	0	705	705	100.0	0	705	0	0	0	846	846	100.0	0	846	0	0	DRR051417
NC_007366.1	1302	A	C	missense_variant	p.Leu425Ile	HA	FLUAVH3N2_s4p1	0	609	610	100.0	609	0	1	0	1	753	755	99.8674	753	1	1	0	DRR051417
NC_007366.1	1302	A	C	missense_variant	p.Leu425Ile	HA	FLUAVH3N2_s4p1	0	609	610	100.0	609	0	1	0	1	753	755	99.8674	753	1	1	0	DRR051417
NC_007366.1	1302	A	C	missense_variant	p.Leu425Val	HA	FLUAVH3N2_s4p1	0	609	610	100.0	609	0	1	0	1	753	755	99.8674	753	1	1	0	DRR051417
NC_007366.1	1302	A	C	missense_variant	p.Leu425Val	HA	FLUAVH3N2_s4p1	0	609	610	100.0	609	0	1	0	1	753	755	99.8674	753	1	1	0	DRR051417
NC_007366.1	1304	T	C	synonymous_variant	p.Leu425Leu	HA	FLUAVH3N2_s4p1	0	593	593	100.0	0	0	0	593	0	745	745	100.0	0	0	0	745	DRR051417
NC_007366.1	1325	A	T	synonymous_variant	p.Thr432Thr	HA	FLUAVH3N2_s4p1	2	540	542	99.631	540	0	0	2	2	731	733	99.7271	731	0	0	2	DRR051417
NC_007366.1	1364	T	G	synonymous_variant	p.Val445Val	HA	FLUAVH3N2_s4p1	0	479	479	100.0	0	0	0	479	0	571	571	100.0	0	0	0	571	DRR051417
NC_007366.1	1426	A	G	missense_variant	p.Arg466Lys	HA	FLUAVH3N2_s4p1	0	313	313	100.0	313	0	0	0	0	392	392	100.0	392	0	0	0	DRR051417
NC_007366.1	1514	A	G	synonymous_variant	p.Gly495Gly	HA	FLUAVH3N2_s4p1	0	331	331	100.0	331	0	0	0	0	397	397	100.0	397	0	0	0	DRR051417
NC_007366.1	1541	C	T	synonymous_variant	p.His504His	HA	FLUAVH3N2_s4p1	0	269	269	100.0	0	269	0	0	0	373	373	100.0	0	373	0	0	DRR051417
NC_007366.1	1553	G	A	synonymous_variant	p.Arg508Arg	HA	FLUAVH3N2_s4p1	0	322	323	100.0	0	0	322	1	0	385	386	100.0	0	0	385	1	DRR051417
NC_007366.1	1553	G	A	synonymous_variant	p.Arg508Arg	HA	FLUAVH3N2_s4p1	0	322	323	100.0	0	0	322	1	0	385	386	100.0	0	0	385	1	DRR051417
NC_007366.1	1553	G	A	missense_variant	p.Arg508Ser	HA	FLUAVH3N2_s4p1	0	322	323	100.0	0	0	322	1	0	385	386	100.0	0	0	385	1	DRR051417
NC_007366.1	1553	G	A	missense_variant	p.Arg508Ser	HA	FLUAVH3N2_s4p1	0	322	323	100.0	0	0	322	1	0	385	386	100.0	0	0	385	1	DRR051417
NC_007366.1	1586	G	A	synonymous_variant	p.Lys519Lys	HA	FLUAVH3N2_s4p1	0	364	366	100.0	0	0	364	2	0	424	427	100.0	0	0	424	3	DRR051417
NC_007366.1	1586	G	A	synonymous_variant	p.Lys519Lys	HA	FLUAVH3N2_s4p1	0	364	366	100.0	0	0	364	2	0	424	427	100.0	0	0	424	3	DRR051417
NC_007366.1	1586	G	A	missense_variant	p.Lys519Asn	HA	FLUAVH3N2_s4p1	0	364	366	100.0	0	0	364	2	0	424	427	100.0	0	0	424	3	DRR051417
NC_007366.1	1586	G	A	missense_variant	p.Lys519Asn	HA	FLUAVH3N2_s4p1	0	364	366	100.0	0	0	364	2	0	424	427	100.0	0	0	424	3	DRR051417
NC_007366.1	1589	A	T	synonymous_variant	p.Gly520Gly	HA	FLUAVH3N2_s4p1	0	353	353	100.0	353	0	0	0	0	413	413	100.0	413	0	0	0	DRR051417
NC_007366.1	1596	C	T	synonymous_variant	p.Leu523Leu	HA	FLUAVH3N2_s4p1	0	340	340	100.0	0	340	0	0	0	431	431	100.0	0	431	0	0	DRR051417
NC_007366.1	1607	G	A	synonymous_variant	p.Gly526Gly	HA	FLUAVH3N2_s4p1	0	335	335	100.0	0	0	335	0	0	419	419	100.0	0	0	419	0	DRR051417
NC_007367.1	151	A	C	synonymous_variant	p.Leu42Leu	M1	FLUAVH3N2_s7p2	833	4	837	0.4779	4	833	0	0	1008	6	1015	0.5917	6	1008	0	1	DRR051417
NC_007367.1	151	A	C	synonymous_variant	p.Leu42Leu	M1	FLUAVH3N2_s7p2	833	4	837	0.4779	4	833	0	0	1008	6	1015	0.5917	6	1008	0	1	DRR051417
NC_007367.1	186	A	C	missense_variant	p.Pro54His	M1	FLUAVH3N2_s7p2	808	5	813	0.615	5	808	0	0	987	7	994	0.7041999999999999	7	987	0	0	DRR051417
NC_007367.1	281	T	G	missense_variant	p.Gly86Trp	M1	FLUAVH3N2_s7p2	638	5	643	0.7776	0	0	638	5	797	5	802	0.6234	0	0	797	5	DRR051417
NC_007367.1	292	C	T	synonymous_variant	p.Asp89Asp	M1	FLUAVH3N2_s7p2	0	528	528	100.0	0	528	0	0	0	626	626	100.0	0	626	0	0	DRR051417
NC_007367.1	466	G	A	synonymous_variant	p.Val147Val	M1	FLUAVH3N2_s7p2	0	538	538	100.0	0	0	538	0	0	666	666	100.0	0	0	666	0	DRR051417
NC_007367.1	481	G	A	synonymous_variant	p.Glu152Glu	M1	FLUAVH3N2_s7p2	1	588	589	99.8302	1	0	588	0	1	672	673	99.8514	1	0	672	0	DRR051417
NC_007367.1	517	G	A	synonymous_variant	p.Gln164Gln	M1	FLUAVH3N2_s7p2	0	528	529	100.0	0	0	528	1	0	620	621	100.0	0	0	620	1	DRR051417
NC_007367.1	517	G	A	synonymous_variant	p.Gln164Gln	M1	FLUAVH3N2_s7p2	0	528	529	100.0	0	0	528	1	0	620	621	100.0	0	0	620	1	DRR051417
NC_007367.1	517	G	A	missense_variant	p.Gln164His	M1	FLUAVH3N2_s7p2	0	528	529	100.0	0	0	528	1	0	620	621	100.0	0	0	620	1	DRR051417
NC_007367.1	517	G	A	missense_variant	p.Gln164His	M1	FLUAVH3N2_s7p2	0	528	529	100.0	0	0	528	1	0	620	621	100.0	0	0	620	1	DRR051417
NC_007367.1	598	G	A	synonymous_variant	p.Gln191Gln	M1	FLUAVH3N2_s7p2	1	459	460	99.7826	1	0	459	0	1	604	605	99.8347	1	0	604	0	DRR051417
NC_007367.1	637	G	A	synonymous_variant	p.Glu204Glu	M1	FLUAVH3N2_s7p2	0	498	498	100.0	0	0	498	0	0	590	590	100.0	0	0	590	0	DRR051417
NC_007367.1	658	G	A	synonymous_variant	p.Gln211Gln	M1	FLUAVH3N2_s7p2	0	485	485	100.0	0	0	485	0	0	596	596	100.0	0	0	596	0	DRR051417
NC_007367.1	680	A	G	missense_variant	p.Val219Ile	M1	FLUAVH3N2_s7p2	0	520	520	100.0	520	0	0	0	0	607	607	100.0	607	0	0	0	DRR051417
NC_007367.1	697	T	C	synonymous_variant	p.Ser224Ser	M1	FLUAVH3N2_s7p2	0	525	525	100.0	0	0	0	525	0	625	625	100.0	0	0	0	625	DRR051417
NC_007367.1	785	T	C	synonymous_variant	p.Asp24Asp	M2	FLUAVH3N2_s7p1	601	8	610	1.3136	1	601	0	8	728	10	739	1.355	1	728	0	10	DRR051417
NC_007367.1	785	T	C	synonymous_variant	p.Asp24Asp	M2	FLUAVH3N2_s7p1	601	8	610	1.3136	1	601	0	8	728	10	739	1.355	1	728	0	10	DRR051417

Mutation Count Bar Plot

Each sample has its own mutation count bar plot, as well as each reference pathogen genome has its own global mutation count bar plot. For each possible mutation, it counts how many times the mutation happened among all reads, regardless of the position, and plots it as a bar. It also displays the mean coverage and median coverage, where the coverage is the number of reads with alternate allele. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.mutationCountBarPlot.png for each reference genome graph, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/common.mutationCountBarPlot.png for each sample.

Mutation Count Box Plot

Each sample has its own mutation count box plot, as well as each reference pathogen genome has its own global mutation count box plot. For each possible mutation, it plots the box plot regardless of the position. It is possible to have many outliers, hence the graph can look like a dot plot with a small box in the bottom side. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.mutationCountBoxPlot.png for each reference genome graph, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/common.mutationCountBoxPlot.png for each sample.

Mutation Count Stacked Bar Plot Per Gene

Each sample has its own mutation count stacked bar plot per gene, as well as each reference pathogen genome has its own global mutation count stacked bar plot per gene. For each gene, it plots a stacked bar, split by each possible mutation, where the length of each bar section is given by the number of reads that mutation has in that gene. It is possible that there are multiple graphs for each sample or reference pathogen. This is because it plots at most 100 genes per file, each file representing a group number. The genes are displayed ordered as they appear in the genome. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/geneBarPlot for each reference genome graph, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/geneBarPlot for each sample. There is also a file named _groups.txt where it says the group number of each gene.

Mutation Count Box Plot Per Gene

Each sample has its own mutation count box plot per gene, as well as each reference pathogen genome has its own global mutation count box plot per gene. For each gene, it plots a box plot based on the number of mutated reads in that gene across all the different positions. Some box plots might have many outliers, so they can look like dot plots with a small box in the left side. It is possible that there are multiple graphs for each sample or reference pathogen. This is because it plots at most 100 genes per file, each file representing a group number. The genes are displayed ordered as they appear in the genome. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/geneBoxPlot for each reference genome graph, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/geneBoxPlot for each sample. There is also a file named _groups.txt where it says the group number of each gene.

Mutation Count Per Sample Bar Plot

Each reference pathogen genome has its own global mutation count per sample bar plot. For each possible mutation, it counts how many times the mutation happened among all reads in each sample, regardless of the position, and plots it as a bar. Each sample has its own bar. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.mutationsPerRunCountBarPlot.png for each reference genome graph.

Mutation Count Per Sample Box Plot

Each reference pathogen genome has its own global mutation count per sample bar plot. For each sample, it plots a box plot based on the number of mutated reads in that sample across all the different positions. Some box plots might have many outliers, so they can look like dot plots with a small box in the left side. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.mutationsPerRunCountBoxPlot.png for each reference genome graph.

Frequency Per Mutation Strip Plot

Each reference pathogen genome has its own frequency per mnutation strip plot. It plots a strip plot, where each dot is the frequency of an SNV, and it appears in the strip of their respective mutation. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.frequencyPerMutation.png for each reference genome graph.

Frequency Per Gene Strip Plot

Each reference pathogen genome has its own frequency per gene strip plot. It plots a strip plot, where each dot is the frequency of an SNV, and it appears in the strip of their respective gene. It is possible that there are multiple graphs for each sample or reference pathogen. This is because it plots at most 100 genes per file, each file representing a group number. The genes are displayed ordered as they appear in the genome. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.frequencyPerMutation.png for each reference genome graph. There is also a file named _groups.txt where it says the group number of each gene.

Frequency Per Sample Strip Plot

Each reference pathogen genome has its own frequency per sample strip plot. It plots a strip plot, where each dot is the frequency of an SNV, and it appears in the strip of their respective sample. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.frequencyPerRun.png for each reference genome graph.

Distribution Histograms Plot

Each reference pathogen genome has its own distribution histograms plots. It plots six graphs in one file per chromosome/segment, where each graph represents a mutation (top half) and its reverse complement (bottom half). Each graph has a histogram displaying the number of SNVs of that mutation found each 100 nucleotides with respect to the reference genome. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/CHROMOSOME_histogram.png for each chromosome/segment in each reference genome.

Regression Plot

Each reference pathogen genome has its own regression plots. It displays a dot plot of mutated (or alternate) reads vs the total reads per position. If it finds a suitable linear function, then it is also plotted. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/CHROMOSOME.regression.png for each chromosome/segment in each reference genome.

Presence Per Run Position Plot

Each reference pathogen genome has its own presence per run position plots as a circos graph and a heatmap.

The circos graph is a circular graph that displays a different position each arc, up to 150 positions, while each concentric strip at each radius level displays a different run, creating cells. The circos graph supports up to 20 samples. If a cell is colored, then it means that the corresponding position was mutated in that sample. If the graph contains multiple genes, the genes range will be displayed in the innermost circle strip. Because there is a limit on the positions that each circos graph can display, it is possible that there are multiple graphs for each sample or reference pathogen. Genes that have less than 150 mutated positions will be grouped with other genes that also have less than that number of mutated positions. If a gene has more than 150 mutated positions, it will be split in different files. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/circos/CHROMOSOME_GROUP.png for each group in each chromosome/segment per reference genome, or 6-visualization/REFERENCE_NAME/graphs/circos/GENE_GROUP.png for each group in each gene per reference genome.

The heatmap is a tabular graph where each column represents a sample, and each row represents a position in the reference chromosome/segment. Each heatmap can have up to 300 positions. If a cell is colored, then it means that the corresponding position was mutated in that sample. If the graph contains multiple genes, the genes range will be displayed to the right of the heatmap. Because there is a limit on the positions that each heatmap can display, it is possible that there are multiple graphs for each reference pathogen. Genes that have less than 300 mutated positions will be grouped with other genes that also have less than that number of mutated positions. If a gene has more than 300 mutated positions, it will be split in different files. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/heatmap/CHROMOSOME_GROUP.png for each group in each chromosome/segment per reference genome, or 6-visualization/REFERENCE_NAME/graphs/heatmap/GENE_GROUP.png for each group in each gene per reference genome.

Frequency Per Mutation Position Plot

Each sample has its own frequency per mutation position plots as a circos graph and a heatmap.

The circos graph is a circular graph that displays a different position each arc, up to 150 positions, while each concentric strip at each radius level displays a different mutation, creating cells. If a cell is colored, then it means that the corresponding position had that type of mutation, while the intensity of the color represents the frequency of that position-mutation. If the graph contains multiple genes, the genes range will be displayed in the innermost circle strip. Because there is a limit on the positions that each circos graph can display, it is possible that there are multiple graphs for each sample. Genes that have less than 150 mutated positions will be grouped with other genes that also have less than that number of mutated positions. If a gene has more than 150 mutated positions, it will be split in different files. These graphs were stored at 6-visualization/REFERENCE_NAME/SAMPLE_NAME/circos/CHROMOSOME_GROUP.png for each group in each chromosome/segment per sample, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/circos/GENE_GROUP.png for each group in each gene per sample.

The heatmap is a tabular graph where each column represents a sample, and each row represents a position in the reference chromosome/segment. Each heatmap can have up to 300 positions. If a cell is colored, then it means that the corresponding position had that type of mutation, while the intensity of the color represents the frequency of that position-mutation. If the graph contains multiple genes, the genes range will be displayed to the right of the heatmap. Because there is a limit on the positions that each heatmap can display, it is possible that there are multiple graphs for each sample or reference pathogen. Genes that have less than 300 mutated positions will be grouped with other genes that also have less than that number of mutated positions. If a gene has more than 300 mutated positions, it will be split in different files. These graphs were stored at 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/heatmap/CHROMOSOME_GROUP.png for each group in each chromosome/segment per sample, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/heatmap/GENE_GROUP.png for each group in each gene per sample.