SNV Analysis Report

Overview

Reads Source: List of BioProject accessions
Total Samples: 9 samples
Results Directory: ../mtuberculosis/
Reference Host Genome: 2-alignment/host/genomes/Homo_sapiens_GRCh38/genome.fa

Reference Pathogen Genomes:

Genome file	Protein file	Gene file
data/mycobacteriumTuberculosis/genome.fa	data/mycobacteriumTuberculosis/protein.fa	data/mycobacteriumTuberculosis/genes.gbk

Input Reads:

ID	Type	File 1	File 2
SRR25792492	paired	data/fastq/SRR25792492_1.fastq	data/fastq/SRR25792492_2.fastq
SRR25792493	paired	data/fastq/SRR25792493_1.fastq	data/fastq/SRR25792493_2.fastq
SRR25792494	paired	data/fastq/SRR25792494_1.fastq	data/fastq/SRR25792494_2.fastq
SRR25792495	paired	data/fastq/SRR25792495_1.fastq	data/fastq/SRR25792495_2.fastq
SRR25787973	paired	data/fastq/SRR25787973_1.fastq	data/fastq/SRR25787973_2.fastq
SRR25787974	paired	data/fastq/SRR25787974_1.fastq	data/fastq/SRR25787974_2.fastq
SRR25787975	paired	data/fastq/SRR25787975_1.fastq	data/fastq/SRR25787975_2.fastq
SRR25787976	paired	data/fastq/SRR25787976_1.fastq	data/fastq/SRR25787976_2.fastq
SRR25787977	paired	data/fastq/SRR25787977_1.fastq	data/fastq/SRR25787977_2.fastq

Read Quality

Quality Check

The quality check was done using FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). This tool analyzes the quality of all reads in fastq files and creates reports that help identify quality issues in high-throughput sequencing datasets. All the results were stored in 1-quality/fastqc.

Read Cropping

Read cropping was done using Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic). This tool preprocesses high-throughput sequencing data from next-generation sequencing platforms. It specializes in quality control and trimming of raw sequence reads, removing artifacts, adapters, and low-quality bases. When SNVGuru identifies that a read has a quality decay greater than 1.0, it crops the reads down to 100 base pairs. The cropped fastq files were stored in 1-quality/fastq.

Host Alignment

The reads were aligned against a host reference genome in order to remove reads belonging to the host instead of the pathogen, which could alter the results of the analysis. This alignment was done using STAR (https://github.com/alexdobin/STAR). This tool is a widely used RNA-seq read aligner for short and long reads, particularly well-suited for mapping reads to genomes with complex structures, such as those with many introns and alternative splicing events.. The initial alignments were stored in SAM format at 2-alignment/host/sam.

After doing this, the reads that did not align against the host reference genome were extracted using samtools (http://www.htslib.org/). First, it runs samtools view -F 256 on the SAM files, so that every sequence that aligned is ignored and the rest is saved in BAM files at 2-alignment/host/bam. Then, it runs samtools bam2fq on the resulting BAM files to transform them into fastq files. These filtered fastq files were stored at 2-alignment/host/fastq. The number of reads are the following:

Sample	Reads Before Filter	Reads After Filter
SRR25787973	14912526	14782714
SRR25787974	11072537	11017445
SRR25787975	14019682	14006708
SRR25787976	14982631	14894984
SRR25787977	12627739	12573458
SRR25792492	15396871	15323335
SRR25792493	13904011	13637937
SRR25792494	13689795	13523872
SRR25792495	14404817	14397615

Pathogen Alignment

The reads were aligned against the provided reference pathogen genomes using HISAT2 (http://daehwankimlab.github.io/hisat2/). This tool is a widely used RNA-seq read aligner for short reads, particularly well-suited for ekaryotic transcriptomes with complex splicing patterns.. The initial alignments were stored in SAM format at 2-alignment/pathogen/sam. Then, using samtools (http://www.htslib.org/), the alignments were sorted and transformed into a BAM file running samtools sort, and finally, the MD and NM tags were added running samtools calmd. These resulting BAM files were stored at 2-alignment/pathogen/bam, where the .sorted.bam files are the result of samtools sort, and the .bam files are the final BAM files resulting from samtools calmd.

Alignment Quality

The alignments against the pathogen reference genome were analyzed using Qualimap 2 (http://qualimap.conesalab.org/). This tool inspects SAM/BAM files, analyzes the features of the mapped reads and generates a report of the aligned data. This helps detect issues in the sequencing and/or mapping of the data. The results were stored at 3-qualimap.

After the analysis is done, SNVGuru removes the samples that produced a general error rate greater than 3.0%. The error rates were the following:

Reference pathogen	ID	Error rate (%)
mycobacteriumTuberculosis	SRR25792492	0.0
mycobacteriumTuberculosis	SRR25792493	0.0
mycobacteriumTuberculosis	SRR25792494	0.0
mycobacteriumTuberculosis	SRR25792495	0.03
mycobacteriumTuberculosis	SRR25787973	0.0
mycobacteriumTuberculosis	SRR25787974	0.0
mycobacteriumTuberculosis	SRR25787975	0.01
mycobacteriumTuberculosis	SRR25787976	0.0
mycobacteriumTuberculosis	SRR25787977	0.0

SNV Calling

The SNV calling step was performed using REDItools2 (https://github.com/BioinfoUNIBA/REDItools2) and JACUSA2 (https://github.com/dieterich-lab/JACUSA2).

REDItools2 is a toolkit designed for the analysis of RNA editing events in high-throughput sequencing data, identifying, quantifying, and characterizing RNA editing sites from RNA-seq data. It generates TXT files with the SNV data, which were transformed into VCF files, and these VCF files were also modified for using them as SnpEff inputs. These files were stored at 4-snvCalling/reditools. The files used for SnpEff are named as SAMPLE.reditools.presnpeff.vcf.

JACUSA2 is a framework for single nucleotide variant and reverse transcriptase induced arrest event detection in next-generation sequencing data. It generates VCF files with the SNV data, which were then preprocessed for using them as SnpEff inputs. These files were stored at 4-snvCalling/jacusa. The original output files are named as SAMPLE.jacusa.vcf. while the files used for SnpEff are named as SAMPLE.jacusa.presnpeff.vcf. There are also some files named as SAMPLE.jacusa.vcf.filtered and SAMPLE.jacusa.vcf.filtered.idx that are byproducts of the execution of the program.

Gene and Functional Effect Identification

For identifying the gene and functional effect of each SNV, the VCF files from the previous step were processed with SnpEff (http://pcingola.github.io/SnpEff/). It is a genetic variant annotation and functional effect prediction toolbox, particularly made for single nucleotide polymorphisms and small insertions/deletions. It categorizes variants based on their impact on genes, classifying them into different functional consequences such as synonymous, nonsynonymous, frameshift, and more. The output files of this tool were stored at 5-snpeff.

Allele-Specific Strand Odds Ratio Calculation

The computation of AS strand odds ratio (AS_SOR) was done executing BCFtools' (https://samtools.github.io/bcftools/) mpileup on each resulting BAM file from the alignment using the argument -a FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP in order to get the allelic depth of the forward and reverse strands for both the reference and the aligned sequences. The output files are found at 4-snvCalling/depths/REFERENCE_NAME/SAMPLE_NAME.mpileup.vcf for each pathogen reference genome and sample pair.

Each output file's last column is named as the path of the respective BAM file. This column has a string that, when split by the colon (:) character, results in six fields. The fourth one is the allelic depth for the forward strand (ADF), and the fifth one is the allelic depth for the reverse strand (ARF). Both fields have two comma-separated values, where the first one corresponds to the reference allele and the second one corresponds to the alternate allele. This leaves us with four values: forward reference depth (FRD), reverse reference depth (RRD), forward alternate depth (FAD) and reverse alternate depth (RAD). The formula for calculating the AS_SOR, according to GATK (https://gatk.broadinstitute.org/hc/en-us/articles/4414586726683-AS-StrandOddsRatio) is as follows: $$AS\_SOR = {ln(\frac{FAD * RRD}{FRD * RAD}) + ln(\frac{min(FRD, RRD)}{max(FRD, RRD)}) - ln(\frac{min(FAD, RAD)}{max(FAD, RAD)})}$$ If a mutation has an AS_SOR > 4.0, then it is filtered out of the resulting files and graphs.

Results

Common Identified SNVs

This step merges the identified SNVs from JACUSA2 and REDItools2 by position and mutation (nucleotide change). If any combination of position and mutation is not found in either of the outputs, it is discarded. Furthermore, these SNVs are filtered by the following values:

Minimum base quality: 35
Minimum read quality: 25
Minimum SNV coverage: 20
Minimum main read support: 4
Minimum SNV frequency: 0.0

If there is a position that has multiple mutations, these are split into a row per mutation per position.

These files were stored at 6-visualization/csv/globalCommon.csv for the global results among all samples, and 6-visualization/SAMPLE_NAME/csv/runCommon.csv for the results of each sample. There is also a file for the global results and for each sample of the results by JACUSA2 (6-visualization/REFERENCE_NAME/csv/globalJacusa.csv and 6-visualization/REFERENCE_NAME/SAMPLE_NAME/csv/jacusa.csv) and REDItools2 (6-visualization/csv/globalReditools.csv and 6-visualization/SAMPLE_NAME/csv/reditools.csv). Here is a sample from the global results file.

CHROM	Position	Alt	Reference	Type	AAVar	GeneName	GeneID	RefReads	AltReads	TotalReads	Frequency	A	C	G	T	JacRefReads	JacAltReads	JacTotalReads	JacFrequency	JacA	JacC	JacG	JacT	Sample
AL123456.3	1977	G	A	upstream_gene_variant	nan	dnaN	Rv0002	0	66	66	100.0	0	0	66	0	0	94	94	100.0	0	0	94	0	SRR25792492
AL123456.3	4013	C	T	missense_variant	p.Ile245Thr	recF	Rv0003	0	138	138	100.0	0	138	0	0	0	153	153	100.0	0	153	0	0	SRR25792492
AL123456.3	5563	T	G	missense_variant	p.Lys108Asn	gyrB	Rv0005	1132	4	1136	0.3520999999999999	0	0	1132	4	1486	4	1490	0.2684999999999999	0	0	1486	4	SRR25792492
AL123456.3	5617	T	G	synonymous_variant	p.Ser126Ser	gyrB	Rv0005	1205	7	1212	0.5776	0	0	1205	7	1436	7	1443	0.4851	0	0	1436	7	SRR25792492
AL123456.3	6134	T	G	stop_gained	p.Glu299*	gyrB	Rv0005	1158	6	1164	0.5155	0	0	1158	6	1344	6	1350	0.4444	0	0	1344	6	SRR25792492
AL123456.3	6178	T	C	synonymous_variant	p.Gly313Gly	gyrB	Rv0005	1229	4	1233	0.3243999999999999	0	1229	0	4	1479	5	1484	0.3369	0	1479	0	5	SRR25792492
AL123456.3	6220	T	G	synonymous_variant	p.Val327Val	gyrB	Rv0005	971	4	975	0.4103	0	0	971	4	1185	4	1189	0.3364	0	0	1185	4	SRR25792492
AL123456.3	6349	T	G	missense_variant	p.Gln370His	gyrB	Rv0005	1468	4	1472	0.2717	0	0	1468	4	1743	4	1747	0.2289999999999999	0	0	1743	4	SRR25792492
AL123456.3	6558	C	G	missense_variant	p.Gly440Ala	gyrB	Rv0005	956	4	960	0.4166999999999999	0	4	956	0	1092	4	1097	0.365	0	4	1092	1	SRR25792492
AL123456.3	6626	C	G	missense_variant	p.Ala463Pro	gyrB	Rv0005	806	4	810	0.4937999999999999	0	4	806	0	1016	4	1023	0.3922	3	4	1016	0	SRR25792492
AL123456.3	7362	C	G	missense_variant	p.Glu21Gln	gyrA	Rv0006	0	778	778	100.0	0	778	0	0	0	876	876	100.0	0	876	0	0	SRR25792492
AL123456.3	7585	C	G	missense_variant	p.Ser95Thr	gyrA	Rv0006	0	525	525	100.0	0	525	0	0	0	636	636	100.0	0	636	0	0	SRR25792492
AL123456.3	8559	T	G	stop_gained	p.Gly420*	gyrA	Rv0006	783	4	787	0.5083	0	0	783	4	912	4	916	0.4367	0	0	912	4	SRR25792492
AL123456.3	8936	T	G	missense_variant	p.Gln545His	gyrA	Rv0006	901	4	905	0.442	0	0	901	4	1139	4	1143	0.35	0	0	1139	4	SRR25792492
AL123456.3	9089	T	G	synonymous_variant	p.Val596Val	gyrA	Rv0006	843	4	847	0.4723	0	0	843	4	994	5	999	0.5005	0	0	994	5	SRR25792492
AL123456.3	9304	A	G	missense_variant	p.Gly668Asp	gyrA	Rv0006	0	681	681	100.0	681	0	0	0	0	803	803	100.0	803	0	0	0	SRR25792492
AL123456.3	9628	T	C	missense_variant	p.Ala776Val	gyrA	Rv0006	763	4	767	0.5215	0	763	0	4	1168	4	1172	0.3413	0	1168	0	4	SRR25792492
AL123456.3	9921	T	C	missense_variant	p.Ala3Val	Rv0007	Rv0007	811	4	815	0.4908	0	811	0	4	1007	4	1011	0.3956	0	1007	0	4	SRR25792492
AL123456.3	11820	G	C	upstream_gene_variant	nan	ppiA	Rv0009	0	26	26	100.0	0	0	26	0	0	32	32	100.0	0	0	32	0	SRR25792492
AL123456.3	11879	G	A	missense_variant	p.Ser145Pro	Rv0008c	Rv0008c	0	30	30	100.0	0	0	30	0	0	34	34	100.0	0	0	34	0	SRR25792492
AL123456.3	14785	C	T	missense_variant	p.Cys233Arg	Rv0012	Rv0012	0	146	147	100.0	1	146	0	0	0	180	181	100.0	1	180	0	0	SRR25792492
AL123456.3	14785	C	T	missense_variant	p.Cys233Ser	Rv0012	Rv0012	0	146	147	100.0	1	146	0	0	0	180	181	100.0	1	180	0	0	SRR25792492
AL123456.3	14861	T	G	missense_variant	p.Gly258Val	Rv0012	Rv0012	0	194	194	100.0	0	0	0	194	1	224	225	99.5556	0	0	1	224	SRR25792492
AL123456.3	15117	G	C	missense_variant	p.Ile68Met	trpG	Rv0013	0	258	258	100.0	0	0	258	0	0	302	302	100.0	0	0	302	0	SRR25792492
AL123456.3	16119	A	C	missense_variant	p.Arg451Leu	pknB	Rv0014c	0	216	216	100.0	216	0	0	0	0	246	246	100.0	246	0	0	0	SRR25792492
AL123456.3	18394	A	C	missense_variant	p.Glu123Asp	pknA	Rv0015c	697	4	701	0.5706	4	697	0	0	821	4	826	0.4848	4	821	0	1	SRR25792492
AL123456.3	19514	A	C	stop_gained	p.Glu241*	pbpA	Rv0016c	253	4	257	1.5564	4	253	0	0	333	4	337	1.1868999999999998	4	333	0	0	SRR25792492
AL123456.3	21795	A	G	missense_variant	p.Pro463Ser	pstP	Rv0018c	0	225	225	100.0	225	0	0	0	0	259	259	100.0	259	0	0	0	SRR25792492
AL123456.3	21906	A	C	missense_variant	p.Ala426Ser	pstP	Rv0018c	376	4	380	1.0526	4	376	0	0	461	4	465	0.8602000000000001	4	461	0	0	SRR25792492
AL123456.3	22613	A	G	missense_variant	p.Ser190Leu	pstP	Rv0018c	365	4	369	1.084	4	0	365	0	464	4	468	0.8547000000000001	4	0	464	0	SRR25792492
AL123456.3	23750	A	C	upstream_gene_variant	nan	pknA	Rv0015c	441	4	445	0.8989	4	441	0	0	590	4	594	0.6734	4	590	0	0	SRR25792492
AL123456.3	24159	C	T	missense_variant	p.Tyr429Cys	fhaA	Rv0020c	799	4	803	0.4981	0	4	0	799	971	4	975	0.4103	0	4	0	971	SRR25792492
AL123456.3	24532	T	C	missense_variant	p.Gly305Ser	fhaA	Rv0020c	0	804	804	100.0	0	0	0	804	0	893	893	100.0	0	0	0	893	SRR25792492
AL123456.3	24716	G	A	synonymous_variant	p.Gly243Gly	fhaA	Rv0020c	58	81	139	58.2734	58	0	81	0	101	249	350	71.1429	101	0	249	0	SRR25792492
AL123456.3	24721	C	G	missense_variant	p.Arg242Gly	fhaA	Rv0020c	249	5	254	1.9685	0	5	249	0	443	9	453	1.9912	1	9	443	0	SRR25792492
AL123456.3	24885	A	C	missense_variant	p.Arg187Leu	fhaA	Rv0020c	1101	4	1105	0.362	4	1101	0	0	1455	4	1459	0.2742	4	1455	0	0	SRR25792492
AL123456.3	25210	A	C	stop_gained	p.Glu79*	fhaA	Rv0020c	1116	4	1120	0.3571	4	1116	0	0	1311	5	1316	0.3798999999999999	5	1311	0	0	SRR25792492
AL123456.3	25298	A	C	missense_variant	p.Gln49His	fhaA	Rv0020c	947	5	952	0.5252	5	947	0	0	1196	5	1201	0.4163	5	1196	0	0	SRR25792492
AL123456.3	25447	G	T	upstream_gene_variant	nan	rodA	Rv0017c	0	246	246	100.0	0	0	246	0	0	292	292	100.0	0	0	292	0	SRR25792492
AL123456.3	25610	C	G	upstream_gene_variant	nan	rodA	Rv0017c	0	54	54	100.0	0	54	0	0	0	57	57	100.0	0	57	0	0	SRR25792492
AL123456.3	34044	C	T	upstream_gene_variant	nan	bioF2	Rv0032	0	23	23	100.0	0	23	0	0	0	29	29	100.0	0	29	0	0	SRR25792492
AL123456.3	41378	T	G	missense_variant	p.Leu25Phe	Rv0038	Rv0038	302	4	306	1.3072	0	0	302	4	344	4	348	1.1494	0	0	344	4	SRR25792492
AL123456.3	41516	T	G	missense_variant	p.Trp71Cys	Rv0038	Rv0038	430	4	434	0.9217	0	0	430	4	530	4	534	0.7491	0	0	530	4	SRR25792492
AL123456.3	42281	A	C	missense_variant	p.Cys24Phe	Rv0039c	Rv0039c	0	126	126	100.0	126	0	0	0	0	147	147	100.0	147	0	0	0	SRR25792492
AL123456.3	42967	C	G	synonymous_variant	p.Pro133Pro	mtc28	Rv0040c	0	332	332	100.0	0	332	0	0	0	367	367	100.0	0	367	0	0	SRR25792492
AL123456.3	43732	A	G	synonymous_variant	p.Ser57Ser	leuS	Rv0041	0	121	121	100.0	121	0	0	0	0	141	141	100.0	141	0	0	0	SRR25792492
AL123456.3	44768	G	A	missense_variant	p.Arg403Gly	leuS	Rv0041	0	51	51	100.0	0	0	51	0	0	56	56	100.0	0	0	56	0	SRR25792492
AL123456.3	49360	T	C	missense_variant	p.Val194Ile	Rv0045c	Rv0045c	0	94	94	100.0	0	0	0	94	0	115	115	100.0	0	0	0	115	SRR25792492
AL123456.3	49966	A	C	upstream_gene_variant	nan	Rv0042c	Rv0042c	1480	7	1487	0.4707	7	1480	0	0	1648	7	1655	0.423	7	1648	0	0	SRR25792492
AL123456.3	50114	A	C	synonymous_variant	p.Val337Val	ino1	Rv0046c	1609	4	1613	0.248	4	1609	0	0	1832	4	1836	0.2178999999999999	4	1832	0	0	SRR25792492
AL123456.3	50270	A	C	missense_variant	p.Trp285Cys	ino1	Rv0046c	1814	5	1820	0.2749	5	1814	1	0	2310	5	2316	0.216	5	2310	1	0	SRR25792492
AL123456.3	50311	A	C	missense_variant	p.Gly272Cys	ino1	Rv0046c	1769	4	1773	0.2256	4	1769	0	0	2176	5	2181	0.2293	5	2176	0	0	SRR25792492
AL123456.3	50557	C	T	missense_variant	p.Arg190Gly	ino1	Rv0046c	0	1645	1645	100.0	0	1645	0	0	0	1753	1753	100.0	0	1753	0	0	SRR25792492
AL123456.3	51026	A	G	synonymous_variant	p.Gly33Gly	ino1	Rv0046c	1588	4	1592	0.2513	4	0	1588	0	1796	4	1800	0.2222	4	0	1796	0	SRR25792492
AL123456.3	51142	A	C	upstream_gene_variant	nan	Rv0042c	Rv0042c	1018	5	1024	0.4888	5	1018	0	1	1221	6	1229	0.489	6	1221	0	2	SRR25792492
AL123456.3	51171	A	G	upstream_gene_variant	nan	Rv0042c	Rv0042c	712	5	717	0.6974	5	0	712	0	1018	5	1023	0.4888	5	0	1018	0	SRR25792492
AL123456.3	51551	A	C	synonymous_variant	p.Ala59Ala	Rv0047c	Rv0047c	1873	9	1883	0.4781999999999999	9	1873	0	1	2051	10	2062	0.4852	10	2051	0	1	SRR25792492
AL123456.3	51580	A	C	missense_variant	p.Gly50Trp	Rv0047c	Rv0047c	2082	5	2087	0.2396	5	2082	0	0	2265	5	2270	0.2203	5	2265	0	0	SRR25792492
AL123456.3	51694	A	C	missense_variant	p.Glu12Lys	Rv0047c	Rv0047c	1384	5	1390	0.36	5	1384	0	1	1686	5	1692	0.2957	5	1686	0	1	SRR25792492
AL123456.3	51694	A	C	stop_gained	p.Glu12*	Rv0047c	Rv0047c	1384	5	1390	0.36	5	1384	0	1	1686	5	1692	0.2957	5	1686	0	1	SRR25792492
AL123456.3	51949	G	A	missense_variant	p.Val250Ala	Rv0048c	Rv0048c	0	50	50	100.0	0	0	50	0	0	58	58	100.0	0	0	58	0	SRR25792492
AL123456.3	54394	G	A	synonymous_variant	p.Ala244Ala	ponA1	Rv0050	0	800	800	100.0	0	0	800	0	0	848	848	100.0	0	0	848	0	SRR25792492
AL123456.3	55553	T	C	missense_variant	p.Pro631Ser	ponA1	Rv0050	0	49	49	100.0	0	0	0	49	13	100	113	88.4956	0	13	0	100	SRR25792492
AL123456.3	59563	T	G	missense_variant	p.Arg52Leu	rplI	Rv0056	351	5	356	1.4045	0	0	351	5	418	5	423	1.1820000000000002	0	0	418	5	SRR25792492
AL123456.3	59807	T	G	synonymous_variant	p.Ser133Ser	rplI	Rv0056	287	4	291	1.3746	0	0	287	4	337	4	341	1.173	0	0	337	4	SRR25792492
AL123456.3	62049	G	A	missense_variant	p.Arg552Trp	dnaB	Rv0058	0	247	249	100.0	0	0	247	2	0	300	302	100.0	0	0	300	2	SRR25792492
AL123456.3	62049	G	A	missense_variant	p.Arg552Gly	dnaB	Rv0058	0	247	249	100.0	0	0	247	2	0	300	302	100.0	0	0	300	2	SRR25792492
AL123456.3	63146	T	G	upstream_gene_variant	nan	Rv0059	Rv0059	0	239	239	100.0	0	0	0	239	0	255	255	100.0	0	0	0	255	SRR25792492
AL123456.3	65150	T	C	missense_variant	p.Trp67Cys	Rv0061c	Rv0061c	0	471	472	100.0	0	0	1	471	0	561	562	100.0	0	0	1	561	SRR25792492
AL123456.3	65150	T	C	stop_gained	p.Trp67*	Rv0061c	Rv0061c	0	471	472	100.0	0	0	1	471	0	561	562	100.0	0	0	1	561	SRR25792492
AL123456.3	65246	T	C	synonymous_variant	p.Gln35Gln	Rv0061c	Rv0061c	0	373	373	100.0	0	0	0	373	1	477	478	99.7908	0	1	0	477	SRR25792492
AL123456.3	68336	A	G	missense_variant	p.Val472Ile	Rv0063	Rv0063	0	51	51	100.0	51	0	0	0	0	51	51	100.0	51	0	0	0	SRR25792492
AL123456.3	69989	A	G	missense_variant	p.Gly457Asp	Rv0064	Rv0064	0	470	470	100.0	470	0	0	0	0	509	509	100.0	509	0	0	0	SRR25792492
AL123456.3	70267	T	G	missense_variant	p.Val550Phe	Rv0064	Rv0064	0	423	423	100.0	0	0	0	423	0	546	546	100.0	0	0	0	546	SRR25792492
AL123456.3	70816	G	A	missense_variant	p.Asn733Asp	Rv0064	Rv0064	0	433	433	100.0	0	0	433	0	0	480	480	100.0	0	0	480	0	SRR25792492
AL123456.3	71336	C	G	missense_variant	p.Arg906Pro	Rv0064	Rv0064	0	262	262	100.0	0	262	0	0	0	289	289	100.0	0	289	0	0	SRR25792492
AL123456.3	71874	T	G	missense_variant	p.Lys18Asn	vapC1	Rv0065	1499	4	1503	0.2661	0	0	1499	4	1796	4	1801	0.2222	1	0	1796	4	SRR25792492
AL123456.3	71914	C	T	missense_variant	p.Ser32Pro	vapC1	Rv0065	2	1829	1831	99.8908	0	1829	0	2	2	2082	2084	99.904	0	2082	0	2	SRR25792492
AL123456.3	72003	T	G	missense_variant	p.Gln61His	vapC1	Rv0065	2274	8	2282	0.3506	0	0	2274	8	2779	8	2787	0.287	0	0	2779	8	SRR25792492
AL123456.3	72055	T	C	missense_variant	p.His79Tyr	vapC1	Rv0065	2210	4	2214	0.1807	0	2210	0	4	2665	4	2669	0.1498999999999999	0	2665	0	4	SRR25792492
AL123456.3	75940	C	G	missense_variant	p.Val214Leu	Rv0068	Rv0068	0	57	57	100.0	0	57	0	0	0	65	65	100.0	0	65	0	0	SRR25792492
AL123456.3	78636	A	G	synonymous_variant	p.Gly87Gly	glyA2	Rv0070c	0	46	46	100.0	46	0	0	0	0	50	50	100.0	50	0	0	0	SRR25792492
AL123456.3	87468	T	C	missense_variant	p.Glu112Lys	Rv0078A	Rv0078A	0	254	254	100.0	0	0	0	254	0	310	310	100.0	0	0	0	310	SRR25792492
AL123456.3	87652	A	G	synonymous_variant	p.Asn50Asn	Rv0078A	Rv0078A	618	4	622	0.6431	4	0	618	0	774	4	778	0.5141	4	0	774	0	SRR25792492
AL123456.3	88269	T	G	synonymous_variant	p.Pro22Pro	Rv0079	Rv0079	4041	4	4046	0.0989	1	0	4041	4	4839	4	4844	0.0826	1	0	4839	4	SRR25792492
AL123456.3	88287	T	C	synonymous_variant	p.Ser28Ser	Rv0079	Rv0079	4337	4	4341	0.0921	0	4337	0	4	5084	4	5088	0.0786	0	5084	0	4	SRR25792492
AL123456.3	88288	T	G	missense_variant	p.Gly29Cys	Rv0079	Rv0079	4138	4	4142	0.0965999999999999	0	0	4138	4	4630	4	4634	0.0863	0	0	4630	4	SRR25792492
AL123456.3	88290	T	C	synonymous_variant	p.Gly29Gly	Rv0079	Rv0079	4254	7	4261	0.1643	0	4254	0	7	4678	8	4686	0.1707	0	4678	0	8	SRR25792492
AL123456.3	88291	T	G	missense_variant	p.Gly30Cys	Rv0079	Rv0079	4100	7	4107	0.1704	0	0	4100	7	4470	7	4477	0.1564	0	0	4470	7	SRR25792492
AL123456.3	88314	T	C	synonymous_variant	p.Ala37Ala	Rv0079	Rv0079	3836	4	3840	0.1042	0	3836	0	4	5648	5	5653	0.0884	0	5648	0	5	SRR25792492
AL123456.3	88327	T	C	missense_variant	p.Arg42Cys	Rv0079	Rv0079	4299	4	4303	0.093	0	4299	0	4	5977	5	5983	0.0836	0	5977	1	5	SRR25792492
AL123456.3	88328	T	G	missense_variant	p.Arg42Leu	Rv0079	Rv0079	4177	6	4184	0.1434	0	1	4177	6	5746	7	5755	0.1217	0	2	5746	7	SRR25792492
AL123456.3	88328	T	G	missense_variant	p.Arg42Pro	Rv0079	Rv0079	4177	6	4184	0.1434	0	1	4177	6	5746	7	5755	0.1217	0	2	5746	7	SRR25792492
AL123456.3	88333	T	G	missense_variant	p.Val44Leu	Rv0079	Rv0079	4879	14	4894	0.2860999999999999	1	0	4879	14	5856	18	5875	0.3064	1	0	5856	18	SRR25792492
AL123456.3	88333	T	G	missense_variant	p.Val44Met	Rv0079	Rv0079	4879	14	4894	0.2860999999999999	1	0	4879	14	5856	18	5875	0.3064	1	0	5856	18	SRR25792492
AL123456.3	88335	T	G	synonymous_variant	p.Val44Val	Rv0079	Rv0079	4730	18	4749	0.3791	0	1	4730	18	5355	19	5376	0.3536	0	2	5355	19	SRR25792492
AL123456.3	88336	T	G	missense_variant	p.Gly45Cys	Rv0079	Rv0079	4837	5	4842	0.1033	0	0	4837	5	5432	6	5438	0.1103	0	0	5432	6	SRR25792492
AL123456.3	88339	T	C	missense_variant	p.Arg46Cys	Rv0079	Rv0079	4806	4	4810	0.0832	0	4806	0	4	5507	4	5511	0.0726	0	5507	0	4	SRR25792492
AL123456.3	88340	T	G	missense_variant	p.Arg46Leu	Rv0079	Rv0079	4895	4	4899	0.0816	0	0	4895	4	5480	5	5486	0.0912	1	0	5480	5	SRR25792492
AL123456.3	88344	T	G	synonymous_variant	p.Val47Val	Rv0079	Rv0079	4716	4	4720	0.0847	0	0	4716	4	5455	4	5459	0.0733	0	0	5455	4	SRR25792492

Mutation Count Bar Plot

Each sample has its own mutation count bar plot, as well as each reference pathogen genome has its own global mutation count bar plot. For each possible mutation, it counts how many times the mutation happened among all reads, regardless of the position, and plots it as a bar. It also displays the mean coverage and median coverage, where the coverage is the number of reads with alternate allele. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.mutationCountBarPlot.png for each reference genome graph, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/common.mutationCountBarPlot.png for each sample.

Mutation Count Box Plot

Each sample has its own mutation count box plot, as well as each reference pathogen genome has its own global mutation count box plot. For each possible mutation, it plots the box plot regardless of the position. It is possible to have many outliers, hence the graph can look like a dot plot with a small box in the bottom side. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.mutationCountBoxPlot.png for each reference genome graph, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/common.mutationCountBoxPlot.png for each sample.

Mutation Count Stacked Bar Plot Per Gene

Each sample has its own mutation count stacked bar plot per gene, as well as each reference pathogen genome has its own global mutation count stacked bar plot per gene. For each gene, it plots a stacked bar, split by each possible mutation, where the length of each bar section is given by the number of reads that mutation has in that gene. It is possible that there are multiple graphs for each sample or reference pathogen. This is because it plots at most 100 genes per file, each file representing a group number. The genes are displayed ordered as they appear in the genome. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/geneBarPlot for each reference genome graph, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/geneBarPlot for each sample. There is also a file named _groups.txt where it says the group number of each gene.

Mutation Count Box Plot Per Gene

Each sample has its own mutation count box plot per gene, as well as each reference pathogen genome has its own global mutation count box plot per gene. For each gene, it plots a box plot based on the number of mutated reads in that gene across all the different positions. Some box plots might have many outliers, so they can look like dot plots with a small box in the left side. It is possible that there are multiple graphs for each sample or reference pathogen. This is because it plots at most 100 genes per file, each file representing a group number. The genes are displayed ordered as they appear in the genome. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/geneBoxPlot for each reference genome graph, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/geneBoxPlot for each sample. There is also a file named _groups.txt where it says the group number of each gene.

Mutation Count Per Sample Bar Plot

Each reference pathogen genome has its own global mutation count per sample bar plot. For each possible mutation, it counts how many times the mutation happened among all reads in each sample, regardless of the position, and plots it as a bar. Each sample has its own bar. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.mutationsPerRunCountBarPlot.png for each reference genome graph.

Mutation Count Per Sample Box Plot

Each reference pathogen genome has its own global mutation count per sample bar plot. For each sample, it plots a box plot based on the number of mutated reads in that sample across all the different positions. Some box plots might have many outliers, so they can look like dot plots with a small box in the left side. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.mutationsPerRunCountBoxPlot.png for each reference genome graph.

Frequency Per Mutation Strip Plot

Each reference pathogen genome has its own frequency per mnutation strip plot. It plots a strip plot, where each dot is the frequency of an SNV, and it appears in the strip of their respective mutation. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.frequencyPerMutation.png for each reference genome graph.

Frequency Per Gene Strip Plot

Each reference pathogen genome has its own frequency per gene strip plot. It plots a strip plot, where each dot is the frequency of an SNV, and it appears in the strip of their respective gene. It is possible that there are multiple graphs for each sample or reference pathogen. This is because it plots at most 100 genes per file, each file representing a group number. The genes are displayed ordered as they appear in the genome. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.frequencyPerMutation.png for each reference genome graph. There is also a file named _groups.txt where it says the group number of each gene.

Frequency Per Sample Strip Plot

Each reference pathogen genome has its own frequency per sample strip plot. It plots a strip plot, where each dot is the frequency of an SNV, and it appears in the strip of their respective sample. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/common.frequencyPerRun.png for each reference genome graph.

Distribution Histograms Plot

Each reference pathogen genome has its own distribution histograms plots. It plots six graphs in one file per chromosome/segment, where each graph represents a mutation (top half) and its reverse complement (bottom half). Each graph has a histogram displaying the number of SNVs of that mutation found each 100 nucleotides with respect to the reference genome. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/CHROMOSOME_histogram.png for each chromosome/segment in each reference genome.

Regression Plot

Each reference pathogen genome has its own regression plots. It displays a dot plot of mutated (or alternate) reads vs the total reads per position. If it finds a suitable linear function, then it is also plotted. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/CHROMOSOME.regression.png for each chromosome/segment in each reference genome.

Presence Per Run Position Plot

Each reference pathogen genome has its own presence per run position plots as a circos graph and a heatmap.

The circos graph is a circular graph that displays a different position each arc, up to 150 positions, while each concentric strip at each radius level displays a different run, creating cells. The circos graph supports up to 20 samples. If a cell is colored, then it means that the corresponding position was mutated in that sample. If the graph contains multiple genes, the genes range will be displayed in the innermost circle strip. Because there is a limit on the positions that each circos graph can display, it is possible that there are multiple graphs for each sample or reference pathogen. Genes that have less than 150 mutated positions will be grouped with other genes that also have less than that number of mutated positions. If a gene has more than 150 mutated positions, it will be split in different files. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/circos/CHROMOSOME_GROUP.png for each group in each chromosome/segment per reference genome, or 6-visualization/REFERENCE_NAME/graphs/circos/GENE_GROUP.png for each group in each gene per reference genome.

The heatmap is a tabular graph where each column represents a sample, and each row represents a position in the reference chromosome/segment. Each heatmap can have up to 300 positions. If a cell is colored, then it means that the corresponding position was mutated in that sample. If the graph contains multiple genes, the genes range will be displayed to the right of the heatmap. Because there is a limit on the positions that each heatmap can display, it is possible that there are multiple graphs for each reference pathogen. Genes that have less than 300 mutated positions will be grouped with other genes that also have less than that number of mutated positions. If a gene has more than 300 mutated positions, it will be split in different files. These graphs were stored at 6-visualization/REFERENCE_NAME/graphs/heatmap/CHROMOSOME_GROUP.png for each group in each chromosome/segment per reference genome, or 6-visualization/REFERENCE_NAME/graphs/heatmap/GENE_GROUP.png for each group in each gene per reference genome.

Frequency Per Mutation Position Plot

Each sample has its own frequency per mutation position plots as a circos graph and a heatmap.

The circos graph is a circular graph that displays a different position each arc, up to 150 positions, while each concentric strip at each radius level displays a different mutation, creating cells. If a cell is colored, then it means that the corresponding position had that type of mutation, while the intensity of the color represents the frequency of that position-mutation. If the graph contains multiple genes, the genes range will be displayed in the innermost circle strip. Because there is a limit on the positions that each circos graph can display, it is possible that there are multiple graphs for each sample. Genes that have less than 150 mutated positions will be grouped with other genes that also have less than that number of mutated positions. If a gene has more than 150 mutated positions, it will be split in different files. These graphs were stored at 6-visualization/REFERENCE_NAME/SAMPLE_NAME/circos/CHROMOSOME_GROUP.png for each group in each chromosome/segment per sample, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/circos/GENE_GROUP.png for each group in each gene per sample.

The heatmap is a tabular graph where each column represents a sample, and each row represents a position in the reference chromosome/segment. Each heatmap can have up to 300 positions. If a cell is colored, then it means that the corresponding position had that type of mutation, while the intensity of the color represents the frequency of that position-mutation. If the graph contains multiple genes, the genes range will be displayed to the right of the heatmap. Because there is a limit on the positions that each heatmap can display, it is possible that there are multiple graphs for each sample or reference pathogen. Genes that have less than 300 mutated positions will be grouped with other genes that also have less than that number of mutated positions. If a gene has more than 300 mutated positions, it will be split in different files. These graphs were stored at 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/heatmap/CHROMOSOME_GROUP.png for each group in each chromosome/segment per sample, or 6-visualization/REFERENCE_NAME/SAMPLE_NAME/graphs/heatmap/GENE_GROUP.png for each group in each gene per sample.