alignment

This module handles everything about the alignment .

alignment.extractUnaligned(sras)

Extracts the unaligned sequences from the alignment against the host and discards the runs where all sequences aligned.

It filters the SAM files found in the 2-alignment/host/sam directory with samtools view using the filter 256 and these results are transformed into BAM files, found in the 2-alignment/host/bam directory. Then, using samtools bam2fq, it transforms the BAM files into FASTQ files with the final unaligned sequences.

In the case of paired-end runs, the resulting FASTQ has both 3’ and 5’ reads in the same file, so they are split into two files.

In the case of the unaligned FASTQ files generated by STAR, these are appended to the split FASTQ files.

The resulting FASTQ files are saved at 2-alignment/host/fastq.

Args:
sras (list): List of tuples with the following data:
  • A list of the paths for the input run (one file if single-end, two if paired-end)

  • Run type. “single” if single-end, “paired” if paired-end

  • Run ID

Returns:
list: List of filtered tuples with the following data:
  • A list of the paths for the input run (one file if single-end, two if paired-end)

  • Run type. “single” if single-end, “paired” if paired-end

  • Run ID

alignment.runBWA(sras, host)

Runs BWA aligner.

It creates the index for each reference genome, then aligns each run against each genome. Each run-genome pair generates a SAM file, and it is saved in the 2-alignment/<host/pathogen>/sam directory. It can take the arguments listed in the bwa.config file located within the config directory.

Args:
sras (list): List of tuples with the following data:
  • A list of the paths for the input run (one file if single-end, two if paired-end)

  • Run type. “single” if single-end, “paired” if paired-end

  • Run ID

host (bool): Switch that tells if the alignment must be done against the host or the viral genomes.

alignment.runGMAP(sras, host)

Runs GMAP aligner.

It creates the database for each reference genome, then aligns each run against each genome. Each run-genome pair generates a SAM file, and it is saved in the 2-alignment/<host/pathogen>/sam directory. It can take the arguments listed in the gmap.config file located within the config directory.

Args:
sras (list): List of tuples with the following data:
  • A list of the paths for the input run (one file if single-end, two if paired-end)

  • Run type. “single” if single-end, “paired” if paired-end

  • Run ID

host (bool): Switch that tells if the alignment must be done against the host or the viral genomes.

alignment.runHisat2(sras, host)

Runs Hisat2 aligner.

It creates the database for each reference genome, then aligns each run against each genome. Each run-genome pair generates a SAM file, and it is saved in the 2-alignment/<host/pathogen>/sam directory. It can take the arguments listed in the hisat2.config file located within the config directory.

Args:
sras (list): List of tuples with the following data:
  • A list of the paths for the input run (one file if single-end, two if paired-end)

  • Run type. “single” if single-end, “paired” if paired-end

  • Run ID

host (bool): Switch that tells if the alignment must be done against the host or the viral genomes.

alignment.runMagicBlast(sras, host)

Runs Magic-BLAST aligner.

It creates the database for each reference genome, then aligns each run against each genome. Each run-genome pair generates a SAM file, and it is saved in the 2-alignment/<host/pathogen>/sam directory. It can take the arguments listed in the magicblast.config file located within the config directory.

Args:
sras (list): List of tuples with the following data:
  • A list of the paths for the input run (one file if single-end, two if paired-end)

  • Run type. “single” if single-end, “paired” if paired-end

  • Run ID

host (bool): Switch that tells if the alignment must be done against the host or the viral genomes.

alignment.runMinimap2(sras, host)

Runs Minimap2 aligner.

It creates the database for each reference genome, then aligns each run against each genome. Each run-genome pair generates a SAM file, and it is saved in the 2-alignment/<host/pathogen>/sam directory. It can take the arguments listed in the minimap2.config file located within the config directory.

Args:
sras (list): List of tuples with the following data:
  • A list of the paths for the input run (one file if single-end, two if paired-end)

  • Run type. “single” if single-end, “paired” if paired-end

  • Run ID

host (bool): Switch that tells if the alignment must be done against the host or the viral genomes.

alignment.runSTAR(sras, host)

Runs STAR aligner.

It creates the database for each reference genome, then aligns each run against each genome. Each run-genome pair generates a SAM file with the sequences that may have a relevant alignment, which is saved in the 2-alignment/<host/pathogen>/sam directory, and one or two FASTQ files (depending on the run type) with the unaligned sequences, which are saved in the 2-alignment/<host/pathogen>/star directory with .mate1 and .mate2 extensions. It can take the arguments listed in the star.config file located within the config directory.

Args:
sras (list): List of tuples with the following data:
  • A list of the paths for the input run (one file if single-end, two if paired-end)

  • Run type. “single” if single-end, “paired” if paired-end

  • Run ID

host (bool): Switch that tells if the alignment must be done against the host or the viral genomes.

alignment.sortAlignments(sras)

Sorts the SAM files of each alignment against the viral genomes.

It sorts and transforms these SAM files into BAM files using the samtools sort tool, and then adds some extra headers needed for following steps using samtools calmd. These files are saved in the 2-alignment/pathogen/bam directory.

Args:
sras (list): List of tuples with the following data:
  • A list of the paths for the input run (one file if single-end, two if paired-end)

  • Run type. “single” if single-end, “paired” if paired-end

  • Run ID