SPAN Peak Analyzer

+----------------------------------+
|SPAN Semi-supervised Peak Analyzer|
+----------------------------|/----+
           ,        ,
      __.-'|'-.__.-'|'-.__
    ='=====|========|====='=
    ~_^~-^~~_~^-^~-~~^_~^~^~^


SPAN is a tool for analyzing ChIP-seq / ATAC-seq data supporting ultra-low and single-cell input.

Contents

Features

  • Part of integrated peak calling solution
  • Works with both conventional and ultra-low-input ChIP-seq data
  • Works with both narrow and wide modifications
  • Works with both single-end and paired-end libraries
  • Fragment size prediction for single-end libraries
  • Capable to process tracks with different signal-to-noise ratio
  • Supports optional control track
  • Supports replicates on model level
  • False Discovery Rate correction
  • Experimental: differential peak calling

Installation

SPAN Peak Analyzer (build 0.11.0.4882), released on May 17, 2019

Download Description
span-0.11.0.4882.jar Multi-platform JAR package

Requirements:

  1. 4 GB RAM minimum
  2. Download and install Java 8.
  3. Download the <build>.chrom.sizes chromosome sizes of the organism you want to analyze from the UCSC website.
    Here is the file used in our study.

Galaxy

SPAN is available as tool in the official ToolShed for Galaxy. You can ask your Galaxy administrator to install it.

Usage of SPAN

java -Xmx4G -jar span-0.11.0.4882.jar [-h] [--version] analyze

Use java -Xmx memory settings to configure memory usage. 4 gigabytes are used in examples.

Example of regular peak calling java -Xmx4G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes -p Results.peak
Example of supervised peak calling java -Xmx4G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes -l Labels.bed -p Results.peak
Example of model fitting java -Xmx4G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes

Peak calling

To analyze a single (possibly replicated) biological condition use analyze command.

-b, --bin BIN_SIZE
Peak analysis is performed on read coverage tiled into consequent bins, with size being configurable. Default value is 200bp, approximately the length of one nucleosome.

-t, --treatment TREATMENT
Required. ChIP-seq treatment file. Supported formats: BAM, BED or BED.gz file. If multiple files are given, treated as replicates. Multiple files should be separated by commas: -t A,B,C . Multiple files are processed as replicates on model level.

-c, --control CONTROL
Control file. Multiple files should be separated by commas. Single control file or separate file per each treatment file required.
Follow instructions for -t, --treatment TREATMENT.

-cs, --chrom.sizes CHROMOSOMES_SIZES
Required. Chromosome sizes file for genome build used in TREATMENT and CONTROL files.
Can be downloaded at http://hgdownload.cse.ucsc.edu/goldenPath//...

--fragment FRAGMENT
Fragment size. If provided, reads are shifted appropriately. If not provided, the shift is estimated from the data.
--fragment 0 argument is necessary for ATAC-Seq data processing.

-k, --keep-dup
Keep duplicates. By default SPAN filters out redundant reads, aligned at the same genomic position.
--keep-dup argument is necessary for single cell ATAC-Seq data processing.

-m, --model MODEL
This option is used to specify SPAN model path, if not provided, model name is formed by input names and other arguments.

-p, --peaks PEAKS
Resulting peaks file in ENCODE broadPeak* (BED 6+3) format. If omitted, only model fitting step is performed.

-f, --fdr FDR
Minimum FDR cutoff to call significant regions, default value is 1.0E-6.
SPAN reports p- and q- values for the null hypothesis that a given bin is not enriched with a histone modification. Peaks are formed from a list of truly (in the FDR sense) enriched bins for the analyzed biological condition by thresholding the Q-value with a cutoff FDR and merging spatially close peaks using GAP option to broad ones. This is equivalent to controlling FDR.
q-values are are calculated from p-values using Benjamini-Hochberg procedure.

-g, --gap GAP
Gap size to merge spatially close peaks. Useful for wide histone modifications. Default value is 5, i.e. peaks separated by 5* BIN distance or less are merged.

--labels LABELS
Labels BED file. Used in semi-supervised peak calling.

-d, --debug
Print all the debug information, used for troubleshooting.

-q, --quiet
Turn off output.

-w, --workdir PATH
Path to the working directory (stores coverage and model caches).

--threads THREADS
Configures parallelism level.
SPAN utilizes both multithreading and specialized processor extensions like SSE2, AVX, etc. Parallel computations were performed using open-source library viktor for parallel matrices computations in Kotlin programming language.

Supervised peak calling

When LABELS parameter is given, it is used to optimize peak caller parameters for markup.

Model fitting

SPAN workflow consists of several steps:

  1. Convert raw reads to tags using user-supplied FRAGMENT parameter or maximum cross-correlation estimate.
  2. Compute coverage for all genome tiled into bins of BIN base pairs.
  3. Fit 3-state hidden Markov model that classifies bins as ZERO states with no coverage, LOW states of non-specific binding, and HIGH states of the specific binding.
  4. Compute posterior HIGH state probability of each bin.
  5. Trained model is saved into .span binary format.
  6. Peaks are computed using trained model and FDR and GAP parameters.
  7. If LABELS are provided, optimal parameters are computed to conform with them.

Model fitting mode produces trained model file in binary format as output, which can be:

  1. visualized directly in JBR Genome Browser
  2. used in integrated peak calling pipeline

Output files

  • If OUTPUT file is given, it will contain predicted and FDR-controlled peaks in the ENCODE broadPeak format, i.e. BED 6+3:
    <chromosome> <peak start> <peak end> <peak name> <score> . <coverage / foldchange> <-log p-value> <-log Q-value>
    Same format is used by MACS2 peak caller.
    • chromosome name
    • start position of peak
    • end position of peak
    • peak name
    • score of the peak, computed as log10(qvalue) * log(peak length). Useful for peak ranking with wide histone modifications.
    • . (represents strand)
    • summary reads coverage in peak averaged over replicates. fold-change in differential mode.
    • -log10(pvalue) of null-hypothesis that given peak is in ZERO or LOW state.
    • -log10(qvalue), calculated from p-values using Benjamini-Hochberg procedure. Median value for merged peak.
  • In case of SPAN model fitting, it produces model file in binary format.
    NOTE: after model is trained once, it will be reused automatically in other modes.

Study Cases

As a benchmark we applied SPAN peak calling approach to public conventional ChIP-seq datasets as well as to a ULI ChIP-seq dataset.

CD14+ classical monocytes tracks available in ENCODE database were a natural choice for a conventional ChIP-seq dataset.
We also used the data from Hocking et al. to evaluate SPAN.

Chen C et al. presented an ultra-low-input micrococcal nuclease-based native ChIP (ULI-NChIP) and sequencing method to generate genome-wide histone mark profiles with high resolution and reproducibility from as few as one thousand cells. We used these tracks to estimate semi-supervised approach in extreme conditions.

SPAN produced high quality peak calling in all of these cases, see report.
This suggests that SPAN Peak Analyzer can be used as a general purpose peak calling solution.

Error reporting

Report any errors or comments in the public SPAN issue tracker.

FAQ

Q: What is average running time?
A: SPAN is capable of processing single ChIP-Seq track in less than 1 hour on moderate laptop (MacBook Pro 2015).

Q: Which operating systems are supported?
A: SPAN is developed in modern Kotlin programming language and can be executed on any platform supported by java.

Q: Is differential peak calling supported?
A: This is experimental feature, see for details: java -Xmx4G -jar span.jar compare -h

Q: Where is SPAN source code?
A: Source code is available on GitHub

Q: Where did you get this lovely span picture?
A: From ascii.co.uk, it seems the original author goes by the name jgs.

Modified May 17, 2019