SPAN Peak Analyzer
+----------------------------------+ |SPAN Semi-supervised Peak Analyzer| +----------------------------|/----+ , , __.-'|'-.__.-'|'-.__ ='=====|========|====='= ~_^~-^~~_~^-^~-~~^_~^~^~^
SPAN is a semi-supervised multipurpose peak caller capable of processing a broad range of ChIP-seq, ATAC-seq, and single-cell ATAC-seq datasets that robustly handles multiple replicates and noise by leveraging limited manual annotation information.
- Usage of SPAN
- Output Files
- Study Cases
- Error reporting
- Part of an integrated peak calling solution
- Works with both conventional and ultra-low-input ChIP-seq data
- Works with both narrow and wide modifications
- Works with both single-end and paired-end libraries
- Fragment size prediction for single-end libraries
- Capable of processing tracks with different signal-to-noise ratios
- Supports optional control track
- Supports replicates on the model level
- False Discovery Rate correction
- Experimental: differential peak calling
- New Multistart during model fitting
|span-0.13.5244.jar||Multi-platform JAR package|
- 4 GB RAM minimum.
- Download and install Java 8.
- Download the
<build>.chrom.sizeschromosome sizes of the organism you want to analyze from the UCSC website.
Here is the file used in our study.
java -Xmx4G -jar span-0.13.5244.jar [-h] [--version] analyze
-Xmx memory settings to configure memory usage. 4 gigabytes are used in the examples.
Example of regular peak calling:
java -Xmx4G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes -p Results.peak
Example of supervised peak calling:
java -Xmx4G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes -l Labels.bed -p Results.peak
Example of model fitting:
java -Xmx4G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes
To analyze a single (possibly replicated) biological condition use the
-b, --bin BIN_SIZE
Peak analysis is performed on read coverage tiled into consequent bins of configurable size. Default value is 200bp, approximately the length of one nucleosome.
-t, --treatment TREATMENT
Required. ChIP-seq treatment file. Supported formats: BAM, BED, BED.gz or bigWig file. If multiple files are provided, they are treated as replicates. Multiple files should be separated by commas:
-t A,B,C . Multiple files are processed as replicates on the model level.
-c, --control CONTROL
Control file. Multiple files should be separated by commas. A single control file or a separate file per each treatment file is required.
Follow the instructions for
-t, --treatment TREATMENT.
-cs, --chrom.sizes CHROMOSOMES_SIZES
Required. Chromosome sizes file for the genome build used in
Can be downloaded at
Fragment size. If provided, reads are shifted appropriately. If not provided, the shift is estimated from the data.
--fragment 0 argument is necessary for ATAC-Seq data processing.
Keep duplicates. By default SPAN filters out redundant reads aligned at the same genomic position.
--keep-dup argument is necessary for single cell ATAC-Seq data processing.
-m, --model MODEL
This option is used to specify SPAN model path, if not provided, model name is composed of input names and other arguments.
-p, --peaks PEAKS
Resulting peaks file in ENCODE broadPeak* (BED 6+3) format. If omitted, only the model fitting step is performed.
-f, --fdr FDR
Minimum FDR cutoff to call significant regions, default value is 1.0E-6.
SPAN reports p- and q- values for the null hypothesis that a given bin is not enriched with a histone modification. Peaks are formed from a list of truly (in the FDR sense) enriched bins for the analyzed biological condition by thresholding the Q-value with a cutoff
FDR and merging spatially close peaks using
GAPoption to broad ones. This is equivalent to controlling FDR.
q-values are are calculated from p-values using the Benjamini-Hochberg procedure.
-g, --gap GAP
Gap size to merge spatially close peaks. Useful for wide histone modifications. Default value is 5, i.e. peaks separated by 5*
BIN distance or less are merged.
Labels BED file. Used in semi-supervised peak calling.
Print all the debug information, used for troubleshooting.
Turn off output.
-w, --workdir PATH
Path to the working directory (stores coverage and model caches).
Configures the parallelism level.
SPAN utilizes both multithreading and specialized processor extensions like SSE2, AVX, etc. Parallel computations were performed using an open-source library viktor for parallel matrix computations in Kotlin programming language.
Number of multistart runs using different model initializations. Use 0 to disable (default: 5)
Number of iterations for each multistart run (default: 5)
Convergence threshold for EM algorithm, use --debug option to see detailed info (default: 0.1)
LABELS parameter is given, it is used to optimize peak caller parameters for markup.
SPAN workflow consists of several steps:
- Convert raw reads to tags with user-supplied
FRAGMENTparameter or maximum cross-correlation estimate.
- Compute coverage for the whole genome tiled into bins of
- Fit 3-state hidden Markov model that classifies bins as
ZEROstates with no coverage,
LOWstates of non-specific binding, and
HIGHstates of specific binding.
- Compute posterior
HIGHstate probability for each bin.
- Trained model is saved into
- Peaks are computed using trained model and
LABELSare provided, optimal parameters are computed to conform with them.
Model fitting mode produces a trained model file in binary format as output, which can be:
OUTPUTfile is given, it will contain predicted and FDR-controlled peaks in the ENCODE broadPeak format, i.e. BED 6+3:
<chromosome> <peak start> <peak end> <peak name> <score> . <coverage / foldchange> <-log p-value> <-log Q-value>
Same format is used by MACS2 peak caller.
- chromosome name
- start position of the peak
- end position of the peak
- peak name
- score of the peak, computed as log10(qvalue) * log(peak length). Useful for peak ranking with wide histone modifications.
- . (represents the strand)
- total peak coverage averaged over replicates or fold-change of the coverage for differential analysis.
- -log10(pvalue) of the null-hypothesis that a given peak is in
- -log10(qvalue), calculated from p-values using the Benjamini-Hochberg procedure. Median value for a merged peak.
- In the case of
SPANmodel fitting, it produces the model file in binary format.
NOTE: after the model is trained once, it will be reused automatically in other modes.
As a benchmark we applied the SPAN peak calling approach to public conventional ChIP-seq datasets as well as to a ULI ChIP-seq dataset.
CD14+ classical monocytes tracks available from the ENCODE database were a natural choice for a conventional ChIP-seq dataset.
We also used the data from Hocking et al. to evaluate
Chen C et al. presented an ultra-low-input micrococcal nuclease-based native ChIP (ULI-NChIP) and sequencing method to generate genome-wide histone mark profiles with high resolution and reproducibility from as few as one thousand cells. We used these tracks to test the semi-supervised approach in extreme conditions.
SPAN produced high quality peak calling in all of these cases, see report.
This suggests that SPAN Peak Analyzer can be used as a general purpose peak calling solution.
Report any errors or comments in the public SPAN issue tracker.
Q: What is the average running time?
SPAN is capable of processing a single ChIP-Seq track in less than 1 hour on an average laptop (MacBook Pro 2015).
Q: Which operating systems are supported?
SPAN is developed in modern Kotlin programming language and can be executed on any platform supported by
Q: Is differential peak calling supported?
A: This is an experimental feature, for details see:
java -Xmx4G -jar span.jar compare -h
Q: Where is
SPAN source code?
A: Source code is available at GitHub
Q: Where did you get this lovely span picture?
A: From ascii.co.uk, it seems the original author goes by the name
Modified Wed Aug 12 17:19:12 2020 UTC