BiSpark is a highly parallelized bisulfite-treated read aligner algorithm that utilizes distributed environment to significantly improve aligning performance and scalability. BiSpark is designed based on the Apache Spark distributed framework and shows highly efficient scalability.


Analysis workflow within BiSpark consists of 4 processing phases: (1) Distributing the reads into key-value pairs, (2) Transforming reads into ‘three-letter’ reads and mapping to transformed reference genome, (3) Aggregating mapping results and filtering ambiguous reads, and (4) Profiling the methylation information for each read. The figure depicts the case when library of input data is a non-directional.



BiSpark is implemented on Apache Spark framework and HDFS file system. Bowtie2 is also used to alignment, thus all three frameworks and programs should be installed before running BiSpark. Bowtie2 should be callable on all slave nodes.



Recommended (optional) pre-processing for quality control

To improve the mappability and alignment accuracy, snitizing the poor reads before the main BiSpark phase is highly recommended. Following is recommended procedure for read quality control.  


