GBSA

Documentation

Introduction
System Requirements
Installation (GUI)
Preparation of the Sequencing Data
Analysis (GUI)
Command Line

1. Introduction

Genome Bisulfite Sequencing Analyser (GBSA) is a free open-source software capable of analyzing whole-genome bisulfite sequencing data. In essence, GBSA allows an investigator to explore not only known loci, but also all the genomic regions, for which methylation studies could lead to the discovery of new regulatory mechanisms. More…

2. System Requirements

GBSA can be run on a standard computer (with a least 4 Gb RAM) and a minimum of 3 times the size of input file(s) of free hard disk space for swapping. GBSA GUI runs under any 32-bit or 64-bit version of Microsoft Windows Vista and above. The script release is a cross-platform application that runs on any operating system that has a Python 2.7 (and above) interpreter.

3. Installation (GUI)

After downloading the installation file, double-click on the “GBSA_v1_Setup.exe” and follow these steps to install GBSA version 1.0.

click to enlarge

1 ) When the above window appears… Click on “Next”

2 ) IMPORTANT – The user MUST click on the “I agree with the GNU GPL terms and conditions” check box to continue on the Software License screen, then click “Next”. The Installer is not allowed to do this without user authorization.

3 ) Choose an installation folder and click “Next” to continue.

4 ) Click “Yes” to create the installation folder.

5 ) Click “Start” to install GBSA version 1.0.

6 ) Please wait until installation is finished.

7 ) Click “Next”.

8 ) Click “Exit” to finish the installation.

4. Preparation of the Sequencing Data

GBSA process BS-sequencing aligned reads. As an input, the program accept BSP files from BSmap or RRBSmap, as well as output from BS Seeker.

5. Analysis (GUI)

5.1.Analysis Project and Parameters

- - - - For a first use, user has to download and install a genome annotation database available in our website

Installation Genome Annotation

1 ) Click on File Menu 2 ) Click on Install Genome Annotation and choose genome annotation database file and wait…

Restarting of GBSA is required after finish installing genome annotation database .

4 ) Click on Browse in Input file(s) panel. 5 ) Select Input file, then click Open. If you need to select multiple files, please select the first file and then press and hold the Ctrl key. While holding down the Ctrl key, select each of the other files you wish to select.

6 ) Set analysis parameters ans then 7) Setting analysis parameters

8) Set your analysis parameters and 9) click on “Start analysis using selected method(s)”.

Status of analysis could be viewed on the left lower corner.

5.2. QC’s Screenshots

1) Report from reads alignment.

2)Reports from gene-centric approach

3) Reports from gene independent approach

5.3.Results

1) Results from Gene-centric approach

2) Results from Gene independent approach

5.4.Data Visualisation

Via UCSC (using bedgraph outputs):

Via GBSA:

Localisation of intergenic methylated domains in the neighborhood of a gene can be listed when users click on this button:

6. Command line

GBSA Script release is a cross-platform application that runs on any operating system that has a Python 2.7 (and above) interpreter. The programm produces same results files as the GUI version except for the graphical reports. After downloading GBSA script and all required libraries: pylab, numpy and interval

A detailed tutorial installation is available here (Thanks to the NGS surfer’s Wiki community).

You can use GBSA by typing this command:

>python GBSA_<version>.py <parameters>

using the following analysis parameters:

Usage:

 New analysis:
 python GBSA_2.0.py -i <input file> -f <format> -q <protocol> ... [settings]
 Re-analysis:
 python GBSA_2.0.py -I <Preprocessed analysis file> [settings]

Options:
 --version show program's version number and exit
 -h, --help show this help message and exit

New analysis general settings:
 -i FILE, --input=FILE Reads Alignment file (from BS Seeker or BSmap/RRBSmap)
 -f STR, --format=STR Input File format, possible values: BSseeker, BSmap [default: BSmap]
 -q STR, --protocol=STR Bisulfite sequencing protocol (only for BSseeker), possible values: lister, cokus [default: lister]
 -r STR, --rm_dupliclate=STR  Remove duplicated reads Y/N [default: N]
 -n STR, --methylation_event=STR Methylation events, possible values: x, xy, xz, xyz for CpG only, CpG and CHG, CpG and CHH, CpG, CHG and CHH respectively [default: x]
 -d INT, --depth=INT The minimum read coverage for a given genomic region.If the minimum is not reached, the region will be ignored. [default: 3]
 -t STR, --tempdir=STR Folder path to store temporary files. [default: default]
 -S STR, --save_preproc=STR Save the GBSA preprocessed file for re-analysis [Y/N]. [default: Y] 

Re-analysis (from a GBSA file):
 -I FILE, --INPUT=FILE GBSA Pre-processed file 

General settings:
 -o STR, --output=STR The project's path+name (e.g /home/marcel/bsanalysis), it will be used to name all output files. Space and  special caracteres are prohibited. [default: BS_analysis]
 -a STR, --annotation=STR Genome annotation file on BSA format. Available on our website.
 -b STR, --genome_seq=STR Genome sequence file on BSQ format. Available on our website.
 -m STR, --method=STR Choose your analysis method: Method 1 (Gene based) calculate methylation level relative to the gene annotation; Method 2 -Domain-, detect methylated domains before the annotation. Both Methods (3) [default: 3]
 -v STR, --verbose=STR GBSA provides comments on the operation as they occur as well progress bar. !!!This function could be not compatible on some terminals. [0|1]. [default: 1]

Gene centric approach settings:
 -c INT, --minc=INT The minimum of CG dinucleotides that a region (promoter, gene body) should have to compute a score. If the minimum is not reached, the region will be ignored. This variable is used by the Gene based method. [default: 3]

Gene independent approach settings:
 -p INT, --prom_size=INT Size in nucleotide of the promoter expected lenght. This variable is used by the Domain method for annotate them against the genone annotation. [default:2000]
 -s INT, --domain_score=INT Average level methylation level [0,1] of sequenced CG within the domain. [default: 0.5]
 -w INT, --window_size=INT Sliding window size length used by method 2 to find methylated domains. [default: 100]
 -g INT, --min_CG_domain=INT Minimum of sequenced CG that a domain should contain.
 [default: 5]

Note: the BSA and BSQ files used by the -a (--annotation) and -b (--genome) options are available in our website.

◊ User has to unzipped the downloaded files prior utilization.◊

Example:

>python GBSA_1-01.py -i '/home/touati/Downloads/1K_SW403_R038_hg19.txt' -a '/home/touati/programs/GBSA/annotation_files/hg19.refFlat.bsa' -b '/home/touati/program/GBSA/annotation_files/hg19.bsq'  -o '/home/touati/Desktop/test/SW403' -m 3 -d 3 -f BSseeker -s 0.5 -w 100 -g 5

StdOut:

Outputs Results Files:

log file
<name>.bed is a standard bedgraph file recording all CpGs (or ChG/CHH) β scores.

Output sample

track type=bedGraph name=CHR19test_results_CpG description="CpG Methylation level" visibility=full autoScale=off viewLimits=-1:1 color=0,61,245 maxHeightPixels=11:50:128

chr1	10468	10470	1
chr1	10470	10472	1
chr1	16242	16244	0.33
chr1	51934	51936	1
chr1	55298	55300	0
chr1	55328	55330	1
chr1	56297	56299	1
chr1	58446	58448	0.5
chr1	62155	62157	1

<name>cov.bed is a standard bedgraph file recording the depth or sequencing coverage of all CpGs .

Output sample

track type=bedGraph name=CHR19test_coverage description="Depth of coverage" visibility=full yLineOnOff=on yLineMark="0.0"

chr1	10468	10470	5
chr1	10470	10472	4
chr1	16242	16244	3
chr1	51934	51936	3
chr1	55298	55300	3
chr1	55328	55330	5
chr1	56297	56299	4
chr1	58446	58448	6
chr1	62155	62157	3
chr1	62419	62421	3

<name>_gene_score.txt is a tab delim file where all genes methylation level are recorded.
the columns of the file are

1-Refseq ID
2-Gene name
3-Promoter methylation score (-3kb)
4-Promoter methylation score (-2kb)
5-TSS region methylation score (± 1kb)
6-1stExon/Intron methylation score
7-Gene body methylation score
8-Percentage of CpG sequenced within the promoter (3kb)
9-Percentage of CpG sequenced within the promoter (2kb)
10-Percentage of CpG sequenced within the TSS Region (± 1kb)
11-Percentage of CpG sequenced within the 1stExon/Intron
12-Percentage of CpG sequenced within the gene body

Output sample

refseq_ID	Gene_name	Promoter3k_score	Promoter2k_score	TSS_region_score	1stExon_Intron_score	Gene_body_score	_%CG-seq_Promoter3k	_%CG-seq_Promoter2k	_%CG-seq_TSS_Region	_%CG-seq_1stExon_Intron	_%CG-seq_Gene_body
NR_028327.1	LOC100133331	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NR_033908	LOC100288069	NA	NA	NA	1	0.9583333333	NA	NA	NA	3.5294117647	3.7209302326
NR_024321	NCRNA00115	1	NA	NA	NA	NA	2.8846153846	NA	NA	NA	NA
NR_015368	LOC643837	0.0666666667	NA	NA	NA	0.9375960061	3.2258064516	NA	NA	NA	13.870246085
NR_027055	FAM41C	0.8146705147	0.8333333333	NA	NA	0.8609905073	39.1304347826	23.5294117647	NA	NA	10.7142857143
NR_026874	LOC100130417	0.92	0.9	1	NA	0.9333333333	4.7619047619	4.7619047619	7.0422535211	NA	9.2592592593
NM_152486	SAMD11	NA	NA	NA	NA	0.8107660455	NA	NA	NA	NA	2.4811218986
NM_015658	NOC2L	NA	NA	0.1666666667	NA	0.8730679157	NA	NA	3.4090909091	NA	13.436123348
NM_198317	KLHL17	0.48	0.1666666667	NA	NA	0.9523809524	4.4052863436	3.1578947368	NA	NA	1.8867924528
NM_001160184	PLEKHN1	0.9444444444	1	NA	NA	0.939047619	3.5928143713	4.2735042735	NA	NA	5.7636887608
NM_032129	PLEKHN1	0.9444444444	1	NA	NA	0.939047619	3.5928143713	4.2735042735	NA	NA	5.7636887608
NR_027693	C1orf170	0.7857142857	0.8805555556	0.8888888889	NA	0.9333333333	13.5714285714	12	6.3157894737	NA	1.3586956522
NM_021170	HES4	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NM_001142467	HES4	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NM_005101	ISG15	0.806957672	0.7613888889	0.525	NA	NA	18.9873417722	17.5438596491	3.8461538462	NA	NA
NM_198576	AGRN	NA	NA	NA	NA	0.9443502825	NA	NA	NA	NA	3.4767236299

<name>_domain_scores.txt is a tab delim file where all methylated domains are recorded.
the columns of the file are:

Domain ID
Domain chromosome
Domain start position
Domain end position
Domain methylation score
Number of CpG within the domain
Percentage of CpG sequenced within the domain
Distance from the nearest gene
Fall in a intergenic region? (0/1)
Fall in a promoter? (0/1)
Fall in a genic region? (0/1)
Fall in a intronic region? (0/1)
Fall in a exonic region? (0/1)
Nearest Refseq ID
Nearest gene name
Nearest gene chromosome
Nearest gene start position
Nearest gene end position
Nearest gene strand

Output sample

Nearest_Gene_Strand
+
-
-
+
+
+
+
+
+
+
+
+
+
-
-
+
-
-

Document was last modified on Feb. 25 2012

Domain_ID	Domain_Chr	Domain_Start	Domain_End	Domain_score	Domain_CpG_number	Domain_%_CpG_sequenced	Distance_nearest_gene	intergenic	promoter	gene	intron	exon	Nearest_Refseq_ID	Nearest_Gene_Name	Nearest_Gene_Chr	Nearest_Gene_Start	Nearest_Gene_End	Nearest_Gene_Strand
198	chr1	2946593	2946672	1	5	100	8587	1	0	0	0	0	NM_080431	ACTRT2	chr1	2938045	2939467	+
199	chr1	2971344	2971616	0.8666666667	5	71.4285714286	12809	1	0	0	0	0	NR_024371	FLJ42875	chr1	2984289	2980635	-
200	chr1	2975614	2975825	0.8163265306	7	100	8570	1	0	0	0	0	NR_024371	FLJ42875	chr1	2984289	2980635	-
201	chr1	2995107	2995533	0.9004329004	11	84.6153846154	9579	0	0	1	1	0	NM_199454	PRDM16	chr1	2985741	3355185	+
202	chr1	2997075	2997372	0.9642857143	7	63.6363636364	11482	0	0	1	1	0	NM_199454	PRDM16	chr1	2985741	3355185	+
203	chr1	3001129	3001351	0.9761904762	6	66.6666666667	15499	0	0	1	1	0	NM_199454	PRDM16	chr1	2985741	3355185	+
204	chr1	3003299	3003478	0.891984127	10	111.111111111	17647	0	0	1	1	0	NM_199454	PRDM16	chr1	2985741	3355185	+
205	chr1	3017871	3018035	1	7	87.5	-26585	1	0	0	0	0	NR_036215	MIR4251	chr1	3044538	3044599	+
206	chr1	3028578	3028948	1	6	46.1538461538	-15775	1	0	0	0	0	NR_036215	MIR4251	chr1	3044538	3044599	+
207	chr1	3038817	3038947	1	5	100	-5656	1	0	0	0	0	NR_036215	MIR4251	chr1	3044538	3044599	+
208	chr1	3043633	3044081	0.9777777778	12	85.7142857143	-681	0	1	0	0	0	NR_036215	MIR4251	chr1	3044538	3044599	+
209	chr1	3044805	3045043	0.9333333333	5	62.5	386	1	0	0	0	0	NR_036215	MIR4251	chr1	3044538	3044599	+
198	chr1	2946593	2946672	1	5	100	8587	1	0	0	0	0	NM_080431	ACTRT2	chr1	2938045	2939467	+
199	chr1	2971344	2971616	0.8666666667	5	71.4285714286	12809	1	0	0	0	0	NR_024371	FLJ42875	chr1	2984289	2980635	-
200	chr1	2975614	2975825	0.8163265306	7	100	8570	1	0	0	0	0	NR_024371	FLJ42875	chr1	2984289	2980635	-
201	chr1	2995107	2995533	0.9004329004	11	84.6153846154	9579	0	0	1	1	0	NM_199454	PRDM16	chr1	2985741	3355185	+
162	chr1	2440017	2440214	1	10	100	17920	0	0	1	1	0	NM_018216	PANK4	chr1	2458035	2439974	-
163	chr1	2450109	2450206	0.9777777778	9	75	7878	0	0	1	1	0	NM_018216	PANK4	chr1	2458035	2439974	-

Genome-Wide Bisulfite Sequencing Analyser Software

Documentation

1. Introduction

2. System Requirements

3. Installation (GUI)

4. Preparation of the Sequencing Data

5. Analysis (GUI)

5.1.Analysis Project and Parameters

5.2. QC’s Screenshots

5.3.Results

5.4.Data Visualisation

Via UCSC (using bedgraph outputs):

6. Command line

◊ User has to unzipped the downloaded files prior utilization.◊

StdOut:

Outputs Results Files:

GBSA – Genome Bisulfite Sequencing Analyser – 2011-2012
National University of Singapore | Cancer Science Institute of Singapore

GBSA

Genome-Wide Bisulfite Sequencing Analyser Software

Documentation

1. Introduction

2. System Requirements

3. Installation (GUI)

4. Preparation of the Sequencing Data

5. Analysis (GUI)

5.1.Analysis Project and Parameters

5.2. QC’s Screenshots

5.3.Results

5.4.Data Visualisation

Via UCSC (using bedgraph outputs):

6. Command line

◊ User has to unzipped the downloaded files prior utilization.◊

StdOut:

Outputs Results Files:

GBSA – Genome Bisulfite Sequencing Analyser – 2011-2012 National University of Singapore | Cancer Science Institute of Singapore

GBSA – Genome Bisulfite Sequencing Analyser – 2011-2012
National University of Singapore | Cancer Science Institute of Singapore