GBSA

Genome-Wide Bisulfite Sequencing Analyser Software


  1. Introduction
  2. System Requirements
  3. Installation (GUI)
  4. Preparation of the Sequencing Data
  5. Analysis (GUI)
    1. Analysis Project and Parameters
    2. Reports
    3. Results
    4. Data Visualisation
  6. Command Line

1. Introduction

Genome Bisulfite Sequencing Analyser (GBSA) is a free open-source software capable of analyzing whole-genome bisulfite sequencing data. In essence, GBSA allows an investigator to explore not only known loci, but also all the genomic regions, for which methylation studies could lead to the discovery of new regulatory mechanisms. More…


2. System Requirements

GBSA can be run on a standard computer (with a least 4 Gb RAM) and a minimum of 3 times the size of input file(s) of free hard disk space for swapping. GBSA GUI runs under any 32-bit or 64-bit version of Microsoft Windows Vista and above. The script release is a cross-platform application that runs on any operating system that has a Python 2.7 (and above) interpreter.

3. Installation (GUI)

After downloading the installation file, double-click on the “GBSA_v1_Setup.exe” and follow these steps to install GBSA version 1.0.

click to enlarge

1 ) When the above window appears… Click on “Next”

 

2 ) IMPORTANT – The user MUST click on the “I agree with the GNU GPL terms and conditions” check box to continue on the Software License screen, then click “Next”. The Installer is not allowed to do this without user authorization.

3 ) Choose an installation folder and click “Next” to continue.

4 ) Click “Yes” to create the installation folder.

5 ) Click “Start” to install GBSA version 1.0.

6 ) Please wait until installation is finished.

7 ) Click “Next”.

8 ) Click “Exit” to finish the installation.

4. Preparation of the Sequencing Data

GBSA process BS-sequencing aligned reads. As an input, the program accept BSP files from BSmap or RRBSmap, as well as output from BS Seeker.

5. Analysis (GUI)

5.1.Analysis Project and Parameters

Installation Genome Annotation

 

 

 

1 ) Click on File Menu 2 ) Click on Install Genome Annotation and choose genome annotation database file and wait…

Restarting of GBSA is required after finish installing genome annotation database .

4 ) Click on Browse in Input file(s) panel. 5 ) Select Input file, then click Open. If you need to select multiple files, please select the first file and then press and hold the Ctrl key. While holding down the Ctrl key, select each of the other files you wish to select.

6 ) Set analysis parameters ans then 7) Setting analysis parameters

8) Set your analysis parameters and 9) click on “Start analysis using selected method(s)”.

Status of analysis could be viewed on the left lower corner.

5.2. QC’s Screenshots

1) Report from reads alignment.

2)Reports from gene-centric approach

 

3) Reports from gene independent approach

5.3.Results

1) Results from Gene-centric approach

2) Results from Gene independent approach

5.4.Data Visualisation

 

Via UCSC (using bedgraph outputs):

Via GBSA:

Localisation of intergenic methylated domains in the neighborhood of a gene can be listed when users click on this button:

6. Command line

GBSA Script release is a cross-platform application that runs on any operating system that has a Python 2.7 (and above) interpreter. The programm produces same results files as the GUI version except for the graphical reports. After downloading GBSA script and all required libraries: pylab, numpy and interval

A detailed tutorial installation is available here (Thanks to the NGS surfer’s Wiki community).

You can use GBSA by typing this command:

>python GBSA_<version>.py <parameters>

using the following analysis parameters:

Usage:

 New analysis:
 python GBSA_2.0.py -i <input file> -f <format> -q <protocol> ... [settings]
 Re-analysis:
 python GBSA_2.0.py -I <Preprocessed analysis file> [settings]
Options:
 --version show program's version number and exit
 -h, --help show this help message and exit
New analysis general settings:
 -i FILE, --input=FILE Reads Alignment file (from BS Seeker or BSmap/RRBSmap)
 -f STR, --format=STR Input File format, possible values: BSseeker, BSmap [default: BSmap]
 -q STR, --protocol=STR Bisulfite sequencing protocol (only for BSseeker), possible values: lister, cokus [default: lister]
 -r STR, --rm_dupliclate=STR  Remove duplicated reads Y/N [default: N]
 -n STR, --methylation_event=STR Methylation events, possible values: x, xy, xz, xyz for CpG only, CpG and CHG, CpG and CHH, CpG, CHG and CHH respectively [default: x]
 -d INT, --depth=INT The minimum read coverage for a given genomic region.If the minimum is not reached, the region will be ignored. [default: 3]
 -t STR, --tempdir=STR Folder path to store temporary files. [default: default]
 -S STR, --save_preproc=STR Save the GBSA preprocessed file for re-analysis [Y/N]. [default: Y] 

Re-analysis (from a GBSA file):
 -I FILE, --INPUT=FILE GBSA Pre-processed file 

General settings:
 -o STR, --output=STR The project's path+name (e.g /home/marcel/bsanalysis), it will be used to name all output files. Space and  special caracteres are prohibited. [default: BS_analysis]
 -a STR, --annotation=STR Genome annotation file on BSA format. Available on our website.
 -b STR, --genome_seq=STR Genome sequence file on BSQ format. Available on our website.
 -m STR, --method=STR Choose your analysis method: Method 1 (Gene based) calculate methylation level relative to the gene annotation; Method 2 -Domain-, detect methylated domains before the annotation. Both Methods (3) [default: 3]
 -v STR, --verbose=STR GBSA provides comments on the operation as they occur as well progress bar. !!!This function could be not compatible on some terminals. [0|1]. [default: 1]
Gene centric approach settings:
 -c INT, --minc=INT The minimum of CG dinucleotides that a region (promoter, gene body) should have to compute a score. If the minimum is not reached, the region will be ignored. This variable is used by the Gene based method. [default: 3]
Gene independent approach settings:
 -p INT, --prom_size=INT Size in nucleotide of the promoter expected lenght. This variable is used by the Domain method for annotate them against the genone annotation. [default:2000]
 -s INT, --domain_score=INT Average level methylation level [0,1] of sequenced CG within the domain. [default: 0.5]
 -w INT, --window_size=INT Sliding window size length used by method 2 to find methylated domains. [default: 100]
 -g INT, --min_CG_domain=INT Minimum of sequenced CG that a domain should contain.
 [default: 5]
Note: the BSA and BSQ files used by the -a (--annotation) and -b (--genome) options are available in our website.

◊ User has to unzipped the downloaded files prior utilization.◊

Example:

>python GBSA_1-01.py -i '/home/touati/Downloads/1K_SW403_R038_hg19.txt' -a '/home/touati/programs/GBSA/annotation_files/hg19.refFlat.bsa' -b '/home/touati/program/GBSA/annotation_files/hg19.bsq'  -o '/home/touati/Desktop/test/SW403' -m 3 -d 3 -f BSseeker -s 0.5 -w 100 -g 5

StdOut:

 

Outputs Results Files:

 

  • log file
  • <name>.bed is a standard bedgraph file recording all CpGs (or ChG/CHH) β scores.
Output sample

track type=bedGraph name=CHR19test_results_CpG description="CpG Methylation level" visibility=full autoScale=off viewLimits=-1:1 color=0,61,245 maxHeightPixels=11:50:128
chr1
10468
10470
1
chr1
10470
10472
1
chr1
16242
16244
0.33
chr1
51934
51936
1
chr1
55298
55300
0
chr1
55328
55330
1
chr1
56297
56299
1
chr1
58446
58448
0.5
chr1
62155
62157
1
  • <name>cov.bed is a standard bedgraph file recording the depth or sequencing coverage of all CpGs .
Output sample

track type=bedGraph name=CHR19test_coverage description="Depth of coverage" visibility=full yLineOnOff=on yLineMark="0.0"
chr1
10468
10470
5
chr1
10470
10472
4
chr1
16242
16244
3
chr1
51934
51936
3
chr1
55298
55300
3
chr1
55328
55330
5
chr1
56297
56299
4
chr1
58446
58448
6
chr1
62155
62157
3
chr1
62419
62421
3
  • <name>_gene_score.txt is a tab delim file where all genes methylation level are recorded.
    the columns of the file are
1-Refseq ID
2-Gene name
3-Promoter methylation score (-3kb)
4-Promoter methylation score (-2kb)
5-TSS region methylation score (± 1kb)
6-1stExon/Intron methylation score
7-Gene body methylation score
8-Percentage of CpG sequenced within the promoter (3kb)
9-Percentage of CpG sequenced within the promoter (2kb)
10-Percentage of CpG sequenced within the TSS Region (± 1kb)
11-Percentage of CpG sequenced within the 1stExon/Intron
12-Percentage of CpG sequenced within the gene body
Output sample

refseq_ID Gene_name Promoter3k_score Promoter2k_score TSS_region_score 1stExon_Intron_score Gene_body_score _%CG-seq_Promoter3k _%CG-seq_Promoter2k _%CG-seq_TSS_Region _%CG-seq_1stExon_Intron _%CG-seq_Gene_body
NR_028327.1 LOC100133331 NA NA NA NA NA NA NA NA NA NA
NR_033908 LOC100288069 NA NA NA 1 0.9583333333 NA NA NA 3.5294117647 3.7209302326
NR_024321 NCRNA00115 1 NA NA NA NA 2.8846153846 NA NA NA NA
NR_015368 LOC643837 0.0666666667 NA NA NA 0.9375960061 3.2258064516 NA NA NA 13.870246085
NR_027055 FAM41C 0.8146705147 0.8333333333 NA NA 0.8609905073 39.1304347826 23.5294117647 NA NA 10.7142857143
NR_026874 LOC100130417 0.92 0.9 1 NA 0.9333333333 4.7619047619 4.7619047619 7.0422535211 NA 9.2592592593
NM_152486 SAMD11 NA NA NA NA 0.8107660455 NA NA NA NA 2.4811218986
NM_015658 NOC2L NA NA 0.1666666667 NA 0.8730679157 NA NA 3.4090909091 NA 13.436123348
NM_198317 KLHL17 0.48 0.1666666667 NA NA 0.9523809524 4.4052863436 3.1578947368 NA NA 1.8867924528
NM_001160184 PLEKHN1 0.9444444444 1 NA NA 0.939047619 3.5928143713 4.2735042735 NA NA 5.7636887608
NM_032129 PLEKHN1 0.9444444444 1 NA NA 0.939047619 3.5928143713 4.2735042735 NA NA 5.7636887608
NR_027693 C1orf170 0.7857142857 0.8805555556 0.8888888889 NA 0.9333333333 13.5714285714 12 6.3157894737 NA 1.3586956522
NM_021170 HES4 NA NA NA NA NA NA NA NA NA NA
NM_001142467 HES4 NA NA NA NA NA NA NA NA NA NA
NM_005101 ISG15 0.806957672 0.7613888889 0.525 NA NA 18.9873417722 17.5438596491 3.8461538462 NA NA
NM_198576 AGRN NA NA NA NA 0.9443502825 NA NA NA NA 3.4767236299
  • <name>_domain_scores.txt is a tab delim file where all methylated domains are recorded.
    the columns of the file are:
  1. Domain ID
  2. Domain chromosome
  3. Domain start position
  4. Domain end position
  5. Domain methylation score
  6. Number of CpG within the domain
  7. Percentage of CpG sequenced within the domain
  8. Distance from the nearest gene
  9. Fall in a intergenic region? (0/1)
  10. Fall in a promoter? (0/1)
  11. Fall in a genic region? (0/1)
  12. Fall in a intronic region? (0/1)
  13. Fall in a exonic region? (0/1)
  14. Nearest Refseq ID
  15. Nearest gene name
  16. Nearest gene chromosome
  17. Nearest gene start position
  18. Nearest gene end position
  19. Nearest gene strand
Output sample

Domain_ID
Domain_Chr
Domain_Start
Domain_End
Domain_score
Domain_CpG_number
Domain_%_CpG_sequenced
Distance_nearest_gene
intergenic
promoter
gene
intron
exon
Nearest_Refseq_ID
Nearest_Gene_Name
Nearest_Gene_Chr
Nearest_Gene_Start
Nearest_Gene_End
Nearest_Gene_Strand
198
chr1
2946593
2946672
1
5
100
8587
1
0
0
0
0
NM_080431
ACTRT2
chr1
2938045
2939467
+
199
chr1
2971344
2971616
0.8666666667
5
71.4285714286
12809
1
0
0
0
0
NR_024371
FLJ42875
chr1
2984289
2980635
-
200
chr1
2975614
2975825
0.8163265306
7
100
8570
1
0
0
0
0
NR_024371
FLJ42875
chr1
2984289
2980635
-
201
chr1
2995107
2995533
0.9004329004
11
84.6153846154
9579
0
0
1
1
0
NM_199454
PRDM16
chr1
2985741
3355185
+
202
chr1
2997075
2997372
0.9642857143
7
63.6363636364
11482
0
0
1
1
0
NM_199454
PRDM16
chr1
2985741
3355185
+
203
chr1
3001129
3001351
0.9761904762
6
66.6666666667
15499
0
0
1
1
0
NM_199454
PRDM16
chr1
2985741
3355185
+
204
chr1
3003299
3003478
0.891984127
10
111.111111111
17647
0
0
1
1
0
NM_199454
PRDM16
chr1
2985741
3355185
+
205
chr1
3017871
3018035
1
7
87.5
-26585
1
0
0
0
0
NR_036215
MIR4251
chr1
3044538
3044599
+
206
chr1
3028578
3028948
1
6
46.1538461538
-15775
1
0
0
0
0
NR_036215
MIR4251
chr1
3044538
3044599
+
207
chr1
3038817
3038947
1
5
100
-5656
1
0
0
0
0
NR_036215
MIR4251
chr1
3044538
3044599
+
208
chr1
3043633
3044081
0.9777777778
12
85.7142857143
-681
0
1
0
0
0
NR_036215
MIR4251
chr1
3044538
3044599
+
209
chr1
3044805
3045043
0.9333333333
5
62.5
386
1
0
0
0
0
NR_036215
MIR4251
chr1
3044538
3044599
+
198
chr1
2946593
2946672
1
5
100
8587
1
0
0
0
0
NM_080431
ACTRT2
chr1
2938045
2939467
+
199
chr1
2971344
2971616
0.8666666667
5
71.4285714286
12809
1
0
0
0
0
NR_024371
FLJ42875
chr1
2984289
2980635
-
200
chr1
2975614
2975825
0.8163265306
7
100
8570
1
0
0
0
0
NR_024371
FLJ42875
chr1
2984289
2980635
-
201
chr1
2995107
2995533
0.9004329004
11
84.6153846154
9579
0
0
1
1
0
NM_199454
PRDM16
chr1
2985741
3355185
+
162
chr1
2440017
2440214
1
10
100
17920
0
0
1
1
0
NM_018216
PANK4
chr1
2458035
2439974
-
163
chr1
2450109
2450206
0.9777777778
9
75
7878
0
0
1
1
0
NM_018216
PANK4
chr1
2458035
2439974
-

 


Document was last modified on Feb. 25 2012

GBSA – Genome Bisulfite Sequencing Analyser – 2011-2012
National University of Singapore | Cancer Science Institute of Singapore