- Introduction
- System Requirements
- Installation (GUI)
- Preparation of the Sequencing Data
- Analysis (GUI)
- Command Line
1. Introduction
Genome Bisulfite Sequencing Analyser (GBSA) is a free open-source software capable of analyzing whole-genome bisulfite sequencing data. In essence, GBSA allows an investigator to explore not only known loci, but also all the genomic regions, for which methylation studies could lead to the discovery of new regulatory mechanisms. More…
2. System Requirements
GBSA can be run on a standard computer (with a least 4 Gb RAM) and a minimum of 3 times the size of input file(s) of free hard disk space for swapping. GBSA GUI runs under any 32-bit or 64-bit version of Microsoft Windows Vista and above. The script release is a cross-platform application that runs on any operating system that has a Python 2.7 (and above) interpreter.
3. Installation (GUI)
After downloading the installation file, double-click on the “GBSA_v1_Setup.exe” and follow these steps to install GBSA version 1.0.
1 ) When the above window appears… Click on “Next”
2 ) IMPORTANT – The user MUST click on the “I agree with the GNU GPL terms and conditions” check box to continue on the Software License screen, then click “Next”. The Installer is not allowed to do this without user authorization.
3 ) Choose an installation folder and click “Next” to continue.
4 ) Click “Yes” to create the installation folder.
5 ) Click “Start” to install GBSA version 1.0.
6 ) Please wait until installation is finished.
7 ) Click “Next”.
8 ) Click “Exit” to finish the installation.
4. Preparation of the Sequencing Data
GBSA process BS-sequencing aligned reads. As an input, the program accept BSP files from BSmap or RRBSmap, as well as output from BS Seeker.
5. Analysis (GUI)
5.1.Analysis Project and Parameters
-
-
-
-
-
-
-
-
- For a first use, user has to download and install a genome annotation database available in our website
-
-
-
-
-
-
-
1 ) Click on File Menu 2 ) Click on Install Genome Annotation and choose genome annotation database file and wait…
4 ) Click on Browse in Input file(s) panel. 5 ) Select Input file, then click Open. If you need to select multiple files, please select the first file and then press and hold the Ctrl key. While holding down the Ctrl key, select each of the other files you wish to select.
6 ) Set analysis parameters ans then 7) Setting analysis parameters
8) Set your analysis parameters and 9) click on “Start analysis using selected method(s)”.
Status of analysis could be viewed on the left lower corner.
5.2. QC’s Screenshots
1) Report from reads alignment.
2)Reports from gene-centric approach
3) Reports from gene independent approach
5.3.Results
1) Results from Gene-centric approach
2) Results from Gene independent approach
5.4.Data Visualisation
Via UCSC (using bedgraph outputs):
Via GBSA:
Localisation of intergenic methylated domains in the neighborhood of a gene can be listed when users click on this button:
6. Command line
GBSA Script release is a cross-platform application that runs on any operating system that has a Python 2.7 (and above) interpreter. The programm produces same results files as the GUI version except for the graphical reports. After downloading GBSA script and all required libraries: pylab, numpy and interval
A detailed tutorial installation is available here (Thanks to the NGS surfer’s Wiki community).
You can use GBSA by typing this command:
>python GBSA_<version>.py <parameters> |
using the following analysis parameters:
Usage:
New analysis: python GBSA_2.0.py -i <input file> -f <format> -q <protocol> ... [settings] Re-analysis: python GBSA_2.0.py -I <Preprocessed analysis file> [settings] Options: --version show program's version number and exit -h, --help show this help message and exit New analysis general settings: -i FILE, --input=FILE Reads Alignment file (from BS Seeker or BSmap/RRBSmap) -f STR, --format=STR Input File format, possible values: BSseeker, BSmap [default: BSmap] -q STR, --protocol=STR Bisulfite sequencing protocol (only for BSseeker), possible values: lister, cokus [default: lister] -r STR, --rm_dupliclate=STR Remove duplicated reads Y/N [default: N] -n STR, --methylation_event=STR Methylation events, possible values: x, xy, xz, xyz for CpG only, CpG and CHG, CpG and CHH, CpG, CHG and CHH respectively [default: x] -d INT, --depth=INT The minimum read coverage for a given genomic region.If the minimum is not reached, the region will be ignored. [default: 3] -t STR, --tempdir=STR Folder path to store temporary files. [default: default] -S STR, --save_preproc=STR Save the GBSA preprocessed file for re-analysis [Y/N]. [default: Y] Re-analysis (from a GBSA file): -I FILE, --INPUT=FILE GBSA Pre-processed file General settings: -o STR, --output=STR The project's path+name (e.g /home/marcel/bsanalysis), it will be used to name all output files. Space and special caracteres are prohibited. [default: BS_analysis] -a STR, --annotation=STR Genome annotation file on BSA format. Available on our website. -b STR, --genome_seq=STR Genome sequence file on BSQ format. Available on our website. -m STR, --method=STR Choose your analysis method: Method 1 (Gene based) calculate methylation level relative to the gene annotation; Method 2 -Domain-, detect methylated domains before the annotation. Both Methods (3) [default: 3] -v STR, --verbose=STR GBSA provides comments on the operation as they occur as well progress bar. !!!This function could be not compatible on some terminals. [0|1]. [default: 1] Gene centric approach settings: -c INT, --minc=INT The minimum of CG dinucleotides that a region (promoter, gene body) should have to compute a score. If the minimum is not reached, the region will be ignored. This variable is used by the Gene based method. [default: 3] Gene independent approach settings: -p INT, --prom_size=INT Size in nucleotide of the promoter expected lenght. This variable is used by the Domain method for annotate them against the genone annotation. [default:2000] -s INT, --domain_score=INT Average level methylation level [0,1] of sequenced CG within the domain. [default: 0.5] -w INT, --window_size=INT Sliding window size length used by method 2 to find methylated domains. [default: 100] -g INT, --min_CG_domain=INT Minimum of sequenced CG that a domain should contain. [default: 5] Note: the BSA and BSQ files used by the -a (--annotation) and -b (--genome) options are available in our website. |
◊ User has to unzipped the downloaded files prior utilization.◊
Example:
>python GBSA_1-01.py -i '/home/touati/Downloads/1K_SW403_R038_hg19.txt' -a '/home/touati/programs/GBSA/annotation_files/hg19.refFlat.bsa' -b '/home/touati/program/GBSA/annotation_files/hg19.bsq' -o '/home/touati/Desktop/test/SW403' -m 3 -d 3 -f BSseeker -s 0.5 -w 100 -g 5 |
StdOut:
Outputs Results Files:
- log file
- <name>.bed is a standard bedgraph file recording all CpGs (or ChG/CHH) β scores.
track type=bedGraph name=CHR19test_results_CpG description="CpG Methylation level" visibility=full autoScale=off viewLimits=-1:1 color=0,61,245 maxHeightPixels=11:50:128
chr1 |
10468 |
10470 |
1 |
chr1 |
10470 |
10472 |
1 |
chr1 |
16242 |
16244 |
0.33 |
chr1 |
51934 |
51936 |
1 |
chr1 |
55298 |
55300 |
0 |
chr1 |
55328 |
55330 |
1 |
chr1 |
56297 |
56299 |
1 |
chr1 |
58446 |
58448 |
0.5 |
chr1 |
62155 |
62157 |
1 |
- <name>cov.bed is a standard bedgraph file recording the depth or sequencing coverage of all CpGs .
track type=bedGraph name=CHR19test_coverage description="Depth of coverage" visibility=full yLineOnOff=on yLineMark="0.0"
chr1 |
10468 |
10470 |
5 |
chr1 |
10470 |
10472 |
4 |
chr1 |
16242 |
16244 |
3 |
chr1 |
51934 |
51936 |
3 |
chr1 |
55298 |
55300 |
3 |
chr1 |
55328 |
55330 |
5 |
chr1 |
56297 |
56299 |
4 |
chr1 |
58446 |
58448 |
6 |
chr1 |
62155 |
62157 |
3 |
chr1 |
62419 |
62421 |
3 |
- <name>_gene_score.txt is a tab delim file where all genes methylation level are recorded.
the columns of the file are
2-Gene name
3-Promoter methylation score (-3kb)
4-Promoter methylation score (-2kb)
5-TSS region methylation score (± 1kb)
6-1stExon/Intron methylation score
7-Gene body methylation score
8-Percentage of CpG sequenced within the promoter (3kb)
9-Percentage of CpG sequenced within the promoter (2kb)
10-Percentage of CpG sequenced within the TSS Region (± 1kb)
11-Percentage of CpG sequenced within the 1stExon/Intron
12-Percentage of CpG sequenced within the gene body
refseq_ID | Gene_name | Promoter3k_score | Promoter2k_score | TSS_region_score | 1stExon_Intron_score | Gene_body_score | _%CG-seq_Promoter3k | _%CG-seq_Promoter2k | _%CG-seq_TSS_Region | _%CG-seq_1stExon_Intron | _%CG-seq_Gene_body |
NR_028327.1 | LOC100133331 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
NR_033908 | LOC100288069 | NA | NA | NA | 1 | 0.9583333333 | NA | NA | NA | 3.5294117647 | 3.7209302326 |
NR_024321 | NCRNA00115 | 1 | NA | NA | NA | NA | 2.8846153846 | NA | NA | NA | NA |
NR_015368 | LOC643837 | 0.0666666667 | NA | NA | NA | 0.9375960061 | 3.2258064516 | NA | NA | NA | 13.870246085 |
NR_027055 | FAM41C | 0.8146705147 | 0.8333333333 | NA | NA | 0.8609905073 | 39.1304347826 | 23.5294117647 | NA | NA | 10.7142857143 |
NR_026874 | LOC100130417 | 0.92 | 0.9 | 1 | NA | 0.9333333333 | 4.7619047619 | 4.7619047619 | 7.0422535211 | NA | 9.2592592593 |
NM_152486 | SAMD11 | NA | NA | NA | NA | 0.8107660455 | NA | NA | NA | NA | 2.4811218986 |
NM_015658 | NOC2L | NA | NA | 0.1666666667 | NA | 0.8730679157 | NA | NA | 3.4090909091 | NA | 13.436123348 |
NM_198317 | KLHL17 | 0.48 | 0.1666666667 | NA | NA | 0.9523809524 | 4.4052863436 | 3.1578947368 | NA | NA | 1.8867924528 |
NM_001160184 | PLEKHN1 | 0.9444444444 | 1 | NA | NA | 0.939047619 | 3.5928143713 | 4.2735042735 | NA | NA | 5.7636887608 |
NM_032129 | PLEKHN1 | 0.9444444444 | 1 | NA | NA | 0.939047619 | 3.5928143713 | 4.2735042735 | NA | NA | 5.7636887608 |
NR_027693 | C1orf170 | 0.7857142857 | 0.8805555556 | 0.8888888889 | NA | 0.9333333333 | 13.5714285714 | 12 | 6.3157894737 | NA | 1.3586956522 |
NM_021170 | HES4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
NM_001142467 | HES4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
NM_005101 | ISG15 | 0.806957672 | 0.7613888889 | 0.525 | NA | NA | 18.9873417722 | 17.5438596491 | 3.8461538462 | NA | NA |
NM_198576 | AGRN | NA | NA | NA | NA | 0.9443502825 | NA | NA | NA | NA | 3.4767236299 |
- <name>_domain_scores.txt is a tab delim file where all methylated domains are recorded.
the columns of the file are:
- Domain ID
- Domain chromosome
- Domain start position
- Domain end position
- Domain methylation score
- Number of CpG within the domain
- Percentage of CpG sequenced within the domain
- Distance from the nearest gene
- Fall in a intergenic region? (0/1)
- Fall in a promoter? (0/1)
- Fall in a genic region? (0/1)
- Fall in a intronic region? (0/1)
- Fall in a exonic region? (0/1)
- Nearest Refseq ID
- Nearest gene name
- Nearest gene chromosome
- Nearest gene start position
- Nearest gene end position
- Nearest gene strand
Domain_ID |
Domain_Chr |
Domain_Start |
Domain_End |
Domain_score |
Domain_CpG_number |
Domain_%_CpG_sequenced |
Distance_nearest_gene |
intergenic |
promoter |
gene |
intron |
exon |
Nearest_Refseq_ID |
Nearest_Gene_Name |
Nearest_Gene_Chr |
Nearest_Gene_Start |
Nearest_Gene_End |
Nearest_Gene_Strand |
198 |
chr1 |
2946593 |
2946672 |
1 |
5 |
100 |
8587 |
1 |
0 |
0 |
0 |
0 |
NM_080431 |
ACTRT2 |
chr1 |
2938045 |
2939467 |
+ |
199 |
chr1 |
2971344 |
2971616 |
0.8666666667 |
5 |
71.4285714286 |
12809 |
1 |
0 |
0 |
0 |
0 |
NR_024371 |
FLJ42875 |
chr1 |
2984289 |
2980635 |
- |
200 |
chr1 |
2975614 |
2975825 |
0.8163265306 |
7 |
100 |
8570 |
1 |
0 |
0 |
0 |
0 |
NR_024371 |
FLJ42875 |
chr1 |
2984289 |
2980635 |
- |
201 |
chr1 |
2995107 |
2995533 |
0.9004329004 |
11 |
84.6153846154 |
9579 |
0 |
0 |
1 |
1 |
0 |
NM_199454 |
PRDM16 |
chr1 |
2985741 |
3355185 |
+ |
202 |
chr1 |
2997075 |
2997372 |
0.9642857143 |
7 |
63.6363636364 |
11482 |
0 |
0 |
1 |
1 |
0 |
NM_199454 |
PRDM16 |
chr1 |
2985741 |
3355185 |
+ |
203 |
chr1 |
3001129 |
3001351 |
0.9761904762 |
6 |
66.6666666667 |
15499 |
0 |
0 |
1 |
1 |
0 |
NM_199454 |
PRDM16 |
chr1 |
2985741 |
3355185 |
+ |
204 |
chr1 |
3003299 |
3003478 |
0.891984127 |
10 |
111.111111111 |
17647 |
0 |
0 |
1 |
1 |
0 |
NM_199454 |
PRDM16 |
chr1 |
2985741 |
3355185 |
+ |
205 |
chr1 |
3017871 |
3018035 |
1 |
7 |
87.5 |
-26585 |
1 |
0 |
0 |
0 |
0 |
NR_036215 |
MIR4251 |
chr1 |
3044538 |
3044599 |
+ |
206 |
chr1 |
3028578 |
3028948 |
1 |
6 |
46.1538461538 |
-15775 |
1 |
0 |
0 |
0 |
0 |
NR_036215 |
MIR4251 |
chr1 |
3044538 |
3044599 |
+ |
207 |
chr1 |
3038817 |
3038947 |
1 |
5 |
100 |
-5656 |
1 |
0 |
0 |
0 |
0 |
NR_036215 |
MIR4251 |
chr1 |
3044538 |
3044599 |
+ |
208 |
chr1 |
3043633 |
3044081 |
0.9777777778 |
12 |
85.7142857143 |
-681 |
0 |
1 |
0 |
0 |
0 |
NR_036215 |
MIR4251 |
chr1 |
3044538 |
3044599 |
+ |
209 |
chr1 |
3044805 |
3045043 |
0.9333333333 |
5 |
62.5 |
386 |
1 |
0 |
0 |
0 |
0 |
NR_036215 |
MIR4251 |
chr1 |
3044538 |
3044599 |
+ |
198 |
chr1 |
2946593 |
2946672 |
1 |
5 |
100 |
8587 |
1 |
0 |
0 |
0 |
0 |
NM_080431 |
ACTRT2 |
chr1 |
2938045 |
2939467 |
+ |
199 |
chr1 |
2971344 |
2971616 |
0.8666666667 |
5 |
71.4285714286 |
12809 |
1 |
0 |
0 |
0 |
0 |
NR_024371 |
FLJ42875 |
chr1 |
2984289 |
2980635 |
- |
200 |
chr1 |
2975614 |
2975825 |
0.8163265306 |
7 |
100 |
8570 |
1 |
0 |
0 |
0 |
0 |
NR_024371 |
FLJ42875 |
chr1 |
2984289 |
2980635 |
- |
201 |
chr1 |
2995107 |
2995533 |
0.9004329004 |
11 |
84.6153846154 |
9579 |
0 |
0 |
1 |
1 |
0 |
NM_199454 |
PRDM16 |
chr1 |
2985741 |
3355185 |
+ |
162 |
chr1 |
2440017 |
2440214 |
1 |
10 |
100 |
17920 |
0 |
0 |
1 |
1 |
0 |
NM_018216 |
PANK4 |
chr1 |
2458035 |
2439974 |
- |
163 |
chr1 |
2450109 |
2450206 |
0.9777777778 |
9 |
75 |
7878 |
0 |
0 |
1 |
1 |
0 |
NM_018216 |
PANK4 |
chr1 |
2458035 |
2439974 |
- |
Document was last modified on Feb. 25 2012