GBSA

Genome-Wide Bisulfite Sequencing Analyser Software

To build your own gene annotation tables, please download the following program:

 

————————————————-

For genomes found in UCSC Genome Browser e.g. Human (hg), Rat (rno), Zebrafish (danRer)

[download: GBSA-genome-builder-UCSC.zip]

The UCSC version builds a genome database for *ANY* completely-annotated organism found in the UCSC database automatically. This spans 35 vertebrates which means that GBSA is automatically available for a very wide range of organisms. The program requires only one argument, which is the version of the build organism found in UCSC
(eg hg19, hg18, mm9 or mm8)

To run the program:

perl GBSA-genome-builder-UCSC.pl <organism-build>

Example:

perl GBSA-genome-builder-UCSC.pl hg19 #this builds the GBSA genome database for the hg19 build (human)
perl GBSA-genome-builder-UCSC.pl mm9 #this builds the GBSA genome database for the mm9 build (mouse)
perl GBSA-genome-builder-UCSC.pl danRer7 #this builds the GBSA genome database for the danRer7 build (zebrafish)

For genomes not found in UCSC Genome Browser e.g. Arabidopsis

[download: GBSA-genome-builder-MANUAL.zip]

Although the UCSC database is rich in resources, it does not contain all organisms which is specifically studied for methylation events. The Manual version is made for this purpose. The program requires 3 files, a folder called “genome” with the genome’s fasta files (separated by chromosome e.g. chr1.fa, chr2.fa),  and a user-given output name as arguments:

  1. A GFF file containing the exon and CDS information for all protein coding genes of the organism
  2. A BED file containing the known CpG sites of the organism
  3. A BED file containing repeats information of the organism (output specifically from RepeatMasker [http://repeatmasker.org])
  4. genome folder in the program execution path which contains the fasta files of the genome. Example below:
    • ./genome/chr1.fa , ./genome/chr2.fa, ./genome/chr3.fa
  5. Output_name

 

To run the program:

perl GBSA-genome-builder-MANUAL.pl <Gene_GFF_file> <CpG_BED_file> <Repeats_BED_file> <Output-name>

Example:

perl GBSA-genome-builder-UCSC.pl Arabidopsis_genes.gff Arabidopsis_cpg.bed Arabidopsis_repeats.bed TAIR10 

 

Examples of File Format

1. Sample of Gene/Protein-coding GFF file:

<chr> <source> <feature> <start> <end> <score> <strand> <frame> <group>
chrMt protein_coding exon 273 734 0 -0 0 gene_id “ATMG00010”; transcript_id “ATMG00010.1”; exon_number “1”; gene_name “ORF153A”; transcript_name “ATMG00010.1”; seqedit “false”;
chrMt protein_coding CDS 276 734 0 -0 0 gene_id “ATMG00010”; transcript_id “ATMG00010.1”; exon_number “1”; gene_name “ORF153A”; transcript_name “ATMG00010.1”; protein_id “ATMG00010.1”;
chrMt protein_coding start_codon 732 734 0 -0 0 gene_id “ATMG00010”; transcript_id “ATMG00010.1”; exon_number “1”; gene_name “ORF153A”; transcript_name “ATMG00010.1”;

2. Sample of CpG BED file:

<chr> <start> <end> <length> <CpGcount> <GCcontent> <pctGC> <obsExp>
chr1 3713 3857 145 14 72 0.497 1.586
chr1 20960 21279 320 21 156 0.488 1.119
chr1 52465 52638 174 18 100 0.575 1.396
chr1 57458 57601 144 13 72 0.5 1.625

3. Sample of Repeats BED file:

<chr> <start> <end> <feature>
chr1 3202 3229 Low_complexity
chr1 3204 3246 Low_complexity
chr1 4291 4328 Simple_repeat
chr1 55676 56576 DNA/hAT