To build your own gene annotation tables, please download the following program:
- For genomes found in UCSC Genome Browser e.g. Human (hg), Rat (rno), Zebrafish (danRer)
- For genomes not found in UCSC Genome Browser e.g. Arabidopsis
————————————————-
For genomes found in UCSC Genome Browser e.g. Human (hg), Rat (rno), Zebrafish (danRer)
[download: GBSA-genome-builder-UCSC.zip]
The UCSC version builds a genome database for *ANY* completely-annotated organism found in the UCSC database automatically. This spans 35 vertebrates which means that GBSA is automatically available for a very wide range of organisms. The program requires only one argument, which is the version of the build organism found in UCSC
(eg hg19, hg18, mm9 or mm8)
To run the program:
perl GBSA-genome-builder-UCSC.pl <organism-build>
|
Example:
perl GBSA-genome-builder-UCSC.pl hg19 #this builds the GBSA genome database for the hg19 build (human)
|
perl GBSA-genome-builder-UCSC.pl mm9 #this builds the GBSA genome database for the mm9 build (mouse)
|
perl GBSA-genome-builder-UCSC.pl danRer7 #this builds the GBSA genome database for the danRer7 build (zebrafish)
|
For genomes not found in UCSC Genome Browser e.g. Arabidopsis
[download: GBSA-genome-builder-MANUAL.zip]
Although the UCSC database is rich in resources, it does not contain all organisms which is specifically studied for methylation events. The Manual version is made for this purpose. The program requires 3 files, a folder called “genome” with the genome’s fasta files (separated by chromosome e.g. chr1.fa, chr2.fa), and a user-given output name as arguments:
- A GFF file containing the exon and CDS information for all protein coding genes of the organism
- A BED file containing the known CpG sites of the organism
- A BED file containing repeats information of the organism (output specifically from RepeatMasker [http://repeatmasker.org])
- genome folder in the program execution path which contains the fasta files of the genome. Example below:
- ./genome/chr1.fa , ./genome/chr2.fa, ./genome/chr3.fa
- Output_name
To run the program:
perl GBSA-genome-builder-MANUAL.pl <Gene_GFF_file> <CpG_BED_file> <Repeats_BED_file> <Output-name>
|
Example:
perl GBSA-genome-builder-UCSC.pl Arabidopsis_genes.gff Arabidopsis_cpg.bed Arabidopsis_repeats.bed TAIR10
|
Examples of File Format
1. Sample of Gene/Protein-coding GFF file:
<chr> | <source> | <feature> | <start> | <end> | <score> | <strand> | <frame> | <group> | |||||||||||
chrMt | protein_coding | exon | 273 | 734 | 0 | -0 | 0 | gene_id | “ATMG00010”; | transcript_id | “ATMG00010.1”; | exon_number | “1”; | gene_name | “ORF153A”; | transcript_name | “ATMG00010.1”; | seqedit | “false”; |
chrMt | protein_coding | CDS | 276 | 734 | 0 | -0 | 0 | gene_id | “ATMG00010”; | transcript_id | “ATMG00010.1”; | exon_number | “1”; | gene_name | “ORF153A”; | transcript_name | “ATMG00010.1”; | protein_id | “ATMG00010.1”; |
chrMt | protein_coding | start_codon | 732 | 734 | 0 | -0 | 0 | gene_id | “ATMG00010”; | transcript_id | “ATMG00010.1”; | exon_number | “1”; | gene_name | “ORF153A”; | transcript_name | “ATMG00010.1”; |
2. Sample of CpG BED file:
<chr> | <start> | <end> | <length> | <CpGcount> | <GCcontent> | <pctGC> | <obsExp> |
chr1 | 3713 | 3857 | 145 | 14 | 72 | 0.497 | 1.586 |
chr1 | 20960 | 21279 | 320 | 21 | 156 | 0.488 | 1.119 |
chr1 | 52465 | 52638 | 174 | 18 | 100 | 0.575 | 1.396 |
chr1 | 57458 | 57601 | 144 | 13 | 72 | 0.5 | 1.625 |
3. Sample of Repeats BED file:
<chr> | <start> | <end> | <feature> |
chr1 | 3202 | 3229 | Low_complexity |
chr1 | 3204 | 3246 | Low_complexity |
chr1 | 4291 | 4328 | Simple_repeat |
chr1 | 55676 | 56576 | DNA/hAT |