Sequence Generation and Sequence Analysis Processing Pipeline

Figure 1: Schematic overview of the processing pipeline.
Step A: Sequence generation

Genomic DNA from human tissues was digested into heavily methylated and completely unmethylated sequence fragments. Unmethylated regions were isolated by digestion with the restriction enzyme, McrBC, cloned into plasmid vectors and sequenced. Methylated regions were isolated and cloned based upon their resistance to a collection of methylation-sensitive enzymes (i.e., HpaII, HhaI, MaeII, BstUI, and AciI).

Pairs of sequence fragment ends (sequence tags) were derived by sequencing from the M13 forward and M13 reverse priming sites in the pZEro-1 vector. These sequences are stores in FASTA format. The description is a label that indicates the fragment library, (McrBC or RE); the fragment id; and whether it is the front or rear end of the fragment.

Step B: Align tags to genome

The first step in the software analysis pipeline is to find all possible local alignments for each sequence fragment end on the genome. The inputs to this program are the sequence fragments ends, (in FASTA format), and a local copy of the human genome obtained from the Genome Browser at UCSC. Using BLAT, a linear search algorithm similar to BLAST, the program searches the genome and locates all exact or near-exact matches of each sequence fragement end.

Next the program pairs the putative alignments for the front and rear sequence tags, and applies as series of constraints to find the best possible alignment of the sequence fragment to the genome. The sequence tags must be on the same chromosome, in the same orientation, and the distance between the ends cannot exceed the size selection used to construct the library. These constraints are usually sufficient to rule out all but one possible alignment to the genome. If there are multpiple possible alignments which meet the constraints, then the program uses the BLAT output to calculate the percent identity for the front and rear alignments, and choses the pairwise front and rear alignment which have the overall highest perecent identity.

Step C: Fill in intervening sequence for tag pairs

Using the optimal alignment for the sequence fragment ends, the program extracts the intervening sequence for the library fragment from the genome assembly database, storing clone sequences in FASTA format. This data is available for download, see the Results section.

Step D: Determine methylation status of fragment

The pipeline determines the methylation status of each library fragment by searching the sequence for particular restriction enzyme recognition sites.

The restriction enzymes used to create the RE library recognize the following tetranucleotides: CCGG, GCGC, ACGT, CGCG, GGCG, and CCGC. If a fragment from that library contains any of these tetranucleotides and is not cleaved by the methylation-sensitive enzymes, then that implies that this sequence must have been methylated in the original DNA. The software therefore marks this site as "methylated." If the fragment does not contain any of these tetranucleotides, then the fragment is given a status of "unknown".

Similarly for the McrBC library, the presence of a McrBC recognition sequence within a fragment is marked as an "unmethylated" site and fragments that do not contain the patterns are given a status of "unknown." For McrBC, the cleavage pattern is more complex than the simple tetranucleotides recognized by the other restriction enzymes. McrBC cleaves 5'-RCG-(X)40-500-RCG-3' (where (X)40-500 refers to any nucleotides, ranging in number from 40-500bp between the half-sites). Cleavage is most efficient when the distance between the half-sites is ~55-103bp, and falls off rapidly at more than 500 bp. The software therefore looks at intervening sequences of less than 500bp between the half-site patterns. The result of this phase of the analysis will assign methylated, unmethylated or unknown labels to the corresponding regions in the human genome.

Step E: Use UCSC genome browser to visualize methylated, unmethylated domains.

The pipeline outputs a GFF file containing the coordinates of this fragment on the genome, as well as its methylation status, and a final postprocessing step breaks these into files which can be uploaded to the USCS genome browser as a custom track.

The results section contains URLs which upload the results of this analysis into the Human Genome Browser database located at the University of California, Santa Cruz. Through the web browser, users can interactively explore the complete human genome, with synchronized views of parallel data tracks such as ESTs, gene transcript annotations, and regions of similarity to other genomes (Figure 1E). The methylation landscape appears as an additional track in this database.

Information on running the pipeline is available in the README file. Access to this file is restricted to project members and collaborators. To request access, please contact Dr. Haghighi.