| Preface | 7 |
---|
| Acknowledgments | 9 |
---|
| Contents | 10 |
---|
| Acronyms | 14 |
---|
| Introduction | 15 |
---|
| Some Basic Genetics | 15 |
| The Central Dogma | 17 |
| The Structure of a Gene | 19 |
| How Many Genes Do We Have? | 21 |
| Problems of Gene Definitions | 25 |
| The Gene Finding Problem | 26 |
| Comparative Gene Finding | 28 |
| History of Algorithm Development | 29 |
| To Build a Gene Finder | 32 |
| References | 35 |
| Single Species Gene Finding | 41 |
---|
| Hidden Markov Models (HMMs) | 41 |
| Markov Chains | 42 |
| Discrete-Time Markov Chains | 42 |
| Stationarity and Reversibility | 48 |
| Continuous-Time Markov Chains | 50 |
| Hidden Markov Models | 53 |
| Dynamic Programming | 56 |
| Silent Begin and End States | 58 |
| The Forward Algorithm | 59 |
| The Backward Algorithm | 59 |
| The Viterbi Algorithm | 61 |
| EasyGene: A Prokaryotic Gene Finder | 63 |
| Posterior Decoding | 65 |
| Statistical Significance of Predictions | 65 |
| Generalized Hidden Markov Models (GHMMs) | 66 |
| Preliminaries | 66 |
| The Forward and Backward Algorithms | 68 |
| The Forward Variables | 68 |
| The Backward Variables | 70 |
| The Viterbi Algorithm | 70 |
| Genscan: A GHMM-Based Gene Finder | 71 |
| Sequence Generation Algorithm | 74 |
| Reducing Computational Complexity | 74 |
| Exon Probabilities | 78 |
| Interpolated Markov Models (IMMs) | 81 |
| Preliminaries | 81 |
| Linear and Rational Interpolation | 82 |
| GLIMMER: A Microbial Gene Finder | 83 |
| Gene Prediction | 84 |
| Training the IMM | 85 |
| GlimmerM | 86 |
| Neural Networks | 86 |
| Biological Neurons | 87 |
| Artificial Neurons and the Perceptron | 88 |
| Multi-Layer Neural Networks | 90 |
| GRAIL: A Neural Network-Based Gene Finder | 91 |
| Decision Trees | 93 |
| Classification | 94 |
| Decision Tree Learning | 95 |
| MORGAN: A Decision Tree-Based Gene Finder | 99 |
| References | 100 |
| Sequence Alignment | 103 |
---|
| Pairwise Sequence Alignment | 103 |
| Dot Plot Matrix | 105 |
| Nucleotide Substitution Models | 106 |
| The Jukes-Cantor Model | 108 |
| The Kimura Model | 109 |
| The Felsenstein Model | 110 |
| The Tamura and Nei Model | 111 |
| General Time-Reversible (GTR) Model | 111 |
| Amino Acid Substitution Models | 112 |
| The PAM Matrix | 113 |
| The BLOSUM Matrix | 117 |
| The GONNET matrix | 120 |
| Gap Models | 120 |
| The Needleman-Wunsch Algorithm | 122 |
| Needleman-Wunsch Using Affine Gaps | 124 |
| The Smith-Waterman Algorithm | 126 |
| Pair Hidden Markov Models (PHMMs) | 128 |
| Preliminaries | 128 |
| The Forward, Backward, and Viterbi Algorithms | 130 |
| Database Similarity Searches | 132 |
| FASTA | 132 |
| BLAST | 134 |
| Gapped BLAST | 136 |
| PSI-BLAST | 136 |
| The Significance of Alignment Scores | 137 |
| Multiple Sequence Alignment | 138 |
| Scoring Schemes | 139 |
| Sum-of-Pairs (SP) | 141 |
| Weighted Sum-of-Pairs (WSP) | 141 |
| Minimum Entropy | 141 |
| Gap Costs | 142 |
| Phylogenetic Trees | 143 |
| The Neighbor-Joining Method | 143 |
| Fitch-Margoliash | 144 |
| Dynamic Programming | 145 |
| The MSA Package | 145 |
| Progressive Alignments | 147 |
| Iterative Methods | 150 |
| Hidden Markov Models | 153 |
| SAM-Sequence Alignment and Modeling | 153 |
| Genetic Algorithms | 155 |
| Simulated Annealing | 158 |
| Alignment Profiles | 161 |
| Standard Profiles | 161 |
| Profile HMMs | 163 |
| Scoring a New Sequence | 164 |
| References | 165 |
| Comparative Gene Finding | 171 |
---|
| Similarity-Based Gene Finding | 171 |
| GenomeScan: GHMM-Based Gene Finding Using Homology | 172 |
| Twinscan: GHMM-Based Gene Finding Using Informant Sequences | 174 |
| Heuristic Cross-Species Gene Finding | 176 |
| ROSETTA | 176 |
| Pair Hidden Markov Models (PHMMs) | 177 |
| DoubleScan: A PHMM-Based Comparative Gene Finder | 178 |
| The State Space | 178 |
| The Stepping Stone Algorithm | 180 |
| Generalized Pair Hidden Markov Models (GPHMMs) | 181 |
| Preliminaries | 181 |
| The Forward, Backward and Viterbi Algorithms | 182 |
| SLAM: A GPHMM-Based Comparative Gene Finder | 184 |
| The State Space | 184 |
| Reducing Computational Complexity | 186 |
| Gene Mapping | 188 |
| Projector: A Gene Mapping Tool | 188 |
| GeneMapper-Reference Based Annotation | 189 |
| Multiple Sequence Gene Finding | 190 |
| N-SCAN: A Multiple Informant-Based Gene Finder | 191 |
| References | 193 |
| Gene Structure Submodels | 195 |
---|
| The State Space | 195 |
| The Exon States | 196 |
| Splice Sites | 198 |
| Introns and Intergenic Regions | 199 |
| Untranslated Regions (UTRs) | 200 |
| Promoters and PolyA-signals | 201 |
| State Length Distributions | 202 |
| Geometric and Negative Binomial Lengths | 202 |
| Empirical Length Distributions | 205 |
| Acyclic Discrete Phase Type Distributions | 206 |
| Sequence Content Sensors | 210 |
| GC-Content Binning | 210 |
| Start Codon Recognition | 211 |
| Codon and Amino Acid Usage | 212 |
| K-Tuple Freq
|