Summary
As noted in the introduction, any mammalian gene may have 50/100, or more, binding sites for transcription factors scattered among promoters and enhancers. Typically, there are multiple sites bound by any single transcription factor. As noted above, genuine transcriptional regulatory elements tend to be clustered within conserved non-coding regions. There are many transcription factors that bind or act cooperatively, for example, the Ets and AP1 families (Stacey et al., 1995), so that their respective recognition motifs commonly occur side-by-side if they are functional. Regardless of the method used above, one can achieve an additional constraint on analysis and greater confidence in predictions by searching for clusters of predicted elements using programs such as Cluster Bluster (Frith et al., 2003). If the same clusters occur in genes with similar regulatory patterns, or across species, the analysis can have an additional predictive power. When one includes multiple genes, the order and location of sites becomes irrelevant, and the output one seeks is the incidence of a particular site within a cluster, and its frequency when it is present. This constraint, in addition to those above, can help overcome the problem of transcription factor binding site degeneracy, and take us to a position in which it may be possible to design machine learning approaches that can distinguish classes of genes and likely transcriptional outputs based upon genomic sequence information alone.