The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In the last few years, it has become routine to use gene-order data to reconstruct phylogenies, both in terms of edge distances (parsimonious sequences of operations that transform one end point of the edge into the other) and in terms of genomes at internal nodes, on small, duplication-free genomes. Current gene-order methods break down, though, when the genomes contain more than a few hundred genes,...
Conserved intervals were recently introduced as a measure of similarity between genomes whose genes have been shuffled during evolution by genomic rearrangements. Phylogenetic reconstruction based on such similarity measures raises many biological, formal and algorithmic questions, in particular the labelling of internal nodes with putative ancestral gene orders, and the selection of a good tree topology...
Studying rearrangements from gene order data is a standard approach in evolutionary analysis. Gene order data are usually modeled as signed permutations. The computation of the minimal number of reversals between two signed permutations produced a lot of literature during the last decade. Algorithms designed were first approximative, then polynomial and were further improved to give a linear one....
Contact maps are a model to capture the core information in the structure of biological molecules, e.g., proteins. A contact map consists of an ordered set S of elements (representing a protein’s sequence of amino acids), and a set A of element pairs of S, called arcs (representing amino acids which are closely neighbored in the structure). Given two contact maps (S,A) and (S...
One of the most promising ways to determine evolutionary distance between two organisms is to compare the order of appearance of orthologous genes in their genomes. The resulting genome rearrangement problem calls for finding a shortest sequence of rearrangement operations that sorts one genome into the other. In this paper we provide a 1.5-approximation algorithm for the problem of sorting by transpositions...
We examine the problem of finding maximal-scoring sets of disjoint regions in a sequence of scores. The problem arises in DNA and protein segmentation, and in post-processing of sequence alignments. Our key result states a simple recursive relationship between maximal-scoring segment sets. The statement leads to an algorithm that finds such a k-set of segments in a sequence of length n in O(nk) time...
We present a program qhash, based on q-gram filtration and high-dimensional search, to find gapped local similarities between two sequences. Our approach differs from past q-gram-based approaches in two main aspects. Our filtration step uses algorithms for a sparse all-pairs problem, while past studies use suffix-tree-like structures and counters. Our program works in sequence-sequence mode, while...
We study the problem of extracting, from given source x and error threshold k, substrings of x that occur unusually often in x within k substitutions or mismatches. Specifically, we assume that the input textstring x of n characters is produced by an i.i.d. source, and design efficient methods for computing the probability and expected number of occurrences for substrings of x with (either exactly...
Sequence alignment algorithms have a long standing tradition in bioinformatics. In this paper, we formulate an extension to existing local alignment algorithms: local alignments across multiple scoring functions. For this purpose, we use the Waterman-Eggert algorithm for suboptimal local alignments as template and introduce two new features therein: 1) an alignment of two strings over a set of score...
The subject of estimating the p-value of the log-likelihood ratio statistic for multinomial distribution has been studied extensively in the statistical literature. Nevertheless, bioinformatics laid new challenges before that research by often concentrating its interest on the “thin tail” of the distribution where classical statistical approximation typically fails. Hence, some of the more recent...
Bayesian networks are widely used for modelling gene networks. We investigate the problem of expanding a given Bayesian network by adding a hidden node – a node on which no experimental data are given. Finding a good expansion (a new hidden node and its neighborhood) can point to regions where the model is not rich enough, and help locate new, unknown variables that are important for understanding...
Genomic instabilities, amplifications, deletions and translocations are often observed in tumor cells. In the process of cancer pathogenesis cells acquire multiple genomic alterations, some of which drive the process by triggering overexpression of oncogenes and by silencing tumor suppressors and DNA repair genes. We present data analysis methods designed to study the overall transcriptional effects...
We consider the problem of finding candidates for regulatory elements of alternative splicing events from orthologous genes, using phylogenetic footprinting. The problem is formulated as follows: We are given orthologous sequences P1,...,Pa and N1,...,Nb from a + b different species, and a phylogenetic tree...
The aim of this work is to use a supervised learning approach to identify sets of motif-based sequence characteristics, combinations of which can give the most accurate annotation of new proteins. We assess several of InterPro Consortium member databases for their informativeness for the annotation of full-length protein sequences. Thus, our study addresses the problem of integrating biological information...
We present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local protein aligners to use a collection of vector seeds [3] to reduce noise hits. We model picking a set of vector seeds as an integer programming problem, and give algorithms to choose such a set of seeds. A good set of vector seeds we have chosen allows four times fewer false positive...
This paper presents an efficient algorithm for aligning aquery amino-acid sequence to a protein 3D structure template. Solving this problem is one of the main steps of the methods of protein structure prediction by threading. We propose an integer programming model and solve it by branch-and-bound algorithm. The bounds are computed using a Lagrangian dual of the model which turns out to be much easier...
Protein-protein interfaces, which are regions of interaction between two protein molecules, contain information about patterns of interacting functional groups. Recognition of such patterns is useful both for prediction of binding partners and for the development of drugs that can interfere with the formation of the protein-protein complex. We present a novel method, Interface-to-Interface (I2I)-SiteEngine,...
The problem of identifying sequence domains is essential for understanding protein function. Most current methods for protein domain identification rely on prior knowledge of homologous domains and construction of high quality multiple sequence alignments. With rapid accumulation of enormous data from genome sequencing, it is important to be able to automatically determine domain regions from a set...
We give an algorithm that locally improves the fit between two proteins modeled as space-filling diagrams. The algorithm defines the fit in purely geometric terms and improves by applying a rigid motion to one of the two proteins. Our implementation of the algorithm takes between three and ten seconds and converges with high likelihood to the correct docked configuration, provided it starts at a position...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.