The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The rapid development of high-throughput sequencing technology provides unique opportunities for studies of transcription factor binding, while also bringing new computational challenges. Recently, a series of discriminative motif discovery (DMD) methods have been proposed and offer promising solutions for addressing these challenges. However, because of the huge computational cost, most of them have...
Genome-wide association studies have discovered many biologically important associations of genes with phenotypes. Typically, genome-wide association analyses formally test the association of each genetic feature (SNP, CNV, etc) with the phenotype of interest and summarize the results with multiplicity-adjusted p-values. However, very small p-values only provide evidence against the null hypothesis...
The dysregulations of long intergenic non-coding RNAs (lincRNAs) have shown to be linked with a wide variety of human diseases over the past few years. However, there are only a few lincRNA-disease association inference tools available with most of them relying on very specific type of prior knowledge about the lincRNAs and the diseases. They fall short in generalized association predictions when...
Miniature inverted-repeat transposable element (M ITE) is a type of class II non-autonomous transposable element playing a crucial role in the process of evolution in biology. Development of bioinformatics tools that are capable of effectively identifying MITEs can enable genome-wide studies of MITE patterns in eukaryotes. Here, we present a fast, accurate and memory-efficient tool, MiteFinder, for...
Alignment of sequence reads is an important step of many bioinformatics workflows. While the alignment of short reads is well investigated, the alignment of long reads produced by third-generation sequencing technologies, such as Oxford Nanopore, is more challenging because they have high error rates (10–40%). Furthermore, due to their different algorithmic approaches, different tools produce varied...
Before genotyping microarrays can be used, calling algorithms must first be calibrated with a control set. Calling algorithms that evaluate hybridization intensity data on the basis of individual markers are better able to compensate for sequence specific variations. However, they require that the control set includes samples sufficient to exercise every marker in all of its allelic states. Minimizing...
In this paper, we present a graph search approach for identifying arbitrarily complex structural genomic variation. Our method leverages the ability of long reads (e.g. from Pacific Biosciences platforms) to span multiple breakpoints of complicated local rearrangements, allowing us to resolve small-scale complexities that may be overlooked by other tools. We applied our method to a subset of NA12878...
We study the problem of predicting human biogeographical ancestry using genomic data. While continental level ancestry is relatively simple using genomic information, distinguishing between individuals from closely associated subpopulations (e.g., from the same continent) is still a difficult challenge. In particular, we focus on the case where the analysis is constrained to using single nucleotide...
Genome-wide Association Study has presented a promising way to understand the association between human genomes and complex traits. Many simple polymorphic loci have been shown to explain a significant fraction of phenotypic variability. However, challenges remain in the non-triviality of explaining complex traits associated with multifactorial genetic loci, especially considering the confounding...
Differential gene expression analysis is one of the significant efforts in single cell RNA sequencing (scRNAseq) analysis to discover the specific changes in expression levels of individual cell types. Since scRNAseq exhibits multimodality, large amounts of zero counts, and sparsity, it is different from the traditional bulk RNA sequencing (RNAseq) data. The new challenges of scRNAseq data promote...
Genome-wide association study (GWAS), as one primary approach for genetic studies, has been successfully applied to a variety of complex diseases, leading to the discovery of substantial disease-associated loci. These discovered associations provide unprecedented opportunities for deepening our understanding of complex diseases, such as disease-associated risk variants, genes, and pathways. However,...
The de novo assembly aims to reconstruct the genome of the unknown species. Many algorithms have been proposed for de novo assemblies. Due to problems of repetitive regions and sequencing errors, contigs usually contain a large amount of misassemblies. Consequently, the misassembly correction of contigs is a challenging and significant work, which receives considerable attentions from researchers...
Database search is the main approach for identifying proteoforms using top-down tandem mass spectra. However, it is extremely slow to align a query spectrum against all protein sequences in a large database when the target proteoform that produced the spectrum contains post-translational modifications and/or mutations. As a result, efficient and sensitive protein sequence filtering algorithms are...
High-throughput next generation sequencing (NGS) technologies have created an opportunity for detecting copy number variations (CNVs) more accurately. However, efficient and precise detection of CNVs remains challenging due to high levels of noise and biases, data heterogeneity and the “big data” nature of NGS data. In this work, we introduce a novel preprocessing pipeline to improve the detection...
Structural variation is important in disease etiology and ecological adaptation. Prior work has focused on using either only short paired-end reads or a hybrid approach that combines long and short reads to detect structural variants. Few methods have focused solely on using long reads. Here, we aim to detect a specific type of structural variation, large inversions, using only raw PacBio long reads...
The revolutionary invention of single-cell sequencing technology carves out a new way to delineate intra tumor heterogeneity and traces the evolution of single cells at the molecular level. To cater for fast and convenient needs in calling copy-number variations in analyzing single-cell sequencing data, a systematical protocol and a working pipeline is reported. The proposed pipeline consists of six...
Sequence overlap graphs, constructed based on suffix-prefix relationships between pairs of sequences, are an important data structure in computational biology. High throughput sequencers can read several million to a few billion DNA fragments in a single experiment, making the construction of overlap graphs for such datasets compute-intensive. In this paper, we present a Locality-Sensitive Hashing...
This work examines the validity of facial phenotypes as Autism Spectrum Disorders (ASD) biomarkers in boys with essential autism. A family-based association analysis framework is presented that uses previously identified facially-delineated (FD) clusters to examine relationship between FD clusters and known ASD genes. The hypothesis is that there are certain genetic variants, single nucleotide polymorphisms...
Post-database searching is a key procedure for peptide spectrum matches (PSMs) in protein identification with mass spectrometry-based strategies. Although many machine learning-based approaches have been developed to improve the accuracy of peptide identification, the challenge remains for improvement due to the poor quality of data samples. CRanker has shown its effectiveness and efficiency in terms...
Somatic copy number alternations (SCNAs) can be utilized to infer tumor subclonal populations in whole genome seuqncing studies, where usually their read count ratios between tumor-normal paired samples serve as the inferring proxy. We found that, in a GC study, the GC contents and read count ratios on SCNA segments present a Log linear biased pattern. However, currently no subclonal inferring tools...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.