The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
As many diseases are known to be related to microbes, interests in statistical methods for Microbiome-Wide Association Studies (MWAS) are also increasing. In this respect, we systematically investigate the properties of statistical methods for MWAS and compare their performances using simulation data generated from Human Microbiome Project data. We first assessed the type I error rates of eight commonly...
The Oxford Nanopore and Pacbio SMRT sequencing technologies has revolutionized the Next-Generation Sequencing (NGS) environment by producing long reads that exceed 60 kbp and helped to the completion of many biological projects. But, long reads are characterized by a high error rate which increases the difficulty of biological problems like the genome assembly problem. Error correction of long reads...
Mate-pair sequencing is a technology for sequencing two ends of long DNA fragments, which has been widely used in genome scaffolding. Although the cost of mate-pair sequencing is now affordable, its accuracy has been limited by the lower quality and contamination. The 3rd generation sequencing is able to generate long reads for genome scaffolding. However, the error rates and cost are still too high...
Tuning bioinformatics pipelines and training software parameters require sequencing data with known ground truth, which are actually difficult to get from real sequencing data. Particularly, for those applications of detecting low frequency variations (like ctDNA sequencing), it is hard to tell whether a called variation is a true positive, or a false positive caused by errors from sequencing or other...
A track to solve the problem of errors caused by the third generation of sequencing technology is to use the high coverage of the high quality of short reads generated by the second-generation sequencing technology. This paper presents a new approach for error correction and de novo assembly for long reads. We present MiRCA a hybrid approach based on the sequences alignments that detects and corrects...
Cancer classification based on molecular level investigation has gained the interest of researches as it provides a systematic, accurate and objective diagnosis for different cancer types. It has also been applied in a wide range of applications such as drug discovery, cancer prediction and diagnosis which is a very important issue for cancer treatment. Besides, it helps in understanding the function...
Next-generation sequencing (NGS) technologies have superseded traditional Sanger sequencing approach in many experimental settings, given their tremendous yield and affordable cost. Nowadays it is possible to sequence any microbial organism or meta-genomic sample within hours, and to obtain a whole human genome in weeks. Nonetheless, NGS technologies are error-prone. Correcting errors is a challenge...
While most current high-throughput DNA sequencing technologies generate short reads with low error rates, emerging sequencing technologies generate long reads with high error rates. A basic question of interest is the tradeoff between read length and error rate in terms of the information needed for the perfect assembly of the genome. Using an adversarial erasure error model, we make progress on this...
Intercellular heterogeneity serves as both a confounding factor in studying individual clones and an information source in characterizing any heterogeneous tissues, such as blood, tumor systems. Due to inevitable sequencing errors and other technical artifacts such as PCR errors, systematic efforts to characterize intercellular genomic heterogeneity must effectively distinguish genuine clonal sequences...
Bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap in small-sample settings. However, its performance can deteriorate in the high-dimensional settings prevalent in Genomic Signal Processing. We propose here a modification of Bolstered error estimation that is based on the principle of Naive Bayes. Rather than attempting to estimate a...
With the development of DNA microarray technology, scientists can now measure gene expression levels. However, such high-throughput microarray technologies produce a long list of genes with small sample size and high noisy genes. The data need to be further analysed and interpreting information on biological process requires a lot of practice and usually is a time consuming process. Most of the traditional...
The viral quasispecies represent a set of related variants in a virus population (e.g. from an infected patient) that contain similar mutations due to the rapid and mutation-prone replications in viruses. The characterization of viral quasispecies in a highly divergent virus population is of great interest in biomedical research, in particular, to identify virulent and drug-resistant mutations in...
Next-generation sequencing (NGS) technologies are marking the foundations for a new paradigm in genomics and transcriptomics. Nowadays is possible to sequence any microbial organism or meta-genomic sample within hours, and to obtain a whole human genome in less than a month. The sequencing prices are decreasing dramatically, opening to actual personalised medicine. NGS technologies however are error-prone,...
The problem of inference of family trees, or pedigree reconstruction, for a group of individuals has attracted lots of attentions recently. Various methods have been proposed to automate the process of pedigree reconstruction given the genotypes or haplotypes of a set of individuals. The state-of-the-art method IPED is able to reconstruct large pedigrees with reasonable accuracy. However, the algorithm...
Biomarker discovery and classification in medical applications both typically involve feature selection applied to a small-sample high-dimensional dataset. Recent work has proposed a framework to integrate a prior over an uncertainty class of parameterized feature-label distributions with training data to obtain optimal classifiers, MMSE classifier error estimates, and evaluate the MSE of error estimates...
The development of statistical pathway analysis methods has focused on testing individual main effects of genes in a pathway on disease. However, gene-gene interactions can also play an important role in complex disease etiology. We developed a pathway analysis method based on a protein-protein interaction network to account for gene-gene interactions in a pathway. We used simulations to evaluate...
We study the problem of base calling in next-generation DNA sequencing platforms that rely on reversible terminator chemistry. After reviewing a statistical model of the generated signal and the Viterbi algorithm for finding the maximum-likelihood solution to the base calling problem, we present a closed form expression for the upper bound on the probability of base calling error. Simulation results...
High-throughput technology for genotyping has made genome-wide associations possible. Single nucleotide polymorphism (SNP) data derived from array-based technology are usually flawed due to missing data, although they have generally high call rates and good concordance rates across different genotype calling schemes. Missing SNPs can bias the results of association analyses and hence loci with missing...
DNA sequencing technology has played an important role on life sciences, especially Illumina's solexa sequencer. It was used for more and more genome projects. Solexa libraries were usually constructed with insert sizes of 200bp, 500bp, 2k, 5k and 10k in genome projects. It is a problem how to find the optimum combination of different insert sizes and different depth of solexa sequencing libraries...
Error estimation is a crucial part of any classification problem and it becomes problematic with small samples. In this paper, we analyze the performance of some widely used error estimation methods relative to the complexity of the feature-label distribution: resubstitution, 10-fold cross validation with repetition (CV10r), leave-one-out (LOO), bootstrap .632, and bolstered resubstitution. Our definition...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.