The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
As the number of data sources for public health surveillance continues to grow both in volume and variety, there is a need to develop data-driven machine learning tools that can automate discovery and aid decision makers in obtaining quantifiable insights on emerging disease spread phenomena. In this talk, we present an overview of scalable machine learning tools that we have been developing as part...
Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections are associated with unsafe injection practices, drug diversion, and other exposures to blood products. HCV outbreaks are difficult to detect and investigate because HCV infections can remain asymptomatic in >70% of infected persons for years, even decades. During the...
Scaffolding is the important stage of genome assembly consisting of orienting and ordering contigs based on read pairs. We present a scalable scaffolding algorithm that finds most likely contig orientation using integer linear program solved by a non-serial dynamic programming approach. We then formulate the problem of finding most likely contig ordering as an optimization problem and propose a novel...
The Biomedical and Clinical (BC) research domain has evolved significantly in the last decade, quickly becoming a data-intensive field that requires sophisticated databases and data analysis tools. The constant growth of BC data has given rise to the notion of data-driven decision making. BC institutions typically use a wide range of modern diagnostic equipment that produces various types of biomedical...
A large volume of research has been done to uncover different characteristics of biological network, such as large-scale organization, node centrality and network robustness. Nevertheless, the vast majority of research done in this area assume that biological networks have deterministic topologies. Biological interactions are however probabilistic events that may or may not appear at different cells...
Viruses can display high intra-patient genetic diversity. A single infected individual hosts billions of virus particles that can be summarized as a set of genetically different strains, called haplotypes, with their respective frequencies. The haplotype distribution, also known as a viral quasispecies, is a key determinent of virulence, pathogenesis, and treatment outcome.
In this study, we proposed an approach for constructing directed, regulatory gene set networks to reveal novel relationships among gene sets. Results from this study showed that regulatory gene set networks can provide complementary information to existing types of gene set networks and explain underlying mechanisms of a disease.
Parameterized probabilistic complex computational (P2C2) models are being increasingly used in computational systems biology for analyzing biological systems. A key challenge is to build mechanistic P2C2 models by combining prior knowledge and empirical data, given that certain system properties are unknown. These unknown components are incorporated into a model as parameters and determining their...
Record linkage or deduplication integrates records across multiple data sources. We propose sequential and parallel techniques for record linkage using complete linkage clustering. The key idea of these approaches is radix sorting and blocking on data attributes and producing a graph-based solution. These methods have been tested on real datasets as well as synthetic datasets. They identify records...
Although a number of sequence database search tools and post-database search algorithms for filtering target PSMs have been developed, the discrepancy among the output PSMs is usually significant, remaining a few disputable PSMs. We employ a SVM-based learning model to search the optimal weight for each target PSM and develop a new score system, C-Ranker, to rank all target PSMs. Compared with PeptideProphet...
DNA methylation is an important epigenetic mark relevant to normal development and disease genesis. A common approach to characterizing genome-wide DNA methylation is to use Next Generation Sequencing technology to sequence bisulfite treated DNA. The short sequence reads are mapped to the reference genome to determine the methylation status of Cs. However, despite intense effort, a much smaller proportion...
Although it is known that aligning short reads to reference genomes becomes harder if such genomes are embedded with complex repeat structures, there has been little effort to quantify this intuition. We investigated several measures of complexity, employed 10 popular short-read aligners to align a large number of diverse genomes, and found that unlike existing notions of complexity, a proposed notion...
The next generation sequencing technology has enabled the understanding of the whole genome of an organism at a greater coverage and reduced cost. In this study, we provide a systems biology approach to understand the functional relevance of the single nucleotide variants identified by the whole genome sequencing studies. This approach also includes a methodology for the identification of conserved...
The process of whole genome doubling (WGD) gives rise to two copies of each chromosome in a genome, containing the same genes in the same order. Through an attrition mechanism known as fractionation, one of each pair of duplicate genes is lost over evolutionary time, resulting in an interleaving patterns of deletions from duplicated regions [1]. This differentiates the WGD/fractionation model from...
The advance of high-throughput sequencing has made it one of most important techniques to obtain new transcriptomes in non-model organisms. In these studies, there is often a need to investigate the transcriptomes of two related organisms at the same time in order to find the similarities and differences between them. The traditional approach to address this problem is to perform de novo transcriptome...
Hepatitis C Virus (HCV) is the most common etiological cause of non-A/non-B blood-borne viral hepatitis and the leading cause for liver transplantation. The population of HCV-infected individuals in the US is estimated to be over 3 million. There are 7 major HCV genotypes with world-wide distribution, which are further grouped into numerous sub-genotypes. HCV genotype 1a is the most common genotype...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.