The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Noise is a prominent challenge found in many bioinformatics datasets and it refers to erroneous or missing data. The presence of noise in gene expression datasets has adverse effects on machine-learning techniques, such as supervised classification algorithms and feature selection techniques. Additionally, the identification of noise and its quantification are challenging tasks that require a proper...
Class imbalance is a significant challenge that practitioners in the field of bioinformatics are faced with on a daily basis. It is a phenomenon that occurs when number of instances of one class is much greater than number of instances of the other class(es) and it has adverse effects on the performance of classification models built on this skewed data. Random Forest as a robust classifier has been...
Document classification or document categorization is one of the most studied areas in computer science due to its importance. The problem is to assign a document using its text to one or more classes or categories from a predefined set. We propose a new approach for fast text classification using randomized explicit semantic analysis (RS-ESA). It is based on a state of the art approach for word sense...
Sentiment analysis of tweets requires the ability to reliably and accurately identify the emotional polarity (positive or negative) of instances. This can be challenging, particularly when the data quality is questionable due to noise or imbalance. Ensemble learning algorithms have been shown to offer superior performance compared to non-ensemble techniques in many domains, but have not been thoroughly...
Learning from imbalanced data sets is a hot and challenging research topic with many real world applications. Many studies have been conducted on integrating sampling-based techniques and ensemble learning for imbalanced data sets. However, most existing sampling methods suffer from the problems of information loss, over-fitting, and additional bias. Moreover, there is no single model that can be...
Ensemble learning is a powerful tool that has shown promise when applied towards bioinformatics datasets. In particular, the Random Forest classifier has been an effective and popular algorithm due to its relatively good classification performance and its ease of use. However, Random Forest does not account for class imbalance which is known for decreasing classification performance and increasing...
In this work, we present results produced from a nonlinear QSAR model developed and implemented using evolutionary computation and Random Forest Regression to study the effectiveness of dimeric Aryl ß-Diketo Acids on HIV-1 Integrase enzyme inhibition. Dimeric Aryl ß-Diketo Acids have been proven to be effective inhibitors of the biological mechanism of protein transfer known as HIV-integrase. This...
In this paper, we propose a novel approach for reader-emotion categorization using word embedding learned from neural networks and an SVM classifier. The primary objective of such word embedding methods involves learning continuous distributed vector representations of words through neural networks. It can capture semantic context and syntactic cues, and subsequently be used to infer similarity measures...
To enjoy more social network services, users nowadays are usually involved in multiple online social networks simultaneously. The shared users between different networks are called anchor users, while the remaining unshared users are named as non-anchor users. Connections between accounts of anchor users in different networks are defined as anchor links and networks partially aligned by anchor links...
Social media is becoming a critical avenue for businesses today to target new customers and create brand loyalty. In order to target users effectively, companies need to know basic information about their users. However, in many cases, user profiles are either incomplete or completely wrong, and one of the most critical pieces of private information is gender. In this paper we examine the case of...
The Harmonized Tariff Schedule for the classification of goods is a major determinant of customs duties and taxes. The basic HS Code is 6 digits long but can be extended according to the needs of the countries such as application of custom duties based on details of the product. Finding the correct, consistent, legally defensible HS Code is at the heart of Import Compliance. However finding the best...
Sentiment classification of tweets is used for a variety of social sensing tasks and provides a means of discerning public opinion on a wide range of topics. A potential concern when performing sentiment classification is that the training data may contain class imbalance, which can negatively affect classification performance. A classifier trained on imbalanced data may be biased in favor of the...
This paper describes the iPReS project, which provides a web service-based framework for i18n-type (internationalized) access to scientific data products and product metadata contained within the NASA Jet Propulsion Laboratory Physical Oceanography Distributed Active Archive Center, otherwise known as PO.DAAC. PO.DAAC is an element of the EOSDIS, which freely provides science data to the global community...
For the functioning of American democracy, the Lobbying Disclosure Act (LDA), for the very first time, provides data to empirically research interest groups behaviors and their influence on congressional policymaking. One of the main research challenges is to automatically find the topic(s), by short & sparse text classification, in a large corpus of unorganized, semi-structured, and poorly...
Galaxies in the universe are commonly classified by their morphology, or visual appearance. The morphology of a galaxy tells us about the history and physical make-up of the galaxy. With the fast pace at which digital galaxy images are captured and a slow and biased human pattern recognition process, finding an efficient way to automate the galaxy image classification process can help advance the...
Data mining and machine learning methods have been playing an important role in searching and retrieving multimedia information from all kinds of multimedia repositories. Although some of these methods have been proven to be useful, it is still an interesting and active research area to effectively and efficiently retrieve multimedia information under difficult scenarios, i.e., detecting rare events...
Tens of thousands of pictures are taken at different locations throughout the year. People often visit places and take pictures to remember their visits. We believe that the seasonal travel patterns of people to specific locations will create a correlation between a location and the season of the images taken in that location. For example, fewer people visit Bear Valley, California during the summer...
While the regular treatment for wrist stiffness is physical therapy or surgery, researchers are looking for an alternative, more efficient and automatic procedure by means of robotic applications. In this paper, we propose a low-cost system exploiting a haptic interface aided by a glove sensorized on the wrist allowing the identification of the wrist orientation, in this way, by using virtual reality,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.