The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Learning from imbalanced datasets is a well known problem in the data mining community. Many techniques have been proposed to alleviate the problems associated with class imbalance, including data sampling and boosting. While data sampling has received the bulk of the attention from the research community, our results show that boosting often results in better classification performance than even...
The problem of class imbalance in machine learning is quite real and cumbersome when it comes to building a useful and practical classification model. We present a unique insight into addressing class imbalance for classification problems that involve three or more categories, i.e. non-binary. This study is different than related works in the literature because most works focus on addressing class...
Boosting has been shown to improve the performance of classifiers in many situations, including when data is imbalanced. There are, however, two possible implementations of boosting, and it is unclear which should be used. Boosting by reweighting is typically used, but can only be applied to base learners which are designed to handle example weights. On the other hand, boosting by resampling can be...
It is difficult to learn good classifiers when training data is missing attribute values. Conventional techniques for dealing with such omissions, such as mean imputation, generally do not significantly improve the performance of the resulting classifier. We proposed imputation-helped classifiers, which use accurate imputation techniques, such as Bayesian multiple imputation (BMI), predictive mean...
This study investigates the impact of increasing levels of simulated class noise on software quality classification. Class noise was injected into seven software engineering measurement datasets, and the performance of three learners, random forests, C4.5, and Naive Bayes, was analyzed. The random forest classifier was utilized for this study because of its strong performance relative to well-known...
An improvement to public-resource e-science portals shows promise in solving a well-known dilemma: how to dynamically discover a provider PC that is ready to deliver computing power when the scientific community requires it.
Class imbalance tends to cause inferior performance in data mining learners. Evolutionary sampling is a technique which seeks to counter this problem by using genetic algorithms to evolve a reduced sample of a complete dataset to train a classification model. Evolutionary sampling works to remove noisy and duplicate instances so that the sampled training data will produce a superior classifier. We...
A practical problem in data mining and machine learning is the limited availability of data. For example, in a binary classification problem it is often the case that examples of one class are abundant, while examples of the other class are in short supply. Examples from one class, typically the positive class, can be limited due to the financial cost or time required to collect these examples. This...
A new rule-based classification model (RBCM) and rule-based model selection technique are presented. The RBCM utilizes rough set theory to significantly reduce the number of attributes, discretation to partition the domain of attribute values, and Boolean predicates to generate the decision rules that comprise the model. When the domain values of an attribute are continuous and relatively large, rough...
Collaborative filtering (CF) is one of the most successful approaches for recommendation. In this paper, we propose two hybrid CF algorithms, sequential mixture CF and joint mixture CF, each combining advice from multiple experts for effective recommendation. These proposed hybrid CF models work particularly well in the common situation when data are very sparse. By combining multiple experts to form...
The performance of classification models can be negatively impacted if the data on which they are trained contains very rare events. While recent research has investigated the issue of class imbalance, few if any studies address issues related to the handling of extreme imbalance (rare events), where the minority class can account for as little as 0.1% of the training data. This work investigates...
This paper discusses a comprehensive suite of experiments that analyze the performance of the random forest (RF) learner implemented in Weka. RF is a relatively new learner, and to the best of our knowledge, only preliminary experimentation on the construction of random forest classifiers in the context of imbalanced data has been reported in previous work. Therefore, the contribution of this study...
Missing values are commonly encountered in software measurement data, and k nearest neighbor imputation (kNNI) is one of the most popular imputation procedures used by researchers and practitioners in empirical software engineering. Imputation techniques are used to replace missing values with one or more alternatives. Traditionally, kNNI uses only complete cases as possible donors for imputation...
We present an unsupervised multiscale color image segmentation algorithm. The basic idea is to apply mean shift clustering to obtain an over-segmentation and then merge regions at multiple scales to minimize the minimum description length criterion. The performance on the Berkeley segmentation benchmark compares favorably with some existing approaches
Assuring whether the desired software quality and reliability is met for a project is as important as delivering it within scheduled budget and time. This is especially vital for high-assurance software systems where software failures can have severe consequences. To achieve the desired software quality, practitioners utilize software quality models to identify high-risk program modules: e.g., software...
Intrusion detection in wireless networks has become an indispensable component of any useful wireless network security systems, and has recently gained attention in both research and industry communities due to widespread use of wireless local area networks (WLANs). This paper focuses on detecting intrusions or anomalous behaviors in WLANs with data clustering techniques. We first explore the security...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.