The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The performance of user profiling models depends on both the predictive accuracy and the cost of incorrect predictions. In this paper we study whether including contextual information leads to a decrease in the misclassification cost. Several experimental analyses were done by varying the cost ratio, the market granularity and the granularity of context. The experimental results show that context...
Error-reduction sampling (ERS) is a high performing (but computationally expensive) query selection strategy for active learning. Subset optimisation has been proposed to reduce computational expense by applying ERS to only a subset of examples from the pool. This paper compares techniques used to construct the subset, namely random sub-sampling and pre-filtering. We focus on pre-filtering which populates...
This paper presents a new keyword extraction algorithm for Chinese news Web pages using lexical chains and word co-occurrence combined with frequency features, cohesion features, and corelation features. A lexical chain is an external performance consistency by semantically related words of a text, and is the representation of the semantic content of a portion of the text. Word co-occurrence distribution...
For multi-view learning, existing methods usually exploit originally provided features for classifier training, which ignore the latent correlation between different views. In this paper, semantic features integrating information from multiple views are extracted for pattern representation. Canonical correlation analysis is used to learn the representation of semantic spaces where semantic features...
Distance computation is one of the most computationally intensive operations employed by many data mining algorithms. Performing such matrix computations within a DBMS creates many optimization challenges. We propose techniques to efficiently compute Euclidean distance using SQL queries and user-defined functions (UDFs). We concentrate on efficient Euclidean distance computation for the well-known...
For many data mining applications, it is necessary to develop algorithms that use unlabeled data to improve the accuracy of the supervised learning. Co-Training is a popular semi-supervised learning algorithm. It assumes that each example is represented by two or more redundantly sufficient sets of features (views) and these views are independent given the class. However, these assumptions are not...
In many practical situations it is not feasible to collect labeled samples for all available classes in a domain. Especially in supervised classification of remotely sensed images it is impossible to collect ground truth information over large geographic regions for all thematic classes. As a result often analysts collect labels for aggregate classes (e.g., Forest, Agriculture, Urban). In this paper...
Action rules describe possible transitions of objects from one state to another with respect to a distinguished attribute. Previous research on action rule discovery usually required the extraction of classification rules before constructing any action rule. This paper gives anew approach for generating association-type action rules. The notion of frequent action sets and Apriori-like strategy generating...
The reliability of an induced classifier can be affected by several factors including the data oriented factors and the algorithm oriented factors. In some cases, the reliability could also be affected by knowledge oriented factors. In this paper, we analyze three special cases to examine the reliability of the discovered knowledge. Our case study results show that (1) in the cases of mining from...
Research on streaming data classification has been mostly based on the assumption that data can be fully labelled. However, this is impractical. Firstly it is impossible to make a complete labelling before all data has arrived. Secondly it is generally very expensive to obtain fully labelled data by using man power. Thirdly user interests may change with time so the labels issued earlier may be inconsistent...
Recently, learning to rank technique has attracted much attention. However, the lack of labeled training data seriously limits its application in real-world tasks. In this paper, we propose to break this bottleneck by considering the cross-domain ldquotransfer of rank learningrdquo problem. Simultaneously, we propose a novel algorithm called TransRank, which can effectively utilize the labeled data...
Transductive learning is the learning setting that permits to learn from "particular to particular'' and to consider both labelled and unlabelled examples when taking classification decisions. In this paper, we investigate the use of transductive learning in the context of hierarchical text categorization. At this aim, we exploit a modified version of an inductive hierarchical learning framework...
Monitoring applications play an increasingly important role in many domains. They detect events in monitored systems and take actions such as invoke a program or notify an administrator. Often administrators must then manually investigate events to figure out the source of a problem. Stream processing engines (SPEs) are general purpose data management systems for monitoring applications. They provide...
This paper addresses the identification problem of causal variables for the system anomaly. In real-world complicated systems, even experts often fail to specify causal factors, thus they attempt to detect the anomaly with exploratory heuristics. Our goal is to offer further information that supports anomaly cause analysis using the incomplete empirical knowledge. Proposed technique discovers responsible...
The decision tree-based classification is a popular approach for pattern recognition and data mining. Most decision tree induction methods assume training data being present at one central location. Given the growth in distributed databases at geographically dispersed locations, the methods for decision tree induction in distributed settings are gaining importance. This paper describes one distributed...
Automatic indexing of music by instruments and their types is a challenging problem, especially when multiple instruments are playing at the same time. We have built a database containing more than one million of music instrument sounds, each described by a large number o features including standard MPEG7 audio descriptors, features for speech recognition, and many new audio features developed by...
Although a huge amount of remote sensing data has been provided by Earth observation satellites, few data manipulation techniques and information extraction in large data sets have been developed. In this context, the present paper aims to show a new system for spatial data mining, and two test cases applied to land use change in the Brazilian Amazon region. We present the operational environment...
A new feature description is used for human behaviour representation and recognition. The feature is based on Radon transforms of extracted silhouettes. Key postures are selected based on the Radon transform. Key postures are combined to construct an action template for each sequence. Linear discriminant analysis (LDA) is applied to the set of key postures to obtain low dimensional feature vectors...
If we can estimate the accuracy of our observations then we can estimate the true and false positive rates over a series of samples in high dimensional data mining problems. To date such issues have been largely neglected and previously no algorithm has been provided to facilitate the computations involved. In high dimensional data mining tasks, increasing sparsity leads to decreasing true positive...
Learning classifier systems (LCS) are machine learning systems designed to work for both multi-step and single-step decision tasks. The latter case presents an interesting,though not widely studied, challenge for such algorithms,especially when they are applied to real-world data mining problems. The present investigation departs from the popular approach of applying accuracy-based LCS to data mining...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.