The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of non-informative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally,...
Distance computation is one of the most computationally intensive operations employed by many data mining algorithms. Performing such matrix computations within a DBMS creates many optimization challenges. We propose techniques to efficiently compute Euclidean distance using SQL queries and user-defined functions (UDFs). We concentrate on efficient Euclidean distance computation for the well-known...
We present a multiple-instance regression algorithm that models internal bag structure to identify the items most relevant to the bag labels. Multiple-instance regression (MIR) operates on a set of bags with real-valued labels, each containing a set of unlabeled items, in which the relevance of each item to its bag label is unknown. The goal is to predict the labels of new bags from their contents...
Clustering is an active research topic in data mining and different methods have been proposed in the literature. Most of these methods are based on the use of a distance measure defined either on numerical attributes or on categorical attributes. However, in fields such as road traffic and medicine, datasets are composed of numerical and categorical attributes. Recently, there have been several proposals...
Monitoring applications play an increasingly important role in many domains. They detect events in monitored systems and take actions such as invoke a program or notify an administrator. Often administrators must then manually investigate events to figure out the source of a problem. Stream processing engines (SPEs) are general purpose data management systems for monitoring applications. They provide...
This paper describes how distributed data mining models, such as collective learning, ensemble learning, and meta-learning models, can be implemented as WSRF mining services by exploiting the Grid infrastructure. Our goal is to design a general distributed architectural model that can be exploited for different distributed mining algorithms deployed as Grid services for the analysis of dispersed data...
Automatic indexing of music by instruments and their types is a challenging problem, especially when multiple instruments are playing at the same time. We have built a database containing more than one million of music instrument sounds, each described by a large number o features including standard MPEG7 audio descriptors, features for speech recognition, and many new audio features developed by...
This demonstration concerns a system designed and implemented to automatically build multimodal aggregations of informative news items coming from the two domains of digital television and the Web. Though in recent times several technological solutions have addressed the problem of clustering online articles, little is available which is capable of integrating these two sources of information. The...
In this paper a new algorithm, called CStar, for document clustering is presented. This algorithm improves recently developed algorithms like generalized star (GStar) and ACONS algorithms, originally proposed for reducing some drawbacks presented in previous Star-like algorithms.The CStar algorithm uses the condensed star-shaped sub-graph concept defined by ACONS, but defines a new heuristic that...
Data clustering has been proven to be a promising data mining technique. Recently, there have been many attempts for clustering market-basket data. In this paper, we propose a parallelized hierarchical clustering approach on market-basket data (PH-Clustering), which is implemented using MPI. Based on the analysis of the major clustering steps, we adopt a partial local and partial global approach to...
This paper addresses the problem of detecting and tracking moving clusters in spatio-temporal data sets. Spatio-temporal data sets contain data elements that move in space over time. Traditional data clustering algorithms work well on static data sets that contain well separated clusters. When traditional techniques are applied to spatio-temporal data they breakdown when the moving data elements intersect...
In the problem of face clustering with multi-views, the similarity between faces of different persons with similar pose is usually greater than the similarity between multi-view faces of the same person. This may exert a tremendous impact on the clustering result that sent back to the user. To solve this problem, we should do pose clustering first and then within each dasiapose grouppsila, clustering...
In this paper we present a new algorithm for semisupervised clustering. We assume to have a small set of labeled samples and we use it in a clustering algorithm to discover relevant patterns. We study how our algorithm works against two other semisupervised algorithms when the data are multimodal. Then, we study the case where the user is able to produce few samples for some classes but not for each...
Constraint-based mining has been proven to be extremely useful. It has been applied not only to many pattern discovery settings (e.g., for sequential pattern mining) but also, recently, on classification and clustering tasks (see, e.g., ). It appears as a key technology for an inductive database perspective on knowledge discovery in databases (KDD), and constraint-based mining is indeed an answer...
The theoretical relationship between association rules and machine learning techniques needs to be studied in more depth. This article studies the use of clustering as a model for association rule mining. The clustering model is exploited to bound and estimate association rule support and confidence. We first study the efficient computation of the clustering model with K-means; we show the sufficient...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.