The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of non-informative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally,...
Distance computation is one of the most computationally intensive operations employed by many data mining algorithms. Performing such matrix computations within a DBMS creates many optimization challenges. We propose techniques to efficiently compute Euclidean distance using SQL queries and user-defined functions (UDFs). We concentrate on efficient Euclidean distance computation for the well-known...
Sales prediction is an important problem for different companies involved in manufacturing, logistics, marketing, wholesaling and retailing. Food companies are more concerned with sales prediction of products having a short shelf-life and seasonal changes in demand. The demand may depend on many hidden contexts, not given explicitly in the form of predictive features. Even if some changes are known...
We introduce s-kNN, a nearest neighbor based spatial data mining algorithm. It belongs to the class of vector-geometry based algorithms that reason on complex spatial objects instead of point measurements. In contrast to most methods in this class, it does on the fly spatial computations that cannot be replaced by a pre-processing step without sacrificing efficiency. The key is a partial evaluation...
Clustering is an active research topic in data mining and different methods have been proposed in the literature. Most of these methods are based on the use of a distance measure defined either on numerical attributes or on categorical attributes. However, in fields such as road traffic and medicine, datasets are composed of numerical and categorical attributes. Recently, there have been several proposals...
Transductive learning is the learning setting that permits to learn from "particular to particular'' and to consider both labelled and unlabelled examples when taking classification decisions. In this paper, we investigate the use of transductive learning in the context of hierarchical text categorization. At this aim, we exploit a modified version of an inductive hierarchical learning framework...
Monitoring applications play an increasingly important role in many domains. They detect events in monitored systems and take actions such as invoke a program or notify an administrator. Often administrators must then manually investigate events to figure out the source of a problem. Stream processing engines (SPEs) are general purpose data management systems for monitoring applications. They provide...
Classical attribute-value descriptions induce a multi-dimensional geometric space. One way for computing the distance between descriptions in such a space consists in evaluating an Euclidean distance between tuples of coordinates. This is the ground on which a large part of the Machine Learning literature has built its methods and techniques. However, the complexity of some domains require the use...
Unsupervised machine learning algorithms are used to perform statistical analysis of several transport and dispersion model runs which simulate emissions from a fixed source under different atmospheric conditions. A clustering algorithm is used to automatically group the results of the transport and dispersion simulations according to their respective cloud characteristics. Each cluster of clouds...
High-dimensional data presents a significant challenge to a broad spectrum of pattern recognition and machine-learning applications. Dimensionality reduction (DR) methods serve to remove unwanted variance and make such problems tractable. Several nonlinear DR methods, such as the well known ISOMAP algorithm, rely on a neighborhood graph to compute geodesic distances between data points. These graphs...
Case-based reasoning (CBR) is a methodology for problem solving, which suggests a solution to a new problem based on the previously-solved problems and their associated solutions. A key issue in this methodology is that can we always trust the solutions suggested by a case-based reasoning system? This paper studies the reliability of CBR systems at an overall level first. Factors affecting the reliability...
In contrast with most Internet topology measurement research, our concern here is not to obtain a map as complete and precise as possible of the whole internet. Instead, we claim that each machine's view of this topology, which we call ego-centered view, is an object worth of study in itself. We design and implement an ego-centered measurement tool, and perform radar-like measurements consisting of...
In the problem of face clustering with multi-views, the similarity between faces of different persons with similar pose is usually greater than the similarity between multi-view faces of the same person. This may exert a tremendous impact on the clustering result that sent back to the user. To solve this problem, we should do pose clustering first and then within each dasiapose grouppsila, clustering...
Most research on Internet topology is based on active measurement methods. A major difficulty in using these tools is that one comes across many unresponsive routers. Different methods of dealing with these anonymous nodes to preserve the connectivity of the real graph have been suggested. One of the more practical approaches involves using a placeholder for each unknown, resulting in multiple copies...
Sensor networks play an important role in applications concerned with environmental monitoring, disaster management, and policy making. Effective and flexible techniques are needed to explore unusual environmental phenomena in sensor readings that are continuously streamed to applications. In this paper, we propose a framework that allows to detect outlier sensors and to efficiently construct outlier...
We present and discuss several spatiotemporal kernels designed to mine real-life and simulated data in support of drought prediction. We implement and empirically validate these kernels for support vector machines. Issues related to the nature of geographic data such as autocorrelation and directionality are investigated.
We introduce a novel technique to detect anomalies in images. The notion of normalcy is given by a baseline of images, under the assumption that the majority of such images is normal. The key of our approach is a featureless probabilistic representation of images, based on the length of the codeword necessary to represent each image. Such codeword's lengths are then used for anomaly detection based...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.