The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Clustering is one of the fundamental data analysis techniques, which aims to find distinct groups of similar objects and discovers hidden structures in data. A recent clustering approach, clustering ensembles tries to derive an improved clustering solution based on previously generated different candidate clustering solutions. Clustering ensembles have two steps: generating multiple candidate clustering...
The important task of correcting label noise is addressed infrequently in literature. The difficulty of developing a robust label correction algorithm leads to this silence concerning label correction. To break the silence, we propose two algorithms to correct label noise. One utilizes self-training to re-label noise, called Self-Training Correction (STC). Another is a clustering-based method, which...
We analyze the work of urban trip planners and the relevance of trips they recommend upon user queries. We propose to improve the planner recommendations by learning from choices made by travelers who use the transportation network on the daily basis. We analyze a large collection of individual travelers' trips collected from the automated fare collection systems; we convert the trips into pair-wise...
The explosion of available positioning information associated with the inferred or user-declared semantics of the respective locations, already contributes in what is called the big data era, posing new challenges to the mobility data management and mining research community. In this paper, motivated by a series of challenges set in [11], we present a unified framework for the management and the analysis...
We suggest a novel method of clustering and exploratory analysis of temporal event sequences data (also known as categorical time series) based on three-dimensional data grid models. A data set of temporal event sequences can be represented as a data set of three-dimensional points, each point is defined by three variables: a sequence identifier, a time value and an event value. Instantiating data...
In recent years, the widespread adoption of GPS enabled vehicles brings the Location Based Services new opportunities. It benefits many related fields such as urban planning, city traffic modeling, personalized recommendations and driving suggestions. The service providers can understand their users better by modeling the mobility pattern and provide more personalized services by predicting the destination...
Anomaly detection in time series is one of the fundamental issues in data mining that addresses various problems in different domains such as intrusion detection in computer networks, irregularity detection in healthcare sensory data and fraud detection in insurance or securities. Although, there has been extensive work on anomaly detection, majority of the techniques look for individual objects that...
The frequent ups and downs are characteristic to the stock market. The conventional standard models that assume that investors act rationally have not been able to capture the irregularities in the stock market patterns for years. As a result, behavioural finance is embraced to attempt to correct these model shortcomings by adding some factors to capture sentimental contagion which may be at play...
In recent years, mining high-utility itemsets (HUIs) has become as a key topic in data mining. However, most of the developed algorithms assume the unrealistic situations that unit profits of items remain unchanged over time. But in real-life situations, the profit of an item or itemset varies as a function of cost prices, sales prices and sales strategies. In this paper, a novel framework for mining...
Advertising through web search engines is one of the modes of online advertising and is described as Adwords problem. In Adwords, advertisers bid on keywords to display advertisements along with corresponding search results. During keyword auction, there is very high competition for the frequent keywords while little to no competition for the less frequent ones. In this paper, we have proposed an...
With an increased interest in machine processable data, many datasets are now published in RDF (Resource Description Framework) format in Linked Data Cloud. These data are distributed over independent resources which need to be centralized and explored for domain specific applications. This paper proposes a new approach based on interactive data exploration paradigm using Pattern Structures, an extension...
People are increasingly using social media, especially online communities, to discuss mental health issues and seek supports. Understanding topics, interaction, sentiment and clustering structures of these communities informs important aspects of mental health. It can potentially add knowledge to the underlying cognitive dynamics, mood swings patterns, shared interests, and interaction. There has...
We describe FactorBase, a new SQL-based framework that leverages a relational database management system to support multi-relational model discovery. A multi-relational statistical model provides an integrated analysis of the heterogeneous and interdependent data resources in the database. We adopt the BayesStore design philosophy: statistical models are stored and managed as first-class citizens...
Activity recognition and prediction in buildings can have multiple positive effects in buildings: improve elderly monitoring, detect intrusions, maximize energy savings and optimize occupant comfort. In this paper we apply human activity recognition by using data coming from a network of motion and door sensors distributed in a Smart Home environment. We use Hidden Markov Models (HMM) as the basis...
Multi-class learning is an important task in Data Science. One of the ways to achieve good performance on this task is to use Error Correcting Output Codes (ECOC), which is a powerful ensemble learning method that transforms a multi-class problem into a series of binary classifiers which it uses indirectly to learn the original multi-class problem. A crucial component of ECOC is the design of the...
Electronic Medical Records (EMR) are increasingly used for risk prediction. EMR analysis is complicated by missing entries. There are two reasons — the “primary reason for admission” is included in EMR, but the co-morbidities (other chronic diseases) are left uncoded, and, many zero values in the data are accurate, reflecting that a patient has not accessed medical facilities. A key challenge is to...
Forecasting time series data is an integral component for management, planning and decision making. Following the Big Data trend, large amounts of time series data are available from many heterogeneous data sources in more and more applications domains. The highly dynamic and often fluctuating character of these domains in combination with the logistic problems of collecting such data from a variety...
In this work, we propose a Compression Rate Distance, a new distance measure for time series data. The main idea behind this distance is based on the Minimum Description Length (MDL) principle. The higher compression rate between two time series is, the closer they should be. Besides, we also propose a relaxed version of the new distance, called the Extended Compression Rate Distance. The Extended...
The rise of big data, which need computationally demanding manipulation has posed unprecedented challenges in the machine learning community. In this context, a variety of dimensionality reduction methods has been introduced in order to deal with the large-scale aspect of the data. However, their employment in very large scales often becomes impractical due to memory and computation limitations. In...
Among other criteria, a pattern may be interesting if it is not redundant with other discovered patterns. A general approach to determining redundancy is to consider a probabilistic model for frequencies of patterns, based on those of patterns already mined, and compare observed frequencies to the model. Such probabilistic models include the independence model, partition models or more complex models...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.