The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Clustering is an unsupervised learning approach that explores data and seeks groups of similar objects. Many classical clustering models such as k-means and DBSCAN are based on heuristics algorithms and suffer from local optimal solutions and numerical instability. Recently convex clustering has received increasing attentions, which leverages the sparsity inducing norms and enjoys many attractive...
This paper studies the problem of extracting real world event information from social media streams. Although existing work focuses on event signals of bursty mentions extracted from a single-source of textual streams, these signals are likely to be noisy due to ambiguous occurrences of individual mentions. To extract accurate event signals, we propose a framework capable of "grounding"...
One of the most popular fuzzy clustering techniques is the fuzzy K-means algorithm (also known as fuzzy-c-means or FCM algorithm). In contrast to the K-means and K-median problem, the underlying fuzzy K-means problem has not been studied from a theoretical point of view. In particular, there are no algorithms with approximation guarantees similar to the famous K-means++ algorithm known for the fuzzy...
Clustering is a technique that is usually applied as a tool for exploratory data analysis. Because of the exploratory nature of this task, it would be beneficial if a clustering method generates interpretable results, and allows incorporating domain knowledge. This motivates us to develop a probabilistic discriminative model that learns a rectangular decision rule for each cluster, we call Discriminative...
We provide a new formulation for theproblem of role discovery in graphs. Our definition is structural:two vertices should be assigned to the same roleif the roles of their neighbors, when viewed as multi-sets, are similar enough. An attractive characteristic of our approachis that it is based on optimizing a well-defined objective function, and thus, contrary to previous approaches, the role-discovery...
Frequent sequence mining methods often make use of constraints to control which subsequences should be mined, e.g., length, gap, span, regular-expression, and hierarchy constraints. We show that many subsequence constraints—including and beyond those considered in the literature—can be unified in a single framework. In more detail, we propose a set of simple and intuitive "pattern expressions"...
Many real-world applications exhibit scenarios where distributions represented by training and test data are not similar, but related by a covariate shift, i.e., having equal class conditional distribution with unequal covariate distribution. Traditional data mining techniques suffer to learn a good predictive model in the presence of covariate shift. Recent studies have proposed approaches to address...
Mining and searching heterogeneous and large knowledge graphs is challenging under real-world resource constraints such as response time. This paper studies a framework that discover to facilitate knowledge graph search. 1) We introduce a class of summaries characterized by graph patterns. In contrast to conventional summaries defined by frequent subgraphs, the summaries are capable of adaptively...
Convolutive non-negative matrix factorization (CNMF) is a promising method for extracting features from sequential multivariate data. Conventional algorithms for CNMF require that the structure, or the number of bases for expressing the data, be specified in advance. We are concerned with the issue of how we can select the best structure of CNMF from given data. We first introduce a framework of probabilistic...
Point of interest (POI) recommendation, which provides personalized recommendation of places to mobile users, is an important task in location-based social networks (LBSNs). However, quite different from traditional interest-oriented merchandise recommendation, POI recommendation is more complex due to the timing effects: we need to examine whether the POI fits a user's availability. While there are...
An emerging challenge in the online classification of social media data streams is to keep the categories used for classification up-to-date. In this paper, we propose an innovative framework based on an Expert-Machine-Crowd (EMC) triad to help categorize items by continuously identifying novel concepts in heterogeneous data streams often riddled with outliers. We unify constrained clustering and...
How can we rank users in signed social networks? Relationships between nodes in a signed network are represented as positive (trust) or negative (distrust) edges. Many social networks have adopted signed networks to express trust between users. Consequently, ranking friends or enemies in signed networks has received much attention from the data mining community. The ranking problem, however, is challenging...
Feature generation is one of the challenging aspects of machine learning. We present ExploreKit, a framework for automated feature generation. ExploreKit generates a large set of candidate features by combining information in the original features, with the aim of maximizing predictive performance according to user-selected criteria. To overcome the exponential growth of the feature space, ExploreKit...
Graphs are widely used to represent many differentkinds of real world data such as social networks, protein-proteininteractions, and road networks. In many cases, each node in agraph is associated with a set of its attributes and it is criticalto not only consider the link structure of a graph but also usethe attribute information to achieve more meaningful results invarious graph mining tasks. Most...
On social media platforms, companies, organizations and individuals are using the function of sharing or retweeting information to promote their products, policies, and ideas. While a growing body of research has focused on identifying the promoters from millions of users, the promoters themselves are seeking to know what strategies can improve promotional effectiveness, which is rarely studied in...
Accounts are often shared by multiple users, eachof them having different item consumption and temporal habits. Identifying of the active user can lead to improvements ina variety of services by switching from account personalizedservices to user personalized services. To do so, we developa topic model extending the Latent Dirichlet Allocation usinga hidden variable representing the active user and...
Large datasets often have unreliable labels—such as those obtained from Amazon's Mechanical Turk or social media platforms—and classifiers trained on mislabeled datasets often exhibit poor performance. We present a simple, effective technique for accounting for label noise when training deep neural networks. We augment a standard deep network with a softmax layer that models the label noise statistics...
Detecting a small number of outliers from a set of data observations is always challenging. This problem is more difficult in the setting of multiple network samples, where computing the anomalous degree of a network sample is generally not sufficient. In fact, explaining why a given network is exceptional, expressed in the form of subnetwork, is also equally important. We develop a novel algorithm...
The service usage analysis, aiming at identifying customers' messaging behaviors based on encrypted App traffic flows, has become a challenging and emergent task for service providers. Prior literature usually starts from segmenting a traffic sequence into single-usage subsequences, and then classify the subsequences into different usage types. However, they could suffer from inaccurate traffic segmentations...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.