The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
As an important preprocessing technology in patent knowledge utilization, patent classification should be accurate and efficient. Commonly used feature selection methods and classification algorithms, like information gain (IG) and k nearest neighbors (k-NN) algorithm, are superior in text classification but have some drawbacks in patent classification. In the paper, we focus on patent classification...
Text Categorization (TC) is an important component in many information organization and information management tasks. In many TC applications, the case-base grows at a fast rate and this causes inefficiency in the case retrieval process. Using Case-Base Maintenance learning via the GC (Generalization Capability) algorithm, which can reduce the case number into KNN algorithm, can improve efficiency...
Based on the complex network theory, we proposed a clustering algorithm based on content similarity. Firstly, the Chinese documents are represented by the vector-space model, and the content similarity between any two documents is computed by the cosine similarity. Consequently, the network node is defined as a document, and the edge weight is defined as the similarity obtained by the cosine similarity...
In the field of imbalance learning and cost sensitive learning, minimization of the classification error rate is not an appropriate approach due to class skew and cost distributions. Thus the area under the ROC Curve (AUC) has been widely utilized to assess the performance of the classifiers in such cases. The Maximum AUC Linear Classifier (MALC), aiming at maximizing AUC directly, is a nonparametric...
A Multi-relational Bayesian Classification Algorithm with Rough Set is proposed in this paper. The concept of relational graph used to dynamic choice associative table associated with the target table, and a tuple ID propagation approach is used to solve directly the association rule mining problem with multiple database relations, and the concept of Core in Rough Set is introduced, simplify the associative...
A hybrid constrained semi-supervised clustering algorithm(HCC) is proposed, both labeled data and pairwise constraints are concerned in clustering a given dataset to get a better clustering result. This paper gives theoretical derivation and experiments on UCI data sets, and the experiments show that the quality of clustering using two kinds of constraint information is better than only one kind of...
Many existing clustering algorithms use a single prototype to represent a cluster. However sometimes it is very difficult to find a suitable prototype for representing a cluster with an arbitrary shape. One possible solution is to employ multi-prototype instead. In this paper, we propose a minimum spanning tree (MST) based multi-prototype clustering algorithm. It is a split and merge scheme. In the...
This paper proposes an improved FCM algorithm aiming at many problems in Fuzzy C Means algorithm, such as being sensitive to initial conditions, usually leading to local minimum results. The new algorithm can obtain global optimal solutions through a new simple and efficient selecting rule of the initial cluster centers, furthermore alternating optimization in terms of a novel separable criterion...
This paper proposed a new point symmetry-based ant clustering algorithm which can defect the number of clusters and the proper partitions from data sets when data sets possess the property of symmetry. In the proposed algorithm, a revised ant clustering algorithm is presented which can reduce the running time of standard ant clustering algorithm. Each ant represents a data object. It will decide its...
Literature-based discovery is linking two or more literature concepts that have heretofore not been linked (i.e., disjoint), in order to produce novel, interesting, plausible, and intelligible knowledge. Cluster analysis is the core of literature-based discovery. This paper proposes an improved fuzzy c means (FCM) algorithm based on the analysis of existing clustering analysis of literature-based...
Spatial data mining is the process of identifying or extracting efficient, novel, potentially useful and ultimately understandable patterns from the spatial data set, the spatial clustering analysis is one of the most important research directions in spatial data mining. Clustering criterion implied in massive data can be discovered by spatial clustering analysis method which can be used to explore...
Clustering in high dimensional data is an important task. Subspace clustering has emerged as a possible solution to the challenges associated with high dimensional clustering. A subspace cluster is a subset of points together with a subset of attributes, such that some category of value of cluster points has great aggregation in these attributes. This paper proposes a subspace clustering algorithm...
K-means clustering is sensitive to starting points and its time cost is expensive for large scale of data, such as audio. Sampling approach is widely applied to find “better” starting points for speeding up the clustering converging procedure. However, how to choose a reasonable sampling-rate remains a problem. In this paper, we reported our initial exploration of locating reasonable sampling-rates...
Clustering is a hot research field in data mining. There are so many methods or algorithms designed for different type data set on which data analysis action operates. Local Agglomerative Characteristic (LAC) based Algorithm, in this paper, is presented for data clustering, which can handle clusters of different size, shapes, and densities, can work well on different distributed and natural variant...
Climate factors govern the distribution of plant species which is the indicator of the corresponding region climate. Spatial clustering methods are an important component of spatial data mining. We obtained distribution data of more than 100 Chinese genuine regional herb plants to serve as basic data for spatial analyze. Spatial clustering algorithm based on spatial contiguity relations in GIS was...
Case based reasoning (CBR) is very important task in data mining, but privacy information will be disclosed easily in CBR. This paper presents random locally linear embedding (LLE) on encrypted case based reasoning method. In order to be ensure the security of the CBR, the parameters nearest neighbor number k and embedded space dimension d of LLE algorithm are selected randomly. Further we embed the...
The problem of similarity measure for time series has attracted considerable research interest. Most of the recently used algorithms utilize the Dynamic Time Warping (DTW) distance for measuring the similarity of time series, in various areas such as science, medicine, industry, and finance. DTW is a considerably more robust distance measure for time series, which allows similar shapes to match even...
K-dominant skyline query has been proposed as an important operator for multi-criteria decision making, data mining and so on, this technology can reduce the large result sets of skyline query in high dimensional space. In this paper, a new concept was firstly proposed: k-dominant Skyline cube, which consists of all the k-dominant skylines. Although existing algorithms can compute every k-dominant...
Sequential pattern mining is an important and useful tool with broad applications, such as analyzing customer purchase behavior, recommending services to customers, and so on. It is challenging since explosive number of subsequences need to be examined and both the memory and computational cost are becoming extremely expensive when the sequence database grows huge. Many previous algorithms developed...
With the widespread of Internet application, more and more enterprises build their Web sites and provide business information through Web pages. Web page classification could be used to assign the enterprise Web pages to one or more predefined business categories. On the purpose of Internet-based enterprises administration in E-government system, algorithms and application related to web page classification...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.