The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper focuses on document clustering algorithms that build hierarchical solutions. In this paper is evaluate the performance of different criterion functions for the problem of clustering documents.
This paper formulates, simulates and assess an improved data clustering algorithm for mining web documents with a view to preserving their conceptual similarities and eliminating the problem of speed while increasing accuracy. The improved data clustering algorithm was formulated using the concept of K-means algorithm. Real and artificial datasets were used to test the proposed and existing algorithm...
By utilizing the must-link or cannot-link pair wise constraints in data, semi-supervised clustering improves the performance of unsupervised clustering significantly. A number of semi-supervised clustering algorithms have been proposed to consider such pair wise constraints. However, most of them assign a hard label to each data item and produce little information about the cluster itself. In this...
We introduce a new method for discovering latent topics in sets of objects, such as documents. Our method, which we call PARIS (for Principal Atoms Recognition In Sets), aims to detect principal sets of elements, representing latent topics in the data, that tend to appear frequently together. These latent topics, which we refer to as `atoms', are used as the basis for clustering, classification, collaborative...
In the traditional K-means algorithm, the selection of cluster number and the initial cluster center brings huge affection on the quality of clustering. To reduce the dependence on the initial center and to locate the types of new data rapidly, an algorithm applicable for text data is proposed. In this algorithm, document density is considered as parameter. Documents are divided into blocks first...
Massive linear search results returned from traditional search engines bring much inconvenience to users when extract the information they need. Search result clustering is of critical need for grouping similar topics of documents. The existing algorithm has drawbacks in clustering labels screening, cluster quality assessment, overlapping clusters controlling. The improved clustering algorithm-CQIG,...
An algorithm, TBCClustering, is presented in the paper for clustering GML documents using maximal frequent induced subtree patterns. TBCClustering mines the maximal frequent induced subtrees by using the structural information of GML documents, it can get the best minimum support automatically, and then chooses a set of subtree patterns to form the optimistic clustering features. Finally it uses CLOPE...
Recently, a large amount of work has been done in XML data mining. However, most of the existing work focuses on the snapshot XML data, while XML data is dynamic in practical application. In order to mine knowledge hidden in the frozen structures (FS) which are not changed or very little changed during the historical changing process of an XML document, we present a method for clustering XML documents...
Based on the complex network theory, we proposed a clustering algorithm based on content similarity. Firstly, the Chinese documents are represented by the vector-space model, and the content similarity between any two documents is computed by the cosine similarity. Consequently, the network node is defined as a document, and the edge weight is defined as the similarity obtained by the cosine similarity...
This paper introduces a new description-centric algorithm for web document clustering based on the hybridization of the Global-Best Harmony Search with the K-means algorithm, Frequent Term Sets and Bayesian Information Criterion. The new algorithm defines the number of clusters automatically. The Global-Best Harmony Search provides a global strategy for a search in the solution space, based on the...
Document clustering as an unsupervised approach extensively used to navigate, filter, summarize and manage large collection of document repositories like the World Wide Web (WWW). Recently, focuses in this domain shifted from traditional vector based document similarity for clustering to suffix tree based document similarity, as it offers more semantic representation of the text present in the document...
Document clustering or unsupervised document classification is an automated process of grouping documents with similar content. A typical technique uses a similarity function to compare documents. In the literature, many similarity functions such as dot product or cosine measures are proposed for the comparison operator. In these papers, we evaluate the effects of many similarity functions on k-mean...
Summary form only given. This paper focuses on structure of dialog graphical system for pattern recognition with help of distance function. In this paper is evaluate the performance of different criterion functions and algorithms for the problem of clustering large datasets.
As the number of available Web pages grows, it is become more difficult for users finding documents relevant to their interests. Clustering is the classification of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. It can enable users to find the relevant documents more easily and also...
In order to precisely procure the Chinese person information on the web, especially distinguish from the namesake, this paper propose a clustering algorithm based on latent semantic model. It establishes for every document a latent semantic model of sentence-word matrix based on central distance, central segment, document length, etc, by building the central word library of person attributes. It clusters...
One of the most effective ways to extract knowledge from large information resources is applying data mining methods. Since the amount of information on the Internet is exploding, using XML documents is common as they have many advantages. Knowledge extraction from XML documents is a way to provide more utilizable results. XCLS is one of the most efficient algorithms for XML documents clustering....
As the information available on the internet is growing explosively, this paper proposed a new method of web news summarization via sentence clustering algorithm. It adopted cluster algorithm to cluster all the sentences. Feature fusion will be used to extract summary sentences. Experimental result shows that the proposed summarization method can improve the performance of summary.
By researching all kinds of methods for document clustering, we put forward a new dynamic method based on genetic algorithm (GA). K-means is a greedy algorithm, which is sensitive to the choice of cluster center and very easily results in local optimization. Genetic algorithm is a global convergence algorithm, which can find the best cluster centers easily. Among the traditional document clustering...
Table of contents (TOC) recognition has attracted a great deal of attention in recent years. After reviewing the merits and drawbacks of the existing TOC recognition methods, we have observed that book documents are multi-page documents with intrinsic local format consistency. Based on this finding we introduce an automatic TOC analysis method through clustering. This method first detects the decorative...
In many applications of document clustering, a document may include multiple topics and thus may relate to multiple categories at the same time. Most of the existing subspace clustering algorithms can only perform hard clustering on document collections. In this paper, a fuzzy algorithm named R-FPC is introduced for document clustering. The algorithm discovers soft partitions of a data set in the...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.