The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This work presents a binarization technique of map document images. It exploits an amalgam of global and local threshold approaches best suited for binarization of document images with complex background and overlapping objects in the foreground like maps. The proposed approach uses Distance Transform (DT) and Adaptive threshold. Initially a rough estimate of the map background is done using Distance...
Open problems are defined differently in document image analysis than in the physical sciences, theoretical computer science, or mathematics. Instead of a formal definition, problems in DIA are stated in terms of automation of an application area (e.g., postal address reading) or a scientific sub field (e.g., image compression). The notion of a successful solution may be based on (1) the relative...
Projection methods have been used in the analysis of bitonal document images for different tasks such as page segmentation and skew correction for more than two decades. However, these algorithms are sensitive to the presence of border noise in document images. Border noise can appear along the page border due to scanning or photocopying. Over the years, several page segmentation algorithms have been...
Web text mining is a growing research area in data mining. Interestingly, the existing Web text mining algorithms have concentrated on finding frequent patterns while discarding the less frequent ones that may contain outliers. In addition, the domain knowledge in one industry is partly different from that in the others. Whatever they belong to, web texts are analyzed using the same dictionary. This...
For the research of Chinese word segmentation, the BP algorithm model has a lot of defects such as low convergent velocity, easily falling into local minimum, low velocity and efficiency. In this paper, we proposed a new particle swarm neural network algorithm (NPSO-BP), and used it in Chinese word segmentation. The results show that the speed of the segmentation algorithm is obviously faster than...
Since the traditional classification algorithm does not work well in the case of short-text classification, we propose a search-based method employing Na'iveBayes classification algorithm. This paper describes the whole process, including the classification algorithms, training and the evaluation. The results indicate that the classifier has better performance comparing with other methods.
Traditional Clustering is a powerful technique for revealing the "hot" topics among documents. However, it's hard to discover the new type events coming out gradually. In this paper, we propose a novel model for detecting new clusters from time-streaming documents. It consists of three parts: the cluster definition based on Multi-Representation Index Tree (MI-Tree), the new cluster detecting...
In order to resolve the current problem about seriously academic plagiarism in the web environment, this article proposes an algorithm of the text copy detection on the topic bag and the algorithm uses the idea of semantic clustering and multi-instance learning. Firstly, a paper is divided into three layers construction tree: a leaf node denotes a sentence; a branch node represents a topic bag, and...
Knowledge representation is a key area of research in artificial intelligence which deals with the proper storage and retrieval of knowledge for various useful applications. This research paper proves that knowledge can be easily and efficiently represented in predicate logic. The algorithm in this paper splits the Urdu text/sentences into phrases/constituents and then represents these in predicate...
Script recognition is a necessary process before OCR algorithm in multilingual systems. In this paper, a novel method is proposed for identifying Farsi and Latin scripts in bilingual document using curvature scale space features. The proposed features are rotation and scale invariant and can be used to identify scripts with different fonts. We assumed that the bilingual scripts may have Farsi and...
Identification of Chinese coding type is a major and challenging issue in Chinese web content audit and analysis. In this paper we develop a novel algorithm based on the theory of Kolmogorov complexity to identify the coding type of Chinese characters of a given text segment. An array of text compressors are used as filters to evaluate the information distance of text under examination and the training...
In text categorization, feature selection is an effective feature dimension-reduction methods. To solve the problems of unadaptable high original feature space dimension, too much irrelevance, data redundancy and difficulties in selecting a threshold, we propose an improved LAM feature selection algorithm (ILAMFS). Firstly, combining the gold segmentation and the LAM algorithm based on the characteristics...
Text Categorization (TC) is an important component in many information organization and information management tasks. In many TC applications, the case-base grows at a fast rate and this causes inefficiency in the case retrieval process. Using Case-Base Maintenance learning via the GC (Generalization Capability) algorithm, which can reduce the case number into KNN algorithm, can improve efficiency...
Skew detection and correction is an important step in automated content conversion systems, on which overall system performance is dependent. Although there are many working solutions at the present time, the search for an algorithm that can achieve good error rates in a fast running time and on different layout types is still open, so new solutions for skew detection are needed. The paper at hand...
Reliable and generic methods for skew detection are a necessity for any large-scale digitization projects. As one of the first processing steps, skew detection and correction has a heavy influence on all further document analysis modules, such as geometric and logical layout analysis. This paper introduces a generic, scale-independent algorithm capable of accurately detecting the global skew angle...
This paper presents a new method for automatic text-line extraction from Arabic historical handwritten documents presenting an overlapping and multi-touching characters problems. Our approach is based on block covering analysis using unsupervised technique. This algorithm performs firstly a statistical block analysis which computes the optimal number of document decomposition into vertical strips...
Through research on K-means algorithm of text clustering and semantic-based vector space model, a semantic-based K-means text clustering model is proposed to solve the problem on high-dimensional and sparse characteristics of text data set. The model reduces the semantic loss of the text data and improves the quality of text clustering. Experiments prove that semantic-based text clustering increases...
General purpose search engines utilize a very simple view on text documents: They consider them as bags of words. It results that after indexing, the semantics of documents is lost. In this paper, we introduce a novel approach to improve the accuracy of Web retrieval. We utilize the WordNet and WordNet SenseRelate All Words Software as main tools to preserve the semantics of the sentences of documents...
In recent years, the text data of text mining has gradually become a new research topic. Among them, the study of the text clustering has attracted wide attention. This paper proposes an improved fuzzy clustering-text clustering method based on the fuzzy C-means clustering algorithm and the edit distance algorithm. We use the feature evaluation to reduce the dimensionality of high-dimensional text...
Classification and clustering are frequently-used methods in data excavation technology. This paper introduces the idea of text clustering into the categorization algorithm study. The authors also attempt to use the text categorization pattern of self'-initiated learning to design a clustering-based text categorization algorithm, in the purpose of reducing the dimension of training set and raising...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.