The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Today's information is increasing rapidly, doubling every three years. Consequently, the search and recognition stages in computer applications will consume a growing portion of the total CPU time. The SSE 4.2 instruction set, first implemented in Intel's Core i7, provides string and text processing instructions (STTNI) that utilize SIMD operations for processing character data. Though originally...
In the text preprocessing of text mining, a stop-word list is constructed to filter the segment results of the text documents so that the dimensionality of the text feature space can be cut down primarily. This paper summarized the definition, extraction principles and method of stop-word, and constructed a customizing Chinese-English stop-word list with the classical stop-word list based on the difference...
This article proposes such a question classification approach that integrates multiple semantic features. It is aimed at these two questions in Chinese question classification models: inaccurate semantic information extraction and too slow processing speed caused by too high Eigenvector dimension. With the help of HowNet and the support vector machine and syntactic and semantic information of question...
This paper proposes a systematic full text search on document using a combined keyword and structural similarity of documents under consideration. The approach operates in two steps. The first step uses a set of designated keywords to acquire potential desired documents by means of an open source tool. The second step builds a suffix tree of frequently used vocabulary to retrieve the most similar...
This paper is proposing a novel idea for text transformation based on mapping single letters form the standard alphabetical order into the same set of single letters reordered by their relative frequencies. This method can be used as a complementary algorithm to enhance the statistical compression techniques. We have designed and implemented an algorithm called ETAO transformation method. It has been...
In this paper, we described an approach about chunk parsing using fixed word combination. It is different from the previous researches. We presented a pattern extraction and matching method of Chinese sentence with fixed word combination. After that we tested the pattern, and got a correct rate more than 96%. From the result of our experiment, we can identify that the analysis of syntax has been improved...
As a group of unknown words of Chinese information processing, the letter-word phrases used in Chinese texts can't be identified correctly by the existed segmentation software. Here, an auto-tagging system of letter-word phrases based on rules and statistical data is presented. At first, the system scans the sentences to get letter-strings, and then takes every letter string as an anchor and scans...
Through research on the calculation method of feature words' weight in texts and semantic similarity between words, we proposed a calculation method of feature words' weight based on concept weight for the semantic association phenomenon of text features and the prevalence of high-dimensional problem in a text vector space model. This method reduces the semantic loss of the feature set and the dimension...
Digitizing printed document is always a challenge faced by the computing society. Digitization of text not only allows users to easily modify and reprint printed documents, but also is a need of the day due to the use of word-search capability available at disposal in this era. Converting a printed document into a stream of characters using OCR (optical character recognition) techniques is a widely...
State of art document segmentation algorithms employ adhoc solutions which use some document properties and iteratively segment the document image. These solutions need to be adapted frequently and sometimes fail to perform well for complex scripts. This calls for a generalized solution that achieves a one shot segmentation that is globally optimal. This paper describes one such solution based on...
Dictionary mechanism is the basis of Chinese word segmentation, and its quality directly affects the speed and efficiency of Chinese word segmentation. In existing dictionary mechanisms, there are such shortages as space wasting, low efficiency, and difficult maintenance, and therefore, how to establish an effective mechanism is an urgent problem for Chinese word segmentation. In this paper, the idea...
A duo such as Hindi-English (Hin-Eng) does differ in terms of grammar, and thus finding correspondences is often quite obscure in word alignment. Hindi being rich in morphology makes the alignment with its counterpart a bit contingent and invites obscurities in annotation process. We present annotation guidelines for Hin-Eng word alignment through contrastive analysis of the two languages. We applied...
Text Classification is an important field of research. There are a number of approaches to classify text documents. However, there is an important challenge to improve the computational efficiency and recall. In this paper, we propose a novel framework to segment Chinese words, generate word vectors, train the corpus and make prediction. Based on the text classification technology, we successfully...
In English - Vietnamese machine translation (EVMT) project at Ho Chi Minh City University of Technology there are some problems that cause the system to malfunction. One of the most undesired phenomena is lexical gap. A lexical gap occurs in case of lacking Vietnamese equivalent word to English word. There are some approaches to this obstacle. Some researchers prefer replacing lexical gap by its nearest...
The similarity between the semantic relations that exist between two word pairs is defined as their relational similarity. For example, the semantic relation, is a large holds between the words in the word pair (lion, cat) and (ostrich, bird), because lion is a large cat, and ostrich is the largest living bird on earth. Consequently, the two word pairs, (lion, cat) and (ostrich, bird), are considered...
It is critical to think of creating electronic medical record (EMR) templates for general utilization of EMR due to semi-structured features. Word processor is widely used for recording patient electronic information. However, the most weakness of these editors is that it is hard to extract medical data from text document. Also it is less flexible to present data in some other forms. This paper provides...
When browsing news on the web, various emotions may be evoked in readers and furthermore cause different influence on their minds and life. We expect that emotional analysis and classification of text may provide good performance and significance to users surfing the Internet. Most previous research only focus on bi-emotion classification, that is, Positive and Negative, e.g., identifying whether...
The prime objective of this Research is the development of effective reading skills in Machines. After reading the text and comprehending the meaning, it would self-program itself and according to the program it would implement the instructions. Here we are exploring a new era of computer vision and related Research. The current investigation presents an algorithm and software which detects, recognizes...
A presentation on attempt to extract words from handwritten text lines in Gujarati script is hereby submitted. The very cursive nature of most Indian scripts makes the word extraction process a very critical one for Optical Character Recognition (OCR) activity. This cursive nature also causes difficulty during character extraction and modifier extraction. Word extraction is considered as one of the...
Query difficulty prediction aims to identify, in advance, how reliably an information retrieval system will perform when faced with a particular user request. The prediction of query difficulty level is an interesting and important issue in Information Retrieval (IR) and is still an open research. In order to appreciate importance of query difficulty prediction we present an example., Information...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.