The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Chinese new words extraction is an important problem for Chinese information processing. In this paper a new words extraction method based on machine learning is proposed, where the context information, the word construction rules and statistic information are combined to extract new words. An experiment, based on two-character-nouns, shows that this method can well improve the efficiency and accuracy...
The traditional English text chunking approach identifies phrases by using only one model and phrases with the same types of features. It has been shown that the limitations of using only one model are that: the use of the same types of features is not suitable for all phrases, and data sparseness may also result. In this paper, the divide-conquer approach is proposed and applied in the identification...
Automatic source attribution refers to the ability for an autonomous process to determine the source of a previously unexamined piece of writing. Statistical methods for source attribution have been the subject of scholarly research for well over a century. The field, however, is still missing a definitive currency of established or agreed-upon classes of features, methods, techniques and nomenclature...
Knowledge of Chinese historical official titles is stored in paper or in electronic image, and the representations are based on nature language, which is more ambiguous than formal one. So, it is difficult to retrieve and disseminate. This paper reports a method to build a knowledge base that can provide a sharable and reusable knowledge resource about Chinese historical official titles. Consider...
Chinese information search engines always encounter a difficulty in segmentation of Chinese words from an article. In this paper, a suffix tree based searching approach is proposed to avoid the difficulty in segmentation of Chinese words. The suffix tree algorithms are studied and a set of optimal algorithms for index build are proposed. Based on the algorithms, a prototype of Chinese information...
Named entity recognition (NER) is low-level semantics technology. Since it is simple and efficient, it has been widely applied in many systems such as machine translation, information retrieval, information extraction, question answering and summarization. The goal of named entity recognition is to classify names into some particular categories from text, such as the names of people, places, and organizations...
Question-answering has recently received more and more attention from researchers. It is widely regarded as the advanced stage of information retrieval. This paper provides a novel domain-independent question-answering system which is based on information retrieval in a large-scale collection of texts, and an improved system similarity model is developed and applied in it which improves the performance...
Most of the research in last few decades has focused on automatic natural language processing (NLP) in English, European and East Asian languages. But unfortunately South Asian languages especially Urdu have received less attention. In this paper we present a survey regarding classification of Urdu language. The main goal of this survey is to present briefly about the material available on Urdu NLP,...
As far as the rule-based machine translation (RBMT) is concerned, the rule acquisition remains as a bottle-neck problem. This paper proposes a cascaded approach to optimize the rule base, which is automatically acquired from the bilingual corpus. Observing the more risk of errors in the upper layer of the parsing tree, we propose in this paper a method which advocates the optimization of rules by...
At present the most widely used technology of pinyin-Chinese character conversion combines statistics with linguistic rules. Although it basically solves such problems as long distance restriction and language recursion phenomenon, it relies on a great deal of computation because there are too many candidate paths. This paper tries to simplify the candidate paths by using quotient space granularity...
This paper proposes an integration algorithm of English-Chinese word segmentation and alignment. In this algorithm, bilingual word segmentation and alignment work synchronously and interactively. Given sentence-aligned bitext, it cannot only use bilingual word alignment's information to guide resolving word segmentation ambiguities, but also avoid the errors of word segmentation from being transferred...
The Manchu character recognition method based on Manchu character unit is an efficient method. In this method, the recognition accuracy rate of Manchu character unit has great influence on the final recognition result. As new approach to solve this problem, a hybrid wavelet neural network scheme has developed as a recognition method replaces the original mini-distance method. Both the learning samples...
For information retrieval, users hope to acquire more relevant information from the top indexing documents. In this paper, a combination of ontology with statistical method is presented to retrieval initial document set and improves the precision of top N ranking documents by re-ranking document set. The experiment with NTCIR-3 Chinese CLIR dataset shows the proposed method improved the precision...
This paper presents a kind of target language generation mechanism in data-oriented English-Chinese machine translation. This mechanism applies the theory of data-oriented parsing used in language analysis traditionally into target language generation equally. Through linearizing the result of source language analysis - a parse tree, the final translation in target language is generated. To prove...
In the task of auto-building a Chinese-English semantic lexicon for translation selection, this research presents a method, which introduces WordNet similarity measures to wash out misaligned Chinese-English word pairs. Six different proposed measures of similarity based on WordNet were experimentally compared and evaluated by using WordNet and the software package WordNet::Similarity. It was found...
This paper proposes a strategy for Chinese multi-document summarization based on clustering and sentence extraction. It adopts the term vector to represent the linguistic unit in Chinese document, which obtains higher representation quality than traditional word-based vector space model in a certain extent. As for clustering, we propose two heuristics to automatically detect the proper number of clusters:...
The liaison evaluation for spoken English is one of the key problems for computer aided spoken language learning. Though a lot of factors affect the performance of a spoken language evaluation algorithm, there are mainly two factors that contribute to the most of the obstacles, i.e. the natural casualness of spoken language and the unstable performance of existing speech processing systems. In this...
The evaluation of pronunciation for spoken English is one of the key problems for computer aided spoken language learning. While the most of researchers focus on the improvement of speech recognition to build a reliable evaluation system, there still needs a model that fuses the reliabilities of existing speech processing systems and the learner personalities into the evaluation system. In this paper,...
Text categorization is one of the important steps of many applications, e.g. Web page classification, indexing in search engine and information retrieval. When the number of documents available is huge, active learning could help relief the training time and cost. Moreover, active learning is able to filter out noisy samples for training and therefore may achieve better generalization capability....
As an important work in the field of natural language processing, word sense disambiguation (WSD) has been a research focus since 1950. The task of WSD is very difficult to solve, and most of modern algorithms fail to reach an ideal level. The processing for WSD is to determine the sense of a polysemous word within a specific context, which involves two steps - determining all the senses for the polysemous...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.