The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
N-grams are a building block in natural language processing and information retrieval. It is a sequence of a string data like contiguous words or other tokens in text documents. In this work, we study how N-gram can be computed efficiently using a MapReduce for distributed data processing and a distributed database named Hbase This technique is applied to construct the training and testing processes...
An important factor of a corpus is its domain, usually the quality of a SMT system trained on an in-domain corpus increases by adding out-of-domain sentences to its training corpus. In this paper we have shown out-of-domain corpora may also contains sentences which are proper for improving the quality of in-domain corpus. These sentences have words and phrases that occur in indomain corpora so, their...
Target phrase selection, a crucial component of the state-of-the-art phrase-based statistical machine translation(PBSMT) model, plays a key role in generating accurate translation hypotheses. Inspired by context-rich word-sense disambiguation techniques, machine translation (MT) researchers have successfully integrated various types of source language context into the PBSMT model to improve target...
In Information Retrieval (IR), the similarity scores between a query and a set of documents are calculated, and the relevant documents are ranked based on their similarity scores. IR systems often consider queries as short documents containing only a few words in calculating document similarity score. In Computer Aided Assessment (CAA) of narrative answers, when model answers are available, the similarity...
This paper proposes a novel approach for word similarity computation based on word sense vectors. The word sense vector is built using HIT-IR Tongyici Cilin (extended) for concept generalization and is further modified by the use of relative and absolute frequency filters. Experiments show that the approach not only overcomes the problem of similarity computation of unseen words but also yields a...
Translating Chinese ancient poem is a valuable but hard thing. Automatic choosing of English rhymes in translation of Chinese ancient poems would do translators a favor. This paper extracts three important factors that influence English rhymes, and presents a set of statistical models based on these factors, and then trains these models and acquires their parameters, which at last are used to recommend...
As one of the core technologies of minority language information processing, in recent years, the Uyghur speech synthesis technology has made great progress, but in TTS (text to speech) systems, prosodic phrases are not predicted with high accuracy which slows down the improvement of naturalness of synthesized speech. In this paper, Uyghur prosodic features was studied and the context features which...
Natural languages are typically replete with homographs, words which have more than one meaning. Consequently, machine understanding of natural language sentences sometimes suffers from certain ambiguities in getting the correct sense of a word in a given sentence. In this work we present a trainable model for word sense disambiguation (WSD) for resolving this ambiguity. The proposed model applies...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.