The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
An earlier paper used triangulated word translations as seed in linear translation between medium European languages. The present work improves upon it by handling word ambiguity both in the main (i.e. source and target) languages and in the pivot by training a multi-prototype vector space model in the former, filtering triangles based on scores computed by a linear model trained with direct (non-triangulated)...
Transliteration forms an essential part of transcription which converts text from one writing system to another. The need for translating data has become larger than before as the world is getting together through social media. Machine transliteration has emerged as a part of information retrieval and machine translation projects to translate named entities, that are not registered in the dictionary,...
Research in the field of WSD has been conducted in computational linguistics as a specific task for many years. Language and context features have been shown to be very helpful for the task of word sense disambiguation. In this paper, we investigate the effectiveness of the graph-based ranking method on features from limited language data of word sense disambiguation. Contrary to existing method,...
Between the growth of Internet or World Wide Web (WWW) and the emersion of the social networking site like Friendster, Myspace etc., information society started facing exhilarating challenges in language technology applications such as Machine Translation (MT) and Information Retrieval (IR). Nevertheless, there were researchers working in Machine Translation that deal with real time information for...
We propose a language-independent approach to clean up word alignment errors in an aligned parallel corpus, which are caused by the unsupervised word-align process. In such an aligned corpus, we evaluate the alignment patterns of one-to-many discontinuous words by statistical measures of collocation. The alignment of discontinuous words without strong collocation tendencies will be taken as errors...
Word alignment is an important and fundamental task for building a statistical machine translation (SMT) system. However, obtaining word-level alignments in parallel corpora with high accuracy is still a challenge. In this paper, we propose a new method, which is based on constraint approach, to improve the quality of word alignment. Our experiments show that using constraints for the parameter estimation...
This paper presents an outline of our work to develop a word sense disambiguation system in Malayalam. Word sense disambiguation (WSD) is a linguistically based mechanism for automatically defining the correct sense of a word in the context. WSD is a long standing problem in computational linguistics. A particular word may have different meanings in different contexts. For human beings, it is easy...
Many automatic word alignment techniques have been so far developed in Natural Language Processing (NLP). However, word alignment between English and Hindi has not progressed much due to two main reasons viz. complex structure of the participating languages and the scarcity of Hindi-language resources. This paper provides a corpus-augmented method of word alignment in which these limitations have...
The bag of visual words model has seen immense success in addressing the problem of image classification. Algorithms using this model generate the codebook of visual words by vector quantizing the features (such as SIFT) of the images to be classified. However, a codebook so formed tends to get biased by the nature of the dataset. In this paper we propose an alternative method to create the codebook...
Pronunciation variation is a natural and inevitable phenomenon in an accented Mandarin speech recognition application. In this paper, we integrate knowledge-based and data-driven approaches together for syllable-based pronunciation variation modeling to improve the performance of Mandarin speech recognition system for speakers with Southern accent. First, we generate the syllable-based pronunciation...
We continue studying a new context-free computationally simple stylometry-based text homogeneity test: the sliced conditional compression complexity (sCCC or simply CCC) of literary texts introduced and inspired by the incomputable Kolmogorov conditional complexity. Other stylometry tools can occasionally almost coincide for different authors. Our CCC-attributor is asymptotically strictly minimal...
For the existing disadvantage of Word Sense Disambiguation(WSD) research methods, we have analyzed the computability and computational complexity of knowledge Dictionaries with different structure, and chosen ??The Grammatical knowledge-base of Contemporary Chinese?? and ??the Semantic Knowledge-base of Contemporary Chinese?? which written by Institute of Computational Linguistics of Peking University,...
The speed of dictionary query affects not only the speed of segmentation, but also the wide use of the segmentation system in the mass calculation. According to the different occurrence frequency of words in the text, the dictionary mechanism of the suboptimal search tree is designed so that the comparison times is reduced in the process of segmentation and the speed of segmentation is improved. Finally,...
A major bottleneck for promoting use of computers and the Internet is that many languages lack access to basic tools that would make it possible for people to access ICT in their own language. The paper describes the development a set of such resources for the processing of Amharic, the working language of the Ethiopian government. The primary goal was to investigate techniques and methods that can...
This paper proposes a novel approach for word similarity computation based on word sense vectors. The word sense vector is built using HIT-IR Tongyici Cilin (extended) for concept generalization and is further modified by the use of relative and absolute frequency filters. Experiments show that the approach not only overcomes the problem of similarity computation of unseen words but also yields a...
Part-of-speech (POS) guessing of unknown words is an essential phase in the process of unknown words identification. This paper applies combined features (namely, both external and internal features) in POS guessing of Chinese unknown words, under conditional random field model (CRF). For acquiring high-precision of POS guessing, this paper puts forward a method of integrating Chinese radical, as...
Unknown word recognition is a very important problem in natural language processing. It has a great influence on the performance of dictionary construction and word segmentation. This paper introduces two methods to improve the effect of Chinese unknown word recognition by using Conditional Random Fields: the rough label of the characters and the N-best listing. The CRF with the two methods proposed...
This paper presents the concept of vicarious words and develops a new unsupervised Chinese word sense disambiguation method. This method, after statistical learning from the vicarious words, realizes unsupervised word sense disambiguation by calculating mutual information to measure the degree of collocation information between the ambiguous words and their context. In our experiment, we test ten...
Automatic acquisition of ISA relations is a basic problem in knowledge acquisition from text. We present a method that acquires and verifies ISA relations from large Chinese free text. It initially discovers a set of sentences using Chinese lexicosyntactic patterns. Then we combine outside layer removal and inside layer gathering for acquiring concepts of constituting ISA relation. Finally, ISA relations...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.