The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Text classification (TC) is one of the fundamental problems in text mining. Plenty of works exist on TC with interesting approaches and excellent results; however, most of these works follow a word-based approach for feature extraction. In this work, we are interested in an alternative (byte-based or character-based) approach known as compression-based TC (CTC). CTC has been used for some languages...
Chunking or shallow syntactic parsing is proving to be a task of interest to many natural language processing applications. The problem gets worse for the Arabic language because of its specific features that make it quite different and even more ambiguous than other natural languages when processed. In this paper, we present a method for chunking Arabic texts based on supervised learning. We use...
With the growth of the Internet and electronic commerce, there is more and more review data on the Internet. Quite a lot of Internet users refer to related comments of a product before they make a decision, which can teach them about the quality and reputation of the product and help them decide whether to buy it. A system that can automatically classify the polarity of a given text would be a great...
In this paper, we propose a framework for a spoken dialogue agent that is not dependent on any specific language; it takes some dialogues and sentences as training sets and uses them to acquire knowledge about the target language, then it uses this knowledge to generate several possible responses corresponding to the user input and finally it uses a simple score method to select the best one to show...
We survey evidence — orthographic distributional phonological and psycholinguistic — in favor of a model of Arabic speech sounds based on the CV unit and extensive use of the silent sukuun vowel. We then construct a small-vocabulary multi-speaker CV HMM similar to the phonemic HMMs based on tied triphones that are widely used in speech recognizers for English and other European languages. Using experimental...
Word Sense Disambiguation (WSD) is a key factor in written and verbal communication of natural language processing. It is a method of selecting the appropriate sense of an ambiguous word in the given context. This paper aims at determining the correct sense of the given ambiguous word in Hindi language. A modified Lesk approach is used which uses the concept of dynamic context window. Dynamic context...
When we read a piece of writing, the meaning we derive from that text often includes information about the authors themselves. Clues to their identity, worldview, and even psychological states are encoded in features such as word choice and sentence structure. This work describes how writing style features can be used to analyze the authorship of extreme jihadist writing. Inspire magazine is an online,...
We introduce the REEL (RElation Extraction Learning) framework, an open source framework that facilitates the development and evaluation of relation extraction systems over text collections. To define a relation extraction system for a new relation and text collection, users only need to specify the parsers to load the collection, the relation and its constraints, and the learning and extraction techniques...
Between the growth of Internet or World Wide Web (WWW) and the emersion of the social networking site like Friendster, Myspace etc., information society started facing exhilarating challenges in language technology applications such as Machine Translation (MT) and Information Retrieval (IR). Nevertheless, there were researchers working in Machine Translation that deal with real time information for...
An important factor of a corpus is its domain, usually the quality of a SMT system trained on an in-domain corpus increases by adding out-of-domain sentences to its training corpus. In this paper we have shown out-of-domain corpora may also contains sentences which are proper for improving the quality of in-domain corpus. These sentences have words and phrases that occur in indomain corpora so, their...
We propose a new unsupervised method to identify Named Entities (NE) in resource-poor languages. The idea is to transfer the knowledge of NEs from a resource-rich language to a resource-poor one by using a bilingual parallel corpus of this language pair. After extracting all NE pair candidates and filtering these candidates (includes lexical and contextual filters) to obtain a high precision seed...
This paper deals with adaptive rule based machine translation from English to Telugu. This is a proposed approach and it is a rule-based methodology. Set of production rules, training set for English and Telugu sentences and English to telugu dictionary are developed for this purpose. In the process of machine translation, handling prepositions is the main issue. There are many different kinds of...
Along with the rapid improvements of informational technology, educational data grows quickly. Such data become massive and raw data. Researchers develop educational standards to regular such data. However, the standards are multiple and the education resources based on different education standards have different structure, which is hard to be shared. Most of them have become Information Islands...
Vector Symbolic Architectures (VSA) are approaches to representing symbols and structured combinations of symbols as high-dimensional vectors. They have applications in machine learning and for understanding information processing in neurobiology. VSAs are typically described in an abstract mathematical form in terms of vectors and operations on vectors. In this work, we show that a machine learning...
State-of-the-art phrase-based machine translation (MT) systems usually demand large parallel corpora in the step of training. The quality and the quantity of the training data exert a direct influence on the performance of such translation systems. The lack of open-source bilingual corpora for a particular language pair results in lower translation scores reported for such a language pair. This is...
Applications in the World Wide Web aggregate vast amounts of information from different data sources. The aggregation process is often implemented with Extract, Transform and Load (ETL) processes. Usually ETL processes require information for aggregation available in structured formats, e. g. XML or JSON. In many cases the information is provided in natural language text which makes the application...
In the present day world, people are so much habituated to Social Networks. Because of this, it is very easy to spread spam contents through them. One can access the details of any person very easily through these sites. No one is safe inside the social media. In this paper we are proposing an application which uses an integrated approach to the spam classification in Twitter. The integrated approach...
Event extraction is a key step in many text-mining applications such as question-answering, information extraction and summarization systems. In this study we used conditional random field (CRF) to extract causal events from PubMed articles related to Geriatric care. Abstracts of geriatric care domain were manually reviewed and categorized into 42 different sub domains. There are a total of 19, 677...
Tibetan person name recognition is one of the most difficult tasks in the area of Tibetan information processing, and the effect of recognition impacts directly on the precision of Tibetan word segmentation and the performance of relative application systems, which include Tibetan-Chinese machine translation, Tibetan information search, text categorization, etc. Based on the analysis of wording rules...
With the increase of ubiquitous data all over the internet, intelligent classroom systems that integrate traditional learning techniques with modern e-learning tools have become quite popular and necessary today. Although a substantial amount of work has been done in the field of e-learning, specifically in automation of objective question and answer evaluation, personalized learning, adaptive evaluation...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.