The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This research focuses on the implementation of Gramatika, a grammar checker designed for the Filipino language given its available resources and linguistic tools. The checker uses hybrid n-grams generated from n-grams of words, part-of-speech tags, and lemmas of grammatically-correct texts. It covers a variety of error types including those unique in Filipino: wrong word form, and incorrectly merged...
Authorship recognition from micro-blogs such as Twitter is a challenging task due to limitation of text length to 140 characters. However, identification of micro-blog authors is crucial in many cyber-crime investigations as well as in forensic applications. So far, traditional linguistic profiles such as Bag-Of-Words (BOW) and style-based markers have been investigated for identification of micro-blog...
The ultimate aim of this research is to develop a Rule Based Machine Translation System (RBMT) using sentence simplification. The sentence pattern for English is SVO and Tamil is SOV. Complex and larger sentence are not easy to parse and translate. So, the sentence simplifier is also accommodated in the rule based system to split a large sentence into simple multiple sentences. Machine translation...
The presence of sarcasm in text can hamper the performance of sentiment analysis. The challenge is to detect the existence of sarcasm in texts. This challenge is compounded when bilingual texts are considered, for example using Malay social media data. In this paper a feature extraction process is proposed to detect sarcasm using bilingual texts; more specifically public comments on economic related...
This paper presents a method to improve Thai-English word alignment in statistical machine translation (SMT) for interrogative sentences in a parallel corpus. We utilize the Thai and English grammatical knowledge i.e. tense, part of speech (POS), and question inversion pattern. The proposed method handles the difference of Thai and English interrogative sentences using sentence transformation, interrogative...
In this research paper, a rule based chunker is developed and evaluated. For the development of the chunker, handcrafted linguistic rules for mainly noun, adverb, verb, adjective phrases and conjuncts were generated. Indian Languages Chunk Tagset is used for annotations. In order to evaluate, 500 sentences of Hindi language tagged by HMM tagger were considered and given as an input to our chunker...
Spoken language understanding (SLU) is a core component of a spoken dialogue system, which involves intent prediction and slot filling and also called semantic frame parsing. Recently recurrent neural networks (RNN) obtained strong results on SLU due to their superior ability of preserving sequential information over time. Traditionally, the SLU component parses semantic frames for utterances considering...
Morphological analysis is an essential step for processing the Korean language, due to highly agglutinative properties of the language. In this paper, we propose a novel approach for constructing a Korean morphological analyzer that can capture linguistic properties using graphemes as basic processing units. Since our model does not utilize prior linguistic knowledge, the model can be applied to other...
This study describes the construction of the TOCFL (Test Of Chinese as a Foreign Language) learner corpus, including the collection and grammatical error annotation of 2,837 essays written by Chinese language learners originating from a total of 46 different mother-tongue languages. We propose hierarchical tagging sets to manually annotate grammatical errors, resulting in 33,835 inappropriate usages...
This study examines the challenging issues in the semantic annotation of the characteristics of verbal information of Mandarin Chinese. It proposes a frame-based constructional approach that aligns with linguistic premises in Frame Semantics, Construction Grammar and Cognitive Grammar. Given that semantic processing has a lot to do with human cognitive capacities, semantic transfer and profile on...
Reading ability is one of the most important skills to language learners. Grade-level reading corpus can be more targeted to improve learners' reading abilities. Based on the Corpus of Teaching Chinese as a Second Language (CTC), this paper presents a grade standard for the construction of a grade-level reading corpus. The corpus is tagged with linguistic information, and it can be used as a language...
Grammar teaching and learning have always been important and difficult parts in L2 Chinese. This paper demonstrates a method for automatically extracting and recommending Grammar Points to L2 Chinese teachers and learners. First, a L2 Chinese grammar syllabus is reconstructed based on a corpus of international Chinese teaching materials. Second, a regular expression-based learning algorithm is explored...
Customer reviews in online websites has been increased a lot nowadays. Detecting aspects on those reviews are becoming a challenging task because of size complexity. Hence, an automated mechanism is needed to detect the product aspects from the online consumer reviews. In this paper we modeled an unsupervised technique to detect product aspects. In general, the product aspect may be single word or...
The paper presents various Russian language corpora to discuss professional advantages and cultural benefits of linguistic corpora technology in comparison with the pre-computational and pre-corpora state-of-the-art in language research and Arts and Humanities. As the most faithful ‘mirror’ of political, intellectual and spiritual life of a nation during current state and in historical perspective,...
How to represent the structure of a sentence is a key issue in linguistic and NLP fields. Dependency Grammar (DG) has been widely used as it directly describes the relations between words in a sentence. However, it always follows the tree structure that does not fit the argument sharing phenomenon. On the other hand, the Semantic Role Labeling (SRL) annotation does not give a full structure for a...
This paper presents a novel version of ExATO, a term extractor originally designed to extract relevant terms from corpora in Portuguese. In this new version not only corpora in Portuguese can be handled, but also texts in English are accepted. This extension is likely to offer the same quality pattern already achieved for Portuguese. In this paper, we draw the analysis of results in parallel corpora...
A tweet is an authentic use of Natural Language where the user has to deliver the message in 140 characters or less. According to previous researchers, this restriction increases the possible ambiguity of a tweet making it difficult for traditional Natural Language Processing (NLP) tools to analyze it. This research enhances the machine learning based Stanford CoreNLP Part-of-Speech (POS) tagger with...
Document Summarization is a technique of conveying important information in a given document. It is one of the most important chores of Natural Language Processing as the summary produced is helpful for information retrieval systems, question answering systems, medical domain and news domain etc. Most of the summarization works in Indian languages are of extractive nature and not much work is oriented...
Natural language processing is one of the major field in computer science. NLP is the ability of the system to process different sentences in natural language. Parts of speech tagging, pragmatic analysis, machine translation, discourse analysis etc are the different fields in NLP. Malayalam is the one of the important language in Dravidian family, where the difficult grammar structure will be the...
Opinion or sentiment analysis has risen to extract useful information from a lot of unstructured text data, in the form of customer reviews on different products and their features or online SNS data respectively. Customer reviews are not only helpful for potential customers, but also are helpful for the manufacturers of the products to raise their products and services. The reviews conciseness takes...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.