The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The similarity between the semantic relations that exist between two word pairs is defined as their relational similarity. For example, the semantic relation, is a large holds between the words in the word pair (lion, cat) and (ostrich, bird), because lion is a large cat, and ostrich is the largest living bird on earth. Consequently, the two word pairs, (lion, cat) and (ostrich, bird), are considered...
Web news articles play an important role in stock market. Sentiment classification of news articles can help the investors make investment decisions more efficiently. In this paper, we implemented an approach of Chinese new words detection by using N-gram model and applied the result for Chinese word segmentation and sentiment classification. Appraisal theory was introduced into sentiment analysis...
Language identification (LID) is always regarded to be a fascinating field to be studied. Studies on language identification has been carried out from early 1970's and up to now lot of research have been undergone in this area. In this paper a few of the papers are highlighted and reviewed based on the past history and the current state of research on various techniques that have been applied for...
This paper investigates lexical stress detection for Chinese learners of English, where a combined differential acoustic feature is developed to represent the lexical stress of polysyllabic words in continuous speech. The use of frame-averaged feature and the contextual information intra-word can be input to the classifiers without normalization. The word-based stress detection method proposed in...
SMS spam filtering for Thai-English language has not previously been studied and implemented. Two methods of spam SMS message filtering objected to filter spam SMS messages written in Thai and English have been studied and implemented. The first method simply uses current spam English message filtering and then upgrades for Thai language support. The second one applies text normalization, word segmentation...
Parallel corpus is the valuable resource for some important applications of natural language processing such as statistical machine translation, dictionary construction, cross-language information retrieval. The Web is a huge resource of knowledge, which partly contains bilingual information in various kinds of web pages. It currently attracts many studies on building parallel corpora based on the...
Hidden Markov Support Vector Machines is a novel structural SVMs model. Its efficiency has been proved in label sequence learning task such as English text chunking. In this paper, we treat Chinese chunk recognition as a label sequence learning problem. After giving the definition of Chinese chunk, we apply HMSVM to solve Chinese chunk problem. The results of experiment show that it achieves a better...
Corpus is the set of language materials which are stored in computers and can use computers to search, query and analyze for enterprise decision-makers. Automated text categorization has been extensively studied and various techniques for document categorization. But based on the current scarcity of Chinese corpus, especially in the field of text categorization, the Chinese categorization corpus is...
Feature selection is an important preprocessing step of Chinese Text Categorization, which reduces the high dimension and keeps the reduced results comprehensible compared to feature extraction. A novel criterion to filter features coarsely is proposed, which integrating the superiorities of term frequency-inverse document frequency as inner-class measure and CHI-square as inter-class, and a new feature...
Since the Urdu language has more isolated letters than Arabic and Farsi, a research on Urdu handwritten word is desired. This is a novel approach to use the compound features and a Support Vector Machine (SVM) in offline Urdu word recognition. Due to the cursive style in Urdu, a classification using a holistic approach is adapted efficiently. Compound feature sets, which involves in structural and...
A two-way textual entailment (TE) recognition system that uses lexical and syntactic features has been described in this paper. The hybrid TE system is based on the Support Vector Machine that uses twenty three features for lexical similarity and the output tag from a rule based syntactic two-way TE system as another feature. The important lexical features that are used in the present system are:...
Language Identification is an important issue in today's multilingual world. In this paper we have analyzed Fuzzy-SVM technique for identification of romanized plaintexts of five Indian regional languages namely Hindi, Bangla, Manipuri, Urdu and Kashmiri. Distinguishing features/characteristics have been extracted from romanized plaintexts of each of these five languages and represented suitably through...
In this paper, we detail an approach to a very specific task of information extraction namely, extracting biomarker information in biomedical literature. Starting with the abstract of a given publication, we first identify the evaluative sentence(s) among other sentences by recognizing words and phrases in the text belonging to semantic categories of interest to bio-medical entities (i.e., semantic...
Text categorization-assignment of natural language texts to one or more predefined categories based on their content-is an important component in many information organization and management tasks. Categorization algorithm is the most critical factor to text categorization system performance. The inductive learning classifiers are put forward. Very accurate text categorization result can be learned...
With a rapid growth of the internet communication, many types of text are produced. They can convey the meanings that can contribute to text categorization. Emotion classification also becomes more interesting, but emotion classification in Thai text is still not able to be correctly classified. Thus, this paper proposes a novel approach that takes advantage of bi-words occurrence to classify emotion...
Information distillation is the task that aims to extract relevant passages of text from massive volumes of textual and audio sources, given a query. In this paper, we investigate two perspectives that use shallow language processing for answering open-ended distillation queries, such as “List me facts about [event]”. The first approach is a summarization-based approach that uses the unsupervised...
There are a lot of text documents on the Web which contain opinions or sentiments about an object such as software reviews, product reviews, movies reviews, music reviews, and book reviews etc. Opinion mining or sentiment classification aim to extract the features on which the reviewers express their opinions and determine they are positive or negative. In this paper we proposed an ontology based...
Traditional text chunking approach is to identify many phrases using only one model, and the same features are used to identify these phrases too. So the helpful features of each phrase are ignored. In fact, different phrases have different helpful features. In this paper, the concept of ??sensitive feature?? is proposed, and the sensitive features of eleven English types and seven Chinese types of...
Automatic document classification due to its various applications in data mining and information technology is one of the important topics in computer science. Classification plays a vital role in many information management and retrieval tasks. Document classification, also known as document categorization, is the process of assigning a document to one or more predefined category labels. Classification...
We present results of an experiment dealing with combining outputs of five part-of-speech taggers via tagger voting in order to improve the overall accuracy of morphosyntactic tagging of Croatian texts using a subset of the Multext-East v3 tagset. The increase in accuracy over the best-performing single tagger is shown to exist, but not to be statistically significant. We discuss the performance of...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.