The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This short empirical paper investigates how well topic modeling and database meta-data characteristics can classify web and other proof-of-concept (PoC) exploits for publicly disclosed software vulnerabilities. By using a dataset comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is obtained in the empirical experiment. Text mining and topic modeling are a significant boost factor...
Indonesian text data from social media is one of large text data that interesting to be mined. Mining the insight knowledge from large text data need more effort and time to processed. Moreover, Indonesian text data from social media contains natural language, including slang that require special treatment. We propose incremental technique for more efficient mining process of large text data with...
Due to the vast amount of data, searching and obtaining relevant information on the web is a challenging task. Despite that a broad range of classification techniques have been proposed to improve the information retrieval methods, many difficulties are still present because of the continuous increase in the amount of web contents, as well as its diversity. In this paper, we propose a method that...
The huge amount of text documents has made the manual organization of text data a tedious task. Automatic text classification helps to easily handle the large number of documents by organising them automatically into predefined classes. The effectiveness and efficiency of automatic text classification largely depends on the way text documents are represented. A text document is usually viewed as a...
Game reviewing is one of the method for game users and critics to comment and discuss about a game. Game developers and marketers could use game reviews as insights to assist on designing a better game by specifying quality requirements and providing better game marketing. Usability and problems are major concerns of users and game developers since these quality affects users' satisfaction and opportunity...
Nowadays, most of the data on the Web is still in the form of unstructured text. Knowledge extraction from unstructured text is highly desirable but extremely challenging due to the inherent ambiguity of natural language. In this article, we present an architecture of an information extraction system based on the concept of Embedded Controlled Language that allows for extracting formal semantic knowledge...
Requirements elicitation is the activity of identifying facts that compose the system requirements. One of the steps of this activity is the identification of information sources, which is a time-consuming task. Text documents are typically an important and abundant information source. However, their analysis to gather useful information is also time consuming and hard to automate. Because of its...
In this work we propose Inclusive vector to keep the key words available in natural language database. The inclusive vectors are generated by the process of extraction of words given in the source and the cited items of records published in the ISI Thompson Citation Indexes. The proposed inclusive vector exhibits related words and the degree of their relationships. In this work we present the results...
The Web is a huge virtual space where to express and share individual opinions, influencing any aspect of life, with implications for marketing and communication alike. Social Media are influencing consumersâ preferences by shaping their attitudes and behaviors. Monitoring the Social Media activities is a good way to measure customersâ loyalty, keeping a track on their sentiment...
Revealing an opinion hidden in a text document is a challenging task. The article presents a method based on the automatic extraction of expressions that are significant for specifying a document attitude to a given topic. The significant expressions are composed using revealed significant words in the documents. The significant words are selected by the c5 decision-tree generator based on the entropy...
One of the biggest problems in sensitive data wiping is to determine if a file is sensitive or not. Data wiping applications have improved a lot, but they cannot determine by themselves if a file is sensitive. The method we propose tries to determine if a file is sensitive by using a pre-defined set of rules initially specified by the user. These rules can update themselves in time, by “learning”...
In this paper we propose an automated method for generating domain specific stop words to improve classification of natural language content. Also we implemented a bayesian natural language classifier working on web pages, which is based on maximum a posteriori probability estimation of keyword distributions using bag-of-words model to test the generated stop words. We investigated the distribution...
In the text preprocessing of text mining, a stop-word list is constructed to filter the segment results of the text documents so that the dimensionality of the text feature space can be cut down primarily. This paper summarized the definition, extraction principles and method of stop-word, and constructed a customizing Chinese-English stop-word list with the classical stop-word list based on the difference...
Online Social Networks are so popular nowadays that they are a major component of an individual's social interaction. They are also emotionally-rich environments where close friends share their emotions, feelings and thoughts. In this paper, a new framework is proposed for characterizing emotional interactions in social networks, and then using these characteristics to distinguish friends from acquaintances...
This paper describes steps that have been taken to construct a development dataset for the task of Technology Structure Mining. We have defined the proposed task as the process of mapping a scientific corpus into a labeled digraph named a Technology Structure Graph as described in the paper. The generated graph expresses the domain semantics in terms of interdependencies between pairs of technologies...
We describe a system for the detection of mentions of protein-protein interactions in the biomedical scientific literature. The original system was developed as a part of the OntoGene project, which focuses on using advanced computational linguistic techniques for text mining applications in the biomedical domain. In this paper, we focus in particular on the participation to the BioCreative II.5 challenge,...
Aliases discovered in Thai articles are challenging. We apply a standard vector space model to explore and match aliases with formal names or each others. On first construct a term-by-document matrix (TDM), which contains term frequency of term occurring in document collection assuming that all terms exist in the typed named entity dictionary. Normalization techniques are used instead of standard...
This paper describes a linguistic text mining tool for analyzing problem reports in aerospace engineering and safety organizations. The semantic trend analysis tool (STAT) helps analysts find and review recurrences, similarities and trends in problem reports. The tool is being used to analyze engineering discrepancy reports at NASA Johnson Space Center. The tool has been augmented with a statistical...
In multi-instance learning, the training set comprises labeled bags which are composed of unlabeled instances, and the task is to predict the labels of unseen bags. In this paper, a text mining problem, i.e. text representation, is investigated from a multi-instance view. In detail, each text is regarded as a bag while each of its sentences is regarded as an instance. Bag can be labeled by its class...
A paper presents the ontology creation approach, based on evolutional sequences. The universal system of accumulation of the applied scientific knowledge is proposed. The method of compression of text content also is discussed.
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.