The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Recognition of named entities (people, companies, locations, etc) is an essential task of text analytics. We address the subproblem of this task, namely, named entity classification. We propose a novel approach that constructs an effective fine-grained named entity classifier. Its key highlights are semi-automatic training set construction from Wikipedia articles and additional feature selection....
Link spam techniques can enable some pages to achieve higher-than-deserved rankings in the results of a search engine. They negatively affect the quality of search results. Classification methods can detect link spam. For classification problem, features play an important role. This paper proposes to derive new features using genetic programming from existing link-based features and use the new features...
Tag recommendation is an integral part of any bookmarking application. With the growing popularity in Web 2.0 usage, recommending tags is of utmost importance in enabling a user to perform bookmarking easily. An issue that most recommendation systems do not consider is that users have a tendency to choose from tags that are suggested to them, which might bias the true popular rankings of tags. In...
Traffic classification has become a crucial domain of research due to the rise in applications that are either encrypted or tend to change port consecutively. The challenge of flow classification is to determine the applications involved without any information on the payload. In this paper, our goal is to achieve a robust and reliable flow classification using data mining techniques. We propose a...
The increase of malware that are exploiting the Internet daily has become a serious threat. The manual heuristic inspection of malware analysis is no longer considered effective and efficient compared against the high spreading rate of malware. Hence, automated behavior-based malware detection using machine learning techniques is considered a profound solution. The behavior of each malware on an emulated...
This paper presents a simulation-based empirical study of the performance profile of random sub sample ensembles with a hybrid mix of base learner composition in high dimensional feature spaces. The performance of hybrid random sub sample ensemble that uses a combination of C4.5, k-nearest neighbor (kNN) and naïve Bayes base learners is assessed through statistical testing in comparison to those...
Worms are self-contained programs that spread over the Internet. Worms cause problems such as lost of information, information theft and denial-of-service attacks. The first part of the paper evaluates the detection of worms based on content classification by using all machine learning techniques available in WEKA data mining tools. Four most accurate and quite fast classifiers are identified for...
Port-based or payload-based analysis is becoming difficult for accurate traffic identification when many applications use dynamic port numbers and encryption to avoid detection. In this paper we present an approach for online traffic classification relying on the observation of the first n packets of a flow. The packet size and inter-arrival times of the individual packets, rather than the statistic...
Obfuscated and encrypted protocols hinder traffic classification by classical techniques such as port analysis or deep packet inspection. Therefore, there is growing interest for classification algorithms based on statistical analysis of the length of the first packets of flows. Most classifiers proposed in literature are based on machine learning techniques and consider each flow independently of...
In this paper we propose to apply an algorithm for finding out and cleaning mislabeled training sample in an adversarial learning context, in which a malicious user tries to camouflage training patterns in order to limit the classification system performance. In particular, we describe how this algorithm can be effectively applied to the problem of identifying HTTP traffic flowing through port TCP...
As computer systems become increasingly complex, system anomalies have become major concerns in system management. In this paper, we present a comprehensive measurement study to quantify the predictability of different system anomalies. Online anomaly prediction allows the system to foresee impending anomalies so as to take proper actions to mitigate anomaly impact. Our anomaly prediction approach...
As the number of pages on the web is permanently increasing, there is a need to classify pages into categories to facilitate indexing or searching them. In the method proposed here, we use both textual and visual information to find a suitable representation of web page content. In this paper, several term weights, based on TF or TF-IDF weighting are proposed. Modification is based on visual areas,...
Along with the rapid popularity of the Internet, crime information on the web is becoming increasingly rampant, and the majority of them are in the form of text. Because a lot of crime information in documents is described through events, event-based semantic technology can be used to study the patterns and trends of web-oriented crimes. In our research project on cyber crime mining, we construct...
Application-layer classification is needed in many monitoring applications. Classification based on machine learning offers an alternative method to methods based on port or payload based techniques. It is based on statistical features computed from network flows. Several works investigated the efficiency of machine learning techniques and found algorithms suitable for network classification. A classifier...
Due to the complexity of topical opinion retrieval systems, standard measures, such as MAP or precision, do not fully succeed in assessing their performances. In this paper we introduce an evaluation framework based on artificially defined opinion classifiers. Using a Monte Carlo sampling, we perturb a relevance ranking by the outcomes of these classifiers and analyse how the opinion retrieval performance...
This work presents an unsupervised snippet-based sentiment classification method for Chinese unknown sentiment phrases, which is also applicable to other languages theoretically. Unlike existing Semantic Orientation (SO) methods, our proposed method does not require any Reference Word Pairs (RWPs) for predicting the sentiments of phrases. The results of preliminary experiments show that our proposed...
Recently, DoS (Denial of Service) detection has become more and more important in web security. In this paper, we argue that DoS attack can be taken as continuous data streams, and thus can be detected by using stream data mining methods. More specifically, we propose a new Weighted Ensemble learning model to detect the DoS attacks. The Weighted Ensemble model first trains base classifiers using different...
This paper examines the performance of a new Hidden Markov Model (HMM) structure used as the core of an Internet traffic classsifier and compares the results against other models present in the literature. Traffic modeling and classification find importance in many areas such as bandwidth management, traffic analysis, prediction and engineering, network planning, Quality of Service provisioning and...
A prospective buyer interested in a particular item may find out information about the item from various sources, including product reviews. With interactive information sharing facilitated by Web 2.0, a lot of product reviews are available on the web. For a popular item with a large number of reviews, a prospective buyer could use some help in selecting only reviews of interest, such as, only positive...
Internet services that has become easier to access has contributed to the drastic increase in the number of web pages. This phenomenon has created new difficulties to internet users about retrieving the latest, relevant and excellent web information. This is due to the enormous contents of web information that have caused problems in the restructuring of web information. Thus, in order to ensure the...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.