The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
K-modes is a typical categorical clustering algorithm. Firstly, we improve the process of K-modes: when allocating categorical objects to clusters, the number of each attribute item in clusters is updated, so that the new modes of clusters can be computed after reading the whole dataset once. In order to make K-modes capable for large-scale categorical data, we then implement K-modes on Hadoop using...
Identifying the parameters of a model such that it best fits an observed set of data points is fundamental to the majority of problems in computer vision. This task is particularly demanding when portions of the data has been corrupted by gross outliers, measurements that are not explained by the assumed distributions. In this paper we present a novel method that uses the Least Quantile of Squares...
Advances in Statistical Machine Translation (SMT) for breaking language barrier have been seen in recent years, and there is huge demand on cross-language dialog communication between people. In this paper we propose to leverage SMT for supporting cross-language dialog communication. Several techniques are applied to improve the performance on a dialog domain, including rescoring, system combination,...
As modern simulations involve large inputs and outputs over the network, there is an increasing need to store, manage and analyze the massive datasets, efficiently. In this paper, we present ARLS (After action Reviewer for Large-Scale simulation data), a Hadoop-based output analysis tool for large-scale simulation datasets. ARLS clusters distributed storages using Hadoop and analyzes the large-scale...
This work presents a study on prediction of university enrollment using three computational intelligence (CI) techniques. The enrollment forecasting has been considered as a form of time series prediction using CI techniques that include an artificial neural network (ANN), a neuro-fuzzy inference system (ANFIS) and an aggregated fuzzy time series model. A novel form of ANN, namely, single multiplicative...
With development of Multicore clusters the taskscheduling problem in heterogeneous cluster has become hot point of research. The method to solve this problem in Cloud computing is virtualization, which can make the heterogeneous nodes being isomorphic and then using MapReduce model for task scheduling in isomorphic nodes. But the approach has some shortcomings: virtualization itself will cause the...
This paper presents a comprehensive review of Cyberinfrastructure (CI), an emerging collaborative research environment, including its representative applications in four science communities around the world. An in-depth analysis is also conducted to reveal the key functions and desired features that can be expected from modern CI systems.
The word “Cloud” has become more and more popular these days. One of its applications, Cloud for manufacturing, is also proposed as we called Cloud Manufacturing. It is realized by setup a public service cloud platform which shares manufacturing resources and knowledge. As the big data era comes, the amount of both resources and knowledge in the platform may increase much more rapidly than ever before...
Large scale data process has emerged as an important issue for concerned researchers. By reusing calculation results, the efficiency of large scale data process can be improved greatly. This paper proposes an efficient data reusing strategy based on the data warehouse tool-Hive, which works on MapReduce framework. Since the intermediate calculation results have been stored in DFS by different jobs...
In the opportunistic networks, nodes carry and store the data and forward it until they encounter each other. How to choose an appropriate opportunity to forward data is pivotal for nodes' routing in this type of networks. Since nodes currently will keep a regular movement state in the scene of this paper discussed, forecasting a node's moving track in the near future would be very helpful. Through...
Samtla (Search And Mining Tools with Linguistic Analysis) is an online integrated research environment designed in collaboration with historians and linguists to facilitate the study of digitised texts written in any language. It currently supports the research of two corpora: the Genizah collection held by the Taylor-Schechter Genizah Research Unit in Cambridge University, and a collection of Aramaic...
The network traffic generated by a computer, or a pair of computers, is often well modelled as a series of sessions. These are, roughly speaking, intervals of time during which a computer is engaging in the same, continued, activity. This article explores a variety of statistical approaches to re-discovering sessions from network flow data using timing alone. Solutions to this problem are essential...
How can we effectively use costly statistical models in the defence of large computer networks? Statistical modelling and machine learning are potentially powerful ways to detect threats as they do not require a human level understanding of the attack. However, they are rarely applied in practice as the computational cost of deploying all but the most simple algorithms can become implausibly large...
This section of the volume contains the proceedings of the 3M4SE 2014 workshop, held on September 1-2, 2014, in Ulm, Germany, in conjunction with the 18th IEEE International EDOC Conference on Enterprise Computing, EDOC 2014.
We present the M.In.E.R.Va. project, an online tool to help students (both attending Secondary Schools and Universities) recovering their deficiencies in Mathematics. We also present an analysis of the answers given by the students, performed using the Rasch model, which allows to investigate both students' results and the validity of the items.
Record linkage is the task of identifying which records from one or more data sources refer to the same entity. Many record linkage methods were introduced and applied over the last decades. In general, the principle is to compare a range of available identifier fields in record pairs among different data sources, in order to make a linkage decision. The Fellegi-Sunter probabilistic record linkage...
We deal with data structures and algorithms suitable for the displaying of multiple binary raster images. Multiple images are dealt as a multiple layer image. In this paper, we introduce three algorithms for operation of images represented by hexadeci-grid as multiple layer images and show some examples for our introduced alghrithms.
In order to quickly determine the distribution of anomaly detection model based on small amounts of collected data, the moving relative entropy density deviation method (MREDD) is introduced to test the power series distributed random sequence. Through the moving averages of data analysis and comparison, the anomaly detection models can quickly be established. Experimental results show that this method...
Latent Dirichlet Allocation (LDA) has been widely applied to text mining. LDA is a probabilistic topic model which processes documents as the probability distribution of topics. One challenging issue in application of LDA is to select the optimal number of topics in LDA model. This paper presents a topic selection method which considers the density of each topic and computes the most unstable topic...
Cloud storage is now an important developmenttrend in information technology. However, informationsecurity has become an important problem to impede it forcommercial application, such as data confidentiality, integrity,and availability. In this paper, we revisit the two private PDPschemes. We show that the property of correctness cannot beachieved when active adversaries are involved in these auditingsystems...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.