The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Traditional machine learning algorithms often require computations on centralized data, but modern datasets are collected and stored in a distributed way. In addition to the cost of moving data to centralized locations, increasing concerns about privacy and security warrant distributed approaches. We propose keybin, a distributed key-based binning clustering algorithm for high-dimensional spaces....
Big data technology refers to the rapid acquisition of valuable information from various types of large amounts of data. It can be divided into 8 technologies: data acquisition, data access, infrastructure, data processing, statistical analysis, data mining, model prediction and results presentation. The paper presents improved statistical analysis method based on big data technology. A statistical...
In this paper, we present a distributed data visualization framework for HPC environments based on the PBVR (Particle Based Volume Rendering) method. The PBVR method is a kind of point-based rendering approach where the volumetric data to be visualized is represented as a set of small and opaque particles. This method has the object-space and image-space variants, defined by the place (object or image-...
Satellite can provide remote sensing data for disaster monitoring, and various sensors are generating huge volumes of remote sensing data for disaster management. It is urgent to store and process massive data acquired by satellite as fast as possible. A flexible and rapid service platform can realize integrated services from data acquisition, data production and product visualization. This article...
The goal of this work is to present a software package which is able to process binary climate data through spawning Map-Reduce tasks while introducing minimum computational overhead and without modifying existing application code. The package is formed by the combination of two tools, Pipistrello, a Java utility that allows users to execute Map-Reduce tasks over any kind of binary file, Tina a lightweight...
This paper introduces the development of a distributed air-defense engagement simulation model based on data distribution service (DDS). To design and develop effectively, system developers need a high-resolution engagement simulation including complex engineering-level models and operational scenario models. Increasing the resolution of the model results in the growing model's complexity which requires...
The exponential growth of digital data sources has the potential to transform all aspects of society and our lives. However, to achieve this impact, the data has to be processed promptly to extract insights that can drive decision making. Further, traditional approaches that rely on moving data to remote data centers for processing are no longer feasible. Instead, new approaches that effectively leverage...
Compared with distributed graph computation, traditionally single node computation is unfitted in processing large scale graph data. The GAS (Gather, Apply and Scatter) Model is a universal vertex-cut graph computation programming model based on edge-centric programs to support graph algorithms, which process distributed graph computation after graph partition. In this paper, we introduce that three...
Due to its simplicity and scalability, MapReduce has become a de facto standard computing model for big data processing. Since the original MapReduce model was only appropriate for embarrassingly parallel batch processing, many follow-up studies have focused on improving the efficiency and performance of the model. Spark follows one of these recent trends by providing in-memory processing capability...
High-performance distributed memory applications often load or receive data in a format that differs from what the application uses. One such difference arises from how the application distributes data for parallel processing. Data must be redistributed from how it was laid out by the producer to how the application needs the data to be laid out amongst its processes. In this paper, we present a large-scale...
Cloud Computing allows users to control substantial computing power for complex data processing, generating huge and complex data. However, the virtual resources requested by users are rarely utilized to their full capacities. To mitigate this, providers often perform over-commitment to maximize profit, which can result in node overloading and consequent task eviction. This paper presents a novel...
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to...
MapReduce has emerged as a popular programming model in the field of data-intensive computing. This is due to its simplistic design, which provides ease of use for programmers, and its framework implementations such as Hadoop, which have been adopted by large business and technology companies. One significant issue in practical MapReduce applications is data skew: the imbalance in the amount of data...
Data Mining algorithms can tackle the data either centrally or distributed. Outsourcing data can solve the issues of processing, storing, and analyzing a massive data. A proportion of existing data in various places and to improve the classification results, we propose the following solution for data mining with preserving the privacy. However, a critical problem that precludes free sharing of information...
Scheduling is one of the most important issues in executing tasks in grid systems. A data grid mainly deals with sharing and managing large amounts of distributed data in executing data-intensive applications. It is primarily a solution to satisfy the requirements of data-intensive tasks processing. OptorSim is a useful open source simulation tool for data grids. In this paper a new two-step data-intensive...
Traditional data mining (DM) has certain challenges viz. Scalability, high dimensionality, distributed data and often it also requires huge amount of computational resources in terms of space and time to extract the hidden patterns in the data. In addition, the data has to be available at one location. But in today's era the data are often inherently distributed in several databases. Hence, due to...
The concept of workflow is used for modeling many of the data-intensive scientific applications executed on data grids. A Workflow is a series of interdependent tasks during which data is processed by different tasks. Scheduling the workflows in the grids is the process of assigning tasks to appropriate resources with the aim of achieving goals such as reducing workflow completion time while considering...
Scalable processing on large-scaled RDF graphs becomes a critical issue with the explosion of semantic web technologies. Most of the existing distributed RDF querying and reasoning solutions are designed based on the MapReduce paradigm. However, MapReduce should be further optimized since several inherent limitations such as lack of efficient job scheduling and iterative computing mechanisms affect...
At present, there are a lot of mature encryption mechanisms and access control models for the protection of the data content in cloud environment. However, the research on the privacy protection of data attributes in cloud is still in the initial stage, which can be classified as two types: one is the privacy protection of data attributes during data transmission, including routing information, generation...
Big data storage and sharing are becoming the major demand of the community. To overcome such issues, virtually unified data facilities are being presented with geodistributed data centers by providing the user with the single unified namespace. These unified data storage facilities lack efficient storage and analysis of data. To address these shortcomings in such unified data facilities, we designed...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.