The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In this article, we report about platform and architecture that real-time analysis of big data are possible, and structured IT infrastructure that they are optimally combined. We developed a distributed architecture which the data conversion and the abnormality determination are multi-blocked. Furthermore, by selecting a distributed storage DB, we succeeded in constructing IT infrastructure capable...
It has been observed that there has been a great interest in computing experiments which has been useful on shared nothing computers and commodity machines. We need multiple systems running in parallel working closely together towards the same goal. Frequently it has been experienced and observed that the distributed execution engine named MapReduce handles the primary input-output workload for such...
The Atmospheric Radiation Measurement (ARM) Climate Research Facility (www.arm.gov) provides atmospheric observations from diverse climatic regimes around the world. Currently, ARM archives over 22 million user assessable data files, primarily stored in NetCDF file format, with total data volumes close to one Petabyte. In this paper, we will discuss how ARM is currently storing, distributing, cataloging...
RDF datasets have increased rapidly over the last few years. In order to process SPARQL queries on these large datasets, much effort has been spent on developing horizontally scalable techniques, which involve data partitioning and parallel query processing. While distribution may provide storage scalability, it may also incur high communication costs for processing queries. In this paper, we present...
In this paper, we will discuss how NASA's Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) is distributing large volumes of ‘structured’ data using Daily Surface Weather Data and a corresponding Climatological Summaries Dataset (Daymet) as an example.
This paper proposes the Topology-aware Virtual Machine Selection (TAVMS) algorithm to choose sets of communicating groups of virtual machines (VMs) to be migrated to other data centers, aiming at global energy savings. It considers the migration of groups of VMs as well as the data center network topology, selecting VM groups with network proximity in order to increase the potential number of equipments...
Distributed SQL Query Engines, like Hive, Spark, and Impala, have become the de-facto database set-up for Decision Support Systems with large database sizes. Unlike other distributed computing like graph processing and OLTP transactions, DSS queries are often CPU bound as opposed to Network I/O bound [8]. In this paper, we identify apparent anomalies in query performance on a distributed Hive database...
Many high-performance computing (HPC) sites extend their clusters to support Hadoop MapReduce for a variety of applications. However, HPC cluster differs from Hadoop cluster on the configurations of storage resources. In the Hadoop Distributed File System (HDFS), data resides on the compute nodes, while in the HPC cluster, data is stored on separate nodes dedicated to storage. Dedicated storage offloads...
The convergence of high-performance computing and big data, which has become known as the field of extreme big data, is problematic in that file creation in storage systems such as distributed file systems is not optimized. That is, the large workload leads to the simultaneous creation of many files by many processes when creating checkpoints. The need to improve the file creation processes prompted...
Privacy preserving data mining have been studied widely on static data. Static algorithms are not suitable for streaming data. This imposes the study of new algorithms for privacy preserving that cope with data streams characteristics. Recently, effective anonymization algorithms have been studied on centralized data streams. In this paper we propose an approach for anonymizing distributed data streams...
With the advent of big data, data center applications are processing vast amounts of unstructured and semi-structured data, in parallel on large clusters, across hundreds to thousands of nodes. The highest performance for these batch big data workloads is achieved using expensive network equipment with large buffers, which accommodate bursts in network traffic and allocate bandwidth fairly even when...
Supercomputing has been widely implemented in theoretical physics, theoretical chemistry, climate modeling, biology simulation and medicine research for high-performance and energy-efficient computing. Many of scientific applications are I/O sensitive and users have to tolerate high latency when supercomputing center storage processes thousands of I/O requests. In this paper, IMFSSC, an in-memory...
Many critical e-commerce and financial services predominantly depend on geo-distributed data centers for scalability and availability. Recent market surveys show that failure of a data center is inevitable causing huge financial loss. Fault-tolerant distributed data centers are typically designed by provisioning spare capacity to mask failure at a site. At the same time, data center operators are...
With an increase in the usage of data centers to power content distribution networks (CDN), minimizing the cost of deployment while handling fault-tolerance has become an important research issue. In this work, we demonstrate the importance of cost-aware capacity provisioning in fault-tolerant CDN data centers (that can tolerate failure at a single site). We propose an optimization model that exploits...
Concurrent Big Data applications often require high-performance storage, as well as ACID (Atomicity, Consistency, Isolation, Durability) transaction support. Although blobs (binary large objects) are an increasingly popular storage model for such applications, state-of-the-art blob storage systems offer no transaction semantics. This demands users to coordinate data access carefully in order to avoid...
The integration of Hive, Impala and Spark SQL platforms has achieved to perform rapid data retrieval using SQL query in big data environment. This paper is to design the optimized platform selection for highly improving the response of data retrieval. It can automatically choose the best-perform platform to best perform SQL commands. In addition, the distributed memory storage systems using Memcached...
In this paper, we demonstrate a coded computing framework, named Coded Distributed Computing (CDC), which optimally trades extra computation resources for communication bandwidth in a MapReduce-type distributed computing environment. We also empirically illustrate the practical impact of CDC by analyzing the performance of a distributed sorting algorithm, named CodedTeraSort, which was developed by...
In today's computing world in the cloud user can easily modify and share data as group. The main issues in the cloud computing was data privacy, data integrity, data access by unauthorized users. TTP (Trusted Third Party) is used to store and share data in cloud computing. To verify integrity of data, users in the group need to compute signature on all the blocks in shared data. In shared data different...
As data volumes to be processed in all domains; scientific, professional, social…etc., are increasing at a high speed, their management and storage raises more and more challenges. The emergence of highly scalable infrastructures has contributed to the evolution of storage management technologies. However, numerous problems have emerged such as consistency and availability of data, scalability of...
Thanks to their ability to return interesting objects in a database, the skyline queries have received considerable attention from the database community over the last few years. Skyline analysis is a powerful tool in a wide spectrum of real applications including multi-criteria optimal decision making, preference answering and many applications where uncertain, imprecise and noisy data inherently...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.