The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Applications deployed in the Cloud usually come with dedicated performance and availability requirements. This can be achieved by replicating data across several sites and/or by partitioning data. Data replication allows to parallelize read requests and thus to decrease data access latency, but induces significant overhead for the synchronization of updates. Partitioning, in contrast, is highly beneficial...
As hardware and software technologies have improved, our definition of a “manageable amount of data” has increased in its scope dramatically. The term “big data” can be applied to any of several different projects and technologies sharing the ultimate goal of supporting analysis on these large, heterogeneous, and evolving data sets. The term “data science” refers to the statistical, technical, and...
In this paper we present Digree, an experimental middleware system that can execute graph pattern matching queries over databases hosting voluminous graph datasets. First, we formally present the employed data model and the processes of re-writing a query into an equivalent set of subqueries and subsequently combining the partial results into the final result set. Our framework guarantees the correctness...
The Atmospheric Radiation Measurement (ARM) Climate Research Facility (www.arm.gov) provides atmospheric observations from diverse climatic regimes around the world. Currently, ARM archives over 22 million user assessable data files, primarily stored in NetCDF file format, with total data volumes close to one Petabyte. In this paper, we will discuss how ARM is currently storing, distributing, cataloging...
People sensing data have been successfully utilized in various domains to support a more livable place with on-demand transport system, green environment, profitable economy and interactive governance, however, their potentials in supporting the design of places are not widely studied and explained. As an on-going multidisciplinary project in Singapore, “Livable Places” mins valuable insights from...
With the rise of location-aware IoT devices, there is an increased desire to process and manage the stationary and moving trajectory data generated by these real-time sensors. There has been a corresponding evolution of distributed database and compute technology to handle the increasing data load. Here we describe challenges in managing this kind of data and survey the technologies that address those...
Apache Spark enables fast computations and greatly accelerates analytics applications by efficiently utilizing the main memory and caching data for later use. At its core Apache Spark uses data structures called RDDs (Resilient Distributed Datasets) to give a unified view to the distributed data. However, the data represented in the RDDs remain unencrypted which can result in leakage of confidential...
The use of functional brain imaging for research and diagnosis has benefitted greatly from the recent advancements in neuroimaging technologies, as well as the explosive growth in size and availability of fMRI data. While it has been shown in literature that using multiple and large scale fMRI datasets can improve reproducibility and lead to new discoveries, the computational and informatics systems...
Big data is currently a hot research topic, with four million hits on Google scholar in October 2016. One reason for the popularity of big data research is the knowledge that can be extracted from analyzing these large data sets. However, data can contain sensitive information, and data must therefore be sufficiently protected as it is stored and processed. Furthermore, it might also be required to...
We have implemented an updated Hierarchical Triangular Mesh (HTM) as the basis for a unified data model and an indexing scheme for geoscience data to address the variety challenge of Big Earth Data. In the absence of variety, the volume challenge of Big Data is relatively easily addressable with parallel processing. The more important challenge in achieving optimal value with a Big Data solution for...
We have witnessed a dramatic increase in national cyberinfrastructure resources to support data-driven research. Orchestrating these resources to enable the creation of collaborative infrastructure capable of supporting data intensive activities is challenging. In this work we present RADII, a novel architecture and system that enables the provisioning and configuration of collaborative infrastructure...
In this paper we propose a two-stage algorithm for robust K-subspaces recovery. In the first stage, a large number of local candidate subspaces are generated by probabilistic farthest insertion, and then the initial near-optimal K-subspaces are solved by combinatorial selection with randomized greedy method. In the second stage, the K-subspaces are further refined by assigning each data vector to...
In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an...
Globalization and cloud computing have allowed major strides forward in terms of communication possibilities, but it is also illuminating how many different resource options and formats exist access to which would dramatically increase the accuracy and reliability of choices made as a result of computational output. As a result, there is increasing need for methods resolving levels of data translations...
The data warehouse system Hive has emerged as an important facility for supporting data computing and storage. In particular, RCFile is a tailor-made data placement structure implemented in Hive, which is designed for the data processing efficiency. In this paper, we propose several optimized schemes based on RCFile and introduce EStore, which is an optimized data placement structure that is able...
The performance of scalable analytic frameworks supporting data-intensive parallel applications often depends significantly on the time it takes to read input data. Therefore, existing frameworks like Spark and Flink try to achieve a high degree of data locality by scheduling tasks on nodes where the input data resides. However, the set of nodes running a job and its tasks is chosen by a cluster resource...
The age of cloud computing has introduced all the mechanisms needed to elastically scale distributed, cloud-enabled applications. At roughly the same time, NoSQL databases have been proclaimed as the scalable alternative to relational databases. Since then, NoSQL databases are a core component of many large-scale distributed applications. This paper evaluates the scalability and elasticity features...
In this paper, we will discuss how NASA's Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) is distributing large volumes of ‘structured’ data using Daily Surface Weather Data and a corresponding Climatological Summaries Dataset (Daymet) as an example.
The adoption of Big Data technologies can potentially boost the scalability of data-driven biology and health workflows by orders of magnitude. Consider, for instance, that technologies in the Hadoop ecosystem have been successfully used in data-driven industry to scale their processes to levels much larger than any biological-or health-driven work attempted thus far. In this work we demonstrate the...
Clustering large scale data has become an important challenge which motivates several recent works. While the emphasis has been on the organization of massive data into disjoint groups, this work considers the identification of non-disjoint groups rather than the disjoint ones. In this setting, it is possible for data object to belong simultaneously to several groups since many real-world applications...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.