The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The in-memory data processing framework, Apache Spark, has been stealing the limelight for low-latency interactive applications, iterative and batch computations. Our early experience study [17] has shown that Apache Spark can be enhanced to leverage advanced features (e.g., RDMA) on highperformance networks (e.g., InfiniBand and RoCE) to improve the performance of shuffle phase. With the fast evolving...
Distributed communities of researchers rely increasingly on valuable, proprietary, or sensitive datasets. Given the growth of such data, especially in fields new to data-driven research like the social sciences and humanities, coupled with what are often strict and complex data-use agreements, many research communities now require methods that allow secure, scalable and cost-effective storage and...
We have witnessed a dramatic increase in national cyberinfrastructure resources to support data-driven research. Orchestrating these resources to enable the creation of collaborative infrastructure capable of supporting data intensive activities is challenging. In this work we present RADII, a novel architecture and system that enables the provisioning and configuration of collaborative infrastructure...
Large-scale scientific applications are often expressed as workflows that help defining data dependencies between their different components. Several such workflows have huge storage and computation requirements, and so they need to be processed in multiple (cloud-federated) datacenters. It has been shown that efficient metadata handling plays a key role in the performance of computing systems. However,...
In this paper, we pose and address some of the unique challenges in the analysis of scientific Big Data on supercomputing platforms. Our approach identifies, implements and scales numerical kernels that are critical to the instantiation of theory-inspired analytic workflows on modern computing architectures. We present the benefits of scalable kernels towards constructing algorithms such as principal...
Scaling-up scientific data analysis and machine learning algorithms for data-driven discovery is a grand challenge that we face today. Despite the growing need for analysis from science domains that are generating ‘Big Data’ from instruments and simulations, building high-performance analytical workflows of data-intensive algorithms have been daunting because: (i) the ‘Big Data’ hardware and software...
Natural Language Processing (NLP) constitutes a fundamental module for a plethora of domains where unstructured text is a predominant source. Despite the keen interest of both industry and research community in developing NLP tools, current industrial solutions still suffer from two main cons. First, the architectures underlying existing systems do not satisfy critical requirements of large-scale...
With no limit on time and location [1], the number of users attracted by massive open online course (MOOC) has increased rapidly, and many platforms have been built to provide a variety of courses. All of these trigger an explosive growth in data volume. As we known, people have met big data in many areas and proposed many techniques and methods to deal with them. However, people still have no sense...
As social media has become increasingly popular in the modern world, people are using these platforms to express their opinions about products, businesses, and services. The need for categorizing these consumer reviews has been prominent. One effective solution is sentiment analysis (SA), which has been an active research topic. The goal of SA is to automatically extracting and classifying user opinions...
Analyzing and visualizing large datasets generated by real-time spatio-temporal activities (e.g. vehicle mobility or large crowd movement) are a very challenging task. Recursive delays both at middleware and front end applications limit the of usefulness of the real-time analysis. In this paper, we present a framework “Spatial-Crowd” that first handles spatial-temporal data acquisition and processing...
Workload characterization of Big Data applications has always been a challenging research problem. Big data applications often have high demands on multiple computing components in concert, such as storage, memory, network and processors and have evolving performance characteristics along with the scale of the workload. To further complicate the problem, the increasing diversity of hardware technologies...
Intel Xeon Phi is a processor based on MIC architecture that contains a large number of compute cores with a high local memory bandwidth and 512-bit vector processing units. To achieve high performance on Xeon Phi, it is important for programmers to explore all the software features provided by the Intel compiler and libraries to fully utilize the new hardware resources. In this paper, we use the...
Data driven science, accompanied by the explosion of petabytes of data, has called into need dedicated analytics computing resources. Dedicated analytics clusters require large capital outlays due to their expensive hardware requirements. Additionally, if such resources are located far from the data they analyze, they also incur substantial data transfer, which has both cost and latency implications...
We present a study of scientific data analytics on heterogeneous architectures using the Legion runtime system. Legion is a new programming model and runtime system targeting distributed heterogeneous architectures. It introduces logical regions as a new abstraction for describing the structures and usages of program data. We describe how to leverage logical regions to express important properties...
Highly distributed applications dominate today's software industry posing new challenges for novel software architectures capable of supporting real time processing and analytics. The proposed framework, so called REAXICS, is motivated by the fact that the demand for aggregating current and past big data streams requires new software methodologies, platforms and services. The proposed framework is...
The Polystore architecture revisits the federated approach to access and querying the standalone, independent databases in the uniform and optimized fashion, but this time in the context of heterogeneous data and specialized analyses. In light of this architectural philosophy, and in the light of the major data architecture development efforts at the US Department of Veterans Administration (VA),...
Polystores, i.e., data management systems that use multiple stores for different data models, are gaining popularity. We are developing a polystore-based system called AWESOME to support social data analytics. The AWESOME polystore can support relational, semistructured, graph and text data and houses a Spark computation engine to produce derived data during ingestion. ADIL, the data ingestion language...
Spatial-temporal computing refers to the modeling, management, and analysis of spatial and temporal information. Despite the recent advances in massive data manipulation, software system approaches that support the massive spatial-temporal data integration and analysis still face numerous challenges, including the lack of: (i) a high-level architectural framework for massive data integration and analysis;...
The data science skills shortage means that those who have the knowledge are under constant pressure to do more with less. While the data science tools are improving at a staggering pace, the operational tools around them can not keep up. Even researchers at Google state that the issue of automatic configuration and dependency management of services is still an “open, hard problem”. This manifests...
The age of cloud computing has introduced all the mechanisms needed to elastically scale distributed, cloud-enabled applications. At roughly the same time, NoSQL databases have been proclaimed as the scalable alternative to relational databases. Since then, NoSQL databases are a core component of many large-scale distributed applications. This paper evaluates the scalability and elasticity features...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.