The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper proposes a novel hybrid transactional memory scheme based on both abort prediction and an adaptive retry policy. First, the proposed scheme can predict not only conflicts between transactions running concurrently, but also the capacity and other aborts of transactions by collecting the information of previously executed transactions. Second, the proposed scheme can provide an adaptive retry...
As the volume of data stored by Big data and Cloud services continues to grow, both academia and industry are seeking for high-performance storage systems. Recently, with the recent advances in write-optimized indexes (WOI), WOI-based file systems can now outperform conventional file systems with orders of magnitude on random writes, metadata updates, and small file creation. Based on the B-tree structure,...
Enhancing the performance of turbulent flow simulations is important as the size of simulations grows with higher Reynolds number. We discuss the performance of our in-house turbulent flow simulation solver, named as DNS-TBL (Direct Numerical Simulation: Turbulent Boundary Layer), on the Intel Xeon Phi™ manycore processors. With bootable Knights Landing processors, the DNS-TBL solver shows excellent...
Parallelization on a GPU (graphics processing unit) cluster is an effective approach to reducing the huge computation time of backprojection, which is the most accurate SAR (synthetic aperture radar) imaging algorithm for reconstructing images with no errors caused by the platform motion. To obtain accurate imagery in real-time, we developed a distributed parallel backprojection algorithm for stripmap...
Social media networks as well as online graph analytics operate on large-scale graphs with millions of vertices, even billions in some cases. Low-latency access is essential, but caching suffers from the mostly irregular access patterns of the aforementioned application domains. Hence, distributed in-memory systems are proposed keeping all data always in memory. But, the sheer amount of small data...
Large-scale graphs processing attracts more and more attentions, and it has been widely applied in many application domains. FPGA is a promising platform to implement graph processing algorithms with high power-efficiency and parallelism. In this paper, we propose OmniGraph, a scalable hardware accelerator for graph processing. OmniGraph can process graphs with different sizes adaptively and is adaptable...
In the US alone, data centers consumed around $20 billion (200 TWh) yearly electricity in 2016, and this amount doubles itself every five years. Data storage alone is estimated to be responsible for about 25% to 35% of data-center power consumption. Servers in data centers generally include multiple HDDs or SSDs, commonly arranged in a RAID level for better performance, reliability, and availability...
Due to the rapidly increasing use of big data, machines are stressed to provide more computing power at higher energy efficiency while maintaining simpler and more scalable computing paradigms. Transactional Memory (TM) is one such technique that can be used for synchronization instead of conventional locks used in critical sections since it has simpler paradigms, is scalable and has better energy...
Today, artificial neural networks (ANNs) are widely used in a variety of applications, including speech recognition, face detection, disease diagnosis, etc. And as the emerging field of ANNs, Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) which contains complex computational logic. To achieve high accuracy, researchers always build large-scale LSTM networks which are time-consuming...
Convolutional neural networks(CNNs) have been widely applied in various applications. However, the computation-intensive convolutional layers and memory-intensive fully connected layers have brought many challenges to the implementation of CNN on embedded platforms. To overcome this problem, this work proposes a power-efficient accelerator for CNNs, and different methods are applied to optimize the...
Platform as a Service (PaaS) clouds abstract large parts of the hardware/software stack to its tenant clients and provide it as a service. In this paper, we highlight the lack of scientific literature on the problem of Service Level Objective (SLO) satisfaction effects on clouds due to Garbage Collection (GC). To this end, we propose and implement CloudGC, a configurable PaaS application framework...
We discuss the feasibility of an in-house Schrödinger equation solver on the Intel Broadwell Xeon processor with a built-in FPGA, with a particular focus on the performance of large-scale sparse matrix-vector multiplication (SpMV) that is the core numerical operation of electronic structure simulations for multi-million atomic systems. The double-precision SpMV section in our solver is offloaded to...
A new page swap protocol is proposed for a user-level remote memory paging system to accelerate the performance of out-of-core processing with multi-thread user programs and libraries written in OpenMP and pthread. The original swap protocol has a bottle-neck in efficient page swapping which is requested by multiple threads in a user program, because all MPI communications to memory servers and page...
HPC interconnect is a very crucial component of any HPC machine. Interconnect performance is one of the contributing factors for overall performance of HPC system. Most popular interface to connect Network Interface Card (NIC) to CPU is PCI express (PCIe). With denser core counts in compute servers and increasingly maturing fabric interconnect speeds, there is need to maximize the packet data movement...
Among the high-radix and low-diameter networks, fat-tree topology is commonly used in HPC and datacenter systems. Resource and job management is critically important to mitigate application interference in order to achieve high system performance and utilization. Preliminary studies have shown the effect of job placement on parallel scientific applications performance. In this work we study interference...
Dragonfly network is widely used in modern high-performance computing systems. On this network, however, interference caused by network sharing can lead to significant network congestion and degraded performance. In this work, we present a comparative analysis of intra-application interference on applications with nearest neighbor communication, considering various placement strategies. Our results...
Applications in computer network security, social media analysis, and other areas rely on analyzing a changing environment. The data is rich in relationships and lends itself to graph analysis. Traditional static graph analysis cannot keep pace with network security applications analyzing nearly one million events per second and social networks like Facebook collecting 500 thousand comments per second...
A high-performing distributed hash is critical for achieving performance in many applications and system software using extreme-scale systems. It is also a central part of many Big-Data frameworks including Memcached, file systems, and job schedulers. However, there is a lack of high-performing distributed hash implementations. In this work, we propose, design, and implement, SharP Hash, a high-performing,...
As the US Department of Energy (DOE) invests in exascale computing, scalable performance modeling of physics codes on CPUs remains a hard challenge in computational codesign due to advanced design features of processors such as the memory hierarchy, instruction pipelining, and speculative execution. Reuse distance is a powerful (but unscalable) characteristic that helps to predict cache hit-rates...
Branch prediction is crucial in improving the throughput of microprocessors. It reduces branching stalls in the pipeline, which helps to maintain the instruction execution flow. Of these instructions, conditional branches are non-trivial in determining the microprocessor performance and throughput. Modern microprocessors accurately predict the branches using advanced branch prediction techniques....
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.