The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Taking advantage of computing capabilities offered by modern parallel and distributed architectures is fundamental to run large-scale simulation models based on the Parallel Discrete Event Simulation (PDES) paradigm. By relying on this computing organization, it is possible to effectively overcome both the power and the memory wall, which are core limiting aspects to deliver high-performance simulations...
The use of multicore clusters is one of the strategies used to achieve energy-efficient multicore architecture designs. Even though chips have multiple cores in these designs, cache constraints such as size, latency, concurrency, and scalability still apply. Multicore clusters must therefore implement alternative solutions to the shared cache access problem. Bigger or more frequently accessed caches...
Adaptive dynamic programming (ADP) is a prevalent way to solve the coupled Hamilton-Jacobi-Bellman (HJB) equations of the optimal consensus control for multi-agent systems (MAS). Neural networks (NNs) are normally used to approximate the value functions in ADP. However, NNs with manually designed features may influence the approximation ability. In this study, kernel-based methods which do not need...
Dense Wi-Fi and Bluetooth (BT) environments become increasingly common so that the coexistence issue between Wi-Fi and BT is imperative to solve. In this paper, we propose BlueCoDE, a coordination scheme for multiple neighboring BT piconets, to make them collision-free and less harmful to Wi-Fi. BlueCoDE reuses BT's existing PHY and MAC design, thus making it practically feasible. We implement a prototype...
Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although...
As hardware vendors provision more cores and faster storage devices, attaining fast data durability for concurrent file writes is demanding to high-performance storage systems in cluster systems. We approach the challenge by proposing a system that uses a small amount of fast persistent memory for buffering concurrent file writes while preserving data durability. The main issue in designing a durable...
Keeping a high-precision time base in cloud clusters is still a big challenge, even using the Precision Time Protocol version 2 (PTPv2) specified in IEEE 1588. One of the main factors on this issue is that too many uncertainties in the network path from the master clock to the slave one, which is likely residing on the Kernel-based Virtual Machine (KVM). The Transparent Clock (TC) of PTPv2 may be...
While GPUs are becoming common in HPC systems, the CPU is still responsible for managing both GPU-side and CPU-side compute, communication, and synchronization operations. For instance, if a result from a GPU-side computation is to be transferred to a remote destination, then the CPU must synchronize on GPU compute completion issuing a communication operation. Both CPU cycles and energy are consumed...
The conventional OpenCL 1.x style CPU-GPU heterogeneous computing paradigm treats the CPU and GPU processors as loosely connected separate entities. At best each executes independent tasks, but, more commonly, the CPU idles while waiting for results from the GPU. No data-sharing and communications are allowed during kernel execution. This model limits the number of applications that can harness the...
The contribution focuses on the technical aspects related to the focusing and interferometric processing of bistatic data acquired by companion satellite (CS) SAR missions. In particular, the processing aspects related to the large along-track baseline configuration will be addressed, for the processing needs to properly consider a potential high squint angle. The technical challenges encompass synchronization,...
Shared memory and message passing are traditional parallel programming models used on multiprocessor system-on-chip environments. Underlying models are traditionally meant for static scenarios where all communicating entities and their intercommunication patterns are known a priori by the software engineer. The systems design following such programming models became complex due to dynamic behavior...
NVIDIA GPUDirect is a family of technologiesaimed at optimizing data movement among GPUs (P2P) orbetween GPUs and third-party devices (RDMA). GPUDirectAsync, introduced in CUDA 8.0, is a new addition whichallows direct synchronization between GPU and third partydevices. For example, Async allows an NVIDIA GPU to directlytrigger and poll for completion of communication operationsqueued to an InfiniBand...
Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced...
It is well known that the TLB performance impacts the memory system performance, which is critical for overall system performance. Similar to multi-level caches, multilevel TLBs have become an important leverage for boosting data access performance. Applications have increasingly large working sets. Servers targeting such applications have thus been built with ever larger main memory capacities, but...
Ontology Based Information Extraction (OBIE) is being adopted in various domains in order to improve the system's precision and recall. Though use of multiple ontologies in different semantic based Information Extraction systems helps to improve the system extraction accuracy but the performance of system degrades significantly. This paper proposes autonomous decentralized kernel cache architecture...
Classification of human behavior is a key step to developing closed-loop Deep Brain Stimulation (DBS) systems, which may decrease the power consumption and side effects of the existing systems. Recent studies have shown that the Local Field Potential (LFP) signals from both Subthalamic Nuclei (STN) of the brain can be used to recognize human behavior. Since the DBS leads implanted in each STN can...
As computer systems increase in size and complexity, bugs become ever subtler and more difficult to detect and diagnose. A bug could exist at different layers of computer systems (e.g., applications, shared libraries, file systems, device firmware), or could be caused by the incompatibility among layers. In many cases, bugs would require a very specific combination of events to be triggered and are...
Container based virtualization is rapidly growing in popularity for cloud deployments and applications as a virtualization alternative due to the ease of deployment coupled with high-performance. Emerging byte-addressable, nonvolatile memories, commonly called Storage Class Memory or SCM, technologies are promising both byte-addressability and persistence near DRAM speeds operating on the main memory...
The preconditioned conjugate gradient method (PCG) is a popular method for solving linear systems at scale. PCG requires frequent blocking allreduce collective operations that can limit performance at scale. We investigate PCG variations designed to reduce communication costs by decreasing the number of allreduces and by overlapping communication with computation using a non-blocking allreduce. These...
PGAS models with a lightweight synchronization and shared memory abstraction, are seen as a good alternative to the Message Passing model for irregular communication patterns. OpenSHMEM is a library based PGAS model. OpenSHMEM 1.3 introduced Non-Blocking data movement operations to provide better asynchronous progress and overlap. In this paper, we present our experiences in designing Non-Blocking...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.