The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The well-known memory wall problem has motivated wide research in the design of caches. Last-level caches, whose misses can stall the processors for hundreds of cycles, have received particular attention. Strategies to modify adaptably the cache insertion, promotion, eviction and even placement policies have been proposed, some techniques being better at reducing different kinds of misses. For example...
Wireless clients must associate to a specific Access Point (AP) to communicate over the Internet. Current association methods are based on maximum Received Signal Strength Index (RSSI) implying that a client associates to the strongest AP around it. This is a simple scheme that has performed well in purely distributed settings. Modern wireless networks, however, are increasingly being connected by...
We develop a radix sort algorithm, GRS, suitable to sort multifield records on a graphics processing unit (GPU). We assume the ByField layout for records to be sorted. GRS is benchmarked against the radix sort algorithm, SDK, in NVIDIA's CUDA SDK 3.0 as well as the radix sort algorithm, SRTS, of Merrill and Grimshaw. Although SRTS is faster than both GRS and SDK when sorting numbers as well as records...
Continuing improvements in the scale of many-core platforms are accompanied by increased asymmetry in their memory architectures. Such NUMA architectures, however, require systems software that understands this asymmetry to attain high levels of performance, leading to significant work in optimizing operating systems like Linux and Windows to increase locality of access to memory nodes and to consider...
Network contention has a significantly adverse effect on the performance of parallel applications with increasing size of parallel machines. Machines of the petascale era are forcing application developers to map tasks intelligently to job partitions to achieve the best performance possible. This paper presents a framework for automated mapping of parallel applications with regular communication graphs...
CMOS scaling exacerbates hardware errors making reliability a big concern for recent and future microarchitecture designs. Mechanisms to provide fault tolerance in architectures must accomplish several objectives such as low performance degradation, power consumption and area overhead. Several studies have been already proposed to provide fault tolerance for parallel codes. However, these proposals...
GPU hardware and software has been evolving rapidly. CUDA versions 1.1 and higher started supporting atomic operations on device memory, and CUDA versions 1.2 and higher started supporting atomic operations on shared memory. This paper focuses on parallelizing applications involving reductions on GPUs. Prior to the availability of support for locking, these applications could only be parallelized...
This paper presents a simple, but powerful memory-aware scheduling mechanism that adaptively schedules tasks in a message driven distributed-memory parallel program. The scheduler adapts its behavior whenever memory usage exceeds a threshold by scheduling tasks known to reduce memory usage. The usefulness of the scheduler and its low overhead are demonstrated in the context of an LU matrix factorization...
Shared data centers and clouds are gaining popularity because of their ability to reduce costs by increasing the utilization of server farms. In a shared server environment, a careful assignment of workload streams (all work-requests from a customer may constitute a stream) to servers is necessary to ensure good “end user” performance. In this work, we investigate the assignment of streams to servers...
System event logs are often the primary source of information for diagnosing (and predicting) the causes of failures for cluster systems. Due to interactions among the system hardware and software components, the system event logs for large cluster systems are comprised of streams of interleaved events, and only a small fraction of the events over a small time span are relevant to the diagnosis of...
With the growing costs of powering data centers, power management is gaining importance. Server consolidation in data centers, enabled by virtualization technologies, is becoming a popular option for organizations to reduce costs and improve manageability. While consolidation offers these benefits, it is important to ensure proper resource provisioning so that performance is not compromised. In addition...
The demand of larger and more powerful high-performance shared-memory servers is growing over the last few years. To meet this need, AMD has recently launched the twelve-core Magny-Cours processors. They include a directory cache (Probe Filter) that increases the scalability of the coherence protocol applied by Opterons, based on coherent Hyper Transport interconnect (cHT). cHT limits up to 8 the...
In Wireless Sensor and Actor Networks (WSANs), effective Actor-Actor Communication (AAC) is an important requirement for the timely responses to events reported by the sensors. However, due to scattered nature of events, mobility of actor nodes, and low density of actor nodes, the network of actor nodes tends to get partitioned frequently. To provide effective AAC in such situations, the energy-constrained...
On-chip networks have rapidly emerged as the best interconnection choice for high-core count chip multiprocessors (CMPs) because of the good scalability properties they present. Their fast evolution has been accelerated by the large inheritance from the offchip network domain. Many of the mechanisms and techniques previously developed in that area have been directly applied to the on-chip domain due...
We consider estimating the location of a target moving in a 2D plane by combining distance measurements from multiple sensors. Given that available energy in sensors is at a premium, it is desirable that energy be conserved by selecting fewer number of sensors that measure distance and communicate to the central tracker. We propose heuristics on the basis of which a handful of sensors may be selected...
The patterns of movement used by Mobile Ad-Hoc networks are application specific, in the sense that networks use nodes which travel in different paths. When these nodes are used in experiments involving social patterns, such as wildlife tracking, algorithms which detect and use these patterns can be used to improve routing efficiency. The intent of this paper is to introduce a routing algorithm which...
General purpose computing using GPUs is becoming increasingly popular, because of GPU's extremely favorable performance/price ratio. Besides application development using CUDA, automatic code generation for GPUs is also receiving attention. Like standard processors, GPUs also have a memory hierarchy, which must be carefully optimized for in order to achieve efficient execution. Specifically, modern...
Graphics processing units provide a large computational power at a very low price which position them as an ubiquitous accelerator. Efficient primitives that can expand the r ange of operations performed on the GPU are thus important. Discrete Range Searching(DRS) is one such primitive with direct applications to string processing, document and text retrieval systems, and least common ancestor queries...
Cloud computing is an elastic computing model whereby users can lease computing and storage resources on demand from a remote infrastructure. Cloud computing is gaining popularity due to its low cost, high reliability and wide availability. However, a serious impediment to its wider deployment is the relative lack of effective data management services. The relatively slow access to persistent data...
Parallel Sparse Matrix Vector Multiplication (PSpMV) is a compute intensive kernel used in iterative solvers like Conjugate Gradient, GMRES and Lanzcos. Numerous attempts at optimizing this function have been made that require fine tuning of many hardware and software parameters to achieve optimal performance. We attempt to offer a simple framework that involves (i) Employing a greedy algorithm to...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.