The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Exact values of Spatial and temporal consumption are needed when we are judging the space and time complexities of an algorithm, but few researchers paid attention on whether the their methods were valid. In this paper, we discussed about some key concepts involved in the process of monitoring process's spatial and temporal consumption, and then we explained and distinguished those concepts. Further,...
We are proposing a novel framework that ameliorates locality-aware parallel programming models, by defining hierarchical data locality model extension. We also propose a hierarchical thread partitioning algorithm. This algorithm synthesizes hierarchical thread placement layouts that targets minimizing the program's overall communication costs. We demonstrated the effectiveness of our approach using...
We present a many-core full system simulation platform and its OpenCL runtime system. The OpenCL runtime system includes an on-the-fly compiler and resource manager for the ARM-based many-core platform. Using this platform, we evaluate approaches of work-item scheduling and memory management in OpenCL memory hierarchy. Our experimental results show that scheduling work-items on a many-core system...
Multi-core processors are gaining a foothold in the domain of embedded automotive systems. The AUTOSAR Release 4.1 establishes a common standard for the use of multi-core processors in automotive systems. While interfaces and functionalities are well defined in the specification, the actual implementation is left open to the software manufacturers. We exploit this room that is left by the specification...
Kernel minimization has already been established as a practical approach to reducing the trusted computing base. Existing solutions have largely focused on whole-system profiling - generating a globally minimum kernel image that is being shared by all applications. However, since different applications use only part of the kernel's code base, the minimized kernel still includes an unnecessarily large...
In this paper we discuss how an efficient online model checker and a small-footprint RTOS can be integrated. Alternative approaches are discussed, leading to the decision for a federated approach. An implemented prototype is described and some analytical as well as experimental evaluations are presented.
Programming languages have long incorporated type safety, increasing their level of abstraction and thus aiding programmers. Type safety eliminates whole classes of security-sensitive bugs, replacing the tedious and error-prone search for such bugs in each application with verifying the correctness of the type system. Despite their benefits, these protections often end at the process boundary, that...
Unified tracing is the process of collecting trace logs across the boundary of kernel and user spaces, and has been used to understand the in-depth correspondence between low level events and application program context for diagnosing system failures and performance problems. Crossing the boundary from the kernel space to a user space to collect trace events from dual spaces imposes challenges compared...
The increasing use of runtime-compiled applications provides an opportunity for coarse-grained reconfigurable architecture (CGRA) accelerators to be used in a user-transparent way. The challenge is to provide efficient runtime translation for CGRAs. Despite the apparent difficulties stemming from the heterogeneous nature of CGRAs, this paper demonstrates that it is possible to speed up runtime-compiled...
Utilizing accelerators in heterogeneous systems is an established approach for designing peta-scale applications. Today, CUDA offers a rich programming interface for GPU accelerators but requires developers to incorporate several layers of parallelism on both CPU and GPU. From this increasing program complexity emerges the need for sophisticated performance tools. This work contributes by analyzing...
Manycore architecture promotes a massively parallel computing on the accelerators. Especially GPU is one of the main series of the high performance computing, which is also employed by top supercomputers in the world. The programming method on such accelerators includes development of a control program. The accelerator executes it to schedule the invocation timing of the accelerator's kernel program...
Message Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. The Partitioned Global Address Space (PGAS) programming models present an alternative approach to improve programmability. PGAS languages like UPC are growing in popularity because of...
GPUs are becoming pervasive in scientific computing. Originally served as peripheral accelerators, now they are gradually turning into central computing nodes. However, most current directive-based approaches for parallelizing sequential legacy code such as OpenACC and HMPP simply off-load "hot" CPU code onto GPUs, entailing a lot of limitations such as unsupported external calls and coarse-grained...
Embedded systems require even more flexibility. Several system permits on-the-market software updates. However these updates must be reliable, otherwise, the results can be catastrophic. Device drivers may have any updates and they are very vulnerable to this problem, requiring mechanisms that are able to capture errors arising from updates at runtime. This work proposes an approach for runtime errors...
To circumvent the limitation from the hardware scheduler on GPU, we create an SM-centric transformation technique. This technique enables complete control of the mapping between tasks and streaming multi-processors (SMs), and enables controlling the number of active thread blocks on each SM. Results show that our approach achieves better speedup than previous ones with kernel co-run cases.
Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels concurrently. On these GPUs, the thread block scheduler (TBS) currently uses the FIFO policy to schedule thread blocks of concurrent kernels. We show that the FIFO policy leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the preemptive...
Traditional means of gathering performance data are tracing, which is limited by the available storage, and profiling, which has limited accuracy. Performance modeling is often used to interpret the tracing data and generate performance predictions. We aim to complement the traditional data collection mechanisms with online performance modeling, a method that generates performance models while the...
We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to...
We present a fully automated approach to project the relative performance of an OpenCL program over different GPUs. Performance projections can be made within a small amount of time, and the projection overhead stays relatively constant with the input data size. As a result, the technique can help runtime tools make dynamic decisions about which GPU would run faster for a given kernel. Usage cases...
Spatial data mining techniques enable the knowledge extraction from spatial databases. However, the high computational cost and the complexity of algorithms are some of the main problems in this area. This work proposes a new algorithm referred to as VDBSCAN+, which derived from the algorithm VDBSCAN (Varied Density Based Spatial Clustering of Applications with Noise) and focuses on the use of parallelism...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.