The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In this paper, we present an efficient vector graphics rendering algorithm which is suitable to use on low-end device. To enjoy high performance vector graphics on low-end device, our algorithm must satisfy two folds; i) providing parallel rendering scheme, ii) removing redundant computations. To do so, we propose BSP Tree-based vector graphics rendering which provides a good solution in such situation...
In this paper, we present a stereoscopic rendering based on a mobile ray tracing GPU. Adopting an existing algorithm to new mobile GPUs specialized for ray tracing enables two high performance techniques such as reprojection and tile-based rendering. Experimental results show that our implementation can be a versatile solution for future virtual reality applications, as it achieves up to 1.64 times...
In this paper, we propose an efficient ray scheduling algorithm and non-block cache architecture to hiding main-memory access latency targeting real-time ray tracing on mobile device. We first analyze on the impact of memory latency by analyzing the memory access patterns for a ray tracing system and present an energy efficient data transmission method using a dedicated interface between the processor...
Using independent voltage (and frequency) domains for cores and caches allows us to achieve high energy efficiency since it enables operating the cores and caches at their own optimal voltages. However, it incurs a clock synchronization problem between the core and cache voltage domains. One of the conventional solutions is to add asynchronous FIFOs on the domain crossing boundary, but it degrades...
Thread or warp scheduling in GPGPUs has been shown to have a significant impact on overall performance. Recently proposed warp schedulers have been based on a greedy warp scheduler where some warps are prioritized over other warps. However, a single warp scheduling policy does not necessarily provide good performance across all types of workloads; in particular, we show that greedy warp schedulers...
Vector graphics is a key technology for drawing 2D graphics images on the mobile devices. As screen resolution increases and multi-touch interface is widely used in mobile devices, the efficient solution for vector graphics becomes more important. For future mobile environments, we should process vector graphics with high performance and low power. In this paper, we have proposed the efficient anti-aliasing...
Multi-core system has merits in terms of energy efficiency and performance enhancement compared to the single core. However, design and development of a system using the multiple processors are very difficult, and in particular, verification of a system having concurrency may be difficult. This makes parallel system design hard, so developers must spend substantial amounts of time for design and debugging...
In this paper, we present an efficient resolution independent path rendering algorithm. To do so, we propose tile binning and rendering algorithm for resolution independent path rendering which is suitable to use on mobile device. Experimental comparisons show that our scheme reduces not only most of memory I/O overhead but also 50% of computation overhead. As the result, most of mobile phones can...
In this paper, we propose a computing intensive path rendering scheme. Because legacy path rendering schemes are memory I/O bound they are not suitable to the high resolution display. To do so, we propose to use winding number generator which generates per pixel winding number in parallel manner. When we use the winding number generator, computing latency (cycles) for path rendering are reduced into...
In a coarse-grained reconfigurable array (CGRA) architecture, software pipelining is primarily used to improve performance by exploiting loop-level parallelism (LLP). In this technique, the loop-carried memory dependence in user code prevents high parallelism, and it is difficult to be detected. In this paper, we propose a simulation-based memory dependence checker, which is used in the verification...
In this paper, we focus on the impact of a memory bandwidth limitation by analyzing the bandwidth consumption for a ray tracing system and present an energy efficient data transmission method using a dedicated interface between the processor and ray tracing hardware engine. To achieve real-time ray tracing, we propose a full-stream architecture through the use of this dedicated interface. For an evaluation...
With significant growth in portable multimedia devices such as smartphones, application processors (AP) play a critical role for running various multimedia applications on these devices. By considering the power constraints of such devices, we often integrate reconfigurable processors (RPs) into APs. This is because RPs offer flexibility and good performance, thereby greatly improving the power efficiency...
As the system complexity increases, the simulation performance becomes one of the most important issues in virtual prototyping. Parallel simulation is a fascinating technique for high-speed simulation utilizing state of the art multi-core processors on a host workstation, but the efficiency of the parallel simulation is low because of the synchronization and communication overhead and unbalanced workloads...
High performance in GPGPU workloads is obtained by maximizing parallelism and fully utilizing the available resources. The thousands of threads are assigned to each core in units of CTA (Cooperative Thread Arrays) or thread blocks - with each thread block consisting of multiple warps or wavefronts. The scheduling of the threads can have significant impact on overall performance. In this work, explore...
High performance for a GPGPU workload is obtained by maximizing parallelism and fully utilizing the available resources. However, this is not necessarily energy efficient, especially for memory-intensive GPGPU workloads. In this work, we propose Throttle CTA (cooperative-thread array) Scheduling (TCS) where we leverage two type of throttling — throttling the number of actives cores and throttling...
In this paper, we present an efficient curve rasterization method that effectively reduces duplicated computations in a tile-based rendering. To do so, Tile Boundary Sharing (TBS) method is proposed that shares boundary information between neighbor tiles. When we use the TBS, computing cycles for tile-based vector graphics are reduced into 21∼34%.
Low-power processing of multimedia data is mandatory for the recent mobile devices. In this paper, we present coarse-grained reconfigurable processor for low-power audio processing. By utilizing perfect instruction cache and tightly-coupled scratchpad memory, we can eliminate all the bandwidth consumption caused by external memory access while decoding compressed audio data. Acceleration of audio...
This paper presents an implementation of seeded region growing (SRG) algorithm on multi-core system to achieve real time constraint with high precision. The proposed implementation has dynamic load balancing feature inherently and shows a speedup of 13.3 times with 16 cores.
The Samsung reconfigurable processor (SRP) is developed to accelerate multimedia applications such as video decoding, audio decoding, and image processing. Owing to coarse-grained reconfigurable array (CGRA) acceleration via software (SW) pipelining and application-specific intrinsic instructions, SRP outperforms other digital signal processors (DSPs) in these application domains. In addition, recent...
Coarse Grained Reconfigurable Architecture (CGRA) achieves high performance by exploiting instruction-level parallelism with software pipeline. Large instruction memory is, however, a critical problem of CGRA, which requires large silicon area and power consumption. Code compression is a promising technique to reduce the memory area, bandwidth requirements, and power consumption. We present an adaptive...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.