The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Variable block size motion estimation (VBSME) is a video coding technique which improves video distortion, provides more accurate predictions, reduces video coding data, and increases the utilization of network bandwidth. This paper presents a high-performance VLSI architecture for VBSME which can be applied to the full search block matching algorithm. Our proposed architecture uses pipelined designs...
The ongoing move of hardware platforms to many-core processor challenges the traditional software design methodology. It is critical to develop new programming paradigms and efficient ways to port legacy applications. This paper analyzed a typical packet processing application and also the cache hierarchy and behavior of Raw architecture many-core processor. It presented an easy to implement run-time...
This paper presents the standard-cells synthesis and comparison of parallel hardware architectures for the Sum of Absolute Differences (SAD) datapath, focusing on different design points such as high-performance (maximum throughput) and the tradeoff between high-performance and low-power dissipation (isoperformance target). Clock gating and different combination of parallelism and pipeline architectural...
This paper presents empirical results of the finite difference time domain algorithm implemented on NVIDIA's CUDA architecture in order to achieve significant decreases in electromagnetic field simulation time. The proposed strategy is demonstrated with respect to biomedical applications where 10 different anatomical realistic body phantoms are simulated and speed increases quantified.
This paper presents the outer-round only pipelined architecture for a FPGA implementation of the AES-128 encryption processor. The proposed design uses the Block RAM storing the S-box values and exploits two kinds of Block RAM. By combining the operations in a single round, we can reduce the critical delay. Therefore, our design can achieve a throughput of 34.7 Gbps at 271.15 Mhz and 2389 CLB Slices...
In this paper, we present a JPEG2000 EBCOT tier-1 encoder based on a hybrid dual-core processor composed of a coarse-grained Dynamically Reconfigurable Processor (DRP) and an ARM core targeting next generation of cameras. The complete EBCOT tier-1 encoder is partitioned into two tasks and mapped onto the two cores respectively according to different potentials of the two processors. A Partial Parallel...
Multi-parallel architecture for MD5 (Message-Digest Algorithm 5) implemented on FPGA (Field-Programmable Gate Array) is presented in this paper. To accelerate the speed, a general architecture for Host Computer and FPGAs is proposed. The MD5 implementation is presented. Besides the internal parallelization of MD5 modules, FPGAs can be easily duplicated and connected to Ethernet LAN. The design was...
This paper presents a parallel and scalable solution for adaptive deblocking filtering in H.264/AVC. While traditionally in deblocking filtering, the edges in a macroblock are processed in a sequential order, this paper demonstrates how algorithm modifications can be used to enable processing multiple consecutive edges at the same time. The proposed method increases the throughput in proportion to...
This paper describes a Field Programmable Gate Array hardware based Deep Packet Inspection Engine that uses regular expression matchers to simultaneously categorize and look for malicious signatures in Ethernet packets. This was a submission to the 2010 MEMOCODE Design Contest. It is the fastest Xilinx FPGA based design with a throughput of 734 Mbit/sec and the 2nd fastest overall, out of all designs...
We present a novel architecture of a communication engine for non-coherent distributed shared memory systems. The shared memory is composed by a set of nodes exporting their memory. Remote memory access is possible by forwarding local load or store transactions to remote nodes. No software layers are involved in a remote access, neither on origin or target side: a user level process can directly access...
Investigation of the cryptanalytic strength of RSA cryptography requires computing many GCDs of two long integers (e.g., of length 1024 bits). This paper presents a high throughput parallel algorithm to perform many GCD computations concurrently on a GPU based on the CUDA architecture. The experiments with an NVIDIA GeForce GTX285 GPU and a single core of 3.0 GHz Intel Core2 Duo E6850 CPU show that...
State-of-the-art hardware based techniques achieve high performance and maximize efficiency of packet classification applications. The predominant example of these, ternary content addressable memory (TCAM) based packet classification systems can achieve much higher throughput than software-based techniques. However, they suffer from high power consumption due to the highly parallel architecture and...
Image compression is one of the major services in space flight mission and remote sensing system. This paper presents a high speed and high performance image compressor for future spacecrafts and micro-satellites. The proposed compression core is based on a modified CCSDS algorithm and the processing speed is improved 40 times using parallel architecture and pipeline technique. The experimental results...
In this paper, we propose a new parallel-pipeline approach to design small-area low complexity convolutional encoders, suitable for high data throughput communication applications. This approach can apply both to the OTM (one to many) and the MTO (many to one) encoder schemes. Here, we will discuss the problem of designing a low cost parallel-pipeline encoder for the MTO case. The new architecture...
With the ever increasing data throughputs required by communication application, there is an actual need for new effective architectures (small area and high speed) for circuit parts dedicated to error detecting/correcting coding (EDC/ECC). In this paper, we propose a new parallel-pipeline design scheme for convolution encoders that meets these requisites. This approach apply both to the OTM (One...
In cryptographic applications, private key algorithms usually aim at high-throughput data communication, while public key algorithms require much lower throughput for private key exchange and authentication. To increase hardware utilization and reduce area overhead, this paper presents a flexible divider design in GF(2m), which can be configured to operate in either SIMD or SISD mode. When applied...
Implementation of high throughput VLSI chips for low-density parity-check codes has been considered very difficult especially when the row or column weight of the code is high. In this paper, a projective-geometry (PG) LDPC code is implemented in VLSI employing the proposed soft bit flipping (SBF) algorithm. The SBF algorithm requires only simple interconnections, but its error correcting performance...
We develop Superscalar Architecture to compute fixed point FFT (Fast Fourier Transform). Some high-speed and time sensitive real time applications demand far better and efficient implementation of FFT and call for improved novel architectures. This account for bringing in place an embedded custom hardware for instance FPGA that helps us rally things in parallel yielding better performance. We take...
Recently proposed techniques for peak power management involve centralized decision-making and assume quick evaluation of the various power management states. These techniques do not prevent instantaneous power from exceeding the peak power budget, but instead trigger corrective action when the budget has been exceeded. Similarly, they are not suitable for many-core architectures (processors with...
Parallel architectures have become an increasingly popular method in which to achieve high performance with low power consumption. In order to leverage these benefits, applications are decomposed into multiple computational modules (tasks) that collectively operate and communicate in parallel. In this paper, we present a scalable and highly parametric streams-based communication architecture for inter-module...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.