The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Implementing elliptic curve point multiplication (ECPM) based on residue number system (RNS) can efficiently use FPGA resources. In this paper, we propose a modular reduction method, where a kind of RNS pair is selected to achieve fast reduction. Our reduction method mainly needs several parallel additions while the reduction unit of previous designs require two multiplications which are computed...
In this paper, an efficient mapping of the pipeline single-path delay feedback (SDF) fast Fourier transform (FFT) architecture to field-programmable gate arrays (FPGAs) is proposed. By considering the architectural features of the target FPGA, significantly better implementation results are obtained. This is illustrated by mapping an R22SDF 1024-point FFT core toward both Xilinx Virtex-4 and Virtex-6...
DSP blocks are integrated in most modern high-performance FPGA devices in order to improve the speed and efficiency of computation-intensive DSP designs. Based on the existing ones, this paper proposes a new DSP block which can improve the speed and area-efficiency by reducing the delay of cascade path and supporting multi-input addition. Both architecture and implementation are described. Virtual...
In this paper, we propose a high-throughput pipeline architecture of the stream cipher ZUC which has been included in the security portfolio of 3GPP LTE-Advanced. In the literature, the schema with the highest throughput only implements the working stage of ZUC. The schemas which implement ZUC completely can only achieve a much lower throughput, since a self-feedback loop in the critical path significantly...
The paper is devoted to design of the digital components for safety-related instrumentation and control systems using the modern CAD tools. Traditionally, the digital components are built with matrix parallelism that reduces fault tolerance of circuits and safety of systems in their checkability. Circuits with bitwise pipeline data processing have advantage in checkability, but are considered as less...
This paper describes the architecture and implementation, from both the standpoint of target applications as well as circuit design, of an FPGA DSP Block that can efficiently support both fixed and single precision (SP) floating-point (FP) arithmetic. Most contemporary FPGAs embed DSP blocks that provide simple multiply-add-based fixed-point arithmetic cores. Current FP arithmetic FPGA solutions make...
In this paper, we propose an efficient 256×256 bit modular multiplier based on Montgomery reduction algorithm. The 256 × 256 bit modular multiplier is required in elliptic curve and pairing based cryptographic protocols to achieve 128 bit security level. The in-built features of modern FPGA are efficiently utilized. Two time consuming components (1) 512-bit addition (2) 256 × 256 bit multiplier are...
Modular multiplication is one of the most important operations in the public key cryptographic algorithms. In order to design a high-performance modular multiplier, we present a novel hybrid Montgomery modular multiplier over GF(p) on FPGAs, which employs Karatsuba and Knuth multiplication algorithms in different levels to implement large integer multiplication. A 9-stage pipeline full-word multiplier...
In the literature, multiplexer has been proposed for the ASIC implementation of unrolled CORDIC (COordinate Rotation DIgital Computer) processor. In this paper, the efficacy of this approach is studied for the implementation on FPGA. For this study, both non pipelined and 2 level pipelined CORDIC with 8 stages and using two schemes — one using adders in all the stages and another using multiplexers...
In this paper, multiply-accumulator (MAC) is designed for application in simple 16-bit RISC processors to enhance the processor's capability by adding new instruction set. Creation of new instruction set is achieved by modifying the processor's architecture using Verilog Hardware Description Language (Verilog HDL). The new instruction set has simple structure, and can be fully compatible with the...
This paper presents a reconfigurable mechanism for the multiplier. The proposed mechanism is applied to generate a multiplier, whose data width, type and pipeline depth can be customized. The data width of each operand of these generated multipliers can be configured for 4i where i=1, 2, 3, 4, 5, 6, 7, 8. And the data type of operand can be unsigned or signed at will. The multiplier is composed of...
We present our parametric hardware architecture of the NIST approved Lucas probabilistic primality test. To our knowledge, our work is the first hardware architecture for the Lucas test. Our main contributions are a hardware architecture for calculating the Jacobi symbol based on the binary Jacobi algorithm, a pipelined modular add-shift module for calculating the Lucas sequences, methods for dependence...
Achievable accuracy for mixed precision iterative refinement depends on the precisions supported by computing platforms. Even though the arithmetic unit precision can be flexible for programmable logic computing architectures (e.g. FPGAs), previous work rarely discusses the performance benefits due to enabling flexible achievable accuracy. Hence, we propose an iterative refinement approach on FPGAs...
High performance implementation of 2D digital filters are highly desired in many applications for real-time processing. In this paper, a multiprocessor realization of a 2D denominator separable digital filter is implemented in Altera Stratix III FPGA. The implementation achieves a data throughput equivalent to one multiplication and two additions, plus one clock cycle. It has been found that the maximum...
FPGA technology constitutes an attractive platform for high-performance accelerators of parallel workloads in general-purpose computers. Matrix multiplication is a computationally intensive application that is highly parallelizable. Previous work has typically described custom floating-point components and reported on specific designs or implementations using these components for FPGA-based matrix...
In this paper we describe a double precision floating point sparse matrix-vector multiplier (SpMV) and its performance as implemented on a Convey HC-1 reconfigurable computer. The primary contributions of this work are a novel streaming reduction architecture for floating point accumulation, a novel on-chip cache optimized for streaming compressed sparse row (CSR) matrices, and end-to-end integration...
Many scientific or engineering applications perform reduction of sets of sequential data streams. If the core operator of the reduction is deeply pipelined, dependencies between the input data elements cause data hazards in the pipeline. To tackle this problem, we propose a multiple set variable length reduction design with low latency and high pipeline utilization in this paper. We prove the buffer...
To resolve the latency problem of implementing Montgomery modular multiplication algorithm using the linear systolic array, this paper proposes the improved Montgomery algorithm, and improves the systolic array by combining the long carry save adder (CSA) structure. This paper also proposes a series of methods to optimize the critical path and a non-waiting modular multiplication strategy which can...
This work proposes a cascaded and pipelined (CAP) reconfigurable architecture to achieve high throughput in executing numerical reduction operations commonly found in many scientific computations by (1) cascading multiple deeply-pipelined floating-point arithmetic cores to match the inherent computing structure underlying target operations, (2) interleaving multiple computing threads to eliminate...
Flexible Application Specific Instruction-set Processors (ASIPs) are starting to replace monolithic Application Specific Integrated Circuits (ASICs) in a wide variety of fields. However the construction of an ASIP is today associated with a substantial design effort. Novel Generator of Accelerators And Processors (NoGap) is a tool for ASIP design utilizing hardware multiplexed data paths. One of the...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.