The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture...
There is growing evidence that current architectures do not well handle cache-unfriendly applications such as sparse math operations, data analytics, and graph algorithms. This is due, in part, to the irregular memory access patterns demonstrated by these applications, and in how remote memory accesses are handled. This paper introduces a new, highly-scalable PGAS memory-centric system architecture...
Recently, many approaches apply light-field image processing on smartphones and wearable devices. A Graphic Processing Unit (GPU) is commonly used to exploit parallelism in such image processing. However, because the access pattern in the light-field application is more sparse than typical stencil applications and does not use all data in a cache line. Furthermore, the data requests to multiple locations...
In most computer programs and general-purpose computing environments, the precision of any calculation is limited by the word size of the computer. However, for some applications, such as cryptography, this precision is not sufficient. In these cases, it is necessary to use multiple-precision numbers. Operations on such numbers in most computer software are implemented by third party libraries that...
Hardware caches are widely employed in GPGPUs to achieve higher performance and energy efficiency. Incorporating hardware caches in GPGPUs, however, does not immediately guarantee enhanced performance and energy efficiency due to high cache contention and thrashing. To address the inefficiency of GPGPU caches, various adaptive techniques (e.g., warp limiting) have been proposed. However, relatively...
In this paper a meaningful parallel implementation of spatial fuzzy c-means (SFCM) is presented. It has an advantage of being a powerful tool of classical fuzzy c-means. The great effort made to come up with this work is to reduce significantly its complexity and time execution simultaneously. This technique is inspired by the technological progress of GPUs hardware. The studies we have conducted...
In this paper, a high-flexibility and energy-efficient reconfigurable symmetric cryptographic processor architecture is presented, which is based on very-long instruction word (VLIW) structure. By analyzing basic operations and storage characteristics of symmetric ciphers, the application-specific instruction-set system for symmetric ciphers is proposed. Eleven kinds of reconfigurable cryptographic...
Software Defined Networking (SDN) architecture enables centralized control of the forwarding behavior of individual network elements. While SDN brings many well-known benefits, such as manageability and adaptability, it also poses some challenges. Scalability becomes an issue in highly dynamic, large scale networks, where the forwarding rules of single elements must be updated at a high pace by a...
Presenting large formal instruction set models as executable functions makes them accessible to engineers and useful for less formal purposes such as simulation. However, it is more difficult to extract information about the behaviour of individual instructions for reasoning. We present a method which combines symbolic evaluation and symbolic execution techniques to provide a rule-based view of instruction...
The recently invented thick control flow (TCF) model packs together an unbounded number of fibers, thread-like computational entities, flowing through the same control path. This promises to simplify parallel programming by partially eliminating looping and artificial thread arithmetics. In this paper we outline an architecture for efficiently executing programs written for the TCF model. It features...
In combinatorial optimization problems, the neighborhood search (NS) is a fundamental component for local search based heuristics. It consists of selecting a solution from a high cardinality set of neighbor solutions, by means of operations called moves. To perform this search, NS algorithms usually adopt two main approaches: selecting the first or best improving move. The Multi Improvement (MI) strategy...
Directed Acyclic Graph (DAG) is a standard model used to describe tasks that execute according to precedence constraints and that allows intra-task parallelism. This model is well suited to camera-based applications where multiple treatments must be executed in parallel according to the camera input, such applications found for example in self-driving cars or image recognition via convolutional neural...
This paper addresses the problem of balancing the on-chip packet latencies in a chip multi-processor (CMP), which is simultaneously executing multiple applications. Specifically, this paper presents a balanced application-to-core mapping algorithm that aims to minimize the maximum on-chip packet latency of all running applications. The paper starts by formulating the balanced mapping problem for CMPs...
Architecture simulators play an important role in exploring frontiers in the early stages of the architecture design. However, the execution time of simulators increases with an increase the number of cores. The sampling simulation technique that was originally proposed to simulate single-core processors is a promising approach to reduce simulation time. Two main hurdles for multi/many-core are preparing...
Many Integrated Core (MIC) architecture systems are becoming increasingly popular for HPC applications as they have the dual-advantage of accelerating vector processing and a general-purpose programming model. One of the key challenges for energy-efficient execution on MIC architecture systems is to determine time and energy-efficient configurations among a large system configuration space. Given...
The superb efficiency and noise resilience of human cognizance comes from the extensive highly associative memory. For example, it is easy for human to recognize occluded or incomplete text images based on its context. Associative inference in the neocortex system is a concurrent process. Serial implementation of this concurrent process not only hinders its performance, but also limits the quality...
The performance of a CUDA kernel often depends on the number of threads per thread-block (thread-block size), and the optimal configuration differs according to the graphics processing unit (GPU) hardware and the given data size to the kernel. In particular, in linear algebra libraries such as Basic Linear Algebra Subprograms (BLAS), most routines support a wide range of problem sizes and various...
As high performance computing (HPC) systems reach exascale proportions, the cost of simulation in time and resources increases. Tools for selecting representative parts of parallel applications to reduce simulation cost are widespread, e.g., BarrierPoint achieves this by analysing abstract characteristics such as basic blocks and reuse distances. However, architectures new to HPC will have a limited...
The paper presents the architecture of PLC CPU consisting of multiple cores enabling parallel processing of control algorithms. Control programs consist of many program fragments that are suitable for parallel execution. Proposed architecture is constructed from independent logic and arithmetic units. They share common data memories of respective types. In order to enable tight coupling of processing...
High-performance automata-processing engines are traditionally evaluated using a limited set of regular expressionrulesets. While regular expression rulesets are valid real-world examples of use cases for automata processing, they represent a small proportion of all use cases for automata-based computing. With the recent availability of architectures and software frameworks for automata processing,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.