The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Over these last years, the number of cores witnessed a spectacular increase in digital signal and general use processors. Concurrently, significant researches are done to get benefit from the high degree of parallelism. Indeed, these researches are focused to provide an efficient scheduling from hardware/software systems to multicores architecture. The scheduling process consists on statically choose...
Cloud computing provides a cost-effective computing platform for big data workflows where moldable parallel computing models such as MapReduce are widely applied to meet stringent performance requirements. The granularity of task partitioning in each moldable job has a significant impact on workflow completion time and financial cost. We investigate the properties of moldable jobs and design a big-data...
Exascale computing is facing a gap between the ever increasing demand for application performance and the underlying chip technology that does no longer deliver the expected exponential increases in CPU performance. The industry is now progressively moving towards dedicated accelerators to deliver high performance and better energy efficiency. However, the question of programmability still remains...
Compilers use static analyses to justify program optimizations. As every optimization must preserve the semantics of the original program, static analysis typically fall-back to conservative approximations. Consequently, the set of states for which the optimization is invalid is overapproximated and potential optimization opportunities are missed. Instead of justifying the optimization statically,...
The left-preconditioned communication avoiding conjugate gradient (LP-CA-CG) method is applied to the pressure Poisson equation in the multiphase CFD code JUPITER. The arithmetic intensity of the LP-CA-CG method is analyzed, and is dramatically improved by loop splitting for inner product operations and for three term recurrence operations. Two LPCA-CG solvers with block Jacobi preconditioning and...
Many embedded processors do not support floating-point arithmetic. But they generally provide support for SIMD as a mean to improve performance for near-zero cost overhead. Achieving good performance when targeting such processors requires the use of fixed-point arithmetic and efficient SIMDization. To reduce applications time-to-market, automatic SIMDization and floating-point conversion methodologies...
This paper overviews a technique for verifying cache coherence protocols described in the Promela language. The approach is comprised of the following steps. First, a model written for a certain configuration of the memory system is generalized to the model being parameterized with the number of processors. Second, the parameterized model is abstracted from the exact number of processors. Finally,...
In the last decade, the scope of software optimizations expanded to encompass energy consumption on top of the classical runtime minimization objective. In that context, several optimizations have been developed to improve the software energy efficiency. However, these optimizations commonly rely on long profiling steps and are often implemented as unstable runtime systems, which limits their applicability...
While compilers offer a fair trade-off between productivity and executable performance in single-threaded execution, their optimizations remain fragile when addressing compute-intensive code for parallel architectures with deep memory hierarchies. Moreover, these optimizations operate as black boxes, impenetrable for the user, leaving them with no alternative to time-consuming and error-prone manual...
High performance in modern computing platforms requires programs to be parallel, distributed, and run on heterogeneous hardware. However programming such architectures is extremely difficult due to the need to implement the application using multiple programming models and combine them together in ad-hoc ways. To optimize distributed applications both for modern hardware and for modern programmers...
The scheduling for divisible load in heterogeneous distributed system is a well known NP-hard problem. The problem is even more complex and challenging when its model has more than one objective, The difficulty is to satisfy multiple objectives that may be of conflicting nature. This paper investigates a multi-objective scheduling problem for divisible load in heterogeneous distributed systems. First,...
Split-execution computing leverages the capabilities of multiple computational models to solve problems, but splitting program execution across different computational models incurs costs associated with the translation between domains. We analyze the performance of a split-execution computing system developed from conventional and quantum processing units (QPUs) by using behavioral models that track...
Distributing of the multiobjective optimization algorithm into various devices in a parallel fashion is a method for speeding up the computation time of the multiobjective evolutionary algorithms (MOEAs). When the processors are increased in number, the gain from parallelization decreases. Therefore, the aim of the parallelization method is not only to decrease the overall algorithm execution time,...
Artificial Bee Colony algorithm inspired by the foraging behaviour of real honey bees is one of the most popular swarm intelligence based optimization techniques. Like other population based evolutionary computation approaches, Artificial Bee Colony algorithm is intrinsically suitable for distributed architectures. However, determining which food source should be chosen to distribute between sub-colonies...
Power consumption in modern processor design is a key aspect. Optimizing the processor for power leads to direct savings in battery energy consumption in case of mobile devices. At the same time, many mobile applications demand high computational performance. In case of large scale computing, low power compute devices help in thermal design and in reducing the electricity bill. This paper presents...
This paper details the construction of an analytical performance model of HYDRA, a production nonlinear multigrid solver used by Rolls-Royce for computational fluid dynamics simulations. The model captures both the computational behaviour of HYDRA's key subroutines and the behaviour of its proprietary communication library, OPlus, with an absolute error consistently under 16% on up to 384 cores of...
Modern multicore hardware employs a variety of parallel execution units, including multiple CPU cores for executing multiple threads simultaneously, vector units such as the Intel SIMD on the CPU cores, as well as GPU-like processing arrays. Availability of such unprecedented level of parallelism on main-stream computers offers an enormous potential to enable a new generation of computation-intensive...
Simulations of statistical models have been used to validate theories of past events in evolution of species. Studies concerning human evolution are important for understanding about our history and biodiversity. However, these approaches use complex statistical models, leading to high computational cost. The present paper proposes optimization techniques for Hyper-threaded multicore architectures...
Algorithms are described for the resolution of shared vertices and higher-dimensional interfaces on domain-decomposed parallel mesh, and for ghost exchange between neighboring processors. Performance data is given for large (up to 64M tet and 32M hex element) meshes on up to 16k processors. Shared interface resolution for structured mesh is also described. Small modifications are required to enable...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.