The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Deterministic execution of a multithreaded application guarantees the same output as long as the application runs with the same input parameters. Determinism helps a programmer to test and debug an application and to provide fault-tolerance in the systems based on replicas. Additionally, Transactional Memory (TM) greatly simplifies development of multithreaded applications where applications use transactions...
Data races are one of the most common problems in concurrent programs. As SystemC standard allows nondeterministic scheduling of processes, this leads to data races. Hence, different executions of the same concurrent program may lead to unexpected results due to race conditions. We develop a hybrid dynamic data race detection algorithm for SystemC/TLM designs that adopts the well-studied dynamic race...
This paper presents a holistic comparison of different parallel SystemC simulation approaches at the register transfer level (RTL). The effect of RTL modeling styles and simulation strategies on performance will be evaluated to show potentials and limitations of state of the art parallel simulation techniques on shared memory machines. Experiments show that the simulation performance strongly depends...
The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The summed area table (SAT) of a matrix is a data structure frequently used in the area of computer vision which can be obtained by computing the column-wise prefix-sums and then the row-wise prefix-sums. The main contribution of this paper is to introduce the...
Detecting concurrency bugs, such as data race, atomicity violation and order violation, is a cumbersome task for programmers. This situation is further being exacerbated due to the increasing number of cores in a single machine and the prevalence of threaded programming models. Unfortunately, many existing software-based approaches usually incur high runtime overhead or accuracy loss, while most hardware-based...
Concurrent programs are known to be difficult to test and maintain. These programs often fail because of concurrency bugs caused by non-deterministic interleavings among shared memory accesses. Even though a concurrency bug can be detected, it is still hard to isolate the root cause of the bug, due to the challenge in understanding the complex thread interleavings or schedules. In this paper, we propose...
Automation systems must primarily be deterministic and reliable, especially in safety-critical environments. With recent trends such as mass customization or Industry 4.0, there is an increasing need for automation systems to be dynamic. Changing parts of the software of today's automation systems, however, typically requires rebooting the controller, which makes software updates a complex and costly...
Power states in power-scalable systems are managed to maximize performance and reduce energy waste. Power-scalable processor capabilities (e.g., Intel Turbo Boost) embrace a "faster is better" approach to power management. While these technologies can vastly improve performance and energy efficiency, there is a growing body of evidence that "faster is not always better". For example,...
Graphics Processing Units (GPUs) have a huge number of cores to speed up graphical computations and they are being used in a wide area of general-purpose applications that require high performances. In this paper, GPU computing is exploited to model the signal propagation and the interference in large RFID systems, which are a promising solution for achieving pervasive computing since they offer the...
This work addresses the generation of parallel on-chip heterogeneous systems starting from high-level code with explicit parallelism, based on a custom compiler and a high-level synthesis flow. Blending parallel software programming paradigms with high-level synthesis introduces a range of challenges at both the architectural level and the programming paradigm level, particularly involving the mismatches...
Barrier are synchronization operations widely used by compiler and programmer, it is flexible and convenient but there are some defects. Threads arrive at barrier ahead of other threads have to wait the subsequent threads. This lead to some waste of time. Our experiments show that up to 35% of the total execution time is wasted on synchronization. Inspired by this, we propose barrier speculation which...
Hilbert-Huang Transform (HHT) is a process of adaptive analysis applicable to non-linear and non-stationary data such as voice and biomedical signals. Empirical Mode Decomposition (EMD) is a key in HHT and decomposes data into multiple Intrinsic Mode Functions (IMFs). Traditionally, EMD is computed on all data points in a serial manner, thus making its execution time grows at least linearly with the...
Work stealing is a popular and effective approach to implement load balancing in modern multi-/many-core systems, where each parallel thread has its local deque to maintain its own work-set of tasks and performs load balancing by stealing tasks from other deques. Unfortunately, the existing concurrent deques have two limitations. Firstly, these algorithms require memory fences in the owner's critical...
Extract parallelism from programs is growing important as the number of cores of processors is increasing. Parallelization usually involves splitting a sequential thread, and schedule the split code to run on multiple cores. E.g. Some previous Speculative Multi-Threading research used code block reordering to automatically parallelize a sequential thread on multi-core processors. Although the parallelized...
Mahalanobis distance algorithms has been widely used in machine learning and classification algorithms, and it has an important practical significance in improving the performance of some applications through GPU, especially in some applications with high real-time demand. However, due to the complexity of the GPU hardware architectures, how to complete the algorithm optimization and achieve high...
The popular wave front parallelization has been proposed to encode H.264/AVC video employing macro-block level parallelism. This approach, however, fails to achieve an optimum performance due to a significant overhead of barrier-based synchronization. All threads must wait for the slowest ones to complete encoding before starting a next processing wave. In this paper, we propose a dynamic scheduling...
Scalability of a multi-tier enterprise system is limited by the presence of software and hardware resource bottlenecks. These bottlenecks typically occur at larger number of users. It would help enterprise applications significantly if these bottlenecks are known a-priori during the performance testing itself. This paper deals with predicting the performance of such systems and models an application...
This paper presents the application of the PCJ library for the parallelization of the selected HPC applications implemented in Java language. The library is motivated by partitioned global address space (PGAS) model represented by Co-Array Fortran, Unified Parallel C, X10 or Titanium.
Parallelized applications running on many-core Network-on-Chip (NoC) processors may consume a great part of execution time to synchronize threads mapped on multiple NoC nodes, if synchronization for NoC processors is not carefully designed. In this paper, we propose an instruction-based synchronization solution applied in a packet-switched many-core NoC processor with 2D mesh grid topology. Return...
In this paper, we presents the design of a hardware temporal multi-threading architecture for a Java processor. The Java virtual machine (JVM) model is a stack machine where the process state is the snapshot of the Java stack. If the runtime stack is stored (or cached) in on-chip memory for performance reasons, the backup and restoration of the Java runtime stacks for context switching would be expensive...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.