The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Today cache size and hierarchy level of caches play an important role in improving computer performance. By using full system simulations of gem5, the variation in memory bandwidth, system bus throughput, L1 and L2 cache size misses are measured by running SPLASH-2 Benchmarks on ARM and ALPHA Processors. In this work we calculate cache misses, memory bandwidth and system bus throughput by running...
Although time-sharing CPUs has been an essential technique to virtualize CPUs for threads and virtual machines, most of the commercial operating systems and hyper visors maintain relatively coarse-grained time slices to mitigate the costs of context switching. However, the proliferation of system virtualization poses a new challenge for the coarse-grained time sharing techniques, since operating systems...
The recent release of Altera's SDK for OpenCL has greatly eased the development of FPGA-based systems. Research have shown performance improvements brought by OpenCL using a single FPGA device. However, to meet the objectives of high performance computing, OpenCL needs to be evaluated using multiple FPGAs. This work has proposed a scalable FPGA architecture for high performance computing. The design...
Heterogeneous systems consisting of multiple CPUs and GPUs are increasingly attractive as platforms for high performance computing. Such platforms are usually programmed using OpenCL which provides program portability by allowing the same program to execute on different types of device. As such systems become more mainstream, they will move from application dedicated devices to platforms that need...
Pipelining is an important technique in high-level synthesis, which overlaps the execution of successive loop iterations or threads to achieve high throughput for loop/function kernels. Since existing pipelining techniques typically enforce in-order thread execution, a variable-latency operation in one thread would block all subsequent threads, resulting in considerable performance degradation. In...
A number of simple performance measurements on disk speed, CPU, memory and network throughput were done on a dual ARM Cortex-A7 machine running Linux inside a Xen virtual machine that communicate with the outside through fronted-driver and backend-driver. The average performance overhead of Xen virtual machine is between 3 and 7 percent when the host is lightly loaded (running only the system software...
Traffic classification is one of the kernel applications in network management. Many Machine Learning (ML) traffic classification algorithms are based on decision-trees. While most of the existing implementations of decision-trees are hardwarebased, a new trend in network applications is to use softwarebased solutions. The decision-tree used for traffic classification is highly unbalanced, it is challenging...
Cellular providers are rapidly deploying multiple technologies like cell biasing, carrier aggregation, co-ordinated interference control/scheduling to improve capacity and coverage. In this paper, we explore a complementary transport layer approach based on multipath TCP that can concurrently use multiple interfaces to boost throughput of users with poor coverage and improve fairness. Multipath TCP...
Virtual switches, like Open vSwitch, have emerged as an important part of cloud networking architectures. They connect interfaces of virtual machines and establish the connection to the outer network via physical network interface cards. Today, all important cloud frameworks support Open vSwitch as the default virtual switch. However, general understanding about the performance implications of Open...
TCP Cubic is designed to better utilize high bandwidth-delay product paths in IP networks. It is currently the default TCP version in the Linux kernel. Our objective in this work is to better understand the performance of TCP Cubic in scenarios with a large number of competing long-lived TCP flows, as can be observed, e.g., in cloud environments. In such situations, Cubic connections tend to synchronize...
Soft vector processors can augment and extend the capability of embedded hard vector processors in FPGA-based SoCs such as the Xilinx Zynq. We develop a compiler framework and an auto-tuning runtime that optimizes and executes data-parallel computation either on the scalar ARM processor, the embedded NEON engine or the Vectorblox MXP soft vector processor as appropriate. We consider computational...
For any network connection, the data throughput of the end device is related to its TCP (Transmission Control Protocol) buffer size, network latency and network bandwidth. In regions where open market devices are popular like European and Asian markets, the devices come with a static value of buffer sizes that are independent of operator's network conditions consequently resulting in either low throughput...
In the paper, a new implementation of a 3GPP LTE standards compliant turbo decoder based on GPGPU is proposed. It uses the newest GPU-Tesla K20c, which is based on the Kepler GK110 architecture. The new architecture has more powerful parallel computing capability and we use it to fully exploit the parallelism in the turbo decoding algorithm in novel ways. Meanwhile, we use various memory hierarchies...
Device-to-Device (D2D) communications can efficiently support the growth in mobile data traffic by offloading part of the traffic from the cellular infrastructure. D2D communications are influenced by the propagation conditions between mobile devices that depend on the antenna heights, presence of obstacles, and mobility of devices. Analytical and simulation studies have shown that link-aware opportunistic...
A number of NUMA-aware synchronization algorithms have been proposed lately to stress the scalability inefficiencies of existing locks. However their presupposed local lock granularity, a physical processor, is often not the optimum configuration for various workloads. This paper further explores the design space by taking into consideration the physical affinity between the cores within a single...
This paper explores the possibility of efficiently using multicores in conjunction with multiple GPU accelerators under a parallel task programming paradigm. In particular, we address the challenge of extending a parallel_for template to allow its exploitation on heterogeneous systems. The extension is based on a two-stages pipeline engine which is responsible for partitioning and scheduling the chunks...
Commodity graphic processing units (GPUs) have rapidly evolved to become high performance accelerators for data-parallel computing through a large array of processing cores and the CUDA programming model with a C-like interface. However, optimizing an application for maximum performance based on the GPU architecture is not a trivial task for the tremendous change from conventional multi-core to the...
Achieving high I/O throughput on modern servers presents significant challenges. With increasing core counts, server memory architectures become less uniform, both in terms of latency as well as bandwidth. In particular, the bandwidth of the interconnect among NUMA nodes is limited compared to local memory bandwidth. Moreover, interconnect congestion and contention introduce additional latency on...
Current technology trends for efficient use of infrastructures dictate that storage converges with computation by placing storage devices, such as NVM-based cards and drives, in the servers themselves. With converged storage the role of the interconnect among servers becomes more important for achieving high I/O throughput. Given that Ethernet is emerging as the dominant technology for datacenters,...
Frequency table computation is a key step in decision tree learning algorithms. In this paper we present a novel implementation targeted for dataflow architecture implemented on field programmable gate array (FPGA). Consistent with dataflow model of computation, the kernel views input dataset as synchronous streams of attributes and class values. The kernel was benchmarked using key functions from...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.