Search results for: Jianbin Fang

Items from 1 to 19 out of 19 results

chapter

Efficient and Portable ALS Matrix Factorization for Recommender Systems

Jing Chen, Jianbin Fang, Weifeng Liu, Tao Tang, more

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 409 - 418

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Alternating least squares (ALS) has been proved to be an effective solver of matrix factorization for recommender systems. To speedup factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-core CPUs and many-core GPUs/MICs. Existing implementations are limited in either speed or portability (constrained to certain platforms). In this paper, we present an...

chapter

An Energy-Efficient Implementation of LU Factorization on Heterogeneous Systems

Canqun Yang, Cheng Chen, Tao Tang, Xuhao Chen, more

2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS) > 971 - 979

2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)

Energy consumption is increasingly becoming a critical issue in HPC. There is a broad consensus that future exascale-computing will be strongly constrained by energy consumption. Heterogeneous systems usually feature higher energy efficiency than homogeneous ones since the former employ coprocessors that provide higher GFlops/Watt than CPUs. Thus, it is of great importance to better utilize the coprocessors...

chapter

Evaluating the Performance Impact of Multiple Streams on the MIC-Based Heterogeneous Platform

Zhaokui Li, Jianbin Fang, Tao Tang, Xuhao Chen, more

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1341 - 1350

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Using multiple streams can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Prior work focuses a lot on GPUs but little is known about the performance impact on (Intel Xeon) Phi. In this work, we apply multiple streams into six real-world applications on Phi. We then systematically evaluate the performance benefits of using multiple streams...

chapter

High Performance Parallel Graph Coloring on GPGPUs

Pingfan Li, Xuhao Chen, Zhe Quan, Jianbin Fang, more

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 845 - 854

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Graph coloring has been broadly used to discover concurrency in parallel computing, where vertices with the same color represent subtasks that can be processed simultaneously. To speedup graph coloring for large scaledatasets, parallel algorithms have been proposed to leverage the massive hardware resources on modern multicore CPUs or GPGPUs. Existing GPU implementations either have limited performance...

chapter

Evaluating Multi-core and Many-Core Architectures through Accelerating an Alternating Direction Implicit CFD Solver

Liang Deng, Jianbin Fang, Fang Wang, Hanli Bai

2016 15th International Symposium on Parallel and Distributed Computing (ISPDC) > 1 - 10

2016 15th International Symposium on Parallel and Distributed Computing (ISPDC)

In this paper, we accelerate a double-precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house computational fluid dynamics (CFD) software on the latest multi-core and many-core architectures (Intel Ivy Bridge CPU, Intel Xeon Phi 7110P coprocessor and NVIDIA Kepler K20c GPU). For the GPU platform, both the OpenACC-based and...

chapter

High Performance Computing of Fast Independent Component Analysis for Hyperspectral Image Dimensionality Reduction on MIC-Based Clusters

Minquan Fang, Yi Yu, Weimin Zhang, Heng Wu, more

2015 44th International Conference on Parallel Processing Workshops > 138 - 145

2015 44th International Conference on Parallel Processing Workshops (ICPPW)

Fast independent component analysis (Fast ICA) for hyper spectral image dimensionality reduction is computationally complex and time-consuming due to the high dimensionality of hyper spectral images. By analyzing the Fast ICA algorithm, we design parallel schemes for covariance matrix calculating, white processing and ICA iteration at three parallel levels: multicores, many integrated cores (MIC),...

chapter

Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels

Jianbin Fang, Henk Sips, Pekka Jaaskelainen, Ana Lucia Varbanescu

2014 43rd International Conference on Parallel Processing > 162 - 171

2014 43nd International Conference on Parallel Processing (ICPP)

Due to the diversity of processor architectures and application memory access patterns, the performance impact of using local memory in OpenCL kernels has become unpredictable. For example, enabling the use of local memory for an OpenCL kernel can be beneficial for the execution on a GPU, but can lead to performance losses when running on a CPU. To address this unpredictability, we propose an empirical...

chapter

Balancing CPU-GPU Collaborative High-Order CFD Simulations on the Tianhe-1A Supercomputer

Chuanfu Xu, Lilun Zhang, Xiaogang Deng, Jianbin Fang, more

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 725 - 734

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

HOSTA is an in-house high-order CFD software that can simulate complex flows with complex geometries. Large scale high-order CFD simulations using HOSTA require massive HPC resources, thus motivating us to port it onto modern GPU accelerated supercomputers like Tianhe-1A. To achieve a greater speedup and fully tap the potential of Tianhe-1A, we collaborate CPU and GPU for HOSTA instead of using a...

chapter

Quantifying the performance impacts of using local memory for many-core processors

Jianbin Fang, Henk Sips, Ana Lucia Varbanescu

2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS) > 1 - 10

2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS)

Due to the increasing complexity of multi/many-core architectures (with their mix of caches and scratch-pad memories) and applications (with different memory access patterns), the performance of many workloads becomes increasingly variable. In this work, we address one of the main causes for this performance variability: the efficiency of the memory system. Specifically, based on an empirical evaluation...

chapter

Sesame: A User-Transparent Optimizing Framework for Many-Core Processors

Jianbin Fang, Ana Lucia Varbanescu, Henk Sips

2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing > 70 - 73

2013 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

With the integration of more computational cores and deeper memory hierarchies on modern processors, the performance gap between naively parallel zed code and optimized code becomes much larger than ever before. Very often, bridging the gap involves architecture-specific optimizations. These optimizations are difficult to implement by application programmers, who typically focus on the basic functionality...

chapter

Performance Traps in OpenCL for CPUs

Jie Shen, Jianbin Fang, Henk Sips, Ana Lucia Varbanescu

2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing > 38 - 45

2013 21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

With its design concept of cross-platform portability, OpenCL can be used not only on GPUs (for which it is quite popular), but also on CPUs. Whether porting GPU programs to CPUs, or simply writing new code for CPUs, using OpenCL brings up the performance issue, usually raised in one of two forms: "OpenCL is not performance portable!" or "Why using OpenCL for CPUs after all?!"...

chapter

ELMO: A User-Friendly API to Enable Local Memory in OpenCL Kernels

Jianbin Fang, Ana Lucia Varbanescu, Jie Shen, Henk Sips

2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing > 375 - 383

2013 21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

Recent parallel architectures are equipped with local memory, which simplifies hardware design at the cost of increased program complexity due to explicit management. To simplify this extra-burden that programmers have, we introduce an easy-to-use API, ELMO, that improves productivity while preserving high performance of local memory operations. Specifically, ELMO is a generic API that covers different...

chapter

Accelerating Cost Aggregation for Real-Time Stereo Matching

Jianbin Fang, Ana Lucia Varbanescu, Jie Shen, Henk Sips, more

2012 IEEE 18th International Conference on Parallel and Distributed Systems > 472 - 481

2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS)

Real-time stereo matching, which is important in many applications like self-driving cars and 3-D scene reconstruction, requires large computation capability and high memory bandwidth. The most time-consuming part of stereo-matching algorithms is the aggregation of information (i.e. costs) over local image regions. In this paper, we present a generic representation and suitable implementations for...

chapter

Performance Gaps between OpenMP and OpenCL for Multi-core CPUs

Jie Shen, Jianbin Fang, Henk Sips, Ana Lucia Varbanescu

2012 41st International Conference on Parallel Processing Workshops > 116 - 125

2012 41st International Conference on Parallel Processing Workshops (ICPPW)

OpenCL and OpenMP are the most commonly used programming models for multi-core processors. They are also fundamentally different in their approach to parallelization. In this paper, we focus on comparing the performance of OpenCL and OpenMP. We select three applications from the Rodinia benchmark suite (which provides equivalent OpenMP and OpenCL implementations), and carry out experiments with different...

chapter

A Comprehensive Performance Comparison of CUDA and OpenCL

Jianbin Fang, Ana Lucia Varbanescu, Henk Sips

2011 International Conference on Parallel Processing > 216 - 225

2011 International Conference on Parallel Processing (ICPP)

This paper presents a comprehensive performance comparison between CUDA and OpenCL. We have selected 16 benchmarks ranging from synthetic applications to real-world ones. We make an extensive analysis of the performance gaps taking into account programming models, ptimization strategies, architectural details, and underlying compilers. Our results show that, for most applications, CUDA performs at...

chapter

An Auto-tuning Solution to Data Streams Clustering in OpenCL

Jianbin Fang, Ana Lucia Varbanescu, Henk Sips

2011 14th IEEE International Conference on Computational Science and Engineering > 587 - 594

2011 IEEE 14th International Conference on Computational Science and Engineering (CSE)

Due to its applicability to numerous types of data, including telephone records, web documents, and click streams, the data stream model has recently attracted attention. For analysis of such data, it is crucial to process the data in a single pass, or a small number of passes, using little memory. This paper provides an OpenCL implementation for data streams clustering, and then presents several...

chapter

Optimizing Adaptive Synchronization in Parallel Simulators for Large-scale Parallel Systems and Applications

Chuanfu Xu, Yonggang Che, Jianbin Fang, Zhenghua Wang

2010 10th IEEE International Conference on Computer and Information Technology > 131 - 138

2010 IEEE 10th International Conference on Computer and Information Technology (CIT)

This paper addresses the optimization of parallel simulators for large-scale parallel systems and applications. Such simulators are often based on parallel discrete event simulation with conservative or optimistic protocols to synchronize the simulating processes. The paper considers how available future information about events and application behaviors can be efficiently extracted and further exploited...

chapter

A Two-Phase Mapping Method for Parallel Applications Simulation

Jianbin Fang, Chuanfu Xu, Yonggang Che, Yufen Weng, more

2009 International Conference on Computational Intelligence and Software Engineering > 1 - 4

2009 International Conference on Computational Intelligence and Software Engineering

Efficient mapping of logical processes to physical processes is one of key technologies to accelerate parallel performance simulation. Aiming at minimizing the communications between SMP nodes and between host physical processes, this paper presents a novel method named TPsmp-LP³M. It automatically extracts communication pattern of logical processes from trace and then generates a two-phase mapping...

chapter

Towards efficient mapping in large-scale trace-driven parallel performance simulation

Chuanfu Xu, Yonggang Che, Jianbin Fang, Zhenghua Wang

2009 IEEE Youth Conference on Information, Computing and Telecommunication > 455 - 458

2009 IEEE Youth Conference on Information, Computing and Telecommunication (YC-ICT 2009)

In parallel performance simulation of parallel systems, a large amount of logic processes (LP) must be mapped to relatively small number of physical elements (PE). Previous researches have shown that different mapping schemes could result in significant variation in the whole parallel simulation cost. In this paper, we propose, implement, and evaluate a minimum communication-guided mapping (MiniCoM)...

Filter options

Publication type:
book

Publication date

Set your own date range

Keywords

OPENCL (9)
OPTIMIZATION (5)
PARALLEL PROCESSING (5)
COMPUTATIONAL MODELING (4)
GRAPHICS PROCESSING UNITS (4)
INSTRUCTION SETS (4)
BENCHMARK TESTING (3)
HARDWARE (3)
KERNEL (3)
LOCAL MEMORY (3)
PARALLEL PERFORMANCE SIMULATION (3)
ARRAYS (2)
BIGSIM SIMULATOR (2)
CUDA (2)
DATA MINING (2)
GPUS (2)
LOAD BALANCE (2)
LOAD MODELING (2)
MICROWAVE INTEGRATED CIRCUITS (2)
PARALLEL SYSTEMS (2)
PERFORMANCE (2)
PERFORMANCE COMPARISON (2)
PERFORMANCE EVALUATION (2)
PROGRAMMING (2)
RESOURCE ALLOCATION (2)
SCHEDULES (2)
ADAPTATION MODEL (1)
ADAPTIVE OPTIMISTIC PROTOCOLS (1)
ADAPTIVE SYNCHRONIZATION (1)
ALGORITHM DESIGN AND ANALYSIS (1)
ALTERNATING DIRECTION IMPLICIT (1)
ALTERNATING LEAST SQUARES (1)
ANALYTICAL MODELS (1)
API (1)
AUTO-TUNING (1)
BLOCKED MAPPING (1)
BRIDGES (1)
CFD (1)
CFD SOLVER (1)
CLUSTERING (1)
COLOR (1)
COMMUNICATION PATTERN (1)
COMPILERS (1)
COMPUTER ARCHITECTURE (1)
COMPUTERS (1)
CONCURRENT COMPUTING (1)
CONSERVATIVE PROTOCOLS (1)
COPROCESSORS (1)
COST AGGREGATION (1)
COVARIANCE MATRICES (1)
CPU (1)
CPU-GPU COLLABORATION (1)
DATA STREAMS (1)
DATA STRUCTURES (1)
DATA TRANSFER (1)
DISCRETE EVENT SIMULATION (1)
ENERGY EFFICIENCY (1)
EQUATIONS (1)
FAST INDEPENDENT COMPONENT ANALYSIS (1)
GPGPU (1)
GPU (1)
GPU PARALLELIZATION (1)
GRAPH COLORING (1)
HETEROGENEOUS SYSTEM (1)
HIGH PERFORMANCE COMPUTING (1)
HIGH-ORDER FINITE DIFFERENCE SCHEME (1)
HPL (1)
HYPERSPECTRAL IMAGE DIMENSIONALITY REDUCTION (1)
HYPERSPECTRAL IMAGING (1)
IMAGE COLOR ANALYSIS (1)
INDEXES (1)
INTER-NODE LATENCY (1)
ITERATIVE BEHAVIOR (1)
IVY BRIDGE (1)
JACOBI3D (1)
JACOBIAN MATRICES (1)
LARGE-SCALE PARALLEL SYSTEMS (1)
LARGE-SCALE TRACE-DRIVEN PARALLEL PERFORMANCE SIMULATION (1)
LOAD BALANCING OF LOWER TRIANGULAR MATRIX (1)
LOGIC PROCESSES (1)
LOGIC PROGRAMMING (1)
LOGICAL PROCESS MAPPING (1)
LU (1)
MANY INTEGRATED CORES (1)
MANY-CORE CPUS (1)
MANY-CORE PROCESSORS (1)
MAPPING (1)
MATHEMATICAL MODEL (1)
MATRIX FACTORIZATION (1)
MEMORY ACCESS PATTERN (1)
MEMORY MANAGEMENT (1)
MICRO-BENCHMARKING (1)
MINIMUM COMMUNICATION-GUIDED MAPPING SCHEME (1)
MULTI-CORE (1)
MULTIPLE STREAMS (1)
MULTIPROCESSING SYSTEMS (1)
OPENACC (1)
OPENMP (1)
OPTIMIZATION TECHNIQUES (1)
PARALLEL APPLICATIONS SIMULATION (1)
more

INFONA - science communication portal

Search results for: Jianbin Fang

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options