Histogram is a popular analytic graphical representation of data distribution resulting from processing a given numerical input data. Although the sequential histogram computation may be simple, it is no longer suitable in processing high volume of data. With recent advancement of high performance computing (HPC), aided by the accelerating growth of General Purpose Graphic Processing Unit (GPGPU), parallel implementation of this algorithm specifically for the NVIDIA architecture is explored. This paper presents various experimental analyses of parallel optimization of this algorithm. It is largely based on the advantages of utilizing multi-core CPU (OpenMP) and many-core GPU computing (Compute Unified Device Architecture, CUDA). The result shows the GPU optimized streaming histogram is able to gain 7x more speedup against the multi-threaded OpenMP implementation. Note that this processing time includes data transfer. It is important to have multiple rounds of fine tuning for different architectural platforms to increase parallelism.