Graphics processors Unit (GPU) architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general purpose applications compared to contemporary general-purpose processors (CPUs). However, GPU architecture depends on multithreading that needs to share data and resources that face memory concurrency issues. Data races and deadlocks are the most challenging Concurrency and consistency issues due to the non-deterministic execution of threads. In this paper, we evaluate the performance of CUDA Memory fence to solve the data race problem via implementing the Berlekamp-Massey Algorithm as a case study. The results showed that CUDA memory fence improves algorithm speed up with small input sequence.