This paper presents challenges encountered while parallelizing an existing sequential algorithm. A breadth-first search implementation in CUDA C++ of quadratic time complexity is used. Even though BFS might seem like an easily parallelizable problem due to many independent iterations over graph vertices, there are other important aspects which need to be considered. Properties like granulation, communication and load balancing are thoroughly examined to find their exact impact in BFS. The implementation is profiled to find bottlenecks and problems which prevent its further optimization.