Next generation large single-chip multicores will process massive data with varying degree of locality. Harnessing on-chip data locality to optimize the utilization of on-chip cache and network resources is of fundamental importance. We propose a locality-aware selective data replication protocol for the last-level cache (LLC). The goal is to lower memory access latency and energy by only replicating cache lines with high reuse in the LLC slice of the requesting core, while simultaneously keep the off-chip miss rate low. The approach relies on low-overhead yet highly accurate in-hardware runtime cache line level classifier that only allows replication of cache lines with high reuse. Furthermore, a classifier captures the LLC pressure at the existing replica locations and adapts its replication decision accordingly. On a set of parallel benchmarks, the proposed protocol reduces overall energy by 14.7, 10.7, 10.5, and 16.7 % and completion time by 2.5, 6.5, 4.5, and 9.5 % when compared to the previously proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA, and Static-NUCA LLC management schemes. An efficient classifier implementation is evaluated with an overhead of 5.44 KB, which translates to only 1.58 % on top of the Static-NUCA baseline’s cache related per-core storage.