The Internet of Things (IoT) is a rapidly growing area with an estimated 25 billion connected devices anticipated by 2020. As more devices join the IoT landscape, the ability to scale from small to large deployments is becoming paramount. In this paper, we investigate the ability to scale an IoT system above the leaf-level by using parallel computing within the gateway devices. The initial task identified for gateway parallel computing is to aggregate and analyze data from end devices. This approach provides a scalable architecture for IoT Systems. Devices such as the Jetson TX1 and TK1 incorporate an ARM multicore plus General Purpose Graphical Processing Unit (GP-GPU) capability to enhance performance for data-parallel computations. We demonstrate the value of this type of hybrid architecture on an example IoT system test bed while achieving non-trivial speedup. This points clearly to the need for gateway devices to move to highly parallel architectures, rather than simply serving as small servers with multiple network adapters.