Today, centralized data-centers or clouds have become the de-facto platform for data-intensive computing in the commercial, and increasingly, scientific domains. This is because clouds such as Amazon AWS and Microsoft Azure offer large amounts of monetized co-located computation and storage well suited to typical processing tasks such as batch analytics. However, many Big Data applications rely on data that is geographically distributed, and is not collocated with the centralized computational resources provided by clouds. Examples of such applications include analysis of user data such as blogs, video feeds taken from geographically separated cameras, monitoring and log analysis of server and content distribution network (CDN) logs, and scientific data collected from distributed instruments and sensors. Such applications lead to a number of challenges for efficient data analytics in today's cloud platforms. First, in many applications, data is both large and widely distributed and data upload may constitute a non-trivial portion of the execution time. Second, centralized cloud resources present a single point of failure and network partitions between the data sources and the cloud can also lead to service disruptions. Third, the cost to transport, store, and process data may be outside of the budget of the small-scale application designer or end-user.