2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

book

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

ACM

chapter

Shock: Active Storage for Multicloud Streaming Data Analysis

Andreas Wilke, Wolfgang Gerlach, Travis Harrison, Tobias Paczian, more

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 68 - 72

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

Access to data plays a major role in designing and performing efficient data computation and analyses in a distributed environment. Existing approaches access data via a variety of methods and offer various benefits and drawbacks based on the use case. Our original use case was the computational analysis of environmental sequence data, or metagenomics. Unlike other workflows that often reduce the...

chapter

Big Data Remote Access Interfaces for Light Source Science

Justin M. Wozniak, Kyle Chard, Ben Blaiszik, Ray Osborn, more

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 51 - 60

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

The efficiency and reliability of big data computing applications frequently depend on the ease with which they can manage and move large distributed data. For example, in x-ray science, both raw data and various derived data must be moved between experiment halls and archives, supercomputers, and user workstations for reconstruction, analysis, visualization, storage, and other purposes. Throughout,...

chapter

Assessing Process Discovery Scalability in Data Intensive Environments

Sergio Hernandez, Joaquin Ezpeleta, S.J. van Zelst, Wil M. P. van der Aalst

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 99 - 104

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

Tremendous developments in Information Technology (IT) have enabled us to store and process huge amounts of data at unprecedented rates. This phenomenon largely impacts business processes. The field of process discovery, originating from the area of process mining, is concerned with automatically discovering process models from event data related to the execution of business processes. In this paper,...

chapter

Policy-Driven Data Management Middleware for Multi-cloud Storage in Multi-tenant SaaS

Ansar Rafique, Dimitri Van Landuyt, Bert Lagaisse, Wouter Joosen

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 78 - 84

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

Multi-tenant Software-as-a-Service (SaaS) applications are increasingly built on combinations of cloud storage technologies and providers in a so-called multi-cloud setup. One advantage is that such a setup helps satisfying the different -- sometimes even contrasting -- storage requirements of different customer organizations (tenants). In such a multi-cloud environment, the application data is distributed...

chapter

Any Data, Any Time, Anywhere: Global Data Access for Science

Kenneth Bloom, Tommaso Boccali, Brian Bockelman, Daniel Bradley, more

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 85 - 91

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

Data access is key to science driven by distributed high-throughput computing (DHTC), an essential technology for many major research projects such as High Energy Physics (HEP) experiments. However, achieving efficient data access becomes quite difficult when many independent storage sites are involved because users are burdened with learning the intricacies of accessing each system and keeping careful...

chapter

Publisher's Information

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 106

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

chapter

Cover Art

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > C4

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

chapter

Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence

Marco Capuccini, Lars Carlsson, Ulf Norinder, Ola Spjuth

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 61 - 67

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

Increasing size of datasets is challenging for machine learning, and Big Data frameworks, such as Apache Spark, have shown promise for facilitating model building on distributed resources. Conformal prediction is a mathematical framework that allows to assign valid confidence levels to object-specific predictions. This contrasts to current best-practices where the overall confidence level for predictions...

chapter

A Unified Computation Engine for Big Data Analytics

Chenyang Xu, Yanjie Chen, Qin Liu, Weixiong Rao, more

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 73 - 77

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

Nowadays large enterprises maintain a huge amount of data in multiple backend systems including traditional database systems and recently popular big data systems. In an example of telecom providers, the key business data (e.g., billing information) is maintained in database systems whereas the huge amount of log data is on HDFS with Hive. How to provide insightful analytics on such data becomes a...

chapter

Author Index

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 105

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

chapter

Building the Dresden Web Table Corpus: A Classification Approach

Julian Eberius, Katrin Braunschweig, Markus Hentsch, Maik Thiele, more

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 41 - 50

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML tables are used for layout purposes and only...

chapter

Title Page iii

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > iii

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

chapter

Title Page i

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > i

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

chapter

Linked 'Big' Data: Towards a Manifold Increase in Big Data Value and Veracity

Jeremy Debattista, Christoph Lange, Simon Scerri, Soren Auer

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 92 - 98

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

The Web of Data is an increasingly rich source of information, which makes it useful for Big Data analysis. However, there is no guarantee that this Web of Data will provide the consumer with truthful and valuable information. Most research has focused on Big Data's Volume, Velocity, and Variety dimensions. Unfortunately, Veracity and Value, often regarded as the fourth and fifth dimensions, have...

chapter

FaBRiQ: Leveraging Distributed Hash Tables towards Distributed Publish-Subscribe Message Queues

Iman Sadooghi, Ke Wang, Dharmit Patel, Dongfang Zhao, more

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 11 - 20

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

The advent of Big Data has brought many challenges and opportunities in distributed systems, which have only amplified with the rate of growth of data. There is a need to rethink the software stack for supporting data intensive computing and big data analytics. Over the past decade, the data analytics applications have turned to finer granular tasks which are shorter in duration and much more in quantity...

chapter

Towards a Hybrid Imputation Approach Using Web Tables

Ahmad Ahmadov, Maik Thiele, Julian Eberius, Wolfgang Lehner, more

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 21 - 30

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing...

chapter

An Empirical Approach to Detection of Topic Bubbles in Tweets

Ashwin Kumar TK, K.M. George, Johnson P. Thomas

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > 31 - 40

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

As Twitter usage increases worldwide, it becomes an important part of the big data ecosystem. People from across the globe tweeting and retweeting a large number of tweets instantaneously results in exponential growth of the information diffusion. This in turn can cause information bubbles. As data from Twitter and other microblogs are used in predictive analytics in many areas such as stock price...

chapter

Message from General and Program Chairs

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) > vii - viii

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)

INFONA - science communication portal

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)