2015 IEEE International Conference on Big Data (Big Data)

Items from 1 to 20 out of 23 results

chapter

Scalable preference queries for high-dimensional data using map-reduce

Gheorghi Guzun, Joel E. Tosado, Guadalupe Canahuate

2015 IEEE International Conference on Big Data (Big Data) > 2243 - 2252

2015 IEEE International Conference on Big Data (Big Data)

Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function is provided by the user at query time, or generated for a particular user by a personalized search engine which prevents the pre-computation of the...

chapter

SciSpark: Applying in-memory distributed computing to weather event detection and tracking

Rahul Palamuttam, Renato Marroquin Mogrovejo, Chris Mattmann, Brian Wilson, more

2015 IEEE International Conference on Big Data (Big Data) > 2020 - 2026

2015 IEEE International Conference on Big Data (Big Data)

In this paper we present SciSpark, a Big Data framework that extends Apache™ Spark for scaling scientific computations. The paper details the initial architecture and design of SciSpark. We demonstrate how SciSpark achieves parallel ingesting and partitioning of earth science satellite and model datasets. We also illustrate the usability and extensibility of SciSpark by implementing aspects of the...

chapter

High quality clustering of big data and solving empty-clustering problem with an evolutionary hybrid algorithm

Jeyhun Karimov, Murat Ozbayoglu

2015 IEEE International Conference on Big Data (Big Data) > 1473 - 1478

2015 IEEE International Conference on Big Data (Big Data)

Achieving high quality clustering is one of the most well-known problems in data mining. k-means is by far the most commonly used clustering algorithm. It converges fairly quickly, but achieving a good solution is not guaranteed. The clustering quality is highly dependent on the selection of the initial centroid selections. Moreover, when the number of clusters increases, it starts to suffer from...

chapter

Octopus: A multi-job scheduler for Graphlab

Srikant Padala, Dinesh Kumar, Arun Raj, Janakiram Dharanipragada

2015 IEEE International Conference on Big Data (Big Data) > 293 - 298

2015 IEEE International Conference on Big Data (Big Data)

Graphlab, which is a framework for large graph processing currently does not support multiple job scheduling simultaneously. However, for efficient use of the cluster resources, it may be required to share the cluster among multiple jobs. The challenges in multi-job scheduling in the case of graph processing are different from other frameworks such as Hadoop. In Hadoop, it is possible to schedule...

chapter

Composable and efficient functional big data processing framework

Dongyao Wu, Sherif Sakr, Liming Zhu, Qinghua Lu

2015 IEEE International Conference on Big Data (Big Data) > 279 - 286

2015 IEEE International Conference on Big Data (Big Data)

Over the past years, frameworks such as MapRe-duce and Spark have been introduced to ease the task of developing big data programs and applications. However, the jobs in these frameworks are roughly defined and packaged as executable jars without any functionality being exposed or described. This means that deployed jobs are not natively composable and reusable for subsequent development. Besides,...

chapter

Early experience with optimizing I/O performance using high-performance SSDs for in-memory cluster computing

I. Stephen Choi, Weiqing Yang, Yang-Suk Kee

2015 IEEE International Conference on Big Data (Big Data) > 1073 - 1083

2015 IEEE International Conference on Big Data (Big Data)

This paper describes our experience with storage optimization that utilizes cost-effective PCIe solid-state drives (SSDs) to improve the overall performance of a Spark framework. A key problem we address is the limited memory system performance. In particular, we adopt high-performance SSDs to alleviate the saturated DRAM bandwidth and its limited capacity. We utilize SSDs to store shuffle data and...

chapter

Large-scale learning with AdaGrad on Spark

Asmelash Teka Hadgu, Aastha Nigam, Ernesto Diaz-Aviles

2015 IEEE International Conference on Big Data (Big Data) > 2828 - 2830

2015 IEEE International Conference on Big Data (Big Data)

Stochastic Gradient Descent (SGD) is a simple yet very efficient online learning algorithm for optimizing convex (and often non-convex) functions and one of the most popular stochastic optimization methods in machine learning today. One drawback of SGD is that it is sensitive to the learning rate hyper-parameter. The Adaptive Sub-gradient Descent, AdaGrad, dynamically incorporates knowledge of the...

chapter

Scientific computing meets big data technology: An astronomy use case

Zhao Zhang, Kyle Barbary, Frank Austin Nothaft, Evan Sparks, more

2015 IEEE International Conference on Big Data (Big Data) > 918 - 927

2015 IEEE International Conference on Big Data (Big Data)

Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark — a modern big data platform — to parallelize many-task applications...

chapter

Performance evaluation of enabling logistic regression for big data with R

Ruizhu Huang, Weijia Xu

2015 IEEE International Conference on Big Data (Big Data) > 2517 - 2524

2015 IEEE International Conference on Big Data (Big Data)

The software package R is a free, powerful, open source software package with extensive statistical computing and graphics capabilities. Due to its high-level expressiveness and multitude of domain-specific packages, R has become a popular tool for data analysis in many scientific fields. While there are a number of packages enabling running R in parallel using message passing interface across multiple...

chapter

Low latency analytics for streaming traffic data with Apache Spark

Altti Ilari Maarala, Mika Rautiainen, Miikka Salmi, Susanna Pirttikangas, more

2015 IEEE International Conference on Big Data (Big Data) > 2855 - 2858

2015 IEEE International Conference on Big Data (Big Data)

Demand for new efficient methods for processing large-scale heterogeneous data in real-time is growing. Currently, one key challenge in Big Data is performing low-latency analysis with real-time data. In vehicle traffic, continuous high speed data streams generate large data volumes. Harnessing new technologies is required to benefit from all the potential this data withholds. This work studies the...

chapter

Scaling NLP algorithms to meet high demand

Connor Stokes, Anoop Kumar, Frederick Choi, Ralph Weischedel

2015 IEEE International Conference on Big Data (Big Data) > 2839

2015 IEEE International Conference on Big Data (Big Data)

The growth of digital information and the richness of data shared online make it increasingly valuable to be able to process large amounts of data at a very high throughput rate. At the same time, rising interest in natural language processing (NLP) has resulted in the development of a great number of algorithms designed to perform a variety of NLP tasks. There is a need for frameworks that enable...

chapter

ADMM based scalable machine learning on Spark

Sauptik Dhar, Congrui Yi, Naveen Ramakrishnan, Mohak Shah

2015 IEEE International Conference on Big Data (Big Data) > 1174 - 1182

2015 IEEE International Conference on Big Data (Big Data)

Most machine learning algorithms involve solving a convex optimization problem. Traditional in-memory convex optimization solvers do not scale well with the increase in data. This paper identifies a generic convex problem for most machine learning algorithms and solves it using the Alternating Direction Method of Multipliers (ADMM). Finally such an ADMM problem transforms to an iterative system of...

chapter

Performance characterization and acceleration of in-memory file systems for Hadoop and Spark applications on HPC clusters

Nusrat Sharmin Islam, Md. Wasi-ur-Rahman, Xiaoyi Lu, Dipti Shankar, more

2015 IEEE International Conference on Big Data (Big Data) > 243 - 252

2015 IEEE International Conference on Big Data (Big Data)

For data-intensive computing, the low throughput of the existing disk-bound storage systems is a major bottleneck. Recent emergence of the in-memory file systems with heterogeneous storage support mitigates this problem to a great extent. Parallel programming frameworks, e.g. Hadoop MapReduce and Spark are increasingly being run on such high-performance file systems. However, no comprehensive study...

chapter

Big data provenance: Challenges, state of the art and opportunities

Jianwu Wang, Daniel Crawl, Shweta Purawat, Mai Nguyen, more

2015 IEEE International Conference on Big Data (Big Data) > 2509 - 2516

2015 IEEE International Conference on Big Data (Big Data)

Ability to track provenance is a key feature of scientific workflows to support data lineage and reproducibility. The challenges that are introduced by the volume, variety and velocity of Big Data, also pose related challenges for provenance and quality of Big Data, defined as veracity. The increasing size and variety of distributed Big Data provenance information bring new technical challenges and...

chapter

LiteMat: A scalable, cost-efficient inference encoding scheme for large RDF graphs

Olivier Cure, Hubert Naacke, Tendry Randriamalala, Bernd Amann

2015 IEEE International Conference on Big Data (Big Data) > 1823 - 1830

2015 IEEE International Conference on Big Data (Big Data)

The number of linked data sources and the size of the linked open data graph keep growing every day. As a consequence, semantic RDF services are more and more confronted with various "big data" problems. Query processing in the presence of inferences is one them. For instance, to complete the answer set of SPARQL queries, RDF database systems evaluate semantic RDFS relationships (subPropertyOf,...

chapter

Efficient large scale distributed matrix computation with spark

Rong Gu, Yun Tang, Zhaokang Wang, Shuai Wang, more

2015 IEEE International Conference on Big Data (Big Data) > 2327 - 2336

2015 IEEE International Conference on Big Data (Big Data)

Matrix computation is the core of many massive data-intensive analytical applications such mining social networks, recommendation systems and nature language processing. Due to the importance of matrix computation, it has been widely studied for many years. In the Big Data ear, as the scale of the matrix grows, traditional single-node matrix computation systems can hardly cope with such large data...

chapter

Is Apache Spark scalable to seismic data analytics and computations?

Yuzhong Yan, Lei Huang, Liqi Yi

2015 IEEE International Conference on Big Data (Big Data) > 2036 - 2045

2015 IEEE International Conference on Big Data (Big Data)

High Performance Computing (HPC) has been a dominated technology used in seismic data processing at the petroleum industry. However, with the increasing data size and varieties, traditional HPC focusing on computation meets new challenges. Researchers are looking for new computing platforms with a balance of both performance and productivity, as well as featured with big data analytics capability...

chapter

Online anomaly detection over Big Data streams

Laura Rettig, Mourad Khayati, Philippe Cudre-Mauroux, Michal Piorkowski

2015 IEEE International Conference on Big Data (Big Data) > 1113 - 1122

2015 IEEE International Conference on Big Data (Big Data)

Data quality is a challenging problem in many real world application domains. While a lot of attention has been given to detect anomalies for data at rest, detecting anomalies for streaming applications still largely remains an open problem. For applications involving several data streams, the challenge of detecting anomalies has become harder over time, as data can dynamically evolve in subtle ways...

chapter

Spark deployment and performance evaluation on the MareNostrum supercomputer

Ruben Tous, Anastasios Gounaris, Carlos Tripiana, Jordi Torres, more

2015 IEEE International Conference on Big Data (Big Data) > 299 - 306

2015 IEEE International Conference on Big Data (Big Data)

In this paper we present a framework to enable data-intensive Spark workloads on MareNostrum, a petascale supercomputer designed mainly for compute-intensive applications. As far as we know, this is the first attempt to investigate optimized deployment configurations of Spark on a petascale HPC setup. We detail the design of the framework and present some benchmark data to provide insights into the...

chapter

Evaluating cloud frameworks on genomic applications

Michele Bertoni, Stefano Ceri, Abdurrahman Kaitoua, Pietro Pinoli

2015 IEEE International Conference on Big Data (Big Data) > 193 - 202

2015 IEEE International Conference on Big Data (Big Data)

We are developing a new, holistic data management system for genomics, which uses cloud-based computing for querying thousands of heterogeneous genomic datasets. In our project, it is essential to leverage upon a modern cloud computing framework, so as to encode our query expressions into high-level operations provided by the framework. After releasing our first implementation using Pig and Hadoop...

Keywords:
SPARKS

Publication date

Set your own date range

Keywords

BIG DATA (19)
SPARK (7)
PROGRAMMING (5)
DISTRIBUTED DATABASES (4)
ENCODING (3)
FAULT TOLERANCE (3)
FAULT TOLERANT SYSTEMS (3)
LIBRARIES (3)
ALGORITHM DESIGN AND ANALYSIS (2)
ARRAYS (2)
BIOINFORMATICS (2)
CLUSTERING ALGORITHMS (2)
COMPUTATIONAL MODELING (2)
COMPUTER ARCHITECTURE (2)
DNA (2)
GENOMICS (2)
HADOOP (2)
LOGISTICS (2)
MACHINE LEARNING ALGORITHMS (2)
MEMORY MANAGEMENT (2)
OPTIMIZATION (2)
PIPELINES (2)
SEMANTICS (2)
THROUGHPUT (2)
ACCELERATION (1)
ADAPTIVE GRADIENT (1)
ADMM (1)
AGGREGATES (1)
APACHE SPARK (1)
APPROXIMATION METHODS (1)
ARCHITECTURE (1)
ASSEMBLY (1)
ASTRONOMY (1)
BANDWIDTH (1)
BENCHMARK TESTING (1)
BIG DATA ANALYTICS (1)
BIG DATA PROCESSING (1)
CITIES AND TOWNS (1)
CLOUD COMPUTING (1)
CLOUD FRAMEWORKS FOR BIG DATA MANAGEMENT (1)
CLOUDS (1)
CLUSTERING (1)
COMPARATIVE PERFORMANCE EVALUATION OF FLINK AND SPARK (1)
COMPUTER SCIENCE (1)
COMPUTERS (1)
CONFERENCES (1)
CONTEXT (1)
CONVEX FUNCTIONS (1)
CORRELATION (1)
CUCKOO SEARCH (1)
DATA ANALYSIS (1)
DATA CLEANING (1)
DATA INGEST (1)
DATA MODELS (1)
DATA PROCESSING (1)
DATA STRUCTURES (1)
DE BRUIJN GRAPH (1)
DE NOVO ASSEMBLY (1)
DISTRIBUTED DATA-PARALLEL PROGRAMMING MODELS (1)
DISTRIBUTED MACHINE LEARNING (1)
DISTRIBUTED OPTIMIZATION (1)
DISTRIBUTED SEQUENCE ASSEMBLY (1)
DISTRIBUTED SYSTEMS (1)
DYNAMIC (1)
ELBOW (1)
ENCODING GENOMIC APPLICATIONS USING FLINK AND SPARK (1)
ENGINES (1)
ENTROPY (1)
EVOLUTIONARY ALGORITHMS (1)
EVOLUTIONARY COMPUTATION (1)
EXPLOSIONS (1)
FILE SYSTEMS (1)
FIREWORKS ALGORITHM (1)
FLEXIBLE (1)
FUNCTIONAL PROGRAMMING (1)
GLOBAL POSITIONING SYSTEM (1)
GRAPH PROCESSING (1)
HEART BEAT (1)
HISTORY (1)
I/O PERFORMANCE (1)
IMAGE PROCESSING (1)
IN-MEMORY DISTRIBUTED COMPUTING (1)
INDEXES (1)
INDEXING (1)
INDUSTRIES (1)
INMEMORY COMPUTING (1)
K-MEANS (1)
LARGE SCIENTIFIC DATASETS (1)
LINEAR ALGEBRA (1)
LOSS MEASUREMENT (1)
MAPREDUCE (1)
MATHEMATICAL MODEL (1)
MATRIX COMPUTATION (1)
MEASUREMENT (1)
MEMORY BANDWIDTH (1)
MERGING (1)
MESOSCALE CONVECTIVE COMPLEXES (1)
METEOROLOGY (1)
ML-LIB (1)
more

INFONA - science communication portal

2015 IEEE International Conference on Big Data (Big Data)

Scalable preference queries for high-dimensional data using map-reduce

SciSpark: Applying in-memory distributed computing to weather event detection and tracking

High quality clustering of big data and solving empty-clustering problem with an evolutionary hybrid algorithm

Octopus: A multi-job scheduler for Graphlab

Composable and efficient functional big data processing framework

Early experience with optimizing I/O performance using high-performance SSDs for in-memory cluster computing

Large-scale learning with AdaGrad on Spark

Scientific computing meets big data technology: An astronomy use case

Performance evaluation of enabling logistic regression for big data with R

Low latency analytics for streaming traffic data with Apache Spark

Scaling NLP algorithms to meet high demand

ADMM based scalable machine learning on Spark

Performance characterization and acceleration of in-memory file systems for Hadoop and Spark applications on HPC clusters

Big data provenance: Challenges, state of the art and opportunities

LiteMat: A scalable, cost-efficient inference encoding scheme for large RDF graphs

Efficient large scale distributed matrix computation with spark

Is Apache Spark scalable to seismic data analytics and computations?

Online anomaly detection over Big Data streams

Spark deployment and performance evaluation on the MareNostrum supercomputer

Evaluating cloud frameworks on genomic applications

Filter options

Publication date

Keywords

INFONA - science communication portal

2015 IEEE International Conference on Big Data (Big Data) $("#expandableTitles").expandable();

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options

2015 IEEE International Conference on Big Data (Big Data)