2016 IEEE International Conference on Big Data (Big Data)

rozdział

Bad big data science

Frank S. Haug

2016 IEEE International Conference on Big Data (Big Data) > 2863 - 2871

As hardware and software technologies have improved, our definition of a “manageable amount of data” has increased in its scope dramatically. The term “big data” can be applied to any of several different projects and technologies sharing the ultimate goal of supporting analysis on these large, heterogeneous, and evolving data sets. The term “data science” refers to the statistical, technical, and...

rozdział

ORANGE: Spatial big data analysis platform

Sunghwan Cho, Sunghal Hong, Changsoo Lee

2016 IEEE International Conference on Big Data (Big Data) > 3963 - 3965

2016 IEEE International Conference on Big Data (Big Data)

Information which is related to the geographic area is being produced continuously. However, there is currently no technique that can handle large spatial data. For this reason, we developed a spatial big data platform, ORANGE, based on the Apache Hadoop. ORANGE can load the vector and raster data based on HDFS and manages metadata and creates index data using the Apache HIVE. These improvements made...

rozdział

I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets

Kyle Chard, Mike D'Arcy, Ben Heavner, Ian Foster, więcej

2016 IEEE International Conference on Big Data (Big Data) > 319 - 328

2016 IEEE International Conference on Big Data (Big Data)

Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets...

rozdział

Exploring archives with probabilistic models: Topic modelling for the valorisation of digitised archives of the European Commission

Simon Hengchen, Mathias Coeckelbergs, Seth van Hooland, Ruben Verborgh, więcej

2016 IEEE International Conference on Big Data (Big Data) > 3245 - 3249

2016 IEEE International Conference on Big Data (Big Data)

Topic Modelling (TM) has gained momentum over the last few years within the humanities to analyze topics represented in large volumes of full text. This paper proposes an experiment with the usage of TM based on a large subset of digitized archival holdings of the European Commission (EC). Currently, millions of scanned and OCRed files are available and hold the potential to significantly change the...

rozdział

Appraising digital archives with Archivematica

Michael Shallcross

2016 IEEE International Conference on Big Data (Big Data) > 3272 - 3276

2016 IEEE International Conference on Big Data (Big Data)

The Bentley Historical Library, funded by a generous grant from the Andrew W. Mellon Foundation, has developed a new Appraisal and Arrangement tab in the Archivematica digital preservation system as part of its “ArchivesSpace-Archivematica-DSpace Workflow Integration” project. This new functionality permits users to conduct large-scale appraisal of digital archives as part of a largely automated workflow...

rozdział

Improving clustering efficiency by SimHash-based K-Means algorithm for big data analytics

Jenq-Haur Wang, Jia-Zhi Lin

2016 IEEE International Conference on Big Data (Big Data) > 1881 - 1888

2016 IEEE International Conference on Big Data (Big Data)

K-Means algorithm is one of the most popular methods for flat clustering, but it's time-consuming in similarity calculation for big data, which causes lower performance in practice. Previous studies proposed improvements for finding better initial centroids to facilitate effective assignment of the data points to suitable clusters with reduced time complexity. However, in vector space representation,...

rozdział

RADU: Bridging the divide between data and infrastructure management to support data-driven collaborations

Fan Jiang, Claris Castillo, Charles Schmitt

2016 IEEE International Conference on Big Data (Big Data) > 370 - 377

2016 IEEE International Conference on Big Data (Big Data)

We have witnessed a dramatic increase in national cyberinfrastructure resources to support data-driven research. Orchestrating these resources to enable the creation of collaborative infrastructure capable of supporting data intensive activities is challenging. In this work we present RADII, a novel architecture and system that enables the provisioning and configuration of collaborative infrastructure...

rozdział

Managing hot metadata for scientific workflows on multisite clouds

Luis Pineda-Morales, Ji Liu, Alexandru Costan, Esther Pacitti, więcej

2016 IEEE International Conference on Big Data (Big Data) > 390 - 397

2016 IEEE International Conference on Big Data (Big Data)

Large-scale scientific applications are often expressed as workflows that help defining data dependencies between their different components. Several such workflows have huge storage and computation requirements, and so they need to be processed in multiple (cloud-federated) datacenters. It has been shown that efficient metadata handling plays a key role in the performance of computing systems. However,...

rozdział

Computational provenance: DataONE and implications for cultural heritage institutions

Robert J. Sandusky

2016 IEEE International Conference on Big Data (Big Data) > 3266 - 3271

2016 IEEE International Conference on Big Data (Big Data)

Provenance data is a type of metadata that computer scientists argue can support trustworthy and reliable replication of scientific results. From its origins in scientific workflow systems and database theory, and with concurrent interest from the ecological informatics community, a standard data model (PROV) and extensions for DataONE (ProvONE) have led to initial implementations in several tools...

rozdział

Constellation: A science graph network for scalable data and knowledge discovery in extreme-scale scientific collaborations

Sudharshan S. Vazhkudai, John Harney, Raghul Gunasekaran, Dale Stansberry, więcej

2016 IEEE International Conference on Big Data (Big Data) > 3052 - 3061

2016 IEEE International Conference on Big Data (Big Data)

Constellation's overarching goal is the federation of information from resources within an extreme-scale scientific collaboration to enable the scalable discovery of data and new knowledge pathways. The resource fabric is comprised of petascale supercomputers and storage systems, users, jobs, datasets and lifecycle artifacts. For an extreme-scale supercomputing center, normal operations can generate...

rozdział

Facilitating reproducible research by investigating computational metadata

Priyaa Thavasimani, Paolo Missier

2016 IEEE International Conference on Big Data (Big Data) > 3045 - 3051

2016 IEEE International Conference on Big Data (Big Data)

Computational workflows consist of a series of steps in which data is generated, manipulated, analysed and transformed. Researchers use tools and techniques to capture the provenance associated with the data to aid reproducibility. The metadata collected not only helps in reproducing the computation but also aids in comparing the original and reproduced computations. In this paper, we present an approach,...

rozdział

Automated schema extraction for PID information types

Ulrich Schwardmann

2016 IEEE International Conference on Big Data (Big Data) > 3036 - 3044

2016 IEEE International Conference on Big Data (Big Data)

Typing is a well known concept to prepare services for data processing for instance by choosing the correct service to a mime type for processing. But a lot more metadata elements, like availability and access conditions, provenance, processing preconditions or integrity parameters, are useful to be known in advance for preprocessing data services. In order to expose such metadata independently from...

rozdział

MetaStore: A metadata framework for scientific data repositories

Ajinkya Prabhune, Hasebullah Ansari, Anil Keshav, Rainer Stotzka, więcej

2016 IEEE International Conference on Big Data (Big Data) > 3026 - 3035

2016 IEEE International Conference on Big Data (Big Data)

In this paper, we present MetaStore, a metadata management framework for scientific data repositories. Scientific experiments are generating a deluge of data and metadata. Metadata is critical for scientific research, as it enables discovering, analysing, reusing, and sharing of scientific data. Moreover, metadata produced by scientific experiments is heterogeneous and subject to frequent changes,...

rozdział

SS-dedup: A high throughput stateful data routing algorithm for cluster deduplication system

Zhihao Huang, Hui Li, Xin Li, Wei He

2016 IEEE International Conference on Big Data (Big Data) > 2991 - 2995

2016 IEEE International Conference on Big Data (Big Data)

As data grows exponentially within data centers, cluster deduplication storage systems face challenges in providing high throughput, high deduplication ratio and load balance. As the key technique, data routing algorithm has a strong impact on the deduplication ratio, throughput and load balance in cluster deduplication storage systems. In this paper, we propose SS-Dedup, a novel stateful data routing...

rozdział

DeltaSherlock: Identifying changes in the cloud

Ata Turk, Hao Chen, Anthony Byrne, John Knollmeyer, więcej

2016 IEEE International Conference on Big Data (Big Data) > 763 - 772

2016 IEEE International Conference on Big Data (Big Data)

To track security and compliance requirements and perform problem diagnosis, administrators of cloud computing systems need to monitor significant system changes occurring on the set of cloud instances under their supervision. Considering the large number of instances (virtual machines, containers) possibly operating under multiple configurations, this is a difficult-to-track process. Standard solutions...

rozdział

QED: Groupon's ETL management and curated feature catalog system for machine learning

Derrick C. Spell, Ling-Yong Wang, Richard T. Shomer, Bahador Nooraei, więcej

2016 IEEE International Conference on Big Data (Big Data) > 1639 - 1646

2016 IEEE International Conference on Big Data (Big Data)

In today's technology industry where machine learning has become essential, the effectiveness of algorithms ultimately depends on a robust data pipeline, and fast model prototyping and tuning require easy feature discovery and consumption. Careful management of ETL processes and their produced datasets is key to both model development in the research stage and model execution in the production environment...

rozdział

The SMART approach to comprehensive quality assessment of site-based spatial-temporal data

Rafal A. Angryk, Douglas E. Galarus

2016 IEEE International Conference on Big Data (Big Data) > 2636 - 2645

2016 IEEE International Conference on Big Data (Big Data)

There is a need for comprehensive solutions to address the challenges of spatio-temporal data quality assessment. Emphasis is often placed on the quality assessment of individual observations from sensors but not on the sensors themselves nor upon site metadata such as location and timestamps. The focus of this paper is on the development and evaluation of a representative and comprehensive, interpolation-based...

rozdział

Change detection and classification of digital collections

Sampath Jayarathna, Faryaneh Poursardar

2016 IEEE International Conference on Big Data (Big Data) > 1750 - 1759

2016 IEEE International Conference on Big Data (Big Data)

People develop personal information collections consisting of distributed web resources as both reminders that resources exist and to provide rapid access to these resources. Managing such collections is necessary to preserve their value. Unexpected changes within distributed collections can cause them to become outdated, requiring revisions to or removal of no-longer-appropriate resources and replacements...

rozdział

Automatic generation of relational attributes: An application to product returns

Michele Samorani, Farrukh Ahmed, Osmar R. Zaiane

2016 IEEE International Conference on Big Data (Big Data) > 1454 - 1463

2016 IEEE International Conference on Big Data (Big Data)

Although statistical and machine learning methods require the input data to be in a tabular format, in real-world applications data are often stored across several tables in a relational database. How to build a single mining table from a relational database is a critical pre-processing step of any classification method, because including the right attributes may dramatically boost the accuracy of...

rozdział

Content-based comparison for collections identification

Weijia Xu, Ruizhu Huang, Maria Esteva, Jawon Song, więcej

2016 IEEE International Conference on Big Data (Big Data) > 3283 - 3289

2016 IEEE International Conference on Big Data (Big Data)

Assigning global unique persistent identifiers (GUPIs) to datasets has the goal of improving their accessibility and simplifying how they are referenced and reused. However, as repositories receive more and complex data, attesting for the identity of datasets attached to persistent identifiers over time is becoming more challenging. This is due to the nature of scientific research data, which is generated...

INFONA - portal komunikacji naukowej

2016 IEEE International Conference on Big Data (Big Data)

Bad big data science

ORANGE: Spatial big data analysis platform

I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets

Exploring archives with probabilistic models: Topic modelling for the valorisation of digitised archives of the European Commission

Appraising digital archives with Archivematica

Improving clustering efficiency by SimHash-based K-Means algorithm for big data analytics

RADU: Bridging the divide between data and infrastructure management to support data-driven collaborations

Managing hot metadata for scientific workflows on multisite clouds

Computational provenance: DataONE and implications for cultural heritage institutions

Constellation: A science graph network for scalable data and knowledge discovery in extreme-scale scientific collaborations

Facilitating reproducible research by investigating computational metadata

Automated schema extraction for PID information types

MetaStore: A metadata framework for scientific data repositories

SS-dedup: A high throughput stateful data routing algorithm for cluster deduplication system

DeltaSherlock: Identifying changes in the cloud

QED: Groupon's ETL management and curated feature catalog system for machine learning

The SMART approach to comprehensive quality assessment of site-based spatial-temporal data

Change detection and classification of digital collections

Automatic generation of relational attributes: An application to product returns

Content-based comparison for collections identification

Opcje filtrowania

Data publikacji

Słowa kluczowe

INFONA - portal komunikacji naukowej

2016 IEEE International Conference on Big Data (Big Data) $("#expandableTitles").expandable();

Dodaj adresata

Anulowanie wysłania wiadomości

Czy na pewno chcesz anulować wysłanie wiadomości?

Wyślij wiadomość

Opcje filtrowania

Data publikacji

Ustawianie zakresu dat

Podaj zakres dat dla filtrowania wyświetlonych wyników. Możesz podać datę początkową, końcową lub obie daty. Daty możesz wpisać ręcznie lub wybrać za pomocą kalendarza.

Słowa kluczowe

Zgłaszanie błędu / nadużycia

Nieudane wysłanie zgłoszenia

Ułatwienia dostępu

2016 IEEE International Conference on Big Data (Big Data)