Serwis Infona wykorzystuje pliki cookies (ciasteczka). Są to wartości tekstowe, zapamiętywane przez przeglądarkę na urządzeniu użytkownika. Nasz serwis ma dostęp do tych wartości oraz wykorzystuje je do zapamiętania danych dotyczących użytkownika, takich jak np. ustawienia (typu widok ekranu, wybór języka interfejsu), zapamiętanie zalogowania. Korzystanie z serwisu Infona oznacza zgodę na zapis informacji i ich wykorzystanie dla celów korzytania z serwisu. Więcej informacji można znaleźć w Polityce prywatności oraz Regulaminie serwisu. Zamknięcie tego okienka potwierdza zapoznanie się z informacją o plikach cookies, akceptację polityki prywatności i regulaminu oraz sposobu wykorzystywania plików cookies w serwisie. Możesz zmienić ustawienia obsługi cookies w swojej przeglądarce.
As hardware and software technologies have improved, our definition of a “manageable amount of data” has increased in its scope dramatically. The term “big data” can be applied to any of several different projects and technologies sharing the ultimate goal of supporting analysis on these large, heterogeneous, and evolving data sets. The term “data science” refers to the statistical, technical, and...
Information which is related to the geographic area is being produced continuously. However, there is currently no technique that can handle large spatial data. For this reason, we developed a spatial big data platform, ORANGE, based on the Apache Hadoop. ORANGE can load the vector and raster data based on HDFS and manages metadata and creates index data using the Apache HIVE. These improvements made...
Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets...
Topic Modelling (TM) has gained momentum over the last few years within the humanities to analyze topics represented in large volumes of full text. This paper proposes an experiment with the usage of TM based on a large subset of digitized archival holdings of the European Commission (EC). Currently, millions of scanned and OCRed files are available and hold the potential to significantly change the...
The Bentley Historical Library, funded by a generous grant from the Andrew W. Mellon Foundation, has developed a new Appraisal and Arrangement tab in the Archivematica digital preservation system as part of its “ArchivesSpace-Archivematica-DSpace Workflow Integration” project. This new functionality permits users to conduct large-scale appraisal of digital archives as part of a largely automated workflow...
K-Means algorithm is one of the most popular methods for flat clustering, but it's time-consuming in similarity calculation for big data, which causes lower performance in practice. Previous studies proposed improvements for finding better initial centroids to facilitate effective assignment of the data points to suitable clusters with reduced time complexity. However, in vector space representation,...
We have witnessed a dramatic increase in national cyberinfrastructure resources to support data-driven research. Orchestrating these resources to enable the creation of collaborative infrastructure capable of supporting data intensive activities is challenging. In this work we present RADII, a novel architecture and system that enables the provisioning and configuration of collaborative infrastructure...
Large-scale scientific applications are often expressed as workflows that help defining data dependencies between their different components. Several such workflows have huge storage and computation requirements, and so they need to be processed in multiple (cloud-federated) datacenters. It has been shown that efficient metadata handling plays a key role in the performance of computing systems. However,...
Provenance data is a type of metadata that computer scientists argue can support trustworthy and reliable replication of scientific results. From its origins in scientific workflow systems and database theory, and with concurrent interest from the ecological informatics community, a standard data model (PROV) and extensions for DataONE (ProvONE) have led to initial implementations in several tools...
Constellation's overarching goal is the federation of information from resources within an extreme-scale scientific collaboration to enable the scalable discovery of data and new knowledge pathways. The resource fabric is comprised of petascale supercomputers and storage systems, users, jobs, datasets and lifecycle artifacts. For an extreme-scale supercomputing center, normal operations can generate...
Computational workflows consist of a series of steps in which data is generated, manipulated, analysed and transformed. Researchers use tools and techniques to capture the provenance associated with the data to aid reproducibility. The metadata collected not only helps in reproducing the computation but also aids in comparing the original and reproduced computations. In this paper, we present an approach,...
Typing is a well known concept to prepare services for data processing for instance by choosing the correct service to a mime type for processing. But a lot more metadata elements, like availability and access conditions, provenance, processing preconditions or integrity parameters, are useful to be known in advance for preprocessing data services. In order to expose such metadata independently from...
In this paper, we present MetaStore, a metadata management framework for scientific data repositories. Scientific experiments are generating a deluge of data and metadata. Metadata is critical for scientific research, as it enables discovering, analysing, reusing, and sharing of scientific data. Moreover, metadata produced by scientific experiments is heterogeneous and subject to frequent changes,...
As data grows exponentially within data centers, cluster deduplication storage systems face challenges in providing high throughput, high deduplication ratio and load balance. As the key technique, data routing algorithm has a strong impact on the deduplication ratio, throughput and load balance in cluster deduplication storage systems. In this paper, we propose SS-Dedup, a novel stateful data routing...
To track security and compliance requirements and perform problem diagnosis, administrators of cloud computing systems need to monitor significant system changes occurring on the set of cloud instances under their supervision. Considering the large number of instances (virtual machines, containers) possibly operating under multiple configurations, this is a difficult-to-track process. Standard solutions...
In today's technology industry where machine learning has become essential, the effectiveness of algorithms ultimately depends on a robust data pipeline, and fast model prototyping and tuning require easy feature discovery and consumption. Careful management of ETL processes and their produced datasets is key to both model development in the research stage and model execution in the production environment...
There is a need for comprehensive solutions to address the challenges of spatio-temporal data quality assessment. Emphasis is often placed on the quality assessment of individual observations from sensors but not on the sensors themselves nor upon site metadata such as location and timestamps. The focus of this paper is on the development and evaluation of a representative and comprehensive, interpolation-based...
People develop personal information collections consisting of distributed web resources as both reminders that resources exist and to provide rapid access to these resources. Managing such collections is necessary to preserve their value. Unexpected changes within distributed collections can cause them to become outdated, requiring revisions to or removal of no-longer-appropriate resources and replacements...
Although statistical and machine learning methods require the input data to be in a tabular format, in real-world applications data are often stored across several tables in a relational database. How to build a single mining table from a relational database is a critical pre-processing step of any classification method, because including the right attributes may dramatically boost the accuracy of...
Assigning global unique persistent identifiers (GUPIs) to datasets has the goal of improving their accessibility and simplifying how they are referenced and reused. However, as repositories receive more and complex data, attesting for the identity of datasets attached to persistent identifiers over time is becoming more challenging. This is due to the nature of scientific research data, which is generated...
Podaj zakres dat dla filtrowania wyświetlonych wyników. Możesz podać datę początkową, końcową lub obie daty. Daty możesz wpisać ręcznie lub wybrać za pomocą kalendarza.