The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Achieving high quality clustering is one of the most well-known problems in data mining. k-means is by far the most commonly used clustering algorithm. It converges fairly quickly, but achieving a good solution is not guaranteed. The clustering quality is highly dependent on the selection of the initial centroid selections. Moreover, when the number of clusters increases, it starts to suffer from...
The growth in the capacity and capability of NAND Flash based storage systems have changed the face of data oriented computational systems. These systems have become both more capable and flexible in how they are used. With these changes comes both increased potential and user complexity. While many systems attempt to hide this complexity through the addition of more layers of storage caches, the...
In this paper, we propose a new method for addressing post-purchase recommendations for a dynamic marketplace. The proposed method uses the transactional data as the primary data source to mine co-purchase relationships. The item listings from the transactional data are mapped to their static ‘cluster’ representation and a cluster-cluster directed graph is generated. Clusters have explicit definitions...
Genomic analysis [1] usually includes a pipeline of three stages: sequence alignment, data conversion, and advanced analysis. The analysis pipeline needs to handle hundreds of gigabytes of data as well as to run complex analytics algorithms, which traditionally takes long execution time (20+ hours) for a full genomes analysis. Parallelizing the execution of analytics algorithms is one way to speed...
A recent trend for big data analytics is to provide heterogeneous architectures to allow support for hardware specialization. Considering the time dedicated to create such hardware implementations, an analysis that estimates how much benefit we gain in terms of speed and energy efficiency, through offloading various functions to hardware would be necessary. This work analyzes data mining and machine...
This paper describes a functional view of a privacy architecture based on a shared-services model. The architecture exposes 7 functional management components: Master Management, Privacy Monitoring, Private Data Identification, Policy Management, Privacy Service Injection, Privacy Logging, and Privacy Analytics for (re)use by multiple applications operating in heterogeneous Big Data environments....
The software package R is a free, powerful, open source software package with extensive statistical computing and graphics capabilities. Due to its high-level expressiveness and multitude of domain-specific packages, R has become a popular tool for data analysis in many scientific fields. While there are a number of packages enabling running R in parallel using message passing interface across multiple...
With rapidly growing computing power, ultra high-resolution Earth science simulations with a long period of time are feasible. However, it is still very challenging to distribute and analyze a huge amount of simulation results, which could be over 100TB. One key reason is that typical Earth science data are represented in NetCDF, which is not supported by the popular and powerful Hadoop Distribute...
Big Data constitutes an opportunity for companies to empower their analysis. However, at the moment there is no standard way for approaching Big Data projects. This, coupled with the complex nature of Big Data, is the cause that many Big Data projects fail or rarely obtain the expected return of investment. In this paper, we present a methodology to tackle Big Data projects in a systematic way, avoiding...
Erasure codes such as Reed-Solomon (RS) codes are widely used to improve data reliability in distributed storage systems. Although erasure codes indeed greatly reduce the storage overhead compared to the replication schemes, it is still very costly in terms of network bandwidth when repairing a failed node. To address such problem, we employ the Zigzag code, a MDS array code with optimal repair property,...
Big data analysis technologies are becoming more widely used in industry. The ever-increasing data volume, however, puts data analytic platforms such as Hadoop under constant pressure. Several compression methods have been made available on the Hadoop platform to effectively reduce data size and efficiently deliver data between cluster nodes. In the Hadoop context, compressed data can be categorized...
Big-data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by...
Healthcare applications typically require big data management as well as intensive computation. This is especially true with recently developed next generation sequencing technology which increases interests in processing the huge amount of information in a timely fashion. In this paper, we focus on testing whether the healthcare applications can scale well on commercial big data platforms that implement...
Cloud services are widely used across the globe to store and analyze Big Data. These days it seems the news is full of stories about security breaches to these services, resulting in the exposure of huge amounts of private data. This paper studies the current security threats to Cloud Services, Big Data, and Hadoop. The paper analyzes a newly proposed Big Data security system based on the EnCoRe system...
Existing Big data analytics platforms, such as Hadoop, lack support for user activity monitoring. Several diagnostic tools such as Ganglia, Ambari, and Cloudera Manager are available to monitor health of a cluster, however, they do not provide algorithms to detect security threats or perform user activity monitoring. Hence, there is a need to develop a scalable system that can detect malicious user...
Hadoop emerged as the de facto state-of-the-art system for MapReduce-based data analytics. The reliability of Hadoop systems depends in part on how well they handle failures. Currently, Hadoop handles machine failures by re-executing all the tasks of the failed machines (i.e., executing recovery tasks). Unfortunately, this elegant solution is entirely entrusted to the core of Hadoop and hidden from...
The success of the Hadoop MapReduce programming model has greatly propelled research in big data analytics. In recent years, there is a growing interest in the High Performance Computing (HPC) community to use Hadoop-based tools for processing scientific data. This interest is due to the facts that data movement becomes prohibitively expensive, highperformance data analytic becomes an important part...
Many emerging Semantic Web applications combine and aggregate data across domains for analysis. Such analytical queries compute aggregates over multiple groupings of data, resulting in query plans with complex grouping-aggregation constraints. In the context of an RDF analytical query, each such grouping maps to a graph pattern subquery with multiple join operations, and related groups often result...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.