The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper studies one-scan approximation algorithms for streaming data mining (SDM). Despite of the importance of pattern discovery in streaming data, this issue has not sufficiently addressed yet in the big data community. In this context, we briefly review the previously proposed SDM methods. There is a recent work to improve their limitation using the tecnique of online compression. It is based...
Large amount of data is being generated every day and is creating new challenges and opportunities which lead to extraordinary new knowledge and discoveries in many application domains ranging from science and engineering to business. One of the main challenges in this era of Big Data is how to efficiently manage and analyse such scale of data. This is challenging not only due to the size of the data,...
Generating the maximum number of visual patterns by uncovering the entire space of possible visual designs remains a challenge within the construction process of information visualization. Users interact with different mindsets consisting of design, data analysis, application development, and hardware resource usage. Therefore, they desire a flexible and productive interface that keeps them clued...
Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute...
As urban population grows, cities face many challenges related to transportation, resource consumption, and the environment. Ride sharing has been proposed as an effective approach to reduce traffic congestion, gasoline consumption, and pollution. Despite great promise, researchers and policy makers lack adequate tools to assess tradeoffs and benefits of various ride-sharing strategies. Existing approaches...
We present a transaction model which simultaneously supports different consistency levels, which include serial-izable transactions for strong consistency, and weaker consistency models such as causal snapshot isolation (CSI), CSI with commutative updates, and CSI with asynchronous updates. This model is useful in managing large-scale replicated data with different consistency guarantees to make suitable...
Recommendation systems play an important role in suggesting relevant information to users. In this paper, we introduce community-wise social interactions as a new dimension for recommendations and present a social recommendation system using collaborative filtering and community detection approaches. We use (i) community detection algorithm to extract friendship relations among users by analyzing...
Intervals have become prominent in data management as they are the main data structure to represent a number of key data types such as temporal or genomic data. Yet, there exists no solution to compactly store and efficiently query big interval data. In this paper we introduce CINTIA — the Checkpoint INTerval Index Array — an efficient data structure to store and query interval data, which achieves...
Entity resolution constitutes a crucial task for many applications, but has an inherently quadratic complexity. Typically, it scales to large volumes of data through blocking: similar entities are clustered into blocks so that it suffices to perform comparisons only within each block. Meta-blocking further increases efficiency by cleaning the overlapping blocks from unnecessary comparisons. However,...
Data historians[1] are today transitioning from their traditional role as record-keepers and planners, to tools that provide the required flexibility and responsiveness to customers' requirements in terms of the type and volume of data stored, archived and queried. Added dimensions to these requirements are the need for high performance and scalability. Businesses are realizing that traditional database...
In this paper, we propose and implement a key-value store that supports MPI while allowing application access at any time without having to declaring in the same MPI communication world. This feature may significantly simplify the application design and allow programmers leverage the power of key-value store in an intuitive way. In our preliminary experiment results captured from a supercomputer at...
Big Data presents challenges for predictive analytic algorithms due to the possibility of non-stationary populations. Concept drift detection algorithms can be used to detect changes in underlying distribution in order to retrain. Most concept drift detection methods are known to scale to a relatively low number of features (a few hundred). However, in many areas, datasets with thousands or even tens...
Cloud environments usually feature several geographically distributed data centers. In order to increase the scalability of applications, many Cloud providers partition data and distribute these partitions across data centers to balance the load. However, if the partitions are not carefully chosen, it might lead to distributed transactions. This is particularly expensive when applications require...
Communication traces help developers of high-performance computing (HPC) applications understand and improve their codes. When run on large-scale HPC facilities, the scalability of tracing tools becomes a challenge. To address this problem, traces can be clustered into groups of processes that exhibit similar behavior. Instead of collecting traces information of each individual node, it then suffices...
In-memory data grid (IMDG) is a new technology that enables scalable and low-latency processing of big data by sharding it over the RAMs of multiple servers. In this paper, we explore the design space of IMDGs to identify their advantages and avoid their drawbacks. We present the performance tradeoffs of IMDGs using unit tests on core distributed operations and data structures. For evaluation, we...
The wide popularity of graphs in areas such as Semantic Web and Social Network has necessitated the need to develop efficient methods to store and process graph data. However, the unique structure of graphs render traditional data handling methods and storage structures inefficient when dealing with large volumes of data. Existing graph storage structures either compromise scalability by adopting...
In this paper we present a framework to enable data-intensive Spark workloads on MareNostrum, a petascale supercomputer designed mainly for compute-intensive applications. As far as we know, this is the first attempt to investigate optimized deployment configurations of Spark on a petascale HPC setup. We detail the design of the framework and present some benchmark data to provide insights into the...
For datacenter applications that require tight synchronization, transactions are commonly employed for achieving concurrency while preserving correctness. Unfortunately, distributed transactions are hard to scale due to the decentralized lock acquisition and coordination protocols they employ. We investigate the use of a centralized lock broker architecture to improve the efficiency/scalability for...
Many systems have been developed for machine learning at scale. Performance has steadily improved, but there has been relatively little work on explicitly defining or approaching the limits of performance. In this paper we describe the application of roofline design, an approach borrowed from computer architecture, to large-scale machine learning. In roofline design, one exposes ALU, memory, and network...
Interactive database exploration is a key task in information mining. Relational databases have been long used as a critical infrastructure component to access and analyze large volumes of data in a variety of applications, including ad-hoc analytics over big data, large-scale data warehouses that support business-intelligence tools, and services for scientific-data exploration. To aid the users of...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.