The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Real-life systems involving interacting objects are typically modeled as graphs and can often grow very large in size. Revealing the community structure of such systems is crucial in helping us better understand their complex nature. However, the ever-increasing size of real-world graphs, and our evolving perception of what a community is, make the task of community detection very challenging. One...
Most real world applications comprise databases having multiple tables. It becomes further complicated in the realm of Big Data where related information is spread over different data repositories. However, data mining techniques are usually applied on a single flat table. This work focuses on generating a mining table by aggregating information from multiple local tables and external data sources...
Identity resolution capability for social networking profiles is important for a range of purposes, from open-source intelligence applications to forming semantic web connections. Yet replication of research in this area is hampered by the lack of access to ground-truth data linking the identities of profiles from different networks. Almost all data sources previously used by researchers are no longer...
When analyzing sensitive data in a cloud-deployed Hadoop stack, data-in-transit security needs to be enabled, especially in the underlying storage tier. This, however, will affect the performance of the system and may partially offset the cost benefits of the cloud. In this paper, we discuss two strategies for securing HBase deployments in the cloud. For both, we present benchmarking results which...
Scientific datasets are steadily growing in size, due to increasing resolution and scale. Unstructured meshes are essential to certain fields of engineering and science, but they present special challenges for efficient access and processing. The work described in this paper accelerates range queries for very large unstructured meshes using the GPU. Prior work in the area introduced a preprocessing...
Parallel Factor Analysis (PARAFAC) is used in many scientific disciplines to decompose multimodal datasets ('tensors') into principal factors to uncover multilinear relationships in the data. Today's popular implementations of PARAFAC are single-server solutions that do not scale well to big datasets. This paper presents the design, implementation, and testing of a Big Data-enabled Parallel PARAFAC...
In this paper, we characterize the behavior of “big” and “fast” data analysis frameworks, in multi-tenant, shared settings for which computing resources (CPU and memory) are limited, an increasingly common scenario used to increase utilization and lower cost. We study how popular analytics frameworks behave and interfere with each other under such constraints. We empirically evaluate Hadoop, Spark,...
Clustering large scale data has become an important challenge which motivates several recent works. While the emphasis has been on the organization of massive data into disjoint groups, this work considers the identification of non-disjoint groups rather than the disjoint ones. In this setting, it is possible for data object to belong simultaneously to several groups since many real-world applications...
Point of interest (POI) recommendation, a service which can help people discover useful and interesting locations has emerged rapidly with the development of location-based social networks (LBSNs), like Foursquare, Gowalla and Wechat. The large number of check-in histories make it possible to mine the preference of each user and then to provide accurate personalized POI recommendation. In real-world...
In cloud systems, efficient resource provisioning is needed to maximize the resource utilization while reducing the Service Level Objective (SLO) violation rate, which is important to cloud providers for high profit. Several methods have been proposed to provide efficient provisioning. However, the previous methods do not consider leveraging the complementary of jobs' requirements on different resource...
We introduce GraphFlow, a big graph framework that is able to encode complex data science experiments as a set of high-level workflows. GraphFlow combines the Spark big data processing platform and the Galaxy workflow management system to offer a set of components for graph processing using a novel interaction model for creating and using complex workflows. GraphFlow contributes an easy-to-use interface...
Although it is crucial to transmit important information to those who require it during disasters, neither of the following questions have been answered: who contributes to information diffusion? How do users construct helpful relationships in social media? Unfortunately, most previous research has focused on the scale of information diffusion, instead of the flow of information and the paths traveled...
Video is an increasingly important method of information-sharing on the Web. Services such as YouTube, Vimeo, and Liveleak are platforms that support uploading User-Generated Content. Users tend to seek related information during or after watching an informative video by finding and reading comments on Web services. However, existing services only support sorting by recentness (newest) or rating (LIKES...
Implementing trajectory data stream analysis in parallel has technical issues of data partition and improvements of the analysis operations. In this paper, we define the trajectory analysis problem as discovering trajectory companies of moving objects. We develop a discovery workflow in parallel batch processing. We solve technical issues of data partition and data locality in the steps of analysis...
In the past two decades, new developments in computing, sensing and crowdsourced data have resulted in an explosion in the availability of quantitative information. The possibilities of analyzing this so-called “big data” to inform research and the decision-making process are virtually endless. In general analyses have to be done across multiple data sets in order to bring out the most value of big...
In 2010 the popular paper by Kwak et al. [11] presented the first comprehensive study of Twitter as it appeared in 2009, using most of the Twitter network at the time. Since then, Twitter's popularity and usage has exploded, experiencing a 10-fold increase. As of 2015, it has more than 500 million users, out of which 316 million are active, i.e. logging into the service at least once a month.1 In...
This paper explores the relationship between TV viewership ratings for Scandinavian's most popular talk show, Skavlan and public opinions expressed on its Facebook page. The research aim is to examine whether the activity on social media affects the number of viewers per episode of Skavlan, how the viewers are affected by discussions on the Talk Show, and whether this creates debate on social media...
In this work, we propose Max-Node sampling, a novel sampling algorithm for data collection. The goal of Max-Node is to maximize the number of nodes observed in the sample, given a budget constraint. Max-Node is based on the intuition that networks contain many densely connected regions (i.e., communities), that may be only weakly connected to another, and to maximize the number of nodes observed,...
The scale of scientific data generated by experimental facilities and simulations on high-performance computing facilities has been growing rapidly. In many cases, this data needs to be transferred rapidly and reliably to remote facilities for storage, analysis, sharing etc. At the same time, users want to verify the integrity of the data by doing a checksum after the data has been written to disk...
Fault tolerance is an important challenge for supporting critical big data analytic operations. Most existing solutions only provide fault tolerant data replication, requiring failed queries to be restarted. This approach is insufficient for long-running time-sensitive analytic queries, due to lost query progress. Several solutions provide intra-query fault tolerance. However, these focus on distributed...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.