The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Big data is a broad data set that has been used in many fields. To process huge data set is a time consuming work, not only due to its big volume of data size, but also because data type and structure can be different and complex. Currently, many data mining and machine learning technique are being applied to deal with big data problem; some of them can construct a good learning algorithm in terms...
This paper discusses the relation between dorm arrangement and student performance. One of the unsupervised learning algorithms, k-means algorithm, is mainly used in the process of analysis. Students are clustered into several clusters according to their similarity of performance scores. This paper analyzes the result of clustering by comparing it with actual dorm arrangement. In the end, drawbacks...
We define interestingness hotspots as contiguous regions in space which are interesting based on a domain expert's notion of interestingness captured by an interestingness function. This paper centers on finding interestingness hotspots on very large gridded datasets which are quite common in scientific computing. Mining large gridded datasets with a lot of variables and measurements requires a scalable...
In both industry and academia, the seismic exploration does not yet have the capability of illuminating the physical dynamics with high resolution and in real-time. The major bottleneck in real-time monitoring today is to transfer large volume of raw data for post processing. Although computation capacity and sampling rate of sensors have increased exponentially, we still have challenges in terms...
Rumor detection in streaming social media is a significant but challenging problem. In this paper, we present a method to identify rumor patterns in the streaming social media environment. Patterns which combine both structural and behavioral properties of rumor are firstly proposed to distinguish false rumors from valid news. A novel graph-based pattern matching algorithm is also described to detect...
Graph keyword search is the process of extracting small subgraphs that contain a set of query keywords from a graph. This problem is challenging because there are many constraints, including distance constraint, keyword constraint, search time constraint, index size constraint, and memory constraint, while the size of data is inflating at a very high speed nowadays. Existing greedy algorithms guarantee...
Recent advances in microscopy imaging and genomics have created an explosion of patient data in the pathology domain. Whole-slide images (WSIs) of tissues can now capture disease processes as they unfold in high resolution, recording the visual cues that have been the basis of pathologic diagnosis for over a century. Each WSI contains billions of pixels and up to a million or more microanatomic objects...
Implementing database operations on parallel platforms has gain a lot of momentum in the past decade. A number of studies have shown the potential of using GPUs to speed up database operations. In this paper, we present empirical evaluations of a state-of-the-art work published in SIGMOD'08 on GPU-based join processing. In particular, this work presents four major join algorithms and a number of join-related...
Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute...
Based on empirical studies, the feature of random initialization in Particle Swarm Optimization (PSO) based Fuzzy c-means (FCM) methods affects the computational performance especially in big data. As the data points in high-density areas are more likely near the cluster centroids, we design a new algorithm to guide the initialization according to the data density patterns. Our algorithm is initialized...
The rise of big science techniques is reshaping the provisioning of computing resources and scientific software in large science facilities. As facilities are gearing up for data intensive computing infrastructure, a wave of facility-based big science computing platforms is emerging. This paper presents a new computing paradigm towards designing HPC data analysis platform, named Data Optimised Computing...
Influence among objects prevalently exists in graph structured data. However, most existing research efforts detect influence among objects from snapshots of homogeneous graphs. In this paper, we study a new problem of detecting time-evolving influence among objects from dynamic heterogeneous graphs. We propose a probabilistic graphical model, Time-evolving Influence Model (TIM), to capture the temporal...
Genomic analysis [1] usually includes a pipeline of three stages: sequence alignment, data conversion, and advanced analysis. The analysis pipeline needs to handle hundreds of gigabytes of data as well as to run complex analytics algorithms, which traditionally takes long execution time (20+ hours) for a full genomes analysis. Parallelizing the execution of analytics algorithms is one way to speed...
Geoscience gives insights into our surroundings and benefits many aspects of our life. Nowadays, with massive sensors deployed to sense all kinds of parameters for environments, tens of billions, even trillions of sensed data are collected and need to be analyzed for surveillance or other purposes. From many perspectives, users always issue queries according to specific spatial and temporal predicates...
Designing materials that are resistant to extreme temperatures and brittleness relies on assessing structural dynamics of samples. Algorithms are critically important to characterize material deformation under stress conditions. Here, we report on our design of coarse-grain parallel algorithms for image quality assessment based on structural information and on crack detection of gigabyte-scale experimental...
In this study we analyzed a series of LiDAR point clouds acquired over Taijiang district (part of Fujian province, China). The objective was to detect and extract water surface area from individual LiDAR point cloud, in a parallel means. To this end, interactive visualization of fine-grained data, global cluster algorithms, and statistical investigation were applied. We first rasterized point clouds...
Graphical Model (GM) has provided a popular framework for big data analytics because it often lends itself to distributed and parallel processing by utilizing graph-based ‘local’ structures. It models correlated random variables where in particular, the max-product Belief Propagation (BP) is the most popular heuristic to compute the most-likely assignment in GMs. In the past years, it has been proven...
In modern web-scale applications that collect data from different sources, entity conflation is a challenging task due to various data quality issues. In this paper, we propose a robust and distributed framework to perform conflation on noisy data in the Microsoft Academic Service dataset. Our framework contains two major components. In the offline component, we train a GBDT model to determine whether...
In this paper, we focus on distance-based outliers detection in an uncertain dataset, which is very useful in large social network. Based on the x-tuple model and the possible world semantics, we propose the concept of tuple outlier score, top k\ probability and top (k1, k2) distance-based outlier. We then design an algorithm using dynamic programming technique to calculate tuple outlier scores and...
In this paper, we discuss data-driven discovery challenges of the Big Data era. We observe that recent innovations in being able to collect, access, organize, integrate, and query massive amounts of data from a wide variety of data sources have brought statistical data mining and machine learning under more scrutiny and evaluation for gleaning insights from the data than ever before. In that context,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.