The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Multi-label data with high dimensionality arise frequently in data mining and machine learning. It is not only time consuming but also computationally unreliable when we use high-dimensional data directly. Supervised dimensionality reduction approaches are based on the assumption that there are large amounts of labeled data. It is infeasible to label a large number of training samples in practice...
We develop a warped correlation finder to identify correlated user accounts in social media websites such as Twitter. The key observation is that humans cannot be highly synchronous for a long duration, thus, highly synchronous user accounts are most likely bots. Existing bot detection methods are mostly supervised, which requires a large amount of labeled data to train, and do not consider cross-user...
Graph classification methods have gained increasing attention in different domains, such as classifying functions of molecules or detection of bugs in software programs. Similarly, predicting events in manufacturing operations data can be compactly modeled as graph classification problem. Feature representations of graphs are usually found by mining discriminative sub-graph patterns that are non-uniformly...
In this paper, we propose a interactive constrained independent topic analysis in text mining. Independent Topic Analysis (ITA) is a method for extracting the independent topics from the document data by using the independent component analysis. In the independent topic analysis, it is possible to extract the most independent topics between each topic. By extracting the independent topic, it is easy...
Using the recent advances in sequencing technology thousands of genomes have been sequenced. This sequence data can be fruitfully employed in diagnosis, drug design, etc. Genome-wide Association Study (GWAS) focuses on this important problem of extracting useful information from genomic data. As an example, a comparison of different genomes could throw light on causes for different diseases. Human...
Discovering and modeling lead-lag relations is a critical task in a variety of domains, including energy management, financial markets and environment monitoring. This task becomes more challenging when processing massive and highly dynamic data sources, often produced by sensors and live feeds that collect data about evolving entities in the real world. To cope with this data volume and velocity,...
Unprecedented expansion of user generated content in recent years demands more attempts of information filtering in order to extract high quality information from the huge amount of available data. In particular, topic detection from microblog streams is the first step toward monitoring and summarizing social data. This task is challenging due to the short and noisy characteristics of microblog content...
Today's dynamic computing deployment for commercial and scientific applications is propelling us to an era where minor inefficiencies can snowball into significant performance and operational bottlenecks. Data center operations is increasingly relying on sensors based control systems for key decision insights. The increased sampling frequencies, cheaper storage costs and prolific deployment of sensors...
Given a set of events of two different types (e.g. locations of crime incidents/road accidents) in geographic space and minimum density and area thresholds, spatial regions of high correlation discovery (RHC) aims to determine rectangular-shaped areas of high correlation between two event types. RHC discovery is important to many fields like transportation engineering, criminology, and epidemiology...
In this study, we focus on extraction of latent topic transition from POS data. POS analysis is conducted to obtain the frequent pattern of customer's behavior. The fundamental method for POS analysis is to conduct market basket analysis. By doing Market basket analysis, the sets of products that are often bought at the same time can be extracted. In market basket analysis, however, the effect of...
With the increase of systems' complexity, exception detection becomes more important and difficult. For most complex systems, like cloud platform, exception detection is mainly conducted by analyzing a large amount of telemetry data collected from systems at runtime. Time series data and events data are two major types of telemetry data. Techniques of correlation analysis are important tools that...
In this paper, a novel adequate and concise information extraction approach is explored to provide a promising alternative for manifesting the intrinsic structure of the cyclostationary signals, such as communication signals. A novel graph-based signal representation is proposed to interpret the spectral correlation function into a graph and its adjacency matrix. This graph can represent the proposed...
Among many Big Data applications are those that deal with data streams. A data stream is a sequence of data points with timestamps that possesses the properties of transiency, infiniteness, uncertainty, concept drift, and multi-dimensionality. In this paper we propose an outlier detection technique called Orion that addresses all the characteristics of data streams. Orion looks for a projected dimension...
The amounts of currently produced data emphasize the importance of techniques for efficient data processing. Searching big data collections according to similarity of data well corresponds to human perception. This paper is focused on similarity search using the concept of sketches – a compact bit string representations of data objects compared by Hamming distance, which can be used for filtering...
Bursty behavior normally indicates that the workload generated by data accesses happens in short time, uneven spurts. In order to handle the bursts, the physical resources of IT devices have to be configured to offer capability which goes far beyond the average resource utilization, thus satisfying the performance. However, this kind of fat provisioning incurs wasting resources when the system does...
It is crucial for Internet company to provide highly reliable web-based services. The web-based services always have many components running in the large-scale infrastructure with complex interactions. As an indispensable part of high reliability, the diagnosis remains to be a thorny problem. With the growth of system scale and complexity, it becomes even more difficult. In this paper, we propose...
Differential privacy (DP) has emerged as a popular standard for privacy protection and received great attention from the research community. However, practitioners often find DP cumbersome to implement, since it requires additional protocols (e.g., for randomized response, noise addition) and changes to existing database systems. To avoid these issues we introduce Explode, a platform for differentially...
In this paper the key indicators of sports tourism competitiveness were selected through the factor analysis method, by using the factor analysis method in Spss22.0 software, all kinds of sports tourism data in each city and county of Hainan were analyzed according to the factor analysis method and the sports tourism competitiveness of each city and county were also evaluated in comprehensive scores...
The problem of mining a network of time series data naturally arises in many research areas including energy system, quantitative finance, bioinformatics, environmental monitoring, traffic monitoring, etc. Among others, two emerging challenges shared by manifold applications are (1) the modeling of temporal-spatial dependence with contextual information and (2) the design of efficient learning algorithms...
What is exactly ‘Big Data’, and for what purpose and application is it really efficient? Between the commercial promises made by the industrial actors and the Cassandra's cautions from some whistle-blowers, we propose a singular Big Data field to investigate with Inductive Data-Driven Algorithms: developing collections. Last but not least, we investigate the innovative possibility to curate ‘figural’...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.