The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
We propose and study a novel problem of mining news text and social media jointly to discover controversial points in news, which enables many applications such as highlighting controversial points in news articles for readers, revealing controversies in news and their trends over time, and quantifying the controversy of a news source. We design a controversy scoring function to discover the most...
As the ability to store and process massive amounts of user behavioral data increases, new approaches continue to arise for leveraging the wisdom of the crowds to gain insights that were previously very challenging to discover by text mining alone. For example, through collaborative filtering, we can learn previously hidden relationships between items based upon users' interactions with them, and...
Content is one of the most essential parts of products on e-commerce websites such as eBay. It not only drives user-engagement but also traffic from various search engine websites based on the relevance. Generating the content for the products, however comes with a wide set of challenges, due to the complexity of commerce at scale, and requires new applications in text processing and information extraction...
Many modern big data applications deal with graph structured data, such as databases of molecular compounds represented as graphs of atoms and bonds, or “structured interaction networks” in biological and social networks, where nodes refer to entities (proteins, people, etc.) and edges represent their relationships. Central to high performance graph analytics over such data, is to locate patterns...
Predicting ad click-through rates is the core problem in display advertising, which has received much attention from the machine learning community in recent years. In this paper, we present an online learning algorithm for click-though rate prediction, namely Follow-The-Regularized-Factorized-Leader (FTRFL), which incorporates the Follow-The-Regularized-Leader (FTRL-Proximal) algorithm with per-coordinate...
Semantic Knowledge is usually adding into topic model to improve topic coherence. However, it's hard to judge whether semantic information is related to topic without using complicated lexical characteristics. In this paper, we demonstrate a novel model called Cloud Transformation Model, which can easily judge whether semantic information is related to topic, and integrate semantic information into...
This paper proposes a Contrarian Probabilistic Model (CPM) to evaluate the effectiveness of contrarians' investment in preferred stocks using big data from Tradeline. CPM accommodates the unique features of investment data which are often correlated, nested, heterogeneous, non-normal with missing values. The clustering and statistical inference are integrated in CPM, which enables joint investment...
Big data is a broad data set that has been used in many fields. To process huge data set is a time consuming work, not only due to its big volume of data size, but also because data type and structure can be different and complex. Currently, many data mining and machine learning technique are being applied to deal with big data problem; some of them can construct a good learning algorithm in terms...
Understanding bike trip patterns in a bike sharing system is important for researchers designing models for station placement and bike scheduling. By bike trip patterns, we refer to the large number of bike trips observed between two stations. However, due to privacy and operational concerns, bike trip data are usually not made publicly available. In this paper, instead of relying on time-consuming...
This paper discusses the relation between dorm arrangement and student performance. One of the unsupervised learning algorithms, k-means algorithm, is mainly used in the process of analysis. Students are clustered into several clusters according to their similarity of performance scores. This paper analyzes the result of clustering by comparing it with actual dorm arrangement. In the end, drawbacks...
A mechanism for identifying bandings in large "zero-one" N-dimensional data sets, using a sampling technique, is presented. The challenge of identifying bandings in data is the large number of potential permutations that need to be considered. To circumvent this a banding score mechanism is proposed that avoids the need to consider large numbers of permutations. This has been incorporated...
Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function is provided by the user at query time, or generated for a particular user by a personalized search engine which prevents the pre-computation of the...
We define interestingness hotspots as contiguous regions in space which are interesting based on a domain expert's notion of interestingness captured by an interestingness function. This paper centers on finding interestingness hotspots on very large gridded datasets which are quite common in scientific computing. Mining large gridded datasets with a lot of variables and measurements requires a scalable...
Phishing website is becoming a major threat to the information security in Social Network. The attacks not only lessen the users' trust but also influence the benefit of the third party who develops the platform. In order to solve the time lag in phishing website passive detection, this paper proposes a solution to discover phishing website initiatively based on blacklist, in which the anomalies of...
In this paper we present SciSpark, a Big Data framework that extends Apache™ Spark for scaling scientific computations. The paper details the initial architecture and design of SciSpark. We demonstrate how SciSpark achieves parallel ingesting and partitioning of earth science satellite and model datasets. We also illustrate the usability and extensibility of SciSpark by implementing aspects of the...
In both industry and academia, the seismic exploration does not yet have the capability of illuminating the physical dynamics with high resolution and in real-time. The major bottleneck in real-time monitoring today is to transfer large volume of raw data for post processing. Although computation capacity and sampling rate of sensors have increased exponentially, we still have challenges in terms...
The usage of large amounts of data has an immense potential for global economic growth and the competitiveness of countries with high technological standards. Vast amounts of data from different sources are collected and analyzed in order to seek economic profit and competitive advantages for companies and society in general. To gain profit from such data, it needs to be analyzed, processed, and interpreted...
This paper discusses a project that studied the relationship between citizen trust and social protest using visual analysis of approximately 11 million sentiment classified Tweets from the period of the 2014 Brazilian World Cup. The results of the study reveal that the 2014 World Cup protests in Brazil sprang from a wide range of grievances coupled with a relative sense of deprivation compared with...
The extreme volume and staggeringly increasing rate inevitably produce unprecedented pressure on any large scale video sharing and hosting systems. Among the efforts to mitigate this pressure, content-based video similarity search is becoming more and more important with the exponential growth of the data size. Though various approaches have been proposed to address this problem, they are mainly focusing...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.