The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The continuously growing wealth of data has radically changed the data science landscape. At the same time, Big Data tools have known important progress in terms of optimising performance and scalability. However, applying them into practical deployment settings is still a challenging task that is highly dependent on the particularities of the data. In this paper, we present our experiences with implementing...
Remote Access Trojans (RATs) provide cyber criminals with unlimited access to infected endpoints. Using the victim's access privileges, they can access and steal sensitive business and personal data including intellectual property and, personally identifiable information. However due to attack evolution, targeted attacks utilize modified versions of known signatures, which means that IDS rules that...
People sensing data have been successfully utilized in various domains to support a more livable place with on-demand transport system, green environment, profitable economy and interactive governance, however, their potentials in supporting the design of places are not widely studied and explained. As an on-going multidisciplinary project in Singapore, “Livable Places” mins valuable insights from...
Herein we present a novel big-data framework for healthcare applications. Healthcare data is well suited for bigdata processing and analytics because of the variety, veracity and volume of these types of data. In recent times, many areas within healthcare have been identified that can directly benefit from such treatment. However, setting up these types of architecture is not trivial. We present a...
Low latency and high availability of an app or a web service are key, amongst other factors, to the overall user experience (which in turn directly impacts the bottoniline). Exogenic and/or endogenic factors often give rise to breakouts in cloud data which makes maintaining high availability and delivering high performance very challenging. Existing breakout detection techniques are not suitable for...
With the enormous amount of data generated through the internet and sensors, Internet of Things, it becomes too overwhelming for humans to examine it all. One solution is to reduce the data to a set of statistics. The perspective in this paper is the opposite, namely that most of this data is just background noise, and the interesting parts are those that deviate from background noise, the parts that...
Close monitoring ICU patients is a necessity for health care providers. Prediction of mortality of ICU patients based on the monitored data is an active research area. If the probability of survival (or death) of a patient could be predicted early enough, proper and timely attention could be given to the patient, saving the patients life. Most of the existing work in this regard try to predict mortality...
We propose a novel iterative unified clustering algorithm for data with both continuous and categorical variables, in the big data environment. Clustering is a well-studied problem and finds several applications. However, none of the big data clustering works discuss the challenge of mixed attribute datasets, with both categorical and continuous attributes. We study an application in the health care...
This paper introduces three interpolation methods that enrich complex evolving region trajectories that are captured every day from numerous ground-based and space-based solar observatories. The interpolation module takes a trajectory as its input and generates an enriched trajectory with interpolated time-geometry pairs. we created three different interpolation techniques that are: MBR-Interpolation...
The introduction of Advanced Metering Infrastructures in electricity networks brings new means of dealing with issues influencing financial margins and system-safety problems, thanks to the information reported continuously by smart meters. Such an issue is the detection of Non-Technical Losses (NTLs) in electric power grids. We introduce a datadriven method, called Structure&Detect, to identify...
The k-nearest neighbor (kNN) join has recently attracted considerable attention due to its broad applications. However, processing fcNN joins is very expensive due to the quadratic nature of the join operation. Furthermore, since there is an increasing trend of applications to deal with big data, computing fcNN joins becomes more challenging. In order to process such big data, parallel and distributed...
Static Index Pruning is a performance optimization technique for search engines that attempts to identify and remove index postings that are unlikely to lead to top results for typical user queries. The goal is to obtain a much smaller inverted index that can quickly return results that are (almost) as good as those for the unpruned index. We make two contributions: First, we improve on previous results...
Top-k join is an essential tool for data analysis, since it enables selective retrieval of the k best combined results that come from multiple different input datasets. In the context of Big Data, processing top-k joins over huge datasets requires a scalable platform, such as the widely popular MapReduce framework. However, such a solution does not necessarily imply efficient processing, due to inherent...
Building Information Modeling needs better strategies for schema interoperability in order to begin solving some of the problems the building industry faces including discrepancies in simulation tool data, missing or incorrect data, and gaps in data sourcing transparency. Addressing these challenges so far has often resulted in further “siloed” translation tools that only work for the few formats...
The in-memory data processing framework, Apache Spark, has been stealing the limelight for low-latency interactive applications, iterative and batch computations. Our early experience study [17] has shown that Apache Spark can be enhanced to leverage advanced features (e.g., RDMA) on highperformance networks (e.g., InfiniBand and RoCE) to improve the performance of shuffle phase. With the fast evolving...
Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets...
This paper proposes an intercloud brokerage method for system infrastructure deployments of genomic big data analytics workflows. The proposed method utilizes a conjunction of universally quantified atomic formula to describe requirements given by users, and selects combinations of cloud services based on logical reasoning by the replacement of definite clause sets created from conjunction of the...
The construction of data analysis infrastructures that handle continuously accumulating data is quickly becoming an essential requirement for many organizations such as the U.S. Department of Energy (DOE). While DOE supports some of the largest computing facilities in the world, new analysis infrastructures like Apache Spark are difficult to implement. In this paper, we propose an on-demand Spark...
Apache Spark enables fast computations and greatly accelerates analytics applications by efficiently utilizing the main memory and caching data for later use. At its core Apache Spark uses data structures called RDDs (Resilient Distributed Datasets) to give a unified view to the distributed data. However, the data represented in the RDDs remain unencrypted which can result in leakage of confidential...
Managed Hadoop in the cloud, especially SQL-on-Hadoop, has been gaining attention recently. On Platform-as-a-Service (PaaS), analytical services like Hive and Spark come pre-configured for general-purpose and ready to use. Thus, giving companies a quick entry and on-demand deployment of ready SQL-like solutions for their big data needs. This study evaluates cloud services from an end-user perspective,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.