The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Due to its simplicity and scalability, MapReduce has become a de facto standard computing model for big data processing. Since the original MapReduce model was only appropriate for embarrassingly parallel batch processing, many follow-up studies have focused on improving the efficiency and performance of the model. Spark follows one of these recent trends by providing in-memory processing capability...
Thanks to their high availability, scalability, and usability, cloud databases have become one of the dominant cloud services. However, since cloud users do not physically possess their data, data integrity may be at risk. In this paper, we present a novel protocol that utilizes crowdsourcing paradigm to provide practical data integrity assurance in key-value cloud databases. The main advantage of...
Labeling each instance in a large-scale data set is extremely labor- and time-consuming. One way to alleviate this problem is active learning, which aims to discover the most valuable instances for labeling to construct a powerful classifier with low generalization error. Considering both informativeness and representativeness provides a promising way to design a practical active learning. However,...
In nowadays, robotics database management systems are increasing. These systems ensure good storage of data and with big data analytic, a new approach demands new structures and methods for collecting, recording, and analyzing enterprise data. This paper work deals with the NoSQL databases which are the secret of the continual progression data that new data management solutions have been emerged....
Researchers often wish to study data stored in separate locations, such as when several research entities wish to make inferences from their combined data. The most common solution is to centralize the data in one location. However, certain types of data can be difficult to transfer between entities due to legal or practical reasons. This makes centralizing these types of data problematic. A possible...
High-performance distributed memory applications often load or receive data in a format that differs from what the application uses. One such difference arises from how the application distributes data for parallel processing. Data must be redistributed from how it was laid out by the producer to how the application needs the data to be laid out amongst its processes. In this paper, we present a large-scale...
When running data intensive scientific workflow in multiple data centers environment, it is inevitable that massive data movement will be caused. The emergence of cloud computing technologies offers a new way to develop scientific workflow systems, and using dataset replicas to reduce data transfer among data centers is an import issue. In this paper, we propose a group based genetic algorithm which...
Since RDF triples are modeled as graphs, we cannot directly adopt existing solutions from relational databases and XML technologies. Thus, there are still a number of open problems in the area of Linked Data. We present a hybrid method between centralized and distributed approaches. By using auxiliary indexes based on the MBB approximation, our approach can retrieve distributed Linked Data efficiently...
Many data-intensive applications like MapReduce are network-bound in data centers, due to transfer massive amount of flows across successive processing stages. Data flows in such an incast or shuffle transfer are highly correlated and aggregated at the receiver side. Prior work aims to aggregate correlated flows of each transfer, during the transmission phase as early as possible, so as to directly...
In the emerging field of big data, a large volume of data has to be managed, operating on data of huge volume becomes easier when it's sorted and structured. The data can be structured using a simple algorithm i.e. index algorithm which stores and categories data on basis of their application. This in turn will be very beneficial on business level as well as on software level.
Cloud Computing allows users to control substantial computing power for complex data processing, generating huge and complex data. However, the virtual resources requested by users are rarely utilized to their full capacities. To mitigate this, providers often perform over-commitment to maximize profit, which can result in node overloading and consequent task eviction. This paper presents a novel...
Fast data analytics at an increasingly large scale has become a critical task in any Internet service company. For example, in Baidu, the major search engine company in China, large volumes of Web and business data in PB-scale are timely and constantly acquired and analyzed for the purposes of evaluating product revenue, tracking product demanding activities on market, predicting user behavior, upgrading...
Data pre-processing for machine learning methods is key step for knowledge discovery process. Depending on nature of the data, pre-processing might take the majority time of data analysis. Correctly prepared data for processing guarantees precise and reliable results of data analysis. This paper analyses initial data pre-processing influence to attack detection accuracy by using Decision Trees, Naïve...
Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distributed environment with a high-level API. This approach...
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to...
For single-owner multi-user wireless sensor networks, there is the demand to implement the user privacy-preserving access control protocol in WSNs. Firstly, we propose a new access control protocol based on an efficient attribute-based signature. In the protocol, users need to pay for query, and the protocol achieves fine-grained access control and privacy protection. Then, the protocol is analyzed...
In the last decades, more and more time series data has been collected in many kinds of fields, and specially in the industry field, which has been increased greatly. One of the most common types of data visualization used is the line chart, but in industry field, time series datasets are so huge that it costs much more time to draw data as a line chart. In this case, we must reduce dimensionality...
This paper analyzed the challenges of data management in army data engineering, such as big data volume, data heterogeneous, high rate of data generation and update, high time requirement of data processing, and widely separated data sources. We discussed the disadvantages of traditional data management technologies to deal with these problems. We also highlighted the key problems of data management...
MapReduce has emerged as a popular programming model in the field of data-intensive computing. This is due to its simplistic design, which provides ease of use for programmers, and its framework implementations such as Hadoop, which have been adopted by large business and technology companies. One significant issue in practical MapReduce applications is data skew: the imbalance in the amount of data...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.