The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Access to data plays a major role in designing and performing efficient data computation and analyses in a distributed environment. Existing approaches access data via a variety of methods and offer various benefits and drawbacks based on the use case. Our original use case was the computational analysis of environmental sequence data, or metagenomics. Unlike other workflows that often reduce the...
The efficiency and reliability of big data computing applications frequently depend on the ease with which they can manage and move large distributed data. For example, in x-ray science, both raw data and various derived data must be moved between experiment halls and archives, supercomputers, and user workstations for reconstruction, analysis, visualization, storage, and other purposes. Throughout,...
Tremendous developments in Information Technology (IT) have enabled us to store and process huge amounts of data at unprecedented rates. This phenomenon largely impacts business processes. The field of process discovery, originating from the area of process mining, is concerned with automatically discovering process models from event data related to the execution of business processes. In this paper,...
Multi-tenant Software-as-a-Service (SaaS) applications are increasingly built on combinations of cloud storage technologies and providers in a so-called multi-cloud setup. One advantage is that such a setup helps satisfying the different -- sometimes even contrasting -- storage requirements of different customer organizations (tenants). In such a multi-cloud environment, the application data is distributed...
Data access is key to science driven by distributed high-throughput computing (DHTC), an essential technology for many major research projects such as High Energy Physics (HEP) experiments. However, achieving efficient data access becomes quite difficult when many independent storage sites are involved because users are burdened with learning the intricacies of accessing each system and keeping careful...
Increasing size of datasets is challenging for machine learning, and Big Data frameworks, such as Apache Spark, have shown promise for facilitating model building on distributed resources. Conformal prediction is a mathematical framework that allows to assign valid confidence levels to object-specific predictions. This contrasts to current best-practices where the overall confidence level for predictions...
Nowadays large enterprises maintain a huge amount of data in multiple backend systems including traditional database systems and recently popular big data systems. In an example of telecom providers, the key business data (e.g., billing information) is maintained in database systems whereas the huge amount of log data is on HDFS with Hive. How to provide insightful analytics on such data becomes a...
In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML tables are used for layout purposes and only...
The Web of Data is an increasingly rich source of information, which makes it useful for Big Data analysis. However, there is no guarantee that this Web of Data will provide the consumer with truthful and valuable information. Most research has focused on Big Data's Volume, Velocity, and Variety dimensions. Unfortunately, Veracity and Value, often regarded as the fourth and fifth dimensions, have...
The advent of Big Data has brought many challenges and opportunities in distributed systems, which have only amplified with the rate of growth of data. There is a need to rethink the software stack for supporting data intensive computing and big data analytics. Over the past decade, the data analytics applications have turned to finer granular tasks which are shorter in duration and much more in quantity...
Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing...
As Twitter usage increases worldwide, it becomes an important part of the big data ecosystem. People from across the globe tweeting and retweeting a large number of tweets instantaneously results in exponential growth of the information diffusion. This in turn can cause information bubbles. As data from Twitter and other microblogs are used in predictive analytics in many areas such as stock price...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.