The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Many books and papers describe how to do data science. While those texts are useful, it can also be important to reflect on anti-patterns; i.e. common classes of errors seen when large communities of researchers and commercial software engineers use, and misuse data mining tools. This technical briefing will present those errors and show how to avoid them.
Software engineering is a complex filed with diverse specialties. By the growth of Internet based applications, information security plays an important role in software development process. Finding expert software engineers who have expertise in information security requires too much effort. Stack Overflow is the largest social Q&A Website in the field of software engineering. Stack Overflow contains...
We introduce Candoia, a platform and ecosystem for building Mining Software Repositories (MSR) tools. The platform is designed to support building of MSR tools by providing necessary tools and abstractions that hide the complex details of version control, bug databases, source code programming languages and forges. The ecosystem allows easy sharing and accessing of MSR apps for researchers and practitioners...
Large data handling and analysis either on industrial level or on research level has always been facing problems. These problems increase with the increase in machine dedicated software packages. Large data processing and analysis is prone to errors and is time consuming while moving data from data generation to data analysis. In this paper, first-methods of data generation, methods to move data from...
Big Data Analysis (BDA) has attracted considerable interest and curiosity from scientists of various fields recently. As big size and complexity of big data, it is pivotal to uncover hidden patterns, bursts of activity, correlations and laws of it. Complex network analysis could be effective method for this purpose, because of its powerful data organization and visualization ability. Besides the general...
Test cases are an essential tool in software quality assurance: they ensure that code behaves as specified in the requirement. However, writing test cases does not have only benefits, it comes with a cost: the programmer has to formulate the test cases and maintain them when the tested source code changes. Particularly for start-ups or small enterprises such costs become prohibitive, which often prefer...
The use of Application Programming Interfaces (APIs) is pervasive in software systems; it makes the development of new software much easier, but remembering large APIs with sophisticated usage protocol is arduous for software developers. Code recommendation systems alleviate this burden by providing developers with a ranked list of API usages that are estimated to be most useful to their development...
Source code comments are valuable to keep developers' explanations of code fragments. Proper comments help code readers understand the source code quickly and precisely. However, developers sometimes delete valuable comments since they do not know about the readers' knowledge and think the written comments are redundant. This paper describes a study of lost comments based on edit operation histories...
Stack Overflow is one of the most popular question-and-answer sites for programmers. However, there are a great number of duplicate questions that are expected to be detected automatically in a short time. In this paper, we introduce two approaches to improve the detection accuracy: splitting body into different types of data and using word-embedding to treat word ambiguities that are not contained...
Identifying dependencies between classes is an essential activity when maintaining and evolving software applications. It is also known that JavaScript developers often use classes to structure their projects. This happens even in legacy code, i.e., code implemented in JavaScript versions that do not provide syntactical support to classes. However, identifying associations and other dependencies between...
A large corpora of software-related documents is available on the Web, and these documents offer the unique opportunity to learn from what developers are saying or asking about the code snippets that they are discussing. For example, the natural language in a bug report provides information about what is not functioning properly in a particular code snippet. Previous research has mined information...
A key goal of this research is to understand the relationship between adoption of software library versions and its release cycle. In detail, we conducted an empirical study of the release cycle of 23 libraries and how they were adopted by 415 Apache Software Foundation (ASF) client projects. Our preliminary findings show that software projects are quicker to update earlier rapid-release libraries...
Change distilling algorithms compute a sequence of fine-grained changes that, when executed in order, transform a given source AST into a given target AST. The resulting change sequences are used in the field of mining software repositories to study source code evolution. Unfortunately, detecting and specifying source code evolutions in such a change sequence is cumbersome. We therefore introduce...
In this paper, we present a collection of Modern Code Review data for five open source projects. The data showcases mined data from both an integrated peer review system and source code repositories. We present an easy–to–use andricher data structure to retrieve the 1.) People 2.) Process and 3.) Product aspects of the peer review. This paperpresents the extraction methodology, the dataset structure,...
One of the many effects of social media in software development is the flourishing of very large communities of practice where members share a common interest, such as programming languages, frameworks, and tools. These communities of practice use many different communication channels but little is known about how these communities create, share, and curate knowledge using such channels. In this paper,...
GitHub, one of the most popular social coding platforms, is the platform of reference when mining Open Source repositories to learn from past experiences. In the last years, a number of research papers have been published reporting findings based on data mined from GitHub. As the community continues to deepen in its understanding of software engineering thanks to the analysis performed on this platform,...
Exception handling is a powerful tool provided by many pro- gramming languages to help developers deal with unforeseen conditions. Java is one of the few programming languages to enforce an additional compilation check on certain sub- classes of the Exception class through checked exceptions. As part of this study, empirical data was extracted from soft- ware projects developed in Java. The intent...
ABSTRACTIssue tracking systems store valuable data for testing hy-potheses concerning maintenance, building statistical pre-diction models and (recently) investigating developer affec-tiveness. For the latter, issue tracking systems can be minedto explore developers emotions, sentiments and politeness, affects for short. However, research on affect detection insoftware artefacts is still in its early...
Developers summarize their changes to code in commit messages.When a message seems “unusual’', however, this puts doubt into the quality of the code contained in the commit. We trained n-gram language models and used cross-entropy as an indicator of commit message “unusualness” of over 120,000 commits from open source projects.Build statuses collected from Travis-CI were used as a proxy for code quality...
In this paper, we present a curated collection of 2833 C# solutions taken from Github. We encode the data in a new intermediate representation (IR) that facilitates further analysis by restricting the complexity of the syntax tree and by avoiding implicit information. The dataset is intended as a standardized input for research on recommendation systems for software engineering, but is also useful...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.