We aim to analyze popular GitHub repositories and find the common traits and correlations with different parameters amongst them.
Contributions in open-source have really moved forward with platforms like Github and Bitbucket. This recent trend has allowed for projects and software programs to be published openly and enabled people from all over the world to contribute to them. With Open-source being so prevalent in the technology culture, we decided to analyze some of the most popular Github repositories to find trends that might help understand what makes these projects so popular, in addition to analyzing trends in technologies that software engineers are using these days.
We initially used the API provided by GitHub to get data. However, the limited rate per hour quickly became problematic for us.
In order to avoid the rate limit problem, we used beautifulsoup to scrape GitHub repository pages to get relevant data we needed.
Web scraping in one machine for large amounts of data is not feasible. Hence, we leveraged Google Cloud platform to distribute the load and used Twillio API to notify us when something happened while the processes were running in the cloud.
After that, we had data in many separated files. To ease our analysis effort, we combined them into one single JSON file.
Now we had the data, it was time to clean it up. This dataset contains lots of outliers. For example, the Linux repository is one of the most popular and most well-known code bases. However, it has infinitely many contributors and can mess with our analysis. Therefore, we tried to single out all outliers. This cut the number of repositories almost by half.
GitHub is supposed to be an open-source code sharing platform. However, we found that lots of popular repositories don't even contain a single line of code. Most of them are tutorials or best practices for certain frameworks or languages. We thought these repositories are still very valuable and believed there must be some traits these repositories have that the actual code-based repositories don't. Therefore, we wanted to investigate this issue.
It turns out that there are 492 repositories that contain code and 41 repositories that don't have a single line of code.
Based on our analysis, there are lots of differences between repositories that contain code (legit repositories) and repositories that don't (non-legit repos). For instance, legit repositorieshave more than twice as many branches and commits. In addition to those, legit repositorieshave more than 3 times as many open issues and milestones. Non-legit repositories have far smaller sizes than legit repos, since most of them only have texts or links to other contents. Nonetheless, on average, non-legit repositories do receive slightly more attention than the legit ones as they have more stars/watchers.
There are distinct differences in the repositories based on the legitness, so we built a decision tree classifier to help us determine if a repository is legit or not. We used half the dataset as training set and the other half as testing set. Our model was able to predict 243 out of 266 repositories correctly with score = 0.910112359551.
We are interested in knowing what languages these popular repositories use to see if choosing the primary languages for a repository is important to the stars count. First, we will investigate the primary languages of each repository - that is, the language which composes the largest percentage of the code in each repository.
Most unpopular languages used by popular repositories are
1.Haskell 2.Batchfile 3.Crystal 4.TeX 5.Makefile 6.XSLT 7.Vue 8.Assembly 9.Perl 10.ApacheConf
Unsurprisingly, the pure, lazy and functional language Haskell takes the crown of being the least popular language. What are the top frequently used languages by popular repositories?
Below are just some miscellaneous fun things we found in this dataset.
We noticed that popular repositories that contain code are hard to make commits to because they are already very built out. Some of these popular projects were products that were open sourced by companies, with no active record of commits on GitHub while they were in the beginning stages. Adding code to these projects is difficult because they are more stable versions. This is given by looking at the trend in branches that are created, forks, commit history, and open pull requests. On analyzing contributor count, we noticed the majority of popular repositories that contained code had less than 100 unique contributors, while a majority of repositories that did not contain code had less than 200 unique contributors. This trend is explained by the fact that contributing to repositories with code is much harder and requires a significant amount of time and effort as compared to repositories with no code. Milestones are used to track progress on groups of issues or pull requests in a repository. There are on average 0 milestones for repositories with no code, and less than four milestones for repositories with code. Popular repositories have milestones that need to be met because they have a significant number of concurrent contributions. This is also the case for the number of projects. A majority of popular repositories that contain code have received less than 10,000 stars on GitHub. However, stars are mostly evenly distributred up to 20,000 stars for repositories with no code. This can be attributed to the lack in the total number of popular repositories with no code.
This is a playground for finding correlations. You can pick any two features from our dataset and see if there are anything interesting between these two features.
Originally, we wanted to find what makes a repository popular. However, when we collected data, we pivoted and obtained only the popular repositories and ignored the normal repositories. Hence, our data was skewed toward popular ones so we couldn't do any analysis on the difference between the two kinds of repositories. Secondly, we realized that collecting data is one of the biggest challenges in data science. Data is very valuable and most places set lots of limitations on how much data we could obtain from them (such as the GitHub rate limit).
You can check out the following sources to learn more about this project.