Star Wars

An Analysis of Popular GitHub Repositories

About Star Wars

We aim to analyze popular GitHub repositories and find the common traits and correlations with different parameters amongst them.

Contributions in open-source have really moved forward with platforms like Github and Bitbucket. This recent trend has allowed for projects and software programs to be published openly and enabled people from all over the world to contribute to them. With Open-source being so prevalent in the technology culture, we decided to analyze some of the most popular Github repositories to find trends that might help understand what makes these projects so popular, in addition to analyzing trends in technologies that software engineers are using these days.

Data Collection

We initially used the API provided by GitHub to get data. However, the limited rate per hour quickly became problematic for us.

In order to avoid the rate limit problem, we used beautifulsoup to scrape GitHub repository pages to get relevant data we needed.

Web scraping in one machine for large amounts of data is not feasible. Hence, we leveraged Google Cloud platform to distribute the load and used Twillio API to notify us when something happened while the processes were running in the cloud.

After that, we had data in many separated files. To ease our analysis effort, we combined them into one single JSON file.

Now we had the data, it was time to clean it up. This dataset contains lots of outliers. For example, the Linux repository is one of the most popular and most well-known code bases. However, it has infinitely many contributors and can mess with our analysis. Therefore, we tried to single out all outliers. This cut the number of repositories almost by half.

Data Analysis

Legitness

GitHub is supposed to be an open-source code sharing platform. However, we found that lots of popular repositories don't even contain a single line of code. Most of them are tutorials or best practices for certain frameworks or languages. We thought these repositories are still very valuable and believed there must be some traits these repositories have that the actual code-based repositories don't. Therefore, we wanted to investigate this issue.

It turns out that there are 492 repositories that contain code and 41 repositories that don't have a single line of code.

Based on our analysis, there are lots of differences between repositories that contain code (legit repositories) and repositories that don't (non-legit repos). For instance, legit repositorieshave more than twice as many branches and commits. In addition to those, legit repositorieshave more than 3 times as many open issues and milestones. Non-legit repositories have far smaller sizes than legit repos, since most of them only have texts or links to other contents. Nonetheless, on average, non-legit repositories do receive slightly more attention than the legit ones as they have more stars/watchers.

There are distinct differences in the repositories based on the legitness, so we built a decision tree classifier to help us determine if a repository is legit or not. We used half the dataset as training set and the other half as testing set. Our model was able to predict 243 out of 266 repositories correctly with score = 0.910112359551.

Languages

We are interested in knowing what languages these popular repositories use to see if choosing the primary languages for a repository is important to the stars count. First, we will investigate the primary languages of each repository - that is, the language which composes the largest percentage of the code in each repository.

Javascript is the most popular primary language used by all the repositories in the dataset. 47.6% of repositories use mostly Javascript. All of the top 10 languages are fairly well-known and widespread in academia and industry. This is expected, as people would contribute or look at repositories written in common languages. But what are the most unpopular languages used by the popular repositories?

Most unpopular languages used by popular repositories are

1.Haskell
2.Batchfile
3.Crystal
4.TeX
5.Makefile
6.XSLT
7.Vue
8.Assembly
9.Perl
10.ApacheConf

Unsurprisingly, the pure, lazy and functional language Haskell takes the crown of being the least popular language. What are the top frequently used languages by popular repositories?

We can clearly see that most of those popular repositories must have something to do with websites since web languages such as HTML, CSS and JavaScript dominate.

We also seperate the repositories by languages. Interestingly, Haskell dominates in several fields such branches_count, commits_count, contributors_count and pull_requests_count. Almost every Haskell repository has a homepage, downloads and a wiki as well as open issues. Perhaps the language is a bit difficult to follow so having extra resources would be helpful to become popular. JavaScript and Java repositories get the most stars while C++ and C# receive the least. Golang has the most contributors; it is definitely an up-and-coming language and popular among the open-source community. Java repositories have the largest code base. This is expected as the language itself is extremely verbose.

Miscellaneous

Below are just some miscellaneous fun things we found in this dataset.

We noticed that popular repositories that contain code are hard to make commits to because they are already very built out. Some of these popular projects were products that were open sourced by companies, with no active record of commits on GitHub while they were in the beginning stages. Adding code to these projects is difficult because they are more stable versions. This is given by looking at the trend in branches that are created, forks, commit history, and open pull requests. On analyzing contributor count, we noticed the majority of popular repositories that contained code had less than 100 unique contributors, while a majority of repositories that did not contain code had less than 200 unique contributors. This trend is explained by the fact that contributing to repositories with code is much harder and requires a significant amount of time and effort as compared to repositories with no code. Milestones are used to track progress on groups of issues or pull requests in a repository. There are on average 0 milestones for repositories with no code, and less than four milestones for repositories with code. Popular repositories have milestones that need to be met because they have a significant number of concurrent contributions. This is also the case for the number of projects. A majority of popular repositories that contain code have received less than 10,000 stars on GitHub. However, stars are mostly evenly distributred up to 20,000 stars for repositories with no code. This can be attributed to the lack in the total number of popular repositories with no code.

Star Playground

This is a playground for finding correlations. You can pick any two features from our dataset and see if there are anything interesting between these two features.

Takeaways

Originally, we wanted to find what makes a repository popular. However, when we collected data, we pivoted and obtained only the popular repositories and ignored the normal repositories. Hence, our data was skewed toward popular ones so we couldn't do any analysis on the difference between the two kinds of repositories. Secondly, we realized that collecting data is one of the biggest challenges in data science. Data is very valuable and most places set lots of limitations on how much data we could obtain from them (such as the GitHub rate limit).

Want to learn More?

You can check out the following sources to learn more about this project.