By Micah Martin
Github is an online hub for source code and projects around the world. Many well known programs are hosted publicly on the site, for example, the Linux Kernel. Problems arise when inexperienced developers inadvertently or ignorantly push sensitive information to their projects repository. New projects are often the most prone to accidental data breaches. These accidents happen so often that assailants have utilized Github and Google’s search engines to watch for files containing strings with sensitive information to instantly grab them. From the other side of the fence, developers have realized how sensitive the files can be and tools have been developed to find potentially sensitive information in repositories. Developers are quick to remove their mistakes as attackers race to grab the keys before they are removed.
A Different Approach
Besides the convenience of sharing and maintaining code online, many of the useful features come from the foundation of Github, git. Git was built as a version control system for the development of Linux. Version control systems log every change made to files in blobs called commits. While most developers remove the private files from the current version of the program, many overlook the history of the project. Github allows the searching of commit history and this is what I leverage to find valuable information. This technique vastly lengthens the time window needed to steal valuable information as deleting the information is not enough. Once commits that might hold valuable information are found, we can clone the repository and sift through the data to extract the information. My target information for this experiment is RSA private key files. We find these by searching commits where the author removed the private key from the repository. My workflow for this technique is in the following diagram.
By sending a simple POST request to Github, 10 links are instantly at our fingertips. Each of these 10 projects potentially contains valuable information and sifting through it is where the challenge begins. Here is what the query looks like:
By curling the link and using ‘grep’ to find the string “Browse the” we are able to get a list of ‘a’ tags that contains this:
These links can be parsed for the names of the project and the author as well as the exact commit that the private key was removed in. Now begins the parsing.
I could either sift through the Github html file, or use the git tools to find the information. I chose to write a program in bash to that would allow me to parse through the files and gather information. My primary reason for choosing bash was to leverage the simplicity of other tools, this entire project might deserve a rewrite in another language though. My flowchart involves cloning the repository. This method allows me to find context to the private information such as other config files with an IP address to a server which uses the private key. Parsing the website would be faster and would probably get you more results. If going for another type of information that does not need context, such as AWS API tokens or Slack API codes, web parsing would certainly be faster.
Git has features for comparing commits, viewing files from old commits, and finding changes in files. A process was created to find files from the commit that could have keys. We check to see that the commit still exists.
git cat-file $commit~1 -t
The command “git cat-file -t” will return the type of a Git object passed to it. The commit hash with a “~1” specifies the hash before the hash passed to it. If we get an error when looking for this git object, then the commit does not exist. If this happens, either the commit where the key was removed is at the HEAD of the repository, or the repository has been scrubbed. Hopefully the author has scrubbed the repository in one of several ways. Git has its own diff tool for comparing objects, passing the commit where the change occurred and the commit before will return a list of files that have been changed. We get the contents of each of these files and use this data to parse for keys. A code example of this process is shown below.
As we can see, this will check for two methods of scrubbing. Rebasing to the HEAD to fixed commit, or removing the files from all previous commits. Rebasing is the easiest however you lose all version control before the new HEAD. Scrubbing files is more effort as it will go through every past commit and delete all mentions of the file ever existing. However, there are tools to help you scrub files which should ease some of the burden.
At this point we have an output of the file at the time the commit occurred. To save space I only gather the changes in the file not the entire file itself. The parser then handles the files using “sed” and “grep”. A simple check is done to see if the entire key is on one line. If this is the case, the key is then spaced out among line by line. Sed is then used to crop out just the lines with the key. Parsing is one of the things that I enjoy doing in the Linux shell. Rather than write my own functions for specific tasks, simple one liners solve the task at hand very simply. The output of the parsing function drops each key in its own files. Old keys are stashed for the later. By mashing together these steps, I managed to get a very effective output. Starting with 100 search results, I managed to collect around 17 keys.
What I Learned
At this point the script is not capable of finding context and simply supplies the project name. However I hope to add this to the project over the long term. Collecting keys does not result in a practical use for said keys, however just by collecting the keys with no context, I was able to gather general information and statistics. Not all search results contain keys, I have found that an average of one in seven results actually contains a valid private key. Several combinations of search results can bring in new projects. The success rate of searching through commit history rather than in the current branch of the repository was greater than I expected to see. The task of finding sensitive information is very simple with just Githubs website, but automating the collection allowed private key files with a simple command line tool. Creating this tool was even easier than I imagined it would be. This tool is not perfect, but I believe it highlights how important scrubbing repositories history really is. Below is the output of two pages of search results. With just 20 repositories, and one search term, 7 private keys have been compromised.
Every commit found by my scraper was from an author who realized their key was on Github, and then removed it. However, steps can be taken to mitigate an attack and prevent negative effects beyond just scrubbing the repository. Over half of the SSH keys that I gathered did not have a password set. Setting a secure password on the private key would slow down an attacker from using any stolen keys. This would allow time to remove any keys from servers or accounts. Several keys have been found that were not even revoked. I believe that my research projects one major headline for developers, In the event of a mistake, simply removing the private key is not enough. However, with a few more simple steps, enumeration can be stopped dead in its tracks
My scraper can be found on my Github for testing. I encourage testing this technique and learning how to scrub repositories. I created a simple repository to test my script against and practice repository scrubbing. You can find these repositories here:
To scrub a repository and the history of sensitive information, check out BFG Repo-Cleaner.