By Daniel Monteagudo
With the proliferation of cloud computing, containerization has come to the forefront as a system for creating repeatable and easily portable deployments of code that can be run practically anywhere. The system that has been taking a leading role in this containerization revolution has been Docker. This container technology represents its configuration in plain text files called Dockerfiles.
Building these Dockerfiles is an iterative process, and the vast majority of docker files refer to lower-level Dockerfiles. This means that many of the high-level Dockerfiles are a house of cards built upon layers and layers of lower Dockerfiles. Considering this led me to wonder whether any of the lower level Docker containers contain sensitive information or defaults such as passwords or private keys.
With the earlier paragraphs in mind, the idea I had was to enumerate a large number of Dockerfiles somehow, procedurally download them and search them for keywords such as ‘password’, ‘key’, etc. I then planned to perform some data analysis to determine how prevalent of a problem plaintext passwords in Dockerfiles actually was.
The first step of this process was to aggregate these Dockerfiles. To do this, I turned to a neat piece of google tech called BigQuery, which allows you to query terabytes of data in seconds. One interesting bit of information about BigQuery is that it includes a fairly wide array of datasets, which is constantly growing. Conveniently, one of these data sets happened to be a dataset that contains seemingly every commit on Github. Querying this information was actually surprisingly easy since BigQuery can understand SQL. The query in BigQuery is shown below.
To find these files, I queried for every file on Github which ended in ‘Dockerfile’. Once I downloaded this CSV, the fun data analysis began. For a preliminary analysis, I wrote a python script that requests all of the Dockerfiles listed in the BigQuery results and searches them for offending words like ‘password’, ‘passphrase’, ‘secret’, ‘private’, and ‘key’. The code then output a CSV file with every row containing a URL, the offending word found in the file, and the line containing that word. This will allow us to analyze which offending words show up most often in files as well as which Dockerfiles have the most mentions of passwords. A screenshot of the python code as it stands is shown below, but I will upload the code itself to Github once I fix the ridiculous variable names.
Once this code ran, I opened up the file in RStudio to inspect the output data. An initial search of how many unique URLs were in the dataset showed that 18,969 Github repos out of the initial 100,000, about 19 percent, were flagged for having the word password or key in their Dockerfile.
18% of all Dockerfiles containing hardcoded credentials or keys would be a pretty dramatic finding, so I’m tempted to stop there. However, it is pretty clear that at this point we have a fair amount of false positives, most likely from people using environment variables that contain the word ‘key’ or ‘password’. To try to determine the number of actual passwords stored, I did some manual poring through the data, where it became quite clear to me that the vast majority of these Dockerfiles contained no actual passwords, and they generally just contain mentions of environment variables or references to other files on the system. I suspect that some of these files might also be on GitHub, but I didn’t have time to crawl the file tree that way.
To separate actual passwords from random strings containing the word password, I decided to use the entropy of the string. Total string entropy was yielding confusing results, probably because of other symbols in the string, making the lines seem more random than they were. To remedy this, I attempted to split up the string and use the maximum entropy of any string. This yielded much better results and I could quickly look at the data set and see a fair amount of SSH keys and even some API keys (!!). Unfortunately, there was fairly little in the way of actual passwords in any of the data I looked at.
I plotted the entropy of the strings against their max entropy below, mainly to show the distribution of trigger words. It looks like the vast majority of the data involves the word ‘key’, so next time I should most likely focus more on the key results.
To finally state roughly how many of the Dockerfiles contain passwords is tricky, but after looking at the data for a while, I think it would be fair to say on average everything with either a max_entropy or total entropy above 5 contains a password or a key. Filtering on these criteria leaves only 512 Dockerfiles at fault. This means that assuming our sample is representative of all Dockerfiles, roughly 0.5% of all Dockerfiles contain either a default credential or a hardcoded key.
The results of this research were surprisingly not as bad as I had initially expected. If only 1 in 200 Dockerfiles contain something they shouldn’t, that means 199 out of 200 developers are doing at a minimum an okay job writing secure configurations for their containerized services. The two most common mistakes in Dockerfiles I found were hardcoding ssh-keys and hardcoding API keys, which are both pretty prevalent problems in the rest of software development, so they weren’t particularly surprising. I was a little bit disappointed at the lack of plaintext passwords I found, but I noticed that a fair amount of Dockerfiles specified an empty password, so perhaps that would be worth researching further.
Ideally, given more time, I’d love to expand on the research outlined in this blog post to do a more comprehensive search for passwords and actually try to create a dependency tree for the flawed Dockerfiles I did find. In addition, I would have liked to run the analysis on a much larger sample set and potentially assign a reputation score to each Dockerfile based on Github statistics such as the number of forks or contributors, or based on the number of other Dockerfiles in the data set that reference it.