By Zachary Hilton –
Typically, search algorithms attempt for precision and optimization when iterating over the search space. This tends to work well in many situations, especially when the search space is small enough, but when that space is infinitely vast, how do you find things?
The Basic Brute Force Approach
That answer is pretty straight forward to many, we just use brute force. We pick a starting point and make iterate from there. For passwords, this is pretty defined. We tend to start at the lowest character limit and cycle through all possible character combinations. To make things faster, we add basic intelligence; starting first with the passwords most likely, then defaulting to general brute force. This is done in a number of ways, but one of the most common is using a list of words and known passwords, then proceeding to move on to the brute force iteration. This method is incredibly effective with passwords, but is often overlooked in other complex search spaces. One search space of particular interest is that of mapping a website on the internet. Some tools exist already for this, but fail to act as intelligent as possible in this regard.
Of note is DirBuster, a tool which takes a directory and file list and attempts to find paths which are not linked on a web server. This is about as close as we can come to our dictionary attack on passwords. But it still can go further. DirBuster not only checks the directories and files in its list, but it also fuzzes the URL and attempts to add some basic randomization to it. Though this helps find lazy attempts at hiding URLs, it fails to account for the unknown, avoiding approaching the problem of new directories and files that are not on the list.
Adding Some Intelligence
With that in mind, we can add varying levels of intelligence to our brute force. First, we can make it more predictive, analyzing the known list of directories and files and generating likely new ones. To take it a step further, we can instead analyze the very names of the directories to predict likely future names. Though this may sound complicated, it is actually fairly simple to do with a construct known as Markov chains.
Markov chains are a way in which we can move from one state to another by chaining probabilities together. This can be done simply, with no regard for the past, or can be done utilizing knowledge of past states to dynamically modify the probabilities of the future states. Markov chains allow us to simplistically model the transitions from one state to another, in a probabilistic way. In our case, these states can be a number of things, altering the way in which our search will behave.
One such possible state is that of the directory. We can split the URL on the directory marker (typically ‘/’) and use each directory or filename as a state. Then, we generate a table of the transitions and determine the probably of one directory being preceded by another for all directories. To generate new URLs, we just start at a random directory and follow its path using random numbers to determine which path we follow, until the path terminates naturally. Each time we do this, we will generate a URL which should look very similar to one of those we input to the generator.
A more interesting state though is the characters involved. This is typically used in Natural Language Processing to generate likely words. First, you choose a number of characters to retain for the past, and then create a similar table, but only considering the characters of all URLs, including the directory markers, and let it generate a string. The output string will likely appear as a URL which is different from our learning URLs, but will be a normal looking URL nonetheless. If we use just a few characters for history, we can create a fairly small table which can easily be stored in memory and called repeatedly to generate a seemingly endless supply of URLs which appear to be valid. To take this concept a step further, we can modify the table to replace all 0 value transitions with some arbitrarily low value, such as 0.001 (or lower based on data set), and artificially inject the ability to create new paths. This can lead to more invalid URLs, which should be screened for, but will add in fuzzing on a broader scale, while still maintaining a semblance of normal.
Ok, so now we can generate as many URLs as we wish, they will probe deeper than DIrBuster and will work off of the premise that humans tend to use similar schemes throughout language, allowing us to attempt to predict the unknown directories not found in our test set, dynamically generating strings which look like our target URL “language.” What more can we do? Well, what about target parameters or file extensions. We can utilize similar methods to the above for this too. File extensions can be predicated by analyzing those we have seen before, allowing us to better cause naming schemes based upon the known supported languages, such as PHP, ASP, JS, CSS, etc. Parameters are not much different. We can take known values for the left side of the parameter and known right side values and attempt to fuzz those as well.
Putting this all together, we can either build our own version of DirBuster, or modify the source code of the ZAP add-on to generate more likely URLs, increasing the number of discovered URLs before we have to resort to general brute force. To make this system even more powerful, we can add all of the discovered paths to the base list and utilize those to further refine our table, approaching a more accurate model of the URL “language” as it is utilized today.