By Kenneth Nero
There is little more interesting and infuriating in academia than encountering a problem for which there seems to be no solution. Among those having taken CSEC-380 at RIT, it’s not an uncommon belief that homework 3 is one of the most difficult assignments of the course. Without giving too much away, this homework requires students to build and utilize their own web-based requests library for a variety of tasks. Whether one is dealing with misconfigured cacheing, properly parsing mis-attributed response codes, or understanding the intricacies of how every HTTP header flag interacts with one another, things can get understandably frustrating as web protocols are notorious for their uneven implementation by different developers. I want to take you on a brief and somewhat meandering journey down the rabbit hole of RIT’s questionable web configuration and demonstrate why – according to my hypothesis – proper maintenance of one’s configuration files is paramount.
The focus of our attention today will be on this site: https://www.rit.edu/ntid/alumninews/. I would first like to credit the site’s designer whom I’ve been in contact with for the latter half of this process on doing a fantastic job of merging two content delivery platforms – WordPress and Drupal – together to create a digestible and informative site for NTID’s alumni. With pleasantries out of the way, I want to copy verbatim part of 380’s homework 3 document for all to read:
…Your agent is getting pretty powerful, but it is still only limited to processing one page, websites consist of hundreds of pages. Modify your user-agent to be able to crawl an entire website…
This aspect of the assignment effectively asked students to crawl an entire website and follow links to some arbitrary depth. With crawling links came the impetus of today’s post – following redirects. Typically following redirects is a simple feat, if you receive a 301 or 302 error, you look into the response header, snag the location url, and send the same request you just made (with minor modifications to host and what not) over to the new location. Simple, right?
No. Not simple.
Not simple at all.
Because sites like https://www.rit.edu/ntid/alumninews/ exist.
With the most basic HTTP 1.1 headers attached to a GET request (host, accept, content-type, and user agent) capable of basic redirect following with no support for cookies, this site will throw you into a loop. If you request the HTTP version of the site, it will redirect you to the secure variant; should you request the secure variant, you’ll find yourself redirected back to the HTTP address until a cache is established. This is suboptimal for many reasons, not the least of which due to the fact it caused my web crawler to spectacularly crash and burn when this site – and at least 20 other RIT sites I’ve yet to adequately identify – started hogging all available threads causing what was already a 3 hour program to loop indefinitely. This issue was solved by adding a simple redirect counter which would bail after 6 or 7 loops, but a notable quirk was that this website was perfectly reachable through a modern browser. With the complaint-based exposition out of the way, lets briefly analyze what our browser is seeing (All relevant .har files will be made available at the end).
Figure 1. Initial request to website: 301 redirect from https to http
Figure 2. First 302 redirect to HTTPS from HTTP
Figure 3. Second 301 redirect from HTTPS to HTTP
Figure 4. Second 302 redirect from HTTP to HTTPS
Figure 5. Functional HTTPS request, session ID and redirect count varying from first
With the data present before me, three main questions came to mind in regards to this back and forth: does upgrade insecure requests actually force any difference, why are there so many set-cookie headers in the last response, and why am I getting two different kinds of errors (302 vs 301)? I assumed at this point that the issue might very well be on my end and went about experimenting with my first theory through ZAP utilizing breakpoints and manual editing of the requests. Once the upgrade-insecure-requests header was removed, nothing different happened and the same back and forth we saw prior continued to occur. Now assured that this issue wasn’t caused by attempted upgrades by Firefox, my attention turned to potential server-side issues.
Through examination of the responses received, the server type is clearly Apache, giving me a decent place to begin my examination. After some research into what controls both error type determination for given pages and the setting of cookies on apache servers, I had found the answer which would eventually lead me to my hypothesis – the .htaccess file. Noting only one .htaccess file would be amiss, as my best guess for the problem being experienced above is as follows: two .htaccess files exist – one at the directory of /ntid/, and one at the directory of /alumninews. These two files are both capable of setting cookies and issuing redirects to specific pages for given error types, a quirk however exists in the file implementation such that – if an overriding .htaccess file modifies the URL to be visited – the next queued directives and subsequent .htaccess files downstream of the directory will assess the changed URL instead of the original one.
This behavior would explain the alternating 301/302 errors experienced by our browser as the root file would redirect one way, and the subdirectory another. To explain the cookie setting behavior and why our final response sets a duplicate cookie, it is not likely that the .htaccess file responsible for setting these values was ever intended to be hit twice (or deferred to by its own sub file), and thus its cookie-write directives don’t contain rewrite clauses or checks for the presence of the previously set cookies. To say this behavior is harmless would be misleading, as the session ID intended to stay within the HTTPS space is persisted to HTTP redirects – opening the potential for session hijacking exploits should an attacker be monitoring network traffic. A simple fix for this would be to add an END condition within the subdirectory’s .htaccess file to prevent the parent .htaccess from reassessing the modified URL. Additionally, it wouldn’t be impossible to remove the subdirectory’s .htaccess file all together and simply pack its contents into the parent’s, alleviating any ambiguity at the cost of a larger single file.
As I was about to submit this blog post, the designer of the site got back to me with in part the following message:
Normally I believe this would throw my initial theory right out the window, however the lack of a .htaccess file in both locations doesn’t particularly mean one doesn’t exist within the WordPress structure itself – simply that it is not visible to the individuals managing the filesystems. The architects of this configuration may have instead used a built-in plugin or other functionality to determine how the website reacts to incoming traffic, reducing the visibility of the configuration in exchange for ease of access.
Regardless of the presence of a file or not, knowing the WordPress sites are to soon be decommissioned and moved to Drupal brings me a level of relief, as this functionality is likely to be addressed as well with the move. I hope this post has been informative, entertaining, and maybe even slightly helpful for any other individuals experiencing similar redirect related issues in their own apache environments.