Testing a Metric Against College Website’s Disclosure of Faculty CII

By Aaron Karenchak


            In this age where people post everything about themselves and their day down to what they ate, it is easy to access this information and abuse it. Web crawling is a time-consuming, automated information-gathering technique that has wide applications and uses that can span both good and bad practices. On one end, search engines use their own web crawlers to discover and access those pages. On the other end, malicious adversaries use these tools to scope out information that could be used for nefarious purposes.

I created a crawlability metric that defines how readily available contact identifiable information (CII) is on institutions websites. The metric is defined by how much CII is found in relation to the number of sites visited and the time required to do so. From the results, an attacker might find that they would prefer to make use of a web crawler against an institution that has a lower crawlability metric.


            I hypothesized that the crawlability metric of an institution’s website would be proportional to how large the institution is. It seems reasonable that the more people there are, the more contact info there would be on their site and how lax their anti-crawling measures are.

Data Collection

            I gathered three different types of data that could be referred to as contact identifiable information. These three were phone numbers, email addresses, and location addresses. The third data gathered, location addresses, include anything that can be identified as an address, whether it be a home address or work address. I gathered data from a small list of colleges of varying sizes.

            I used a modified python web crawler from homework 3 that automates the gathering of the contact identifiable information from the institution list. The crawler uses sockets to connect to the website and details the information gathered from custom HTTP GET requests. I collected the data for around 2 months.

            I used regex on the HTML gathered from the GET requests to get the CII I desired. The three regex formulae I decided were best to use were:

Emails: re.compile(“[\w\d.]+@[\w\d\.]+\.[a-zA-Z]{2,}”)

Phone Numbers: re.compile(“(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4})”)

Locational Addresses: re.compile(“(?:One|two|three|four|five|six|seven|eight|nine|\d{1,4}) [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|park|parkway|pkwy|circle|cir|boulevard|blvd)(?=\s|\.|<br ?\/>|$)”, re.IGNORECASE)

These are the email and phone number regex did a great job of filtering out the bad and keeping the good. In contrast, the regex for finding matches for addresses was good at finding real addresses but also catch other stuff that is not an address because the English language is a mystery.

            In addition to gathering the three CII data, I also collected the average time it took to run the script and also the average amount of sites visited each time. This data is labeled as the ‘cost’ for getting the CII. The cost is added to the metric calculation. It is pretty obvious that time is a cost, but the number of sites visited is also a cost, because to gather the data you need to load each individual site, which takes time and memory since the crawler uses multithreading.


            There was a wide variety of data that was gathered over two months. There was a general trend where there was more CII gathered in proportion to the size of the student body. This is shown in Figures 1-3 below.

Figure 1: Phone Numbers

Figure 2: Emails

Figure 3: Locational Addresses

            The crawlability metric was calculated by summing up the different CII gathered and dividing it by the cost which was the time spent and the number of sites visited. Most of the calculated metrics were calculated to be between 0 and 2, with the average being 0.911. In Figure 4, the values for the metric for the schools with less than 20,000 students were jumbled and varied evenly between 0 and 2, but as you get to the higher student body count, the calculated metric evens out at around 0.8. The lower the metric is, the more cost there is to get the same amount of information, while the higher metric means that you get more out of the same cost. My conclusion from this shows that for schools that have smaller student bodies, it is not easy to guess how easy it is to gather CII, but as the college gets larger, it comes to a general point. For the larger colleges, there is generally a lot of data able to be gathered, but the cost to gather that, time and memory, is also proportionally large. That general point I calculate to be pretty much the average so it is not easy nor is it hard to gather the data.

Looking back at my hypothesis, I was correct in thinking that institutions with a larger student body would have proportionally large amounts of CII, but I was wrong with believing larger institutions would have a larger metric. It is more correct to say that the metric calculated stabilizes to an average standard as the sampled institution is larger.

Figure 4: Metric Grap

            In conclusion, to the eyes of an attacker, they may find that institutions with a smaller student body would seem to have a varying cost to collecting data while they would know the cost of crawling against a larger institution. The larger institutions would more likely be the choice because not only do you know what cost it will have, but the adversary will also get proportionally more useful data for their nefarious means. Future work for this project would be to either expand on what type of CII to collect or more preferably do a larger sample size of institutions. With a larger sample size, I could get more concrete data on the value of my metric.


Table 1:  Data Gathered

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s