Using Python to Scrape Emails from Microsoft Outlook and to Determine Potential Malicious Phishing Websites

By Tyler Olexa

A Nigerian Prince is willing to offer you millions of dollars if you just click this link, what do you do? Well, you click obviously; ‘millions of dollars’ is a lot of money. Pretty soon your computer is filled with ransomware, your computer is running significantly slower, and you blame the dang ol’ videogames your son has been putting on your laptop. Little did you know, the Nigerian Prince is actually a man named Tom, sitting in the basement of a Motel 6, and definitely not giving you millions of dollars.

Phishing has become a major problem for both the corporate world and standard email users, and the number of phishing emails received per year continue to rise. Companies like Facebook, Amazon, and most banking websites have repeatedly been in the news for being impacted by large phishing campaigns. [3] As phishing tactics become more complicated and more companies are affected, the need for a better way to analyze phishing emails and stop them is higher than ever. This post will explain a new way to provide more insight into analyzing phishing emails by scraping them for links that could be malicious. It should be noted that this is not a thoroughly tested approach and the results will not always be accurate, but the concepts and ideas behind the system are supported by and cemented in facts.

There are several good methods for scanning links and files for malicious content. In this example, I use ‘virustotal.com’ and ‘urlscan.io’ to gain knowledge about a site inserted. VirusTotal tends to give a limited amount of information, simply the results of other scanners tracking the site as clean or malicious. URLscan gives more in-depth information about the website such as where the servers hosting the website originate, information about HTTPS usage on the site, and various other smaller statistics like malicious intent of the site or how many ads are blocked on the page.

The benefit of this program is to have a hassle-free way to scan emails from Microsoft Outlook and effortless scan them in Python. To start, Microsoft Outlook must be configured to run macros as we will use a Visual Basic macro embedded in Outlook to download the emails to our PC.

To enable macros:

  1. Open Outlook and open the Options tab.
  2. Go to the Trust Center tab and click Trust Center Settings.
  • Click Macro Settings and enable macros. Note that enabling all macros does potentially create a security concern so enabling notifications for macros is more secure.

At this point, we can create our macro to download emails from Outlook. To access the Developer tab in Outlook, you must first make it visible. This is done by going into the Outlook Options and customizing the ribbon the show the Developer tab as seen below.

Once this is enabled, you can access the developer tab and create macros. From the developer tab, you can click Visual Basic to bring up your VBA project. I created a module by right clicking on my current open project and created code to download the currently selected email. With a bit of help from the interwebs, we were able to successfully create this macro. [5] Note this code below currently only downloads one selected email but can be modified to download more.

Once this code is saved, you can run the macro by selecting it from the developer tab after selecting an email. Luckily, we have two emails that we are going to be taking a look at today with the subjects ‘Promo’ and ‘More_Promos’. By running our macro, these two emails will be downloaded as HTML files.

These two emails are pretty unique as they hold HTML that we can easily analyze in Python using the BeautifulSoup library. This method will be able to scan any emails that originate from HTML but configuring them to analyze all emails would also be possible. These two emails are formatted to look and feel like a phishing email (with a little sarcasm, of course[TO(S1] ). The ‘Promo’ email contains a harmless link to ‘www.google.com’ while the ‘More_Promos!’ email contains a malicious website, taken from a site containing a long list of malicious sites for testing purposes. [6]

DO NOT GO TO THE LINK WE USE IN THIS EXAMPLE!

Now that we have our emails downloaded and secured in a folder, we can hop into our Python code and begin to determine how we will analyze them. To start, we have to create accounts for virustotal.com and urlscan.io. This can be done by simply providing an email and password. This gives access to an API key that can be found in your profile on each site. These API keys provide access to a REST API where we can query their database with links and files. To provide a bit of security in this program, I store the credentials in a file ‘credentials.txt’ so the program can pull them in.

These API keys are then passed into respective functions for VirusTotal and URLscan so we can use them to query the websites. To start, we will extract the links from the emails using the BeautifulSoup library. We will start with the first email, Promo. When we run our parse() function, we will be able to extract ‘http://www.google.com’ from the email. If there were other links in the email, it will extract these as well and scan them.

Now that we have the links from the email, we can submit them to VirusTotal and URLscan. We’ll call our function to VirusTotal first. This sends a GET request to the website and passes in our API and the link we are looking to query. VirusTotal, as mentioned before, responds limited amounts of information, simply the results of other scanners tracking the site as clean or malicious. We store the number of positives, meaning the number of scanners that found the site malicious, and the total number of sites so we can get an accurate percentage.

Shortly after we will query URLScan as well. This site is a bit more complex to use as the API is less intuitive but the information gleamed is quite useful for analyzing a website. We are going to pull back a bunch of data, namely the countries the website originates from, whether or not the site is using HTTPS, what percentage of the site is secure, any malicious information found on the site, the number of ads blocked on the page, and whether the website originates from the United States. We’ll chat a bit more about countries further in the analysis. Once we have this data back, we can start to analyze it to predict whether the site and email itself are malicious. Since URLScan is a newer site and is not fully maintained, there are times where it crashes and returns garbage results so I put the request itself in a Try Except block to attempt to stop it from crashing the entire script.

At this point, we can analyze the data sent back from these two scans. I check the fields we searched for and if they meet certain criteria, I implement a ‘malicious total’ variable. At the end of the analysis, I divide the number of malicious criteria found by 9, the total number of criteria. This gives an assumption about the malicious intent of the email and website. Through my research, I found that phishing campaigns use a specific set of phrases to invoke fear or excitement in the user. [1] I check for several of these phrases in the body of the email and see if they are there. These phrases include words like ‘Verify Your Account’ ‘Failure to Respond within’ and ‘Click this link’. If any of these phrases are there, we tick our malicious criteria up by one. We also tick the malicious criteria up if the site is not using HTTPS as this makes it a vulnerable site to begin with and puts your data at risk, if the website has over 30 ads on the initial homepage, if the site has less than 50% of the site secured with HTTPS, and whether VirusTotal replies that the site is unsecure.

The final characteristic that we check is where the website originates from. In a study done in 2018, it was found that most phishing emails originate in the United States, China, Russia, and Germany. For our purposes, we must assume that links from the United States are ‘okay’, but we will uptick the criteria for links from China, Russia, and Germany. Depending on the country of origin you are running the script from, this may need to be modified.

At this point, we have our script ready and we can test it on our first two emails. The first email, ‘Promo’ with the non-malicious link, will give us this result.

It successfully grabs our email, extracts ‘google.com’ and searches it. We always print out how many scanners found the site malicious, which in this case is zero. The only thing increasing the chance of the link being malicious, is that the script actually accurately found the intent of the email to be malicious. This gives us a low chance at 11.11% of the link being malicious.

When we use our second email, that actually has a malicious link, we get this result.

A bit difference response! Four scanners on VirusTotal found this link malicious and the site is not implemented with HTTPS. This is a big warning sign and the user should be warned to visit that site.

This system is obviously still small on a grand scale. True phishing detectors have not been fully implemented in the corporate environment today because there are so many factors that go into phishing. Many studies show that the grammar in phishing emails are also normally worse than a standard email. A machine learning approach to detect language within the email could help to give an accurate response. Focusing on where the email originates can create a dilemma. Not every email from China will be malicious and not every email from the United States will not be. Determining other characteristics of an email and having another trust anchor in place could make this more reliable. Another area of future work would be to analyze attachments on the email as well. The most common attachments containing malicious content in 2018 are PDF files as the world has gotten more knowledgeable about phishing and will not click executables as readily as before. Modifying our macro to save the attachments from the email and then scan those using the APIs through Python can give us more information about the intent of the email as it may find a hash of a well-known malware.

While the world continues to improve in phishing knowledge, it remains a problem in the world today. This system can help to do an initial scan on emails sent through Microsoft Outlook to confirm or deny suspicions about possibly malicious emails and hopefully add a bit of security to corporate life.

This is the full script for my implementation:

import requests
import bs4
import json
import time
import sys


def get_credentials():

    apis = []
    f = open('credentials.txt', 'r')
    for line in f:
        apis.append(line.strip())
    return apis


def virus_total(link, api):

    url = 'https://www.virustotal.com/vtapi/v2/url/report'
    params = {'apikey': api, 'resource': link}
    response = requests.get(url, params=params).json()
    parsed = json.dumps(response, indent=4, sort_keys=True)
    positives = response['positives']
    total = response['total']
    return [positives, total, (positives/total)*100]


def url_scan(link, api):

    url = 'https://urlscan.io/api/v1/scan'
    header = {'Content-Type': 'application/json', 'API-Key': api}
    data = '{"url" : "%s"}' % link
    try:
        response = requests.post(url, headers=header, data=data).json()
        time.sleep(30)
        r = requests.get(response['api']).json()
        countries = r['stats']['tlsStats'][0]['countries']
        ishttps = (r['stats']['tlsStats'][0]['securityState'] == 'secure')
        malicious = (r['stats']['malicious'])
        adBlocked = (r['stats']['malicious'])
        securePercentage = (r['stats']['securePercentage'])
        isUS = (r['page']['country'] == 'US')

        return [countries, ishttps, malicious, adBlocked, securePercentage, isUS]
    except:
        return None


def parse(filename):

    f = open(filename, 'r', encoding="utf-16LE")
    soup = bs4.BeautifulSoup(f, "html.parser")
    links = []
    for link in soup.findAll('a', href=True):
        links.append(link['href'])
    return links, soup


def analyze(vt, us, link, soup):

    common_words = ['Please verify your account', "Failure to respond", "Dear friend", "Dear Friend", "Dear Customer", \
                    "Dear customer", "Dearest", "Dear valued customer", "Click", "click"]

    total = 0



    print("    - {} out of {} scanners found this site malicious.".format(vt[0], vt[1]))
    for words in common_words:
        if words in soup.prettify():
            print("    - Malicious intent was determined from the email's text.")
            total += 1
    if 'CHN' in us[0] or 'RU' in us[0] or 'EG' in us[0] or 'DE' in us[0]:
        print("    - Email originates from possibly malicious geographic location.")
        total += 1

    if not us[1]:
        print("    - Link is not HTTPS. Information on site will not be secure.")
        total += 1
    if us[2] != 0:
        print("    - Link possibly leads to malicious content")
        total += 1
    if us[3] > 30:
        print("    - Link has over 30 ads on the initial website. Malicious content is possible")
        total += 1
    if us[4] <= 50:
        print("    - Site is not fully implemented for HTTPS. Data may not be secure")
        total += 1
    if vt[0] > 2:
        total += 1
    if vt[0] > 3:
        total += 1
    if vt[0] > 4:
        total += 1

    print("\n    {}% chance of {} being malicious.\n".format((total/9)*100, link))


def analyze_no_us(vt, link):

    print("    - {} out of {} scanners found {} malicious.".format(vt[0], vt[1], link))
    if vt[0] > 3:
        print("Website is very likely malicious. Do not go to this link.")
    else:
        print("Information is limited as the link was not fully scannable. Continue with caution")


def main():

    if len(sys.argv) != 2:
        print("Wrong command.")
    parsed, soup = parse(sys.argv[1])
    apis = get_credentials()
    print("Scanning links, please hold . . . ")
    for link in parsed:
        virus_api = apis[0]
        url_api = apis[1]
        vt_stats = virus_total(link, virus_api)
        us_stats = url_scan(link, url_api)
        print("Analyzing {} . . .".format(link))
        if us_stats is None:
            print("Accurate stats could be understood. Testing differently . . .")
            analyze_no_us(vt_stats, link)
        else:
            analyze(vt_stats, us_stats, link, soup)


main()

These are the sources used throughout my research:

https://www.virustotal.com/#/home/upload   – VirusTotal for scanning links

https://urlscan.io/  – URLScan for scanning links

[1] http://www.24hoursupport.com/support-center/internet-security/protect-yourself/phishing/

[2] https://www.fireeye.com/content/dam/fireeye-www/global/en/current-threats/pdfs/rpt-top-spear-phishing-words.pdf

[3] https://blog.alertlogic.com/must-know-phishing-statistics-2018/

[4] https://www.forcepoint.com/zh-hant/blog/security-labs/new-phishing-research-5-most-dangerous-email-subjects-top-10-hosting-countries

[5] https://answers.microsoft.com/en-us/msoffice/forum/msoffice_outlook-mso_win10-mso_2016/save-current-email-as-html-through-vba/18710fd3-7cbf-4f81-ab56-7ef66830e062

[6] http://www.malwaredomainlist.com/hostslist/hosts.txt


 [TO(S1]

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s