Malware Analysis Lab: Introduction to Non-Executable/Non-Binary Malware

Section I: Introduction

During my internship at MITRE during the summer of 2021, I worked on the Cross Domain Solution Open Source Discovery Project and the Threat Assessment Enclave (TAE) Update. A major theme in both projects was the protecting against dangers involving non-executable/non-binary file types like Microsoft Office documents (Word, Powerpoint, Excel) and PDFs. In the TAE Update project specifically, I was tasked with finding, researching, testing, and writing modules for tools focused on detecting and analyzing malware of those file types. While it wasn’t a surprise that this type of malware is a problem, it made me realize that most of the Malware Analysis/Reverse Engineering classes, workshops, and other activities I participated in mostly (if not only) focused on executables/binary files. This inspired me for to write a Malware Analysis Lab focusing on non-executable/non-binary file types and the tools used to detect and analyze them. The remainder of this blog will be structured as follows. Section II will describe the format of the Lab and other related resources while Section III will be my Answer Key for this Lab. This will include either my own personal response to the question while completing the Lab or subsections of text from resources on the tools (like their github page, the author’s website, etc). Section IV will be the Resources Section which include the main resources used in making, not completing, this Lab. Section V will be the Works Cited which includes all the resources used in completing the Lab, making the key, or are otherwise  relevant/helpful. 

However, I will say that this Lab is an introduction into the field and tools, and it does not go deep into the functionality of the malware samples as I myself am inexperienced with analyzing this type of malware. This only further supports my point that, with as common as these types of malware are, there isn’t enough focus on them in classes and activities.

Section II: Lab Format and Description

The questions for this Lab will focus on the subject itself (non-executable/non-binary malware), the malware samples, and the tools being used. To answer these questions, students/users will not only need to use the tools for analysis but also perform some research on them. The malware samples used in this Lab come from Malware Bazaar and are referred to as Doc_sample.docx and pdf_sample.pdf for the Word Document and PDF samples respectively. I selected them by filtering by filetype, randomly selecting a few, uploading their SHA 256 to VirusTotal and HybridAnalysis, and choosing one per filetype with the most detections and details. The reason for this was so the samples would be somewhat interesting to analyze. For certain questions in the Lab, the user/student is required to use random files on their host machine. In my report, I used random homework or school related files. The random files will not be provided since it is meant to help the student/user see normal files can also contain malicious indicators and that just because a file has those indicators does not guarantee that it is malicious. For this Lab, we will be exploring parts of both the oletools tool suite and Didier Stevens’ toolset. Specifically, we will explore the use and output of oleid, olevba, MacroRaptor, pdfinfo, and pdf-parser.

Please note, all the samples and tools used in this Lab are linked in the Resources Section at the end.

Section III: My Lab Report/Answer Key

Lab Report Section I: General Questions

  1. Before examining the malware samples, what kind of malware could someone say these are? In other words, by just knowing these files are malware, what kind of malware could you say these are? Explain your answer.
    1. There is no wrong answer as long as the explanation/justification is valid or reasonable.
    2. Without looking at the malicious code, one could say these would be considered Trojans. One reason for this is that these files appear to be legitimate office files while secretly performing malicious activities. Just like the Trojan horse from Greek Mythology, the outer appearance of these files appear harmless (even helpful) whereas in reality, they contain malicious code that’ll spring into action once past your defenses.
  2. Why are malware in these formats (office files like PDFs, Word Documents, etc) especially concerning? Consider the use and exchanging of executables vs files of these formats.
    1. There is no wrong answer as long as the explanation/justification is valid or reasonable.
    2. The reason malware of these file types are concerning is because these types of files are used more often in more diverse environments. These types of files are used more often in every aspect of our daily lives. For example, I am using and interacting with Word Documents and PDFs in everyone of my classes on a daily basis. On the other hand, I only use executables directly (as in not opening an application but double-clicking the icon or running it in the terminal/command line) sometimes in two of my classes at most. 
    3. In short, since people are more likely to use these file types compared to executables, they are more likely to be accepted, used, and overlooked by common users vs executables.

Lab Report Section II: Doc_sample.docx Questions

  1. The main component of the oleid output is a simple table of indicators, their values, a risk level, and a description of the indicator. If your output is multicolored, explain what each color represents in the output for the malware sample? Which of the values in the Indicator column are actually indicators? Provide Screenshot(s) to support your answer.
    1. In the sample there are 3 colors: Blue, Green, and Red (see Figure 2.1.1) 
      1. Blue means the row contains purely informational data (not indicators but background data/metadata). 
      2. Green means the indicator and value combination in that row does not tell the user anything about the risk of the file being malicious. This does not mean it is a sign the file is safe but simply it provides no information that can tell us is malicious. As Dugald Bell, Martin Rees, and Carl Sagan all have said, “absence of evidence is not evidence of absence” (4).
        1. For example, the row for the “Encrypted” indicator and “False” value combination is in Green since this means the sample isn’t encrypted and the sample not being encrypted does not tell us anything about the risk or chance of the file being malicious or not.
      3. Red means the indicator and value combination in that row indicates there is a high risk of the file being malicious. The Description for that row will go into a little more detail as to justify that rating.
        1. For example, the row for the “VBA Macros” indicator and “Yes, suspicious” value combination is in Red because the file contains a VBA Macro with suspicious keywords in it and suggests we investigate further with olevba and mraptor/MacroRaptor.
    2. While these are the colors we are guaranteed to see in the Lab, I want to go over all of them and how they are assigned to a row based on reading the source code for oleid (3). 
      1. The text color used for a row is determined by the Risk Rating given by the tool. Red is for High Risk, Yellow is for Medium Risk, White is for Low Risk, and Green is for No Risk. The default text color is used if there is an Error when determining the Risk or if the Risk Rating was set to Unknown (this can be confusing since the default text color could be one of the specified colors). Blue is for when Risk Rating is set “info” which means the row is purely informational.
    3. The actual indicators in our output are the Encrypted, VBA Macros, XLM Macros, and External Relationships indicators. The rest are purely information/metadata about the sample.
    4. Figure 2.1.1: oleid Output

  1. For our sample, what does this output tell us? Use all the actual indicators in your answer and feel free to use the informational ones as well. Provide Screenshot(s) to support your answer.
    1. The user must use and explain what the High Risk indicator is and what it means in their answer.
    2. The oleid output, as seen in Figure 2.1.1, tells us that the file is not encrypted, contains no XLM Macros, and has no External Relationships (like links). It also tells us that the file contains a VBA Macro with suspicious keywords in it which we should (and will) investigate more with olevba and MacroRaptor. This means that the file contains a VBA Macro that might be trying to perform potentially malicious or suspicious actions/tasks. We also learn that the original author of this file was Rayan Marty who could also be the malware author (potentially).
  2. Experiment with oleid on a few random Word Document files on your machine. These should not be malware samples but normal files. What do you see? Are there any files with malicious indicators? Try to find one that does and explain what this false positive indicates. Provide Screenshot(s) to support your answer. You do not have to show the contents of the actual files, just the oleid output.
    1. In one file, see Figure 2.3.1, we see a new indicator, ObjectPool, which has a Low Risk, and in another file, see Figure 2.1.3.2, we see multiple External Relationships which indicates a High Risk. 
    2. Since I know these files and where they came from (they are homework assignments and/or their templates from mycourses), I also know they aren’t actually malicious but contain features that are used to indicate malware. This indicates that malware authors use common features to perform malicious actions and that just because an indicator is present doesn’t guarantee the file is malicious. 
    3. Figure 2.3.1: “Blog Assignment (2).doc” oleid Output
  1. Figure 2.3.2: “CSEC-380-Homework6-R5.docx” oleid Output
  1. Describe the different flags that MacroRaptor has and what they indicate. Then, show which ones the sample has and what that means. Provide Screenshot(s) to support your answer. 
    1. The 3 flags in MacroRaptor are: A for Auto Execute (AutoExec), W for Write, and X for Execute (see Figure 2.4.1). To quote the github page for MacroRaptor (2):
      1. A: Auto-execution trigger
        1. Automatically executes based on a trigger like opening or closing the file.
      2. W: Write to the file system or memory
      3. X: Execute a file or any payload outside the VBA context
    2. Our sample was given the A and X flags (see Figure 2.4.1) which resulted in it being labeled suspicious. These flags indicate our sample will try to use an auto-execution trigger to execute a file or payload outside the VBA Macro.
    3. Figure 2.4.1: MacroRaptor Output for Doc_sample.docx
  1. What are the types of keywords that are found in our sample? Pick a few keywords, explain what they mean, and how a malware could use them. Provide Screenshot(s) to support your answer.
    1. While the keywords and meaning have definite right answers (since they are in the output), how the malware uses the feature can be correct as long as it makes sense/is reasonable.
    2. As can be seen in Figure below, there are AutoExec (Auto Execute) and Suspicious keywords found in our sample. 
      1. The Document_open keyword is the only AutoExec keyword found. This keyword indicates that the sample attempts to automatically execute the macro when the file is opened (4). This could (and probably does) mean the malware is trying to execute a file or payload right when the victim opens the file.
      2. The Create keyword is labeled as Suspicious and means the sample is trying to execute a file or command using the WMI. This is probably either the rest of the malware or the payload hidden in the sample.
      3. The showwindow keyword is labeled as Suspicious and means the sample is attempting to hide the application. This could be to trick the victim into thinking the file is broken and nothing is happening (like an evasion technique)
      4. The GetObject  keyword is labeled as Suspicious and means the sample may get another OLE object that is currently running. This could be how the malware spreads and/or persists (planting its payload into another file). Another possibility is that our sample is just the set-up piece that loads the actual payload into another file that executes it.
      5. ChrW keyword is labeled as Suspicious and means the sample  may attempt to obfuscate specific strings. Obviously, the malware is trying to obfuscate its strings for detection evasion, encrypted/encoded communications, or just to hide its functionality and characteristics.
      6. Hex Strings keyword and the Base64 keyword are labeled as Suspicious and means the sample has hex encoded strings and Base64 strings. These either indicate the sample is obfuscated or may use these strings to attempt obfuscation. As I said before, the malware could be using these stings/obfuscation to evade detection, encrypt/encode its communications, or hide its characteristics or functionality.
    3. Figure 2.5.1: End of the olevba Output for Doc_sample.docx

Lab Report Section III: pdf_sample.df Questions

  1. Explain the difference between pdfid and pdfparser.
    1. As you can find on the author’s website (7)(9), pdf-parser analyzes the PDF file to find fundamental elements (like objects) used in the file while pdfid simply scans it while looking for certain PDF keywords. The author himself even recommends that users use pdfid to triage PDFs before analyzing the suspicious ones with pdf-parser. 
    2. In short, pdfid is a keyword scanner meant for triaging while pdf-parser is a tool that parses/dissects a PDF looking for fundamental elements and is meant for in-depth analysis.
  2. By looking at the website, explain which keywords can indicate the file is malware, when, and why. Then specify which indicators are in the sample and what they might mean for the malware. Provide Screenshot(s) to support your answer. Hint: Look up Didier Stevens PDF Tools.
    1. As per the author’s website (9):
      1. /Page gives an indication of the number of pages in the PDF document. Most malicious PDF document have only one page.
      2. /ObjStm counts the number of object streams. An object stream is a stream object that can contain other objects, and can therefore be used to obfuscate objects (by using different filters).
      3. /JS and /JavaScript indicate that the PDF document contains JavaScript. Almost all malicious PDF documents that I’ve found in the wild contain JavaScript (to exploit a JavaScript vulnerability and/or to execute a heap spray). Of course, you can also find JavaScript in PDF documents without malicious intentions.
      4. /AA and /OpenAction indicate an automatic action to be performed when the page/document is viewed. All malicious PDF documents with JavaScript I’ve seen in the wild had an automatic action to launch the JavaScript without user interaction.
      5. The combination of automatic action  and JavaScript makes a PDF document very suspicious.
      6. /JBIG2Decode indicates if the PDF document uses JBIG2 compression. This is not necessarily and indication of a malicious PDF document, but requires further investigation.
    2. In our sample, as seen below, the keywords that can indicate malware that appear are /ObjStm and /OpenAction. However, it is also suspicious that the /Page is set to 0 since no PDF or file should be 0 pages long (it is always at least 1).
    3. Figure 3.2.1: pdfid Output for pdf_sample.pdf
  1. Create an empty text file on your host. At the top, put “%PDF-1.6” (this tricks the tool into believing this file is a PDF) and then copy some or all of the keywords (/Page to /Colors) in the previous output into the file. Then run pdfid against the file. What do you see and why does this make sense? Provide Screenshot(s) to support your answer.
    1. As you can see in Figures 3.2.1-2, besides the /Colors keyword, every keyword I hardcoded into the fake PDF file (pdf.txt) was found once by pdfid. This makes sense since pdfid is a keyword scanner (more specifically a string scanner) that simply scans the given file after finding the PDF magic number (“%PDF-1.6”) for the list of strings/keywords we have seen.
    2. Figure 3.3.1: pdf.txt File Contents
  1. Figure 3.3.2: pdfid Output for pdf.txt
  1. Find the stats of the PDF sample using the -a argument and discuss them (mainly focus on the Indirect Objects and Search Keywords). Provide Screenshot(s) to support your answer. 
    1. You can see the stats of the sample in Figure 3.3.1 which also shows that there are 14 indirect objects and 3 Keyword hits. Of the 14 indirect objects, there was 1 Catalog object, 1 Embedded File, 4 XObjects, 1 XRef object, and 6 other miscellaneous indirect objects. Of the 3 Keyword hits there was 1 hit for OpenAction, 1 hit for AcroForm, and 1 hit for Embedded files. 
    2. Figure 3.4.1: pdf-parser with -a Option Output
  1. Using the information from the previous answer, search the PDF for XObjects and then reference one of the XObject objects. Use the man page/–help option to figure out which arguments you need to use (how to do this is also available online by the author). Provide Screenshot(s) to support your answer and show that you completed the process.
    1. As long as the student shows the whole process in screenshots, that should be enough. They can explain as well but it isn’t necessary as long as the screenshots show the commands they used and the beginning of the output (at least).
    2. As you can see in the Figures below, I first used the –help options to see the other options and used the hint in the question to determine I needed to use the –search option. I used the –search option to look for XObjects and found several. Then, I used the –reference option to pull up a specific XObject (Object 29 to be specific). 
    3. Figure 3.5.1: Part of pdf-parser –help Option Output
  1. Figure 3.5.2: pdf-parser –search (Search) XObject –raw Options Output Part 1
  1. Figure 3.5.3: pdf-parser –search (Search) XObject –raw Options Output Part 2
  1. Figure 3.5.4: pdf-parser –reference 29 –raw Options Output
  1. In your opinion, are these great malware tools or just analysis tools?
    1. There is no wrong answer as long as the explanation/justification is valid or reasonable.
    2. In my opinion, these tools are better for general breakdown and analysis of PDF files seeing as they are just a parser and a string scanner (7)(9). While they can provide data relevant to Malware Analysis, that is not their purpose. These tools are similar to the strings tool/command where the data it provides can be used in Malware Analysis but it isn’t what it does explicitly. None of them tell the user this file is suspicious or malicious, they just provide data that someone can interpret as being a malicious indicator.

Section IV: Resources

Malware Samples

PDF 

MalwareBazaar | SHA256 54c3c13b6bd236bab7971c6635866b4ca335727e6f96f66491edabae3cbc65cd

Word Document

MalwareBazaar | SHA256 000444a623568f34fca2d4281a5bb95c13686514625941b4c53c0db63762a872

PDF Parser and PDFiD

https://blog.didierstevens.com/programs/pdf-tools/

Oletools

https://github.com/decalage2/oletools

Section V: Works Cited

(1) Lagadec, P. (n.d.). Decalage2/oletools: Oletools – python tools to analyze MS OLE2 files (structured storage, compound file binary format) and MS Office documents, for malware analysis, forensics and debugging. GitHub. Retrieved April 11, 2022, from https://github.com/decalage2/oletools

(2) Lagadec, P. (n.d.). MRAPTOR · Decalage2/oletools wiki. GitHub. Retrieved April 11, 2022, from https://github.com/decalage2/oletools/wiki/mraptor

(3) Lagadec, P. (n.d.). Oletools/oleid.py at master · Decalage2/oletools. GitHub. Retrieved April 11, 2022, from https://github.com/decalage2/oletools/blob/master/oletools/oleid.py

(4) Lagadec, P. (n.d.). OLEVBA · Decalage2/oletools wiki. GitHub. Retrieved April 11, 2022, from https://github.com/decalage2/oletools/wiki/olevba

(5) quoteresearch. (2019, September 17). Absence of evidence is not evidence of absence. Quote Investigator. Retrieved April 11, 2022, from https://quoteinvestigator.com/2019/09/17/absence/

(6) Stevens, D. (2008, October 20). Analyzing a malicious pdf file. Didier Stevens. Retrieved April 11, 2022, from https://blog.didierstevens.com/2008/10/20/analyzing-a-malicious-pdf-file/

(7) Stevens, D. (2009, March 31). PDFiD. Didier Stevens. Retrieved April 11, 2022, from https://blog.didierstevens.com/2009/03/31/pdfid/

(8) Stevens, D. (2010, December 29). Quickpost: About the physical and logical structure of PDF Files. Didier Stevens. Retrieved April 11, 2022, from https://blog.didierstevens.com/2008/04/09/quickpost-about-the-physical-and-logical-structure-of-pdf-files/

(9) Stevens, D. (2021, August 18). PDF Tools. Didier Stevens. Retrieved April 11, 2022, from https://blog.didierstevens.com/programs/pdf-tools/ 

written by Elijah Heilman

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s