By Caleb Chesley
Log analysis has long been a staple of computer forensics and incident response. While tools known as Intrusion Detection Systems (IDS) exist to provide security personnel with alerts for initial intrusion, simply halting an attack is insufficient. Even if an attack is halted before serious damage occurs, it is vital that the event be thoroughly investigated and explored to locate lingering persistence mechanisms and identify defensive failings. Frequently this process involves delving into various logs, from network logs from sources like Zeek to machine logs from auditd or a local antivirus. The double-edged sword of this technique is the vast amount of information logs provide; while applications can be configured to provide vast levels of detail, presenting this through even a well-featured text editor can make it difficult to identify relevant information among a wall of characters. This method presents an additional hurdle when following the trail of an attacker through logs from multiple sources. It can be incredibly difficult to conceptualize a path leading through various types of logs as one jumps from document to document. After spending hours consuming this content it is easy to imagine that an analyst can become desensitized and be unable to process it effectively.
A tool for addressing this problem can be made using graphs. I use the term “graph” as it appears in discrete mathematics, where a graph is made up of a series of nodes and the edges that connect them. The node represents some entity, while the edge depicts a relationship between two such entities. Graphs can be either directed graphs, where edges include an arrow indicating the direction of the relationship, or non-directed graphs, where edges are simply a line. This is illustrated in the example below. The edge indicating a family relationship between the two person entities is non-directed because the relationship cannot be unidirectional; Person A cannot be family to Person B without person B also being family to person A. The next example displays a directed graph. Person A owns the Dog, but the Dog does not also own Person A.
Graphs are incredibly versatile in the types of data they can depict. If applied to something such as a network log, it is easy to conceptualize the types of entities and relationships a graph could represent. To demonstrate this, I will use a tool called Neht Graff which I developed as an intern with the threat intelligence team at Workday, Inc. alongside my mentor Max Hill. The tool has since been open sourced and is available on Github: https://github.com/Workday/neht-graff. Neht Graff provides a web interface for visualizing and querying graph data. The tool is designed to be paired with Neo4j (https://neo4j.com/neo4j-graph-database/), a database for relational graphs. Neo4j utilizes a schema to ingest various kinds of log files and turn them into a directed graph as illustrated above. The graph can then be queried using Neo4j’s proprietary Cypher query language to select nodes and edges based on specific properties of the entity or relationship.
In this example I will discuss the Neo4j schema included with Neht Graff which is intended to ingest logs from the popular Zeek IDS, formerly known as Bro. The dataset used in demonstration was obtained from https://www.secrepo.com/, and was generated by uploading packet captures to Zeek from the Mid-Atlantic CCDC 2012. Zeek sections its logs into classes of network traffic based on factors such as protocol, as shown in the graphic below. The connection log shown at the top is by far the largest, showing connections between two hosts using TCP, UDP, or ICMP. The other log files will contain traffic which frequently overlaps with the connection log but contains additional information relevant to the protocol or type of traffic. For instance, FTP traffic will contain the credentials used in communication between the client and FTP server, while file traffic will include an md5 hash of files sent.
Clearly, interpreting Zeek data can require sifting through many types of logs that can be hard to correlate. Importantly, the various log types tend to have overlapping fields which could allow us to build a better model for visualizing them. If we examine the log snippets below, it is clear that IP address for both source and destination is a common inclusion. This is the case for every log type except weird.log, which contains traffic that could not be classified.
With each entry containing a source and destination IP address, our approach will be to treat the host as one form of entity, and connections as the others. This is illustrated below with an example from the Neht Graff repository. The two blue nodes represent hosts, with the orange node representing a connection on port 80 which was utilized to transfer Nv2-PC.exe between hosts. The edges indicate the relationship between each entity, with ORIG indicating the initiator of the connection while RESP indicates the recipient.
To build these nodes and relationships from Zeek data, we first use Cypher queries to create constraints on the database which allows the IP address to be a unique identifier for our host entity, while the UID will serve as an identifier for other entity types representing various types of connection. While a few example constrains are shown below, a constraint statement is necessary for each data type you intend to visualize as a node before uploading any data.
Because Neo4j contains direct support for importing CSV data, Neht Graff contains a script for converting Zeek logs to CSV files before upload called log_to_csv.py. Once these are converted, we can begin uploading the converted logs to be processed into nodes and relationships. While the more specialized logs such as ftp or http will be represented by a single class of entities in our graph, the connection log will be used to make nodes for both hosts and generic connections. The method for this is shown in the two code blocks below:
Generate connection nodes:
Generate host nodes:
Each line in conn.log contains a source and destination IP along with a variety of other information about the connection. Conn.log will be loaded in the first round to generate connection nodes with the bulk of the metadata. The second round looks at only the source and destination IP addresses and creates nodes based on these. Because of the constraints we configured earlier, only one node will be generated for a given IP address.
Now that we have uploaded our data to be parsed into nodes, we must generate the relationships. The common fields shared between all connection entities is the source and destination node, which can be used to link each to two corresponding host nodes. Cypher is designed such that the queries are visually analogous to the relationships they represent. The following query does the following: for the host entity with the IP address 192.168.1.5, display the host, relationships, and connection entities where host has an ORIG relationship to the connection.
MATCH (h:Host)-[r:ORIG]->(c:Conn) WHERE h.address = 192.168.1.5 RETURN h,r,c
The merge keyword in the following statement is what generates the ORIG relationship between each host and connection entity. The statement is pairing each host and connection for which the connection source IP matches the IP of the host with an ORIG relationship directed from the host to the connection.
We then do the same for destination hosts using the RESP relationship:
And repeat for other connection types like FTP:
Now that the nodes and relationships have been generated, we can utilize Neht Graff to visualize and search the data. The visualization takes the form of a web application running on a Flask server, using D3.js force directed graphs to display nodes and relationships. The following query will display 200 file nodes:
By zooming in and clicking on a node, we can view additional metadata.