Skip to main content

The Value of Data

August 12, 2015

Mike MacIntyre

Last week we introduced ourselves with the blog title “Log thy Data” but the truth is, if organisations don’t log relevant data then we are not in a position to help them exploit it and improve security. Sometimes data owners are not aware of the full scope of services, use cases or missions (in Panaseer language) that their data can unlock.

This leads them to make capture and storage decisions based on the single mission that they do understand. So we intend to explore how logging the right data can increase the total value of a data source and – when combined with the right technology – service the needs of multiple stakeholders.

To help illustrate the point let’s start with Web Proxy Logs. They are a rich source of data and are a vital tool in the security toolkit. I would be amazed if you have not reviewed them in some capacity as part of a SecOps team. Not only are they easy for an analyst to comprehend but they are a great example of a single source supporting multiple missions. Mission examples include:

  1. Hunt (or Security Analytics) – Most threats still utilise a web attack vector and this makes these logs a great place to look for the tell-tale data footprints beyond standard SIEM rules.
  2. Threat Intelligence – The indicators of compromise (IOCs) that vendors regularly issue contain a large proportion of hostnames and IP addresses which can be swept against past and present web logs to identify if collisions have occurred. How else do you know if you’ve been compromised?
  3. Incident Response – These logs are certain to be queried during an IR mission to help determine the full extent of a compromise.
  4. Strategy – Bulk analysis of the historical web logs can help answer “what if” questions. For example, “What proportion of users would be affected if we blocked all traffic that is uncategorised by the proxy (aka speed bump)?

This all sounds sensible right? Except that too many times I have gone to work with a customer, only to discover that the missions described previously are prohibited (or certainly limited) for one simple reason: The customer did not log the correct fields. What’s more, when challenged as to why this was the case the answer is often “why would you need those?”. A full list of fields I’ve used previously is shown at the end of the blog, but here are 3 that have proved more challenging to obtain than I expected, or which the customer failed to realise the value of.

HTTP Referer

Although deemed to be a security risk and slowly disappearing due to the increasing use of HTTPS, the HTTP referer field captures the origin of indirect web requests. This is incredibly useful when attempting to determine which web requests can be linked together. In an IR mission when reviewing potential infection vectors, drive-by attacks for example, being able to determine the referral path a user was taken down to get from a compromised domain to the malware domain is an important piece of knowledge. How are IR teams able to do this attribution without this vital piece of data?!

In the Hunt mission I have found this field to be a strong feature in differentiating malicious and benign traffic. As is often the case with malware authors, attempts to make malicious traffic blend in with normal traffic often stops with constructing plausible hostnames and URLs and they forget the finer detail of how the user ended up making that request. Perhaps it’s because our adversaries know that many organisations don’t log this information! Regardless, at particular points in a threat actors campaign (namely command and control) the lack of a referrer can be a telling indicator.

User Agent

The user agent header, a string denoting a software agent operating on behalf of a user to make the requests (e.g. web browser), also falls in to that category of one that our adversaries neglect when generating malicious web traffic. This lapse is one that has always surprised me, but it is not uncommon for malware authors to hard code a user agent string rather than determining it at runtime (this is clearly not the case for all malware). As such, it again serves a hunt mission well to profile this field for abnormalities.

In the same way in which it enables better threat detection, I’ve found the user agent string to be a valuable thread to pull in an investigation. In a previous life working on an IR mission that was targeting the energy sector, we were given only a single hostname by some government friends. A quick review of the customer’s logs revealed the infected host but more valuable was a secondary search (or pivot) around the user agent (which exhibited some unusual characteristics) and we uncovered another 4 or so infected assets communicating with different domains. Then, after knocking together a simple regular expression, more infected assets and malicious domains were uncovered. In 2 (or 3 if you count the regexp) simple hops, centred around the user agent string, the full scale of the compromise had been uncovered. How much longer would it have taken us to determine this had the customer not logged that field?

Client IP Address (or UserName)

Finally, I can’t think of an instance when the client (or source) IP address hasn’t been logged. Yet I know of numerous instances where the field has provided little value. Why? Because the customers network architecture (e.g. Citrix estate) or logging infrastructure (e.g. no proxy authentication) has prohibited me from determining who the specific user was that made a web request. In many cases this doesn’t matter but what if I want to track a user’s activity over multiple days? What if, as part of a hunt mission, I am building profiles of normal user behaviour in order to detect deviation from this baseline? If I can’t attribute the traffic to a user, then how can I know what is normal for them? Bear this in mind when vendors are selling you machine learning algorithms detecting ‘normal’. Can you benefit from such algorithms given your current logging limitations?

I recognise that extensive logging adds to the storage and bandwidth requirements (the referrer field and User Agent fields in particular can double the record length) but consider the potential benefits to investigation, detection and remediation and then consider asking your team or vendor to start logging these fields as standard.

Below is a full list of fields that have helped us out in Hunt, Threat, IR missions in the past.

HTTP Header Fields:

  • Timestamp
  • Source (or Client) IP Address
  • Destination IP Address
  • Username
  • URL
  • Request Method
  • Bytes Up
  • Bytes Down
  • User Agent
  • Referer URL
  • HTTP Status Code
  • Proxy Category