I've opted to write this entry in a condensed format so for further context I do suggest grabbing the slides and following along with my presentation. However, much of what I spoke about will be contained within. Some people remarked to me post-presentation that they wish they had seen my talk before they had embarked on their journey in collecting security logs.
One thing I'll warn you all on is that I may skip things since I spoke about them verbally in the talk. Additionally, the notes that form most of this entry were initially strictly for me so any odd typos or grammatical errors are to be expected.
A copy of the slides can be downloaded here.
Getting a running start...
To give you a backgrounder on who I am: I've been working in various information security roles for the past decade, but presently for the past 3.5 years as of this writing for a natural resources company as their senior analyst. The company I work for has about 10,000 employees scattered globally and has some interesting challenges; namely a need to defend both corporate and industrial control assets and geographical challenges that I never thought about until I came onboard. We process and store anywhere between 170 and 250 GB of data within our security log software daily with a year's retention.
You're done with using the command line tools like grep, awk, and cut and you're done with data going into the aether, so now you want to collect your logs and have them somewhere in a central repository. You have figured out that using the tools of old is not faster (they're not) and now you're embarking on looking for a software solution.
Here's the mess you'll encounter:
This is a sampling of the smörgåsbord that is security log collection. All of these displayed above have different use cases, different feature sets, and you will be bombarded with buzz terms like "machine learning" and "threat intelligence". Vendors are going to be super eager when they get a whiff of you having a budget and will do anything to convince you that their solution is the best option. I'm not going to tell you what to choose but I will tell you what to consider.
Right off of the bat, you must try and keep this simple at least in the short term. The first six months of you using your new kit is going to be you implementing it, getting it configured right, and then pulling your hair out because you think you understand just a fraction of what it is doing. It is super tempting to aim to have these really neat features that on the surface appear to solve all of your woes, but realistically you need to set expectations and set them early so you don't get blind-sided when you discover that they're not living up to your expectations.
Knowing your network before you dive in is super important. Do you know everything that is on your network? When your network is small (say a 20 person company), there is probably not a lot of legacy things or at least if there are you know what they are. However, as time has gone on, your large organisation probably hasn’t been so lucky and you have oddball things scattered about and have become long-forgotten yet somehow important.
Annoyingly, not every device is going to have an effective method for log collection! Even security appliances can fall victim to this issue! In one case, I had a proxy server that had only one output for its logs and at the time we were sending them to an analytic software by the same vendor. We chose to ditch the software and have the proxy send its data directly namely because our log collection software could do a much better and faster job at answering questions and generating reports.
Not everything needs to be collected either. Your brain doesn't store all the information it is fed at all. All the while you're reading this, your eyes are capturing approximately 30 GB of data (let's just run with the idea of your brain storing bits here). It is assumed by neuroscientists that you could keep anything between 10 TB and 2.5 PB within, meaning that within a whole day you'd be full! Of course, your brain is very clever and discards so much of that information unless it is important. You need to know what you want to keep otherwise things will just become way too much to handle!
If your team is large enough maybe host your security logs yourself! It’s a lot of work but then you have full control over the log collection. However, you need to be prepared to have lots of storage capacity. How long do you want to keep it around?
My organisation collects 200 GB per day and we’re about to migrate to 72 TB of our data to our own infrastructure. Can you host 72 TB? Can you backup 72 TB? Do you need to collect a year’s worth of data?
However, on the flip side, the advantage of having someone else host your log collection is that it takes the infrastructure challenges off of your plate. Make sure your SLA includes backups and storage redundancy! And you should also keep in mind that you may want to seize the data should you decide to pull the data into your own environment.
You’re now freeing up time and energy to devote to responding to incidents or integrating the log collection into other systems. However, one challenge you may face is that you will lose a lot of flexibility in terms of configurations.
You’re going to find that since you chose your SaaS solution that certain configurations are not going to be supported or after badgering the vendor to do something to improve it they come up with a solution that their support team cannot make sense of so when it breaks you find yourself pulling hair because they won’t get the urgency of the situation.
Do you have change control? It is quite possible that a new device or a change to an existing device could lead to a problem with log collection. Will a change to an existing device cause the log output to change or stop all together? Will you be able to support the log collection?
Things are in place--now what?
Make sure that everyone is aware of what your intentions are for these logs. Knowing the difference between collecting, monitoring, hunting, and predicting when it comes to all this is super important in setting expectations.
Truthfully, never say more than collecting as if you’re using this in an incident response role, you want to set the expectations before something occurs in how you’ll deal with a situation. You may be creating alerts (monitor) based on existing indicators and you’re probably going to do the same for hunting.
Predicting is something I highly doubt you’re going to achieve. You can of course use the software to forecast, but it probably isn’t going to tell you when you’re about to get breached. If any product could do that, security would be “solved”.
You need to know what you’re going to do with this data. Is it for digital forensics? Incident response? Employee behaviour?
Can you measure the amount of data and events per second (EPS) you’re going to generate? I won’t get into how you’ll want to do this but there are many guides out there for doing this effectively. This is something you’ll want to consider before signing that licence agreement. At 200 GB per day, we’re eating up almost a quarter of our daily Internet traffic sending our data over to an AWS cluster.
It’s quite possible you’ll end up with duplicate data. Do you need to collect device from your router sitting in front of your firewall?
Who is responsible for the data? Who is supposed to be involved in any changes? This is an opportunity to learn the RACI model.
Working with the data
Let's talk about what I am collecting.
The biggest set of logs you’ll probably collect will be your event logs. You have your typical Application, System, and Security logs, but Windows Event Log system is a lot more extensible than just that. Some security products create their own event logs (such as many endpoint solutions) and having that data collected can make a whole lot of difference in incident response.
On the subject of endpoint security, there are a lot of products out on the market that will keep a ledger of sorts to help in digital forensics. However, your organisation may be unable to afford such software. An alternative is to make use of Sysmon logging which is something native within Windows. Actions such as specific user behaviour and various other events not otherwise captured by event logs can be recorded. This has proven very, very useful in determining the spread of malware within my organisation.
However, bear in mind that event logs are probably going to be the bulk of your log collection so you should determine how much traffic you’re going to generate from those details alone.
One thing to take advantage of here is to forward your event logs to a central spot. This is a feature that exists within Windows and can be very useful in avoiding installing too many collector agents -- especially useful for workstations such as what I mentioned here with Sysmon. Bear in mind that you will need to do some rewrites on the log data to correct tag the log data itself as the source and host data may end up getting lost.
Proxy data is very useful as well but something to keep in mind is if your web filtering solution is doing HTTPS inspection, you may need to ensure that recording this activity is permissible. I am not your risk and legal here so I won’t bother to elaborate further on this point, but this may be very important to keep in mind.
When it comes to DNS, it’s surprisingly easy to capture and record this traffic. Assuming you have centralised your name servers, you don’t need to do much in the way of logging to disk to make this effective. There are ways to just outright capture the traffic by monitoring the network traffic itself. You can either just sniff traffic right on the DNS servers or mirror traffic going to them to a dedicated collector running the capture software.
DHCP is also useful (and can be captured the same way) for looking for rogue devices and for historical DNS lookups. If you don’t have any sort of network access control, this could be a way to supplement one.
Firewall data; not much to be said. If you’re lucky to have a firewall that does packet inspection and thus can tag the application data or are able to tag users and IP addresses, this can be extremely useful.
Be careful with Netflow! Holy heck have I ever seen this one go sideways if you have far too many devices capable of it. Netflow is super useful for determining lateral movement within the network but because of how noisy it is you may find it nigh-impossible to make use of it. In my situation, I am only using it for egress and ingress traffic at our sites where our main firewalls do not cover them and they’re set up in a split tunnel configuration, meaning that Internet traffic isn’t captured by our usual means. Netflow can easily surpass the amount of data that event logs generate.
Lastly, I have lots of other random logs I am collecting from mainframes, various database software, and from Internet-facing appliances.
What about your cloud data? You should own that data and if your vendor says you don’t then it’s time to consider going elsewhere.
Many SaaS solutions do offer log data either via an API or syslog (usually over TLS). However, it may not be documented all that well and there is a good chance that you’ll either have to write some of your own code or have something sitting within your DMZ to capture this traffic.
I’ve had situations where the vendor has provided the log data but it’s only what they deem as “important” and not the general activity. Be prepared for this to happen and do not hesitate to demand a feature request to change this.
A grotesque myriad of log formats
You’re going to find three popular HTTP daemon log formats: Apache, W3C (done by the standards committee for the world wide web), and Microsoft IIS. IIS is kind of ridiculous as they modeled their format off of W3C but like all things Microsoft from the late 90s and early noughties, they opted to go their own way sort of.
This is something you’re going to have to face and you have a few options for dealing with it. In a lot of cases, the software you use will do field extractions for you automatically if the product is mainstream. You may luck out and the vendor or a very nice person may have written a solution for you. However, even if a solution is provided, you have weird edge cases that arise.
It’s very tempting to log all traffic on your file server. If you’re small enough, it’s probably no big deal, but what happens as you scale? Just like I mentioned earlier with Netflow traffic, it can become far too difficult to sift out what is a real threat and what is normal activity. Consider evaluating the event IDs you absolutely want and filter out whatever you don’t. This can be useful in reducing the amount of traffic and storage required.
I guess I am harping on event logs here still but by default it does include helper details that to you and I is super useful. However, when you have millions of events per day, those details add up and are not useful to keep around. Consider filtering that stuff out too and you may find that you save at least a third of your storage requirements!
This is your typical contents of a Windows Event log. Out of your box your software solution should support the format and if not then I have no idea what you’re using and you’re probably going to want to reconsider everything you’re doing. However, it will by default only deal with your typical System, Application, and Security event logs.
In this case, we have an output from Sysmon, allowing my organisation to see what is happening on a machine--if you have the ability to set this up, sysmon data is an absolute goldmine for information. However, if you look at the message field, you’ll notice that it starts to differ.
Okay. You’ve fixed the Sysmon issue, but now you’re like me and you’re collecting AV logs. In this case, SCCM manages our built-in Windows AV solution but in order to get the data out, we had to create a trigger within MSSQL to dump the event into an event log. It works great but take a look at the Path and DN fields. This would not be extracted properly with the same solution as Sysmon.
There are times however where everything just sucks. The above is an output that couldn't be extracted properly. It was awful. None of the data was consistent and the log software would just do everything improperly. I hated it so much but I fortunately had a solution after much complaining to the vendor (I'll elaborate a bit later).
To fix most of these, you’ll want to learn regular expressions. They’re absolutely useful to learn but do require a lot and I mean a lot of time to learn effectively. I am very rusty with them these days but there are solutions for writing them without having to spend too much time getting beyond the basics. I recommend working with RegEx101 if you want to get a start on things.
However, don't get too creative as it doesn't fix everything.
Shout out to Ex-Parrot for this disastrous regular expression.
This was not written by hand. If your regular expressions are getting to this point, you’re going to hate life. Regex is NOT a solution to all of your woes and does not mean 100% perfection. You will NOT achieve perfection in your extractions--but you will get something functional with enough work.
Do not use regular expressions to parse XML either. Your software should be able to work with it natively (as well as JSON, which can be regex'd but shouldn't need to).
If your syslog output has CEF (Common Event Format) as an option: use it. There are variations of it per vendor, but it's night and day in contrast to other log formats.
In the case of the vendor with the terrible output, after much complaining and pointing out their other products can do better, they provided us with a JSON output that pushed over HTTP. It has been the most workable data I have received yet so far so sometimes vendors can improve and improve real well in this space if you ask them to!
Things will break I promise you
Prepare for things to break and prepare to not panic. You must accept that somehow everything will break. Have everything documented and understand the impact of missing or delayed data.
If any amount of downtime or interruption causes a compliance issue, you must prepare for it either through redundancy or risk acceptance. These are things I cannot walk you through but know who to consult as it is important!
Regardless of whether or not you’re dealing with one, two, or fifteen time zones, you should always set everything to UTC. This will make it easier for you to build timelines. Having accurate time also means you can correlate with other sources effectively.
Have a central NTP source too. Time can be a few seconds off but any more than that and it makes correlation very difficult!
Time will break without you trying to. One of the times here is correct and one of them is not. This will become a headache.
As I mentioned earlier, one of the things I deal with in my organisation is industrial control--you may have heard it referred to as “real-time systems”, “process control”, or “SCADA” but they’re all one in the same. There’s a huge concern for safety and as a result we do monitor some aspects of our IC environments.
Be very, very careful when it comes to monitoring these spaces as even though you’re listening, you’re not necessarily passive about it. I highly recommend skipping to the part of the video where I use the above image as I go into detail about how using TCP instead of UDP can lead to trouble.
If you’re a team of one, then you’re responsible for everything that breaks!
If not, then you need to be able to identify the problems and then determine where the fault lies. Have your partners within your department involved in these situations and make sure that they’re aware of what their involvement needs to be. You may not have access to that firewall that is no longer sending syslog but they do. These teams may want to identify your log collection software as at fault so ensure that you have checked everything on your end and done the appropriate tests. Don’t be afraid to use netcat for example!
Be prepared for things to break and have a plan to deal with it. This also includes identifying the risks involved.
Don’t hesitate to hire a consultant and make use of them. They’re assets and can make your life easier. You're burning money if they're doing nothing.
This is probably the most effective security software you’ll use, but it’s not a holy grail so don’t treat it as such.
Lastly, this was the first talk I've given since coming out as queer. I really appreciate those who were attendance and appreciated the questions and feedback.