The Subtleties of Enterprise Security Vulnerability Management — at Scale

Enterprises face some interesting challenges as they grow. Once you scale to any real size, tasks that seemed simple become unmanageable and difficult, even confused. One of those tasks is vulnerability management, and while it may sound simple and trivial, I assure you it is, indeed, not.



At small scale


When vulnerability management is done at a small scale, before you get into the large enterprise space, it is relatively simple. You fire up your favorite vulnerability scanner, scan your IP space, which is presumably well-defined, and manually validate the results with one of your ace security analysts. Simple.


Managing the whole thing isn’t too tough and can even be done on a spreadsheet (for the ultra-low-budget SMB), or via a pretty dashboard and management interface. These interfaces usually feature re-test capabilities, trend reports, and deeper insight into the types of issues you face. This is, of course, on a small scale.


Even in a situation where your network is segmented, and you have multiple environments, you only need to worry about getting access to the segment and getting permission to scan the various production and non-production systems. Still relatively simple on a small scale.


Factor in the added complexity of having sensitive network segments, such as those governed by PCI, or something similar and still … it’s a matter of getting signatures and executing purposefully and carefully. Again, relatively simple — on a small scale.



At the enterprise scale


Once you hit enterprise scale with hundreds of thousands of nodes, you start running into scaling issues. There are multiple problems with assessing vulnerability of scale that large, but the most prominent two are completeness and certainty.


Sure, you get technology that scales to a server/sensor model, where you can place sensors inside sensitive segments, and dedicate sensors to specific IP ranges. SPI Dynamics did this with the AMP Server and Webinspect a decade ago and many other vendors do it well today, but as with many things in the security space, technology is not the issue.


When I say that completeness is an issue I don’t mean just being able to scan everything in a meaningful amount of time (although that is definitely something to think about), I’m talking about something as seemingly simple as knowing your IP space, and having a good sense of where endpoints and nodes exist, where they are likely to be, and in what quantities. You need to know this so you can appropriately plan resources for scanning. Completeness also refers to challenge of scheduling regular, repetitive scanning of the various environments you have. From test, staging or production environments to general network, applications hosting, and third-party space — the challenge of getting permission to scan your space at regular intervals that are meaningful is, in itself, a problem. Then there’s that stipulation: “In a meaningful amount of time,” which we continue to struggle with in the enterprise space.


Consider an environment, such as a private cloud segment where you’re likely to see developers set up, use, and tear down virtual servers faster than you can scan them in many cases. How you define completeness has everything to with how you’re going to approach that environment, and whether you care if you scan every IP address that shows up.


The other thing you need to give serious consideration to when you hit any serious scale is certainty. When you are scanning 100 IP addresses you can manually go through and validate what your scanner turns up for false positives. When you’re scanning a million IP addresses or more … this gets a little tricky.


On one hand, you can simply trust your automation, and your vendor, that it’s doing its job with a very low false-positive rate (which I’m pretty sure most vendors claim anyway). This is fine until you run into situations where you ask someone to fix a false positive, and they start to question all of your results.


I wish I could say this is uncommon.


overwhelmed-help-300x190.jpgOn the other hand, you could attempt to manually validate some subset of the total scan results. Doing this on a million IP addresses for even only critical issues gets tricky. Especially when critical is subjective to specific environments and specific issues on specific systems. Handling this at the enterprise scale is difficult.



Making it work at scale


How does one make this work at scale? Automation and smart audit strategy is my best enterprise-tested answer.


First, you have to leverage your automation and develop a plan. Decide what your exposure window tolerance is, and then start to build out your scanning environment with that in mind. If your exposure window is 24 hours, you need to make sure that you have enough automation to complete a scan of any environment you have it attached to in ~18 hours. This allows for six hours of what I’ll simply refer to as wiggle room. Next, determine how thorough you want your scans to be, and what you’ll be scanning for. Each environment will likely have a slightly tailored scan policy, so that you’re not scanning for all 50,000 signatures on every note or you’ll never finish, and you’re likely to blow up production systems this way. (Not that it’s happened to me before … ) Now that you have that and you’re scanning all the critical stuff you want to scan (hopefully that’s your entire environment, at least at some level) with meaningful policy, you’re going to need to figure out how to scale the certainty of your results.


Environments of different criticality levels receive different levels of audit/verification scrutiny. In an environment where you can ill-afford to miss something, or get a false-positive, you’ll want to perhaps do a 25 percent random sample analysis of your critical and high issues; where in less critical IP space you may be okay with a random sample of 10 percent on just critical issues. This gives you a chance to plow through your vulnerabilities library in (you guessed it) a meaningful amount of time so that you can move on to fixing the issues — and there starts an entirely different battle.


Hopefully I’ve given you some insight into this very critical issue, which I know many of you are facing right now but haven’t found good scalable solutions for. As I work with more and more organizations to design, implement and test these types of strategies I’ll keep sharing (anonymously, of course) these types of lessons learned so that the benefit is maximized.


Good luck, and remember you can always get ahold of me to discuss this issue more in depth, suggest changes/efficiencies, or ask for help.


Follow the Wh1t3 Rabbit!

Drew Maness(anon) | ‎06-29-2013 05:52 PM

Very good points Raf.  


In its simplest form,  Vulnerability Management, or I should say the success of Vulnerability Management, is dependent upon the company's process for building from a secure baseline and the patch management process.  These three are a cyclical, symbiotic process, with one feeding off the other.  If you don't have a strong build image process and/or patch management,  then Vulnerability Management becomes an exercise of futility. In fact, in my experience the real cause of a poor VM program is due to the lack of a strong feedback loop into both build image and patch management processes.  To fix, companies often throw more resources, tools...(e.g. we need to reduce false positives) into the VM process but these are just delay tactics by not wanting to address the real issues.


GIGO - Garbage In, Garbage Out.      VM is like cleaning a theater after a show.  The more you sell at the concession stand the more garbage you are going to pick up at the end of the show.  


Build Image/process -> Patch Management -> Vulnerability Management


This seems like a simple loop.  We build a system with the latest patches and best practices,  we patch as necessary, we use Vulnerability Management to validate our process and ensure no new vulnerability has been introduced. However, here is how it works in reality.


Build Image/process ->  Systems are built to however the admin feels like doing it this week. Yeah, they will use the "Gold Image" but this image is only updated once maybe twice a year.  If they remember they may install the latest patches.


Patch Management ->  Hey we are still busy with our manual build process, the systems running fine.  Yeah, I know I got the Tuesday patch notice but no one is even going to check for another month.  Even after that I have 2 weeks to thirty days to fix, and they won't really scream for 90 days, just before the PCI report is due.  So I'll get to it in a couple of months and do a bunch at once.  Besides, the system is working fine.  What's the Big Deal.


Vulnerability Management ->  Hey I run these reports once a month and there is a bunch of stuff you need to fix.  Hello... Hello...  There, I reported to management...  Hello, Hello?  Ok,  it's been 90 days, can you please fix just these High severity, I need to run the PCI report.  Great its fix.  I'll run this months report.  Hey why are these systems that were fixed now showing the same vulnerability from 3 months ago.  And these new systems,  they are just as bad.


How it should work?  


Vulnerability Management fixes & Patch Management -> Build Image/process -> Patch Management - Vulnerability Management.


-The feedback loop from Patch Management and Vulnerability Management into the Build Image/Process should happen as often as your scans/patching.  At least monthly you should be making a new build image and updating your build procedures (not all fixes are patches).  This is a similar concept to a fire break, where the firemen jump ahead of the fire and dig a trench/break so the fire can go no further.  


-  Patch Management,  feedback loop into Build Image/Process is as important, if not more, than applying the patches to all systems.  (fire break)


- Vulnerability Management, feedback loop into Build Image,  should be verifying Patch Management, not directing what "absolutely has to get done".


This is all a long winded way of saying,   If your build process is crap, and patch management process is crap,  and you have no feedback loop into either, then there is absolutely nothing you can do to make vulnerability management work.  

Drew Maness(anon) | ‎06-30-2013 12:01 PM


Also,  in a truly virtualized, automated environment (yes, "cloud" but needed to define),  where you can build our image nightly from source, and you FIFO (first in, first out) your instances and no once instance is up longer than 24hours (i could argue 2weeks to a month),  in this environment...


You no longer need Patch Management, and


Vulnerability Management just becomes part of your Rugged Devops, oops, I mean part of security acceptance  in your Sprint's SDLC...  you know, test before you put in production.  Which means instead of scanning all of production, your are truly just managing the vulnerabilities of the system and can focus the teams attention on finding new vulnerabilities.

Lee Parker(anon) | ‎06-30-2013 07:59 PM

Our problem is that we have a large amount of integration testing on top of a large number of systems to test. I didn't really see anyone mention the testing cycle outside of software development.  It is the testing cycle that slows our patching process.


Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
About the Author

Follow Us
Community Announcements
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation