Transcript
Parad: I would really like you to imagine something. Imagine that you’re an engineer who is on call, it’s Saturday evening and you get paged. There is a critical issue in one of your team’s services, multiple customers affected. You know the service, you wrote the code, so you hope this one will be quick. You start the company-issued laptop. You really dislike this machine. It’s slow, clunky. You don’t like the operating system. You can’t even install an ad blocker for your browser. You log in, type in the password, your password is older than 90 days, please change it. You set up a new password, incrementing the last digit as one does, you’re in. You go to your incident management portal. This website is not available. You’re not in the office, so you need to start the VPN. You do that, it fails to connect, wrong password. Ok, you try again, you’re in. You look at the incident report, the issue is actually really well described. There’s even an error with full stack trace.
Then you read that the issue is intermittent, so you lose some of that enthusiasm. You don’t have access to production, you can’t reproduce the issue locally, which means you’ll have to go to the prod test environment, which is supposed to be exactly like production, but for developers to test in. It’s never a good time. You log in into the environment, obviously it’s a mess. Your service doesn’t even work correctly, it’s misconfigured. You know this one, it’s an easy fix. It’s just a few tweaks in your cloud console, configuration change. You go and do that. You log in to your cloud console, you have no permissions to access that resource again. You spend the next half hour rummaging through the console, and finally you actually manage to grant yourself those missing permissions. You carry on. You hack the prod test environment, you start investigating. You track the issue down to one of the libraries that you’ve been using. This is strange, the code should really work. You try googling.
The first result from Stack Exchange seems promising. You click, this website is dangerous, it has been blocked. You’re still on VPN. You disconnect that, carry on. It turns out that the library has been automatically upgraded to the latest version, which should be a good thing. It’s an easy fix. You make the necessary changes. You create a pull request, and the build fails. Of course, it fails. You didn’t expect it to work the first time. It’s your vulnerability scanner throwing some of those silly errors that everyone just keeps ignoring, so you do that. You jiggle the build, it passes. Time to deploy, and it fails. Of course, it fails.
The connection to your credentials vault in test environment times out, as it often does, so you retry and retry and retry, and it works. Your fix is live. Since you hacked the prod test and you still have no access to production, you actually can’t be really sure if you’ve fixed the issue for good. You reluctantly close the laptop and hope for no more alerts till Monday. If we look at this whole situation, we can see that the actual valuable productive work was really easy and straightforward. We have encountered a lot of tiny, yet annoying obstacles along the way that slowed us down at every corner, broke our flow. Those are only some of the examples of productivity sacrifices that companies make in the name of security.
Background
I’m Dorota. I work at Authress, which is an authentication and authorization API. If you’re making software, you want your users to log in or have different permissions and roles. Instead of coding those bits yourself, you can use our tool. Based on what Authress does, dealing with identity and access, you can imagine that our customers take it for granted that we’re secure. We’re also a Swiss-based company, which means we’re actually under legal obligation to follow the principles of privacy by default, and privacy by design, which actually means security by default. We’re also a private for-profit company. We’re not a startup, but we happen to be a very small company. This means that every one of our engineers has to be happy and productive. We really can’t afford to implement all those security best practices, the ones that make your engineers’ lives a little bit harder. We really had to find a different way to be secure by default.
Traditionally, and it is still a very common mindset among security professionals, we tend to think about security as building this impenetrable citadel, this bastion, so well defended that no one can get in. No one. Not even our own employees. Building such citadels is difficult and they’re awkward in everyday use. What we often end up with is something that’s more like this, security theater, or as I like to call it, security circus.
Making Security BLISS
What if, instead of building that citadel or settling for some bouncy castle, what if we could make security BLISS? BLISS is a framework that will help you identify how to replace some of those annoying, cumbersome security practices with ones that don’t impede productivity. Let’s see what that means. B stands for Bulkheads. It is a feature on all submarines, maybe not Titan, that when a section of a ship gets some catastrophic failure, it gets sealed off so that the problem doesn’t spread. We can apply the same principle in software security. When part of a system gets compromised, it should be possible to seal that part off or remove it so that the problem is contained. Think of it as reducing the blast radius. How well we can apply this principle is going to be severely constrained by our architecture choices.
If we’re using microservices with each service deployed independently and services communicating solely through REST APIs, then the service boundaries are our natural bulkheads. It’s really easy. If we have a monolith, we have to get a little more creative with how we apply bulkheads. Things like deploying separate instances for different customers or using containers are good examples. Regardless how we do it, we want to have those bulkheads spread throughout our systems and processes. L stands for Levels. What are we protecting? Not all data is equally sensitive and not operations equally critical. There are different levels of risk. It makes no sense to put a single thick wall in front of everything. We should have some gradation, different levels in our protection strategy so that we can add a little bit of protection for less important things and a little bit more on top and a little bit more, so on, proportionate to the level of risk.
I stands for Impact. When we think about cybersecurity, the first thing that tends to come to mind is preventing the attacks, making it less likely that we get breached. This was a very good way of thinking in the early days of the internet. Nowadays, when you have something out there, it is getting compromised. It’s only a matter of time. Instead of putting all our efforts towards making it less likely that an attack will succeed, we should try to limit the impact first. We get compromised, so what? Do we have bulkheads in place to limit the blast radius? What can the attackers do with their access? Do we have more levels of protection further down the line? What’s the worst that can happen? How can we lessen that impact? The first S stands for Simplicity. If we put practices in place that are difficult to follow, people will find workarounds.
If we keep adding tools to our tool chain, that’s more vulnerabilities to patch. If you need five years of experience in security in order to even understand our security strategy, who’s going to execute on it? As a general rule, the simpler something is, the easier it is to secure. Notice that there is some tension between this and the other principles I’ve listed. Adding bulkheads means adding complexity, because now we’re no longer dealing with one thing, we have multiple things. On the other hand, what is simpler? One big system that does everything or multiple systems that do one thing each? Depends. Same goes for levels. Different levels of protection make things more complicated.
On the other hand, having those levels lets us keep things simple for hopefully majority of cases. We want simple, not simplistic. The best way to think of it is whenever there’s a tradeoff, we should opt for the simpler option. The last S is my favorite. It’s creating a Pit of Success. It should be so easy to do the right, the secure thing, that it happens almost by accident that we fall into that pit of success. To put it differently, doing the wrong thing should be annoying and cumbersome. It should require us going out of our way. The default option, the one we take without thinking, should be the right one. We should be really thoughtful about where we place our obstacles. Ideally, the obstacles should be reserved for the attackers and for things that are extra risky, that we only do rarely. To make security BLISS, we have bulkheads to reduce the blast radius. We have different levels of protection. We focus on minimizing the impact of the incidents. We keep things simple. All of that lets us create a pit of success.
Applying the BLISS Principles to Resolve Productivity Issues
Let’s have a look at the situation that I asked you to imagine in the beginning and see how we can apply these principles to resolve all those little productivity issues without sacrificing actual security. First thing to get in our way was our machine. It needs a password to start. The password has to be changed. It’s slow, clunky. We don’t like the operating system. We couldn’t even configure the device in the way we wanted. That tends to affect our productivity. This happens all the time. Companies not only restrict our hardware and operating system, they also install mobile device management, MDM software. If you’re really lucky, you may not notice it’s there.
Most of the time, your device will be slightly slower. The battery will drain faster. You may be prevented from installing little quality of life tools that help you, or an update may be pushed on you in the most awkward moment. That is a tiny but constant drain on our productivity. If you’re really unlucky, you may not even be able to do any work at all. Anyone remembers CrowdStrike? What are we protecting? Is our laptop really that special? The reason why companies do that lies in this old-fashioned approach to security, the idea that we can build that citadel, secure the perimeter to keep our assets secure. We have to stop the attackers at all cost. It is costly, not only having to manage those devices, it costs us engineers’ productivity.
The worst of all is that it actually doesn’t do that much for security. It reminds me of this old XKCD comic. The reality is that most of successful attacks happen because of phishing. Restricting our operating system or installing MDM software doesn’t protect us from phishing. We know that because companies that do it get phished all the time.
Maybe instead of trying so hard to protect that device and paying the productivity tax, maybe we should try and minimize the impact instead. Our machine gets compromised, so what? What’s on that laptop that could be valuable to the attacker? Let’s see. We have emails, chat history, maybe some internal documents. If we make it easy to store and access these in the cloud, we should have no local copies. Unless you’re an executive or you work in HR, your emails aren’t really all that interesting. What else do we have? Credentials. What? If we force our engineers to do annoying things like connect to VPN every time they’re not in the office, some credentials will inevitably end up part of some script in plain text on that machine. Don’t ask how I know. If we create a pit of success everywhere where credentials are involved, this shouldn’t happen. Let’s keep that in mind. What else is on the machine that’s precious? Source code.
If I want to do development on the machine without pulling all my hair out, I have to save that Git SSH key, which means whoever has my device has access to see all the source code that I can see. What’s more? They can make commits. No. Obviously, we could then double down on protecting our device or do even something more convoluted like always only use virtual machines. That’s a lousy tradeoff.
Let’s stop and think for a moment, what are we protecting? The attackers are almost never after the source code itself. It doesn’t have inherent value. Your code is actually not that special. What they’re after is credentials, data, and your production environment. Maybe instead of trying so hard to prevent our source code from leaking, we should look at the impact instead. I have some good news, everyone. If you are doing all of the usual common sense, continuous integration and continuous development practices, you are already sufficiently minimizing security risks of those commits coming from a filthy, unprotected machine. If you are doing CI/CD, I bet you are protecting your code on multiple levels already. Here’s the thing, this is a very standard, basic, cookie-cutter CI/CD pipeline. We write, commit some code, we create a pull request, some test run, and it gets merged and deployed to production.
First step, that’s the least risky one. We have a little bit of protection here. Does every engineer need access to all our repositories? If we have a monorepo, we already made our tradeoffs. Otherwise, let’s restrict access to only those repositories that are relevant for this particular engineer at a given time. In order for those potentially malicious commits to end up in our production, we actually have to go ahead and create a pull request, which typically involves going to some UI, GitHub, GitLab, whatever you’re using, clicking some buttons. I can only do that if I provide my credentials. We not only need to have the device, we also need credentials.
That’s not all, assuming all the tests pass, typically, under normal setup, in order for that pull request to get merged, we have another person do a code review and approve, which means another engineer will have to go to that UI and click some buttons. They need to provide their credentials. That is a very powerful way to minimize security risks. Have another human scrutinizing what’s going on. We do it all the time.
For all of that to work, every single commit in existence has to go through our pipeline. No exceptions. No manual tinkering, no logging on to server to copy files. Which means our builds have to be reliable. No random errors that force our engineers to go in and manually push the build through. Here’s a kicker, investing in hardening our CI/CD pipelines is actually an investment in security. If you need any more excuses, just do it for security. We really want to keep our pipelines simple so that there are fewer things to break. Anything that can run as an async process, that’s out of the pipeline, so code coverage analysis, vulnerability scans.
Now that we have looked at the impact of the engineering machine getting compromised and applying multiple levels of protection to our code, we should be able to loosen up our grip on managing our device. That means, no need for a complex password to start. If a device supports biometric login, we’re going to use that. Otherwise, simple password, fully sufficient. We are not going to install MDM software or manage the device in any way. We just don’t trust any device. We’re also not going to restrict the operating system in any way. If someone wants to run Arch and feel smug about it, let them. If you’re really paranoid, you can always force full drive encryption. Most of the time, that’s an overkill. This is, in fact, fully sufficient from a security standpoint for a vast majority of businesses out there.
We’ve liberated our device. What happened next was we tried to access our incident management portal and we couldn’t. We had to be on our internal corporate network, which in this case required a VPN. You can’t just secure a corner of an internet and think it’s good. That’s the citadel mindset all over again. We’re not going to do that. We’re not going to do that because it encourages bad behaviors, like trusting someone without credentials just because their IP address matches. Or engineers storing their credentials in plain text to connect automatically. It really sets our engineers up for a failure. Let’s turn that into a pit of success.
Instead of relying on an IP address or every tool’s internal password database, we’re going to do something very obvious. We’re going to rely on corporate identity provider to authenticate through single sign-on. I know this is obvious. If you’re writing internal tools, no matter how small or insignificant, just add that SSO capability. It’s super easy to do nowadays. Doesn’t require much work. If you’re managing third-party tools, same story, enable SSO. It is so worth it. Single sign-on is a good developer experience, because I only need to enter my password once when I start my workday. The only other time to do that is when I do something risky that requires another level of protection. If our engineers are used to typing in their passwords all the time, they’re more susceptible to phishing.
Speaking of passwords, they are annoying. Not only we have to memorize them, it takes time to type them in. If we add more rules on top of that, like uppercase, lowercase, numbers, special characters, and we also force people to change their passwords on the regular, what’s going to happen is that password ends up on a sticky note right next to my keyboard, or it gets reused. Probably both. That doesn’t look like a pit of success to me. If your company requires any arbitrary password complexity rules for anything or requires people to change their passwords on a timer, I hope no one does it, but if you do that, find whoever is responsible for that process and tell them to stop. Cancel. Don’t do that. Don’t mess with people’s password selection strategy and don’t force anyone to ever change their password unless you suspect they’ve been compromised. Please, whatever you do, don’t be that company.
My theory is that this got rejected because it is actually a valid Perl program. I could be wrong. I hope what I’m saying is really nothing new to you because it’s been in the NIST guidelines since at least 2017. The guidelines are currently under revision. The new version goes even further. The, SHOULD NOT gets changed to SHALL NOT. Here’s a link. Forward it to whoever needs to see it. When setting up passwords, let’s create a pit of success for ourselves. Does everyone remember the Correct Horse Battery Staple password strategy? Here’s a reminder. The idea is to make the password easy to memorize but difficult to guess.
The single most relevant factor determining password strength is the length. We’re going make it super long, 30 characters, 60 characters, maybe more. We are going to be kind to ourselves and we’re going to pick words or phrases that make sense to us. Here are some examples of very good, very strong passwords. As you can see, all of them are over 30 characters long. They’re words and phrases that make sense, at least to me. Good luck guessing them. One could argue that you should not use just a string of plain English words like the first one, but it is still so much better than writing the password down or reusing it. Better still, misspell words. Invent words. Mix languages. Just have fun.
Should Engineers Have Access to Production?
We’ve created a pit of success around accessing our internal corporate resources which incidentally also freed us to browse the web however we want. Next, we tried to reproduce our problem, investigate our issue, and we stumbled because we had no access. Now I am curious, what do you think? Should engineers have access to production? Who thinks they should have no access to production? This is a common misconception among security people because no one should have access to production. No one.
In fact, there’s really nothing wrong with testing in production. I, in fact, think everyone should do it. As long as we have exactly the same kind of access as our customers. Think about it. Our customers are isolated from each other. One customer cannot mess another customer’s account. We have bulkheads in place, we’re good. What engineers really should have no access to is customer data, which includes any copies, and production infrastructure. At the same time, we need to be able to test our stuff, which means messing with data or infrastructure directly sometimes. This implies different levels of access depending on the environment and the situation. Access control is hard. From a security standpoint, we want to give everyone the least amount of access possible, least privilege. From a productivity standpoint, we don’t ever want to block anyone from doing what they need to. Fortunately, all cloud providers give us a very granular way to define our permissions.
Unfortunately, that almost always ends up in a tangled mess of rules and policies, and no one really knows what’s going on. This is a real-life example. You always end up with either too much access or not enough. If you ever played with IAM in your cloud account for a big environment, you know what I’m talking about. This is a problem, unless we apply bulkheads. Every cloud provider lets us attach multiple accounts to a single building account. Let’s use that. Let’s have multiple cloud accounts. How many? That depends on your architecture and team structure. It’s probably more than you think.
At the very least, we want to have two accounts per team. One exclusively for production with no manual access whatsoever, and one for testing. Really, if your software is modular, whatever can get deployed separately, that’s a single production cloud account. It’s not enough to have one big account for everything. Split it. This is also what all cloud providers recommend. That way, because our accounts are smaller, we can keep our access controls simple. Not only we see what’s going on, we can give everyone just the right level of access, not more. If one account gets compromised because our engineer got phished, we can just terminate that account, spin up a new one, and hopefully we have backups. Speaking of permissions and access, when defining your permissions and roles, carefully consider who can grant access. It should be different than who has access.
All of this security thing I’m talking about means nothing if everyone can just go in and grant themselves full admin rights. By applying bulkheads, we can now test in production safely and mess with our test infrastructure if we need to.
We also previously talked about how our CI/CD pipeline protects our code on multiple levels, and for that, it has to be robust, and we should basically keep it simple. There’s one more trick I would like to share, because not many people know about it, and it is a great illustration of how simple means more secure. We said all the code has to go through the pipeline, no manual access to production infrastructure, but our production software sometimes needs to authenticate with third parties like cloud providers or payment providers and whatnot. How do we get the credentials into our production, automatically, without exposing those credentials along the way? One still very common way is to use environment variables that get stored in our Git server secrets and injected into our pipeline and passed along our application control plane.
That is a lot of places where we’re potentially exposing our credentials to parts of our infrastructure that don’t even need to know about it. A much better way is to use some secrets vault, usually a third party that stores the secrets securely, and that gets accessed by our production. That is much better, already very good. There are few moving pieces, few places to break, but we’re still exposing our credentials to that third party, and they get stored in their internal database, which can get compromised. This may be good enough, very little risk there, but we can simplify it even further and make it even more secure. What we’re going to do is instead of a secrets vault, we’re going to use a key management service, which is something every cloud provider offers. It’s a very simple piece of technology, no database involved whatsoever. We’re going to use our KMS to encrypt our credentials, and we are going to store our credentials in Git, encrypted.
At runtime, our application will basically call the KMS to decrypt the credentials and get access. That way, if our source code leaks, because we’re working on an unprotected machine, the credentials are encrypted, no big deal. If our pipeline gets compromised, there are no credentials anywhere. This is really simple and also secure.
Accounting for Emergencies
Throughout this process, we have removed some obstacles, but we actually put other ones in place. We said all the code has to go through our pipeline, no manual tinkering with infrastructure. We also added requirements for merging. What if you are that poor engineer who is on call Saturday night, and the world is on fire? Sometimes, should be rare, but sometimes we have to do something out of band. We have to deploy an emergency patch. We have to manually change the database configuration. I say we have to because something horrible is happening. Maybe we are hard down. Maybe we’re bleeding tons of money every minute. We need to have some way of bypassing all our security mechanisms in an emergency. We have to be careful here because we don’t want this process to be ripe for abuse.
If it’s easier to do development this way, everyone will do it. Here’s where we have to actually think about our obstacles. We want to have that emergency path to be easy to follow because it’s an emergency. The engineer is stressed already, but it should be really cumbersome to trigger. Maybe it’s an email to a group. Maybe we write a short essay on what’s happening and why. Maybe there’s some extra annoying buttons to click in some UI to get a temporary elevation to our permissions. Maybe we always have at least two people on call. Regardless how we set it up, we want to have a full audit trail, what happened, when? We want to look at that, review it later. Without this formal pre-thought emergency process, engineers would get creative under stress and we will end up with backdoors all over the place.
Security Mandates
I talked a lot about replacing or changing your security practices. Some of those changes may be fully within your sphere of control. Other practices may actually be mandated to you by someone way above your pay grade with security in their job title. That doesn’t mean we’re stuck with those mandates. When companies hire a chief information security officer and invest in a security organization, it is actually almost never about security. These people are there to ensure compliance, which is a very important role, but it has nothing to do with security.
In fact, these two sometimes are at odds with each other. This is one of my favorite recent examples. This happens all the time. The person asks to remain anonymous, but it is a company that I bet you all heard about. The job of a CISO is to convince third-party stakeholders that your company does a good enough job minimizing security risks. Those stakeholders can come from different places. There are three main compliance requirement sources: we have certifications, we have regulations, and we have insurance policies. Stakeholders representing these, so auditors, government officials, insurance salespeople, they have no idea about your business. They don’t know how software is made. They may not even have any technical knowledge. Your CISO needs to convince them that you’re secure. That’s a really tough job. It may be tempting to optimize for the ease of audit. It may be tempting to just copy what someone else did that got them to pass the audit, even if it makes no sense in your situation. That’s not the only way.
Contrary to the popular opinion, none of those compliance requirement sources are prescriptive in their nature. It really is up to us to define the scope, the means, and the implementation. Another misconception is that we need to provide some proof that we’re doing the right thing. It is never about a proof. You can’t prove anything when humans and computers are involved. What it really is about is crafting a convincing narrative. This is what your CISO is after, solid, well-structured narrative that will convince the third parties.
If you want to help your CISO, document how you think about security within your area. Do it for yourselves, for your new hires. Only you know your system the best. Describe your bulkheads. Talk about how you classify different parts of your system, what levels of protection you have. Explain what you’re doing to minimize the impact of incidents. Keep this narrative simple, plain English, so that any non-technical person can understand it. Like that, you really should be able to push back on a lot of those really pesky, annoying security practices. Like that, your security can also be BLISS.
Â
See more presentations with transcripts
Â