When your identity platform serves 4 million users across 3,000 applications for federal agencies, an outage isn't an inconvenience—it's a mission failure. Yet most organizations treat identity infrastructure with less operational rigor than their databases or networks, discovering their blind spots only when something breaks catastrophically.
This session draws from hard-won lessons operating FedRAMP High and DoD IL5 identity infrastructure, including recent crises that exposed gaps in our operational maturity. You'll learn how we transformed reactive firefighting into systematic risk management—identifying high-criticality components before they fail, building observability that actually surfaces problems, and creating runbooks that work at 2 AM when your on-call engineer has never seen this failure mode.
We'll cover the unglamorous but essential work: how a centralized logging platform with 10,000+ shards became an operational liability, why "it's always worked" is the most dangerous phrase in identity operations, and what it takes to build a culture where operational risk is everyone's responsibility. You'll leave with a practical framework for assessing your own identity infrastructure's operational maturity and concrete steps to close the gaps before your next incident forces the issue.