Job description
Description
Join Onyx Security and help build the control plane for autonomous AI in a fast-moving, high-impact environment. Onyx is building the control layer that enables enterprises to safely adopt and scale AI agents across the organization, with the visibility, governance, and operational control needed to understand what agents exist, what they do, how they behave, and what risks they introduce.
About the role
We're looking for an all-round infrastructure engineer who writes production-grade code, designs systems from first principles, and owns everything from infrastructure-as-code to a multi-cloud ML serving stack. You'll be joining a small, elite infrastructure team that owns it all.
What you will do
Lead infrastructure architecture across the company and take end-to-end ownership across Onyx's infrastructure - cloud environments, platform tooling, internal services, and production uptime
Design and build production-grade services and internal platform tooling; you write code at the same bar as the product engineering team
Operate and scale AI/ML Serving Infrastructure systems behind real-time LLM routing, model inference, and AI security enforcement in production
Requirements
What you will bring
6+ years in infrastructure, platform, DevOps, or SRE roles - with strong software engineering ability, not just operational experience
Production-grade cloud expertise (AWS, GCP or Azure) - deep understanding of compute, networking, data services, identity, and security across cloud providers
Hands-on experience building and running AI/ML infrastructure in production - model serving, inference workloads
Kubernetes depth - you operate clusters in production, understand the internals, and debug the problems others escalate
Infrastructure-as-code at scale - you've designed module architectures, managed state across environments, and integrated it all into CI/CD
Experience building an internal developer platform - you've designed the tooling, pipelines, and abstractions that engineering teams depend on
Strong coding ability in Go, Python, or Node.js - you build and own services, not just glue scripts
Security-first thinking embedded in how you design and operate systems
Why Onyx
The problem space is rare - you'll build infrastructure at the intersection of AI and enterprise security, where the technical challenges are deep and real
Full ownership from day one - small team, massive surface area, and your decisions shape the platform
The scale is coming fast - hyper-growth means you'll solve problems most engineers only read about
You'll be surrounded by engineers who hold themselves to an exceptionally high bar
Is this role relevant for you?