Job Description
United States
Full job description
As a Senior DevOps Site Reliability Engineer, you:
- Have an automation first mindset
- Are passionate about performance, stability, and security
- Believe in a proactive approach of prevention over mitigation, and mitigation over fixing
- Are comfortable with change
- Have had a positive experience working for a startup before
- Are a U.S. Citizen with an active clearance or willing and able to undergo the clearance process, including polygraph
Required Skills – advanced knowledge of:
- AWS — 3+ years of hands-on experience (Architect / DevOps / SysOps AWS Certification preferred)
- Infrastructure as a Code (Terraform)
- Ansible automation
- Kubernetes — 2+ years of in-depth experience deploying production applications / containers orchestration
- K8S scheduling, networking, security, load-balancing
- CI/CD (GitLab, Jenkins or Bamboo)
- Python, Perl, or Golang
- Best practices and IT operations in an always-up, always-available mission critical service
Desired Experience:
- Implementing observability and monitoring in AWS, using Splunk / ELK / similar
- EKS, ECS , ECR
- Working in an agile environment, focused on rapid cycles and CD
- Supporting, analyzing, and troubleshooting large-scale distributed mission-critical systems
- Building software and/or platforms where security, regulatory compliance and high availability are critical
- Strong understanding of Information Security in various environments
Responsibilities
- Implement and support FedRAMP and other applicable USG standards, policies, and regulations
- Set up, integrate, and maintain a scalable, stable set of CI/CD tools to support development, testing, and security scanning
- Be accountable for a large-scale SaaS app w/a mission-critical customer base
- Manage multiple tools, infrastructure, and roles in a fast-paced environment
- Own the availability of our SaaS infrastructure and application
- Implement best-in-class AWS solution using infrastructure as code
- Collaborate with engineering and product to continuously improve service availability and quality
- Be involved in the entire production lifecycle: code deployments, infrastructure management, and troubleshooting
- Share ownership w/the Dev team, and own service availability and proactive issue prevention, using structured troubleshooting to mitigate issues
- Work closely with our Dev and DevOps teams to ensure that our production services are secure, scalable, performant, and resilient