Job Description
United States
Full job description
About Role
Responsibilities:
You will design, develop and implement end-to-end infrastructure solutions for our multi-tenant, microservices architecture SaaS apps. You will own the responsibility of system reliability, scalability, performance, and security.
Implement and continuously improve CI/CD pipelines.
Set up monitoring and alerting across various layers (App, Network, and OS levels) of the service.
Ensure accessibility, security, reliability, availability, and performance of our infrastructure.
Support, maintain and troubleshoot production issues and alerts and participate in 24/7 on-call production support rotations.
Skill Details
Technical Skills
10+ years experience in DevOps or SRE (Site Reliability Engineering) roles owning the responsibility for large-scale enterprise SaaS service in production environments.
Significant experience with AWS public cloud technologies and the implementation of large-scale container clusters: AWS, EKS, Infrastructure as Code (Terraform), and containers (Docker and Kubernetes, and IAM).Strong programming/scripting skills with one or more scripting languages (Python, Go, Ruby, Bash, etc.) with strong Linux OS and networking fundamentals.
Experience building monitoring systems to ensure high availability, performance, and security integrity (e.g., ELK-stack, Pingdom, Opsgenie/Pagerduty, Kiali, Weave Scope, CloudWatch, CloudTrail).
Hands-on experience operating microservices architecture-based SaaS products, REST web services, SSO (Okta, Auth0), EC2-RDS, MySQL, and Elasticsearch.
Understanding of backup strategies and disaster recovery for RDS and Elasticsearch.
AWS System Architect certification strongly preferred
Capacity sizing to meet the requirements & SLAs of the target state and in transition as applicable.
Self-motivated and excited about the ambiguity, opportunity, and self-direction required at an early-stage startup.