Job Description
Job details
Shift and schedule
- Weekends as needed
- Evenings as needed
Full job description
What Working at Hexaware offers:
company is a dynamic and innovative IT organization committed to delivering cutting-edge solutions to our clients worldwide. We pride ourselves on fostering a collaborative and inclusive work environment where every team member is valued and empowered to succeed.
With an ever-expanding portfolio of capabilities, we delve deep into and identify the source of our motivation. Although technology is at the core of our solutions, it is still the people and their passion that fuel Hexaware’s commitment towards creating smiles.
We are always interested in, and want to support, the professional and personal you. We offer a wide array of programs to help expand skills and supercharge careers. We help discover passion—the driving force that makes one smile and innovate, create, and make a difference every day.
Job Title: Site Reliability Engineer (SRE)
Location: McLean, VA (5 days onsite per week)
Job Summary
We are seeking a Site Reliability Engineer (SRE) who combines software engineering expertise with IT operations to ensure the reliability, availability, scalability, and performance of critical systems and services.
Key Responsibilities
- System Reliability: Design, implement, and maintain automated solutions to ensure high availability, resiliency, and scalability of applications and services.
- Incident Management: Respond to production incidents, develop protocols to minimize downtime, conduct post-mortems, and implement preventive measures.
- Monitoring & Observability: Set up and manage monitoring systems to track performance metrics, ensuring system health and addressing potential issues proactively.
- Performance Optimization: Analyze system performance, identify bottlenecks, and optimize for speed, scalability, and resource utilization.
- Automation: Leverage automation tools to reduce manual interventions and ensure efficiency, repeatability, and minimal human error.
- Collaboration: Work closely with stakeholders to support new features, deployments, and compliance initiatives.
- Capacity Planning: Forecast resource needs and plan for future growth to maintain system stability and scalability.
- Documentation: Create and maintain up-to-date documentation for systems, processes, and troubleshooting procedures.
- Continuous Improvement: Stay current with emerging technologies and practices to design and deliver best-in-class solutions.
Required Qualifications
- 8+ years of total Exp.
- Strong sense of accountability and ownership to identify and drive improvements.
- Excellent communication skills to convey complex information clearly and persuasively.
- Ability to work independently and collaboratively in a fast-paced environment, including evenings/weekends as needed.
- Technical Expertise:
- End-to-end observability solutions (Elastic Observability, Elastic APM, Distributed Tracing, OpenTelemetry).
- Linux/Unix system administration and cloud infrastructure (AWS, Azure, Google Cloud).
- Programming/scripting languages (Java, Python, Go, Bash, Spring Boot, PySpark).
- Data management and data warehousing (MongoDB, Snowflake, SQL).
- CI/CD tools and configuration management (Jenkins, Ansible, Terraform).
- Containerization and orchestration (Docker, Kubernetes, EKS).
- Networking, databases, and distributed systems.
- Experience with incident response and post-mortem processes.
- Bachelor’s degree in Computer Science, Information Technology, or equivalent experience.
Equal Opportunities Employer: