At JFrog, we’re reinventing DevOps to help the world’s greatest companies innovate -- and we want you along for the ride. This is a special place with a unique combination of brilliance, spirit and just all-around great people. Here, if you’re willing to do more, your career can take off. And since software plays a central role in everyone’s lives, you’ll be part of an important mission. Thousands of customers, including the majority of the Fortune 100, trust JFrog to manage, accelerate, and secure their software delivery from code to production -- a concept we call “liquid software.” Wouldn't it be amazing if you could join us in our journey?
In this role you will be part of our Site Reliability engineering group implementing & operating robust, scalable and highly available Cloud native systems/services ensuring JFrog product infrastructure and service reliability, performance and efficiencies.
You will collaborate with cross-functional Product & Platform engineering, Customer success groups to identify and address operational and architectural challenges/debts, driving improvements and innovations of JFrog systems and processes.
Additionally you will provide enterprise capability tools and environment automation tools to drive our business.
Should be able to extend to off-business hours & weekend on-call as and when required.
As a Sr SRE in JFrog you will…
- Support all aspects of JFrog Cloud Platform & Product operations on a day-to-day basis and maintaining continuous availability, reliability, durability, scale and up time
- Work with Product & engineering teams to promote best practices for cloud reliability & fault tolerance enablements.
- Defining reliability governance & operational transformation strategy, roadmap and enforcements
- Define and build innovative solution methodologies and assets around infrastructure, cloud migration, lifecycle and deployment operations at scale.
- Chaos engineering advocacy and adoption, mentoring to build tools and strategies for problem prevention, detection and fault mitigations. Planning & co-ordinate GameDays runs.
- Participate in NPI applicative & platform services launch readiness, platform management and capacity planning.
- Create sustainable systems and services through automation and uplift.
- Balance feature development speed and reliability with well-defined Error budget implementations.
- Develop, test, and deploy automated solutions and automated decision analytics to replace manual processes
- Gather and analyze metrics from systems and applications to assist in performance tuning and fault finding.
To be a Sr SRE in JFrog you need…
- Overall relevant experience of minimum 5+ years,
- Proven experience as an SRE engineer or a similar role.
- Strong understanding of system design principles, distributed systems, and cloud infrastructure.
- Technology experience, including IaaS/PaaS/Serverless (Azure, GCP and/or Amazon Web Services) based on K8S, infrastructure automation/orchestration technologies, server, storage, high availability architecture.
- Foundational understanding of Application Servers, Web Servers, WAF, networks, storage and databases.
- Experience with infrastructure automation tools (For example: Terraform) and containerization technologies (For example: Docker, Kubernetes)
- Proficient in programming and scripting languages (For example: Python/Go/Bash, etc.)
- Deep knowledge of monitoring and observability tools (H Prometheus, Grafana, ELK stack/Coralogix, New Relic, etc.)
- Hands-on Chaos engineering tools(For example:Gremlin/Litmus/Chaos ToolKit,etc..)
- Hands-on experience with alerting tools (For example:PagerDuty/Opsgenie, etc.)
- GitOps experience with SCM tools (For example: Git, Bitbucket, etc.)
- Experience with orchestration tools (ex. Jenkins/Spinnaker/ArgoCD etc.)
- Excellent understanding of Scalability/HA processes and techniques in Cloud & K8s.
- Familiarity with security best practices and experience in designing secure systems.
- Strong intellectual curiosity and drive for continuous improvement, able to take initiative and learn on the fly. Ability to work independently under minimal supervision
- Strong communication and interpersonal skills to collaborate effectively with diverse teams.
- Excellent problem-solving and troubleshooting skills.Highly team oriented & practices collaboration as a key to success
- Experience in working in mission-critical environments & work well under pressure within a technically challenging environment
- Ability to mentor peers & new technical hires