What is the role?
NMI is seeking a Senior DevOps Engineer with deep Linux virtualization experience who is passionate about running applications in an exceedingly high availability environment within our SRE organization. This opportunity to work with similarly skilled professionals in a rapidly growing environment offers opportunities to level-up observability and automation skills while maintaining a mission critical, 4-nines availability platform, and participating in environment modernization.
The SRE team is responsible for the operation of all hardware and software within the production and SDLC environments. This consists of a global network connecting numerous sites which must be highly available 24x7 with a minimal desired target of 99.99% availability. The successful applicant as a Senior DevOps Engineer will be a core member of the SRE team with the opportunity to work with experts in the infrastructure, networking, and DevOps space.
The Ideal Candidate:
- Will have a track record of implementing low-toil solutions to traditionally high-touch operational or administrative tasks.
- Has a deep technical background and can engage with engineers with the nuances of complex systems, while also being able to zoom out and see the bigger picture.
- Enjoys being challenged to find creative solutions using both legacy and cutting edge technology. This is a codespeak for us having a legacy system that has to be maintained and improved while also looking at new technology and tools to improve resiliency, performance, ease of administration, and observability. It’s not all “the fun stuff”.
- Wants to work with a globally distributed team of similarly skilled professionals, and is comfortable building relationships with teammates up to thousands of miles away.
- Is as comfortable in a shell or VIM as an accountant is in QuickBooks.
- Refuses to believe a service or appliance is production ready until they have the metrics and alerts to prove it.
Key duties:
- Administration - Participate in maintenance and operations of our production environment, including patching, deployment, server administration, and troubleshooting, either using configuration as code tooling or manually.
- Reliability & Performance - Ensure reliability, availability and performance of services. Respond to incidents and resolve before they become customer impacting.
- Collaboration - Work closely with teammates, software, and security teams to rapidly meet customer, business, and compliance needs.
- Automation - Drive the automation of operational tasks, and ensure our infrastructure is more like cattle than pets.
- Observability - Develop and maintain internal and commercial or OSS tools to improve system health, performance, and deployment.
- Continuous Improvement - Drive never-ending improvement in SRE processes, tools, and methodologies. Take a leading role in blameless post-mortems to avoid repeat issues or mistakes and clearly document all lessons learned for others. If you love writing actionable documentation, we’d love to set up an interview.
- On-Call - Participate in a rotating 24x7 on-call schedule with your team to ensure availability of services across the production environment.
This is a fully remote role (work anywhere in the US); however, if you live within a reasonable commutable distance, we’d love to see you in the office from time to time! Periodic travel (typically 1-4 times a year) will be required to company colocation facilities, at company expense.
Essential Skills & Experience:
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
- 5+ years of experience in Site Reliability Engineering, DevOps, System Administration, or similar roles.
- Deep experience working in colocation facilities – we have a hybrid footprint, and if you have only worked in the public cloud space, this role is not a great fit for you.
- Experience using Puppet, Ansible, or other common configuration as code tooling to deploy and configure systems.
- Strong familiarity with Linux systems (any distro is fine, but we have a preference for RHEL downstreams).
- Experience using Proxmox, VMWare, or KVM as virtualization platforms for large-scale production environments.
- Experience administering enterprise grade SANs and load balancers is necessary to be successful in this role.
- Demonstrated proficiency in one or more scripting or programming languages (e.g., Python, Go, Bash/ZSH, etc.)
- Some experience with MySQL (any variant) is required.
- Strong problem-solving skills and a passion for reliability and performance.
Preferred Skills and Experience:
- Experience with F5 BigIP LTMs or NetApp SANs is highly desirable.
- Experience using Grafana, Prometheus, and the ELK stack for observability is highly desirable.
- Kubernetes experience is a significant plus. Alternatively, a burning desire to learn it.
- Experience working with SaaS based WAF/DDoS protection services such as Silverline, CloudFlare, or Akamai is preferred.
- Certifications in cloud platforms or relevant technologies are nice to have.
- Prior experience on a team following common agile processes such as Kanban or Scrum would be valuable.
- Experience in the start-up to scale-up space will be very valuable. We are not a calcified, enormous enterprise, and move quickly.
- GitLab experience is a plus.