Job Summary
We are looking for SRE professionals who bring software engineering principles to
infrastructure and operations problems, with the north star goal of creating highly
scalable and reliable systems.
SRE responsibilities include establishing service level thresholds, often manifested as
service-level objectives (SLOs), which help inform whether or not a release gets
greenlighted. An SRE function will typically be measured on a set of key reliability
metrics, namely: system performance, availability, latency, efficiency, monitoring,
capacity planning and emergency response.

Key Responsibilities
● Monitor, measure and improve the reliability, availability and scalability of IT
Infrastructure, applications and services
● Participate in 24*7 rotational shifts for handling production operation issues
● Engage in Incident response and participate in post-mortem analysis to
investigate root cause and capture contributing factors for remediation
● Perform analytics on previous incidents and trend/usage patterns to better
predict issues and take proactive actions
● Identify manual routine operational practices and build robust automation
capabilities using code and modern tools
● Collaborate with Product Developers and business stakeholders to gather
requirements for enabling and improving performance monitoring for
applications and services
● Design and build custom tools as needed to support process optimization and
continually strive to challenge the status-quo and improve operational efficiency
● Engage in service capacity planning and demand forecasting, software
performance analysis and system tuning
● Create meaningful dashboards/reports for application telemetry and
infrastructure health for pro-actively identifying performance constraints and
bottlenecks

Technical Requirements

● Strong understanding of cloud-based
architecture and cloud operations. Hands-on experience with Azure
● Experience in administration/build/management of Linux systems
● Foundational understanding of Infrastructure and Platform Technology stacks
● Strong understanding of Networking concepts and theories, such as different
protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers,
and load balancing
● Working knowledge of Infrastructure and Application monitoring platforms
● Understanding of the core DevOps practices (CI/CD pipeline, release
management etc.)
● Ability to write code using any one modern programming language (Phython,
JavaScript, Ruby etc.). Additional scripting skills are preferred
● Configuration management platform understanding and experience
(Chef/Puppet/Ansible)
● Prior experience in Cloud management automation tools
(Terraform/CloudFormation etc.) is preferred
● Experience with source code management software and API automation is
preferred
● Strong Understanding of architecture and operations of Container Orchestration
tools e.g Kubernetes.
● Ability to understand the working of Application and its Architecture.
● Understanding of Databases and SQL
Professional Attributes
● Service availability oriented mindset with a proactive approach to problem
solving. An ideal candidate should be able to develop automated solutions to
prevent recurring problems
● Possesses the ability and willingness to challenge the status-quo and optimize
current procedures and processes
● Strong sense of ownership and an ability to drive cross-functional process
improvement
● Possesses excellent interpersonal, written and verbal communications skills
● Analytical and logical approach to problem-solving and a willingness to automate
repetitive tasks and reduce manual/re-active workload
● Excellent communication skills

Education : Technical Graduates ( BCA, BSC,
B.Tech), MCA, MSC and M.Tech with strong data structure and algorithm
Skills

Experience : 1-2 years

Site Reliability Engineer - Azure

Submit Your Application