Keeping the monitoring system up to date
Developing and collecting metrics for prompt response
Automating tools, creating instruments for incident detection, response, investigation, and escalation
Response to critical incidents
Managing the resolution of critical incidents (working with other teams to resolve incidents on a 24/7 basis)
Experience in a similar or related IT position
Ability to analyze logs of web servers (nginx, haproxy, IIS)
Understanding of how web applications work
Experience with linux
Knowledge of SQL (basic+)
Understanding of OSI network model
Experience in automation of routine operations through scripts (bash, powershell, etc.)
Experience with Grafana, Prometheus, InfluxDB, ElasticSearch, Zabbix
Understanding of Continuous Integration/Continuous Delivery principles