The other day I stumbled upon a job opening from EUROPEAN DYNAMICS. They are looking for a System Administrator / Engineer in Greece.
In short, the company is a large European IT integrator, a dinosaur with 1100+ employees, deeply entrenched in government contracts across the European Union. Projects worth hundreds of millions of euros, bureaucracy, high demands for security and stability. In short, everything is serious.
Let’s imagine a dialogue somewhere in their Athens office. Let’s call the characters Dimitris (Department Head) and Elena (HR).
Dimitris: Elena, we have a problem again. Yiannis is leaving. We urgently need another person to monitor the servers. To ensure everything runs like clockwork, backups are made, access is controlled. And most importantly – to be on call at night and on weekends if something goes down. We cannot afford downtime; clients from the European Commission will not understand.
Elena: Understood. 5+ years of experience, Linux, virtualization… anything else?
Dimitris: Yes, someone who understands monitoring, security. Someone who can quickly manually bring up a downed service. We need a reliable person who will put out fires.
Here it is, the classic pain point of a large and not very agile company. They are not looking for an engineer for development, but a “firefighter” for existing infrastructure. A person whose main task is to sit and wait for something to break, and then heroically fix it. And for this heroism, they are willing to pay a salary, health insurance, and even language courses. Nice, but inefficient.
What if we look at this task not as a position for a person, but as a process that can be automated with AI?
The modern approach to this problem is called AIOps — Artificial Intelligence for IT Operations. It’s not one magic product, but a concept where routine tasks of monitoring, diagnosis, and troubleshooting are handed over to smart algorithms. Instead of hiring a person for round-the-clock vigilance over dashboards, the company could build a system that does it itself.
How would this look in practice?
Step 1: Data Centralization. Instead of a person sifting through logs from different systems, all metrics, logs, and traces are collected into a unified data lake. Tools like Splunk, Datadog, or ELK Stack exist for this. This is the foundation.
Step 2: Implementing AI Analytics. We unleash machine learning models on the collected data. Datadog or Dynatrace can do this out of the box. The system learns the “normal” behavior of the infrastructure. It understands what server load is typical for Monday morning, and what is typical for Saturday night. Any deviation from the norm is an anomaly. AI will notice it minutes before the problem becomes critical and users find out about it.
Step 3: Automated Response. This is the most interesting part. Instead of a late-night call to a tired sysadmin, the system itself can take action.
A simple example: AI notices an anomalous increase in memory consumption on one of the web servers.
– Before: Monitoring sends an alert -> on-call admin wakes up -> connects -> analyzes logs -> restarts the service. Time lost: 15-30 minutes.
– With AIOps: AI analyst detects the anomaly -> correlates it with the knowledge base and sees that in 95% of cases, this is resolved by restarting the service -> automatically launches an Ansible playbook that safely restarts the necessary service -> problem solved in 30 seconds. The person simply receives a report in the morning about the work done.
How to overcome distrust? Managers like Dimitris will say: “I won’t trust a machine to restart production! What if it makes a mistake?” And this is a normal fear. Implementation must be gradual.
– Advisor Mode. Initially, AI does nothing on its own. It only analyzes and sends alerts with recommendations: “Anomaly X detected, recommend performing action Y. 95% probability of success.” The engineer reviews, agrees, and clicks the “Execute” button.
– Gradual Automation. Start with the safest and most frequent operations. Restarting a non-critical service, clearing temporary files. As the team sees that it works, trust grows.
– Transparency. All AI actions must be logged and absolutely understandable to engineers.
And how to check if AI works better than a human? Very simply, by metrics that any manager loves.
1. MTTR (Mean Time to Resolution) — average time to resolve an incident. Compare how long it took a human to fix a typical problem versus how long it takes an automated system. The difference will be tenfold.
2. Number of Incidents. AIOps allows not only quick fixes but also problem prediction. The system will say in advance: “Attention, in 3 hours, disk Z will run out of space with 80% probability.” A person will have time to react before an outage occurs.
3. Cost of Ownership. Calculate the salary, taxes, bonuses, insurance, onboarding, and training costs for a sysadmin for 3 years in advance. Compare this with the cost of software licenses and the work of an engineer who will set up and maintain this system. The numbers might surprise you.
Instead of looking for another “hero” who will fight routine and burn out on night shifts, EUROPEAN DYNAMICS could invest in building a tireless, self-learning, and super-fast system. And direct people, those experienced engineers, to more creative and complex tasks: to infrastructure development, not its endless repair. For this is where true engineering art lies.
Источник: https://www.linkedin.com/jobs/view/4410037116/