2008-10-21

2 hours of panic

Just after most of my colleagues left work for the day our production ARS server stopped working. It is impossible to access the system in any way. I check the error-log and find that the service crashed five times in a row. Then my connection to the machine totally freezes and I can not even read the log. Still no access to the AR-System.

Panic!


I call the server maintenance (SM) and ask if they notice the problem. They say they do and that they have informed the helpdesk. I explain what I experienced and ask if there are any planned changes that are supposed to get implemented this evening or any other known activities. No there were nothing. It was getting more scary and it started to point at there was something terrible wrong with our system. Good that Girlwing is at a course tonight so it doesn't matter if I work late...

The server suddenly comes alive again and the SM-guy logs in and tries to checks the server information. The server freezes again and nothing can be done. We keep talking about what can be the problem and I am start having awake nightmares about evil viruses.

When the system was responsive I searched for all the information I could find about imports and reports and nothing was supposed to run at this hour. So I was starting to rule out heavy load on the ARS service. I relax and the system got unresponsive again...

A couple of minutes later it is again possible to do simple searches in the ARS and soon the Windows OS is workable again. The CM-guy continue to check the server information. The CPU is running 100% and the memory is all eaten by a bunch of cscript processes. Then everything freezes again... it keep doing like this all the time while we are talking about the incident and searching for the cause.

Then suddenly after two hours of hard work he say... "The system hangs every time I relate those incidents with each others".

I am silent for a couple of seconds... then I say... "So if you stop relating incidents with each others, the system will work well and helpdesk will be able to access the incident application?".

He agrees to wait a couple of hours to continue and I rest my case...

...

The thing that had happened was that some computer had been switched off before the monitoring had been taken off and hundreds of incidents had been automatically generated and send to SM and filled their support group basket. There is a lot of work to manually close each incident because you have to fill in closure code and cause description. But instead it is possible to select a lot of incidents and mark them as related to a parent incident and then close the parent incident and the related ones will get closed automatically. This is a smart feature!

The problem with this is that there is a poorly written filter that executes on every save/modification of an incidents and runs a windows server process of cscript to build an XML-file. Microsoft is not well known for its efficient small footprint applications. So when the SM-guy modifies 50 incidents at the same time there will be 50 cscript processes running at the server and consuming all the available memory including some virtual memory on disk. And I can only guess what will happen when 50 identical cscript processes tries to update the same XML-file at the same time. You will probably get a lot of waiting... and helpdesk personnel that can not do it's work.

1 comment:

Peter said...

Scary.
Really scary.

I am not surprised though. Concurrency is something people for some reason fail to take into account far to often, and when they do they sometimes fail miserably anyway.. assuming that file IO is thread safe or some such other retarded assumption.