I’ve seen a few servers in the last few weeks becoming unreachable due to the memory usage.
These issues have looked to have been caused by processes being blocked by the kernel as it tries to reserve too much memory for it’s self.
A usual error in messages looks like :
kernel: INFO: task crond:2828 blocked for more than 120 seconds.
kernel: “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
To resolve this issue, I’ve ran the following command, this limits the amount of memory to reserve for system processes from the default 40% down to 10%.
“vm.dirty_ratio=10″ >> /etc/sysctl.conf