Home > Support > HOWTO List > Troubleshooting performance issues in Linux

Troubleshooting performance issues in Linux

Performance problems are caused by bottlenecks in one or more hardware subsystems, depending on the profile of resource usage on your system. Some elements to consider (in roughly sorted order):

From our experience, there are common causes of bottlenecks for given systems, but sometimes there are less obvious causes as well.

Common performance bottlenecks

Usually, database systems are sensitive to disk IO resources, and often requires a lot of RAM. Anti-virus/spam software may need CPU and possibly network capacity (RBL and other distributed checks). Application servers (Java, PHP, Ruby, Python, etc) may do a lot of processing.

Also, sometimes a configuration is not sustainable, even if the hardware is top of the line. Maybe it forks too many new processes, opens too many file descriptors or is just buggy. For example we have seen programs doing long "sleep()" for no clear reason, in that case resource usage was minimal, but the system was sluggish.

Investigating performance issues

To troubleshoot performance issues, your strategy will depend on the nature of the problem. Is it always slow, or it is this problem irregular - it appears as suddenly as it goes away? When exactly does it occur (so logs can be checked)? And what is the main symptom, such as a site stops loading temporarily, or database requests stop working.

Using the sysstat package to get historical resource usage information

What you should do in any case is to gather resource usage information. When you get the handle of it, most of the time you'll be able to spot the root cause of slowness very easily.

First of all, install sysstat in your server, so you'll get detailed statistics about CPU, memory, disk and other resources usage.

For EL6/CentOS6 based systems:
[root@box ~]# yum install sysstat
[root@box ~]# chkconfig on sysstat
sysstat 0:off 1:on 2:on 3:on 4:on 5:on 6:off
[root@box ~]# service sysstat start

For more modern systems sucg as CentOS7, debian 9 or later and recent Ubuntu releases, install with "apt-get install sysstat" and then edit the file /etc/default/sysstat if it exists and change the ENABLE variable to "true". Then start the sysstat daemon:

[root@box ~]# systemctl sysstat start

Once started syststat will begin collecting data immediately. Later you'll be able to run "sar -r" or "sar -b" to get memory or IO (disk) statistics, respectively. "sar -A" shows a full report.

Don't worry if these numbers are meaningless to you right now, but we should be able to use them to analyze performance issues better. If you later open a support ticket regarding performance, please mention that "sysstat" is installed and collecting data. That would help us a lot.

We also recommend supplementing this with our memmon.sh script, which is a scheduled task that is installed on most of our new servers. It will regularly log a full process list and record quick memory and load information to a handy file on /var/log/

Troubleshooting irregular, sudden slowness

If your system is having irregular, sudden problems, or issues that appear to become critical very quickly, periodic sampling of server data may miss or inaccurately represent the issues. Sudden problems will require real-time analysis.

The sysstat package provides many tools to help with that task. Be aware of the "Heisenberg effect" of analysis tool - sometimes, the analysis tool is too intrusive and may even make the problem worse.

Analyzing disk usage with iostat

Sample usage:

root@box:~# iostat -x 5
Linux (staff.rimuhosting.com)      03/24/07

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.00    0.00    0.05    0.11   99.83

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
xvda9        0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00    12.80     0.00    5.60   5.60   0.00
xvda1        0.00   0.07  0.01  0.12    0.51    1.57     0.26     0.79    15.77     0.00   11.29   6.41   0.08

This is the output of a system under heavy stress:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          31,67    0,00   24,30   44,02    0,00    0,00

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
hda               1,39  6058,37 14,54 81,67   272,51 50165,74   524,22   131,41 1433,21  10,36  99,68
hdc               0,00     0,00  0,00  0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00

Memory usage

The easiest way to analyze the system memory usage is to use ''free -m'':

# free -m
             total       used       free     shared    buffers     cached
Mem:           320        314          6          0          5         93
-/+ buffers/cache:        215        104
Swap:          127        110         17

The most important number here is the free space in the row after "-/+ buffers/cache", which in this case is 104, which is around 30% of physical memory (320). This is a normal (but not excellent) figure, so your system memory usage is healthy. If that number was much lower, than it probably means your system needs memory.

In this server in particular, though, that's the case because Linux has just oom-killed a few processes, so the there's a lot of free RAM, so remember to check dmesg when analyzing memory usage in a system.

Detailed discussion

Note that when 'free -m' shows that there is some space being used for swap, that doesn't mean necessarily that the system is out of memory right now. It may have been just a temporarily need sometime in the past, but it's still there because it's expensive to clear the swap, even if the pages there are inactive.

Actually, the cost is not to 'clear' the swap, but to move these pages back to the swap again, if more room is needed again. So Linux just leave them there, because usually the vm manager will need to swap those pages in again. This is the SwapCached figure in /proc/meminfo. Which are pages located both in the swap and in physical memory (RAM).

Also, pages which are part of a process but were swapped to disk sometime in the past, will stay there 'forever', or until the process needs them again, that is, when a major page fault happens and that page is requested. To a typical swap can carry very old pages of old processes which haven't used those pages in a long time and sometime in the past were put there to make some room to other 'hot' pages.

So when analyzing memory usage, instead of just looking at the current swap usage, see also the 'freeable memory', which is free + cached + buffers to get a better idea.

The swap usage figure, on the other hand, means that your system had higher need for memory in the past.  That doesn't mean more RAM is really needed. Some daily processes called by cron, like logwatch, are known to use a lot of memory but it's OK for them to use swap a bit during that job, since performance is not usually that important.

Just make sure to run expansive cron jobs in off peak hours.

Also don't forget to check for oom-kills in dmesg.  If Linux has just killed several processes because there was not enough free memory, "free -m" will show a lot of free space, which would be misleading.

Further reading

Optimizing Linux(R) Performance: A Hands-On Guide to Linux(R) Performance Tools (HP Professional Series)
by Phillip G. Ezolt (Author)