Home > Support > HOWTO List > Troubleshooting performance issues in Linux

Troubleshooting performance issues in Linux

Performance problems are caused by bottlenecks in one or more hardware subsystems, depending on the profile of resource usage on your system. Some elements to consider:

The big picture

There is no golden rule for troubleshooting performance issues. There are many different causes of bottlenecks and no ultimate mapping of "classes of systems" and "causes". From our experience, there are common causes of bottlenecks for given systems, but don't let that fool you to think this always applies.

Common performance bottlenecks

Usually, database systems has IO-bound resource usage and requires a lot of RAM, anti-virus software uses many CPU cycles, anti-spam software may stress CPU and network (RBL and other distributed checks). Application servers (Java, PHP, Ruby, Python, etc) may use a lot of processing.

Also, sometimes a given system is just no scalable enough, even if the hardware is good enough. Maybe it forks too many new processes, opens too many file descriptors or is just buggy. We've seen many programs doing long "sleep()" for no clear reason. In this case, resource usage will be minimal, but the system will be sluggish.

Investigating performance issues

To troubleshoot performance issues, your strategy will depend on the nature of the problem. Is it always slow, or it is this problem irregular - it appears as suddenly as it goes away?

Troubleshooting constant slowness

Constant problems are much easier to spot. In this case, it is advisable to have historical statistics for resource usage in your system.

Using the sysstat package to get historical resource usage information

What you should do in any case is to gather resource usage information. When you get the handle of it, most of the time you'll be able to spot the root cause of slowness very easily.

First of all, install sysstat in your server, so you'll get detailed statistics about CPU, memory, disk and other resources usage.

[root@box~]# apt-get install sysstat

For RedHat-based systems:

[root@box ~]# chkconfig on sysstat
sysstat 0:off 1:on 2:on 3:on 4:on 5:on 6:off

For debian-based systems, including Ubuntu, edit the file /etc/default/sysstat and change the ENABLE variable to "true"

Then start the sysstat daemon:

[root@box ~]# /etc/init.d/sysstat start

It starts collecting data now. Later you'll be able for example to run "sar -r" or "sar -b" to get memory or IO (disk) statistics, respectively. "sar -A" shows a full report.

Don't worry if these numbers are meaningless to you right now, but we should be able to use them to analyze performance issues better. If you later open a support ticket regarding performance, please mention that "sysstat" is installed and collecting data. That would help us a lot.

Troubleshooting irregular, sudden slowness

If, on the other hand, your system is having irregular, sudden problems, gathering historical data may not help much. Sudden problems will require real-time analysis.

The sysstat package provides many tools to help with that task. Be aware of the "Heisenberg effect" of analysis tool - sometimes, the analysis tool is too intrusive and may even make the problem worse.

Analyzing disk usage with iostat

Sample usage:

root@box:~# iostat -x 5
Linux (staff.rimuhosting.com)      03/24/07

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.00    0.00    0.05    0.11   99.83

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
xvda9        0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00    12.80     0.00    5.60   5.60   0.00
xvda1        0.00   0.07  0.01  0.12    0.51    1.57     0.26     0.79    15.77     0.00   11.29   6.41   0.08

This is the output of a system under heavy stress:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          31,67    0,00   24,30   44,02    0,00    0,00

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
hda               1,39  6058,37 14,54 81,67   272,51 50165,74   524,22   131,41 1433,21  10,36  99,68
hdc               0,00     0,00  0,00  0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00

Memory usage

The easiest way to analyze the system memory usage is to use ''free -m'':

# free -m
             total       used       free     shared    buffers     cached
Mem:           320        314          6          0          5         93
-/+ buffers/cache:        215        104
Swap:          127        110         17

The most important number here is the free space in the row after "-/+ buffers/cache", which in this case is 104, which is around 30% of physical memory (320). This is a normal (but not excellent) figure, so your system memory usage is healthy. If that number was much lower, than it probably means your system needs memory.

In this server in particular, though, that's the case because Linux has just oom-killed a few processes, so the there's a lot of free RAM, so remember to check dmesg when analyzing memory usage in a system.

Detailed discussion

Note that when 'free -m' shows that there is some space being used for swap, that doesn't mean necessarily that the system is out of memory right now. It may have been just a temporarily need sometime in the past, but it's still there because it's expensive to clear the swap, even if the pages there are inactive.

Actually, the cost is not to 'clear' the swap, but to move these pages back to the swap again, if more room is needed again. So Linux just leave them there, because usually the vm manager will need to swap those pages in again. This is the SwapCached figure in /proc/meminfo. Which are pages located both in the swap and in physical memory (RAM).

Also, pages which are part of a process but were swapped to disk sometime in the past, will stay there 'forever', or until the process needs them again, that is, when a major page fault happens and that page is requested. To a typical swap can carry very old pages of old processes which haven't used those pages in a long time and sometime in the past were put there to make some room to other 'hot' pages.

So when analyzing memory usage, instead of just looking at the current swap usage, see also the 'freeable memory', which is free + cached + buffers to get a better idea.

The swap usage figure, on the other hand, means that your system had higher need for memory in the past.  That doesn't mean more RAM is really needed. Some daily processes called by cron, like logwatch, are known to use a lot of memory but it's OK for them to use swap a bit during that job, since performance is not usually that important.

Just make sure to run expansive cron jobs in off peak hours.

Also don't forget to check for oom-kills in dmesg.  If Linux has just killed several processes because there was not enough free memory, "free -m" will show a lot of free space, which would be misleading.

Further reading

Optimizing Linux(R) Performance: A Hands-On Guide to Linux(R) Performance Tools (HP Professional Series)
by Phillip G. Ezolt (Author)