Dstat helps you figure out why your computer is running slow

Linux users have a large repertoire of tools at their fingertips for figuring out the current workload on system resources. You may already have used tools such as iostat and netstat . These and other tools are either intended only for professionals or solely for the purpose of displaying parts of the desired parameters. The dstat tool, on the other hand, can be used by both novice and experienced administrators, because it offers well-structured and colorful information output. Therefore, with only a little practice, a less experienced user can detect processes causing significant loads on the system.

Dag Johan Maarten Wieers from Belgium was the primary developer for dstat, which is written in Python. Originally, he intended for the program to bring the functions of the well-known tools ifstat , iostat , netstat , and vmstat together in one place to give the user a comprehensive view of network disk and memory status.

The tool also includes numerous extensions that display metrics for many different applications. Although the kernel provides the standard measurement values in the usual way via a virtual proc filesystem, the software has its own modules to read the measurement values for applications.

The program is available for installation as a binary package for all current distributions. If necessary, you can tie in additional sources from the community when the package is not in the repositories of the distribution you are using.

Simple Yet Powerful

On Debian-based systems that don't come with the package already installed, just use the sudo apt-get install dstat command for adding the dstat package. It is usually known by the same name on other distributions as well. The package manager for the distribution easily takes care of installation.

Typically, the Debian package deposits the program in /usr/bin/ . The modules that deliver actual functionality are found chiefly in /usr/share/dstat/ . When called, the software relies on modules from this directory. If you want to write your own extension for the program, this directory contains many helpful examples.

Dstat has numerous options for reading out only that information which is relevant for the moment. If you call the tool without further specification, then it will behave as if you had combined the -c , -d , -n , -g , and -y options (Figure 1). You can get additional information about the use of various options by calling the manpage via man dstat . See also Table 1.

Figure 1: If called without parameters, dstat will deliver a set of values that may make it possible to get an initial impression of bottlenecks.

Table 1

Dstat Switches

Switch Function
-c Displays CPU metrics
-d Outputs bandwidth for disk use
-g Activates paging statistics
-l Load average according to Linux kernel
-m Displays values for the RAM
-n Outputs bandwidth of the network throughput
-s Activates swapping statistics
-y Output of important system values
--disk-tps Number of disk operations per second
--net-packets Number of packages running via the network interfaces
--thermal Reads out temperature sensors
--top-cpu Displays the process causing the largest CPU load
--top-io Displays application with the highest disc through put
--top-mem Displays application with the largest memory use
--bw Activates a different color profile
--nocolor Deactivates all colors

The program displays the desired values in the form of a table and preserves constant widths for the columns. The choice of colors is optimized for dark backgrounds. An option exists to switch via --bw to a design for light backgrounds or via --nocolor to a monochromatic display.

The first lines of output echo the implicitly activated switch. The following line names the five large areas behind which the individual options are found. AccordinglyB in the example from Figure 1, the program shows the metrics for the condition of the CPU, hard drive, network, and memory.

During testing, the system was basically running in idle. Therefore, the use of the color red, as seen in the last column in Figure 1, merely serves to provide a better overview rather than an indication of any problem. However, dstat utilizes color changes inside a single column to indicate altered conditions such as the switch from idle mode into full load. The color green is therefore not necessarily indicative of a healthy condition, because the program sometimes uses colors randomly.

To end the program, you only need to press the Ctrl+C keys. If you know from the beginning how long you want the program to run, you can specify this when you call it. For example, the program will run for 5 seconds with the call dstat 1 5 and update the output in intervals of 1 second.

The Processor at a Glance

When reviewing the system to figure out why it is running more slowly than usual, it makes sense to look at the CPU metrics. To do this, call dstat as follows:

$ dstat -c -y -l --proc-count --top-cpu

Figure 2 once again shows the areas CPU and System with -c and -y , respectively. Additionally, dstat displays the development of the Load , or -l , as well as those processes,--top-cpu , which use the most CPU resources. For a system with multiple processors and cores, the tool summarizes the workload of all of the CPUs. If you would like to have a detailed report, you can use the -C option together with a comma-separated list of the cores that should be monitored.

Figure 2: In addition to standard CPU metrics, the example indicates which process eats up the largest share of computing time.

The first column of the output for the CPU describes typical values measured by the tools. These include usr and sys , which indicate what percentage of the used CPU time was spent in user and kernel space mode. The idl column contains the percentage of unused CPU capacity.

This last value is especially important for purposes of determining whether there are potential issues. If it is high, than the system is idle. The values designated by the abbreviation wai indicate whether programs are waiting to execute. If these values are high, then it is possible that a bottleneck has formed.

The columns labeled hiq and siq show the number of interrupts caused either by hardware or software. A high number of interrupts indicates heavy use of the system but it would not necessarily mean that a problem had occurred.

The system field is divided into the int column for the total number of all interrupts and csw for context switches. The latter deals with all processes that due to multitasking have been paused because another process has CPU priority for execution. If the number of paused processes is higher than normal, this can indicate that the CPU cannot keep up with executing with the tasks at hand. However, this holds true only if the idle time value referred to above is close to zero.

The third large field load-avg shows the system load for the last 60 seconds, the last 5 minutes and the last 15 minutes as reported by the kernel. In a Linux system, the load value serves as the standard indicator of whether the system is overloaded or idle. The software determines the load values by checking how many processes are waiting for execution on a particular CPU. The next column proc indicates the current number of processes that are running.

The last column most-expensive shows which process is currently using the most CPU resources. As long as the load remains smaller than the available processing cores, the computer will be half asleep. When the load reaches a size that is more than twice that of the available cores, problems will arise. This means the CPU can no longer handle the tasks at hand. Each value in a row represents a snapshot for the most recent second. If the CPU measurements are unexpectedly high, then it is a good idea to take a closer look at the memory usage.

Gobbling Up Memory

The memory capacity for older systems is very limited. If such a system does not react as expected and if the results of the CPU metrics are inconclusive, then it could be that a memory hungry process is the culprit. To figure out whether this is true, you should call dstat with the parameters -g -m -s --top-mem and take a look at the pertinent values (Figure 3).

Figure 3: Dstat helps you locate memory hogs quickly.

The relevant fields Paging (-g ), Memory usage (-m ), Swap (-s ), and Most expensive (--top-mem ) have to do with the virtual system memory as well as the real main memory available to applications for completing their tasks.

Normally idle processes do not permanently require memory because the kernel will page out some or all of the memory for an idle process to the hard disk. See the second column in Figure 3 with the heading Page Out . Once this has taken place, additional main memory will become available for other processes. The reverse happens when the system reads the memory back into RAM as soon as the memory is required by the application. See the first column in Figure 3 with the heading Page In .

The Memory Usage area on the other hand indicates the metrics for the real memory. The area tells you how much RAM is currently being used (used ), how much data is in the system queue waiting to be written onto the hard disk (buff ), how much data is in the queue waiting to be read from storage media into RAM (cach ), and how much RAM is still available (free ).

In general, the greater the value in the used column and the smaller the measurement under free , then the greater the system load caused by the programs. If the unused RAM capacity is tending toward zero, then the kernel will begin swapping. This means that it will write memory reserved by the processes onto the hard disk. Since this works considerably more slowly than main memory, the speed of the entire system will slow down. The third area in the output shown in Figure 3 also indicates how much data the kernel has swapped out (used ), and how much capacity is left for swapping (free ).

The --top-mem option, which shows the largest consumer of resources, also gives you information quickly. The benchmarking tool Mbw [6] was used for Figure 3 to determine the amount of memory taken up by the application Compiz. Dstat shows how the kernel increasingly uses paging and moves more and more data into the swap area. This explains why the system reacted slowly when dstat was called.

When looking at memory load, you will find the following relationship. The more actively the swapping and paging is taking place, the slower the system will work. Swapping particularly diminishes the I/O bandwidth available on the hard disk.

Data Throughput

Just as with CPU and RAM work loads, data storage devices containing the operating system and applications will have an impact on the performance of a system. Dstat has options that permit measurements of disk access and data throughput per second. These are usually labeled "Disk I/O" (Figure 4).

Figure 4: Dstat shows a change to green when writing occurs to indicate that there has been a change.

The dstat -d --disk-util --disk-tps command helps to create an overview of all data pertinent to these actions. The Dsk/total (-d ) field shows the throughput for reading and writing for all data storage devices. It also shows the current workload in percentage points for each known storage device. In Figure 4, this is the column sda produced by the option --disk-util . The third area dsk/total indicates the read and write operations per time interval.

It is disconcerting when something like the hard disk work load rises significantly. Once it reaches 100 percent, the system grinds to a halt. At this point, the number of processes queueing for disk I/O (CPU statistics, column wait ) dramatically increases.

Data Streams

It is often a good idea to take a look at the network when you want to know about the current resource consumption on a system. Depending on the data volume flow, the kernel and the application can lay claim to computing time. The option combination -n --net-packets lets you see how things are going with the current load (Figure 5).

Figure 5: Data streaming over the network also affects the condition of the system.

The net/total area summarizes all of the network interfaces and indicates the bandwidth for the incoming(recv ) and outgoing send ) data traffic in increments of bytes or kbytes per time interval. On the other hand, the pkt/total category refers to the number of data packets received (#recv ) and sent (#send ) on all network interfaces.

The information found under net/total may provide insight into bottlenecks. In the test, for example, the bandwidth for receiving data amounted to about 1.2 Mbit/s with a 16-Mbit DSL connection with 2 Mbit/s downstream. This was an indication that not much more breathing space was available.

Combinations

Each option lets you keep an eye on specific individual resources. If necessary, you can combine various options for example to observe both disk I/O and the network load simultaneously.

It also makes sense to take a look at the dstat manpage, which shows numerous additional options for gathering useful information. You could try combining -n --net-packets with the --top-io option to see what happens. Some of the dstat options prove useful in monitoring servers. For example, there are specific options useful for Postfix and MySQL.

Conclusion

The all-around tool dstat offers many options for taking a look into all areas of a system. As soon as a computer slows down, you can easily identify where the resources are being consumed via a quick invocation of dstat. The numerous modules also offer the possibility for gathering metrics on various applications, which can be especially interesting for server administrators.

Because dstat is implemented in Python, many measurements are susceptible to a minimal deviation. After all, dstat itself consumes resources. The dstat manpage has good instructions for how to determine the resource usage of dstat itself.