IN THIS CHAPTER
Troubleshooting boot loaders
Troubleshooting system initialization
Fixing software packaging problems
Checking network interface issues
Dealing with memory problems
Using rescue mode
In any complex operating system, lots of things can go wrong. You can fail to save a file because you are out of disk space. An application can crash because the system is out of memory. The system can fail to boot up properly for, well, lots of different reasons.
In Linux, the dedication to openness and the focus on making the software run with maximum efficiency has led to an amazing number of tools you can use to troubleshoot every imaginable problem. In fact, if software isn't working as you would like, you even have the ultimate opportunity to rewrite the code yourself (although we don't cover how to do that here).
This chapter takes on some of the most common problems you can run into on a Linux system and describes the tools and procedures you can use to overcome those problems. Topics are broken down by areas of troubleshooting, such as the boot process, software packages, networking, memory issues, and rescue mode.
Before you can begin troubleshooting a running Linux system itself, that system needs to boot up. For a Linux system to boot up, a series of things has to happen. A Linux system installed directly on a PC architecture computer goes through the following steps to boot up:
The exact activities that occur at each of these points have undergone a transformation in recent years. Boot loaders are changing to accommodate new kinds of hardware. The initialization process is changing so services can start more efficiently, based on dependencies and in reaction to the state of the system (such as what hardware is plugged in or what files exist) rather than a static boot order.
Troubleshooting the Linux boot process begins when you turn on your computer and ends when all the services are up and running. At that point, typically a graphical or text-based login prompt is available from the console, ready for you to log in.
After reading the short descriptions of startup methods, go to “Starting from the firmware” to understand what happens at each stage of the boot process and where you might need to troubleshoot. Because the general structure of the Linux boot process is the same for the three Linux systems featured here (Fedora, RHEL, and Ubuntu), I go through the boot process only once, but I describe the differences between them as I go.
It's up to the individual Linux distribution how the services associated with the running Linux system are started. After the boot loader starts the kernel, how the rest of the activities (mounting filesystems, setting kernel options, running services, and so on) are done is all managed by the initialization process.
As I describe the boot process, I focus on two different types of initialization: System V init and systemd. I also briefly mention a third type called Upstart, which has been deployed until recently on Debian and Ubuntu distribution, but is now being replaced with systemd. (Ubuntu 14.04, LTS used in this book still uses Upstart by default.)
The System V init facility consists of the init process (the first process to run after the kernel itself), an /etc/inittab file that directs all start-up activities, and a set of shell scripts that starts each of the individual services. The first Fedora releases and up to RHEL 5 used the System V init process. RHEL 6 contains a sort of hybrid of System V init, with the init process itself replaced by the Upstart init process.
System V init was developed for UNIX System V at AT&T in the mid-1980s when UNIX systems first incorporated the start-up of network interfaces and the services connected to them.
It has been supplanted only over the past few years by Upstart and systemd to better suit the demands of modern operating systems.
In System V init, sets of services are assigned to what is referred to as runlevels. For example, the multi-user runlevel can start basic system services, network interfaces, and network services. Single-user mode just starts enough of the basic Linux system so someone can log in from the system console, without starting network interfaces or services. After a System V init system is up and running, you can use commands such as reboot, shutdown, and init to change runlevels. You can use commands such as service and chkconfig to start/stop individual services or enable/disable services, respectively.
The System V init scripts are set to run in a specific order, with each script having to complete before the next can start. If a service fails, there is no provision for that service to restart automatically. In contrast, systemd and Upstart were designed to address these and other System V init shortcomings.
The systemd facility is quickly becoming the present and future of the initialization process for many Linux systems. It was adopted in Fedora 15, in RHEL 7, and is scheduled to replace Upstart in Debian and Ubuntu 15.04. Although systemd is more complicated than System V init, it also offers many more features, such as these:
When a systemd-enabled Linux system starts up, the first running process (PID 1) is the systemd daemon (instead of the init daemon). Later, the primary command for managing systemd services is the systemctl command. Managing systemd journal (log) messages is done with the journalctl command. You also have the ability to use old-style System V init commands, such as init, poweroff, reboot, runlevel, and shutdown to manage services.
As noted earlier, the Debian and Ubuntu Linux distributions used the Upstart project for a while to replace the older System V init facility. Although those distributions plan to move soon to systemd, if you are using a recent Debian or Ubuntu (pre-14.10) system, chances are that Upstart is controlling the startup of your system services.
Like systemd, Upstart allowed services to start in parallel, after their particular dependencies were met. Another of Upstart's major improvements over System V init is that it can start services by reacting when certain events occur (such as when a piece of hardware is connected).
If you are using a Linux system the employs Upstart, you should know a few things about it. The first process that starts on an Upstart system is still called the init process, but it is actually an Upstart daemon. After it is running, services in an Upstart system register events with the Upstart daemon. When an event occurs, Upstart can start, stop, or change a process to react to that event.
Because Upstart is being phased out of the Linux distributions covered in this book (Fedora, RHEL, Debian, and Ubuntu), our discussion of understanding and troubleshooting the boot process is focused on the traditional System V init process and the newer systemd facility.
When you physically turn on a computer, firmware is loaded to initialize the hardware and find an operating system to boot. On PC architectures, that firmware has traditionally been referred to as BIOS (Basic Input Output System). In recent years, a new type of firmware has become available (to replace BIOS on some computers), called UEFI (Unified Extensible Firmware Interface). The two are mutually exclusive.
UEFI was designed to allow a secure boot feature, which can be used to ensure that only operating systems whose components have been signed can be used during the boot process. UEFI can still be used with nonsigned operating systems by disabling the secure boot feature.
For Ubuntu, Secure boot was first supported in 12.04.2. RHEL 7 also officially supports secure boot. The main job of BIOS and UEFI firmware is to initialize the hardware and then hand off control of the boot process to a boot loader. The boot loader then finds and starts the operating system. After an operating system is installed, typically you should let the firmware do its work and not interrupt it.
There are, however, occasions when you want to interrupt the firmware. For this discussion, we focus on how BIOS generally works. Right after you turn on the power, you should see a BIOS screen that usually includes a few words noting how to go into Setup mode and change the boot order. If you press the function key noted (often F1, F2, or F12) to choose one of those two items, here's what you can do:
For my Dell workstation, after I see the BIOS screen, I immediately press the F2 function key to go to into Setup or F12 to temporarily change the boot order. The next sections explore what you can troubleshoot from the Setup and Boot Order screens.
As I already noted, you can usually let the BIOS start without interruption and have the system boot up to the default boot device (probably the hard drive). However, here are some instances when you may want to go into Setup mode and change something in the BIOS:
If you can't get an operating system booted at all, the BIOS Setup screen may be the only way to determine the system model, processor type, and other information you need to search for help or call for support.
For example, let's say your computer has two network interface cards (NICs). You want to use the second NIC to install Linux over a network, but the installer keeps trying to use the first NIC to connect to the network. You can disable the first NIC so the installer doesn't even see the NIC when it tries to connect to the network. Or, you can keep the NIC visible to the computer, but simply disable the NIC's ability to PXE boot.
Maybe you have an audio card and you want to disable the integrated audio on the motherboard. That can be done in the BIOS as well.
Conversely, sometimes you want to enable a device that has been disabled. Perhaps you were given a computer that had a device disabled in the BIOS. From the operating system, for example, it may look like you don't have a parallel (LPT) port or CD drive. Looking at the BIOS tells you whether those devices are not available simply because they have been disabled in the BIOS.
Depending on the hardware attached to your computer, a typical boot order might boot a CD/DVD drive first, then the hard drive, then a USB device, and finally the network interface card. The BIOS would go to each device, looking for a boot loader in the master boot record for that device. If the BIOS finds a boot loader, it starts it. If no boot loader is located, the BIOS moves on to the next device, until all are tried. If no boot loader is found, the computer fails to boot.
One problem that could occur with the boot order is that the device you want to boot may not appear in the boot order at all. In that case, going to the Setup screen, as described in the previous section, to either enable the device or change a setting to make it bootable, may be the thing to do.
If the device you want to boot from does appear in the boot order, typically you just have to move the arrow key to highlight the device you want and press Enter. The following are reasons for selecting your own device to boot:
Assuming you get past any problems you have with the BIOS, the next step is for the BIOS to start the boot loader.
Typically, the BIOS finds the master boot record on the first hard disk and begins loading that boot loader in stages. Chapter 9, “Installing Linux,” describes the GRUB boot loader that is used with most modern Linux systems, including RHEL, Fedora, and Ubuntu. The GRUB boot loader in RHEL 6, described here, is an earlier version than the GRUB 2 boot loader included with RHEL 7, Fedora, and Ubuntu. (Later, I introduce you to the GRUB 2 boot loader as well.)
In this discussion, I am interested in the boot loader from the perspective of what to do if the boot loader fails or what ways you might want to interrupt the boot loader to change the behavior of the boot process.
Here are a few ways in which the boot loader might fail in RHEL 6 and some ways you can overcome those failures:
This probably means that the master boot record portion of GRUB was found, but when GRUB looked on the hard drive to find the next stage of the boot process and a menu of operating systems to load, it could not find them. Sometimes this happens when the BIOS detects the disks in the wrong order and looks for the grub.conf file on the wrong partition.
One workaround to this problem, assuming grub.conf is on the first partition of the first disk, is to list the contents of this file and enter the root, kernel, and initrd lines manually. To list the file, type cat (hd0,0)/grub/grub.conf. If that doesn't work, try hd0,1 to access the next partition on that disk (and so on) or hd1,0 to try the first partition of the next disk (and so on). When you find the lines representing the grub.conf file, manually type the root, kernel, and initrd lines for the entry you want (replacing the location of the hard drive you found on the root line). Then type boot. The system should start up, and you can go and manually fix your boot loader files. See Chapter 9 for more information on the GRUB boot loader.
If the BIOS finds the boot loader in the master boot record of the disk and that boot loader finds the GRUB configuration files on the disk, the boot loader starts a countdown of about three to five seconds. During that countdown, you can interrupt the boot loader (before it boots the default operating system) by pressing any key.
When you interrupt the boot loader, you should see a menu of available entries to boot. Those entries can represent different available kernels to boot. But they may also represent totally different operating systems (such as Windows, BSD, or Ubuntu).
Here are some reasons to interrupt the boot process from the boot menu to troubleshoot Linux:
Why would you boot to different runlevels for troubleshooting? Runlevel 1 bypasses authentication, so you boot directly to a root prompt. This is good if you have forgotten the root password and need to change it (type passwd to do that). Runlevel 3 bypasses the start of your desktop interface. Go to runlevel 3 if you are having problems with your video driver and want to try to debug it without it trying to automatically start up the graphical interface.
You may want to add kernel options to add features to the kernel or temporarily disable hardware support for a particular component. For example, adding init=/bin/bash causes the system to bypass the init process and go straight to a shell (similar to running init 1). In RHEL 7, adding 1 as a kernel option is not supported, so init=/bin/bash is the best way to get into a sort of single-user mode. Adding nousb would temporarily disable the USB ports (presumably to make sure anything connected to those ports would be disabled as well).
Assuming you have selected the kernel you want, the boot loader tries to run the kernel, including the content of the initial RAM disk (which contains drivers and other software needed to boot your particular hardware).
After the kernel starts, there isn't much to do except watch for potential problems. For RHEL, you see a Red Hat Enterprise Linux screen with a slow-spinning icon. If you want to watch messages detailing the boot process scroll by, press the Esc key.
At this point, the kernel tries to load the drivers and modules needed to use the hardware on the computer. The main things to look for at this point (although they may scroll by quickly) are hardware failures that may prevent some feature from working properly. Although much more rare than it used to be, there may be no driver available for a piece of hardware, or the wrong driver may get loaded and cause errors.
In addition to scrolling past on the screen, messages produced when the kernel boots are copied to the kernel ring buffer. As its name implies, the kernel ring buffer stores kernel messages in a buffer, throwing out older messages after that buffer is full. After the computer boots up completely, you can log into the system and type the following command to capture these kernel messages in a file (then view them with the less command):
# dmesg > /tmp/kernel_msg.txt # less /tmp/kernel_msg.txt
I like to direct the kernel messages into a file (choose any name you like) so the messages can be examined later or sent to someone who can help debug any problems. The messages appear as components are detected, such as your CPU, memory, network cards, hard drives, and so on.
In Linux systems that support systemd, kernel messages are stored in the systemd journal. So instead of using the dmesg command, you can run journalctl to see kernel messages from boot time to the present. For example, here are kernel messages output from a RHEL 7 system:
# journalctl -k
Sep 07 12:03:07 host kernel: CPU0 microcode updated early to revision
0xbc
Sep 07 12:03:07 host kernel: Initializing cgroup subsys cpuset
Sep 07 12:03:07 host kernel: Initializing cgroup subsys cpu
Sep 07 12:03:07 host kernel: Initializing cgroup subsys cpuacct
Sep 07 12:03:07 host kernel: Linux version 3.10.0-123.6.3.el7.x86_64
Sep 07 12:03:07 host kernel: Command line:
BOOT_IMAGE=/vmlinuz-3.10.0-123.6.3.el7.x86_64 root=/dev/mapper/vg
Sep 07 12:03:07 host kernel: e820: BIOS-provided physical RAM map:
...
Look for drivers that fail to load or messages that show that certain features of the hardware failed to be enabled. For example, I once had a TV tuner card (for watching television on my computer screen) that set the wrong tuner type for the card that was detected. Using information about the TV card's model number and the type of failure, I found that passing an option to the card's driver allowed me to try different settings until I found the one that matched my tuner card.
In describing how to view kernel startup messages, I have gotten ahead of myself a bit. Before you can log in and see the kernel messages, the kernel needs to finish bringing up the system. As soon as the kernel is done initially detecting hardware and loading drivers, it passes off control of everything else that needs to be done to boot the system to the initialization system.
The first process to run on a system where the kernel has just started depends on the initialization facility that system is using. For System V init, the first process to run is the init process. For systemd, the first process is systemd. Depending on which you see running on your system (type ps -ef | head to check), follow either the System V or systemd descriptions below. RHEL 6, which contains a hybrid of Upstart and System V init, is used in the example of System V initialization.
Most Linux systems up to a few years ago used System V init to initialize the services on the Linux system. In RHEL 6, when the kernel hands off control of the boot process to the init process, the init process checks the /etc/inittab file for directions on how to boot the system.
The inittab file tells the init process what the default runlevel is and then points to files in the /etc/init directory to do such things as remap some keystrokes (such as Ctrl+Alt+Delete to reboot the system), start virtual consoles, and identify the location of the script for initializing basic services on the system: /etc/rc.sysinit.
When you're troubleshooting Linux problems that occur after the init process takes over, two likely culprits are the processing by the rc.sysinit file and the runlevel scripts.
As the name implies, the /etc/rc.sysinit script initializes many basic features on the system. When that file is run by init, rc.sysinit sets the system's hostname, sets up the /proc and /sys filesystems, sets up SELinux, sets kernel parameters, and performs dozens of other actions.
One of the most critical functions of rc.sysinit is to get the storage set up on the system. In fact, if the boot process fails during processing of rc.sysinit, in all likelihood, the script was unable to find, mount, or decrypt the local or remote storage devices needed for the system to run.
The following is a list of some common failures that can occur from tasks run from the rc.sysinit file and ways of dealing with those failures.
# mount -o remount,rw / # vim /etc/fstab # mount -a # reboot
NOTE
The vim command is used particularly when editing the /etc/fstab file because it knows the format of that file. When you use vim, the columns are in color and some error checking is done. For example, entries in the Mount Options field turn green when they are valid and black when they are not.
Other features are set up by the rc.sysinit file as well. The rc.sysinit script sets the SELinux mode and loads hardware modules. The script constructs software RAID arrays and sets up Logical Volume Management volume groups and volumes. Troubles occurring in any of these areas are reflected in error messages that appear on the screen after the kernel boots and before runlevel processes start up.
In Red Hat Enterprise Linux 6.x and earlier, when the system first comes up, services are started based on the default runlevel. There are seven different runlevels, from 0 to 6. The default runlevel is typically 3 (for a server) or 5 (for a desktop). Here are descriptions of the runlevels in Linux systems up to RHEL 6:
Runlevels are meant to set the level of activity on a Linux system. A default runlevel is set in the /etc/inittab file, but you can change the runlevel any time you like using the init command. For example, as root, you might type init 0 to shutdown, init 3 if you want to kill the graphical interface (from runlevel 5) but leave all other services up, or init 6 to reboot.
Normal default runlevels (in other words, the runlevel you boot to) are 3 (for a server) and 5 (for a desktop). Often, servers don't have desktops installed, so they boot to runlevel 3 so they don't incur the processing overhead or the added security risks for having a desktop running on their web servers or file servers.
You can go either up or down with runlevels. For example, an administrator doing maintenance on a system may boot to runlevel 1 and then type init 3 to boot up to the full services needed on a server. Someone debugging a desktop may boot to runlevel 5 and then go down to runlevel 3 to try to fix the desktop (such as install a new driver or change screen resolution) before typing init 5 to return to the desktop.
The level of services at each runlevel is determined by the runlevel scripts that are set to start. There are rc directories for each runlevel: /etc/rc0.d/, /etc/rc1.d/, /etc/rc2.d/, /etc/rc3.d/, and so on. When an application has a startup script associated with it, that script is placed in the /etc/init.d/ directory and then symbolically linked to a file in each /etc/rc?.d/ directory.
Scripts linked to each /etc/rc?.d directory begin with either the letter K or S, followed by two numbers and the service name. A script beginning with K indicates that the service should be stopped, while one beginning with an S indicates it should be started. The two numbers that follow indicate the order in which the service is started. Here are a few files you might find in the /etc/rc3.d/ directory, which are set to start up (with a description of each to the right):
This example of a few services started from the /etc/rc3.d directory should give you a sense of the order in which processes boot up when you enter runlevel 3. Notice that the sysstat service (which gathers system statistics) and the iptables service (which creates the system's firewall) are both started before the networking interfaces are started. Those are followed by rsyslog (system logging service) and then the various networked services.
By the time the runlevel scripts start, you should already have a system that is basically up and running. Unlike some other Linux systems that start all the scripts for runlevel 1, then 2, then 3, and so on, RHEL goes right to the directory that represents the runlevel, first stopping all services that begin with K and starting all those that begin with S in that directory.
As each S script runs, you should see a message saying whether the service started. Here are some things that might go wrong during this phase of system startup:
If you cannot get past a hanging service, you can reboot into an interactive startup mode, where you are prompted before starting each service. To enter interactive startup mode in RHEL, reboot and interrupt the boot loader (press any key when you see the 5 second countdown). Highlight the entry you want to boot, and type e. Highlight the kernel line, and type e. Then add the word confirm to the end of the kernel line, press Enter, and type b to boot the new kernel.
Figure 21.1 shows an example of the messages that appear when RHEL boots up in interactive startup mode.
FIGURE 21.1
Confirm each service in RHEL interactive startup mode.
Most messages shown in Figure 21.1 are generated from rc.sysinit.
After the Welcome message, udev starts (to watch for new hardware that is attached to the system and load drivers as needed). The hostname is set, Logical Volume Management (LVM) volumes are activated, all filesystems are checked (with the added LVM volumes), any filesystems not yet mounted are mounted, the root filesystem is remounted read-write, and any LVM swaps are enabled. Refer to Chapter 12 for further information on LVM and other partition and filesystem types.
The last “Entering interactive startup” message tells you that rc.sysinit is finished and the services for the selected runlevel are ready to start. Because the system is in interactive mode, a message appears asking if you want to start the first service (sysstat). Type Y to start that service and go to the next one.
After you see the broken service requesting to start, type N to keep that service from starting. If, at some point, you feel the rest of the services are safe to start, type C to continue starting the rest of the services. After your system comes up, with the broken services not started, you can go back and try to debug those individual services.
One last comment about startup scripts: The /etc/rc.local file is one of the last services to run at each runlevel. As an example, in runlevel 5, it is linked to /etc/rc5.d/S99local. Any command you want to run every time your system starts up can be put in the rc.local file.
You might use rc.local to send an e-mail message or run a quick iptables firewall rule when the system starts. In general, it's better to use an existing startup script or create a new one yourself (so you can manage the command or commands as a service). Know that the rc.local file is a quick and easy way to get some commands to run each time the system boots.
The latest versions of Fedora, RHEL, and soon Ubuntu use systemd instead of System V init as their initialization system. When the systemd daemon (/usr/lib/systemd/systemd) is started after the kernel starts up, it sets in motion all the other services that are set to start up. In particular, it keys off the contents of the /etc/systemd/system/default.target file. For example:
# cat /etc/systemd/system/default.target
...
[Unit]
Description=Graphical Interface
Documentation=man:systemd.special(7)
Requires=multi-user.target
After=multi-user.target
Conflicts=rescue.target
Wants=display-manager.service
AllowIsolate=yes
[Install]
Alias=default.target
The default.target file is actually a symbolic link to a file in the /lib/systemd/system directory. For a server, it may be linked to the multi-user.target file; for a desktop, it is linked to the graphical.target file (as is shown here).
Unlike with the System V init facility, which just runs service scripts in alphanumeric order, the systemd service needs to work backward from the default.target to determine which services and other targets are run. In this example, default.target is a symbolic link to the graphical.target file. When you list the contents of that file, you can see the following:
By continuing to discover what those two units require, you can find what else is required. For example, multi-user.target requires the basic.target (which starts a variety of basic services) and display-manager.service (which starts up the display manager, gdm) to launch a graphical login screen.
To see services the multi-user.target starts, list contents of the /etc/systemd/system/multi-user.target.wants directory. For example:
# ls /etc/systemd/system/multi-user.target.wants/
abrt-ccpp.service chronyd.service nfs.target
abrtd.service crond.service nmb.service
abrt-oops.service cups.path remote-fs.target
abrt-vmcore.service httpd.service rngd.service
abrt-xorg.service irqbalance.service smb.service
atd.service mcelog.service sshd.service
auditd.service mdmonitor.service vmtoolsd.service
autofs.service ModemManager.service vsftpd.service
avahi-daemon.service NetworkManager.service
These files are symbolic links to files that define what starts for each of those services. On your system, these may include remote shell (sshd), printing (cups), auditing (auditd), networking (NetworkManager), and others. Those links were added to that directory either when the package for a service is installed or when the service is enabled from a systemctl enable command.
Keep in mind that, unlike System V init, systemd can start, stop, and otherwise manage unit files that represent more than just services. It can manage devices, automounts, paths, sockets, and other things. After systemd has started everything, you can log into the system to investigate and troubleshoot any potential problems.
After you log in, running the systemctl command lets you see every unit file that systemd tried to start up. Here is an example:
# systemctl
UNIT LOAD ACTIVE SUB
DESCRIPTION
proc-sys-fs-binfmt_misc.automount loaded active waiting
Arbitrary Executable File Formats File System
sys-devices-pc...:00:1b.0-sound-card0.device loaded active plugged
631xESB/632xESB High Definition Audio Control
sys-devices-pc...:00:1d.2-usb4-4\x2d2.device loaded active plugged
DeskJet 5550
...
-.mount loaded active mounted
/
boot.mount loaded active mounted
/boot
... autofs.service loaded active running Automounts filesystems on demand cups.service loaded active running CUPS Printing Service httpd.service loaded failed failed The Apache HTTP Server
From the systemctl output, you can see whether any unit file failed. In this case, you can see that the httpd.service (your Web server) failed to start. To further investigate, you can run journalctl -u for that service to see whether any error messages were reported:
# journalctl -u httpd.service
...
Sep 07 18:40:52 host systemd[1]: Starting The Apache HTTP Server...
Sep 07 18:40:53 host httpd[16365]: httpd: Syntax error on line 361 of
/etc/httpd/conf/httpd.conf: Expected </Director> but saw
</Directory>
Sep 07 18:40:53 host systemd[1]: httpd.service:
main process exited, code=exited, status=1/FAILURE
Sep 07 18:40:53 host systemd[1]: Failed to start The Apache HTTP
Server.
Sep 07 18:40:53 host systemd[1]: Unit httpd.service entered failed
state.
From the output, you can see that there was a mismatch of the directives in the httpd.conf file (I had Director instead of Directory). After that was corrected, I could start the service (systemctl start httpd). If more unit files appear as failed, you can run the journalctl -u command again, using those unit filenames as arguments.
The next section describes how to troubleshoot issues that can arise with your software packages.
Software packaging facilities (such as yum for RPM and apt-get for DEB packages) are designed to make it easier for you to manage your system software. (See Chapter 10, “Getting and Managing Software,” for the basics on how to manage software packages.) Despite efforts to make it all work, however, sometimes software packaging can break.
The following sections describe some common problems you can encounter with RPM packages on a RHEL or Fedora system and how you can overcome those problems.
Sometimes, when you try to install or upgrade a package using the yum command, error messages tell you that the dependent packages you need to do the installation you want are not available. This can happen on a small scale (when you try to install one package) or a grand scale (where you are trying to update or upgrade your entire system).
Because of the short release cycles and larger repositories of Fedora and Ubuntu, inconsistencies in package dependencies are more likely to occur than they are in smaller, more stable repositories (such as those offered by Red Hat Enterprise Linux). To avoid dependency failures, here are a few good practices you can follow:
When packages are added to the repository, as long as the repository maintainers run the right commands to set up the repository (and you don't use outside repositories), everything you need to install a selected package should be available. However, when you start using third-party repositories, those repositories may have dependencies on repositories that they can't control. For example, if a repository creates a new version of its own software that requires later versions of basic software (such as libraries), the versions they need might not be available from the Fedora repository.
NOTE
If you use the apt-get command in Ubuntu to update your packages, keep in mind that there are different meanings to the update and upgrade options in Ubuntu with apt-get than with the yum command (Fedora and RHEL).
In Ubuntu, apt-get update causes the latest packaging metadata (package names, version numbers, and so on) to be downloaded to the local system. Running apt-get upgrade causes the system to upgrade any installed packages that have new versions available, based on the latest downloaded metadata.
In contrast, every time you run a yum command in Fedora or RHEL, the latest metadata about new packages is downloaded. When you then run yum update, you get the latest packages available for the current release of Fedora or RHEL. When you run yum upgrade, the system actually attempts to upgrade to a whole new version of those distributions (such as Fedora 20 to Fedora 21).
When you encounter a dependency problem, here are a few things you can do to try to resolve the problem:
Even if it works when you first install it, a package someone just handed to you might not have a way to be upgraded. A package from a third-party repository may break if the creators don't provide a new version when dependent packages change.
Because most Linux systems keep the two most recent kernels, you can reboot, interrupt GRUB, and select the previous (still working) kernel to boot from. That gets your system up and running, with the old kernel and working drivers, while you look for a more permanent fix.
The longer-term solution is to get a new driver that has been rebuilt for your current kernel. Sites such as rpmfusion.org build third-party, non-open source driver packages and upgrade those drivers when a new kernel is available. With the rpmfusion.org repository enabled, your system should pick up the new drivers when the new kernel is added.
As an alternative to sites such as rpmfusion.org, you can go straight to the website for the manufacturer and try to download their Linux drivers (Nvidia offers Linux drivers for its video cards), or if source code is available for the driver, you can try to build it yourself.
# yum -y --exclude=somepackage update
If you do have dependency problems during preupgrade, you can work them out before the upgrade actually takes place. Removing packages that have dependency problems is one way to deal with the problem (yum remove somepackage). After the upgrade is done, you can often add the package back in, using a version of the package that may be more consistent with the new repositories you are using.
Using cron for Software Updates
The cron facility provides a means of running commands at predetermined times and intervals. You can set the exact minute, hour, day, or month that a command runs. You can configure a command to run every five minutes, every third hour, or at a particular time on Friday afternoon.
If you want to use cron to set up nightly software updates, you can do that as the root user by running the crontab -e command. That opens a file using your default editor (vi command by default) that you can configure as a crontab file. Here's an example of what the crontab file you create might look like:
# min hour day/month month day/week command 59 23 * * * yum -y update | mail root@localhost
A crontab file consists of five fields, designating day and time, and a sixth field, containing the command line to run. I added the comment line to indicate the fields. Here, the yum -y update command is run, with its output mailed to the user root@localhost. The command is run at 59 minutes after hour 23 (11:59 p.m.). The asterisks (*) are required as placeholders, instructing cron to run the command on every day of the month, month, and day of the week.
When you create a cron entry, make sure you either direct the output to a file or pipe the output to a command that can deal with the output. If you don't, any output is sent to the user that ran the crontab -e command (in this case, root).
In a crontab file, you can have a range of numbers, a list of numbers, or skip numbers. For example, 1, 5, or 17 in the first field causes the command to be run 1, 5, and 17 minutes after the hour. An */3 in the second field causes the command to run every three hours (midnight, 3 a.m., 6 a.m., and so on). A 1-3 in the fourth field tells cron to run the command in January, February, and March. Days of the week and months can be entered as numbers or words.
For more information on the format of a crontab file, type man 5 crontab. To read about the crontab command, type man 1 crontab.
Information about all the RPM packages on your system is stored in your local RPM database. Although it happens much less often than it did with earlier releases of Fedora and RHEL, it is possible for the RPM database to become corrupted. This stops you from installing, removing, or listing RPM packages.
If you find that your rpm and yum commands are hanging or failing and returning an rpmdb open fails message, you can try rebuilding the RPM database. To verify that there is a problem in your RPM database, you can run the yum check command. Here is an example of what the output of that command looks like with a corrupted database:
# yum check
error: db4 error(11) from dbenv->open: Resource temporarily
unavailable
error: cannot open Packages index using db4 - Resource temporarily
unavailable (11)
error cannot open Packages database in /var/lib/rpm
CRITICAL:yum.main:
Error: rpmdb open fails
The RPM database and other information about your installed RPM packages are stored in the /var/lib/rpm directory. You can remove the database files that begin with __db* and rebuild them from the metadata stored in other files in that directory.
Before you start, it's a good idea to back up the /var/lib/rpm directory. Then you need to remove the old __db* files and rebuild them. Type the following commands to do that:
# cp -r /var/lib/rpm /tmp # cd /var/lib/rpm # rm __db* # rpm --rebuilddb
New __db* files should appear after a few seconds in that directory. Try a simple rpm or yum command to make sure the databases are now in order.
Just as RPM has databases of locally installed packages, the Yum facility stores information associated with Yum repositories in the local /var/cache/yum directory. Cached data includes metadata, headers, packages, and yum plug-in data.
If there is ever a problem with the data cached by yum, you can clean it out. The next time you run a yum command, necessary data is downloaded again. Here are some reasons for cleaning out your yum cache:
As packages are added and removed from the repository, the metadata has to be updated, or your system will be working from old packaging information. By default, if you run a yum command, yum checks for new metadata if the old metadata is more than 90 minutes old (or by however many minutes metadata_expire= is set to in the /etc/yum.conf file).
If you suspect the metadata is obsolete but the expire time has not been reached, you can run yum clean metadata to remove all metadata, forcing new metadata to be uploaded with the next upload. Alternatively, you could run yum makecache to get metadata from all repositories up to date.
To clean out all packages, metadata, headers, and other data stored in the /var/cache/yum directory, type the following:
# yum clean all
At this point, your system gets up-to-date information from repositories the next time a yum command is run.
The next section covers information about network troubleshooting.
With more and more of the information, images, video, and other content we use every day now available outside our local computers, a working network connection is required on almost every computer system. So, if you drop your network connection or can't reach the systems you want to communicate with, it's good to know that there are many tools in Linux for looking at the problem.
For client computers (laptops, desktops, and handheld devices), you want to connect to the network to reach other computer systems. On a server, you want your clients to be able to reach you. The following sections describe different tools for troubleshooting network connectivity for Linux client and server systems.
You open your web browser, but are unable to get to any website. You suspect that you are not connected to the network. Maybe the problem is with name resolution, but it may be with the connection outside your local network.
To check whether your outgoing network connections are working, you can use many of the commands described in Chapter 14, “Administering Networking.” You can test connectivity using a simple ping command. To see if name-to-address resolution is working, use host and dig.
The following sections cover problems you can encounter with network connectivity for outgoing connections and what tools to use to uncover the problems.
To see the status of your network interfaces, use the ip command. The following output shows that the loopback interface (lo) is up (so you can run network commands on your local system), but eth0 (your first wired network card) is down (state DOWN). If the interface had been up, an inet line would show the IP address of the interface. Here, only the loopback interface has an inet address (127.0.0.1).
# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN
qlen 1000
link/ether f0:de:f1:28:46:d9 brd ff:ff:ff:ff:ff:ff
In RHEL 7 and Fedora, by default, network interfaces are now named based on how they are connected to the physical hardware. For example, in RHEL 7, you might see a network interface of enp11s0. That would indicate that the NIC is a wired Ethernet card (en) on PCI board 11 (p11) and slot 0 (s0). A wireless card would start with wl instead of en. The intention is to make the NIC names more predictable, because when the system is rebooted, it is not guaranteed which interfaces would be named eth0, eth1, and so on by the operating system.
For a wired connection, make sure your computer is plugged into the port on your network switch. If you have multiple NICs, make sure the cable is plugged into the correct one. If you know the name of a network interface (eth0, p4p1, or other), to find which NIC is associated with the interface, type ethtool -p eth0 from the command line and look behind your computer to see which NIC is blinking (Ctrl+C stops the blinking). Plug the cable into the correct port.
If, instead of seeing an interface that is down, the ip command shows no interface at all, check that the hardware isn't disabled. For a wired NIC, the card may not be fully seated in its slot or the NIC may have been disabled in the BIOS.
On a wireless connection, you may click the NetworkManager icon and not see an available wireless interface. Again, it could be disabled in the BIOS. However, on a laptop, check to see if there is a tiny switch that disables the NIC. I've seen several people shred their networking configurations only to find that this tiny switch on the front or side of their laptops had been switched to the off position.
If your network interface is up, but you still can't reach the host you want to reach, try checking the route to that host. Start by checking your default route. Then try to reach the local network's gateway device to the next network. Finally, try to ping a system somewhere on the Internet:
# route
route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.0.0 * 255.255.255.0 U 2 0 0 eth0
default 192.168.0.1 0.0.0.0 UG 0 0 eth0
The default line shows that the default gateway (UG) is at address 192.168.0.1 and that the address can be reached over the eth0 card. Because there is only the eth0 interface here and only a route to the 192.168.0.0 network is shown, all communication not addressed to a host on the 192.168.0.0/24 network is sent through the default gateway (192.168.0.1). The default gateway is more properly referred to as a router.
To make sure you can reach your router, try to ping it. For example:
# ping -c 2 192.168.0.1
PING 192.168.0.1 (192.168.0.1) 56(84) bytes of data.
From 192.168.0.105 icmp_seq=1 Destination Host Unreachable
From 192.168.0.105 icmp_seq=2 Destination Host Unreachable
--- 192.168.0.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss
The “Destination Host Unreachable” message tells you that the router is either turned off or not physically connected to you (maybe the router isn't connected to the switch you share). If the ping succeeds and you can reach the router, the next step is to try an address beyond your router.
Try to ping a widely accessible IP address. For example, the IP address for the Google public DNS server is 8.8.8.8. Try to ping that (ping -c2 8.8.8.8). If that ping succeeds, your network is probably fine, and it is most likely your hostname-to-address resolution that is not working properly.
If you can reach a remote system, but the connection is very slow, you can use the traceroute command to follow the route to the remote host. For example, this command shows each hop taken en route to http://www.google.com:
The output shows the time taken to make each hop along the way to the Google site. Instead of traceroute, you can use the mtr command (yum install mtr) to watch the route taken to a host. With mtr, the route is queried continuously, so you can watch the performance of each leg of the journey over time.
If you cannot reach remote hosts by name, but you can reach them by pinging IP addresses, your system is having a problem with hostname resolution. Systems connected to the Internet do name-to-address resolution by communicating to a domain name system (DNS) server that can provide them with the IP addresses of the requested hosts.
The DNS server your system uses can be entered manually or picked up automatically from a DHCP server when you start your network interfaces. In either case, the names and IP addresses of one or more DNS servers end up in your /etc/resolv.conf file. Here is an example of that file:
search example.com nameserver 192.168.0.254 nameserver 192.168.0.253
When you ask to connect to a hostname in Fedora or Red Hat Enterprise Linux, the /etc/hosts file is searched; then the first name server entry in resolv.conf is queried; then each subsequent name server is queried. If a hostname you ask for is not found, all those locations are checked before you get some sort of “Host Not Found” message. Here are some ways of debugging name-to-address resolution:
# host www.google.com 192.168.0.254 # dig @192.168.0.254 www.google.com
With NetworkManager enabled, you can't just add name server entries to the /etc/resolv.conf file because NetworkManager overwrites that file with its own name server entries. Instead, add a PEERDNS=no line to the ifcfg file for the network interface (for example, ifcfg-eth0 in the /etc/sysconfig/network-scripts directory). Then set DNS1=192.168.0.254 (or whatever your DNS server's IP address is). The new address is used the next time you restart your networking.
If you are using the network service, instead of NetworkManager, you can still use PEERDNS=no to prevent the DHCP server from overwriting your DNS addresses. However, in that case, you can edit the resolv.conf file directly to set your DNS server addresses.
The procedures just described for checking your outgoing network connectivity apply to any type of system, whether it is a laptop, desktop, or server. For the most part, incoming connections are not an issue with laptops or desktops because most requests are simply denied. However, for servers, the next section describes ways of making your server accessible if clients are having trouble reaching the services you provide from that server.
If you are troubleshooting network interfaces on a server, there are different considerations than on a desktop system. Because most Linux systems are configured as servers, you should know how to troubleshoot problems encountered by those who are trying to reach your Linux servers.
I'll start with the idea of having an Apache web server (httpd) running on your Linux system, but no web clients can reach it. The following sections describe things you can try to see where the problem is.
To be a public server, your system's hostname should be resolvable so any client on the Internet can reach it. That means locking down your system to a particular, public IP address and registering that address with a public DNS server. You can use a domain registrar (such as http://www.networksolutions.com) to do that.
When clients cannot reach your website by name from their web browsers, if the client is a Linux system, you can go through ping, host, traceroute, and other commands described in the previous section to track down the connectivity problem. Windows systems have their own version of ping that you can use from those systems.
If the name-to-address resolution is working to reach your system and you can ping your server from the outside, the next thing to try is the availability of the service.
From a Linux client, you can check if the service you are looking for (in this case httpd) is available from the server. One way to do that is using the nmap command.
The nmap command is a favorite tool for system administrators checking for various kinds of information on networks. However, it is a favorite cracker tool as well because it can scan servers, looking for potential vulnerabilities. So it is fine to use nmap to scan your own systems to check for problems. But know that using nmap on another system is like checking the doors and windows on someone's house to see if you can get in. You look like an intruder.
Checking your own system to see what ports to your server are open to the outside world (essentially, checking what services are running) is perfectly legitimate and easy to do. After nmap is installed (yum install nmap), use your system hostname or IP address to use nmap to scan your system to see what is running on common ports:
# nmap 192.168.0.119
Starting Nmap 5.21 ( http://nmap.org ) at 2012-06-16 08:27 EDT
Nmap scan report for spike (192.168.0.119)
Host is up (0.0037s latency).
Not shown: 995 filtered ports
PORT STATE SERVICE
21/tcp open ftp
22/tcp open ssh
80/tcp open http
443/tcp open https
631/tcp open ipp
MAC Address: 00:1B:21:0A:E8:5E (Intel Corporate)
Nmap done: 1 IP address (1 host up) scanned in 4.77 seconds
The preceding output shows that TCP ports are open to the regular (http) and secure (https) web services. When you see that the state is open, it indicates that a service is listening on the port as well. If you get to this point, it means your network connection is fine and you should direct your troubleshooting efforts to how the service itself is configured (for example, you might look in /etc/httpd/conf/httpd.conf to see if specific hosts are allowed or denied access).
If TCP ports 80 and/or 443 are not shown, it means they are being filtered. You need to check whether your firewall is blocking (not accepting packets to) those ports. If the port is not filtered, but the state is closed, it means that the httpd service either isn't running or isn't listening on those ports. The next step is to log into the server and check those issues.
From your server, you can use the iptables command to list the filter table rules that are in place. Here is an example:
# iptables -vnL Chain INPUT (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination ... 0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:80 0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:443 ...
For the RHEL 7 and Fedora 20 systems where the firewalld service is enabled, you can use the Firewall configuration window to open the ports needed. With the public Zone and Services tab selected, click the check boxes for http and https to immediately open those ports for all incoming traffic.
If your system is using the basic iptables service, there should be firewall rules like the two shown in the preceding code among your other rules. If there aren't, add those rules to the /etc/sysconfig/iptables file. Here are examples of what those rules might look like:
-A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 443 -j ACCEPT
With the rules added to the file, clear out all your firewall rules (systemctl stop iptables.service or service iptables stop), and then start them again (systemctl start iptables.service or service iptables start).
If the firewall is still blocking client access to the web server ports, here are a few things to check in your firewall:
If the port is now open, but the service itself is closed, check that the service itself is running and listening on the appropriate interfaces.
If there seems to be nothing blocking client access to your server through the actual ports providing the service you want to share, it is time to check the service itself. Assuming the service is running (depending on your system, type service httpd status or systemctl status httpd to check), the next thing to check is that it is listening on the proper ports and network interfaces.
The netstat command is a great general-purpose tool for checking network services. The following command lists the names and process IDs (p) for all processes that are listening (l) for TCP (t) and UDP (u) services, along with the port number (n) they are listening on. The following command line filters out all lines except those associated with the httpd process:
# netstat -tupln | grep httpd tcp 0 0 :::80 :::* LISTEN 2567/httpd tcp 0 0 :::443 :::* LISTEN 2567/httpd
The previous example shows that the httpd process is listening on port 80 and 443 for all interfaces. It is possible that the httpd process might be listening on selected interfaces. For example, if the httpd process were only listening on the local interface (127.0.0.1) for HTTP requests (port 80) the entry would look as follows:
tcp 0 0 127.0.0.1:80 :::* LISTEN 2567/httpd
For httpd, as well as for other network services that listen for requests on network interfaces, you can edit the service's main configuration file (in this case, /etc/httpd/conf/httpd.conf) to tell it to listen on port 80 for all addresses (Listen 80) or a specific address (Listen 192.168.0.100:80).
Troubleshooting performance problems on your computer is one of the most important, although often elusive, tasks you need to complete. Maybe you have a system that was working fine, but begins to slow down to a point where it is practically unusable. Maybe applications begin to just crash for no apparent reason. Finding and fixing the problem may take some detective work.
Linux comes with many tools for watching activities on your system and figuring out what is happening. Using a variety of Linux utilities, you can do such things as find out which processes are consuming large amounts of memory or placing high demands on your processors, disks, or network bandwidth. Solutions can include:
To troubleshoot performance problems in Linux, you use some of the basic tools for watching and manipulating processes running on your system. Refer to Chapter 6, “Managing Running Processes,” if you need details on commands such as ps, top, kill, and killall. In this section, I add commands such as memstat to dig a little deeper into what processes are doing and where things are going wrong.
The most complex area of troubleshooting in Linux relates to managing virtual memory. The next sections describe how to view and manage virtual memory.
Computers have ways of storing data permanently (hard disks) and temporarily (random access memory, or RAM, and swap space). Think of yourself as a CPU, working at a desk trying to get your work finished. You would put data that you want to keep permanently in a filing cabinet across the room (that's like hard disk storage). You would put information that you are currently using on your desk (that's like RAM memory on a computer).
Swap space is a way of extending RAM. It is really just a place to put temporary data that doesn't fit in RAM but is expected to be needed by the CPU at some point. Although swap space is on the hard disk, it is not a regular Linux filesystem in which data is stored permanently.
Compared to disk storage, random access memory has the following attributes:
It is important to understand the difference between temporary (RAM) and permanent (hard disk) storage, but that doesn't tell the whole story. If the demand for memory exceeds the supply of RAM, the kernel can temporarily move data out of RAM to an area called swap space.
If we revisit the desk analogy, this would be like saying, “There is no room left on my desk, yet I have to add more papers to it for the projects I'm currently working on. Instead of storing papers I'll need soon in a permanent file cabinet, I'll have one special file cabinet (like a desk drawer) to hold those papers I'm still working with, but that I'm not ready to store permanently or throw away.”
Refer to Chapter 12, “Managing Disks and Filesystems,” for more information on swap files and partitions and how to create them. For the moment, however, there are a few things you should know about these kinds of swap areas and when they are used:
The general rule has always been that swapping is bad and should be avoided. However, some would argue that, in certain cases, more aggressive swapping can actually improve performance.
Think of the case where you open a document in a text editor and then minimize it on your desktop for several days as you work on different tasks. If data from that document were swapped out to disk, more RAM would be available for more active applications that could put that space to better use. The performance hit would come the next time you needed to access the data from the edited document and the data was swapped in from disk to RAM. The settings that relate to how aggressively a system swaps are referred to as swappiness.
As much as possible, Linux wants to make everything that an open application needs immediately available. So, using the desk analogy, if I am working on nine active projects and there is space on the desk to hold the information I need for all nine projects, why not leave them all within reach on the desk? Following that same way of thinking, the kernel sometimes keeps libraries and other content in RAM that it thinks you might eventually need, even if a process is not looking for it immediately.
The fact that the kernel is inclined to store information in RAM that it expects may be needed soon (even if it is not needed now) can cause an inexperienced system administrator to think that the system is almost out of RAM and that processes are about to start failing. That is why it is important to know the different kinds of information being held in memory—so you can tell when real out-of-memory situations can occur. The problem is not just running out of RAM; it is running out of RAM when only nonswappable data is left.
Keep this general overview of virtual memory (RAM and swap) in mind, as the next section describes ways to go about troubleshooting issues related to virtual memory.
Let's say that you are logged into a Linux desktop, with lots of applications running, and everything begins to slow down. To find out if the performance problems have occurred because you have run out of memory, you can try commands such as top and ps to begin looking for memory consumption on your system.
To run the top command to watch for memory consumption, type top and then type a capital M. Here is an example:
# top
top - 22:48:24 up 3:59, 2 users, load average: 1.51, 1.37, 1.15
Tasks: 281 total, 2 running, 279 sleeping, 0 stopped, 0 zombie
Cpu(s): 16.6%us, 3.0%sy, 0.0%ni, 80.3%id, 0.0%wa, 0.0%hi,
0.2%si, 0.0%st
Mem: 3716196k total, 2684924k used, 1031272k free, 146172k
buffers
Swap: 4194296k total, 0k used, 4194296k free, 784176k
cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6679 cnegus 20 0 1665m 937m 32m S 7.0 25.8 1:07.95 firefox
6794 cnegus 20 0 743m 181m 30m R 64.8 5.0 1:22.82
npviewer.bin
3327 cnegus 20 0 1145m 116m 66m S 0.0 3.2 0:39.25 soffice.bin 6939 cnegus 20 0 145m 71m 23m S 0.0 2.0 0:00.97 acroread 2440 root 20 0 183m 37m 26m S 1.3 1.0 1:04.81 Xorg 2795 cnegus 20 0 1056m 22m 14m S 0.0 0.6 0:01.55 nautilus
There are two lines (Mem and Swap) and four columns of information (VIRT, RES, SHR, and %MEM) relating to memory in the top output. In this example, you can see that RAM is not exhausted from the Mem line (only 268492k of 3716196k is used) and that nothing is being swapped to disk from the Swap line (0k used).
However, adding up just these first six lines of output in the VIRT column, you would see that 4937MB of memory has been allocated for those applications, which exceeds the 3629MB of total RAM (3716196k) that is available. That's because the VIRT column shows only the amount of memory that has been promised to the application. The RES line shows the amount of nonswappable memory that is actually being used, which totals only 1364MB.
Notice that, when you ask to sort by memory usage by typing a capital M, top knows to sort on that RES column. The SHR column shows memory that could potentially be shared by other applications (such as libraries), and %MEM shows the percentage of total memory consumed by each application.
If you think that the system is reaching an out-of-memory state, here are a few things to look for:
In the short term, you can do several things to deal with this out-of-memory condition:
This action is the equivalent of cleaning your desk and putting all but the most critical information into the trash or into a file cabinet. You may need to retrieve information again shortly from a file cabinet, but you almost surely don't need it all immediately. Keep top running in one Terminal window to see the Mem line change as you type the following (as root) into another Terminal window:
# echo 3 > /proc/sys/vm/drop_caches
To enable Alt+SysRq keystrokes, the system must have already set /proc/sys/kernel/sysrq to 1. An easy way to do this is to add kernel.sysrq = 1 to the /etc/sysctl.conf file. Also, you must run the Alt+SysRq keystrokes from a text-based interface (such as the virtual console you see when you press Ctrl+Alt+ F2).
With kernel.sysrq set to 1, you can kill the process on your system with the highest OOM score by pressing Alt+SysRq+f from a text-based interface. A listing of all processes running on your system appears on the screen, with the name of the process that was killed listed at the end. You can repeat those keystrokes until you have killed enough processes to be able to access the system normally from the shell again.
NOTE
There are many other Alt+SysRq keystrokes you can use to deal with an unresponsive system. For example, Alt+SysRq+e terminates all processes except for the init process. Alt+SysRq+t dumps a list of all current tasks and information about those tasks to the console. To reboot the system, press Alt+SysRq+b. See the sysrq.txt file in the /usr/share/doc/kernel-doc*/Documentation directory for more information about Alt+SysRq keystrokes.
If your Linux system becomes unbootable, your best option for fixing it is probably to go into rescue mode. To go into rescue mode, you bypass the Linux system installed on your hard disk and boot some rescue medium (such as a bootable USB key or boot CD). After the rescue medium boots, it tries to mount any filesystems it can find from your Linux system so you can repair any problems.
For many Linux distributions, the installation CD or DVD can serve as boot media for going into rescue mode. Here's an example of how to use a Fedora installation DVD to go into rescue mode to fix a broken Linux system:
Right now, the root of the filesystem (/) is from the filesystem that comes on the rescue medium. To troubleshoot your installed Linux system, however, you can type the following command:
# chroot /mnt/sysimage
Now the /mnt/sysimage directory becomes the root of your filesystem (/) so it looks like the filesystem installed on your hard disk. Here are some things you can do to repair your system while you are in rescue mode:
When you are finished fixing your system, type exit to exit the chroot environment and return to the filesystem layout that the live medium sees. If you are completely finished, type reboot to restart your system. Be sure to pop out the medium before the system restarts.
Troubleshooting problems in Linux can start from the moment you turn on your computer. Problems can occur with your computer BIOS, boot loader, or other parts of the boot process that you can correct by intercepting them at different stages of the boot process.
After the system has started, you can troubleshoot problems with software packages, network interfaces, or memory exhaustion. Linux comes with many tools for finding and correcting any part of the Linux system that might break down and need fixing.
The next chapter covers the topic of Linux security. Using the tools described in that chapter, you can provide access to those services you and your users need, while blocking access to system resources that you want to protect from harm.
The exercises in this section enable you to try out useful troubleshooting techniques in Linux. Because some of the techniques described here can potentially damage your system, I recommend you do not use a production system that you cannot risk damaging. See Appendix B for suggested solutions.
These exercises relate to troubleshooting topics in Linux. They assume you are booting a PC with standard BIOS. To do these exercises, you need to be able to reboot your computer and interrupt any work it may be doing.