Diagnose and fix network problems yourself
A recent and typical case of Linux network failure was the friend who rang up to say his "network had stopped". As error reports go, this is on a par with the classic Apollo 13 line "Houston, we've had a problem", though a little less life-threatening. Luckily, Linux has a goodly collection of network tools to help us figure out exactly what had gone wrong. (To eliminate any stress-inducing suspense, let me reveal that we eventually discovered that he had been disconnected by his ISP as a result of forgetting to renew his subscription.)
So, follow along with us now as we review some of the network diagnostic tools in Linux and see how to use them to get answers to the question "what's wrong with my network?"
The most important thing, when you're trouble-shooting something, is to have some idea how it's supposed to work in the first place. Does your machine have a static IP address, and if so, what should it be? Does it use DHCP, and if so, where is the DHCP server, and what range of IP addresses is it expected to allocate? Do you have a broadband modem directly connected to your machine, or do you have a separate broadband router to which you connect via ethernet or wireless?
Our methodology in this tutorial is to take a "bottom up" approach. We start by checking the really low-level stuff first, then gradually work our way up to higher levels. The sequence of tests we'll perform is summarised (approximately) in Figure 1, below. This is a good, systematic approach for network connections that have never worked. On the other hand, if it was working fine yesterday, it's generally faster to start at the top and work your way down.
Figure 1: summary of the testing sequence.
Can Linux find your network card?
In this instance, the first question to ask is: "Is Linux seeing your network interfaces?" You may be able to answer this by looking through the boot-time messages from the kernel using the command dmesg:
# dmesg | grep eth e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection e1000: eth0: e1000_watchdog: NIC Link is Up 10 Mbps Half Duplex
Alternatively, try listing the devices on the bus with lspci:
# lspci | grep Ethernet 01:01.0 Ethernet controller: Intel Corporation 82547EI 02:01.0 Ethernet controller: Intel Corporation 82540EM
Failure at this stage suggests faulty or unsupported hardware.
Does it have an IP address assigned?
Assuming that the kernel knows your network card is there, the next question is: does it have an IP address assigned? The simplest command to use for this is ifconfig:
# ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:0C:F1:96:A3:F7 inet addr:192.168.0.3 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::20c:f1ff:fe96:a3f7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:306 errors:0 dropped:0 overruns:0 frame:0 TX packets:261 errors:0 dropped:0 overruns:0 carrier:0 collisions:8 txqueuelen:10 RX bytes:43074 (42.0 KiB) TX bytes:34480 (33.6 KiB) Base address:0xac00 Memory:ff7e0000-ff800000
The important line here is the second one, which shows an assigned IP address of 192.168.0.3. If you do not see such a line, then it follows that there is no assigned IP address. Even if there is an IP address assigned, give a moment's thought to whether it's a valid address for the network you're on.
In an operational environment, several times we have experienced networks running into trouble after the introduction of a machine that turned out to be running an (unintentional) DHCP server configured with a pool of addresses that weren't valid for that network. If a machine was rebooted, it had about a 50/50 chance of getting a valid IP address from the "real" DHCP server, or a rogue address from the imposter.
If your network interface has no IP address, check the system configuration files: Is the interface configured to be started at boot time? If so, is it configured to use DHCP, or does it have a static IP address? The files you need to look in for this are distro-specific.
On Fedora and Red Hat the filename would be of the form /etc/sysconfig/network-scripts/ifcfg-eth*, on SUSE it would be /etc/sysconfig/network/ifcfg-eth*, and on Ubuntu it would be /etc/network/interfaces. (Aren't standards a marvellous thing; don't you just adore all these gratuitous version-specific differences?) Of course, all these distributions have graphical tools to allow you to inspect and edit these settings; for example, Figure 2, below shows Fedora's system-config-network tool.
Figure 2: Fedora's system-config-network tool.
Normally, the initialisation of an interface is buried deep in a boot-time script, and the interaction with the DHCP server can be difficult to observe. However, you may be able to see the DCHP activity by running the script ifup directly, or by running dhclient. This program handles the dialogue with the DHCP server and the setting of network parameters:
# dhclient Internet Systems Consortium DHCP Client V3.0.5-RedHat Copyright 2004-2006 Internet Systems Consortium. All rights reserved. For info, please visit http://www.isc.org/sw/dhcp/ Listening on LPF/eth1/00:0e:0c:01:d3:a0 Sending on LPF/eth1/00:0e:0c:01:d3:a0 Listening on LPF/eth0/00:0c:f1:96:a3:f7 Sending on LPF/eth0/00:0c:f1:96:a3:f7 Sending on Socket/fallback DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 7 DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 4 DHCPOFFER from 192.168.0.1 DHCPREQUEST on eth0 to 255.255.255.255 port 67 DHCPACK from 192.168.0.1 bound to 192.168.0.3 -- renewal in 125868 seconds.
This particular system has two network interface, eth0 and eth1. We see that eth0 obtained an IP address from the DHCP server at 192.168.0.1. The eth1 interface tried to do the same, (it transmitted a DHCPDISCOVER) but didn't get a reply, which isn't surprising in this case as it wasn't actually connected to anything.
Can you ping your router?
If you've got a valid IP address, a good next step might be to test if you can ping one of the other machines on your network. A successful ping looks something like this:
# ping -c1 192.168.0.6 PING 192.168.0.6 (192.168.0.6) 56(84) bytes of data. 64 bytes from 192.168.0.6: icmp_seq=1 ttl=64 time=0.468 ms --- 192.168.0.6 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.468/0.468/0.468/0.000 ms and an unsuccessful one looks like this: # ping -c 1 192.168.0.2 PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data. From 192.168.0.3 icmp_seq=1 Destination Host Unreachable --- 192.168.0.2 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
The message "Destination Host Unreachable" usually means that the target machine (192.168.0.2 in this example) isn't connected or isn't running, and so failed to respond to my machine's ARP request to find its MAC address. It could also mean that your machine can't find a route to reach the local network; the most likely reason for this is that it has an IP address that's not actually part of the local network.
It's also possible that you have a more complex routing problem, but that's unlikely on a typical home network that has only one (default) route. If you don't have any other machines on your network, you can try pinging your router. (You do know the IP address of your router, right?)
If you can't ping your local router, then you obviously have a local problem. If you have a wired network, check the cabling and the little green lights on the network interfaces at each end.
Is your firewall blocking the traffic?
At some point in your diagnosis, it's worth checking to see whether your firewall settings are screwed down too tight. A quick-and-dirty way to do this favoured by many sysadmins in a hurry is to flush all the firewall rules with the command
# iptables -F
and then try again. If this solves the problem, then at least you know that the problem that's been causing you grief is firewall-related. At that point you should reboot (to re-establish the firewall) and investigate further. Do not be tempted to leave the firewall disabled, this is a Bad Idea!
Do you have an ADSL connection to your ISP?
But if you can ping your router, it's time to start widening your net, so to speak. There might be some more little green lights on your router (and if you can find the instruction book you may even be able to figure out what they mean!) that will allow you to determine if the ADSL modem in your router has successfully connected to your ISP.
Some broadband routers also provide various web-based administration screens that you can use to determine the status of your connection. Figure 3, below, shows one such example. The things to look for here are the Connection Status setting, and the IP address that the ISP has assigned to your outward-facing network connection. (You probably don't much care what that IP address actually is, you just want to confirm that there is one!)
Figure 3: web-based administration screens that you can use to determine the status of your connection.
Try disconnecting and re-connecting manually, and see if you can figure out at what point it fails. If you can't get a connection, you should obviously check the cabling from the router to the phone line, (and it's worth plugging a phone handset into the line to check if you get a dial tone) but if this all looks OK, a call to your ISP's tech support line is probably in order. Make a flask of coffee and grab a good book first, though... those call queues can be long!
Can you ping the target system?
If you appear to have a good connection to your ISP, it's time to continue up the stack with the testing. Try pinging a known external machine using its IP address. For example, Linux Format's web server is at 184.108.40.206. (Of course, it's entirely possible that this will change before you get to read this, but it will serve for the purpose of this example.)
# ping -c1 220.127.116.11 PING 18.104.22.168 (22.214.171.124) 56(84) bytes of data. 64 bytes from 126.96.36.199: icmp_seq=1 ttl=56 time=24.3 ms --- 188.8.131.52 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 24.367/24.367/24.367/0.000 ms
If this works, your network connectivity is actually in quite good shape. As a final test, try pinging the machine using its DNS name:
# ping -c1 www.linuxformat.com PING www.linuxformat.com (184.108.40.206) 56(84) bytes of data. 64 bytes from www.linuxformat.com (220.127.116.11): icmp_seq=1 ttl=56 time=24.2 ms --- www.linuxformat.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 24.249/24.249/24.249/0.000 ms
DNS failures show up very quickly with this test; for example:
$ ping www.prophylactic.gov ping: unknown host www.prophylactic.gov
If you can ping a machine using its IP address but not using its DNS name, it's high time that you investigated your DNS configuration. The best utility for this is dig. Here's a sample (and successful) run. Don't be intimidated by the level of detail that is shown in the output; the important thing to note is the A record returned in the ANSWER section:
# dig www.linuxformat.com ; <<>> DiG 9.4.0 <<>> www.linuxformat.com ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23236 ;; flags: qr rd ra; QUERY:1, ANSWER:2, AUTHORITY:2, ADDITIONAL:2 ;; QUESTION SECTION: ;www.linuxformat.com. IN A ;; ANSWER SECTION: www.linuxformat.com. 3600 IN A 18.104.22.168 ;; AUTHORITY SECTION: linuxformat.com. 300 IN NS ns0.future.net.uk. linuxformat.com. 300 IN NS ns1.future.net.uk. ;; ADDITIONAL SECTION: ns0.future.net.uk. 104 IN A 22.214.171.124 ns1.future.net.uk. 104 IN A 126.96.36.199 ;; Query time: 323 msec ;; SERVER: 192.168.1.254#53(192.168.1.254) ;; WHEN: Thu Mar 26 21:42:40 2009 ;; MSG SIZE rcvd: 134
If the DNS lookup fails, you need to distinguish a couple of cases: The first case is when DNS can't find the machine you're looking for. Here's an example of an attempt to look up a machine that simply doesn't exist:
# dig prophylactic.gov ; <<>> DiG 9.4.0 <<>> prophylactic.gov ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 13168 ;; flags: qr rd ra; QUERY:1, ANSWER:0, AUTHORITY:1, ADDITIONAL:0 ;; QUESTION SECTION: ;prophylactic.gov. IN A ;; AUTHORITY SECTION: gov. 2560 IN SOA a.gov.zoneedit.com. govcontact.zoneedit.com. 1183644065 3600 900 1814400 86400
Notice the NXDOMAIN report for the status of the enquiry, and the absence of an ANSWER section like we saw in the previous lookup. Assuming you've entered a valid machine name, this kind of failure is somebody else's problem.
Can you find your DNS server?
The second case of DNS failure is the situation where your machine can't find a DNS server. This indicates a problem that is potentially closer to home,
# dig www.linuxformat.co.uk ; <<>> DiG 9.4.0 <<>> www.linuxformat.co.uk ;; global options: printcmd ;; connection timed out; no servers could be reached
If this happens, take a look at the file /etc/resolv.conf. This is where Linux records its idea of where its DNS servers are. If you use DHCP to configure your networking, the IP addresses of your DNS servers are supplied by your DHCP server. If you have a static setup, you probably used a graphical network configuration tool such as Fedora's system-config-network to specify the location of your DNS servers. In either case, the results are written into this file. Is there a valid nameserver IP address in this file? Can you ping it directly?
If all else fails in your diagnostic attempts, try looking at the network traffic with wireshark, a packet trace utility previously known as ethereal. As a diagnostic tool, we do tend to view wireshark as a "last resort": not because of any weakness in the program (wireshark is actually a great piece of software) but because debugging network problems by examining the detailed packet traffic requires a very detailed knowledge of TCP/IP and the overlying application protocols. Also (depending on the problem) you may need a "third party" machine on the network in order to observe the traffic.
# ping 192.168.0.42
run on the machine with the IP address 192.168.0.3. Take a look at the upper of wireshark's three display panels; it shows a one-line summary for each packet captured. The middle and bottom panels let us drill down into the contents of the individual packets, but for our present purposes we don't need to go there.
Figure 4: this screengrab shows a simple example of a Wireshark capture of the packet traffic resulting from the command 'ping 192.168.0.42'
The message is clear and simple: the machine 192.168.0.3 is trying to use ARP to discover the MAC address of the machine it's trying to ping. It tries three times, at one second intervals, but gets no reply.
So we can conclude that there's nothing wrong with 192.168.0.3 - it's able to get packets out onto the network with the correct source IP address - but that the machine 192.168.0.42 simply isn't there.
First published in Linux Format magazine