Eduroam connectivity issues on Android 2.3.*

Since reports from users are on the increase, this blog post describes briefly the issue with eduroam connectivity on Android devices. Please be aware of it and inform your users, should they ask for advice.

The problem is affecting some versions only (2.3.3+). There appears to be no pattern in which devices or versions are affected and which aren’t. So far we had reports concerning Samsung and Motorola smartphones and one instance of HTC tablet. The lack of pattern appears to be attributed to the use of custom software by manufacturers, although there’s no official stance on it.

The behavior is somewhat similar to using incorrect Remote Account credentials in that the device goes into a loop of Scanning -> Authenticating -> Connecting -> Disconnected. The reason for such behavior turns out to be a bug in Android, where it’s unable to handle phase 2 802.1x authentication. A quick investigation revealed, that the authentication request never reaches the RADIUS server.

The problem is described here: http://code.google.com/p/android/issues/detail?id=15631

The above page is a few months old now, but due to the random nature of the problem, we didn’t have an avalanche of reports so far (or they were misinterpreted). My own device uses 2.3.4 and doesn’t exhibit the problem, however we had a user with 2.3.5 experiencing the connectivity problem only a few days ago, so clearly the issue remains unsolved.

Exploring the few suggestions provided on the bug description page, I’m afraid there’s no workaround at the moment. We suggest upgrading the operating system on affected devices in the hope of fixing it, but we have no evidence that actually works.

If you are aware of other fixes or have any comments on this particular problem, please email networks@oucs.ox.ac.uk

Posted in Uncategorized | Leave a comment

VPN NAT Changes

What is this post about?

We are planning to make a minor change to the way our VPNs NAT clients. For those who are interested, this blog post explains why and how we are doing this. Please note that these days NAT is used as a general term encompassing both NAT (Network Address Translation) and PAT (Port Address Translation). I’ll be specific in this post.

Problem summary

The original VPN config didn’t use NAT or PAT and had a client pool of 129.67.116.0/22, which was advertised to the various departmental and collegiate IT staff working at Oxford University. This pool became exhausted but we don’t want to be seen to be favouring our own services by taking whatever IPs we would like. Also there are lots of local firewalls around the University which are aware of this range so we migrated to PAT, taking 2 IPs from the above range per ASA.

At the time of writing (Jan 2012), the ASA VPNs are configured as follows, where X and Y are two adjacent IPs and are unique to each ASA:

object network nat_inside_local
 subnet 10.16.0.0 255.255.240.0
object network nat_outside_pool
 range 129.67.119.24X 129.67.119.24Y
nat (vpn-outside,vpn-inside) \
 source dynamic nat_inside_local pat-pool nat_outside_pool

This means that hosts in 10.16.0.0/20 whose traffic hits the vpn-outside interface and is destined for the vpn-inside interface, will be PATed onto 129.67.119.24X. When all 65K ports have been used up, .24Y will be used. The issue we are having is that IP/Port combinations are being re-used too quickly for our CERT team to be certain (ha ha) of who is being naughty.

Solution

To get around this we will move to using Dynamic NAT with a generous range of IPs, falling back to a PAT IP if they are exhausted. We need to take care when choosing the pools though as some IPs are reserved for existing VPN infrastructure. As such the allocation of the pools will asymetrical as to me this seems cleaner than giving 9 hosts from .117 to node 1.

  • node0 will have 129.67.116.1 - 129.67.117.255
  • node1 will have 129.67.118.0 - 129.67.119.235

The /22 contains a few addresses which will look odd. Namely 116.255, 117.[0|255], 118.[0|255], 119.0 are all legitimate. We will waste 118.0 as it means both PAT addresses will be silimar which should make life a bit easier.

116.0 and 119.255 are the only network and broadcast addresses.

The final config will be as follows. I’ve renamed the groups to simplify migration and hopefully make it very clear what is going on.

vpn-0

object network D-NAT-RANGE
 range 129.67.116.2 129.67.117.255
object network PAT-HOST
 host 129.67.116.1
object network INSIDE-POOL
 subnet 10.16.0.0 255.255.240.0
object-group network OUTSIDE-POOL
 network-object object D-NAT-RANGE
 network-object object PAT-HOST

nat (vpn-outside,vpn-inside) \
 source dynamic INSIDE-POOL OUTSIDE-POOL

vpn-1

object network D-NAT-RANGE
 range 129.67.118.2 129.67.119.235
object network PAT-HOST
 host 129.67.118.1
object network INSIDE-POOL
 subnet 10.16.16.0 255.255.240.0

object-group network OUTSIDE-POOL
 network-object object D-NAT-RANGE
 network-object object PAT-HOST

nat (vpn-outside,vpn-inside) source dynamic INSIDE-POOL OUTSIDE-POOL

All three objects are unique on each host.

Staging tests

I used our lab to mimic the production environment. We tested with four clients so needed to artifically shrink the Dynamic NAT range to 2 IPs so that PAT would be triggered and we could verify it worked. We used only one ASA for the same reason, here is the config:

object network D-NAT-RANGE
 range 192.168.30.12 192.168.30.13
object network PAT-HOST
 host 192.168.30.31
object network INSIDE-POOL
 subnet 10.0.0.0 255.255.255.0
object-group network OUTSIDE-POOL
 network-object object D-NAT-RANGE
 network-object object PAT-HOST

nat (vpn-outside,vpn-inside) source dynamic INSIDE-POOL OUTSIDE-POOL

Verification

The first two hosts to connect were NATed to 192.168.30.12 and .13 respectivly. The remaining two hosts use PAT on .31. All hosts were able to reach the appropriate fake external networks hosted on an area of the lab only reachable from the VPN pool.

With three laptops connected we see the first two use 1-1 dynamic NAT and the third uses PAT:

vpn-dev-0# show xlate
4 in use, 7 most used
Flags: D - DNS, i - dynamic, r - portmap, s - static, I - identity, T - twice
NAT from vpn-outside:10.0.0.1 to vpn-inside:192.168.30.12 \
 flags i idle 0:00:09 timeout 3:00:00
NAT from vpn-outside:10.0.0.2 to vpn-inside:192.168.30.13 \
 flags i idle 0:00:17 timeout 3:00:00
TCP PAT from vpn-outside:10.0.0.3/53013 to vpn-inside:192.168.30.31/42959 \
 flags ri idle 0:00:00 timeout 0:00:30
TCP PAT from vpn-outside:10.0.0.3/53012 to vpn-inside:192.168.30.31/20628 \
 flags ri idle 0:00:01 timeout 0:00:30

Now with four laptops connected the additional client also used PAT:

vpn-dev-0# show xlate 
36 in use, 36 most used
Flags: D - DNS, i - dynamic, r - portmap, s - static, I - identity, T - twice
<snip>
UDP PAT from vpn-outside:10.0.0.4/52004 to vpn-inside:192.168.30.31/60697 \
 flags ri idle 0:00:45 timeout 0:00:30
UDP PAT from vpn-outside:10.0.0.4/137 to vpn-inside:192.168.30.31/252 \
 flags ri idle 0:01:25 timeout 0:00:30
NAT from vpn-outside:10.0.0.1 to vpn-inside:192.168.30.12 \
 flags i idle 0:00:00 timeout 3:00:00
NAT from vpn-outside:10.0.0.2 to vpn-inside:192.168.30.13 \
 flags i idle 0:00:10 timeout 3:00:00
ICMP PAT from vpn-outside:10.0.0.3/51467 to vpn-inside:192.168.30.31/26251 \
 flags ri idle 0:01:51 timeout 0:00:30

Our policy is in use:

vpn-dev-0# show nat
Manual NAT Policies (Section 1)
1 (vpn-outside) to (vpn-inside) \  source dynamic INSIDE-POOL OUTSIDE-POOL
translate_hits = 59, untranslate_hits = 126

Sumary

We plan to make this change live during our next maintenance release.

Posted in Cisco Networks, Documentation, General Maintenance, VPN | Leave a comment

How to generate graphs with gnuplot

Introduction

During the JANET Carrrier Ethernet Trial we we took part in, I needed to plot some data based on our testing and came across gnuplot. It is actually quite simple to use and we’re doing so more and more so I thought I share some of what I’ve learned.

Process

First you need to generate a text file of data which you wish to graph (whitespace is fine as a delimiter).

Here is a sample of some data I created. It is the number of unique users of OWL Visitor (our guest wireless service) per day:

2011-08-17 670
2011-08-18 666
2011-08-19 619
2011-08-20 470
2011-08-21 368

Install gnuplog on the server you are using. At the time of writing it available natively in RedHat and Debian. You need to generate a config file, # man gnuplot is your friend here.

# gnuplot script file for plotting bandwidth over time
#!/usr/bin/gnuplot
reset
set terminal png

set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S"
set format x "%d/%m"

set xlabel "Date (day/month)"
set ylabel "Number of uniqe visitor users"

set title "Visitor Users over time"
set key below
set grid

plot "/home/networks/unique_visitors.csv" using 1:2 title "Visitors"

Hopefully the config file above is fairly self explanatory. To generate the graph simply run the following:

/usr/bin/gnuplot visitor_users.gp > /home/networks/visitor_users.png

Where visitor_users.gp is the name of the config file above. Here is the result, using a larger dataset:

You can then use a cronjob to update the data and replot the graph regularly.

Using variables in the config file

If you would like to manipulate the data you are plotting on the fly, for example to scale something down, you can. An example is probably best here.

Here is a small subset of the data:

2011-08-01T09:10:03 31106 630881 15746233 27439 609104 15924148 128 8133029 32533776
2011-08-01T09:20:04 31106 630929 15747201 27439 609152 15925609 128 8133029 32533776
2011-08-01T09:30:03 31106 631020 15750202 27447 609203 15928230 128 8144560 32584722
2011-08-01T09:40:03 31106 631078 15751874 27453 609238 15930112 128 8144560 32584722
2011-08-01T10:00:03 31110 631196 15754712 27455 609310 15933583 128 112198088 40867109
2011-08-01T10:10:03 31115 631354 15760272 27455 609333 15935724 128 112203231 40895646
2011-08-01T10:20:02 31117 631425 15763701 27460 609471 15941256 128 112203231 40895646
2011-08-01T10:30:03 31121 631558 15766657 27461 609489 15943780 128 112204920 40903491

and the config:

#!/usr/bin/gnuplot
reset
set terminal png

set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S"
set format x "%d/%m"

set xlabel "Date (day/month)"
set ylabel "Number of NP Blocks / 10000"
# set ylabel "Number of NP Blocks / 10000 (log)"
# set log y

set title "NP Blocks over time"
set key below
set grid

plot "/home/netdisco/np-blocks.csv" using 1:($2/10000) title "NP1_0" , \
"" using 1:($3/10000) title "NP1_1", \
"" using 1:($4/10000) title "NP1_2", \
"" using 1:($5/10000) title "NP2_0", \
"" using 1:($6/10000) title "NP2_1", \
"" using 1:($7/10000) title "NP2_2", \
"" using 1:($8/10000) title "NP3_0", \
"" using 1:($9/10000) title "NP3_1", \
"" using 1:($10/10000) title "NP3_2"

Again, here is the result:

Hopefully that has been a useful primer on gnuplot, happy graphing!

Posted in Documentation, Productivity, Trend Analysis, Wireless | 1 Comment

Maintenance Work On Eduroam

Just a slightly uneventful blog post aimed at our IT staff in colleges, departments and other units to let you know about some of the grittier routine work on Eduroam. This is a warts and all account of real life events and problems. You can let me know of any errors or ambiguity in the comments, on IRC or via an email to networks at OUCS.

Specifically we had Janet Roaming Support (JRS) in recently to visit on a 2 day paid consultancy basis to review the eduroam deployment with a key focus being the RADIUS configuration. If you’re not familiar, the RADIUS service is what authenticates a user when they attempt to log in to eduroam. The problem JRS had actively contacted us about was that our server was sending requests for user@ox.ax.uk (e.g. a misspelt authentication realm) to the JRS national service. Having dealt with misconfiguration DNS clients making around 38.5 million requests a day (~420 requests/second) to our DNS servers I was a little sceptical at first about the level of denial of service they were complaining about (in the order of 1 request every 4 seconds), which didn’t do the relations much good, but I hadn’t realised that at the time that (as they later explained, subject to myself remembering correctly) they were being forced to run the radius service in a single threaded debug mode as part of their national level logging requirements for the eduroam parent organisation. I believe that situation has since changed however it was still clear that our RADIUS setup was in need of maintenance and was falling foul of more than one requirement of the eduroam provision, such as the level of logging.

So the background to the initial problem is that someone on, for instance, a android phone, types in the users username and adds @ox.ac.uk as the realm but either though auto-correction or bad key press the ac becomes ‘ax’. The device then fails to connect, the user gives up and unknown to the user the phone keeps trying to connect at regular intervals. About 4 phones university wide might cause greater than 4k connections to janets radius servers a day, this number would get worse with time. On top of this common typo there’s users confused and typing in their email address. It’s not a good solution to accept logins for these typos domains as (other objections aside) eduroam wouldn’t work for them at other sites. Another solution offered might be to contact each user we see with a typo rejection in the logs which sounds good at first but there’s various issues with contacting the users that complicates this.

  • The user might also have typed their actual login name wrong, so instead of contacting ‘dept0123′ I’ll contact ‘dept0213′, who will have something to say on the matter
  • It eats up considerable time (I could automate it but the first issue would cause problems)
  • The majority of affected people appear to just ignore emails when contacted about this (local IT Support might get to physically meet them but I don’t)

Despite this I have performed a few checks and contacts when between projects, the first issue seems to have occurred once. Only one person has replied back to say it’s all working and thanks for the assistance. So contacting users helps improve the quality of our provision but it’s not a long term solution for preventing devices dos’ing janets national service. Hence the correct solution in terms of preventing the sending of pointless upstream traffic is to prevent these typo authentication requests going to janet and rejecting them locally (this doesn’t change our public provision behaviour since janet is going to reject anything for ‘ox.ax.uk’ anyway).

In terms of implementing the fix, we had two senior team members confident in RADIUS configuration but one had left and the other had been promoted to a management position (currently mostly taken up by the new shared data centre) so I attempted a fix to this earlier this year, but I’m unfamiliar with RADIUS and the configuration was complex and sadly my solution did not work as expected. We’re torn between multiple tasks and services and I didn’t have the time to devote to testing and background reading that I would have liked. So I had to roll back the changes and in doing so I rolled back slightly too far and causing a cryptographic key (used between our server and janets) to be wrong which was noticed and corrected within about 36 hours.

Since the issue of the bad logins was still ongoing I requested and had approved asking JRS to visit on a contract basis to check the configuration and on a second day implement any changes needed. I knew they were familiar with FreeRADIUS and worked with it each day, they of course were also familiar with the ideal was a eduroam service should work. This went well, with JRS picking up various ways to make the service more efficient and also picking up errors in our published documentation and unexpectedly in the physical eduroam wireless provision at one Oxford site. A college with its own independent Wireless LAN Controllers and access points was advertising WPA2/AES and oddly WPA/AES (instead of WPA/TKIP) so I’ve contacted them to ask them to move to WPA2 only to avoid Windows clients having to make yet another eduroam profile as the WPA type has to be statically configured in the default wireless supplicant and is normally WPA/TKIP. I’m aware of TKIP’s shortcomings but WPA2 is the preferred solution if nothing more than to avoid reconfiguring less than perfect clients. Summary: If in doubt, please just offer WPA2/AES. JRS also recommended moving to WPA2 sitewide, which is something I agree with but with Oxford’s local independent political layout I’m unsure I could ever state that ‘Oxford is WPA2 only site wide’ and be accurate. I hear stories that one unit still offers WEP which is a little soul crushing. I’m not sure what the long term solution is to this in Oxford’s environment. It might be that the OUCS networks physical installation teams are briefed to keep their devices looking for eduroam and report if any WPA/AES sites are found when installing services for colleges or doing maintenance on other physical provisions, and then gently pushing those units to a WPA2 only provision.

Out of the changes made, some of the changes were important for communication with janet, like stopping the typo mistakes from creating a denial of service against the janet servers. Others were at first looked unneeded (like changing the configuration file format from a freeradius 1 style layout to a freeradius2 style layout) but were about long term support of the service – any questions to JRS and similar would be a lot easier to handle with the syntax configuration in a modern format. Going through the configuration line by line also highlighted places where the default performance values were being used and could be increased to match the more modern hardware the RADIUS service is currently on compared to when the configuration was written. We also separated the RADIUS service to the VPN from that provided to eduroam using virtual servers (similar to Apache virtual sites configuration if you’re familiar with that).

It didn’t go perfectly. Moving the VPN service to a permanent location in the configuration from a set of dynamically created list of 802.1x clients in a database table accidentally caused a IPtables rule to be automatically be dropped by a automated process but due to Murphy’s Law this happend only after we had finished testing on the test server and then on the live service. I got the call about this at 6pm that day and had it fixed by 6:10, new VPN connections having been affected as authentication requests to the RADIUS servers had been dropped. I sent a announcement message to let IT support staff know of the outage. Internally we log VPN logins to both a flat file and SQL, and as part of moving to the virtual sites format I missed out the statement that logs to SQL, which was highlighted the next day by the security team as it affected their response to infected hosts on the VPN network and so this was promptly fixed.

Since then I’ve done some contacting of users as mentioned earlier, and need to correct our website links to the JRS Acceptable Usage Policy among other recommendations in the final JRS report. Locally I’ve also been trying to reduce the number of misconfigured access points to zero. We can see units with heavyweight access points where the shared secret is incorrect in the server logs so I’ve been contacting each one I see to get them fixed. I think there’s only one now broken out of about 6 at the start of this week (we’ve between 2000 and 3000 WAPS if you include the hospitals so this is not so bad, but it’s good to see them fixed). I can automate this slightly but the IT support contact for each unit isn’t yet standardised (edit: someone points out there is a push for it-support@$foo.ox.ac.uk which I’m aware of, but I wasn’t sure it’s fully in place yet, but yes I could manually catch bounces which would be less work than emailing every incident) so I don’t believe I can fully automate this but it’s something I can look into. Sadly I’ll have to make a note of it and move on as there’s many other services that also need attention as this automation would be lower priority than for instance, security issues.

I’ve a number of other services and projects to mention, but that’s enough for one day.

Posted in General Maintenance, Wireless | Leave a comment

OUCS Backbone Network Naming and Numbering Conventions

Introduction

This blog post is intended to help ITSS in Oxford to better understand how the centrally provided network fits together with their own local networks. It is also hoped it will assist them in assessing the impact of any reboots we need to do for software and hardware updates.

Devices

The OUCS backbone consists of 12 Cisco Catalyst 6500s and around 200 Cisco Catalyst 3750s.

There are three types of 6500:

  1. 1 x JANET BGP router (COUCS3)
  2. 2 x Core Switches (BOUCS and BMUS)
  3. 9 x Aggregation Switches (CXYX)

The network is arranged in a dual star topology, with all ‘C’ Aggregation Routers having a ten gigabit fibre connection to both ‘B’ Core Switches.

From the diagram hopefully it is clear that either BOUCS or BMUS can be rebooted without an outage. If any of the C Routers are rebooted then the outage extends to all VLANs which rely on that 6500. If COUCS3 is rebooted then internal connections will not be impacted but our access to JANET will be. Note that we plan to install a second link in the near future.

There are three types of 3750s (FroDo or Front Door / point of presence switches):

  1. Building FroDo
  2. MDX FroDo
  3. Distributor FroDo

Generally, each building has its own FroDo. Where multiple Units share a building they will each have one port for their main connection and those using OWL phase 1 will share the centrally provided LIN (Location Independent) network ports. There is a FroDo in each of the Telecoms MDX rooms. Finally, due to the various routes which the fibre takes around the city, it is occasionally necessary to deploy a 3750 to aggregate additional FroDos. This is common in areas with a high density of annexes such as Iffley Road.

The management IP subnet allocated to the FroDo network is 172.16.0.0/20.

Numbering convention

Each ‘C’ Aggregation router has a corresponding number as follows:

Device Number
COUCS1 0
CENG 1
CSUR 2
CMUS 3
CZOO 5
CIND 6
CASH 7
CIHS 8
COUCS2 9

Each FroDo is numbered based on the C Router it is connected to. For example, the first 3750 connected to COUCS1 will be called FroDo-1 and will resolve to 172.16.0.1. The first FroDo to connect to CZOO will be called FroDo-501 and will resolve to 172.16.5.1.

Connection Types

Each Unit has a main L3 connection. This is provided as a L2 VLAN presented on an access port on the  building FroDo and trunked up to the adjacent C Router where the SVI is located. Some Units also have a L2 annexe VLAN. In this case the VLAN is trunked from the main site FroDo, through both core switches to the annexe FroDos, where it is presented as an access port in the annexe VLAN, with or without double tagging (Q-in-Q). This allows Units to put all their annexes behind one firewall for example, although it has the disadvantage of creating a large L2 (failure) domain which is a Very Bad Idea. See http://blogs.oucs.ox.ac.uk/networks/2011/02/04/mac-flaps-why-are-they-bad/ for more on this. Some Annexes have their own L3 connection which is less convenient but better network design. In a future version of the backbone we hope to be able to offer VPLS to provide both flexibility and scalability, but I digress.

Tracking where your connections are

Using the LG (Looking Glass) tool, available here: https://networks.oucs.ox.ac.uk/, you can check which device(s) your networks are fed from. LG will show you you where the L3 interfaces for your routed networks are, and which devices your annexes connect to at L2.

For your routed VLAN(s), the L2 connection will be to a FroDo, and that FroDo will connect directly to the C Router which hosts your L3 gateway as I mentioned earlier. For annexe sites which connect back at L2 to your main site, you will have visibility of the local device they connect to at L2. If this is a FroDo, there is no way for you to see which C Router that FroDo connects to using LG, although this can be deduced based on the third octet of the FroDo IP. The numbers in the table above show what that is for each C Router.

An example may help here. Let’s say Chaucer College connect to FroDo-501. Their main subnet might be 129.67.10.0/24. CZoo would host an SVI for VLAN 501 with an address of 129.67.10.254 (we always take the highest usable address in the subnet). That VLAN would be trunked to FroDo-501 and presented as an access port. Let’s say they own a building on Banbury Road and would like their users there to also be on 129.67.10.0/24. We would present VLAN 551 (for example) as an additional access port on FroDo-501 and trunk it through BOUCS and BMUS to COUCS1 and then FroDo-1 where it would be presented as an access port. Easy for the IT staff, as long as there are no loops at either end – they are propagated through the core and impact all users. Scale that up to 4 or 5 annexes and you see why I don’t like this and why we ask everyone to run STP. But I digress again…

So now you get an email from us saying we’re going to be rebooting all the 6500s for a software update over the summer, and would like to know which days your users will loose service during the announced maintenance period. Keep in mind that your annexe connections will go down when your main C Router is rebooted, and again when the uplink C Router from the annexe FroDo is rebooted if this is different. So with our example, the Chaucer College ITSS Fred Bloggs would check LG for their network and see something like this:

Looking Glass 1.4, using Oxford Directory 2.4
Given Vlan "501", displaying Unit Chaucer College
Chaucer College (cha):
itss01: Fred Bloggs fred.bloggs@chaucer.ox.ac.uk
4 further IT officers (use --all-itss to show)
Registered networks
129.67.10.0/24: Chaucer
Layer 3 interfaces
czoo.backbone.ox.ac.uk Vlan501 (up) Chaucer
129.67.10.254/24
Registered vlans
501: Chaucer
551: Chaucer Annexes
Layer 2 ports
v501 chaucer.frodo.ox.ac.uk     Gi1/0/1  [aGfu] Chaucer main
v551 chaucer.frodo.ox.ac.uk     Gi1/0/2  [aGfu] Chaucer BR Annexe
banbury-road.frodo.ox.ac.uk     Gi1/0/11 [aGfu] Chaucer BR Annexe

Now Fred wants to know what banbury-road.frodo.ox.ac.uk is connected to:

$ host banbury-road.frodo.ox.ac.uk
banbury-road.frodo.ox.ac.uk has address 172.16.0.1

The third octet is 0 so the annexe relies on COUCS1 and CZOO for its connectivity.

Posted in Backbone Network, Cisco Networks, Documentation, General Maintenance | 1 Comment

Firewall firefighting

The intention of this post is to explain what’s been happening with the University Firewall, what we’ve been doing about it and what we intend to do.

The University Firewall Service is provided by a pair of Cisco FWSMs running as an active/standby failover pair in a Cisco Catalyst 6500 chassis.

Over the past month or so there have been a couple of fifteen-minute interruptions to the University’s Internet connection.  Our investigations suggested that the FWSMs may have been to blame.  We contacted the Cisco TAC (Technical Assistance Centre) for a comprehensive diagnosis but since we were running an old version of the FWSM firmware, they wanted us to upgrade to the latest version before helping us.  This firmware upgrade was scheduled for early on the morning of Tuesday 28th June.

During the evening of Monday 27th the active FWSM entered a state of continually rebooting.  The standby FWSM did not takeover which resulted in the University being cut off from the Internet.  Networks staff came in to the office on a voluntary basis and applied an emergency workaround.  This consisted of bypassing the firewalls completely and recreating the ruleset as an ACL (Access Control List).  An ACL doesn’t provide connection tracking like a firewall does but since the firewall policy is default open an ACL offers very similar functionality in our case.

On Tuesday morning the FWSMs were upgraded as planned, put back into service, and the ACL removed.

On Wednesday afternoon in an unrelated incident an IOS bug was triggered which led to a number of backbone Catalyst 6500s rebooting which resulted in the loss of network connectivity for ten minutes.  The trigger for this bug is now known and we have put measures in place to prevent a repeat.  The reboot of the FWSMs’ 6500 caused them to fallover (which they shouldn’t) so we put the ACL back in service.

Now that our FWSMs are running the latest software we have once again sought help from the Cisco TAC.  The FWSMs are giving indications that they are not coping with our traffic load even though it is significantly lower than Cisco’s specification.  On the basis that the FWSMs are suffering from a hardware fault, Cisco is sending us a pair of new FWSMs which we hope will arrive early next Monday.  Assuming that they do arrive in time, we’ll prepare them on Monday and then put them into service during the standard maintenance window on Tuesday 5th July.

EDIT 4th July: the replacement hardware arrived right at the end of the day so no swap-outs tomorrow morning.

Posted in Firewall | Leave a comment

The Week Before World IPv6 Day

So the big news is that as of this morning www.ox.ac.uk / ox.ac.uk has AAAA records and is hence reachable via IPv6, so currently the university IPv6 presence for World IPv6 day will be:

Websites

Other services:

  • irc.ox.ac.uk (relevant blog post although the service is now hosted by the Systems Development Team)
  • webcache.ox.ac.uk (relevant blog post)
  • ntp.oucs.ox.ac.uk (was previously ntp6.oucs.ox.ac.uk which is now CNAME’d)

The Maths Institute and the Ashmolean Museum?

Yes, both local units are taking part as IPv6 early adopters. We can’t currently offer IPv6 to all units until we’ve a working IPAM for IPv6 since we have to hand edit forward and reverse zone files to add IPv6 records currently, which isn’t scalable. The issue we’re having with the off the shelf solutions is integration with Single Sign On, specifically a lot of vendors (and indeed internal staff) don’t understand the term and confuse it with shared sign on, or a common authentication source that is passed the users credentials.

Cambridge has a similar political makeup to ourselves and a homegrown DNS management system like our own however I believe theirs is actively maintained by someone dedicated to DNS/DHCP and based on a database backend. Sadly our own is almost a decade old and uses flat files, the front end itself is about 4k lines of code the backend 3.5k, the author has retired, leaving no documentation lines in the code. Altering this code is risky and the changes needed for IPv6 support would be non trivial.

Any other issues?

Yes. There’s some changes we need to make with the way we respond to security incidents (blocking infected/compromised hosts etc) as the current mechanism is causing some CPU load on the switches, but what might initially seem a trivial problem requires to a major re-write of a backend application that manages blocks and displays the current blocked hosts to ITSS.

As of this week we’ve also discovered that the way we’re suppressing IPv6 auto configuration on networks is imperfect, in that Mac OSX hosts prior to 10.6.4 will configure IPv6 with null information. There appears to be no workaround for this on our provisioning so the options are:

  • Upgrade all mac OSX hosts on the client network to 10.6.4 or above
  • Don’t enable IPv6 on any network with Mac OSX hosts
  • Don’t suppress IPv6 discovery on the network, meaning devices will automatically assign themselves an IPv6 address

For the moment we’re simply warning the end units that this issue exists.They can have auto discovery on or off for their network.

And finally there’s the university IPv6 firewall, separate to the IPv4 firewall. We want to replace the current trial system with a production failover capable system. This couple of weeks (in fact, it seems every week since February) has been incredibly busy and I didn’t get as much testing or preparation done for the production replacement as I’d like. As a result it was not surprising that in production test this morning it didn’t work and the troubleshooting was made awkward due to 101 minor issues associated with not preparing enough. I reverted it after ~5 minutes. I did set up the new solution and send traffic across it on our air gapped test network prior to this mornings work. I think the main problem was that I wanted to do more tests that we simply ran out of time for but to a lesser extent even if we had the time it’s not a perfect mirror of what the production environment is like (for example, the test network doesn’t get Cisco 6500′s for cost reasons) so there would still have been some smaller margin for unexpected error.

I’ll probably remove the AAAA’s for www.ox.ac.uk/ox.ac.uk in advance and then attempt another changeover on Tuesday morning in the ja.net at risk period depending on how much progress I can make today and Monday.

So the main site is on native IPv6 and will be staying on this after the 8th June?

Sadly not. The main website involves the participation of 5 teams. One provides (our own, based at OUCS) the core networking to the unit, one administers the virtual machine the webserver is on (NSMS), another university department administers the underlying hardware and local network, an external contractor provides the CMS that makes up the site and the final team is the Public Affairs Directorate that have political control of the site, it’s funding and what happens to it.

Despite early optimism there’s been an issue with approval for IPv6 to be enabled on the underlying local network (our own team acts in an ISP role, we don’t have political or technical control to the edge) so instead the IPv6 provision for the main site is via a reverse proxy. Essentially this is a webserver listening on Ipv6 and then making requests on the clients behalf to the main IPv4 site.

A reverse proxy? A well respected academic doing work in the IPv6 field told me that there’s little value in taking part in World IPv6 day with a reverse proxy

If I were judging a organisation commitment by their IPv6 involvement and they had used a reverse proxy then depending how proud they were about it I might indeed question their dedication since they’ve not actually made the larger changes needed for native IPv6 to their core systems.

However, from our present viewpoint I see the other argument: due to the internal problems mentioned, we had the option of either not taking part or using a reverse proxy with no native option. Taking part has advantages, specifically gathering information, getting the various teams experienced in IPv6 configuration and gaining political support and understanding among management that future work is needed (we are not all well respected academics, some internal people just don’t believe IPv6 is needed and assume it is simply the cause of issues).

So in summary, I agree with the viewpoint however I think we gain value in any IPv6 progress that can be made in the university, no matter how small.

What might a local IT support office be doing at their unit?

Posted in IPv6 | 1 Comment

Budget High Availability ASA testing

The problem

We’re looking at setting up a management network behind a couple of ASAs.

My requirements and prerequisites are:

  1. No L2 end to end VLANs through the core. That is bad and wrong.
  2. A total site failure at one site must not take down hosts at the other site or any services run on the ASAs. This testing won’t get at far as the VPN side of things, today I’m just lookint at routing.
  3. Routing can be static or dynamic. I’ll use static today because my test switch doesn’t have an OSPF licence and I’m not in a RIP kind of mood.
  4. The ASAs need to be physically at different sites.
  5. We can use private fibre.

It will cost about £5K to get all the optics and interface cards we’d need to do proper dual site ASAs, with dual uplinks and HSRP enabled at the other end. I’m looking into an alternate method which only relies on dark fibre connecting the inside network switches and uses a differnt routed connection at each site. One issue is that the ASA configs are synced exactly. Since I want network connectivity to survive a failover and I can’t to send the same network to both sites in a scalable, redundant way, I’ll need to use two ports on each ASA and only connect one at each site. On failover, the main port will be down and the second connection up so I’ll then want the default route to change accordingly.

Summary

What, bored already? Okay my conclusion is that the ASAs can be made to failover to a second routed connection, but it is dog slow.

Network Diagram

Network Diagram

Step by step

Set up active / standby

ASA 1

failover
failover lan unit primary
failover lan interface failover-link Ethernet0/3
failover interface ip failover-link 10.1.1.1 255.255.255.252 standby 10.1.1.2

ASA 2

failover
failover lan unit secondary
failover lan interface failover-link Ethernet0/3
failover interface ip failover-link 10.1.1.1 255.255.255.252 standby 10.1.1.2

Configure dual uplinks

The config will be replicated across the two ASAs. Site A will have its ‘ISP’ connection on E0/0, Site B will use E0/1.

!
interface Ethernet0/0
nameif ISP-10
security-level 0
ip address 192.168.10.2 255.255.255.0
!
interface Ethernet0/1
nameif ISP-20
security-level 0
ip address 192.168.20.2 255.255.255.0
!

Uplink notes

The other end of the ISP links is a 3750 switch. E0/0 on the first ASA is connected to an access port in VLAN 10, E0/1 on the second ASA is connected to an access port in VLAN 20. The SVIs are given 192.168.10.1 and 20.1 repectively.

Static routes and tracking

We will configure static route tracking which allows us to change our default route if the link fails. For a production service we’d also configure the pair to failover on uplink failure.

First we configure the ASAs to keep an eye on their ISP gateways (sla_id 1 and 2):

sla monitor 1
 type echo protocol ipIcmpEcho 192.168.10.1 interface ISP-10
sla monitor schedule 1 life forever start-time now
sla monitor 2
 type echo protocol ipIcmpEcho 192.168.20.1 interface ISP-20
sla monitor schedule 2 life forever start-time now

Now well configure the ASAs to track the sla_ids:

track 1 rtr 1 reachability
!
track 2 rtr 2 reachability

Finally we define the static routes, setting them to drop off if the gateway IP should not be reachable and making the main ISP the default. We could have ignored all the setp above and just used the metrics (in bold below, sorry I chose 1 and 2, that is a bit confusing in this context), but then the second route would only be used if the ASA interface went down, which isn’t the only failure scenario.

route ISP-10 0.0.0.0 0.0.0.0 192.168.10.1 1 track 1
route ISP-20 0.0.0.0 0.0.0.0 192.168.20.1 2 track 2

Testing

First lets enable debugging so that we can see exactly what happens:

logging enable
logging timestamp
logging console debugging
sdc-asa# debug track

Tracked IP unreachable tests

I won’t repeat all the debug output here but here are the interesting bits:

sdc-asa# failover active
May 24 2011 14:47:56: %ASA-1-104001: (Secondary) Switching to ACTIVE
 - Set by the config command.
sdc-asa# show route <snip>
Gateway of last resort is not set
C    192.168.10.0 255.255.255.0 is directly connected, ISP-10
C    192.168.20.0 255.255.255.0 is directly connected, ISP-20
C    10.1.1.0 255.255.255.252 is directly connected, failover-link
May 24 2011 14:48:41: %ASA-6-622001:
Adding tracked route 0.0.0.0 0.0.0.0 192.168.20.1, distance 2,
table Default-IP-Routing-Table, on interface ISP-20
sdc-asa# show route
<snip>
Gateway of last resort is 192.168.20.1 to network 0.0.0.0
C    192.168.10.0 255.255.255.0 is directly connected, ISP-10
C    192.168.20.0 255.255.255.0 is directly connected, ISP-20
C    10.1.1.0 255.255.255.252 is directly connected, failover-link
S*   0.0.0.0 0.0.0.0 [2/0] via 192.168.20.1, ISP-20

As you can see, it takes 45 seconds for the alternate default route to appear in the routing table of the second ASA after failover. Lets try failing back over.

sdc-asa# failover active
May 24 2011 14:56:35: %ASA-1-104001: (Primary) Switching to ACTIVE
 - Set by the config command.

sdc-asa# show route
<snip>
Gateway of last resort is not set

C    192.168.10.0 255.255.255.0 is directly connected, ISP-10
C    192.168.20.0 255.255.255.0 is directly connected, ISP-20
C    10.1.1.0 255.255.255.252 is directly connected, failover-link

Track: 1 Change #11 rtr 1, reachability Down->Up
May 24 2011 14:56:59: %ASA-6-622001:
Adding tracked route 0.0.0.0 0.0.0.0 192.168.10.1,
distance 1, table Default-IP-Routing-Table, on interface ISP-10
show route

<snip>

Gateway of last resort is 192.168.10.1 to network 0.0.0.0

C    192.168.10.0 255.255.255.0 is directly connected, ISP-10
C    192.168.20.0 255.255.255.0 is directly connected, ISP-20
C    10.1.1.0 255.255.255.252 is directly connected, failover-link
S*   0.0.0.0 0.0.0.0 [1/0] via 192.168.10.1, ISP-10

This time it took 24 seconds, which is better but still considerably worse than the subsecond failver time we can acheive with HSRP and cross site, dual uplinks from the ASAs. Repeat testing showed that primary -> secondary was always c. 30 seconds, but secondary -> primary could be much faster:

May 24 2011 15:02:54: %ASA-1-104001: (Primary) Switching to ACTIVE
 - Set by the config command.
May 24 2011 15:02:59: %ASA-6-622001:
Adding tracked route 0.0.0.0 0.0.0.0 192.168.10.1,
distance 1, table Default-IP-Routing-Table, on interface ISP-10

Repeat test with physical interface failure

Here you can see that this took a full minute to fail over, but did still work. The ASA will track its interfaces by default so no additional config was needed. As you can see, the failover times are rather uninspiring.

! Secondary to active

May 24 2011 15:11:13: %ASA-6-721002: (WebVPN-Secondary)
HA status change: event HA_STATUS_PEER_STATE, my state Standby Ready,
peer state Failed.
Switching to Active
May 24 2011 15:11:13: %ASA-1-104001: (Secondary)
Switching to ACTIVE - Other unit wants me Active.
Primary unit switch reason: Interface check.

May 24 2011 15:12:41: %ASA-6-622001:
Adding tracked route 0.0.0.0 0.0.0.0 192.168.20.1,
distance 2, table Default-IP-Routing-Table, on interface ISP-20

! Primary returns to active

May 24 2011 15:10:58: %ASA-6-721002:(WebVPN-Primary) HA status change:
event HA_STATUS_PEER_STATE, my state Standby Ready, peer state Failed.

Switching to Active
May 24 2011 15:10:58: %ASA-1-104001:
(Primary) Switching to ACTIVE - Other unit wants me Active.
Secondary unit switch reason: Interface check.

May 24 2011 15:11:59: %ASA-6-622001:
Adding tracked route 0.0.0.0 0.0.0.0 192.168.10.1,
distance 1, table Default-IP-Routing-Table, on interface ISP-10

Now what?

Next time I’m going to see whether it is possible / desirable to run a VPN on this set up.

Posted in Cisco Networks, Firewall | Leave a comment

Joe Job Spam Run

The university received two spam run campaigns, the first uses a forged sender to make a university address look like the sender, the second uses forged university addresses (i.e. not accounts) in an outgoing campaign to other sites, resulting in backscatter messages to Oxford account holders. This page is to answer some of the common end user and IT officer queries relating to this.

One of my coworkers got a spam apparently sent from my address, have I or the mail server been hacked?

Probably not if it was in the recent long weekend (29th April – May 2nd). Queries this morning that I answered received a variation on the following:

If it was this weekend then it was part of a Joe job email run, specifically the key here is ‘address’ not account. Emails are like postcards, they can be signed as ‘from’ anyone.
In this case someone has used Oxford addresses as the forged sender address in a spam campaign. This is known as a “joe job”: http://en.wikipedia.org/wiki/Joe_job

I’m afraid this is an aspect of the way email works that is misused by the spammers.
Note that the spams of this type from this weekend have a spam score of 25-30. If on Nexus even the least sensitive option will filter these, there are instructions here:

http://www.oucs.ox.ac.uk/nexus/email/

I’m not keen on canned responses but there were too many queries not to prepare some form of template for it.

Are there any official university pages with generic information about spam?

Yes, try the chain and junk mail page

I’ve whitelisted ox.ac.uk as an incoming sender to my account…

Please don’t do this, it’s not needed and it will cause you more spam. Firstly the ‘from’ address on the incoming spam can be forged to an Oxford address and so bypass your filter settings, secondly mail from internal to internal hosts is not spam scored by the central mail relay. So for example mail from Engineering to Physics and similar will never have a spam score applied by the university mail relay when sent via internal servers.

I had spam sent to me and I blocked the sender address in my account but now I get the same message from another address, and another…

The sender address is forged, it’s like writing who the sender is on a postcard – you could write anything you like and therefore there isn’t much point to blocking based on sender address for the standard types of spam.

An email from a coworker in my same unit went to my spam folder, I thought you didn’t spam score internal mail?

Most likely you have Microsoft Outlook and the local client based mail filtering option is on. It uses rules from Microsoft rather than the university mail server rules and is best switched off due to a high number of false positives.

I’ve heard that you don’t filter any messages but pass them on to local units?

I was a bit surprised to hear someone suggest this. We don’t silently delete any email. We do SMTP time message rejection based on a number of criteria to reject the majority of spam and delivery attempts from compromised hosts, then we spam score the remainder and pass it on. We don’t accept mail and then silently drop it. We either accept and deliver or refuse to accept the message. We do, most certainly perform anti spam techniques on incoming mail to reject it as soon as possible. There are OUCS pages relating to the main mail relay.

Well then, why aren’t you filtering/rejecting this weekends messages?

We are. The messages getting to users inbox are the tip of the iceberg, the majority of connection attempts in this spam run will have been rejected at the first delivery stage by our mail servers using various techniques. The emails that are accepted are then spam scored.

Why don’t you just block the sending host?

In my experience of attempting this, only the more quasi-legal advertising companies use a single address or a handful of addresses or single network. We let the automatic blackilist updates that we receive take out the majority of sending hosts.

Why did this message get through? Here is my message header

From: university.address@ox.ac.uk
Sent: 03 May 2011 01:03
To: some.user@ox.ac.uk
Subject: from Cornelia
I'm an hot brunette girl, and I'm searching for a man to chat with [...]
I have registered my profile at:  www.some-site-beingspamvertised.ru

This isn’t a message header, there are instructions for message headers here: http://www.oucs.ox.ac.uk/email/headers/ it’s not that we’re being picky, the message headers tell us a lot of technical information about the message – which servers it went through and what score it got. Showing a message header usually results in the immediate explanation for a mail issue since the majority of the information needed is usually contained.

I don’t know anything about message headers, just tell me how to get them, I use Lotus notes..

The networks team that run the mail relay don’t know about your local mail clients, we only know about message delivery (e.g. from the outside world to Nexus or to your units own mail server or between internal mail servers). Your primary point of contact for your unit is your local IT officers who will know far more about what choices your unit has made and common issues and configuration with your chosen mail client than I or my team members.

I’m an IT officer, I’m looking at a message header, can you explain what’s going on roughly?

Yes, the first line we trust is where our mail relay takes the message. We know that the IP address it records as being the connecting server is correct (SMTP is TCP not UDP so sending packets with a forged IP address would rather difficult since a three way hand shake must complete – the server connects back to the address that contacted us), and any other lines before this may have been forged by the connecting server


Received: from 188-115-172-147.broadband.tenet.odessa.ua ([188.115.172.147])
by relay0.mail.ox.ac.uk with esmtp (Exim 4.75)
(envelope-from <kfaczek@sbe-ltd.co.uk>)
id 1QHCLG-00051H-15 for some-address@herald.ox.ac.uk;
Tue, 03 May 2011 10:56:30 +010

So in the above line, our mail server relay0.mail.ox.ac.uk has accepted the mail from a server at 188.115.172.147, it’s using the older email addresses the university used to use as it’s source of contact addresses. We don’t care about where our mail relay delivered the message next internally for this incident so these lines aren’t shown.


Received: from 188.115.172.147(helo=herald.ox.ac.uk) by herald.ox.ac.uk with
esmtpa (Exim 4.69) (envelope-from) id 1MM13H-1826ej-31 for
<some-address@herald.ox.ac.uk>; Tue, 3 May 2011 11:56:29 +020

Here the connecting server at 188.115.172.147 has added a totally fake log line, perhaps to try and confuse analysis and/or to see if some form of whitelist will cause the message to be accepted due the suggestion an internal mail server has already processed it.

[...]
x-oxmail-spam-level: ***********************************
x-oxmail-spam-status: score=35.1
tests=FH_HELO_EQ_D_D_D_D,HELO_DYNAMIC_IPADDR2,OX_RBL_MAPS[...]

This is the important bit, we’ve accepted the message so the sending host has passed a number of tests, but now we’ve spam scored the message.
Each test that is failed raises the score, we can see the message has a high spam score due to a high number of failed tests.

I sent a copy of some spam to the OUCS phishing address, they didn’t seem too keen…

They’re only resourced to tackle phishing incidents targeting university account credentials – they can take actions to prevent users accounts being compromised (we have a legal obligation not to send spam) but standard spam isn’t the same. The phishing contact address is staffed by members of the security and networks team that have other tasks and can’t manually tackle each individual spam.

Ok, well here is my message headers, you should do something about this…

[...]

x-oxmail-spam-level: ******************************
x-oxmail-spam-status: score=30.5

Please turn on your accounts filter options or assist your user you are supporting to do so. We recommend that anything over a spam score of 5 is probably spam, with the occasional false positive (hence we recommend moving it to a folder not automatically deleting it), Anything with a spam score of over 12 is always spam (with the specific exception of the university security team who email each other malware links as part of their daily work). This message scored over 30 which is high enough that even a very lax setting will filter out the message.

What about SPF! I’ve heard SPF will fix things like this and I use it on my personal domain…

SPF isn’t a great solution – there are knock on issues with implementing it, it doesn’t solve all that many problems and some political changes would have to be made. Your personal domain isn’t complex, if implemented at Oxford we’d need to enforce/ensure that everyone is using the university mail servers when sending as anyone@unit.ox.ac.uk and ensure they are not using external mail servers (such as that provided by their ISP). If we achieved that then we would probably implement DKIM instead as a better technology. Note that SPF and DKIM assist with anti spam techniques but do not cure it.

I run a department mail server so based on your advice I’m going to silently delete any mail with a score over 5

Please don’t do this, you will delete legitimate correspondence which is bad postmaster-ship: your users will come to think of email as silently unreliable and raise support queries to track each lost message. Email filtering is not a boolean (true or false, 1 or 0) operation. Messages over 5 are probably spam, with the occasional false positive, the recommendation is to filter these to a users spam folder. Messages over 12 should always be spam.

Ok, I run an internal mail server that accepts incoming mail from the central mail relays, what should I be doing?

Take note of the oxmail spam score – if there is no score (not 0, but no score at all – no x-oxmail-spam-level sign) then it’s come from an internal server and I’d recommend you don’t run your own spam filter on it, but deliver it to the user. It’s very rare that an internal address is compromised and sends to internal addresses.

  • The X-Omail score includes scoring from SMTP time checks, it’s recommended that you use the score or at least take it into account with your own scoring mechanism
  • We suggest you filter messages over a score of 5 to a users spam folder, so they can check for false positives, but you or your users might change that level
  • If your users have Outlook deployed to them, turn off the Outlook based local spam filtering as it causes issues and will flag internal mails. It does not relate to the scoring applied to the mail relay but is controlled by Microsoft.
  • Check postmaster@ and abuse@ your domain of ox.ac.uk work.
  • If Oxmail delivers something to your mailservers that your product flags as spam, please accept the message and spam score it to oblivion. If you drop the connection oxmail will have to assume your server had a network issue and will try again and again for 10 days then send a delivery failure message back to the sender (which if forged is backscatter and may result in a blacklisting of the university mail service).
  • Remember you will always have some degree of spam – there is no perfect cure on the internet to date, no matter what any vendor says or how clean your gmail account appears.
  • Turn off your unit level firewalls port 25/SMTP inspection function – it causes issues
  • Check your SMTP logs first before raising queries with OUCS and you’ll answer most of your queries

Are there any stats? Can I see the filtering in action?

Yes there are graphs linked from the mail relay statistics page. You can see the increase in hosts rejected this weekend due to being blacklisted on the lists we utilise on the rejections graph

I have more questions/you left something out

End users can email help@oucs.ox.ac.uk , IT officers can get in touch with us about aspects of the server at networks@oucs.ox.ac.uk

Posted in Mail Relay | Leave a comment

DNS troubleshooting

I thought I’d write a quick reference for support staff not familiar with DNS troubleshooting

The basics:

DNS requests query a server to ask, for instance, what the IP address of a website is, when all you know is the name (the common use from a desktop users perspective at least). For instance if you wanted to visit the Google homepage your web browser will cause a DNS lookup to ask the DNS server what the IP address for www.google.co.uk is. Once the browser knows this it will then attempt a http connection to that address, without the user having to memorise IP numbers.

DNS is not complicated, it is quite basic – it might help to think of it as similar to a phone directory lookup.

Hence there are issues that cannot be caused by DNS. For example if all traffic to and from your site is fine with the exception of traffic on a specific port, then this is a firewalling issue, not a DNS issue. You’ll still be able to resolve addresses, but not make connections. At this point, some managers might scream “but that doesn’t matter, I still can’t connect to the site, just fix it!”. These steps of ruling out one service or another are important for troubleshooting where the aim is to narrow down/rule out the possible causes to find the real cause and hence the correct fix in as short a time as possible. Experience and intuition may help, but guessing and leaping to conclusions hinders.

So here’s how to troubleshoot DNS issues, or rule them out as the cause of your problem, using tests and results:

If I want to check my local caching resolvers are answering queries:

Your hosts configuration (visible with ipconfig /all on windows and a cat of /etc/resolv.conf on Linux) lists the DNS resolvers your client is currently using, e.g:

nameserver 129.67.1.1
nameserver 163.1.2.1
nameserver 129.67.1.180

In short laymans terms, your client asks these servers where websites and similar are, the servers then go and query the DNS servers that own the domain in question. We can use nslookup (which can be found on both Windows and Linux) to query DNS servers, so here we ask a specific DNS server where an example website is:

$ nslookup www.ja.net 129.67.1.1
Server:        129.67.1.1
Address:    129.67.1.1#53

Non-authoritative answer:
Name:    www.ja.net
Address: 212.219.98.101

As a result we now know that

  1. 129.67.1.1 is responding to DNS queries
  2. www.ja.net can be found at 212.219.98.101

It’s non-authoritative because our DNS resolver does not own the definitive data for the zone, it’s simply passing on what it has been told.

Which DNS servers are authoritative for a domain?

Sometimes people are suspicious of the local resolver, thinking they need to send a support email to check it’s correct. It’s possible to check the resolvers record is the same as the authoritative server for a domain from any client by querying the DNS servers for that domain directly.

To find the list of nameservers we can use nslookup

$ nslookup -querytype=NS uclan.ac.uk

But lets also introduce dig as an alternative to nslookup at this point. Windows users will either have to download it, log in to linux.ox.ac.uk [ssh and use your SSO account] or use a website based version. Take the +short off if you want the full gory details.

$ dig uclan.ac.uk NS +short
jans2.uclan.ac.uk.
jans.uclan.ac.uk.
ns1.ja.net.

We can query these DNS servers directly for a domain if we suspect an issue with local resolvers.

e.g. with nslookup

$ nslookup www.uclan.ac.uk jans.uclan.ac.uk
Server:        jans.uclan.ac.uk
Address:    193.61.255.89#53
Name:    www.uclan.ac.uk
Address: 193.61.253.9

or via dig

$ dig www.uclan.ac.uk @jans.uclan.ac.uk +short
193.61.253.9

Under what circumstances would a local resolver give a different answer to a authoritative server?

If a record has been updated, the resolver has performed a query previously and the TTL (an instruction from the DNS server about how long querying machines should store the record for rather than ask again) has not yet expired. Which leads us on to the next query, which happens several times a year:

I’ve just changed a record for my external domain and I’m getting different answers from the university nameservers!

In this example a domain has had a TTL of 24 hours, (e.g. it’s telling software that queries it to please cache the record for 24 hours and not ask again until that time is up). someone has then changed a record, we can see the cached record on our resolvers with the following command:

$ dig www.oxford-union.org @resolver address

In this case resolvers 0,1 and 2 have respectfully:

www.oxford-union.org.   84759   IN      A       213.129.83.29
www.oxford-union.org.   5020    IN      A       89.167.235.71
www.oxford-union.org.   1672    IN      A       89.167.235.71

where the number is the TTL in seconds that the domain has stated the record is to be cached for at the time of query, minus the seconds it’s been in our cache.

They have a default TTL of 86400 on the new record, which is 24 hours, I assume they had that on their old record. We can see we’ve 5020 seconds (about 83 minutes) until the oldest reference is lost a a new lookup is performed.

Yes, but I changed my site? Please flush/reload your nameservers as your DNS is broken

Before making a critical DNS change, reduce your TTL in advance of the change, so for instance you might make a 24 hour TTL a 5 minute one instead, over 24 hours before the change takes place, that way all visitors will see at most 5 minutes of difference in results on the day the change is made.

Do not leave the TTL at a high value, make a change to your domain records and then email every popular service provider asking what’s wrong with their DNS, asking why they still have the old record cached and demanding they fix it. That method is not scalable/sustainable (imagine if every site/domain on the internet did that).

But why aren’t the resolvers cached records in sync with each other?

The DNS resolvers/caches are not in sync with each other – they don’t need to be, they are operating as a standards compliant DNS should. The authoritative DNS servers are in sync (they hold the same records for the domains they ‘own’).

None of my domain resolves at all, it’s only the university affected, it must be an issue with your DNS

Remember to verify the facts being reported to you before acting on them. DNS has caching effects, so it could be that some sites have older records cached, affecting what the user is reporting. A typical scenario might be that the user sees it working on their home broadband (where the resolvers have the record cached) but not in the university and so defines the problem as being with the university.

For instance, lets create a scenario where a student magazine site is claiming the university DNS is broken as their site will not load in the university, the first thing we do is a quick check of what is being reported:

~$ host www.cherwell.org
Host cherwell.org not found: 3(NXDOMAIN)

Ok, so before we leap to conclusions lets ask the authoritative nameservers for the domain what’s going on, first we need to know what the nameservers are:

$ dig cherwell.org NS
[...]
;; ANSWER SECTION:
cherwell.org.           86186   IN      NS      ns1.ospl.org.
cherwell.org.           86186   IN      NS      ns2.ospl.org.

Now we query them

$ dig www.cherwell.org @ns1.ospl.org.
dig: couldn't get address for 'ns1.ospl.org.': not found

er, that shouldn’t happen. Lets doublecheck… (and repeat for the second nameserver)

$ host ns1.ospl.org
Host ns1.ospl.org not found: 3(NXDOMAIN)
$ dig ns1.ospl.org
$

So (only in this scenario – the site in question doesn’t have this issue in real life at the time of writing) the issue here is that our nameservers can’t query records for a domain whose published nameservers don’t resolve – we can’t find them in order to ask them questions. It wont just be our site affected but it may be reported as such by users while cached records are still present at other service providers.

Just as a comparison, here’s how it should be for those same commands, using a different site

dig oxfordstudent.com NS +short
ns1.flirble.org.
ns4.flirble.org.
ns0.flirble.org.
ns2.flirble.org.
ns3.flirble.org.

each of which resolves fine

dig ns4.flirble.org. +short
207.162.195.200

Most of the internet is down! Your resolvers are broken!

Stay calm, troubleshoot the problem in a controlled manner. Gather repeatable/testable evidence. Start with the most basic assumptions:

  • Do you have network connectivity – can you ping your gateway?
  • Which DNS resolvers are you using (university or local?) e.g. cat /etc/resolv.conf or ipconfig /all
    • Note that if using your own resolver and looking up external domains the central university DNS will not be involved
  • Can you perform a name lookup against your resolvers from the answer above, e.g. dig www.oucs.ox.ac.uk @163.1.2.1
  • Can you perform a name lookup of the specific site you want?
    • If not who runs the authoritative name servers for that domain? dig example.com NS
    • Now what happens if we query these name servers directly? dig www.example.com @ns1.example.com
    • If they don’t know their own records then they’ve broken their domain
    • If they do know but the resolvers don’t, then you’ve broken local resolvers. (this is not the same as having a cached record however)
  • If you can perform a name lookup of the site and get a correct answer, then it is not DNS that is the issue. You have ruled out DNS and can now concentrate on other areas of troubleshooting. (E.g. some other cause – service not configured to listen, host down etc).
Posted in DNS | Leave a comment