Copyright © 2005 jsd
Routing for Linux-IPsec
John S. Denker
First of all, before considering what a Virtually Private Network ought to look like, let’s recall what a Physically Private Network looks like. Suppose Moe goes down to the store and buys some interface cards and some wire. He strings the physically private wire from his East machine to his West machine. He installs the interface cards and configures them as eth2 in each case. Therefore the picture looks like:
eth2~~~~(private)~~~~eth2 / \ West East / \ / \ Sunset===eth0 eth1-----\ /-----eth1 eth0===Sunrise (cloud) / \ attackers foo.net
Moe assigns suitable unique IP addresses to the eth2 interface on each machine. Now West can act like a router. It can advertise to its friends on Sunset-net that it can handle traffic destined for East and for the Sunrise-net. West also still has a route to foo.net. West sends traffic to foo.net over its public interface (eth1) but sends traffic to Sunrise-net over its private interface (eth2).
In the case of subnet-to-subnet traffic (where we wish to route traffic between the Sunrise subnet and the Sunset subnet) or host-to-host traffic (from East itself to West itself), then the eth2 interfaces can easily be given 10-net addresses. Alas this might not work for traffic from host to subnet. If we want to allow host-to-subnet traffic, we should probably give the eth2 interfaces their own nonprivate addresses.
A second possibility (OK in some cases) would be to arrange that Sunrise-bound traffic originating on West uses the address of eth0 as its source address, for traffic that leaves via eth2 (and not for traffic that leaves via eth1). (This could be achieved by setting eth2 to have the same “default source address" as eth0.)
AFAICT in no case does it make sense to have eth2 use the same address as eth1; that would make it unnecessarily difficult for applications on West (or anywhere on the Sunset subnet) to control whether they wanted to talk to the East over the public or private links.
All this generalizes nicely to the case of more than two machines. East, West, and N−2 other machines can all plug into a private hub via private wires. This is shown in figure 2.
Orient / \ /~~~~~~eth2 eth0===| (hub) | eth2~~~~~~/ \~~~~~~eth2 (...) / \ | West East | / \ / \ | Sunset===eth0 eth1-----\ /-----eth1 eth0===Sunrise (cloud) / \ attackers foo.net
Note that both East and Orient are advertising a route to Sunrise.
It is clear that Orient.eth2 needs to have an address different from East.eth2, because otherwise West wouldn’t know how to do the ARP that it must do before putting packets on the private wire.
A disadvantage of the scheme shown in figure 2 is that East sees the packets destined for Orient, and vice versa. East and Orient are protected from the attackers out in the cloud, but they are not protected from each other. This disadvantage is remedied by the scheme shown in figure 3, which does away with the hub and uses point-to-point wires instead.
It turns out that figure 2 will be a good model for how IPsec tunnels work.
Note that there is no private wire directly between East and Orient; if you want such a wire you need to install it in addition to the West-East and West-Orient wires. Otherwise you can route the traffic from East to Orient via West, with the risk that West will snoop it.
Orient / \ eth3~~~~~~~~~~~~~~~~~eth2 eth0===| / | / eth2~~~~~~~~~~~~~~~~~eth2 (...) / / \ | West East | / \ / \ | Sunset===eth0 eth1-----\ /-----eth1 eth0===Sunrise (cloud) / \ attackers foo.net
The obvious question is, why should not a Virtually Private Network provide applications with exactly the same interface as a Physically Private Network? Well, it can’t be exactly the same, as discussed in section 2.2, but first let’s see how nearly the same we can make it.
The best design is analogous to figure 3. We treat each tunnel as a point-to-point connection, as shown in figure 4:
(cloud) Dawn . \~~~~~eth1 eth2===| . \ / . Orient . / \ /~~~~~~~~~~~~ipsec:1 eth0===| ipsec:2 . | ipsec:1~~~~~~~~~~~~~ipsec:1 (...) / . \ | West . East | / \ . / \ | Sunset===eth0 eth1-----\ . /-----eth1 eth0===Sunrise (cloud) / \ attackers foo.net
In figure 4, the ipsec:1 and ipsec:2 interfaces are abstractions. In some respects they are just like the physical devices eth2 and eth3 in figure 3, but in other respects they are different. From the point of view of an application program (or a routing daemon) we want the ipsec devices to behave as much like physical devices as possible. They are capable of sending and receiving IP packets (the raw packets) just like other devices. On the other hand, we don’t need to physically open the computer cabinet to install them, and we don’t need to physically run wires to make connections. They also create encrypted packets (ESP envelopes) which must be transported through some other (physical) links. As shown by the dotted line in figure 4, Orient must have a connection (via its eth1) to the same cloud that West is connected to, so that there is a physical path to carry the envelopes. These envelopes causally related to, but must be kept conceptually distinct from, the raw packets.
We should not take the notion of “virtual wire” too literally. IPsec (as its name implies) is an IP security protocol. It carries IP and nothing else. IP is a layer-3 creature. In contrast, a real ethernet wire has some very specific layer-2 functionality (such as broadcast, multicast, and unicast) plus the ability to carry more than 100 different layer-3 protocols, of which IP is just one.
Suppose you want your “virtual wire” to carry something other than plain old IP; perhaps NetBEUI, Banyan VINES, Novell IPX, or whatever. You can’t expect IPsec to do the job by itself. The obvious solution is to establish IPsec connections (presumably host-to-host connections using transport mode). Then run something like GRE (Generic Router Encapsulation) or L2TP (Layer 2 Tunneling Protocol) which handles layer-2 traffic, encapsulates it in IP, and sends the IP over the IPsec connections. Maybe someday somebody will write some sort of “L2sec” protocol to provide secure layer-2 connectivity all in one package, but don’t count on it.
See the comparison between tunnel mode and GRE+encryption in section 3.1.
One can imagine a multi-access scheme similar to figure 4, except that the ipsec device is analogous to an ethernet card rather than to a point-to-point link. However, we will not discuss it because it is a less-accurate analogy to the way IPsec actually operates. With a multi-access device, if you have connections between A and B, plus connections between B and C, you automatically have a connection between B and C, but that is not the case for IPsec tunnels.
Multi-access has other disadvantages, as discussed below.
Suppose West is a FreeS/WAN box that is connected to two different wild-side ISPs. It has two different wild-side addresses. It would be very nice to set up two connections to Sunrise-net. This would allow load-sharing, and it would also improve reliability if we could arrange failover from one connection to the other.
Unfortunately, if you try to set up multiple connections to the same subnet (Sunrise subnet) using FreeS/WAN version 1, the result is fratricide. The second connection is treated as a replacement for the first.
You can work around this by not making subnet connections. Instead, as suggested by Joe Patteson, make host-to-host connections between East and West, and then run GRE-encapsulated traffic over these connections, as discussed in section 2.2. You can perform load-balancing and failover with respect to the GRE virtual devices.
If East has N ISPs and West has M ISPs, you could conceivably want M×N IPsec connections.
Note the following imperfect parallelism:
On the other hand, it is not 100% correct to view tunnel mode as IPIP plus encryption, because real tunnel mode allows allows the system to verify that the correct subnet is being carried over the tunnel; that is, the inner headers can be checked. In contrast, using IPIP+encryption or GRE+encryption, the inner headers are just data in some obscure format that cannot be checked by the IPsec system per se.
On the third hand, since FreeS/WAN relies on you, the user, to implement the security policy anyway, using the firewalling system, you can perfectly well check the packets as they enter the GRE device. So all the necessary functionality can be achieved.
Also note that the GRE links, as the name implies, were designed with routers in mind, so their up/down state is meaningful, and their metric is meaningful to routing daemons.
It sure would be nice if the next-generation IPsec system could do this itself, without requiring us to resort to GRE. We need multiple tunnels to the same subnet, with meaningful up/down state and meaningful metrics. And we need the ipsec device to play nicely with routing daemons.
Sometimes West (or somebody on the Sunset subnet) will want to send traffic to East without IPsec encryption. This might be useful for maintenance if the IPsec tunnel is misbehaving. Such traffic should be routed to the East.eth1 interface.
More often, though, West (and folks on the Sunset subnet) will want to send traffic to East with IPsec encryption. This traffic should be routed to the East.ipsec interface. Therefore...
Conclusion 1: From West’s point of view it is important for the East.ipsec device to have an address different from the East.eth1 device.
This conclusion stands in contrast to the way things are done in linux-ipsec version 1, where the ipsec device is “attached" to a physical interface and shares its IP address.
The numbering of the ipsec devices is important when we consider the details of the interaction between the tunneling system and the higher-level routing system (gated). When a tunnel goes up or down, we can notify gated that the corresponding interface (ipsec:N) is going up or down.
When considering the gated interaction, it helps to keep the following in mind:
Note that in contrast, for multi-access interfaces such as ethers, the “unique address" is the near-end address (source address). This inconsistency is not well documented. The far-end numbering trick puts an icky conceptual burden on folks like us, for whom gated is not the central focus of our lives. But it does open the door to having an unlimited number of point-to-point interfaces.
We continue our discussion of the destination addresses of raw packets. These are the packets that ordinary applications send and receive. The destination address determines how the raw packets find their way to the ipsec device as opposed to the eth1 device. (In a later section I will discuss what gets marked on the outer headers of ESP envelopes, and how they get routed.)
Let’s consider the case of traffic originating on the Sunset subnet and bound not for the Dawn subnet. In this case, we can look up the Dawn address in the routing table and discover that such traffic is handled by the ipsec:2 device. The address of Orient will be listed in the routing table as the “nexthop" for such traffic.
Conclusion 2: From West’s point of view, it is important for the East.ipsec device to have an address different from the Orient.ipsec device. This is noncontroversial; linux-ipsec has always had this property.
We now switch our focus from destination addresses to source addresses.
For traffic that West is just forwarding (including subnet-to-subnet traffic) this is a non-issue; the packets keep whatever source-address was put on them at their origin. OTOH, we have to do something intelligent for traffic that originates on West.
The essential requirement is that packets (raw packets) originating on West need to have a source address that is meaningful to the ultimate recipient, so that it can be used for return traffic. For host-to-host traffic, a private 10-net address suffices. However, for host-to-subnet traffic (e.g. West to Sunrise-net), a nonprivate source address will be needed if (and only if) the subnet contains routers that are not under local “private" control.
In some cases, it may suffice to use the same address as West.eth0. This works if and only if hosts on Sunrise-net have enough control over their routing to ensure that return traffic that is supposed to be secure is routed via East to West.ipsec and not routed to the real unencrypted west.eth0. In the situation diagrammed in figure 4 there is no physical route from Sunrise to West.eth0. In a real-world situation things could be much messier, because West could be advertising the physical unencrypted route to West.eth0, while East is advertising the encrypted tunnelled route to the same address. With enough work, you could handle any particular case, but it might be hard to handle the general case in a provably-correct way.
The proof of correctness would be a whole lot easier if west.ipsec had its own IP address. Then traffic could be routed to that address in the natural way. Therefore:
Conclusion 3: From West’s point of view,
Again, this conclusion stands in contrast to the design of linux-ipsec version 1, in which each ipsec device is “attached" to a physical interface such as eth1, shares its IP address, and sends ESP envelopes only over the attached interface.
Note that the design sketched above means that KLIPS need not contain any code to implement the rightsubnet and leftsubnet parameters. Packets destined for the VPN would find their way to the ipsec device according to the same routing rules that apply to the Physically Private Network described in section 1.
Indeed, the leftsubnet and rightsubnet parameters might or might not remain in ipsec.conf; their function could equally well be performed by entries in ifcfg-ipsec:*.
In the linux-ipsec version 1.5 documentation, in file configuration.html it says:
Normally "no". Set to "yes" if the IP forwarding option is disabled in your network configuration. (This can be set as a kernel configuration option or later. e.g. on Redhat, it’s in /etc/sysconfig/network and on SuSE you can adjust it with Yast.) Linux FreeS/WAN will then enable forwarding when starting up and turn it off when going down. This is used to ensure that no packets will be forwarded before IPSEC comes up and takes control.
1) It is important for newbies to test basic (non-IPsec) forwarding before they start fooling with IPsec extruded subnets. Otherwise the mailing list will be flooded with cries for routine help.
2) OTOH recommending forwardcontrol=no does not seem very security-friendly. Unless the luser uses forwardcontrol (or does something even more wizardly with ipchains) then "ipsec setup stop" becomes a security hole in some plausible circumstances. Also it there is a brief window during system startup when packets that should be encrypted pass in the clear because forwarding is enabled but pluto hasn’t done its job yet.
3) The documented behavior of this feature is suboptimal. When IPsec is going down, it shouldn’t turn forwarding off – it should put it back the way it found it, whatever that may have been. Among other things this means forwardcontrol=yes won’t interfere with testing.
4) In the short run, a compromise between consideration (1) and consideration (2) is to recommend something like:
5) In the longer run, a good redesign would get rid of this whole issue. If we establish distinct addresses for the public interface and private interface of the IPsec hosts with which we communicate, then
The proposed restructuring is not just an increase in elegance; it is a crucial increase in functionality. Note that in figure 4, West has two routes to Sunrise-net: one via East and one via Orient. It is important to allow both of these tunnels to exist, because:
Note that for load-switching, you don’t want to wait until one tunnel fails before bringing up its sibling. For load-sharing, it is even more obvious that you want both siblings to exist at the same time.
In the world of IP, it is perfectly normal, indeed commendable, to have multiple routes to the same place. The ambiguity as to how to reach the destination can be removed by comparing the cost associated with each route, or (failing that) removed arbitrarily.
This stands in contrast to linux-ipsec version 1, which only allows one tunnel to a given subnet. It identifies tunnels by the subnet they serve, so if you install a second one it simply steals the route from the first one. This code was put in to handle renewal and expiration of IPsec Security Associations (SAs), but it has unfortunate side effects.
There will never be a one-to-one correspondence between SAs and ipsec:N aliases, because an SA comprises quite a number of things:
A given ipsec:N alias (link) must in general handle more than one SA (multiple subnet routes).
A craftier way to handle renewal would be to check if the new SA specifies the same subnet and the same link as an older one; if so, it can supersede the older one. If it specifies the same subnet and a new link, we need to create a new ipsec:N alias and a new entry in the routing table.
The foregoing talks about routing above the tunnel layer. Now let’s talk about routing below the tunnel layer.
IMHO the best way to proceed is to recognize that any tunnel really consists of two parts: a top half and a bottom half. One half sits in the OSI layer model below the routing and firewalling, and acts like a device; the other half sits above the firewalling and routing stuff and acts like an application. One possible viewpoint is to view it as a linear process, as shown in figure 5.
(1) generic applications \ \ (2) firewall | (3) routing tables / / (4a) IPSEC virtual ifc | (4b) KLIPS “application" \ \ (5) firewall | (6) routing tables \ \ (7) real ifcs
Note that raw packets are routed to the ipsec device (step 3) and ESP envelopes are routed to the real interfaces (step 6). If the ipsec device shares the IP address of one of the real interfaces, this can cause routing loops.
Linux-ipsec version 1 circumvented this problem by circumventing the routing machinery, skipping steps 5 and 6 and connecting KLIPS directly to the real interface. The linux-ipsec developers have described this as a kludge.
Other ways of solving this problem include (a) giving the ipsec device its own IP address, and/or (b) using policy routing to implement multiple routing tables, so that raw packets use one table and ESP envelopes use another.
Another viewpoint on the same situation is shown in figure 6.
. . . . . . . . . | ^ (1) generic (4b) KLIPS ^ applications “application" ^ \ / ^ \ / ^ (2) firewall ^ | ^ (3) routing ^ tables ^ / \ ^ / \ ^ (4a) IPSEC (7) real ifcs ^ virtual ifc | ^ | physical links ^ | ^ | ^ . . . . . . . . . . . . . . . ^
Typically the generic application emits a packet which follows the path (1)-(2)-(3)-(4a) whereupon the raw packet is absorbed. For present purposes it is advantageous to think of a completely new packet, an ESP envelope, being created; it follows the path (4b)-(2)-(3)-(7). (Of course there is a causal connection between the raw packet and the ESP envelope, as shown by the dotted line, but the routing strategy doesn’t know or care about that.)
Figure 6 makes it somewhat easier to see how what’s going on fits into the OSI layer model.
There are a number of advantages to changing KLIPS so that it sends ESP envelopes through the kernel routing tables, rather than having KLIPS bypass the kernel functions and implement its own kludging routing.
The advantages (expressed using West’s point of view; West=left; East=right) include:
1) We don’t need to put the IP address of our peer (East) into the ipsec.conf file. For incoming connections, we use whatever IP address our peer happens to be using; then we use RSA for identification and identification. For outgoing connections, we use DNS (preferably secure DNS) to find the wild-side address of our peer; then we authenticate.
2) We don’t need to put the IP address of rightnexthop in the file. We never did. The only reason for ever putting it there is if you wanted East’s .conf file to be identical to West’s .conf file – and if you are using RSA you can’t achieve that anyway.
3) We don’t need to put the IP address of leftnexthop in the file. Kernel routing of the envelopes takes care of that. Leftnexthop is only necessary if you are kludgily bypassing the kernel routing machinery.
4) We don’t need to put the IP address of West’s real interface(s) in the .conf file. Ifconfig can take care of that, based on the contents of network-scripts/ifcfg* files. If we are going to do that, we should also create an ifcfg-ipsec file, to assign an address to the ipsec virtual device; otherwise Pluto would have a hard time figuring out how to orient the connections in the .conf file.
5) We no longer need the concept of “eroutes" AFAICT. Policy routing of the raw packets (on their way to the ipsec interface) seems powerful enough to implement all the selectors mentioned in reference 9 section 4.4.2.
6) Suppose you have an IPsec portal with several physical interfaces and hundreds of tunnels. Suppose the ISP that serves one of the wild-side physical interfaces (eth1 for example) changes the IP address on the fly. You would really like to be able to start using the new address without disrupting any of the tunnels. Under linux-ipsec version 1, such a renumbering would be a disaster; one would need to rewrite ipsec.conf, then stop and restart the entire ipsec system. This would cause all tunnels – even tunnels using interfaces that did not get renumbered – to go down for at least one minute.
This results in less code needed, less documentation needed, and more generality.
IMHO, from the user’s point of view, there is no advantage (indeed many disadvantages) to having the ipsec device “associated" with a physical device. It seems clear that ESP envelopes should be treated the same way as pluto traffic. That is, they are just a host-to-host conversation. The fact that they might have some payload that might be cryptologically related to some other packets (raw packets) is irrelevant. These envelopes should be routed on their own merits, according to their own outer headers, which for routing purposes are the only headers they have.
Conclusion 4: The ESP envelopes, like the pluto traffic, should be routed according to the kernel routing tables, according to the usual rules.
Note that this leads immediately to another argument in favor of Conclusion 1.
Also note that this, too, is very, very, very important, because it allows you to do routing below the tunnel layer. It gives you additional ways to do aggregation of slow links and/or recovery from failed links.
The foregoing analysis applies to tunnel mode. It would be nice to make transport mode work also. Conclusion 4 can be most easily implemented if raw packets can be routed differently from the ESP envelopes. This in turn is easiest if the outer-header destination address are different from the inner-header destination address. Well, in transport mode there is no inner header.
We can still make the scheme diagrammed in figure 4 work in transport mode. One way to do it would be to have KLIPS set the firewall mark (FWMARK) on the packets it emits. Then we use the policy routing mechanism to route them differently.
An alternative approach would be to tell the policy routing mechanism to look at the protocol numbers and handle ESP traffic in a special way. This would cause ambiguities if somebody tried to route ESP packets through a box that was also terminating IPsec traffic, but for typical applications that shouldn’t be much of a problem.
No matter what you do with ESP traffic, you ought to do something intelligent with Pluto traffic, which needs to go unaltered out the real interface. (This is not actually critical, because KLIPS already checks for Pluto traffic and applies the null transformation to it.)
If you want transport mode to coexist with routing daemons such as gated, it is necessary to implement the ipsec devices as point-to-point links. If the ipsec device is multi-access and shares a source address with one of the real interfaces, then gated sees a conflict. If it has some other address, return traffic won’t do the right thing.
Another way to approach to implementing transport mode would be to give the ipsec device a different address, but have KLIPS perform a NAT (aka masquerade) function, rewriting the destination address. This would make transport mode look more like tunnel mode. Maybe that’s a good thing; it’s not clear why anybody would prefer transport mode to tunnel mode.
In general, NAT causes problems of its own, but it’s hard to see how this particular sort of NAT could cause problems.
My guess is that such a NATting ipsec implementation could be made compatible with non-NATting implementations, but I haven’t thought through the details.
The “identity" of a Security Association depends in part on the IP address of the endpoints. Under this scheme, the identity should be bound to the virtual address, i.e. the address of the ipsec interface, not the address of whatever physical interface(s) carry the ESP envelope. Once the raw data has been put into an ESP envelope, how the envelope gets routed should be outside the scope of IPsec.
Consider a situation that has one fixed host and a lot of road warriors. Then suppose the fixed host suddenly reboots, or does an ipsec setup restart. It would be very nice to remember the address where each RW was last seen, and automatically –up each connection. With very high probability the RW is still there, and the –up will succeed.
But freeswan (as of version 1.5) makes no attempt to remember where the RWs were last seen. If the fixed host restarts, the RWs are high and dry; until their connections time out at their end, they will not be able to say anything useful to the fixed end. This could go on for an hour or more (depending on ikelifetime). This definitely does not uphold the quality standards that my customers expect.
So here’s a suggestion: In the .conf file on the left (fixed) end, the right=... item should be replaced by *two* items, which might take the form:
respond-to-right=188.8.131.52/8 # road warrior wildcard initiate-to-right=184.108.40.206 # last known domicile of this moat
and we can use the _updown script or whatever to keep the value of initiate-to-right up to date. It could do this by reading a template containing the constant data, then filling in the variable data, and finally writing the ready-to-use .conf file.
As a workaround, the desired behavior can be obtained by defining two conns for each user; one that operates in RSA roadwarrior mode for incoming connections, and one that operates with a fixed IP address for outgoing connections. But this is inelegant and complicated, and it is totally unreasonable to expect each user to re-invent this. An elegant solution needs to be built into pluto.
This requires pluto to write to disk and keep state that persists across reboots, which it heretofore has not done.
1) See the short-term forwardcontrol suggestions in section 3.4.
2) See the address-inertia suggestions in section 6.
3) Another thing that seems like it might be relatively simple would be to allow multiple routes to the same subnet via different gateways. This might involve only a few lines of code, to check the nexthop identification in addition to the current check for subnet.
Note that this check should use the proper "remote host identification" which in the case of a road warrior must be its RSA identification not its IP address, which could change at any time (because of DHCP renewals).
4) In linux-ipsec version 1, KLIPS has an elaborate mechanism for bypassing the kernel routing tables. The long-term sensible goal is to rip all that out and rely on the kernel to do the routing. In the meantime, one could imagine modifying it to achieve some of the objectives outlined above. (This would be more work in the long run, and may or may not be less work in the short run.) This would have the minor advantage of allowing said features to work on kernels that antedate support for policy routing.
The starting point would be to establish the conceptual distinction between the virtual address of the ipsec device and the physical address(es) of the real device(s) over which the bypass mechanism routes the envelopes. For each conn in the .conf file, there would be a rightVirtualInterfaceIP=... statement. Somewhere else (perhaps in the preamble of the .conf file) there would be a few statements describing how to route the envelopes. Also (as previously mentioned) it would be necessary to orient the local ipsec device either using ifconfig or a this-end=... statement in the preamble.
Under this scheme, it would appear that
This is important to anybody who wants to use IPsec in a large, secure, high-availability system. This will also make it easier for the Muggles to understand.
If anybody has questions, or sees a flaw in this design, please speak up. Maybe we can work together to clarify and/or fix it. Of course, if somebody has a better design that will get the IPsec job done and play nicely with gated-like objects, please let the rest of us know about it!
Copyright © 2005 jsd