[techtalk] RHL HA or Acronyms can be fun...

Thu Jan 25 11:43:50 EST 2001

Hey guys...

Anyone out there have experience with RedHat's HA software?  I've downloaded
it and am toying around, and have a few issues with the architecture (I'm
using the example for setting up a five-node cluster I got off their
website).  Here's the deal:

I've got two lvs-routers using NAT to pass the information to two
webservers.  The primary LVS router has it's real interfaces up, as well as
an alias to the external interface for the virtual webserver and an alias to
the internal interface for the virtual nat router address.  The backup lvs
just has its real interfaces up, and the webservers have theirs as well,
happily oblivious to the fact that they're part of a cluster at all.

So the heartbeat via pulse is functioning via the real external interfaces
of the two LVS machines.  When a failure is detected, the backup router
brings up it's aliased interfaces, and takes over the role of primary,
recieving info for the virtual server and acting as the virtual nat router
for the real webservers.  When the failed ex-primary box comes back up, it
becomes the hot-backup.

Here are my 2 main issues:

1)  Pulse cannot tell the difference between a source and destination
failure.  IE, if the backup lvs router for some reason has a broken
connection to the network, it will assume the primary server has failed
because it does not recieve a heartbeat.  It will then bring up it's virtual
aliases and begin arp spoofing, and attempting to route.  The webservers
will all use lvs2 for their router (I deduced this experimentally as well as
theoretically).  Only since it has no external connection, it cannot route.
So the cluster fails.  This seems
like it also might be the case if the external network connection on the
primary failed, because it would leave its internal virtual interfaces up,
AND the backup would also bring its internal interfaces up.  Therefor 2
machines would be responding to the arps.  I don't know exactly what the
result of that scenario would be.

Now, I BELIEVE this can be solved if I used direct routing instead of NAT.
This way, since the webservers would be returning requests directly to the
clients, they would only be dependant on the routers for incoming requests,
and I haven't yet figured out a plausible scenario where pulse would allow
both external virtual interfaces to be up at the same time.  Even if it did,
who cares, as long as the packets get to the webservers.

2)  Pulse only has knowledge of one interface at a time.  Therefor, if the
internal interface on the primary lvs goes, and therefor its connection to
the webservers goes, pulse will not transfer control to the backup because
it continues to get a heartbeat through the external interface.  Thus the
cluster fails, because lvs1 continues to act as a virtual server without
being able to communicate with the real servers.

What I need to know is whether or not there's a better way to configure
pulse to account for these situations.  Perhaps running multiple instances
that communicate with one another and make decisions based on the states of
all of the nics.  Or if it tried to heartbeat more than one server, in an
effort to diagnose where the failure sits exactly.  Or is that asking too
much?

Thanks in advance for your advice/answers...

-Brian

-----------------------
Brian J. Sweeney
"I want to know God's thoughts ... the rest are details." -Albert Einstein
Systems Admin, imagedog
bsweeney at imagedog.com