[Techtalk] Is it the server???

Thu Apr 4 13:14:53 EST 2002

Heya --

Quoth Michelle Murrain (Thu, Apr 04, 2002 at 12:15:46PM -0500):
> This is SO strange.

	But those are the interesting problems!  [grin]

> I'm still getting a lot of "unreachable" errors in the tcpdumps, and
> there is no question that the null connections match with the
> unreachable errors.  Mail is still getting delayed.

	Good diagnostic information.  Is the ICMP unreachable error
actually *to* 192.168.1.1, or is that just an address you swapped in to
substitute for your real IP?  If that's the actual error, that's telling
you that your box doesn't know how to get to that private-space network.
If your box is in the normal-routable-space, it shouldn't be trying to
access private space IP directly.  (Normally.  I would need to know more
about your network topology to say for sure.)

	Can you ASCII art up a network map for us, complete with your
mail server, all boxes you're seeing the errors on in tcpdump, and your
DNS server?  And let us know which boxes actually have routable IPs and
which don't?

> Outgoing smtp seems to work fine. I picked a few of the hosts that I'm
> having trouble with (that have unreachable errors), and regular pings
> show very little (if any) packet loss, whereas ping flooding shows
> serious packet loss (like 25%). I thought at first it might be the
> ethernet cable between the server and the router, but it's not - I
> swapped several out.  When I try to send out large packets from the
> router, they all get dropped.

	Large packets dying and small ones being okay is usually either
connectivity or firewalls.  Your ISP needs to take care of that, since
you won't have access to the relevant equipment.  Was the pingflooding
showing packet loss because their pipe isn't as big as yours?  (If
you're on a T3 and you pingflood a machine on a T1, you're going to see
serious packet loss simply because their pipe can handle only 1/30th the
bandwidth that yours can, and the other 29/30ths are getting dropped.)

	Have you tried replacing all the physical connections in your
path with known good ones?  (I don't think it's your side of things, but
it never hurts to be sure.)  Change ports on the switch or hub, as well
as swapping out the Ethernet cable.

> Suggestions? I have a trouble ticket in with my ISP, but they seem a bit
> clueless.  

	Honestly, that's not unexpected.  When I was in a
customer-support job at an ISP, I would have been totally stymied by
something like this.  Most people with the necessary understanding of
networking and protocols won't take a customer-facing phone-answering
job.  Escalate within your ISP if necessary, keep sending them all the
evidence you can to help them troubleshoot, and hopefully you'll
eventually get someone clueful on the phone.

> Just 'cause I was curious, I did a ping flood from a different box
> within the same network, and guess what - way, way, less packet loss.
> (like 1%) So it seems like it's the server, right? If so - is it a bad
> ethernet card?  Or can something else be going on?

	It could be the server, or it could be connections from your
local network to those remote sites.  Do you still see the packet loss
when going from the non-server machine on your local network to the
machines you're having the issues with on the remote network?  Do they
get their DNS from the same place?

	I am wondering if this is what's happening.

Remote mail server begins to connect to local mail server.
Local mail server queries the DNS server, to make sure the remote mail
server is who it says it is.  "Where's this remote mail server?
192.168.1.1."
DNS server says, "Uh, what?  I don't have a mapping for that.
ServFail."
Mail server says "Fine.  Router, connect me to 192.168.1.1".
Local router says, "The hell?  I don't know how to get to 192.168.1.1!
ICMP error -- network unreachable."

	Lather, rinse, repeat.

	There are a few problems with this hypothesis, though.  If this
kept happening, mail would never get through.  Obviously, it is getting
through, if somewhat delayed.  The configuration error could be on the
DNS server (IME, ServFails usually are borked DNS setups), or on the
mailserver (asking for bad information).  But if it's on the mailserver,
I wonder why it was working before and isn't now.  Maybe the change in
IP address needs to be reflected somewhere, and hasn't been?  So there's
something else going on, too.  If you have tcpdump info from a good SMTP
connection from those same servers, could you post that, too?  (And let
me know what addresses have been changed to RFC 1918 addys and what
haven't.)

Theorizing,
Raven

"Argh!  All these clocks are the same!"
  -- RavenBlack, on unexpected and new synchronicity