Aug 12, 2011

Networking issues....

Lately, I've been seeing a lot of randomly strange networking issues pop up on random things that previously worked flawlessly.  I honestly couldn't find out what was going on.  Some things remained unaffected like World of Warcraft, but games I reciently started playing like League of Legends wouldn't work even slightly.  They would appear to connect... but would just "lag" out eventually.  Some websites also suffered... including my own server hosted in Amazon's EC2.  So... what did I do?  I initially avoided problems by VPN'ing into my work network & doing whatever from there.  Doing this, I initially assumed (incorrectly) that ISP had routing issues or something along those lines.  I've been playing games that way for a few months... and it never occurred to me that my problems were NOT due to routing issues by my ISP... After all, they're a small-time local company.  Bigger companies have done far worse... and they pay technicians an exorbitant amount to "fix" such problems.  (still battling with AT&T & GTA here over another issue)

So... what was the *actual* problem? and what was the fix?

I remember in one of my old Cisco classes that an invalid MTU will result in packets that are larger than the MTU will be thrown to the bit-bucket... most common MTUs for most networks is either 1500 or 1492.  (1500 is pretty standard... but when tunneling... it's not uncommon to see 1492 because of the overhead of encapsulating the packet)

In a normal situation... you ping a server... and a response comes back.   Well... when you do a basic icmp ping in windows using the spiffy "ping" command... a 32-byte packet is sent to the remote site (not counting the overhead) and a response comes back.  For a test of MTU... you need to increase the size of the icmp packet to the actual size of the MTU ... and tell it not to fragment.

C:\> ping www.google.com -f -l 1500

well guess what?

Packet needs to be fragmented but DF set.

Hmmm.... well lets try a smaller packet

C:\> ping www.google.com -f -l 1492
Packet needs to be fragmented but DF set.

ok... that's interesting... lets try smaller-still...


C:\> ping www.google.com -f -l 1400
Reply from 209.85.157.99: bytes=1400 time=58ms TTL=54

ok... so 1400 worked... lets keep trying bigger values until we discover what actually works...  In my case... everything up to and including 1464 worked... but nothing bigger.  Well... what's the MTU set on the interface in windows 7?.... 

C:\> netsh interface ipv4 show subinterfaces
   MTU  MediaSenseState   Bytes In  Bytes Out  Interface
------  ---------------  ---------  ---------  -------------
4294967295                1          0     393572  Loopback Pseudo-Interface 1
  1504                1  196616510   13092640  Local Area Connection 3

1504??? wow.... that's not going to work frequently... lets fix that... (requires an administrative cmd prompt)

C:\> netsh interface ipv4 set subinterface "Local Area Connection 3" mtu=1464 store=persistent
Ok. 


Lets try connecting to stuff & see if that helps.    Hey!!!  Everything works now!  So... what have we learned? ... there's an unspoken *standard* MTU that almost everything uses... it's 1500.  When tunneling traffic... it's 1492 (like for VPNs or PPPoE like used in DSL).   This does NOT mean that your ISP *will* permit packets of that size to reach the Internet.  It might end up being configured to work in a double-tunnel... (starts at 1500... some sort of tunnel to DSLAM... and another tunnel to endpoint... ) or there might be a combination of several things.  Windows supposedly has some sort of mechanic for adjusting the MTU on the fly... but that can fail... (like in my case).   There is nothing that will directly indicate that your MTU is set too high... other than randomly you'll see connections go "stale" on the fly... without an indication that something is wrong.  *some* routers will allow you to manually set the MTU... but I was not so lucky.  So... the final result... I set the MTU manually... and now everything works.

You might run into this same situation... it's very difficult to diagnose... as everyone just assumes a ping is a good indication that your connection is working perfectly.  Now that more & more network appliances and adapters are supporting jumbo frames (mtu ~9000) ... there's going to be more situations where discovering the MTU will fail... and this situation will need to be addressed.  Perhaps this will end up being a useful resource to others.