Aug 12, 2011

Networking issues....

Lately, I've been seeing a lot of randomly strange networking issues pop up on random things that previously worked flawlessly.  I honestly couldn't find out what was going on.  Some things remained unaffected like World of Warcraft, but games I reciently started playing like League of Legends wouldn't work even slightly.  They would appear to connect... but would just "lag" out eventually.  Some websites also suffered... including my own server hosted in Amazon's EC2.  So... what did I do?  I initially avoided problems by VPN'ing into my work network & doing whatever from there.  Doing this, I initially assumed (incorrectly) that ISP had routing issues or something along those lines.  I've been playing games that way for a few months... and it never occurred to me that my problems were NOT due to routing issues by my ISP... After all, they're a small-time local company.  Bigger companies have done far worse... and they pay technicians an exorbitant amount to "fix" such problems.  (still battling with AT&T & GTA here over another issue)

So... what was the *actual* problem? and what was the fix?

I remember in one of my old Cisco classes that an invalid MTU will result in packets that are larger than the MTU will be thrown to the bit-bucket... most common MTUs for most networks is either 1500 or 1492.  (1500 is pretty standard... but when tunneling... it's not uncommon to see 1492 because of the overhead of encapsulating the packet)

In a normal situation... you ping a server... and a response comes back.   Well... when you do a basic icmp ping in windows using the spiffy "ping" command... a 32-byte packet is sent to the remote site (not counting the overhead) and a response comes back.  For a test of MTU... you need to increase the size of the icmp packet to the actual size of the MTU ... and tell it not to fragment.

C:\> ping www.google.com -f -l 1500

well guess what?

Packet needs to be fragmented but DF set.

Hmmm.... well lets try a smaller packet

C:\> ping www.google.com -f -l 1492
Packet needs to be fragmented but DF set.

ok... that's interesting... lets try smaller-still...


C:\> ping www.google.com -f -l 1400
Reply from 209.85.157.99: bytes=1400 time=58ms TTL=54

ok... so 1400 worked... lets keep trying bigger values until we discover what actually works...  In my case... everything up to and including 1464 worked... but nothing bigger.  Well... what's the MTU set on the interface in windows 7?.... 

C:\> netsh interface ipv4 show subinterfaces
   MTU  MediaSenseState   Bytes In  Bytes Out  Interface
------  ---------------  ---------  ---------  -------------
4294967295                1          0     393572  Loopback Pseudo-Interface 1
  1504                1  196616510   13092640  Local Area Connection 3

1504??? wow.... that's not going to work frequently... lets fix that... (requires an administrative cmd prompt)

C:\> netsh interface ipv4 set subinterface "Local Area Connection 3" mtu=1464 store=persistent
Ok. 


Lets try connecting to stuff & see if that helps.    Hey!!!  Everything works now!  So... what have we learned? ... there's an unspoken *standard* MTU that almost everything uses... it's 1500.  When tunneling traffic... it's 1492 (like for VPNs or PPPoE like used in DSL).   This does NOT mean that your ISP *will* permit packets of that size to reach the Internet.  It might end up being configured to work in a double-tunnel... (starts at 1500... some sort of tunnel to DSLAM... and another tunnel to endpoint... ) or there might be a combination of several things.  Windows supposedly has some sort of mechanic for adjusting the MTU on the fly... but that can fail... (like in my case).   There is nothing that will directly indicate that your MTU is set too high... other than randomly you'll see connections go "stale" on the fly... without an indication that something is wrong.  *some* routers will allow you to manually set the MTU... but I was not so lucky.  So... the final result... I set the MTU manually... and now everything works.

You might run into this same situation... it's very difficult to diagnose... as everyone just assumes a ping is a good indication that your connection is working perfectly.  Now that more & more network appliances and adapters are supporting jumbo frames (mtu ~9000) ... there's going to be more situations where discovering the MTU will fail... and this situation will need to be addressed.  Perhaps this will end up being a useful resource to others.

Feb 11, 2011

WOOHOOO! I recovered my blogger account! (no thanks to google)

I must admit, I've become a google-fanboy over the past few years... as it's been one service I can honestly say... "it just works." It's always been reliable, and amazingly simple to use (from the user-side) but still provides full-featured APIs well documented for public use, as well as built on public standards, and a strong supporter of the open-source community... and even more amazingly... it's FREEEEEEEEEEEEEEEEEEEEEEEEE.  Admittedly, they pay the bills with advertising on everything you do... and who knows what else... but honestly, the ads are subtle enough that I don't mind them... and any other avenues of revenue generation they may use I have not yet found personally intrusive.  No crazy psychedelic blinky banners of doom.... which all-too-often make me never want to return to a site.
That being said, Google is also notorious for acquiring other "bits" to improve their internet-footprint.  This "brain-dump" is stored on one such service.  Google acquired Blogger... and after quite a while decided to integrate authentication schemes.  This is exactly what caused me to almost completely lose my blog... and reevaluate my faith in Google.  Truthfully... even though I was able to recover my blog... it was not due to any direct communication with Google... and was not due to any help from any of their many help-documents... and further still... was not even from their forums or the many helpful people who tech-support Google's stuff.  It is much like me (an individual) trying to get Microsoft to acknowledge a bug in one of their products... without forking out crazy-amounts of cash. In-short... I followed every step they recommended to get help... and ultimately it ended up in a forum which never gets read or responded-to by anyone.  
In Google's defense... I can honestly appreciate the complexities & difficulties of trying to merge two completely separate authentication systems into one.  You can't simply keep both sets of usernames/passwords alive... eventually one has got to prevail.  There was notice that google wanted everyone to transition to a google-account, and I thought since my account was associated with my google-apps enabled domain, everything was good-to-go.  Well... sadly, I wasn't.  I'm not exactly sure when Google finally flipped the switch... but there-after when I tried to log into my account I was presented with a blank blogger account.  But my site still had all my posts.  Browsing to my site when logged in, only showed me the traditional guest page.  I tried posting on forums... password-resets... there's even an offline "recover-your-account" page which seems forever-offline with no ETA to being back-online.  I was about to give up & remove DNS entries and perhaps look elsewhere for something else.
Well As you may be reading... I didn't give up.  I somehow stumbled upon this page: http://www.google.com/support/accounts/bin/answer.py?answer=27443 which I didn't have a lot of faith in... but figured, what the heck?  It's worth a try.  Well... following the "Do you use Gmail with this account?" with the "yes" option was completely useless... but for kicks... I said no... and ended up at the "password-assistance page" (https://www.google.com/accounts/recovery) which after submitting my email address at my domain... came back with a rather strange option to send me a reset-password link... the strange part was that the email address was "myadmin%mydomain.tld@gtempaccount.com"   ... well... that's not my email address... but the gtempaccount ... looked like perhaps google went & associated my blogger account with a bogus temporary account.  For kicks... I logged in using my old password with the "myadmin%mydomain.tld@gtempaccount.com" as my username.... and voilĂ ... it worked.
Just to be safe... I went and invited my proper "google-account" account with this & removed the gtempaccount.
It's a shame that there's a huge number of posts out there that also seemed to have fallen on deaf ears.  I listen to many other bloggers/podcasters/etc... out there who have run into various issues with google's tools... and it seems when they have a problem google jumps as fast as they can to fix the problem ASAP... Google would be fools not-to... but the fact that I searched repeatedly across every document Google offered...  followed their advise to the letter, and submitted multiple requests for help in their forums... waited over a month and still didn't get even the time of day.  It seems that Google's users are only as important as the number of their followers.  But hey... it's freeeeeeeeeee...

To all who have fallen into my shoes... I can truly feel your pain... it may not work for ya, but try to log-in using yourolduseracct@gtempaccount.com with your old password.

Feb 10, 2011

Hyper-V Time Sync issues... FIXED!

I've been using Microsoft Hyper-V Server for a while now, and I've run into an issue in Linux operating systems where the time would skew VERY badly.  Installing NTPD didn't help, as it only updates the clock if it's within 128ms of the servers it is polling from... and running a cron job to update the time just wasn't accurate enough.
The time would skew more than a few minutes in the space of 1 hour.  This is VERY unacceptable.

But!  as the title suggests... there is a fix.  (for me it's a simple fix... for others... may not be so simple)  So, here we go!

The problem occurs because of an issue where the clock in an OS isn't based on the "hardware clock" that is kept alive by a battery... and is only sorta-accurate... but rather the OS's clock is based on a set number of cpu cycles.  In a physical machine, this provides a clock that is much finer-grained than the traditional hardware-time... which is only reports whole-second increments.  When you are doing highly time-sensitive things, you need a much more accurate clock.  For example,  VoIP (ulaw RTP audio streams) traditionally breaks audio data up into 20ms chunks of audio into 1 packet of data sends that.  On the other end, those packets are put into a special sort of buffer that takes those 20ms bits and reassembles them into a continuous stream of audio.  If you only had a clock accurate within 1 second... you'd have some SERIOUS delay in conversations.

Today, a VERY large number of things in computers require a highly accurate clock.  Rather than each application trying to have it's own clock, operating systems provide APIs that every applications can rely on for an accurate time source.  I am not 100% sure with all operating systems, but I do know that Linux has one such kernel-clock that is not based on the hardware-clock.  There are kernel options that can be set to define how accurate this clock is... (1/1000ms, 1/100ms, etc....) but that's not really very relevant to this topic.  In short, during the startup process, the kernel starts the os clock based on the hardware clock... and some sort of algorithm for defining a number of cpu cycles per "tick", and continues to count from there... and on shutdown sets the hardware clock to the OS's clock...  Typically, in the middle, services like NTPD can make the OS's clock much more precise with regards to the actual time as defined by the NIST.

So, what goes wrong in a Virtual environment? (not just hyper-v)  Well, cpu cycles are virtual.  There are several different things at play all of which can make the number of cycles per tick a variable rather than a constant.   Most virtual server frameworks (if not all) provide some sort of compensation to the guest OS'es to *appear* like they're getting a constant number of cycles, but this wreaks havoc if the guest OS doesn't quite understand what the host OS has done.  In the case of Hyper-V, extra cpu cycles are thrown at the guest OS to try & push the clock forward periodically when it thinks the guest OS might have missed some.  This *can* help, but in the case of most Linux OSes, they just steadily count the extra cpu cycles, and the OS clock skews forward.  The fix? Well, this is where it gets a bit more tricky.

A Kernel Module Saves the Day!!  Actually, this idea isn't as strange as it sounds.  Other virtualization frameworks have "integration" tools that do exactly this... and other functions which we really aren't worried about at this point.  We want Linux Guests in Hyper-V to keep time!  Microsoft was sooo thoughtful to provide us with the tools we need.  The "Linux Integration Services v2.1 for Windows Server 2008 Hyper-V R2" was written specifically for this purpose!  We're Saved!  ...or are we?  Well, if you read the fine print, it's only supported in a CRAZY-short list of Linux operating systems.

SUSE Linux Enterprise Server 10 SP3 x86 and x64 (up to 4 vCPU)
SUSE Linux Enterprise Server 11 x86 and x64 (up to 4 vCPU)
Red Hat Enterprise Linux 5.2, 5.3, 5.4, and 5.5 x86 and x64 (up to 4 vCPU)
Well... that's a start, Microsoft appears to want to be friends with the Linux community... heck, they even went as far to get several of the pieces of the Linux Integration Services integrated into the kernel.  Wow!  Microsoft creating kernel drivers?   AMAZING!.... wait... why doesn't my OS work then?  Well the down side is that Microsoft managed to get the driver into the kernel, but failed to keep it there due to a GPL violation.  So, you can't expect to see the hyper-v bits in any mainstream linux repositories anytime soon...

But this is not the end!  This Linux Integration Services are still there and still can be useful.  There's a few *gotchyas* but we are mainly focused on 1 feature... time sync.

So, without any further-ado... here's what you need to do:

1) Download the Linux Integration Services package from Microsoft, and extract the files to someplace convenient.  We're only really interested in the LinuxIC v21.iso at this point.

2) Mount the .iso into your guest OS and mount the virtual cdrom to someplace convenient.
mkdir /mnt/cdrom; mount /dev/cdrom /mnt/cdrom
3) make a copy of the cdrom-stuff on the local guest OS.  (the cdrom isn't writable <shock>)
mkdir /opt/linux_ic_v21_rtm; cp /mnt/cdrom/* /opt/linux_ic_v21_rtm
4) get your guest OS ready to build a kernel module.  I'm using Debian 5, but your os should have something similar... (basically, just need to install the build-tools & kernel source)
apt-get install build-essential linux-source module-assistant
m-a update && m-a prepare
5) Fix one line in the script/determine_os script.  Apparently, Microsoft didn't want to build the module for every kernel, just those 2.6.27 or greater.  Unfortunately, I'm running 2.6.26.

This may be a bit iffy, but for my kernel, all I needed to do was change line 40 from:
if [ $KERNEL_VER -ge 27 ]
to:
if [ $KERNEL_VER -ge 26 ]
This may work for other kernel versions, and the entire script might be better modified to support other kernels, but I am not 100% sure of what is & what is not supported.  I figured that 2.6.26 has very few (if any  ) differences in the system clock's functions.  (the entire package was designed to work with kernels 2.6.27 and kernel 2.6.9)

6)  Build *only* the hv_timesource module.  The other bits are very kernel-version specific.  On the other side of the coin... they do contain the other nifty paravirtual drivers.... but I am not a kernel developer (or any kind of programmer) and can't tell you how to fix the compile errors.
make hv_timesource
7)  If all goes well, the hv_timesource.ko module will be built!  Finally, we just need to load it.
insmod src/hv_timesource.ko
Final notes:  At this point, the module should be loaded, and the clock shouldn't drift anymore!  That being said, it may still be wrong, so it may be useful to set it.  You can use "ntpdate" or even pull the time from the hardware clock using "hwclock --hctosys".  This should at least get you started, but you'll still need to make this module auto-load on startup so it will be there after reboots... and if you should upgrade the kernel version, you may need to rebuild the module manually.

I'd be a very happy person if Microsoft would split up their "integration services" package into pieces.  They do have closed-source bits (which is why they were in trouble for violating the GPL) that they can keep as an add-on package... but I honestly can't see any reason why this bit should be kept from the mainstream kernel releases.  This is yet another example of Microsoft playing the "see we integrate with linux" game... without actually integrating with linux.

Sep 30, 2010

Terminal Services (revised)

... as promised... more TS info... this time with less fluff.  I started re-reading my previous post... and quickly decided... mmmkay too much fluff.  It needs a rewrite...  so... this time... I'm not going to be so "non-tech" friendly.  Forgive me if I glance over several things & get right to the meat of this.

In short... here's the goal of what we're looking for when it comes to  getting redundancy going...


Shamelessly "borrowed" from technet's site.
Although there you see 5 servers in this model... we only have 2 to play with.  (in this particular environment) but hey.... the the process is still similar.  Lets cut this down a bit more & see what we end up with.


Woohooo 3 servers... makin' progress...  of course... we're not even touching the gateway services yet... and the session broker ends up becoming the weak link.   What if it dies? ... evil phone-calls.

Lemme break down the steps in simple terms:

  1. The client does a DNS lookup for our server's name... in this example up above, "Farm1".  In production... this will be a FQDN, publicly accessible.
  2. DNS server responds with a list of multiple answers... one for each "session host" server.
  3. The client randomly picks one of the servers from the list and connects to it.  If that server is unavailable... the client will pick another one from the list.
  4. The randomly selected server... talks to the session broker behind the scenes & asks where the client needs to go.
  5. The session responds with where the client needs to go.
  6. The Session host tells the client where to connect...
  7. The client connects to the specific server...
  8. Finally, the session host updates the session broker with it's new connection.
 Ideally... that's how the magic all works!  Woohoo!

Ok... lets add a few more pieces to this puzzle.





Those purple servers are Gateway Servers...  If you look at the previous diagrams... they use private IP addresses... the job of the gateway server... is to securely bridge that gap between the public internet... and the private network.  Why 2? .... and why doesn't one have any lines connected to it? ... well... round robin DNS is round robin DNS... int his particular example... the random server it found was the top one.  The client's connection sticks with that server.  Had the client randomly picked the other server... the lines would all point to the other server.


In order for this to work... the client needs an additional config option set to use the gateway servers.  Not really a big deal as you'll probably end up deploying a .rdp file or setting up a portal to your users.  When a gateway is thrown into the mix, the clients actually use a HTTPS connection rather than the standard RDP (tcp 3389) connection... and add a *needed* additional layer of security. 

So... the process changes a bit.
  1. Client does a DNS lookup for the gateway server.
  2. Client gets multiple servers to connect to... randomly picks one...
  3. Client connects to random gateway server using basic HTTPS authentication... and start a RPC over HTTP session.
  4. The Gateway Server does another round-robin DNS lookup for the Session Host Server.
  5. List of Session host servers is returned... and a random one is picked...
  6. The Gateway Server passes the RPC session over to the Session Host server.
and from there it behaves like normal... juggling around between servers until the new session finds the correct server.  (Either resuming a disconnected session, or starting a new one on the least-used server.) 

And now, for the final piece.  That connection broker represents a single point of failure.  We're trying to build a setup with minimal points of failure.  So... the "Microsoft Answer" ... is to throw yet another server into the mix as a fail over.


So... there we have the "Microsoft" plan.  6 servers... to go from a single session host to adding a 2nd session host to the mix.  Admittedly, this is a great model for scalability to a HUGE deployment.  Most small businesses won't really need this level of scalability.  A single server can easily host 20-50 sessions... or even more with the right hardware or moderate to light weight applications.

In my situation and probably many others...  I can't justify that sort of expense when more than half of those servers will be sitting relatively idle... So... I set out to consolidate the roles each server performs... while still keeping the redundancy and all the benefits of each role.

12:54AM .... and tomorrow is another day.  I'll layout how I consolidated each roll into 2 servers tomorrow.

Sep 29, 2010

Terminal Server Fun.... (or not?)


I've been circling around & around with various problems trying to build a more redundant setup for Microsoft's "hot-new" feature "RemoteApp".

At first glance... and in simple environments, this is a really cool technology.  Imagine only paying a single license fee for (most) programs per-server... and avoid the constant upgrading of workstations... and using your budget to invest in your servers.

In a single-server setup... this works pretty well... baring a few crazy things... which is outside the scope of this particular post... but I promise to bring up sometime.  That information is pretty useful when considering Microsoft's Remote Desktop Services... (previously known as Terminal Services).

I'm sure there are many other admins out there who are playing the "Why not use Citrix" card.  Well... If we all had bottomless budgets to play with... I would loved to have considered the possibility.  Citrix is simply crazy expensive to implement.  The majority of the features of Citrix... Microsoft offers in the base setup for Remote Desktop.  Everything that citrix offers... requires that you *first* purchase Microsoft's Licenses not only for the OS, but additionally for each client.  Strange... isn't it.  Admittedly, Citrix has done a lot of additional work for you to make some things easier to setup, configure and deploy.

So... what is this whole terminal services thingie anyway?  Long story short... You probably have already seen how users can remotely control your computer... or even able to connect to your home computer through the internet... etc... what if you had 1 computer... that can have multiple "sessions" and basically act like multiple remote computers... on 1 server... configured once.  Terminal Services is exactly that.  Install Office... Install company programs... etc... on one server... and have multiple users log into it & run whatever programs.

Now for the twist.  RemoteApp... (which is only available in Windows Server 2008 and above) does some spiffy "jedi-mind-tricks" and doesn't give you a remote desktop... but instead simply draws the individual programs on YOUR desktop.  The program is still running on the remote server... only being displayed locally.  Spiffy huh?

 So... Lets start a pretty diagram of what's going on...
Terminal Server Magic

 Seems simple?  Well in a small office, this might work... but once you get going in a slightly larger office where you need to consider more than one server, is where you start running into problems.  On paper, you simply add another server... riight??? ... reality becomes quite a bit more complicated.

So... what happens behind the scenes? ... according to Microsoft... to do it "correctly" ... you actually need 6 servers... You need 2 gateway servers, 2 broker servers and finally 2 host servers.  So, your little project quickly skyrocketed to a HUGE undertaking... which gets rather expensive.  Hardware and software both.  Seriously?  2x the performance... for 6x the cost?  Does that really seem realistic?  Of course not.

Well, my goal is to simply expand the operation to 2 servers... to get roughly twice the scalability... and a bit of redundancy.  (there's some overhead, so it's not exactly 2x the growth... as the servers will be doing some additional tasks...)  We'll still make use of some of the Microsoft model... but put the work on the servers. 

First... What does each role do?

1.  Gateways:
This is more of a layer of security than anything...  The stock security in the RDP protocol is almost non-existent.  It's enough to keep honest people out... but won't do much beyond that.  The gateway service wraps up the RDP protocol through https... which is significantly more secure.  (but can also add some additional headaches)  Plus, you can add a nice web portal for users to login and run their applications without having to manually install a million .rdp files or shortcuts to everything.

2.  Connection Brokers:
Since we're having multiple servers... if we accidentally get disconnected... what's the chance we will get re-connected to the same server?  50/50.  Which means you'll have a 99.99999% chance of getting phone calls...  I hate those kinds of phone calls.  So... We have connection brokers that help match a disconnected session with the correct user.  The connection broker in 2008R2 can also act as a load-balancer... which helps to spread the workload evenly between the session hosts.

3.  Session Hosts:
This is where the programs are actually run.  All the horsepower needs to be here.  Not much else to say.

Well... even if I had 6 servers... I probably still wouldn't divvy them up that way.  The gateway service & connection broker service rarely tick at 1% cpu usage & not even enough memory to justify a dedicated server to it.  I've seen it suggested that the connection brokers should be thrown onto the domain controllers... but I kinda don't like to put a bunch of unrelated stuff on my domain controller.  (yes, I know most DCs sit idle for the most part... it's still not my preference to do that.)

So... my solution... is to-be-continued.  It's 12:18AM... and I gotta get to work in the morning.

Sep 28, 2010

Day 1

Well, it's day 1... just got my domain last night & already have it tied into google apps.  Hover.com... is kick-butt.  I highly recommend it to any mid-sized company to home-user person who wants to setup a domain.  I only ran into one small bug when setting up some DNS records, but I suspect it will be resolved quickly.  (not really that important anyway)
 
My goals for creating this site... is to have a repository I can dump random information that I may want to share with friends/family/random people in chat-channels... as a reference to take "enterprise-grade" technologies and make them more accessable to small-businesses.  (Technologies like clustering/HA of nearly all types, SANs, redundancies, Virtual Servers, etc...) as well as random tidbits that I've picked up along the way.  (I dabble in nearly everything "computing" related)
 
To start with... I am not a respector of brands.  I won't ever buy something simply because it's the "next" in the series from that manufacturer.  There are brands of products I've tested and been very disappointed with, and that may influence future purchases... but who wouldn't be influenced?  I am not necessirily loyal to Microsoft or any flavor of *nix.  I feel that each has their place in the world... sometimes one is better for the task than the others.  Because I don't prefer one solution or another, does not invalidate the option... but like all people in this world... I do have opinions too.  I can also appreciate the value of the mighty dollar.  Not everyone in the world has a bottomless pit of money to draw from... wait... who besides Bill-Gates does?
 
Anyhow... this is becoming a long & boring blog... so on that note... lets move on to something more interesting.