Computing

reverse proxying with apache and mod_proxy_html

I've been fighting to get some reverse proxy things working today at work. Basically, some python application servers that speak HTTP live on servers with private IP addresses behind the firewall, but they need to be reachable to the outside world via a HTTPS portal that does authentication checking with mod_authnz_ldap. Basically, https://example.com/app1/ needs to go to http://app1:8888/. I figured out much of what is below with the help of: http://www.apachetutor.org/admin/reverseproxies.

Apache's mod_proxy seemed like it would be simple enough to use and 2 lines of config file changes later, the first page was working. However, redirects from the app servers were causing the client to redirect to internal addresses which didn't work, and absolute urls in HTML from the appserver needed to be changed to include the /app1/ on the externally facing server. Enter mod_proxy_html.

mod_proxy_html is a third party module that allows content modification including replacing link addresses with different addresses. I downloaded and installed it on the proxy server but it wasn't working. Turning up debugging with

LogLevel debug
ProxyHTMLLogVerbose On

gave me the following message: "No links configured: nothing for proxy-html filter to do", and Google only had one result for this: mod_proxy_html.c - the source code for mod_proxy_html with the error message in it! It turns out that much of the documentation for mod_proxy_html is out of date, and in mod_proxy_html 3.0 the link tag definitions have been removed from the code and must be included in the configuration. Had I looked at the config file provided with the download (instead of the one I'd been writing from howtos), this wouldn't have happened, but it's surpsising Google hasn't indexed anyone else running into this! The fix for this was to include the following in my config:

ProxyHTMLLinks  a               href
ProxyHTMLLinks  area            href
ProxyHTMLLinks  link            href
ProxyHTMLLinks  img             src longdesc usemap
ProxyHTMLLinks  object          classid codebase data usemap
ProxyHTMLLinks  q               cite
ProxyHTMLLinks  blockquote      cite
ProxyHTMLLinks  ins             cite
ProxyHTMLLinks  del             cite
ProxyHTMLLinks  form            action
ProxyHTMLLinks  input           src usemap
ProxyHTMLLinks  head            profile
ProxyHTMLLinks  base            href
ProxyHTMLLinks  script          src for
ProxyHTMLLinks  iframe          src

ProxyHTMLEvents onclick ondblclick onmousedown onmouseup \
                onmouseover onmousemove onmouseout onkeypress \
                onkeydown onkeyup onfocus onblur onload \
                onunload onsubmit onreset onselect onchange

An Apache restart later, and HTML links were getting rewritten. Neat! On to the next problem.. the app servers in question have lots of hardcoded absolute URLs, many of them in CSS and JS files. The documentation has an initial solution to this in their technical guide, using a regular expression like:

ProxyHTMLURLMap url\(http://internal.example.com([^\)]*)\) url(http://proxy.example.com$1) Rihe

However this only works on inline CSS because mod_proxy_html only works on html content types and not the text/css that CSS files are sent as. A workaround for this is setting the PROXY_HTML_FORCE environment variable, but in addition to forcing mod_proxy_html to look at css files, this forces it to process image files, etc, which uses up too much CPU for our use case. Doh!

Setting up each application server as a vhost insted is a lot simpler (the 2 lines of config I started with here are enough), and while it's less than ideal, we have wildcard SSL certificates so having https://app1.example.com/ isn't the end of the world and doesn't require any additional IP addresses.

GPX GPS trace files and elevation gain

I carry a GPS with me on long bike rides and pull the resulting trace into Google Earth and Garmin's MapSource software. Google Earth is nice for looking at, but doesn't provide much useful information, and MapSource is pretty awful to look at (and will only run in Windows so I have to boot up VMware) but does provide elevation maps (as well as the ability to load maps). I recently started using a bike computer with cadence, and a heart rate monitor, and the last missing piece of information was total elevation gain over a ride. This information is nowhere in MapSource or Google Earth.

I can get GPX format (The standard interchangable format for GPS information) files out of MapSource and it's just XML, so after trying several tools online and several programs I downloaded that didn't work, I wrote a quick python script to get me the info I want. Hopefully this will help someone else:

from xml.dom import minidom

file = minidom.parse('./file.gpx')

min = 1000000
max = 0
gain = 0
loss = 0
last = 0

for node in file.getElementsByTagName("ele"):
        cur = float(node.childNodes[0].data)
        if (cur > max):
                max = cur
        if (cur < min):
                min = cur
        if (last != 0):
                if (cur > last):
                        gain = gain + (cur - last)
                elif (cur < last):
                        loss = loss + (last - cur)
        last = cur

print "max: %.2fft" % (float(max * 3.2808399))
print "min: %.2fft" %  (float(min * 3.2808399))
print "gain: %.2fft" % (float(gain * 3.2808399))
print "loss: %.2fft" % (float(loss * 3.2808399))

So for my 43 mile ride on sunday:
max: 1110.63ft
min: 773.16ft
gain: 3328.98ft
loss: 3232.78ft

Getting those numbers were a lot harder than it should have been! Good ride though..

Griffin PowerMate and Rhythmbox

I was going through some drawers and stumbled across my good old Griffin PowerMate that I got back before I started using Linux. It controlled iTunes in Mac OS 10.1 and was great because I could change volume and pause music without having to change programs or anything. These days I use Rhythmbox in Linux to listen to music and theres not a plugin for it. Yet!

Rhythmbox supports plugins written in python, a guy has some skeleton python code for talking to the powermate, and that means something could work out!

I got the powermate working by compiling and loading the powermate module for 2.6 linux kernels (In 2.6.23 it's in Device Drivers -> Input device support -> Miscellaneous devices -> Griffin PowerMate and Contour Jog support), adding a udev.d entry:

# /etc/udev/rules.d/45-powermate.rules
KERNEL=="event*", SYSFS{product}=="Griffin PowerMate", NAME="powermate", GROUP="users", MODE="0660"

I plugged it in, catted /dev/powermate, and with each twist or push it spit out garbage to the screen. Success!

A quick glance through everything shows that Rhythmbox doesn't support threads and the python code here uses polling so I'd need to delve into the Rhythmbox docs to figure out the best way to do that, but Rhythmbox also exposes itself through DBus and there are some examples of using this around the internet. In a few minutes, I hacked together something dirty to cover the basics and perhaps later on I'll make something that works as a Rhythmbox module. Right now pushing the button is play/pause, turning it adjusts the volume, and the LED shows volume when playing and pulses slowly when paused. Here ya go:

#!/usr/bin/python

import powermate
import dbus

EVENT_BUTTON_PRESS = 1
EVENT_RELATIVE_MOTION = 2

DBUS_START_REPLY_SUCCESS = 1
DBUS_START_REPLY_ALREADY_RUNNING = 2

bus = dbus.SessionBus()
(success, status) = bus.start_service_by_name('org.gnome.Rhythmbox')

proxy_obj = bus.get_object('org.gnome.Rhythmbox', '/org/gnome/Rhythmbox/Player')
         
player = dbus.Interface(proxy_obj, 'org.gnome.Rhythmbox.Player')

pm = powermate.PowerMate("/dev/powermate")
while 1:
	event = pm.WaitForEvent(-1)
	if (event[2] == EVENT_BUTTON_PRESS and event[4] == 0):
		player.playPause(1)
		if player.getPlaying():
			pm.SetLEDState((int)(player.getVolume() * 255), 0, 0, 0, 0)
		else:
			pm.SetLEDState(255, 252, 1, 1, 1);
	elif (event[2] == EVENT_RELATIVE_MOTION and player.getPlaying()):
		player.setVolumeRelative(event[4] * 0.02)
		pm.SetLEDState((int)(player.getVolume() * 255), 0, 0, 0, 0)

Download powermate.py and the code above, save the code above as whatever.py, run it, and you'll be able to control rhythmbox with your PowerMate in Linux!

F-Spot EXIF information mangling

I use F-Spot to manage my photographs. It's fast, clean, simple, and does everything in my current workflow which is JPG on camera -> YYYY/MM/DD folders -> Gallery on my website. Once I start shooting RAW it will get a little more complicated, but F-Spot keeps moving forward so hopefully they'll come up with a plan for that.

When uploading images to Gallery, I noticed that my photo timestamps were off. Conveniently, there was a discussion about this on the F-Spot mailing list at the same time and it turns out that every time you import an image in F-Spot, it adjusts the EXIF Timestamp information based on your timezone. Basically, if you're 5 hours away from GMT, on import F-Spot writes to the file that the image was taken 5 hours later than it actually was. Not only does it do this once, but if you re-import images into F-Spot for whatever reason it does this again, again, and again.

This was a bit of a surprise because EXIF information written by the camera shouldn't be changed by an import program! I thought I'd lost all the actual capture date/times of my ~30,000 photos, and was getting pretty upset that software would do this, but after digging through EXIF headers from all the cameras I've had, it turns out that the "DateTimeOriginal" was still good! I disabled F-Spots ability to write metadata to files (which means I'll have to stop tagging images until this is all resolved upstream) and wrote a little script to fix my files. If you've run into this and would like your original EXIF information back so that photos taken on New Years Eve as the year ticks over aren't at some hour after sunrise on Jan 1st, use this! Just replace $directory with the path to your photo library, store it to a file named "fixer.pl" and run "perl fixer.pl". Note that you'll need find and jhead installed.

EDIT: Note! I looked at this again with my 40D and new version of f-spot. It seems that now the correct EXIF header is "DateTimeDigitized" and _NOT_ "DateTimeOriginal". Please verify things on your setup before running this random script you found on the internet!

#!/usr/bin/perl -w 

use strict;

my $directory = "/media/photos/";

my %opts;
my @files;

@files = `find $directory -type f -iregex \'.*\\.\\(jpg\\|jpeg\\)\'`;

foreach my $file (@files) {
        chomp $file;
        my $dateline = `jhead -v "$file" | grep DateTimeOriginal`;
        if (defined($dateline)) {
                $dateline =~ /.*\"(.*)\".*/;
                my $date = $1;
                if (defined($date)) {
                        $date =~ s/ /-/g;
                        system("jhead -ts$date \"$file\"");
                        system("jhead -ft \"$file\"");
                }
        }
}

10+ years of internetting

I just realized that my yahoo profile is now over 10 years old! I apparently created it on February 12, 1998, and while I know that I had AOL at home before that for perhaps around a year, I can't find any indication that they provide account creation dates anywhere in their system. (And back before AIM was properly integrated, I had to switch from ckdake to theckdake when we canceled AOL and didn't get to switch back to ckdake until perhaps college?)

Before AOL, I got online once or twice at a friends house, but I know my first experience online was at the 99X booth at some olympic experience thing at the 1996 Atlanta Olympics. They had a web browser, I typed in "games" in the address bar and alas, couldn't get to any games. Needless to say I didn't realize that the internet was good to have until later.

In 1999, I purchased my first domain name: ithought.org (for $70 a year or something stupid expensive from Network Solutions) and it's still the one I use for all my servers. ckdake.com finally showed up in 2004.

Things sure have come a long way in ~12 years!

PHP Security, Round 2

As I've noticed from watching hits on my site here, many of you have read my page on PHP security using mod_fastcgi and suexec. The logic on that page still holds, but Gentoo decided to make the switch from mod_fastcgi to mod_fcgid and it broke all sorts of things for me. I got things scratched back together without any security on my old server, and with the installation of my new server a few weeks ago, I set things up more securely again. I still think this way is the way to go for a server where many of the virtual hosts will seldomly see traffic, but if you're running lots of high traffic sites and have a little bit of RAM overhead, you might want to check out this article on mpm-peruser.

For this setup, I decided to stick to some standards. This means no more changing the suxec directory, using /data/, or anything like that. Other than that, the key differences from last time:

  • All configuration is now done with with a setup script instead of using a mysql database. There was not really any point for the host names to be in a database, and it makes setup/teardown scripts easier to write as just a bash script.
  • Some hosts have PHP, some don't, so no point in setting up all the overhead if a host isn't going to use PHP.
  • Most hosts won't have any interest in having their own logs. Statistics can be done using client side things such as Google Analytics, and Apache is happier writing all the logs to 1 place instead of hundreds. I also have split-logs running when logs are rotated, so logs can easily be gathered per-site as needed, just not real time by one of my hosting customers. I've never known of one of my customers using live access to their logs.
  • php.ini files are now stored with the wrapper script in the site's cgi-bin directory and file system extended attributes are used to protect it. This means no separate home for php.ini files, and it's easier for users to see what their PHP confguration is.

The script isn't quite ready for sharing yet, but here's what you can do to get a setup like this:

  1. on Gentoo, make sure your USE contains: suexec, apache2, cgi, fastcgi, session.
  2. on Gentoo, "emerge apache php mod_fcgid". On other platforms, consult your docs (or just download mod_fcgid and use apxs to install it. it should be pretty seamless)
  3. Set up your global configuration. On Gentoo, this is done for you, but make sure this gets loaded into your global apache configuration:
    LoadModule fcgid_module modules/mod_fcgid.so
    SocketPath /var/run/fcgidsock
    SharememPath /var/run/fcgid_shm
    
    <Location /fcgid>
    SetHandler fcgid-script
    Options ExecCGI
    allow from all
    </Location>
    
  4. Add a user and group for your first virtual host, test.example.com. call em "example" if you like
  5. Set up the directory tree for the virtual host:
    /var/www/test.example.com/
    /var/www/test.example.com/tmp
    /var/www/test.example.com/htdocs
    /var/www/test.example.com/htdocs/cgi-bin
    
  6. Make some files:
    <!-- /var/www/test.example.com/test.html -->
    hello HTML world!
    
    <? 
    /* /var/www/test.example.com/test.php  */
    print("hello PHP world!");
    ?>
    
    #!/bin/sh
    # /var/www/test.example.com/htdocs/fcgi
    PHPRC=/var/www/test.example.com/htdocs/cgi-bin/
    export PHPRC
    PHP_FCGI_CHILDREN=2
    export PHP_FCGI_CHILDREN
    PHP_FCGI_MAX_REQUESTS=25000
    export PHP_FCGI_MAX_REQUESTS
    exec /usr/bin/php-cgi
    
  7. Copy your php.ini to /var/www/test.example.com/htdocs/fcgi and edit it so that directories are right. All you'll likely need to change is upload.tmp_dir and session.save_path, but you may want to change others.
  8. Set fcgi to be executable, and make sure permissions are set on it so that it is owned by your test user/group and other users can't mess with it. If things don't work later, this is a frequent culprit
  9. Set the immutable bit on php.ini and fcgi (you'll need to be using extended file system attributes on your filesystem to do this, check your OS documentation for details) by running 'chattr +i /var/www/test.example.com/htdocs/*'. You'll need to undo this with chattr -i if you want to change these files in the future.
  10. Set up this host's configuration:
    <VirtualHost *:80>
            DocumentRoot /var/www/test.example.com/htdocs/
            ServerName test.example.com
            SuexecUserGroup example example
            <Directory /var/www/test.example.com/htdocs/>
                    Options +SymLinksIfOwnerMatch
                    AllowOverride All
                    Order allow,deny
                    Allow from all
                    DirectoryIndex index.html index.php
                    AddType application/x-httpd-fastphp .php
                    Action application/x-httpd-fastphp /cgi-bin/fphp
            </Directory>
    
            <Directory /var/www/test.example.com/htdocs/cgi-bin/>
                    SetHandler fcgid-script
                    FCGIWrapper /var/www/test.example.com/htdocs/cgi-bin/fphp .php
                    Options +ExecCGI -Includes
                    allow from all
            </Directory>
    
    </VirtualHost>
    
    <VirtualHost *:80>
            ServerName aerospace.com
            Redirect Permanent / http://test.example.com/
    </VirtualHost>
    
  11. Give apache a restart and that should be it!

Check out the processes running on your server, and after you hit test.php you should see a php-cgi process running as the example user. If you have problems, error_log and suexec_log in /var/log/apache2/ (or /var/log/httpd/) tend to tell you everything you need to know.

An oh yeah, want to use APC to speed up your PHP applications significantly under this setup? Just install APC, then add the configuration for it to the bottom of the php.ini for any hosts that you want to enable this on. Given that APC isn't 100% perfect and crashes sometimes, the beauty of the fcgid setup is that it will take out the php-cgi process and the fcgid manager will just start a new one like nothing happened.

Adobe Fast Web View

Adobe Fast Web View is a very lightly documented but seemingly often used feature in Adobe Acrobat Reader. From the users point of view, it does what it says and makes pages of a PDF show up in their browser before the entire PDF is completly downloaded, but it's a bit more complicated from a server operators point of view. And, it is enabled by default when installing Adobe Acrobat Reader.

We recently moved sugarcrm.com and some other web properties from stand-alone web servers to a clustered solution involving NFS, load balancers, database replication, etc. It was a pretty complex migration and we're pretty sure that we're running a handful of applications on this cluster that nobody has every clustered before, so needless to say we ran into our share of gotchas. (Other than one web server that seems to be cursed..) One of them was very strange, and involved PDFs: Everything worked fine in all browsers on all platforms until Adobe Acrobat Reader entered the picture. Some number of PDFs would lock up the browser and never load, but only when the PDFs were served from the cluster through the load balancer. When served from one of the cluster web servers but not through the load balancer, everything would work perfectly! Also, with the Adobe plugin disabled, the PDFs would save perfectly and be viewable every time.

Using livehttpheaders it was apparently that 2 HTTP requests were being made so my guess was that the browser would do a GET for the PDF, but when the Adobe plugin took over, it was sending a new HTTP request (all about HTTP). This shouldn't be an issue, but things weren't working! I installed Wireshark on my Windows test installation and dug deeper. Immediately, I noticed that all the data packets in the response coming from the server were fragmented. This typically means that there is an MTU somewhere. However, with some Googling around for PDF files I noticed the same behavior on 50% of the sites I hit, and those PDFs were working fine in the Adobe plugin. Regardless, Igor and I set out to tinker with the MTUs on the web servers and load balancer. Changing the MTU from 1500 down to 1400 did change which PDFs would load in the plugin, but not all of them. Strange!

Again looking in the Wireshark traces, we saw what looked like a TCP reset loop (read all the details about TCP here). After the first part of data came through successfully, every packet from the server was a RST and the Adobe plugin just sat there waiting for data that was never going to arrive. We poked around the load balancer looking for anything that could cause this but no luck. Googling around for this PDF problem, the only solutions we found were recommendations to disable "Fast Web View." What's that? This gave us another thing to search for and led us to a server-side solution in this forum topic. For whatever reason, the load balancer was breaking HTTP requests with a "Request-Range" header, and Adobe Acrobat Reader was using this to attempt to make the PDF load faster. In retrospect, this makes sense, but it sure was a time consuming thing to discover! If you run into this, the solution is to add the following to your Apache configuration file (or something equivalent if you use lighttpd or something else, we found examples of this happening with other server software):

LoadModule headers_module modules/mod_headers.so
...
<FilesMatch "\.(mp3|zip|pdf)$">
    Header unset Accept-Ranges
    RequestHeader unset Range
    RequestHeader unset Unless-Modified-Since
    RequestHeader unset If-Range
</FilesMatch>

Don't buy Apple Routers

Consider yourself warned! It's Apple policy to phase out older routers in such a way that they are no longer usable. Case in point: I have two of the second generation Apple AirPort base stations, called "Dual Ethernet" in some places and "Snow" in others. (See the wikipedia page for more details). They work great, have wireless and wired access, let you use a dial-up connection or dial into them from your remote network, and do 802.11b perfectly. One was at my parents house for a while, another at my girlfriends.

Both are now at my house and I wanted to set them up for other things. First, I found this article on their factory default settings which led me to the one on how to reset the things so that I could reload the software. All seemed to make sense and sounded pretty straightforward. My only Mac is a MacBook Pro for work with the newest version of Mac OS X and AirPort utilities, so I seemed to meet all the requirements. However, the routers were not showing up in the Airport Admin Utility. Weird.

I search around about this problem and found many people with the same issue but no solution, and after digging around on Apple's website some more, I found this Airport Software Compatibility Table. So the "Dual Ethernet" router isn't supported in 10.4 or newer? That seemed strange. I downloaded the AirPort 4.2 for Mac installer, but running on Mac OS X 10.5 it claimed that it required "10.3.3 or newer." Last I checked, 10.5 was newer than 10.3 but *shrug*. I gave Apple tech support a call and after explaining the issue, I was told "Your cheapest option is going to just be buying new routers." WHAT? I have two perfectly fine routers that each cost $300 new, and it turns out Apple decided to just remove support for them from the configuration tool? Wow.

So I dug around some more. I remembered using Windows to configure them at one point so I booted up Windows XP on the MacBook Pro and searched out the older software. It is still available here: AirPort 4.1 Download for Windows and will allow you to configure your older Apple routers. However, this isn't feasible for me because I don't typically have a Windows installation that I can get to for the planned uses of these routers, so they're now for sale. Check Craigslist, Facebook, or event comment here if you want them. Sold to any reasonable offer. These things work great and are only limited by Apple removing functionality from new versions of their software and not providing a downgrade path (Older versions of OS X won't install on these x86 laptops). Even though I recently got a new AirPort 802.11n router (that also works great) I will not be buying another piece of Apple networking hardware.

(Also, I had lots of phone line problems and ended up switching from Speakeasy Business DSL to Comcast Business Cable. I've had to deal with Comcast on the phone once already and their support system is a complete mess, but for the same $ I'm getting my static IP and 15Mb download/2Mb upload instead of the 1.5Mb download/384kb upload.)