[Techtalk] How to write web proxy in Python?

Almut Behrens almut-behrens at gmx.net
Fri Jan 10 11:41:14 EST 2003


On Thu, Jan 09, 2003 at 03:53:04PM -0500, Amanda Babcock wrote:
> Hello all,
> 
> I need to write a very simple web proxy that should only do one thing:
> replace ^M's with ^J's in incoming pages.  (I think some broken Javascript
> with ^M's and no ^J's is preventing me from using my university's online
> classes via Opera :(
> 
> I have heard that the fastest language to write a webserver in is Python.
> I figure the same would be true for a proxy.  I have Python on my box, but
> I don't know the language, and I've never done socket programming in any 
> language.  I don't require the proxy to accept multiple clients - pages 
> will display in one page of the browser, though it does have frames (does 
> that make it look like multiple clients?).
> 
> If doing this in Python is not as simple as, say, C, that's fine too.
> Just need a clue where to start.
> 
> (If only I could just write "cat <HTTP input> | sed s/^M/^J/g | <browser>"...)


writing a simple web proxy in a scripting language like Python, Perl,
etc. actually isn't that difficult... I know you're having Python in
mind, yet, looking through my collection of scripts, I only did find a
suitable Perl script to present here as an example -- sorry ;)
(There's nothing wrong with Python, I just happen to be programming
Perl for much longer, so considerably more code snippets in Perl
have accumulated here...)

The example is an absolutely barebone HTTP proxy, but for your purpose
it should do without too much tweaking.  Okay, here it is, slightly
modified to take care of the issue that substituting ^Js for ^Ms
generally, i.e. in *any* type of data, like jpg/gif images and such,
would not be a good idea. So we're going to do it for HTML and
Javascript MIME-types only... (you can of course extend the list if
you need to).


#!/usr/bin/perl

use IO::Socket;
use IO::Select;

$PROXY_PORT = 8080;         # port you want the proxy to listen on
$PROXY_BIND = 'localhost';  # local address you want it to bind to

%do_filter = map {$_=>1} qw(text/html application/x-javascript);
                            # might want to change this list ^

$proxy = IO::Socket::INET->new(
    Proto     => 'tcp',
    Listen    => SOMAXCONN,
    Reuse     => 1,
    LocalPort => $PROXY_PORT,
    LocalAddr => $PROXY_BIND,
) or die "can't setup proxy server: $@\n";

$SIG{CHLD} = sub { while (wait() > 0) {} };

while (my $client = $proxy->accept()) {

    my $kidpid = fork();  die "cannot fork" unless defined $kidpid;

    if ($kidpid) {
        close $client;  # no longer needed in parent
        next;
    }
    close $proxy;       # no longer needed in child

    my ($host_port, $request, $url);
    while (<$client>) {        # get HTTP request from browser
        $request .= $_;
        last if /^\s*$/;
        # determine where to connect to
        ($url, $host_port) = m#^GET (http://([^/]+)\S*)# if !$host_port;
    }
    if (!$host_port) { die "no host to connect to!\n"; }
    $host_port .= ":80" if $host_port !~ /:\d+$/;  # default port

    # connect to remote server
    my $server = IO::Socket::INET->new($host_port) or die "remote server: $@\n";

    $server->send($request);

    my ($response, $mimetype);
    while (<$server>) {        # read server's response
        $response .= $_;
        last if /^\s*$/;
        ($mimetype) = /^Content-Type:\s*([\w\/-]+)/i if !$mimetype;
    }
    $client->send($response);  # headers only so far

    # log/debug
    print "fetching $url  [$mimetype]\n";
    
    my $slct = IO::Select->new($server);
    while($slct->can_read()) {
        my $nbytes = read $server, $response, 2**16;
        last if !$nbytes;      # socket closed by peer

        ### here's where you can fiddle with the content of the response:
        $response =~ s/\r/\n/g if $do_filter{$mimetype};

        $client->send($response);  # return data to browser
    }
    exit;  # child done
}

Just cut & paste the code into a file, edit the settings as required,
and start it. Then setup your browser to use this web/http proxy (if
you leave the settings as is, this would be "localhost:8080"), and
navigate as usual to the site in question...

And ,if you'd like to run it in the background, don't forget to
comment out the print statement above, or redirect the terminal output
to some file (./httpproxy >logfile & ). To terminate the proxy, use
kill <pid>, or simply ^C when it's the foreground job in the shell.

Note that the script in this form does not support HTTP POST requests,
though that could be added without much fuss, in case you should need
to POST HTML forms etc. Just let me know (or try to add it yourself).


A completely different approach would be to use a tool like 'netcat',
(often also called 'nc') and setup a simple port-forwarding relay, or
something slightly more advanced, as shown in the example shellscripts
'webproxy' or 'webrelay' that come with the package. Somewhere in the
processing pipe you should be able to insert the appropriate sed
command...
Yet, things might not really turn out to be that much easier in the
end (compared to the above script) -- but I wanted to mention the tool
anyway, as it doesn't seem to be too well-known, not even among
experienced admins or programmers. 

>From the package description: netcat is a "TCP/IP swiss army knife",
and, in fact, it's a very useful little program for a multitude of
different purposes, where you don't want to get into real programming
right away. The README from the package contains lots of discussion
and example usages. Actually, I wonder why it never made it into the
standard suite of unix utilities like 'cat' and friends...

So, to play around with it, you might need to install the appropriate
package for your distro. For debian, this would be as simple as
"apt-get install netcat", for example.

Have fun,

Almut


PS: does anyone know what the "official" MIME-type for JavaScript is? 
Is it application/x-javascript, or text/javascript, text/x-javascript,
or still something else? I've come across several of those, so I'm no
longer sure about any of them.  Apache, for example, serves .js files as
application/x-javascript. Browsers either don't seem to care or behave
inconsistently... A little googling also did bring up more questions
than it solved.




More information about the Techtalk mailing list