[prog] 'protecting' perl code

Mon May 10 23:47:17 EST 2004

G'day Jacinta, and everyone else!

On Fri, May 07, 2004 at 08:45:42AM +0200, I wrote:
> I'll add some more on this in my reply to Jacinta's mails.
> I'm afraid this is going to be more than just three lines, though, so
> I'll do that some time later the day - as soon as I get around to it.

Okay, took me some time to get around to it...

Anyway, having seen your remarks, it seems to me that there's some
basic misundestanding. Maybe I didn't express myself clearly enough.
I'll give it another try below.

On Fri, May 07, 2004 at 12:33:31PM +1000, Jacinta Richardson wrote:
> Almut wrote:
> > The key concept is "source filters". Perl provides low-level mechanisms
> > to intervene with how the parser handles the stream of source code that
> > makes up the script. Type 'perldoc perlfilter' to get a concise but
> > good description of what this is all about.
> 
> Think twice before writing your own source filter.  I've seen enough
> cases where things break due to source filters making mistakes.  And the
> breakage can be really subtle and next to impossible to find sometimes.
> Even Damian Conway's switch used to get caught up on some strange cases.

I agree that adding any code anywhere always bears more potential
problems than not adding it.  OTOH, I haven't encountered too many
strange cases, so far, with the version I implemented for my client.
I know, Perl is a complex language that allows you to do all kinds of
tricky things. Yes, no doubt. But the only issue that ever popped up
in practice was that he couldn't read some POD documentation that he
had embedded into his modules (he tried to open the (encrypted) script
itself a second time from within the script to generate a usage
message... -- that's a quite obvious one, IMHO. Even that was solved
easily be moving the POD parts into the unencrypted section at the top
of the scripts).

I agree, however, that you should have a fair idea of what you're doing...

> > Here's a rough sketch of what needs to be done (more details upon
> > request):
> > 
> > * write the source filter as an extension module in C.  Although you
> > can, in principle, write source filters in perl (see Filter::Simple,
> > for example), this would be too easy to reverse engineer.
> 
> My goodness that sounds painful...

I don't think there are many other options.  Note that I'm talking
about the part doing the decryption, not the application scripts that
the user writes. The decryption module is generic and has to be written
and tested just once.
This is remotely similar to the part of the SSH daemon doing the
decrypting of the stream of bytes that the client sends -- its inner
workings are totally independent of the content that's sent across the
line...

The user-scripts are encrypted using some appropriate cryptographic
algorithm, of which there are many around.  The purpose is just to
make it impossible (within the limits of the algorithm used) to derive
the cleartext from the encrypted script alone -- without having to
reverse engineer the perl interpreter containing the decryption logic.

The main problem is that you have no other way than to distribute the
decryption routines plus the password (or random seed) together with
the encrypted program.  Thus, anyone can try any tools they have to
find out how the decryption works, or simply to somehow capture the
source code from somewhere within the interpreter's memory, when pieces
of it are available in their final decrypted form.
(Even if you don't embed the password, the user needs to know it to be
able to run the program, so that wouldn't make much of a difference.)

Due to this, you have a point of severe vulnerability that you can't
get around. That's where obfuscation comes in. It's your only option.
You need to make every effort to make that crucial piece of code as
impenetrable as possible for someone trying to reverse engineer it. [1]

In my implementation I decided to use a modified version of the famous
RC4 algorithm. The basic idea of this algorithm is to bitwise XOR your
data with a pseudo-random sequence of bytes. XOR'ing that same sequence
a second time recovers the original data. Nice and simple, but still
amazingly effective, if the pseudo random number generator (PRNG) is
good. The reason I'm using a different PRNG as normally used with RC4
is as follows: if I was using the standard RC4 PRNG, anyone guessing
that RC4 is being used, could download the algorithm from somewhere on
the net. In this case they'd only have to somehow extract the random
seed (password) from the perl binary. Still not necessarily trivial, if
you hide it well, but in any case far easier than additionally having
to figure out how the algorithm itself works.  Remember, it's all about
obfuscation.

> 
> > * strip all symbol information from the resulting perl binary, to make
> > debugging even more difficult.
> 
> .... there is a reason for that symbol information .... make sure your
> code works before you do this step.  Debugging after this point will be
> like having your fingernails pulled out

Again, I'm talking about the perl interpreter itself. As many other
binaries, it is regularly distributed in stripped form (for whatever
reasons). This is usually okay. Also, in this particular case, the
enduser will typically not _want_ to debug your code, anyway...

The idea is simply to build your own special perl binary that contains
an additional, statically linked-in module that decrypts the incoming
source stream when activated (i.e. the intepreter can still be used as
usual with unencrypted scripts).

> 
> > * add a good amount of 'dummy' code with the only purpose to confuse,
> > i.e. code that does seemingly useful operations, while the real
> > required functionality happens as hidden side effects in the background.
> 
> Okay at this point I thoroughly disagree.  Unless the code you're
> obscuring is 100% perfect and will never ever change (and even the hello
> world program isn't that) this is the fastest way possible to make your code
> unusable.
> (...)
> I can't say strongly enough how fundamentally flawed this idea is.  From
> a Software Engineering point of view I can't even really contemplate it.

Hopefully it has become clear in the meantime, that I'm _not_ proposing
to do some munging or obfuscating of the perl application code...

The point is that the PRNG mentioned above typically only consists of a
few hundred machine code instructions, at least not enough to make life
really hard for anyone trying to understand what's being done. That's
the point where I'm suggesting to add some dummy code -- simply to blow
up the size of the code that has to be reverse engineered.
If you make a big, highly complex mess out of the PRNG code by adding
lots of useless stuff, with hundreds of hidden side effects, chances
are good a potential cracker will give up before having succeeded.
(It speaks for itself that you better comment this code at least ten
times as well as you normally do, or else you'll find yourself in the
position of the cracker, when you look at your own code some months
later...)

At the time I implemented this, I showed the source code to a few of
my co-worker gurus, and asked them to tell me what the code is really
doing. Although they could look at the C code (with comments removed)
they were not able to figure it out in a reasonable amount of time.
Now, consider that the cracker is seeing this mess in form of assembly
instructions with all symbols removed (i.e. with numeric addresses
only, giving no hint whatsoever as to what the variables and functions
were originally meant for).
Nothing anyone wants to wade through, really...

> 
> * Almut forgot the good ole "make all the variable names look like
>   rubbish and remove all possible whitespace".

Do I sense some latent hostility here?  ;)

> 
> This won't buy you a lot.  The perltidy program will cope with the
> whitespace and substitution can slowly fix the rest.
> 
> > The general idea is that anyone running the perl interpreter in a
> > debugger (like gdb) would have a hard time figuring out what's _really_
> > going on...
> 
> And likewise for the poor person who has to maintain it.

Again, as outlined above, I'm not proposing to mess around with your
perl code. What I'm talking about here is that piece of decryption code
that you need to embed into the perl binary. The scripts themselves can
remain neat and tidy as they ever were (with a few exceptions like the
POD issue mentioned before).  When the regular development and testing
has been done, they're simply encrypted before being shipped to whoever
you don't want to see the code.

A typical script then looks something like:

#!/usr/local/bin/plcrun
use PLC;
PLCv1:MD5:<md5sum><salt><encrypted code ...>

or, an individual *.pm module:

use PLC;
PLCv1:MD5:<md5sum><salt><encrypted code ...>

Before the "use PLC;" that activates the decryption in the interpreter,
there can optionally be any regular unencrypted perl code. Likewise,
encrypted and unencrypted modules can be mixed (so you don't need to
encrypt all the system-supplied perl modules). "plcrun" stands for the
modified perl interpreter. "PLC" is the extension module statically
linked into this interpreter (any other name space can be used, of
course).
<encrypted code> is the binary byte sequence resulting from XOR'ing the
original perl code with the random pattern, as described above. <salt>
is simply an additional random 32-bit value (e.g. timestamp) which is
combined with the internal random seed (the idea is to get different
encrypted sequences every time, even for the same cleartext - similar
to the salt being used by crypt(3) for hashing passwords).

> > Depends. As a general rule, I'd say: if someone is willing to pay you
> > (or your company) for it, why not just do it... :)
> > (yes, sure, do some serious consulting first -- however, they may have
> > their specific reasons for wanting a non-standard solution...  well,
> > you get the idea).
> 
> As a professional who abides by the SAGE-AU ethics I disagree here too.
> If an acceptable solution already exists (although I don't know whether
> PAR or ActiveState's options are acceeptable) then it's wrong to make a
> client pay for you to reengineer the wheel.

In principle I agree with you here, of course. No business relation
would work for a longer period of time if you pull a fast one on your
clients. But sometimes, your client doesn't yet know about things he
wants. I made the above statement deliberately a bit provocative to get
the point across. However, we might still disagree on the details, and
it's probably best to simply agree to disagree right from the start.
This is all opinions. My take of it is that, if both sides are happy
with what they got, then that's fine. And with happy I mean: truly,
deeply, entirely happy, not just superficially being polite.
The whole advertising industry, for example, exists for the sole
purpose of making people buy things they otherwise wouldn't. So I guess
I'm in good company.
Assume someone did buy a car that, objectively, is much too large for
him, but the advertising machinery succeeded to trick him into buying
it. As long as he's happy with what he got for his money, what's so
bad about it?  A large part of our capitalistic society works by these
principles. I didn't opt for this form of society. I was born into it.
But this is getting totally off-topic now...

Cheers,
Almut

[1]  One particularly mean trick is to make the program behave
differently when run in a debugger. At the machine code level,
debuggers need to clear the processor's instruction prefetch queue
between executing instructions in single step mode (due to the
context switch between program and debugger).

So, if you're witty enough to modify some code fragment at runtime
(after it's been loaded into the prefetch queue), the processor will
execute the re-fetched, modified instructions in the debugger, while
executing the original code during normal operation. This way, you can
make the program take a completely different path when someone is
trying to spy in on you.

Not every run-off-the-mill cracker knows about such techniques, and can
be fooled effectively. Even if they are aware of the trick, chances are
good you'll still lure them into one of your traps, if you intersperse
your code with several of them.

In the M$-DOS era, some of the harder-to-crack copy protection schemes
made use of such nasty tricks. Under unix it's a bit more difficult to
do, as the code segment is typically write-protected -- but only fools
would think...