Anyone with an email address can expect to receive attachments in a multitude of formats. Unfortunately, some formats cannot be read using free software. This is especially true if our email buddies are still involved in the arguably risky practice of using proprietary programs in conjunction with their email readers.
Many free software advocates adopt a policy of ignoring all email with attachments dependent on closed source software, opting instead to lecture the sender on the importance of open standards. Others may not like missing out on the fun to be had from attachments being forwarded amongst their peers. If you find yourself in this situation, the techniques outlined in this article may serve as a partial solution.
There is not much a Linux user can do if the entire contents of an attachment are encoded using a jealously guarded secret algorithm. Very often however, the problematic file is merely a thin proprietary envelope enclosing a loose collection of data objects that use well-known encoding standards. For instance, some MS Word documents being forwarded around the Net contain ordinary JPG and PNG images embedded within the file. If we can find a way to remove the envelope, reading these enclosed files would be a straight forward matter. The following sections describe how this can be accomplished using a little Python scripting together with a few image viewing and manipulation tools available on most Linux distributions.
Before tackling the problem of the embedded images we can easily
view any readable text using the strings utility:
strings proprietary.file | less
strings proprietary.file | grep JFIF strings -n 3 proprietary.file | grep PNG strings proprietary.file | grep GIF8
We need to find where exactly each image is located within the
file. A little Python will help to find possible embedded images and
report their positions as a byte offset:
from string import find
#read in proprietary data
fh = open( "proprietary.file" )
dat = fh.read()
fh.close()
#search for JFIF
x = -1
while 1:
x = find(dat,"JFIF",x+1)
if x<0: break
#file actually started 6 bytes earlier
print x - 6
#!/usr/bin/python
from string import find
from sys import argv
headers = [("GIF8",0), ("PNG",1), ("JFIF",6)]
filepath = "proprietary.file"
if len(argv)>1: filepath = argv[1]
fh = open(filepath )
dat = fh.read()
fh.close()
for kw,off in headers:
x = 0
while 1:
x = find(dat,kw,x+1)
if x<0: break
print kw,"file begins at byte",x - off
Now that we know where each image is likely to start how do we display
them? ImageMagick's display utility can help here. Suppose
our proprietary file contains a JPEG image beginning at byte 1000.
Using tail to remove all the bytes that preceed it and pipe
the rest to display.
tail -c +1001 proprietary.file | display -
#!/usr/bin/python
from string import find
from sys import argv
from os import system
headers = [("GIF8",0), ("PNG",1), ("JFIF",6)]
filepath = "proprietary.file"
if len(argv)>1: filepath = argv[1]
fh = open(filepath )
dat = fh.read()
fh.close()
for kw,off in headers:
x = 0
while 1:
x = find(dat,kw,x+1)
if x<0: break
system("tail -c +%d %s | display -" % (x - off + 1, filepath))
ImageMagick throws away any excess data fed to it after reading to
the end of the image segment. If we want to separate the image data
completely for storage as individual files, we also need to find the
end of each image. One way to do this is to use a modified binary chop
algorithm.
Listing 3
#!/usr/bin/python
from string import find
from sys import argv
from commands import getstatusoutput
headers = [("GIF8",0,"giftopnm","gif"), ("PNG",1,"pngtopnm","png"),
("JFIF",6,"djpeg","jpg")]
filepath = "proprietary.file"
if len(argv)>1: filepath = argv[1]
fh = open(filepath )
dat = fh.read()
fh.close()
inum = 0
for kw,off,conv,ext in headers:
x = -1
while 1:
x = find(dat,kw,x+1)
if x<0: break
beg = x - off
#possible image located -- find end by binary chop
s1 = len(dat) - x
s0 = 1
sz = s1
while s0<s1:
(stat,output) = getstatusoutput("tail -c +%d %s | head -c %d | %s >/dev/null" % (beg + 1, filepath, sz, conv))
if stat:
#failed -- possibly too small
if sz == s1:
#failed -- probably invalid data
print "failed... no image here"
break
elif sz == s0:
#we've found the length -- write out image
imgname = "image%03d.%s" % (inum, ext)
print "writing",imgname
fh = open( imgname, "w")
fh.write(dat[beg :beg+s1])
fh.close()
inum = inum + 1
break
s0 = sz
else:
#might be too big -- try smaller
s1 = sz
sz = int((s0+s1)/2)
This article has shown how to write scripts that extract data objects, encoded using platform-independent open standards, from within proprietary files. It should be a simple task to extend these scripts for handling other image formats and even other types of data objects, such as sound and music files. Note that there are many file formats that frustrate the techniques described here via a layer of simple encryption and/or obfuscation.
Even if one has access to the appropriate proprietary application for reading a particular email attachment, the scripts outlined above can be useful for avoiding any possible macro viruses or security exploits specific to that application.
And finally a word of warning. The legislature of some countries have vaguely worded laws that can be interpreted in such a way that these scripts may be considered as illegal copyright circumvention devices. This may or may not be relevant to you depending on the country where you reside. As is always the case when mixing open and closed source systems, your mileage may vary.
[Editor's note: The Python Imaging Library (PIL) provides a way to work with images from within a larger program. You can open an image and read its type and dimensions, transform it, create thumbnails, etc. -Iron.]
Adrian J Chung