Wednesday, 31 July 2013

Lines of the dead

Your humble zombie has been casting around for another topic to blog about.  I found a number of things but they didn't warrant a single blog post.  Therefore this post will cover several single line commands that are quite simple but powerful at the same time.   Hopefully they will illustrate the power of the GNU/Linux CLI, and maybe prove useful to you in your work.

Recovering a single record from the MFT is easily do-able, but why would you want to, after all the istat command in the sleuthkit can parse any MFT records (and in the most useful and understandable way amongst all the forensic tools that I have looked at)?   Well, MFT parsing tools generally report on the data in the "allocated" area of the MFT record, they don't report the residual data in the record.   The residual data could contain information useful to your investigation.   It is important to understand that when the data is written to an MFT record for a NEW file, it zeros out the data after the end of record marker.   Therefore, any residual data that you find in an MFT record MUST have once been part of the allocated area of the record.    So, how did I extract the records?   A simple one-line command for each record (I am using an arbitrary record number in this example) like this:
icat -o 2048 bad-guy.E01 0 | dd bs=1024 count=1 skip=9846 > MFT9846.dat

The first part before the pipe simply output the contents of record 0 on the partition starting at sector 2048 on my disk image.   Record 0 on an NTFS file system is the MFT, so the entire MFT is output.  We then pipe the output to the dd command, specifying a block size of 1024 bytes (the size of each record, and skipping a number of records until we get to the one want.  The output is redirected to a file - job done!

Here is a way to identify files that contain an arbitrary number of a particular string.   Another example of how this might be useful:
In several examinations I have found web-pages in the web cache that contain chat records, these are generally from websites that use flash based chat applications.   I noted that the chat messages were generally preceded by the string "said:".   Thus you would see the username followed by the string "said: " followed by the message.   I therefore automate the process of searching for web pages that contain SEVERAL instances of the string "said: ".    Obviously if we looked for web pages that contained a single instance of the string "said: " we are going to get a number of false positives, as that string might appear in a web based story, or very frequently a news story quoting a source.
So, we could find all the web-pages in the mounted file system, remembering that if it is a Windows box being investigated there is a likelihood that there will be white space in file names and file paths so we need to set our Internal Field Separator environmental variable to new lines only, like this:
Thats IFS=' followed by a press of the return key followed by a single ' character.   Theoretically you should be able to type: IFS='\n' but it does seem to work for me, thus I explicitly use the return key.

find /disk/image -type f -exec file {} \; | awk -F: '$2 ~ /HTML document/ {print $1}' > /tmp/livewebs.txt

This command conducts a file signature check on the file in the live file system and looks at the description field in the output for the string "HTML document" if there is a match the file path and file name are sent to a file in my /tmp directory.

I can then process my list of files like this:

cat /tmp/livewebs.txt | while read LWC ; do egrep -H -c -e " said:" -e " says:" -e "msgcontent" $LWC | awk -F :  '$2 > 2 { print $1 }' > /tmp/livewebchat.txt ; done

The above command reads my list of files, line by line, it searches for 3 regular expressions ( said:, says: and msgcontent).   The -H option for egrep reports the file path and file name, the -c option reports the number of matches for each file.   So, a list of file paths/names are generated followed by a colon, followed by the number of matched regular expressions for each file.   The result is piped to awk, the field separator is set to a colon (giving us 2 fields).  The second field is checked to see if the number of matches is greater than 2, if they are the first field is (the file name/path) is sent to a new file in the /tmp directory.   I could then use that new file to copy out all of the files:

cat /tmp/livewebchat.txt | while read i ; do cp $i /home/forensicotd/cases/badguy/webchat ; done 

Obviously I would need to do that again with any deleted files I've recovered with the tsk_recover command or any unallocated web pages I have recovered with photorec.   I would need to do something similar with gzipped files as well, as they could be compressed web pages - so use zgrep instead of egrep.   Remember that simply coping files out is a bad idea - you need to use the FUNCTION I posted to ensure that files don't get over-written in the event of a file name collision.

Another pain in dealing with web-pages is actually viewing them.  You REALLY shouldn't be hooked up to the 'net whilst you are doing forensics, if you are, then you are doing it WRONG!  But browsers often struggle to load raw web-pages when they are trying to load content coded into the html in your web page.    So, one thing we could do with the web pages listed in our livewebchat.txt file is remove all the html tag, effectively converting them to text files that we can view easily in a text editor, or the terminal.   We can do that with the html2text program.  Even better, we can output the full path and filename at the top of our converted webpage so that if we find anything interesting we know the exact file path that the file came from.  Here is a function that I wrote to convert the web-page to text, output the file path/name and check if the file name exists and adjusting the file name to prevent it over-writing a file with an identical file name:

suswebs () {
filename=`basename "$1"`
msg=`html2text $1`
while [ -e ${WDIR}/"${filename}" ]; do
        filename=`echo "$filename" | perl -pe 's/(\[(\d+)\])?(\..*)?$/"[".(1+$2)."]$3"/e;'`
echo "${filepath}${msg}" > ${WDIR}/"${filename}.txt"

To use the function you need to set the WDIR variable to the location where you want your processed web-pages to go to, so something like this:
cat /tmp/livewebchat.txt | while read l ; do suswebs $l ; done

Obviously you can add all of these commands to your linux preview disk.

Monday, 1 July 2013

Maintaining the dead

In this post I will walk you through working with the new overlayfs feature in recent Linux kernels, and how to use that to make changes to a Linux Live distro truly permanent.

For most of my forensic previewing I use the CAINE forensic GNU/Linux distro.  You can fully customise the distribution to fit your needs.  Most of the stuff I blog about I have added to my distro to automate the previewing of suspect devices.

The CAINE guys recently released a new version of, called PULSAR.  They have necessarily had to make some changes due to how Ubuntu is now packaged (CAINE is based on the Ubuntu distro).  Previously, you could download an .iso to burn to CD and a DD image to copy to a thumbdrive.   One of the major changes is that the version designed to be installed to a USB device (or indeed a hard disk on your workstation) comes as an .iso.   I did some major surgery to the previous versions then used the remastersys scripts bundled in CAINE to generate my own customised .iso.   Obviously, it isn't that simple any more, so I thought I would walk you through making persistent changes to the latest CAINE distro.

The basic steps to customising CAINE are:
Download NBCAINE 4.0
Install to thumbdrive
Make your changes to the distro
Merge your changes into the system
Create your new .iso

The first thing you need to do is download NBCAINE 4.0 from the CAINE website.  You then need to use live-disk installer to install the .iso to your thumbdrive.  I use UNETBOOTIN for this, but there are other installers out there for you to experiment with.  You will also need a clean USB stick, formatted as FAT32.  So all you need to do now is plug in your thumbdrive, launch unetbootin and configure it like this:

You need to make sure you have selected Ubuntu 12.04 Live as your Distribution.  Then select the path to your nbcaine .iso image.   You then need to configure how much space your require to preserve files across reboots.  I make lots of changes so I select 2000 MB - you will need an 8 GB thumbdrive if you go this big.  I then select the partition on my thumb drive that the distro is going to be installed to.
Once you click on "OK" the installation will start. Once complete, you can reboot, select your thumb drive as your boot device to boot into the CAINE environment.

One of the big differences you will find in CAINE 4.0 is that the entire structure is radically different from version 2.0.  Previously, when using the dd image of caine, you would have found the expected linux directory tree in the root of the partition, directories such as bin, boot,  home, var, usr, etc.  That directory structure does still exist, however it is contained inside a squashfs file system, that is decompressed at boot.   The important files are contained within the "casper" directory in the root of your .iso.  We should have a look at the root directory to see what's there - this will help us understand how the CAINE system works:

-rw-r--r--  1 fotd fotd  176 2013-03-15 17:47 autorun.inf                                                                                                      drwx------  2 fotd fotd 4.0K 2013-06-28 16:07 casper
-rw-r--r--  1 fotd fotd  1.9G 2013-06-28 16:49 casper-rw
drwx------  2 fotd fotd 4.0K 2013-06-26 13:31 .disk
drwx------  4 fotd fotd 4.0K 2013-06-26 13:31 FtkImager
drwx------  2 fotd fotd 4.0K 2013-06-26 13:34 install
drwx------  2 fotd fotd 4.0K 2013-06-26 13:34 isolinux
drwx------  2 fotd fotd 4.0K 2013-06-26 13:34 lang
-r--r--r--  1 fotd fotd  32K 2013-06-26 13:35 ldlinux.sys
-rw-r--r--  1 fotd fotd  52K 2013-03-16 16:38 md5sum.txt
-rw-r--r--  1 fotd fotd  55K 2013-06-26 13:35 menu.c32
-rw-r--r--  1 fotd fotd 1.4K 2013-03-15 17:56 NirLauncher.cfg
-rwxr-xr-x  1 fotd fotd 102K 2013-03-15 17:47 NirLauncher.exe
drwx------  2 fotd fotd  36K 2013-06-26 13:31 NirSoft
drwx------  2 fotd fotd 4.0K 2013-06-26 13:35 piriform
drwx------  2 fotd fotd 4.0K 2013-06-26 13:35 preseed
-rw-r--r--  1 fotd fotd  192 2013-03-16 15:51 README.diskdefines
drwx------  2 fotd fotd 8.0K 2013-06-26 13:35 sysinternals
-rw-r--r--  1 fotd fotd 1.7K 2013-06-26 13:35 syslinux.cfg
-rw-r--r--  1 fotd fotd  25K 2013-06-26 13:35 ubnfilel.txt
-rw-r--r--  1 fotd fotd  20M 2013-03-16 15:53 ubninit
-rw-r--r--  1 fotd fotd 4.7M 2013-03-16 15:53 ubnkern
-rw-r--r--  1 fotd fotd 1.6K 2013-06-26 13:31 ubnpathl.txt
-rw-r--r--  1 fotd fotd    0 2013-03-16 16:37 ubuntu
drwx------  7 fotd fotd 4.0K 2013-06-26 13:35 utilities

Many of the above files are for the live incident response side of things.  The structures that we are interested in are the casper-rw file and the casper directory.  Remember when we used unetbooting and were asked how much space we wanted to reserved for persistent changes?  Whatever value you enter, the casper-rw file is created with a size corresponding to that value.  In our case we specified 2000 MB, which is about 1.9 GB, hence our casper-rw file is 1.9 GB.   It is this file that captures and stores any changes we make to the CAINE system.   Lets have a look inside the "casper" directory:

-rw-r--r--  1 fotd fotd  62K 2013-06-28 16:39 filesystem.manifest
-rw-r--r--  1 fotd fotd  62K 2013-06-28 16:39 filesystem.manifest-desktop
-rw-r--r--  1 fotd fotd   10 2013-06-28 16:39 filesystem.size
-rw-r--r--  1 fotd fotd 1.5G 2013-06-28 16:41 filesystem.squashfs
-rw-r--r--  1 fotd fotd  20M 2013-03-16 15:53 initrd.gz
-rw-r--r--  1 fotd fotd  192 2013-03-16 15:51 README.diskdefines
-rw-r--r--  1 fotd fotd 4.7M 2013-03-16 15:53 vmlinuz

The main file of interest here is the filesystem.squashfs.   This file contains the compressed Linux file system, it gets decompressed and mounted read-only at boot.   The previously described casper-rw is an ext2 file system that is mounted read-write at boot - it retains all the changes that are made.  Using the new overlayfs system, the casper-rw overlays the decompressed filesystem.squashfs, but it is interpreted by the O/S as a single unified file system.  We should bear in mind that the files "filesystem.size", "filesystem.manifest" (which lists all the installed packages) and "filesystem.manifest-desktop" have to accurately reflect various properties of the filesystem.squashfs file.  Thus, if we add programs to our installation we have got to update all of the "filesystem.*" files.

Using overlayfs is useful for retaining changes over reboots, however it does mean that the system is somewhat bloated.   What we really need to do is merge the changes in the casper-rw file system down into the compressed filesystem.squashfs, then we don't need our large 2000MB casper-rw file any more (or at least can use a much smaller file once we have made our major changes).

So, I have figure out how to do this...and it isn't actually that difficult.  All we need is to make sure we have the squashfs tools installed and a recent Linux kernel that supports overlayfs.   To check if your kernel supports overlayfs, simply have a look a look at your  /proc/filesystems file and see if it is listed.  If not you will need to install the overlayroot package, which is hopefully in your package manager.      So let's go through this step-by-step. We've created our bootable thumb drive, booted into that environment, and made all the changes that we want to make.  So shutdown and reboot into your Linux environment, plug in your CAINE thumbdrive and mount the partition in read write mode.

Change into a working directory in your linux distro.  We are going to create two mount points, mount our filesystem.squashfs and casper-rw file systems, then overlay one over the other with the overlayfs before creating a new filesystem.squashfs.   So, lets assume my thumb drive is mounted at /media/usb, now we create our mount points in our working directory:
mkdir caine-ro
mkdir caine-rw

Now we mount our casper-rw file in our caine-rw mount point:
sudo mount -o loop /media/usb/casper-rw caine-rw
We mount our filesystem.squashfs in our caine-ro mount point:
sudo mount -o loop /media/usb/casper/filesystem.squashfs caine-ro

Now we need to overlay one file system over the other using the overlayfs system.  Conceptually, overlayfs has an "upperdir" which takes precedence and a "lowerdir".  Obviously the "upperdir" gets overlaid on top of the "lowerdir", in our scenario the caine-rw is our "upperdir" and caine-ro is our "lowerdir".

Now we can do our file system overlaying:
mount -t overlayfs caine-ro -o lowerdir=caine-ro,upperdir=caine-rw caine-rw
We now have our two file systems overlayed with each other, we can now make a new filesystem.squashfs to replace the one on our thumb drive...but at which mount point is our unified filesystem from which we need to create our new filesystem.squashfs?  Remember it is the "upperdir" that takes precedence, which in our case is caine-rw.  So we need to create our new filesystem.squashfs from that mountpoint, we can do that like this:
sudo mksquashfs caine-rw/ filesystem.squashfs

We now have our replacement filesystem.squashfs that contains all our changes.
We now need to update the rest of the "filesystem.*" files to reflect changes in our filesystem.squashfs.  Lets update our filesystem.manifest:
sudo chroot caine-rw/ dpkg-query -W --showformat='${Package} ${Version}\n' > filesystem.manifest

Now we can update our filesystem.manifest-desktop with this commands
cp filesystem.manifest filesystem.manifest-desktop

Unmount our mounted filesystems:
sudo umount caine-r*

Finally, we can update the filesystem.size file - we'll need to mount our NEW filesystem.squashfs to do this:
sudo mount -o loop filesystem.squashfs caine-ro
printf $(sudo du -sx --block-size=1 caine-ro | cut -f1) > filesystem.size

At this point we have everything we need in our working directory, we just need to copy all of the "filesystem.*" files to our mounted thumb drive containing our CAINE system.  Remember they need to go into the "casper" directory:

sudo cp filesystem.* /media/usb/casper/

Let's unmount our filesystem.squashfs:
sudo umount caine-ro

To finish up we need to go and work on our mounted thumbdrive.

We can make a new (much smaller) casper-rw to hold any minor future changes we want to make.   We need to delete the existing, casper-rw, create a new one then format it - we'll use ext2 in our example:

sudo rm /media/usb/casper-rw
sudo dd if=/dev/zero of=/media/usb/casper-rw bs=1M count=300
sudo mkfs.ext2 -F /media/usb/casper-rw

Finally we need to make a new md5sum.txt file to reflect our changes:
cd /media/usb
sudo find . -type f -exec md5sum {} \; > md5sum.txt

All done!  Remember to unmount all your various mounted filesystems relating to your work with the CAINE installation.

Sunday, 19 May 2013

Unwinding the dead

Your humble zombie forensicator has been quiet lately.  This was partly due to the prolonged bitterly cold conditions that generally render us un-ambulatory until the thaw.  However I have also been trying to get my python chops in order and have been studying the iTunes backup format.
So, I have written some scripts to help parsing the iTunes backup that I am going to share with you.

I have had a run of cases where significant information has been found in the iTunes backups on computers that I have looked at.  If you weren't aware, owners of iPhone/iPad/iPod mobile devices can hook them up to their computers for backing-up purposes.   The resulting backups are a nuisance to process - all the files have been re-named based on hash values and have no file extension.  However, the backups contain significant amounts of data that can be useful to forensic examiner.   The main files of note are generally sqlite databases that can contain chat/messaging/email messages.  There are a number of popular chat/messaging/email apps on the iphone that appear over and over in the backups that I have looked at, they are generally:
Apple sms
google email

Luckily, there is lots of info on the 'net about how to deal with the sqlite databases that are created by the iPhone apps.   You should really start with the utterly essential Linux Sleuthing blog.  There are an essential series of articles on processing sqlite databases.   I suppose the main thing to bear in mind when dealing with sqlite databases is NOT to rely on automated tools.  By all means, use them to get an overview of the type of data available, but the most effective approach is to process them manually.   Understand the table and schema layouts and execute your own queries to extract data across tables.  Often automated tools will just dump all the data for each table sequentially.  It is only by doing cross table queries that you can marry up phone numbers to the owners of those phone numbers or screen names to unique user names.  There is no way I can improve on the info available in the Linux Sleuthing blog so please visit him to get the skinny on sqlite queries.

The main problem with processing iTunes backups is that there are so may apps that may be used for messaging/chat/email that it nearly impossible to keep up with what ones are out there, what format they store their data in and where it can be found in the backup.   The first step for me in examining an iTunes backup is to try and establish what kind of device it is that has been backed up.  This is fairly simple, you just need to view the contents of the Info.plist file which is XML so it can be viewed in any text viewer.  We know that XML is simply a series of key-value pairs, so you just need to find the "Product Type" key and look at the corresponding value.   If you can't find the Info.plist then look for the Manifest.plist.   This is in more recent backups and is a binary plist file.   Just download the plutil tool and generate the the Info.plist like this:
plutil -i Manifest.plist -o Info.plist
You now have a text version of the Manifest.plist which you can examine in a text editor.

Next thing to do is recreate the original directory structure of the backup and restore the files to their original locations and file names.
The whole point of these backups is that the iTunes software can restore it your original device in the case that your device encounters a problem.  So how does iTunes know the original directory structure and file names of the files in the backup?   When the initial backup is performed a file called Manifest.mbdb is created.  This is a binary database file that maps the original path and file names to their new hash-based file name in the backup.   So, all we need to do is create a text version of the Manifest.mbdb, so we can read it and understand original structure on of the backup.   We can generate our text version of the Manifest database with THIS python script.

The it is a case of recreating the original directory structure and file names - more on this later.

Once we have the original directory/file structure we can browse it in our file browser of choice.   It is important to remember that not all chat is saved in sqlite databases.   Some services such as ICQ save the chat in text based log files.  To complicate matters there is one format for the ICQ-free chat service and anther format for the ICQ-paid chat service.   The ICQ-free chat logs appear to be in a psuedo-JSON type format and not at all easily readable.   I say psuedo-JSON type because my python interpreter could not recognise it as JSON (using the json module) and even online JSON validators stated that the data was not valid JSON.   The ICQ-paid logs are a bit more user friendly and very interesting to forensicators.  They appear to be a record of everything that happens within the ICQ environment once the ICQ app is launched.  This means that the chat messages are buried a long way down in the log file and are not straight forward to read.   No matter, both ICQ formats were amenable to scripting via a couple of python scripts that I wrote.

As a salutary example of not relying on automated tools.  I processed an  iTune backup of an iPhone in a commercial forensic tool (mainly used for processing mobile phones).  In the "chat" tab of the output of the commercial tool, it listed 140 ICQ-Free chat messages and 0 ICQ-Paid chat messages.  Using my technique on the same backup  I recovered 3182 ICQ-Free chat messages and 187 ICQ-Paid chat messages.

Having warned against relying on automated tools, I have produced an err....automated tool for processing the itunes backups in the way I have described.  It comprises one bash script + 2 python scripts.  What it does in "unbackup" the backup and executes sqlite queries on some of the more well known sqlite databases.  However, it also copies out any processed sqlite databases and any unprocessed databases so that you can manual interrogate them.

So, I'll let you have the scripts if you promise to manually process the sqlite databases that are recovered (and visit the Linux Sleuthing blog)!   BTW, if you do find any interesting databases, if you could let me know the path, file name and sqlite query, I can add the functionality to my script.

My project is hosted at google code, HERE

I have, of course, added the scripts to my previewing home brew, so now any iTunes backups should be automatically detected and processed.

Thursday, 10 January 2013

Ressurecting the dead

Unlike zombies, deleted files will not miraculously return to life on their own, we need to either undelete them or carve them if there is no file system meta data to help us.   So I want to blog about file carving, this will be over two posts, the first post will deal with theory, the second post will look at a couple of tools that I use.

In theory, file carving is straight forward enough - just look for a file header and extract the file.   The practice is a lot more complicated.
Let's consider carving files out of unallocated space.   The first consideration is "what is unallocated space?".   Imagine a deleted file in an NTFS file system, where the file data not been overwritten, Both the file data and the meta data are intact, the MFT entry has the flag set to indicate that the file is deleted, thus the clusters are available for allocation by another file.
Do you consider this file to be in unallocated space?   Some people say yes, as the relevant clusters are not allocated to a live file, some say no as the relevant clusters ARE allocated, albeit to a deleted file.  In many ways the question is academic, it doesn't matter what you consider to be unallocated space, it matters what your file carving tools considers to be unallocated space.  If you don't know what your tool considers to be unallocated space then how do you know if you have recovered all of the potentially recoverable files?

Another consideration is what strategy are you going to use.   File carving tools have different approaches to the problem of carving.   Once approach is to search the entire data-stream, byte-by-byte looking for file signatures.   This is the most thorough approach, however it is the most time consuming approach and will potentially lead to a number of false positives.   Some file signatures are maybe only 2 bytes in length, by pure chance we can expect those 2 bytes to appear on a hard disk a number of times.  Those 2 bytes may or may not represent the file header that you are interested in.   Figuring out if they are relevant headers or false positives can be quite challenging.

One way to reduce the number of false positives is to search for file signatures at the block (or cluster) level.   As the file signature is normally at the start of a file, we only need to look at the cluster boundary - as that where the start of files will be.   Any file signatures found here are unlikely to be false positives, what's more our carving routines will be a lot quicker.   The downside to this is that valid files may get missed, especially if there is a new file system overlaying an old file system.   The cluster boundary for the OLD file system may not fall at the cluster boundary for the NEW file system.   Imagine a 500 GB hard drive with a single partition filling the disk, when formatted the block size may be 16 sectors.  If a user then shrinks that partition to 400GB and creates a new partition in the remaining 100GB, the block size might be set at 8 sectors.   You would need your carving tool to search for headers at the 16 sector boundary for the first partition, and 8 sector boundary at the second partition.   Maybe a better solution would be to search  for signatures at the sector boundary?   This would ensure that all block (cluster) boundaries were searched but increase both the time taken and the risk of finding false positives.   Searching at the sector boundary means that there is also a possibility of files embedded in other files not being found if they are not saved to disk at the sector boundaries (not sure if this is possible, I have never tested it).

Once you have decided your strategy, the problems don't end there.   From a programmers point of view, how do you define when your tool stops carving and resumes searching for file headers?   This is probably the biggest problem facing programmers of carving tools.   Some files have footers, so you could program your carving tool to just keep going until it gets to the file footers.   But what happens if the footer is overwritten or is part of a fragmented file...your tool will just keep carving data out until it reaches the end of the disk or eventually finds a footer many, many clusters further into the disk.   There are different potential solutions to this problem, one is to set a maximum file size so that your tool stops carving at a certain point, even if no footers are found.   Another solution is to stop carving once your tool finds another header.   The problem here is deciding what file type header should be your stop point.  If you are carving for jpgs, do you start carving until you find another jpg header or any type of header?   If your carving engine does byte-by-byte carving, then if you are using "any known file signature" as your stop point you risk ending the carving prematurely if your tool finds a "false positive" header.  You can combine the approaches as Jesse Kornblum did when coding the "foremost" file carver - that is to say, once you start carving carve until max file size or footer found.   In fact there are now quite a few different approaches to the problems posed by file carving, a good overview can be found in this PRESENTATION.

Ultimately, once you understand how your file carving tool works, there is no "right way" or "wrong way" to do file carving.  The file signature searching engine in Encase is very through, however it uses a "byte-by-byte" strategy meaning that there are many false positives and it doesn't really do file carving as it doesn't export the found files. My own preferences depend on what I am looking for, generally for unallocated space I will carve at the sector or cluster boundary, for swap and hiberfil files I do byte_by_byte carving.   I will do a step by step post in the next few days on a couple of the file carving tools that I use routiney.  One of them, photorec, is another one of the tools that I use on just about every case I can think of.

Wednesday, 2 January 2013

Traces of the dead

I have blogged about bulk_extractor on several occasions.   As it is such an essential and useful tool for forensicators, I thought I would do a post dedicated to the tool.

Bulk_extractor is tool that scans a disk, disk image or directory of files for potentially evidential data.   It has a number of scanners that search for various artifacts such as urls, email addresses, credit card numbers, telephone numbers and json strings.   The recovery of json strings is particularly useful as a number of chat clients will generate chat messages in json format.  

The url recovery is another extremely useful feature.  We probably have our favourite web history parsers and web history recovery tools.   However recovering all the available web browser history on a disk is incredibly different.  Once again you really need to know what your tool of choice is doing.   So, if you have a tool that claims to recover Internet Explorer history do you understand how it works?  Maybe it is looking for the index.dat file signature?  This is useful...up to a point.   It will work fine for index.dat files stored in contiguous clusters, but what happens if the index.dat file is fragmented?  Under these cirumstances you tool may only recover the first chunk of index.dat data, as this chunk contains the index.dat header.   Your tool may therefore miss thousands of entries that reside in the other fragmented chunks.   Most respectable tools for recovering IE index.dat look for consistent and unique features associated with each record in an index.dat file, thus ensuring that as many entries as possible are recovered.  Other web browser history may be stored in a mysql database, finding the main database header is simple enough, even analysing the sqlite file to establish if it is a web history file is simply.  However, it get much more difficult if there is only a chunk of the database available in unallocated space.   Some tools are able to recover web history from such fragments in some circumstances - does your tool of choice do this?  Have you tested your assumptions.  
Some web history is stored in such a way that there are no consistent and unique record features in the web history file.  Opera web history has a simple structure that doesn't stretch much beyond storing the web page title, the url and a single unix timestamp.   There are no "landmarks" in the file that a programmer can use to recover the individual records if the web history gets fragmented on the disk.   Yahoo browser web history files pose much the same problem.  
bulk_extractor overcomes these problems by simply searching for urls on the disk.  It ingests 16mb chunks of data from your input stream (disk or disk image), extracts all the ascii strings and analyses them to see if there are any strings that have the characteristics of a url i.e they start with "http[s]" or "ftp" and have the structure of a domain name.   In this way you can be confident that you have recovered as much web history as possible.   However, there is a big downside here - you will also recover LOTS of urls that aren't part of a web history file.  You will recover urls that are in text files, pdf files but most likely urls that are hyper-links from raw web pages.   Fortunately, the output from bulk_extractor can help you here.   Bulk_extractor will create a simple text file of the urls that it finds.  It will first list the byte offset of the url, the url itself and a context entry, this shows the url with a number of bytes either side of it - it does what it says on the tin, gives you the context that the url was found in.  I have split a line of output from bulk extractor for ease of viewing.  The first line shows the byte offset of the disk where the url was found, the second line shows the url, the 3rd line shows the url in context as it appears on the disk.
opyright" href="" />\x0A    <title>

As can be seen, the url, when viewed in context,  is preceded by "href=", this indicates that the url is actually a hyperlink from a raw web page.
Bulk_extractor doesn't stop there though.  It will also analyse the recovered url and generate some more useful data.  One of the files that it generates is url_searches.txt - this contains any urls associated with searches.  The file will also show the search terms used and the number of times that the search urls appear on the disk, so a couple of lines of output might look like this:
n=5 firefox
n=4 google+video

You may need to either parse individual records with your favourite web history parser or you may have to do it manually if your favourite web history recovery tool fails to recover the url that bulk_extractor found - this has happened to me on several occasions!

bulk_extractor also creates histograms, these are files that show the urls along with the number of times the urls appear on the  disk, sorted in order of popularity.  Some sample output will look like this:

n=1715 (utf16=1715)
n=300 (utf16=2)
n=292 (utf16=292)
n=228 (utf16=49)

Notice how urls that are multi-byte encoded are also recovered by default.   Obviously there are going to be a LOT of urls that will appear on most hard drives, urls associated with microsoft, mozilla, verisign etc.
You can download some white lists that will suppress those type of urls (and emails and other features recovered by bulk_extractor).   Other types of url analysis that bulk_extractor does is to discover facebook id numbers, and skydrive id numbers.   Of course, it is trivially simple to write your own scripts to analyse the urls to discover other interesting urls.  I have written some to identify online storage, secure transactions and online banking urls.

bulk_extractor will also recover email addresses and creates historgrams for them.   The important thing to remember here is that there isn't (currently) any pst archive decompression built into bulk_extractor.   Therefore if your suspect is using outlook email client you will have to process any pst archive structures manually.   Other than that, the email address recovery works in exactly the same way as url recovery.

Using bulk_extractor can be a bit daunting, depending what you are looking for and searching through.  But to recover urls and emails on a disk image called bad_guy.dd the command is:
bulk_extractor -E email -o bulk bad_guy.dd

By default ALL scanners are turned on (the more scanners enabled, the longer it will take to run) using the -E switch disables all the scanners EXCEPT the named scanner, in our case the "email" scanner (this scanner recovers email addresses and urls).  The -o option is followed by the directory name you want to send the results to (which must not currently exist - bulk_extractor will create the directory).

I urge you all to download and experiment with it, there is a Windows gui that will process E01 image files and directory structures.   There are few cases that I can think of when I don't run bulk_extractor, there are a number of occasions where I have recovered crucial urls or json chat fragments missed by other tools.

Monday, 31 December 2012

Words of the dead

So, in most forensic exams we probably search the hard drive for keywords, using a process I sometimes jokingly refer to as "keyword searching".  

Referring to the process as keyword searching is something of a misnomer, as often it isn't words that we are looking for, it could be numbers or even a pattern of characters.   Therefore, I urge you to put this keyword searching silliness behind you and use the more accurate term of "pattern matching".

It has to be said that for the longest time, performing pattern matching over a disk or disk image in linux was soul destroying (for those of you still lucky enough to have a soul).   It was possible to use grep across the disk, but memory exhaustion problems abounded.  It was not possible to match patterns in compressed data using this approach.   Also there were problems with matching patterns that were multi-byte encoded.   Certainly, do this in a preview on the suspect machine could take a week or more.    The way I approached the problem was to create a database of files in the live set, then use that data to exclude all the binary files then just grep each of the remaining files for my keywords.   However, file cluster slack was not searched with this approach.   Processing unallocated space was a case of outputting each cluster, running strings across it and then searching the raw strings for my patterns - very time consuming!

So, my rotting, black heart leapt for joy when they mighty Simson Garfinkel released the every-forensicator-should-have-it bulk_extractor program.   One of the many features of the program is the ability to search for a pattern or list of patterns on a disk (or disk image).   Bulk_extractor knows nothing about partitions or file systems, it treats the input data as a raw stream.   One of the great things about it is the ability to decompress/decode data on the fly and search that for your patterns.   In addition, BE will perform recursion to a depth of 5 by default (but that can be tweaked by passing some options to it).  This recursion means that if BE finds some compressed or encoded data it will search the decompressed data for more compressed or encoded data and decompress/decode that data on the fly, it will drill down 5 layers by default.   Thus if there is a compressed docx file that has been encoded as a base64 email attachment that has been placed in a zip archive, then BE will be able to drill down through all those layers to get to the "plain text" inside the docx file.   In addition, it will search for your patterns even if they are multi-byte encoded, with UTF-16 for instance.  As BE knows nothing about file systems, the data in unallocated space, swap files etc gets processed.  Does your forensic tool of choice process compressed/encoded data in unallocated space when you are pattern matching.  Have you tested your assumptions????
The only caveat to using BE is that when handling compressed/encoded data, BE is searching for file signatures - it is therefore unlikely that fragmented files will not be fully processed, only the first chunk that contains the file header stands a chance of being processed.

The main thing to remember is that your search patterns are CASE SENSITIVE!  You have to take this into account when preparing your list of patterns.   If your list contained "dead forensicator", then the pattern "Dead Forensicator" would NOT be matched (actually not even "dead forensicator" would be matched either - we will come on to that).   You could of course add the pattern "Dead Forensicator" to your list, but then the pattern "DEAD FORENSICATOR" would not be matched.   Luckily BE is able to handle regular expressions, in fact the patterns you are looking for really need to be written as regular expressions.   This means that you will need to "escape" any white space in your patterns, you do this by putting a "\" character in front of any white space - this tells BE to treat the white space as literally white space (it has a special meaning otherwise).  So, to match the pattern "dead forensicator", you will have to write it as "dead\ forensicator".   If you are not familiar with regular expressions (regexes) then you should make every effort to get acquainted with them, using regexes will make you a much more effective forensicator.    You probably can't be sure if the pattern you are searching for will be in upper case or lower case or a combination of both.   Regexes allow you to deal with any combination of upper and lower case.   If we were searching for either the upper or lower case letter "a" we could include in our pattern list both "a" and "A" - however this is not so good when we are searching for multi-byte patterns.   To search for both upper and lower case letter "a", we can use square brackets in a regex, so our pattern would be [aA].   Taking this further, to search for "dead forensicator" in any combination of upper and lower case you would write your regex as:
[dD][eE][aA][dD]\ [fF][oO][rR][eE][nN][sS][iI][cC][aA][tt][oO][rR]
Remember to escape the white space between the two patterns!
So you can write your list of regexes and put them in a file that you will pass to BE when you run it.

The options for BE are really daunting when you first look at them, but take your time experimenting, as the benefits of using this tool are stunning.   BE has a number of scanners, most of which are turned ON by default.   When pattern matching, we will need to turn a number of them off.   We will also need to give BE a directory to send the results to and the directory cannot already exist (BE will create it for us).  There is a BE gui that we can use on Windows, but this will only process disk images, the Linux CLI version will process a physical disk (I use BE a lot in previewing a suspect machine from my custom previewing CD).   To show you the command you will need, lets assume that I have a dd image of a suspect disk in my home directory called bad_guy.dd.  I want to search the image for my list of regexes in my home directory called regex.txt and I want to send the results to a folder called bulk in my home directory.   My command would be:
bulk_extractor -F regex.txt -x accts -x kml -x gps -x aes -x json -x elf -x vcard -x net -x winprefetch -x winpe -x windirs -o bulk bad_guy.dd

The -F switch is used to specify my list of regexes, I then disable a number of scanners using the -x option for each one, the -o option specifies the directory I want the results to go to, finally I pass the name of the image file (or disk) that I want searching.   Thereafter, BE is surprisingly fast and delightfully thorough!    At the end of the processing there will be a file called "find.txt" in my bulk directory that lists all the patterns that have been matched along with the BYTE OFFSET for each match.  This is particularly useful when I suspect that evidence is going to be in web pages in unallocated space and I know that the suspect uses a browser that will cache web pages with gzip compression - BE will still get me the evidence without having to go through the pain of extracting all the gzip files from unallocated space and processing them manually.

Anyhows, we now have a list of all the matched patterns on the suspect disk.  We probably would like to know which files the matched patterns appear in (at the moment we only know the physical byte offset of each matched pattern).   This is no problem, there is more Linux goodness we can do to determine if any of the matched patterns appear in live/deleted files.   All we need to do is to run the fiwalk program from Simson Garfinkel that "maps" all the live and deleted files, then run Simson's python script.   So step 1 is to run fiwalk across the disk image, we will use the -x option to generate an .xml file of the output.  Our command would be:
fiwalk -X  fiwalk_badguy.xml bad_guy.dd

We can then tell to use the xml output from fiwalk to process the find.txt file from BE.  Note that you need to have python 3.2 (at least) installed!   Our command would be:
python3 --featurefile find.txt --xmlfile fiwalk_badguy.xml bulk FILEPATHS

So, we need to use python3 to launch the python script, we tell it to use the find.txt feature file, the xmlfile we have just generated, we then pass it the name of the directory that contains our find.txt file, finally we specify a new directory FILEPATHS that the result is going to be put into.   Upon completion you will find a file called "annotated_find.txt" that lists all your pattern matches with file paths and file names if the match appears in a live/deleted file.

The bulk_extractor pattern matching is simples, admittedly resolving file names to any matches is a tinsy-winsy bit gnarly, but it is worth the effort.   It is a lot simpler running BE from the windows GUI against a EO1 file.   But you can do like I have done, write a script and add it to your previewing linux disk to automate the whole thing.  

One word of advice, you will need the forked version of sleuthkit to get fiwalk running nicely, you can get that version from github at:

Running a test on CAINE 3 shows that fiwalk is not working, hopefully it will be fixed soon.   However, you can still run bulk_extractor to do your pattern matching from the boot CD and save yourself a lot of time!

Finally, happy new year to all you lucky breathers.   Thanks for all the page views, feel free to comment on any of my ramblings or contact me if you have any problems with implementing any scripts or suggestions.   I am off for my last feed of the year...yum,yum!

Wednesday, 19 September 2012

Hiding The Dead Revisted

In a previous POST, I looked at automating the detecting of encrypted data.   Assuming that you find such data...then what?   You will need to know the software that opens the cyphertext, the password and possibly a key.   There is some stuff we can do to try and establish these parameters. Bearing in mind we have ALREADY done a file signature check on our system, we can use this data to help us.

First lets look at some other routines, to try and detect encrypted data.   Some programs create cyphertext with a recognisable signature, you simply need to add those signatures to your custom magic file, which you will store under the /etc directory.

Some crypto programs generate the cyphertext file with a consistent file extension.
We can search our file signature database for those with something like this, assuming our database is called listoffiles and saved in the /tmp directory:

awk -F: '{print $1}' /tmp/listoffiles | egrep -i '\.jbc$|\.dcv$|\.pgd$|\.drvspace$|\.asc$|\.bfe$|enx$|\.enp$|\.emc$|\.cryptx|\.kgb$|\.vmdf$|\.xea$|\.fca$|\.fsh$|\.encrypted$|\.axx$|\.xcb$|\.xia$|\.sa5$|\.jp!$|\.cyp$|\.gpg$|\.sdsk$' > /tmp/encryptedfiles.txt

In this example we are asking awk to look at the file path + file name field in our database and return only the files with certain file extensions associated with encryption.

We can also detect EFS encrypted files, no matter what size.   We could use the sleuthkit  command "istat" to parse each MFT in the file system and search that for the "encrypted" flag there.   However, this is going to be very time consuming, a quicker way would be to simply look at our file signature database.   If you try and run the file command on an EFS encrypted file, you will get an error message, the error message will be recorded in you signature database.   This is a result of the unusual permissions that are assigned to EFS encrypted files, you can't run any Linux commands against the file without receiving an error message.   I have not seen the error message in any non-EFS encrypted files, so the presence of this error message is a very strong indicator that the file is EFS encrypted.   We can look for the error message like this:
awk -F: '$2 ~ /ERROR/ {print $1}' /tmp/listoffiles > /tmp/efsfiles.tx
So, we have now run a number of routines to try and identify encrypted data, including entropy testing on unknown file types, signature checking, file extension analysis and testing for the error message associated with EFS encryption.

For EFS encrypted files and encrypted files with known extensions, we can figure out what package was used to create the cyphertext (we can look up the file extensions at   But what about our files that have maximum entropy values?
First, we might want to search our file database for executables associated with encryption, we could do something like this:

awk -F: '$1 ~ /\.[Ee][Xx][Ee]$/ {print $0}' $reppath/tmp/listoffiles | egrep -i 'crypt|steg|pgp|gpg|hide|kremlin' | awk -F: '{print $1}' > /tmp/enc-progs.txt

We have used awk to search the file path/name portion of our file database for executable files, sent the resulting lines to egrep to search for strings associated with encryption, then sent those results back to awk to print just the file path/name portion of the results and redirected the output to a file.    Hopefully we will now have a list of programs of executable files associated with encryption.

We can now have a look for any potential encryption keys.   We have all the information we need already, we just need to do a bit more analysis.  Encryption keys (generally!) have two characteristics that we can look for:
1)  They don't have a known file signature, therefore they will be described as simply     "data" in our file data base.
2)  They have a fixed size, which is a multiple of two, and will most likely be 256, 512, 1024, 2048 bits...I emphasise BITS.

So our algorithm will be to analyse only unknown files to establish their size and return only files that 256,512,1024 or 2048 bits.   We can use the "stat" command to establish file size, the output look like this:

fotd-VPCCA cases # stat photorec.log 
File: `photorec.log'
Size: 1705            Blocks: 8          IO Block: 4096   regular file
Device: 806h/2054d      Inode: 3933803     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-09-19 11:52:48.933412006 +0100
Modify: 2012-09-19 11:52:12.581410471 +0100
Change: 2012-09-19 11:52:12.581410471 +0100

The important thing to remember is that the file size in the output is in BYTES, so we actually need to look for files that are exactly 32, 64, 128 or 256 BYTES in size, as they map to being 256, 512,1024 or 2048 BITS.

So our code would look something like this:

FILESIZE=`stat $1 | awk '/Size/ {print $2}'`
if [ $FILESIZE = "64" -o $FILESIZE = "128" -o $FILESIZE = "256" -o $FILESIZE = "32"  ]
        echo $1 >> /tmp/enckeys.txt

awk -F: '$2 ~ /^\ data/ {print $1}' /tmp/listoffiles > /tmp/datafiles.txt
cat /tmp/datafiles.txt | while read i ; do KEYREC $i ; done  

The last two lines of code search the description part our list of files and signatures for unknown files (using the string \ data as the indicator) and sends just the file path/name for those results to a file.   That file is then read and every line fed into a function that runs the stat command on each file, isolates the "Size" field of the output and test whether the size matches our criteria for being consistent with encryption keys.   Now you will get some false positives (only a handful) by looking that those files in a hex viewer you will be to eliminate those files that aren't encryption keys - if they have ascii text in them, then they aren't encryption keys!

Now we have the cyphertext, the program used to decrypt the data plus the key, all we need now is the password.  We are going to again use the tool every forensicator MUST have, bulk_extractor.   One of the many features of bulk extractor is to extract all of the ascii strings from a hard drive and de-duplicate them leaving us with a list of unique ascii strings.   It may well be that the users crypto password has been cached to the disk - often in the swap file.   We will probably want to do the string extraction at the physical disk level, as opposed to the logical disk.   We will need several gigabytes of space on an external drive as the list of strings is going to be very large, the command to extract all the strings on the first physical disk and send results to an external drive mounted at /mnt/usbdisk would be:

bulk_extractor -E wordlist -o /mnt/usbdisk /dev/sda

We need to do a bit more work, we can't realistically try every ascii string that bulk_extractor generates.  The password is likely to be long, with mixture of upper/lower case characters + numbers.   You can use a regex to search for strings with those characteristics to narrow down the number of potential passwords (Google is your friend here!).

So, detecting cyphertext and encrypted compressed archives, identifying potential crypto keys and potential passwords is doable with surprisingly small amount of code.    For those on a budget, this solution costs about 10 pence (the cost of a recordable CD) if you want to use the suspect's processing power to do your analysis.