Wednesday 31 July 2013

Lines of the dead

Your humble zombie has been casting around for another topic to blog about.  I found a number of things but they didn't warrant a single blog post.  Therefore this post will cover several single line commands that are quite simple but powerful at the same time.   Hopefully they will illustrate the power of the GNU/Linux CLI, and maybe prove useful to you in your work.

Recovering a single record from the MFT is easily do-able, but why would you want to, after all the istat command in the sleuthkit can parse any MFT records (and in the most useful and understandable way amongst all the forensic tools that I have looked at)?   Well, MFT parsing tools generally report on the data in the "allocated" area of the MFT record, they don't report the residual data in the record.   The residual data could contain information useful to your investigation.   It is important to understand that when the data is written to an MFT record for a NEW file, it zeros out the data after the end of record marker.   Therefore, any residual data that you find in an MFT record MUST have once been part of the allocated area of the record.    So, how did I extract the records?   A simple one-line command for each record (I am using an arbitrary record number in this example) like this:
icat -o 2048 bad-guy.E01 0 | dd bs=1024 count=1 skip=9846 > MFT9846.dat

The first part before the pipe simply output the contents of record 0 on the partition starting at sector 2048 on my disk image.   Record 0 on an NTFS file system is the MFT, so the entire MFT is output.  We then pipe the output to the dd command, specifying a block size of 1024 bytes (the size of each record, and skipping a number of records until we get to the one want.  The output is redirected to a file - job done!

Here is a way to identify files that contain an arbitrary number of a particular string.   Another example of how this might be useful:
In several examinations I have found web-pages in the web cache that contain chat records, these are generally from websites that use flash based chat applications.   I noted that the chat messages were generally preceded by the string "said:".   Thus you would see the username followed by the string "said: " followed by the message.   I therefore automate the process of searching for web pages that contain SEVERAL instances of the string "said: ".    Obviously if we looked for web pages that contained a single instance of the string "said: " we are going to get a number of false positives, as that string might appear in a web based story, or very frequently a news story quoting a source.
So, we could find all the web-pages in the mounted file system, remembering that if it is a Windows box being investigated there is a likelihood that there will be white space in file names and file paths so we need to set our Internal Field Separator environmental variable to new lines only, like this:
IFS='
'
Thats IFS=' followed by a press of the return key followed by a single ' character.   Theoretically you should be able to type: IFS='\n' but it does seem to work for me, thus I explicitly use the return key.

find /disk/image -type f -exec file {} \; | awk -F: '$2 ~ /HTML document/ {print $1}' > /tmp/livewebs.txt

This command conducts a file signature check on the file in the live file system and looks at the description field in the output for the string "HTML document" if there is a match the file path and file name are sent to a file in my /tmp directory.

I can then process my list of files like this:

cat /tmp/livewebs.txt | while read LWC ; do egrep -H -c -e " said:" -e " says:" -e "msgcontent" $LWC | awk -F :  '$2 > 2 { print $1 }' > /tmp/livewebchat.txt ; done

The above command reads my list of files, line by line, it searches for 3 regular expressions ( said:, says: and msgcontent).   The -H option for egrep reports the file path and file name, the -c option reports the number of matches for each file.   So, a list of file paths/names are generated followed by a colon, followed by the number of matched regular expressions for each file.   The result is piped to awk, the field separator is set to a colon (giving us 2 fields).  The second field is checked to see if the number of matches is greater than 2, if they are the first field is (the file name/path) is sent to a new file in the /tmp directory.   I could then use that new file to copy out all of the files:

cat /tmp/livewebchat.txt | while read i ; do cp $i /home/forensicotd/cases/badguy/webchat ; done 

Obviously I would need to do that again with any deleted files I've recovered with the tsk_recover command or any unallocated web pages I have recovered with photorec.   I would need to do something similar with gzipped files as well, as they could be compressed web pages - so use zgrep instead of egrep.   Remember that simply coping files out is a bad idea - you need to use the FUNCTION I posted to ensure that files don't get over-written in the event of a file name collision.

Another pain in dealing with web-pages is actually viewing them.  You REALLY shouldn't be hooked up to the 'net whilst you are doing forensics, if you are, then you are doing it WRONG!  But browsers often struggle to load raw web-pages when they are trying to load content coded into the html in your web page.    So, one thing we could do with the web pages listed in our livewebchat.txt file is remove all the html tag, effectively converting them to text files that we can view easily in a text editor, or the terminal.   We can do that with the html2text program.  Even better, we can output the full path and filename at the top of our converted webpage so that if we find anything interesting we know the exact file path that the file came from.  Here is a function that I wrote to convert the web-page to text, output the file path/name and check if the file name exists and adjusting the file name to prevent it over-writing a file with an identical file name:

suswebs () {
filepath="$1"
filename=`basename "$1"`
msg=`html2text $1`
while [ -e ${WDIR}/"${filename}" ]; do
        filename=`echo "$filename" | perl -pe 's/(\[(\d+)\])?(\..*)?$/"[".(1+$2)."]$3"/e;'`
done        
echo "${filepath}${msg}" > ${WDIR}/"${filename}.txt"
}

To use the function you need to set the WDIR variable to the location where you want your processed web-pages to go to, so something like this:
WDIR=/home/forensicotd/cases/badguy/webchat 
cat /tmp/livewebchat.txt | while read l ; do suswebs $l ; done

Obviously you can add all of these commands to your linux preview disk.



Monday 1 July 2013

Maintaining the dead

In this post I will walk you through working with the new overlayfs feature in recent Linux kernels, and how to use that to make changes to a Linux Live distro truly permanent.

For most of my forensic previewing I use the CAINE forensic GNU/Linux distro.  You can fully customise the distribution to fit your needs.  Most of the stuff I blog about I have added to my distro to automate the previewing of suspect devices.

The CAINE guys recently released a new version of, called PULSAR.  They have necessarily had to make some changes due to how Ubuntu is now packaged (CAINE is based on the Ubuntu distro).  Previously, you could download an .iso to burn to CD and a DD image to copy to a thumbdrive.   One of the major changes is that the version designed to be installed to a USB device (or indeed a hard disk on your workstation) comes as an .iso.   I did some major surgery to the previous versions then used the remastersys scripts bundled in CAINE to generate my own customised .iso.   Obviously, it isn't that simple any more, so I thought I would walk you through making persistent changes to the latest CAINE distro.

The basic steps to customising CAINE are:
Download NBCAINE 4.0
Install to thumbdrive
Make your changes to the distro
Merge your changes into the system
Create your new .iso

The first thing you need to do is download NBCAINE 4.0 from the CAINE website.  You then need to use live-disk installer to install the .iso to your thumbdrive.  I use UNETBOOTIN for this, but there are other installers out there for you to experiment with.  You will also need a clean USB stick, formatted as FAT32.  So all you need to do now is plug in your thumbdrive, launch unetbootin and configure it like this:

You need to make sure you have selected Ubuntu 12.04 Live as your Distribution.  Then select the path to your nbcaine .iso image.   You then need to configure how much space your require to preserve files across reboots.  I make lots of changes so I select 2000 MB - you will need an 8 GB thumbdrive if you go this big.  I then select the partition on my thumb drive that the distro is going to be installed to.
Once you click on "OK" the installation will start. Once complete, you can reboot, select your thumb drive as your boot device to boot into the CAINE environment.

One of the big differences you will find in CAINE 4.0 is that the entire structure is radically different from version 2.0.  Previously, when using the dd image of caine, you would have found the expected linux directory tree in the root of the partition, directories such as bin, boot,  home, var, usr, etc.  That directory structure does still exist, however it is contained inside a squashfs file system, that is decompressed at boot.   The important files are contained within the "casper" directory in the root of your .iso.  We should have a look at the root directory to see what's there - this will help us understand how the CAINE system works:

-rw-r--r--  1 fotd fotd  176 2013-03-15 17:47 autorun.inf                                                                                                      drwx------  2 fotd fotd 4.0K 2013-06-28 16:07 casper
-rw-r--r--  1 fotd fotd  1.9G 2013-06-28 16:49 casper-rw
drwx------  2 fotd fotd 4.0K 2013-06-26 13:31 .disk
drwx------  4 fotd fotd 4.0K 2013-06-26 13:31 FtkImager
drwx------  2 fotd fotd 4.0K 2013-06-26 13:34 install
drwx------  2 fotd fotd 4.0K 2013-06-26 13:34 isolinux
drwx------  2 fotd fotd 4.0K 2013-06-26 13:34 lang
-r--r--r--  1 fotd fotd  32K 2013-06-26 13:35 ldlinux.sys
-rw-r--r--  1 fotd fotd  52K 2013-03-16 16:38 md5sum.txt
-rw-r--r--  1 fotd fotd  55K 2013-06-26 13:35 menu.c32
-rw-r--r--  1 fotd fotd 1.4K 2013-03-15 17:56 NirLauncher.cfg
-rwxr-xr-x  1 fotd fotd 102K 2013-03-15 17:47 NirLauncher.exe
drwx------  2 fotd fotd  36K 2013-06-26 13:31 NirSoft
drwx------  2 fotd fotd 4.0K 2013-06-26 13:35 piriform
drwx------  2 fotd fotd 4.0K 2013-06-26 13:35 preseed
-rw-r--r--  1 fotd fotd  192 2013-03-16 15:51 README.diskdefines
drwx------  2 fotd fotd 8.0K 2013-06-26 13:35 sysinternals
-rw-r--r--  1 fotd fotd 1.7K 2013-06-26 13:35 syslinux.cfg
-rw-r--r--  1 fotd fotd  25K 2013-06-26 13:35 ubnfilel.txt
-rw-r--r--  1 fotd fotd  20M 2013-03-16 15:53 ubninit
-rw-r--r--  1 fotd fotd 4.7M 2013-03-16 15:53 ubnkern
-rw-r--r--  1 fotd fotd 1.6K 2013-06-26 13:31 ubnpathl.txt
-rw-r--r--  1 fotd fotd    0 2013-03-16 16:37 ubuntu
drwx------  7 fotd fotd 4.0K 2013-06-26 13:35 utilities

Many of the above files are for the live incident response side of things.  The structures that we are interested in are the casper-rw file and the casper directory.  Remember when we used unetbooting and were asked how much space we wanted to reserved for persistent changes?  Whatever value you enter, the casper-rw file is created with a size corresponding to that value.  In our case we specified 2000 MB, which is about 1.9 GB, hence our casper-rw file is 1.9 GB.   It is this file that captures and stores any changes we make to the CAINE system.   Lets have a look inside the "casper" directory:

-rw-r--r--  1 fotd fotd  62K 2013-06-28 16:39 filesystem.manifest
-rw-r--r--  1 fotd fotd  62K 2013-06-28 16:39 filesystem.manifest-desktop
-rw-r--r--  1 fotd fotd   10 2013-06-28 16:39 filesystem.size
-rw-r--r--  1 fotd fotd 1.5G 2013-06-28 16:41 filesystem.squashfs
-rw-r--r--  1 fotd fotd  20M 2013-03-16 15:53 initrd.gz
-rw-r--r--  1 fotd fotd  192 2013-03-16 15:51 README.diskdefines
-rw-r--r--  1 fotd fotd 4.7M 2013-03-16 15:53 vmlinuz


The main file of interest here is the filesystem.squashfs.   This file contains the compressed Linux file system, it gets decompressed and mounted read-only at boot.   The previously described casper-rw is an ext2 file system that is mounted read-write at boot - it retains all the changes that are made.  Using the new overlayfs system, the casper-rw overlays the decompressed filesystem.squashfs, but it is interpreted by the O/S as a single unified file system.  We should bear in mind that the files "filesystem.size", "filesystem.manifest" (which lists all the installed packages) and "filesystem.manifest-desktop" have to accurately reflect various properties of the filesystem.squashfs file.  Thus, if we add programs to our installation we have got to update all of the "filesystem.*" files.

Using overlayfs is useful for retaining changes over reboots, however it does mean that the system is somewhat bloated.   What we really need to do is merge the changes in the casper-rw file system down into the compressed filesystem.squashfs, then we don't need our large 2000MB casper-rw file any more (or at least can use a much smaller file once we have made our major changes).

So, I have figure out how to do this...and it isn't actually that difficult.  All we need is to make sure we have the squashfs tools installed and a recent Linux kernel that supports overlayfs.   To check if your kernel supports overlayfs, simply have a look a look at your  /proc/filesystems file and see if it is listed.  If not you will need to install the overlayroot package, which is hopefully in your package manager.      So let's go through this step-by-step. We've created our bootable thumb drive, booted into that environment, and made all the changes that we want to make.  So shutdown and reboot into your Linux environment, plug in your CAINE thumbdrive and mount the partition in read write mode.

Change into a working directory in your linux distro.  We are going to create two mount points, mount our filesystem.squashfs and casper-rw file systems, then overlay one over the other with the overlayfs before creating a new filesystem.squashfs.   So, lets assume my thumb drive is mounted at /media/usb, now we create our mount points in our working directory:
mkdir caine-ro
mkdir caine-rw

Now we mount our casper-rw file in our caine-rw mount point:
sudo mount -o loop /media/usb/casper-rw caine-rw
We mount our filesystem.squashfs in our caine-ro mount point:
sudo mount -o loop /media/usb/casper/filesystem.squashfs caine-ro

Now we need to overlay one file system over the other using the overlayfs system.  Conceptually, overlayfs has an "upperdir" which takes precedence and a "lowerdir".  Obviously the "upperdir" gets overlaid on top of the "lowerdir", in our scenario the caine-rw is our "upperdir" and caine-ro is our "lowerdir".

Now we can do our file system overlaying:
mount -t overlayfs caine-ro -o lowerdir=caine-ro,upperdir=caine-rw caine-rw
We now have our two file systems overlayed with each other, we can now make a new filesystem.squashfs to replace the one on our thumb drive...but at which mount point is our unified filesystem from which we need to create our new filesystem.squashfs?  Remember it is the "upperdir" that takes precedence, which in our case is caine-rw.  So we need to create our new filesystem.squashfs from that mountpoint, we can do that like this:
sudo mksquashfs caine-rw/ filesystem.squashfs

We now have our replacement filesystem.squashfs that contains all our changes.
We now need to update the rest of the "filesystem.*" files to reflect changes in our filesystem.squashfs.  Lets update our filesystem.manifest:
sudo chroot caine-rw/ dpkg-query -W --showformat='${Package} ${Version}\n' > filesystem.manifest

Now we can update our filesystem.manifest-desktop with this commands
cp filesystem.manifest filesystem.manifest-desktop

Unmount our mounted filesystems:
sudo umount caine-r*

Finally, we can update the filesystem.size file - we'll need to mount our NEW filesystem.squashfs to do this:
sudo mount -o loop filesystem.squashfs caine-ro
printf $(sudo du -sx --block-size=1 caine-ro | cut -f1) > filesystem.size

At this point we have everything we need in our working directory, we just need to copy all of the "filesystem.*" files to our mounted thumb drive containing our CAINE system.  Remember they need to go into the "casper" directory:

sudo cp filesystem.* /media/usb/casper/

Let's unmount our filesystem.squashfs:
sudo umount caine-ro

To finish up we need to go and work on our mounted thumbdrive.

We can make a new (much smaller) casper-rw to hold any minor future changes we want to make.   We need to delete the existing, casper-rw, create a new one then format it - we'll use ext2 in our example:

sudo rm /media/usb/casper-rw
sudo dd if=/dev/zero of=/media/usb/casper-rw bs=1M count=300
sudo mkfs.ext2 -F /media/usb/casper-rw

Finally we need to make a new md5sum.txt file to reflect our changes:
cd /media/usb
sudo find . -type f -exec md5sum {} \; > md5sum.txt

All done!  Remember to unmount all your various mounted filesystems relating to your work with the CAINE installation.






Sunday 19 May 2013

Unwinding the dead

Your humble zombie forensicator has been quiet lately.  This was partly due to the prolonged bitterly cold conditions that generally render us un-ambulatory until the thaw.  However I have also been trying to get my python chops in order and have been studying the iTunes backup format.
So, I have written some scripts to help parsing the iTunes backup that I am going to share with you.

I have had a run of cases where significant information has been found in the iTunes backups on computers that I have looked at.  If you weren't aware, owners of iPhone/iPad/iPod mobile devices can hook them up to their computers for backing-up purposes.   The resulting backups are a nuisance to process - all the files have been re-named based on hash values and have no file extension.  However, the backups contain significant amounts of data that can be useful to forensic examiner.   The main files of note are generally sqlite databases that can contain chat/messaging/email messages.  There are a number of popular chat/messaging/email apps on the iphone that appear over and over in the backups that I have looked at, they are generally:
skype
whatsapp
yahoo
Apple sms
google email

Luckily, there is lots of info on the 'net about how to deal with the sqlite databases that are created by the iPhone apps.   You should really start with the utterly essential Linux Sleuthing blog.  There are an essential series of articles on processing sqlite databases.   I suppose the main thing to bear in mind when dealing with sqlite databases is NOT to rely on automated tools.  By all means, use them to get an overview of the type of data available, but the most effective approach is to process them manually.   Understand the table and schema layouts and execute your own queries to extract data across tables.  Often automated tools will just dump all the data for each table sequentially.  It is only by doing cross table queries that you can marry up phone numbers to the owners of those phone numbers or screen names to unique user names.  There is no way I can improve on the info available in the Linux Sleuthing blog so please visit him to get the skinny on sqlite queries.

The main problem with processing iTunes backups is that there are so may apps that may be used for messaging/chat/email that it nearly impossible to keep up with what ones are out there, what format they store their data in and where it can be found in the backup.   The first step for me in examining an iTunes backup is to try and establish what kind of device it is that has been backed up.  This is fairly simple, you just need to view the contents of the Info.plist file which is XML so it can be viewed in any text viewer.  We know that XML is simply a series of key-value pairs, so you just need to find the "Product Type" key and look at the corresponding value.   If you can't find the Info.plist then look for the Manifest.plist.   This is in more recent backups and is a binary plist file.   Just download the plutil tool and generate the the Info.plist like this:
plutil -i Manifest.plist -o Info.plist
You now have a text version of the Manifest.plist which you can examine in a text editor.

Next thing to do is recreate the original directory structure of the backup and restore the files to their original locations and file names.
The whole point of these backups is that the iTunes software can restore it your original device in the case that your device encounters a problem.  So how does iTunes know the original directory structure and file names of the files in the backup?   When the initial backup is performed a file called Manifest.mbdb is created.  This is a binary database file that maps the original path and file names to their new hash-based file name in the backup.   So, all we need to do is create a text version of the Manifest.mbdb, so we can read it and understand original structure on of the backup.   We can generate our text version of the Manifest database with THIS python script.

The it is a case of recreating the original directory structure and file names - more on this later.

Once we have the original directory/file structure we can browse it in our file browser of choice.   It is important to remember that not all chat is saved in sqlite databases.   Some services such as ICQ save the chat in text based log files.  To complicate matters there is one format for the ICQ-free chat service and anther format for the ICQ-paid chat service.   The ICQ-free chat logs appear to be in a psuedo-JSON type format and not at all easily readable.   I say psuedo-JSON type because my python interpreter could not recognise it as JSON (using the json module) and even online JSON validators stated that the data was not valid JSON.   The ICQ-paid logs are a bit more user friendly and very interesting to forensicators.  They appear to be a record of everything that happens within the ICQ environment once the ICQ app is launched.  This means that the chat messages are buried a long way down in the log file and are not straight forward to read.   No matter, both ICQ formats were amenable to scripting via a couple of python scripts that I wrote.

As a salutary example of not relying on automated tools.  I processed an  iTune backup of an iPhone in a commercial forensic tool (mainly used for processing mobile phones).  In the "chat" tab of the output of the commercial tool, it listed 140 ICQ-Free chat messages and 0 ICQ-Paid chat messages.  Using my technique on the same backup  I recovered 3182 ICQ-Free chat messages and 187 ICQ-Paid chat messages.

Having warned against relying on automated tools, I have produced an err....automated tool for processing the itunes backups in the way I have described.  It comprises one bash script + 2 python scripts.  What it does in "unbackup" the backup and executes sqlite queries on some of the more well known sqlite databases.  However, it also copies out any processed sqlite databases and any unprocessed databases so that you can manual interrogate them.

So, I'll let you have the scripts if you promise to manually process the sqlite databases that are recovered (and visit the Linux Sleuthing blog)!   BTW, if you do find any interesting databases, if you could let me know the path, file name and sqlite query, I can add the functionality to my script.

My project is hosted at google code, HERE

I have, of course, added the scripts to my previewing home brew, so now any iTunes backups should be automatically detected and processed.

Thursday 10 January 2013

Ressurecting the dead

Unlike zombies, deleted files will not miraculously return to life on their own, we need to either undelete them or carve them if there is no file system meta data to help us.   So I want to blog about file carving, this will be over two posts, the first post will deal with theory, the second post will look at a couple of tools that I use.

In theory, file carving is straight forward enough - just look for a file header and extract the file.   The practice is a lot more complicated.
Let's consider carving files out of unallocated space.   The first consideration is "what is unallocated space?".   Imagine a deleted file in an NTFS file system, where the file data not been overwritten, Both the file data and the meta data are intact, the MFT entry has the flag set to indicate that the file is deleted, thus the clusters are available for allocation by another file.
Do you consider this file to be in unallocated space?   Some people say yes, as the relevant clusters are not allocated to a live file, some say no as the relevant clusters ARE allocated, albeit to a deleted file.  In many ways the question is academic, it doesn't matter what you consider to be unallocated space, it matters what your file carving tools considers to be unallocated space.  If you don't know what your tool considers to be unallocated space then how do you know if you have recovered all of the potentially recoverable files?

Another consideration is what strategy are you going to use.   File carving tools have different approaches to the problem of carving.   Once approach is to search the entire data-stream, byte-by-byte looking for file signatures.   This is the most thorough approach, however it is the most time consuming approach and will potentially lead to a number of false positives.   Some file signatures are maybe only 2 bytes in length, by pure chance we can expect those 2 bytes to appear on a hard disk a number of times.  Those 2 bytes may or may not represent the file header that you are interested in.   Figuring out if they are relevant headers or false positives can be quite challenging.

One way to reduce the number of false positives is to search for file signatures at the block (or cluster) level.   As the file signature is normally at the start of a file, we only need to look at the cluster boundary - as that where the start of files will be.   Any file signatures found here are unlikely to be false positives, what's more our carving routines will be a lot quicker.   The downside to this is that valid files may get missed, especially if there is a new file system overlaying an old file system.   The cluster boundary for the OLD file system may not fall at the cluster boundary for the NEW file system.   Imagine a 500 GB hard drive with a single partition filling the disk, when formatted the block size may be 16 sectors.  If a user then shrinks that partition to 400GB and creates a new partition in the remaining 100GB, the block size might be set at 8 sectors.   You would need your carving tool to search for headers at the 16 sector boundary for the first partition, and 8 sector boundary at the second partition.   Maybe a better solution would be to search  for signatures at the sector boundary?   This would ensure that all block (cluster) boundaries were searched but increase both the time taken and the risk of finding false positives.   Searching at the sector boundary means that there is also a possibility of files embedded in other files not being found if they are not saved to disk at the sector boundaries (not sure if this is possible, I have never tested it).

Once you have decided your strategy, the problems don't end there.   From a programmers point of view, how do you define when your tool stops carving and resumes searching for file headers?   This is probably the biggest problem facing programmers of carving tools.   Some files have footers, so you could program your carving tool to just keep going until it gets to the file footers.   But what happens if the footer is overwritten or is part of a fragmented file...your tool will just keep carving data out until it reaches the end of the disk or eventually finds a footer many, many clusters further into the disk.   There are different potential solutions to this problem, one is to set a maximum file size so that your tool stops carving at a certain point, even if no footers are found.   Another solution is to stop carving once your tool finds another header.   The problem here is deciding what file type header should be your stop point.  If you are carving for jpgs, do you start carving until you find another jpg header or any type of header?   If your carving engine does byte-by-byte carving, then if you are using "any known file signature" as your stop point you risk ending the carving prematurely if your tool finds a "false positive" header.  You can combine the approaches as Jesse Kornblum did when coding the "foremost" file carver - that is to say, once you start carving carve until max file size or footer found.   In fact there are now quite a few different approaches to the problems posed by file carving, a good overview can be found in this PRESENTATION.

Ultimately, once you understand how your file carving tool works, there is no "right way" or "wrong way" to do file carving.  The file signature searching engine in Encase is very through, however it uses a "byte-by-byte" strategy meaning that there are many false positives and it doesn't really do file carving as it doesn't export the found files. My own preferences depend on what I am looking for, generally for unallocated space I will carve at the sector or cluster boundary, for swap and hiberfil files I do byte_by_byte carving.   I will do a step by step post in the next few days on a couple of the file carving tools that I use routiney.  One of them, photorec, is another one of the tools that I use on just about every case I can think of.


Wednesday 2 January 2013

Traces of the dead

I have blogged about bulk_extractor on several occasions.   As it is such an essential and useful tool for forensicators, I thought I would do a post dedicated to the tool.

Bulk_extractor is tool that scans a disk, disk image or directory of files for potentially evidential data.   It has a number of scanners that search for various artifacts such as urls, email addresses, credit card numbers, telephone numbers and json strings.   The recovery of json strings is particularly useful as a number of chat clients will generate chat messages in json format.  

The url recovery is another extremely useful feature.  We probably have our favourite web history parsers and web history recovery tools.   However recovering all the available web browser history on a disk is incredibly different.  Once again you really need to know what your tool of choice is doing.   So, if you have a tool that claims to recover Internet Explorer history do you understand how it works?  Maybe it is looking for the index.dat file signature?  This is useful...up to a point.   It will work fine for index.dat files stored in contiguous clusters, but what happens if the index.dat file is fragmented?  Under these cirumstances you tool may only recover the first chunk of index.dat data, as this chunk contains the index.dat header.   Your tool may therefore miss thousands of entries that reside in the other fragmented chunks.   Most respectable tools for recovering IE index.dat look for consistent and unique features associated with each record in an index.dat file, thus ensuring that as many entries as possible are recovered.  Other web browser history may be stored in a mysql database, finding the main database header is simple enough, even analysing the sqlite file to establish if it is a web history file is simply.  However, it get much more difficult if there is only a chunk of the database available in unallocated space.   Some tools are able to recover web history from such fragments in some circumstances - does your tool of choice do this?  Have you tested your assumptions.  
Some web history is stored in such a way that there are no consistent and unique record features in the web history file.  Opera web history has a simple structure that doesn't stretch much beyond storing the web page title, the url and a single unix timestamp.   There are no "landmarks" in the file that a programmer can use to recover the individual records if the web history gets fragmented on the disk.   Yahoo browser web history files pose much the same problem.  
bulk_extractor overcomes these problems by simply searching for urls on the disk.  It ingests 16mb chunks of data from your input stream (disk or disk image), extracts all the ascii strings and analyses them to see if there are any strings that have the characteristics of a url i.e they start with "http[s]" or "ftp" and have the structure of a domain name.   In this way you can be confident that you have recovered as much web history as possible.   However, there is a big downside here - you will also recover LOTS of urls that aren't part of a web history file.  You will recover urls that are in text files, pdf files but most likely urls that are hyper-links from raw web pages.   Fortunately, the output from bulk_extractor can help you here.   Bulk_extractor will create a simple text file of the urls that it finds.  It will first list the byte offset of the url, the url itself and a context entry, this shows the url with a number of bytes either side of it - it does what it says on the tin, gives you the context that the url was found in.  I have split a line of output from bulk extractor for ease of viewing.  The first line shows the byte offset of the disk where the url was found, the second line shows the url, the 3rd line shows the url in context as it appears on the disk.
81510931
http://www.gnu.org/copyleft/fdl.html
opyright" href="http://www.gnu.org/copyleft/fdl.html" />\x0A    <title>

As can be seen, the url, when viewed in context,  is preceded by "href=", this indicates that the url is actually a hyperlink from a raw web page.
Bulk_extractor doesn't stop there though.  It will also analyse the recovered url and generate some more useful data.  One of the files that it generates is url_searches.txt - this contains any urls associated with searches.  The file will also show the search terms used and the number of times that the search urls appear on the disk, so a couple of lines of output might look like this:
n=5 firefox
n=4 google+video

You may need to either parse individual records with your favourite web history parser or you may have to do it manually if your favourite web history recovery tool fails to recover the url that bulk_extractor found - this has happened to me on several occasions!

bulk_extractor also creates histograms, these are files that show the urls along with the number of times the urls appear on the  disk, sorted in order of popularity.  Some sample output will look like this:


n=1715 http://www.microsoft.com/contentredirect.asp. (utf16=1715)
n=1067 http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul
n=735 http://www.mozilla.org/MPL/
n=300 http://www.mozilla.org/xbl (utf16=2)
n=292 http://go.microsoft.com/fwlink/?LinkId=23127 (utf16=292)
n=228 http://home.netscape.com/NC-rdf#Name (utf16=49)
n=223 http://ocsp.verisign.com
n=220 http://www.DocURL.com/bar.htm

Notice how urls that are multi-byte encoded are also recovered by default.   Obviously there are going to be a LOT of urls that will appear on most hard drives, urls associated with microsoft, mozilla, verisign etc.
You can download some white lists that will suppress those type of urls (and emails and other features recovered by bulk_extractor).   Other types of url analysis that bulk_extractor does is to discover facebook id numbers, and skydrive id numbers.   Of course, it is trivially simple to write your own scripts to analyse the urls to discover other interesting urls.  I have written some to identify online storage, secure transactions and online banking urls.

bulk_extractor will also recover email addresses and creates historgrams for them.   The important thing to remember here is that there isn't (currently) any pst archive decompression built into bulk_extractor.   Therefore if your suspect is using outlook email client you will have to process any pst archive structures manually.   Other than that, the email address recovery works in exactly the same way as url recovery.

Using bulk_extractor can be a bit daunting, depending what you are looking for and searching through.  But to recover urls and emails on a disk image called bad_guy.dd the command is:
bulk_extractor -E email -o bulk bad_guy.dd

By default ALL scanners are turned on (the more scanners enabled, the longer it will take to run) using the -E switch disables all the scanners EXCEPT the named scanner, in our case the "email" scanner (this scanner recovers email addresses and urls).  The -o option is followed by the directory name you want to send the results to (which must not currently exist - bulk_extractor will create the directory).

I urge you all to download and experiment with it, there is a Windows gui that will process E01 image files and directory structures.   There are few cases that I can think of when I don't run bulk_extractor, there are a number of occasions where I have recovered crucial urls or json chat fragments missed by other tools.