Monday 31 December 2012

Words of the dead

So, in most forensic exams we probably search the hard drive for keywords, using a process I sometimes jokingly refer to as "keyword searching".  

Referring to the process as keyword searching is something of a misnomer, as often it isn't words that we are looking for, it could be numbers or even a pattern of characters.   Therefore, I urge you to put this keyword searching silliness behind you and use the more accurate term of "pattern matching".

It has to be said that for the longest time, performing pattern matching over a disk or disk image in linux was soul destroying (for those of you still lucky enough to have a soul).   It was possible to use grep across the disk, but memory exhaustion problems abounded.  It was not possible to match patterns in compressed data using this approach.   Also there were problems with matching patterns that were multi-byte encoded.   Certainly, do this in a preview on the suspect machine could take a week or more.    The way I approached the problem was to create a database of files in the live set, then use that data to exclude all the binary files then just grep each of the remaining files for my keywords.   However, file cluster slack was not searched with this approach.   Processing unallocated space was a case of outputting each cluster, running strings across it and then searching the raw strings for my patterns - very time consuming!

So, my rotting, black heart leapt for joy when they mighty Simson Garfinkel released the every-forensicator-should-have-it bulk_extractor program.   One of the many features of the program is the ability to search for a pattern or list of patterns on a disk (or disk image).   Bulk_extractor knows nothing about partitions or file systems, it treats the input data as a raw stream.   One of the great things about it is the ability to decompress/decode data on the fly and search that for your patterns.   In addition, BE will perform recursion to a depth of 5 by default (but that can be tweaked by passing some options to it).  This recursion means that if BE finds some compressed or encoded data it will search the decompressed data for more compressed or encoded data and decompress/decode that data on the fly, it will drill down 5 layers by default.   Thus if there is a compressed docx file that has been encoded as a base64 email attachment that has been placed in a zip archive, then BE will be able to drill down through all those layers to get to the "plain text" inside the docx file.   In addition, it will search for your patterns even if they are multi-byte encoded, with UTF-16 for instance.  As BE knows nothing about file systems, the data in unallocated space, swap files etc gets processed.  Does your forensic tool of choice process compressed/encoded data in unallocated space when you are pattern matching.  Have you tested your assumptions????
The only caveat to using BE is that when handling compressed/encoded data, BE is searching for file signatures - it is therefore unlikely that fragmented files will not be fully processed, only the first chunk that contains the file header stands a chance of being processed.

The main thing to remember is that your search patterns are CASE SENSITIVE!  You have to take this into account when preparing your list of patterns.   If your list contained "dead forensicator", then the pattern "Dead Forensicator" would NOT be matched (actually not even "dead forensicator" would be matched either - we will come on to that).   You could of course add the pattern "Dead Forensicator" to your list, but then the pattern "DEAD FORENSICATOR" would not be matched.   Luckily BE is able to handle regular expressions, in fact the patterns you are looking for really need to be written as regular expressions.   This means that you will need to "escape" any white space in your patterns, you do this by putting a "\" character in front of any white space - this tells BE to treat the white space as literally white space (it has a special meaning otherwise).  So, to match the pattern "dead forensicator", you will have to write it as "dead\ forensicator".   If you are not familiar with regular expressions (regexes) then you should make every effort to get acquainted with them, using regexes will make you a much more effective forensicator.    You probably can't be sure if the pattern you are searching for will be in upper case or lower case or a combination of both.   Regexes allow you to deal with any combination of upper and lower case.   If we were searching for either the upper or lower case letter "a" we could include in our pattern list both "a" and "A" - however this is not so good when we are searching for multi-byte patterns.   To search for both upper and lower case letter "a", we can use square brackets in a regex, so our pattern would be [aA].   Taking this further, to search for "dead forensicator" in any combination of upper and lower case you would write your regex as:
[dD][eE][aA][dD]\ [fF][oO][rR][eE][nN][sS][iI][cC][aA][tt][oO][rR]
Remember to escape the white space between the two patterns!
So you can write your list of regexes and put them in a file that you will pass to BE when you run it.

The options for BE are really daunting when you first look at them, but take your time experimenting, as the benefits of using this tool are stunning.   BE has a number of scanners, most of which are turned ON by default.   When pattern matching, we will need to turn a number of them off.   We will also need to give BE a directory to send the results to and the directory cannot already exist (BE will create it for us).  There is a BE gui that we can use on Windows, but this will only process disk images, the Linux CLI version will process a physical disk (I use BE a lot in previewing a suspect machine from my custom previewing CD).   To show you the command you will need, lets assume that I have a dd image of a suspect disk in my home directory called bad_guy.dd.  I want to search the image for my list of regexes in my home directory called regex.txt and I want to send the results to a folder called bulk in my home directory.   My command would be:
bulk_extractor -F regex.txt -x accts -x kml -x gps -x aes -x json -x elf -x vcard -x net -x winprefetch -x winpe -x windirs -o bulk bad_guy.dd

The -F switch is used to specify my list of regexes, I then disable a number of scanners using the -x option for each one, the -o option specifies the directory I want the results to go to, finally I pass the name of the image file (or disk) that I want searching.   Thereafter, BE is surprisingly fast and delightfully thorough!    At the end of the processing there will be a file called "find.txt" in my bulk directory that lists all the patterns that have been matched along with the BYTE OFFSET for each match.  This is particularly useful when I suspect that evidence is going to be in web pages in unallocated space and I know that the suspect uses a browser that will cache web pages with gzip compression - BE will still get me the evidence without having to go through the pain of extracting all the gzip files from unallocated space and processing them manually.

Anyhows, we now have a list of all the matched patterns on the suspect disk.  We probably would like to know which files the matched patterns appear in (at the moment we only know the physical byte offset of each matched pattern).   This is no problem, there is more Linux goodness we can do to determine if any of the matched patterns appear in live/deleted files.   All we need to do is to run the fiwalk program from Simson Garfinkel that "maps" all the live and deleted files, then run Simson's identify_filenames.py python script.   So step 1 is to run fiwalk across the disk image, we will use the -x option to generate an .xml file of the output.  Our command would be:
fiwalk -X  fiwalk_badguy.xml bad_guy.dd

We can then tell identify_filenames.py to use the xml output from fiwalk to process the find.txt file from BE.  Note that you need to have python 3.2 (at least) installed!   Our command would be:
python3 identify_filenames.py --featurefile find.txt --xmlfile fiwalk_badguy.xml bulk FILEPATHS

So, we need to use python3 to launch the python script, we tell it to use the find.txt feature file, the xmlfile we have just generated, we then pass it the name of the directory that contains our find.txt file, finally we specify a new directory FILEPATHS that the result is going to be put into.   Upon completion you will find a file called "annotated_find.txt" that lists all your pattern matches with file paths and file names if the match appears in a live/deleted file.

The bulk_extractor pattern matching is simples, admittedly resolving file names to any matches is a tinsy-winsy bit gnarly, but it is worth the effort.   It is a lot simpler running BE from the windows GUI against a EO1 file.   But you can do like I have done, write a script and add it to your previewing linux disk to automate the whole thing.  

One word of advice, you will need the forked version of sleuthkit to get fiwalk running nicely, you can get that version from github at:
https://github.com/kfairbanks/sleuthkit

Running a test on CAINE 3 shows that fiwalk is not working, hopefully it will be fixed soon.   However, you can still run bulk_extractor to do your pattern matching from the boot CD and save yourself a lot of time!

Finally, happy new year to all you lucky breathers.   Thanks for all the page views, feel free to comment on any of my ramblings or contact me if you have any problems with implementing any scripts or suggestions.   I am off for my last feed of the year...yum,yum!