Thursday, 10 January 2013

Ressurecting the dead

Unlike zombies, deleted files will not miraculously return to life on their own, we need to either undelete them or carve them if there is no file system meta data to help us.   So I want to blog about file carving, this will be over two posts, the first post will deal with theory, the second post will look at a couple of tools that I use.

In theory, file carving is straight forward enough - just look for a file header and extract the file.   The practice is a lot more complicated.
Let's consider carving files out of unallocated space.   The first consideration is "what is unallocated space?".   Imagine a deleted file in an NTFS file system, where the file data not been overwritten, Both the file data and the meta data are intact, the MFT entry has the flag set to indicate that the file is deleted, thus the clusters are available for allocation by another file.
Do you consider this file to be in unallocated space?   Some people say yes, as the relevant clusters are not allocated to a live file, some say no as the relevant clusters ARE allocated, albeit to a deleted file.  In many ways the question is academic, it doesn't matter what you consider to be unallocated space, it matters what your file carving tools considers to be unallocated space.  If you don't know what your tool considers to be unallocated space then how do you know if you have recovered all of the potentially recoverable files?

Another consideration is what strategy are you going to use.   File carving tools have different approaches to the problem of carving.   Once approach is to search the entire data-stream, byte-by-byte looking for file signatures.   This is the most thorough approach, however it is the most time consuming approach and will potentially lead to a number of false positives.   Some file signatures are maybe only 2 bytes in length, by pure chance we can expect those 2 bytes to appear on a hard disk a number of times.  Those 2 bytes may or may not represent the file header that you are interested in.   Figuring out if they are relevant headers or false positives can be quite challenging.

One way to reduce the number of false positives is to search for file signatures at the block (or cluster) level.   As the file signature is normally at the start of a file, we only need to look at the cluster boundary - as that where the start of files will be.   Any file signatures found here are unlikely to be false positives, what's more our carving routines will be a lot quicker.   The downside to this is that valid files may get missed, especially if there is a new file system overlaying an old file system.   The cluster boundary for the OLD file system may not fall at the cluster boundary for the NEW file system.   Imagine a 500 GB hard drive with a single partition filling the disk, when formatted the block size may be 16 sectors.  If a user then shrinks that partition to 400GB and creates a new partition in the remaining 100GB, the block size might be set at 8 sectors.   You would need your carving tool to search for headers at the 16 sector boundary for the first partition, and 8 sector boundary at the second partition.   Maybe a better solution would be to search  for signatures at the sector boundary?   This would ensure that all block (cluster) boundaries were searched but increase both the time taken and the risk of finding false positives.   Searching at the sector boundary means that there is also a possibility of files embedded in other files not being found if they are not saved to disk at the sector boundaries (not sure if this is possible, I have never tested it).

Once you have decided your strategy, the problems don't end there.   From a programmers point of view, how do you define when your tool stops carving and resumes searching for file headers?   This is probably the biggest problem facing programmers of carving tools.   Some files have footers, so you could program your carving tool to just keep going until it gets to the file footers.   But what happens if the footer is overwritten or is part of a fragmented file...your tool will just keep carving data out until it reaches the end of the disk or eventually finds a footer many, many clusters further into the disk.   There are different potential solutions to this problem, one is to set a maximum file size so that your tool stops carving at a certain point, even if no footers are found.   Another solution is to stop carving once your tool finds another header.   The problem here is deciding what file type header should be your stop point.  If you are carving for jpgs, do you start carving until you find another jpg header or any type of header?   If your carving engine does byte-by-byte carving, then if you are using "any known file signature" as your stop point you risk ending the carving prematurely if your tool finds a "false positive" header.  You can combine the approaches as Jesse Kornblum did when coding the "foremost" file carver - that is to say, once you start carving carve until max file size or footer found.   In fact there are now quite a few different approaches to the problems posed by file carving, a good overview can be found in this PRESENTATION.

Ultimately, once you understand how your file carving tool works, there is no "right way" or "wrong way" to do file carving.  The file signature searching engine in Encase is very through, however it uses a "byte-by-byte" strategy meaning that there are many false positives and it doesn't really do file carving as it doesn't export the found files. My own preferences depend on what I am looking for, generally for unallocated space I will carve at the sector or cluster boundary, for swap and hiberfil files I do byte_by_byte carving.   I will do a step by step post in the next few days on a couple of the file carving tools that I use routiney.  One of them, photorec, is another one of the tools that I use on just about every case I can think of.

Wednesday, 2 January 2013

Traces of the dead

I have blogged about bulk_extractor on several occasions.   As it is such an essential and useful tool for forensicators, I thought I would do a post dedicated to the tool.

Bulk_extractor is tool that scans a disk, disk image or directory of files for potentially evidential data.   It has a number of scanners that search for various artifacts such as urls, email addresses, credit card numbers, telephone numbers and json strings.   The recovery of json strings is particularly useful as a number of chat clients will generate chat messages in json format.  

The url recovery is another extremely useful feature.  We probably have our favourite web history parsers and web history recovery tools.   However recovering all the available web browser history on a disk is incredibly different.  Once again you really need to know what your tool of choice is doing.   So, if you have a tool that claims to recover Internet Explorer history do you understand how it works?  Maybe it is looking for the index.dat file signature?  This is useful...up to a point.   It will work fine for index.dat files stored in contiguous clusters, but what happens if the index.dat file is fragmented?  Under these cirumstances you tool may only recover the first chunk of index.dat data, as this chunk contains the index.dat header.   Your tool may therefore miss thousands of entries that reside in the other fragmented chunks.   Most respectable tools for recovering IE index.dat look for consistent and unique features associated with each record in an index.dat file, thus ensuring that as many entries as possible are recovered.  Other web browser history may be stored in a mysql database, finding the main database header is simple enough, even analysing the sqlite file to establish if it is a web history file is simply.  However, it get much more difficult if there is only a chunk of the database available in unallocated space.   Some tools are able to recover web history from such fragments in some circumstances - does your tool of choice do this?  Have you tested your assumptions.  
Some web history is stored in such a way that there are no consistent and unique record features in the web history file.  Opera web history has a simple structure that doesn't stretch much beyond storing the web page title, the url and a single unix timestamp.   There are no "landmarks" in the file that a programmer can use to recover the individual records if the web history gets fragmented on the disk.   Yahoo browser web history files pose much the same problem.  
bulk_extractor overcomes these problems by simply searching for urls on the disk.  It ingests 16mb chunks of data from your input stream (disk or disk image), extracts all the ascii strings and analyses them to see if there are any strings that have the characteristics of a url i.e they start with "http[s]" or "ftp" and have the structure of a domain name.   In this way you can be confident that you have recovered as much web history as possible.   However, there is a big downside here - you will also recover LOTS of urls that aren't part of a web history file.  You will recover urls that are in text files, pdf files but most likely urls that are hyper-links from raw web pages.   Fortunately, the output from bulk_extractor can help you here.   Bulk_extractor will create a simple text file of the urls that it finds.  It will first list the byte offset of the url, the url itself and a context entry, this shows the url with a number of bytes either side of it - it does what it says on the tin, gives you the context that the url was found in.  I have split a line of output from bulk extractor for ease of viewing.  The first line shows the byte offset of the disk where the url was found, the second line shows the url, the 3rd line shows the url in context as it appears on the disk.
opyright" href="" />\x0A    <title>

As can be seen, the url, when viewed in context,  is preceded by "href=", this indicates that the url is actually a hyperlink from a raw web page.
Bulk_extractor doesn't stop there though.  It will also analyse the recovered url and generate some more useful data.  One of the files that it generates is url_searches.txt - this contains any urls associated with searches.  The file will also show the search terms used and the number of times that the search urls appear on the disk, so a couple of lines of output might look like this:
n=5 firefox
n=4 google+video

You may need to either parse individual records with your favourite web history parser or you may have to do it manually if your favourite web history recovery tool fails to recover the url that bulk_extractor found - this has happened to me on several occasions!

bulk_extractor also creates histograms, these are files that show the urls along with the number of times the urls appear on the  disk, sorted in order of popularity.  Some sample output will look like this:

n=1715 (utf16=1715)
n=300 (utf16=2)
n=292 (utf16=292)
n=228 (utf16=49)

Notice how urls that are multi-byte encoded are also recovered by default.   Obviously there are going to be a LOT of urls that will appear on most hard drives, urls associated with microsoft, mozilla, verisign etc.
You can download some white lists that will suppress those type of urls (and emails and other features recovered by bulk_extractor).   Other types of url analysis that bulk_extractor does is to discover facebook id numbers, and skydrive id numbers.   Of course, it is trivially simple to write your own scripts to analyse the urls to discover other interesting urls.  I have written some to identify online storage, secure transactions and online banking urls.

bulk_extractor will also recover email addresses and creates historgrams for them.   The important thing to remember here is that there isn't (currently) any pst archive decompression built into bulk_extractor.   Therefore if your suspect is using outlook email client you will have to process any pst archive structures manually.   Other than that, the email address recovery works in exactly the same way as url recovery.

Using bulk_extractor can be a bit daunting, depending what you are looking for and searching through.  But to recover urls and emails on a disk image called bad_guy.dd the command is:
bulk_extractor -E email -o bulk bad_guy.dd

By default ALL scanners are turned on (the more scanners enabled, the longer it will take to run) using the -E switch disables all the scanners EXCEPT the named scanner, in our case the "email" scanner (this scanner recovers email addresses and urls).  The -o option is followed by the directory name you want to send the results to (which must not currently exist - bulk_extractor will create the directory).

I urge you all to download and experiment with it, there is a Windows gui that will process E01 image files and directory structures.   There are few cases that I can think of when I don't run bulk_extractor, there are a number of occasions where I have recovered crucial urls or json chat fragments missed by other tools.