So, in most forensic exams we probably search the hard drive for keywords, using a process I sometimes jokingly refer to as "keyword searching".
Referring to the process as keyword searching is something of a misnomer, as often it isn't words that we are looking for, it could be numbers or even a pattern of characters. Therefore, I urge you to put this keyword searching silliness behind you and use the more accurate term of "pattern matching".
It has to be said that for the longest time, performing pattern matching over a disk or disk image in linux was soul destroying (for those of you still lucky enough to have a soul). It was possible to use grep across the disk, but memory exhaustion problems abounded. It was not possible to match patterns in compressed data using this approach. Also there were problems with matching patterns that were multi-byte encoded. Certainly, do this in a preview on the suspect machine could take a week or more. The way I approached the problem was to create a database of files in the live set, then use that data to exclude all the binary files then just grep each of the remaining files for my keywords. However, file cluster slack was not searched with this approach. Processing unallocated space was a case of outputting each cluster, running strings across it and then searching the raw strings for my patterns - very time consuming!
So, my rotting, black heart leapt for joy when they mighty Simson Garfinkel released the every-forensicator-should-have-it bulk_extractor program. One of the many features of the program is the ability to search for a pattern or list of patterns on a disk (or disk image). Bulk_extractor knows nothing about partitions or file systems, it treats the input data as a raw stream. One of the great things about it is the ability to decompress/decode data on the fly and search that for your patterns. In addition, BE will perform recursion to a depth of 5 by default (but that can be tweaked by passing some options to it). This recursion means that if BE finds some compressed or encoded data it will search the decompressed data for more compressed or encoded data and decompress/decode that data on the fly, it will drill down 5 layers by default. Thus if there is a compressed docx file that has been encoded as a base64 email attachment that has been placed in a zip archive, then BE will be able to drill down through all those layers to get to the "plain text" inside the docx file. In addition, it will search for your patterns even if they are multi-byte encoded, with UTF-16 for instance. As BE knows nothing about file systems, the data in unallocated space, swap files etc gets processed. Does your forensic tool of choice process compressed/encoded data in unallocated space when you are pattern matching. Have you tested your assumptions????
The only caveat to using BE is that when handling compressed/encoded data, BE is searching for file signatures - it is therefore unlikely that fragmented files will not be fully processed, only the first chunk that contains the file header stands a chance of being processed.
The main thing to remember is that your search patterns are CASE SENSITIVE! You have to take this into account when preparing your list of patterns. If your list contained "dead forensicator", then the pattern "Dead Forensicator" would NOT be matched (actually not even "dead forensicator" would be matched either - we will come on to that). You could of course add the pattern "Dead Forensicator" to your list, but then the pattern "DEAD FORENSICATOR" would not be matched. Luckily BE is able to handle regular expressions, in fact the patterns you are looking for really need to be written as regular expressions. This means that you will need to "escape" any white space in your patterns, you do this by putting a "\" character in front of any white space - this tells BE to treat the white space as literally white space (it has a special meaning otherwise). So, to match the pattern "dead forensicator", you will have to write it as "dead\ forensicator". If you are not familiar with regular expressions (regexes) then you should make every effort to get acquainted with them, using regexes will make you a much more effective forensicator. You probably can't be sure if the pattern you are searching for will be in upper case or lower case or a combination of both. Regexes allow you to deal with any combination of upper and lower case. If we were searching for either the upper or lower case letter "a" we could include in our pattern list both "a" and "A" - however this is not so good when we are searching for multi-byte patterns. To search for both upper and lower case letter "a", we can use square brackets in a regex, so our pattern would be [aA]. Taking this further, to search for "dead forensicator" in any combination of upper and lower case you would write your regex as:
[dD][eE][aA][dD]\ [fF][oO][rR][eE][nN][sS][iI][cC][aA][tt][oO][rR]
Remember to escape the white space between the two patterns!
So you can write your list of regexes and put them in a file that you will pass to BE when you run it.
The options for BE are really daunting when you first look at them, but take your time experimenting, as the benefits of using this tool are stunning. BE has a number of scanners, most of which are turned ON by default. When pattern matching, we will need to turn a number of them off. We will also need to give BE a directory to send the results to and the directory cannot already exist (BE will create it for us). There is a BE gui that we can use on Windows, but this will only process disk images, the Linux CLI version will process a physical disk (I use BE a lot in previewing a suspect machine from my custom previewing CD). To show you the command you will need, lets assume that I have a dd image of a suspect disk in my home directory called bad_guy.dd. I want to search the image for my list of regexes in my home directory called regex.txt and I want to send the results to a folder called bulk in my home directory. My command would be:
bulk_extractor -F regex.txt -x accts -x kml -x gps -x aes -x json -x elf -x vcard -x net -x winprefetch -x winpe -x windirs -o bulk bad_guy.dd
The -F switch is used to specify my list of regexes, I then disable a number of scanners using the -x option for each one, the -o option specifies the directory I want the results to go to, finally I pass the name of the image file (or disk) that I want searching. Thereafter, BE is surprisingly fast and delightfully thorough! At the end of the processing there will be a file called "find.txt" in my bulk directory that lists all the patterns that have been matched along with the BYTE OFFSET for each match. This is particularly useful when I suspect that evidence is going to be in web pages in unallocated space and I know that the suspect uses a browser that will cache web pages with gzip compression - BE will still get me the evidence without having to go through the pain of extracting all the gzip files from unallocated space and processing them manually.
Anyhows, we now have a list of all the matched patterns on the suspect disk. We probably would like to know which files the matched patterns appear in (at the moment we only know the physical byte offset of each matched pattern). This is no problem, there is more Linux goodness we can do to determine if any of the matched patterns appear in live/deleted files. All we need to do is to run the fiwalk program from Simson Garfinkel that "maps" all the live and deleted files, then run Simson's identify_filenames.py python script. So step 1 is to run fiwalk across the disk image, we will use the -x option to generate an .xml file of the output. Our command would be:
fiwalk -X fiwalk_badguy.xml bad_guy.dd
We can then tell identify_filenames.py to use the xml output from fiwalk to process the find.txt file from BE. Note that you need to have python 3.2 (at least) installed! Our command would be:
python3 identify_filenames.py --featurefile find.txt --xmlfile fiwalk_badguy.xml bulk FILEPATHS
So, we need to use python3 to launch the python script, we tell it to use the find.txt feature file, the xmlfile we have just generated, we then pass it the name of the directory that contains our find.txt file, finally we specify a new directory FILEPATHS that the result is going to be put into. Upon completion you will find a file called "annotated_find.txt" that lists all your pattern matches with file paths and file names if the match appears in a live/deleted file.
The bulk_extractor pattern matching is simples, admittedly resolving file names to any matches is a tinsy-winsy bit gnarly, but it is worth the effort. It is a lot simpler running BE from the windows GUI against a EO1 file. But you can do like I have done, write a script and add it to your previewing linux disk to automate the whole thing.
One word of advice, you will need the forked version of sleuthkit to get fiwalk running nicely, you can get that version from github at:
https://github.com/kfairbanks/sleuthkit
Running a test on CAINE 3 shows that fiwalk is not working, hopefully it will be fixed soon. However, you can still run bulk_extractor to do your pattern matching from the boot CD and save yourself a lot of time!
Finally, happy new year to all you lucky breathers. Thanks for all the page views, feel free to comment on any of my ramblings or contact me if you have any problems with implementing any scripts or suggestions. I am off for my last feed of the year...yum,yum!
Dull witted and fumbling linux forensicating by one of the undead. Demonstrating the art of shell scripting via astonishingly inelegant snippets of code and full scripts. Tutorials on how to automate the previewing of hard disks and storage media. Linux, 4n6 and zombies all presented in a grammatically dubious blog.
Monday, 31 December 2012
Wednesday, 19 September 2012
Hiding The Dead Revisted
In a previous POST, I looked at automating the detecting of encrypted data. Assuming that you find such data...then what? You will need to know the software that opens the cyphertext, the password and possibly a key. There is some stuff we can do to try and establish these parameters. Bearing in mind we have ALREADY done a file signature check on our system, we can use this data to help us.
First lets look at some other routines, to try and detect encrypted data. Some programs create cyphertext with a recognisable signature, you simply need to add those signatures to your custom magic file, which you will store under the /etc directory.
Some crypto programs generate the cyphertext file with a consistent file extension.
We can search our file signature database for those with something like this, assuming our database is called listoffiles and saved in the /tmp directory:
awk -F: '{print $1}' /tmp/listoffiles | egrep -i '\.jbc$|\.dcv$|\.pgd$|\.drvspace$|\.asc$|\.bfe$|enx$|\.enp$|\.emc$|\.cryptx|\.kgb$|\.vmdf$|\.xea$|\.fca$|\.fsh$|\.encrypted$|\.axx$|\.xcb$|\.xia$|\.sa5$|\.jp!$|\.cyp$|\.gpg$|\.sdsk$' > /tmp/encryptedfiles.txt
In this example we are asking awk to look at the file path + file name field in our database and return only the files with certain file extensions associated with encryption.
We can also detect EFS encrypted files, no matter what size. We could use the sleuthkit command "istat" to parse each MFT in the file system and search that for the "encrypted" flag there. However, this is going to be very time consuming, a quicker way would be to simply look at our file signature database. If you try and run the file command on an EFS encrypted file, you will get an error message, the error message will be recorded in you signature database. This is a result of the unusual permissions that are assigned to EFS encrypted files, you can't run any Linux commands against the file without receiving an error message. I have not seen the error message in any non-EFS encrypted files, so the presence of this error message is a very strong indicator that the file is EFS encrypted. We can look for the error message like this:
For EFS encrypted files and encrypted files with known extensions, we can figure out what package was used to create the cyphertext (we can look up the file extensions at
www.fileext.com). But what about our files that have maximum entropy values?
First, we might want to search our file database for executables associated with encryption, we could do something like this:
awk -F: '$1 ~ /\.[Ee][Xx][Ee]$/ {print $0}' $reppath/tmp/listoffiles | egrep -i 'crypt|steg|pgp|gpg|hide|kremlin' | awk -F: '{print $1}' > /tmp/enc-progs.txt
We have used awk to search the file path/name portion of our file database for executable files, sent the resulting lines to egrep to search for strings associated with encryption, then sent those results back to awk to print just the file path/name portion of the results and redirected the output to a file. Hopefully we will now have a list of programs of executable files associated with encryption.
We can now have a look for any potential encryption keys. We have all the information we need already, we just need to do a bit more analysis. Encryption keys (generally!) have two characteristics that we can look for:
1) They don't have a known file signature, therefore they will be described as simply "data" in our file data base.
2) They have a fixed size, which is a multiple of two, and will most likely be 256, 512, 1024, 2048 bits...I emphasise BITS.
So our algorithm will be to analyse only unknown files to establish their size and return only files that 256,512,1024 or 2048 bits. We can use the "stat" command to establish file size, the output look like this:
fotd-VPCCA cases # stat photorec.log
File: `photorec.log'
Size: 1705 Blocks: 8 IO Block: 4096 regular file
Device: 806h/2054d Inode: 3933803 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2012-09-19 11:52:48.933412006 +0100
Modify: 2012-09-19 11:52:12.581410471 +0100
Change: 2012-09-19 11:52:12.581410471 +0100
The important thing to remember is that the file size in the output is in BYTES, so we actually need to look for files that are exactly 32, 64, 128 or 256 BYTES in size, as they map to being 256, 512,1024 or 2048 BITS.
So our code would look something like this:
KEYREC () {
IFS='
'
FILESIZE=`stat $1 | awk '/Size/ {print $2}'`
if [ $FILESIZE = "64" -o $FILESIZE = "128" -o $FILESIZE = "256" -o $FILESIZE = "32" ]
then
echo $1 >> /tmp/enckeys.txt
else
continue
fi
}
awk -F: '$2 ~ /^\ data/ {print $1}' /tmp/listoffiles > /tmp/datafiles.txt
cat /tmp/datafiles.txt | while read i ; do KEYREC $i ; done
The last two lines of code search the description part our list of files and signatures for unknown files (using the string \ data as the indicator) and sends just the file path/name for those results to a file. That file is then read and every line fed into a function that runs the stat command on each file, isolates the "Size" field of the output and test whether the size matches our criteria for being consistent with encryption keys. Now you will get some false positives (only a handful) by looking that those files in a hex viewer you will be to eliminate those files that aren't encryption keys - if they have ascii text in them, then they aren't encryption keys!
Now we have the cyphertext, the program used to decrypt the data plus the key, all we need now is the password. We are going to again use the tool every forensicator MUST have, bulk_extractor. One of the many features of bulk extractor is to extract all of the ascii strings from a hard drive and de-duplicate them leaving us with a list of unique ascii strings. It may well be that the users crypto password has been cached to the disk - often in the swap file. We will probably want to do the string extraction at the physical disk level, as opposed to the logical disk. We will need several gigabytes of space on an external drive as the list of strings is going to be very large, the command to extract all the strings on the first physical disk and send results to an external drive mounted at /mnt/usbdisk would be:
bulk_extractor -E wordlist -o /mnt/usbdisk /dev/sda
We need to do a bit more work, we can't realistically try every ascii string that bulk_extractor generates. The password is likely to be long, with mixture of upper/lower case characters + numbers. You can use a regex to search for strings with those characteristics to narrow down the number of potential passwords (Google is your friend here!).
So, detecting cyphertext and encrypted compressed archives, identifying potential crypto keys and potential passwords is doable with surprisingly small amount of code. For those on a budget, this solution costs about 10 pence (the cost of a recordable CD) if you want to use the suspect's processing power to do your analysis.
First lets look at some other routines, to try and detect encrypted data. Some programs create cyphertext with a recognisable signature, you simply need to add those signatures to your custom magic file, which you will store under the /etc directory.
Some crypto programs generate the cyphertext file with a consistent file extension.
We can search our file signature database for those with something like this, assuming our database is called listoffiles and saved in the /tmp directory:
awk -F: '{print $1}' /tmp/listoffiles | egrep -i '\.jbc$|\.dcv$|\.pgd$|\.drvspace$|\.asc$|\.bfe$|enx$|\.enp$|\.emc$|\.cryptx|\.kgb$|\.vmdf$|\.xea$|\.fca$|\.fsh$|\.encrypted$|\.axx$|\.xcb$|\.xia$|\.sa5$|\.jp!$|\.cyp$|\.gpg$|\.sdsk$' > /tmp/encryptedfiles.txt
In this example we are asking awk to look at the file path + file name field in our database and return only the files with certain file extensions associated with encryption.
We can also detect EFS encrypted files, no matter what size. We could use the sleuthkit command "istat" to parse each MFT in the file system and search that for the "encrypted" flag there. However, this is going to be very time consuming, a quicker way would be to simply look at our file signature database. If you try and run the file command on an EFS encrypted file, you will get an error message, the error message will be recorded in you signature database. This is a result of the unusual permissions that are assigned to EFS encrypted files, you can't run any Linux commands against the file without receiving an error message. I have not seen the error message in any non-EFS encrypted files, so the presence of this error message is a very strong indicator that the file is EFS encrypted. We can look for the error message like this:
awk -F: '$2 ~ /ERROR/ {print $1}' /tmp/listoffiles > /tmp/efsfiles.txSo, we have now run a number of routines to try and identify encrypted data, including entropy testing on unknown file types, signature checking, file extension analysis and testing for the error message associated with EFS encryption.
For EFS encrypted files and encrypted files with known extensions, we can figure out what package was used to create the cyphertext (we can look up the file extensions at
www.fileext.com). But what about our files that have maximum entropy values?
First, we might want to search our file database for executables associated with encryption, we could do something like this:
awk -F: '$1 ~ /\.[Ee][Xx][Ee]$/ {print $0}' $reppath/tmp/listoffiles | egrep -i 'crypt|steg|pgp|gpg|hide|kremlin' | awk -F: '{print $1}' > /tmp/enc-progs.txt
We have used awk to search the file path/name portion of our file database for executable files, sent the resulting lines to egrep to search for strings associated with encryption, then sent those results back to awk to print just the file path/name portion of the results and redirected the output to a file. Hopefully we will now have a list of programs of executable files associated with encryption.
We can now have a look for any potential encryption keys. We have all the information we need already, we just need to do a bit more analysis. Encryption keys (generally!) have two characteristics that we can look for:
1) They don't have a known file signature, therefore they will be described as simply "data" in our file data base.
2) They have a fixed size, which is a multiple of two, and will most likely be 256, 512, 1024, 2048 bits...I emphasise BITS.
So our algorithm will be to analyse only unknown files to establish their size and return only files that 256,512,1024 or 2048 bits. We can use the "stat" command to establish file size, the output look like this:
fotd-VPCCA cases # stat photorec.log
File: `photorec.log'
Size: 1705 Blocks: 8 IO Block: 4096 regular file
Device: 806h/2054d Inode: 3933803 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2012-09-19 11:52:48.933412006 +0100
Modify: 2012-09-19 11:52:12.581410471 +0100
Change: 2012-09-19 11:52:12.581410471 +0100
The important thing to remember is that the file size in the output is in BYTES, so we actually need to look for files that are exactly 32, 64, 128 or 256 BYTES in size, as they map to being 256, 512,1024 or 2048 BITS.
So our code would look something like this:
KEYREC () {
IFS='
'
FILESIZE=`stat $1 | awk '/Size/ {print $2}'`
if [ $FILESIZE = "64" -o $FILESIZE = "128" -o $FILESIZE = "256" -o $FILESIZE = "32" ]
then
echo $1 >> /tmp/enckeys.txt
else
continue
fi
}
awk -F: '$2 ~ /^\ data/ {print $1}' /tmp/listoffiles > /tmp/datafiles.txt
cat /tmp/datafiles.txt | while read i ; do KEYREC $i ; done
The last two lines of code search the description part our list of files and signatures for unknown files (using the string \ data as the indicator) and sends just the file path/name for those results to a file. That file is then read and every line fed into a function that runs the stat command on each file, isolates the "Size" field of the output and test whether the size matches our criteria for being consistent with encryption keys. Now you will get some false positives (only a handful) by looking that those files in a hex viewer you will be to eliminate those files that aren't encryption keys - if they have ascii text in them, then they aren't encryption keys!
Now we have the cyphertext, the program used to decrypt the data plus the key, all we need now is the password. We are going to again use the tool every forensicator MUST have, bulk_extractor. One of the many features of bulk extractor is to extract all of the ascii strings from a hard drive and de-duplicate them leaving us with a list of unique ascii strings. It may well be that the users crypto password has been cached to the disk - often in the swap file. We will probably want to do the string extraction at the physical disk level, as opposed to the logical disk. We will need several gigabytes of space on an external drive as the list of strings is going to be very large, the command to extract all the strings on the first physical disk and send results to an external drive mounted at /mnt/usbdisk would be:
bulk_extractor -E wordlist -o /mnt/usbdisk /dev/sda
We need to do a bit more work, we can't realistically try every ascii string that bulk_extractor generates. The password is likely to be long, with mixture of upper/lower case characters + numbers. You can use a regex to search for strings with those characteristics to narrow down the number of potential passwords (Google is your friend here!).
So, detecting cyphertext and encrypted compressed archives, identifying potential crypto keys and potential passwords is doable with surprisingly small amount of code. For those on a budget, this solution costs about 10 pence (the cost of a recordable CD) if you want to use the suspect's processing power to do your analysis.
Wednesday, 12 September 2012
Saturday, 8 September 2012
Friday, 7 September 2012
Copying the dead
In a previous POST we looked at doing a file signature check on all the files in the live set, then using awk to search the file descripter field in the resultant database.
Once we do our awk search we send the results to a text file, so now we need to know how to use that text file to further process our results.
Lets imagine we have used searched for the string "image data" in our file description field to identify all our graphic files, and created a text file of the fine names and file paths in a file called live_pics.txt in our /tmp folder with this command:
awk -F: '$2 ~ /image\ data/ {print $1}' listoffiles > /tmp/live_pics.txt
One thing you definitely DON'T want to do is use the text file in a loop and the cp (copy command) to copy data out, like this:
cat /tmp/live_pics.txt | while read IMG ; do cp $IMG /mnt/case/images ; done
Our command reads the live_pics.txt file, line by line and copies each file out to a single directory on an external drive mounted at /mnt/case. The reason we don't do this is that if we have two files with the same name in our live_pics list (but in different directories) then the cp command will copy out the first file but then overwrite it with the second file - because a file with that name already exists in our receiving directory. Also, if a file name in our list of pictures, happens to start with a "-" character then the shell will interpret the remainder of the string as an option to the cp command resulting in an error message. In addition, if there is any white space in the file path or file name, the shell will assume that that is the end of the line, and fail to copy the image out. Here is my solution to the problem; I use a function that checks to see if a file with the file name already exists, if so it appends [1], [2] etc to the file. I had to overcome my fear and loathing of perl to introduce a perl regular expression for checking if the file name already exists. I set the Internal Field Separator environmental variable ($IFS) to a newline, thus the function uses a new line character as a marker for the end of a line (ignoring the white space in any file paths). I also include a "--" after the -p option to let the shell know that we have finished with our options. Here is the function and a few lines of code to show how you would use the function:
filecp () {
filepath="$1"
filename=`basename "$1"`
while [ -e $dir/"$filename" ]; do
filename=`echo "$filename" | perl -pe 's/(\[(\d+)\])?(\..*)?$/"[".(1+$2)."]$3"/e;'`
done
cp -p -- "$filepath" $dir/"$filename"
}
IFS='\n'
dir=/mnt/case/images
cat /tmp/live_pics.txt | while read IMG ; do filecp $IMG ; done
unset IFS
Obviously the same principle applies to any list of files that you want to copy out from the file system, so the code can be integrated into any of your scripts in your previewing system. If you haven't been using some of the defensive programming techniques in this code when using the cp command, you really need this code!
Once we do our awk search we send the results to a text file, so now we need to know how to use that text file to further process our results.
Lets imagine we have used searched for the string "image data" in our file description field to identify all our graphic files, and created a text file of the fine names and file paths in a file called live_pics.txt in our /tmp folder with this command:
awk -F: '$2 ~ /image\ data/ {print $1}' listoffiles > /tmp/live_pics.txt
One thing you definitely DON'T want to do is use the text file in a loop and the cp (copy command) to copy data out, like this:
cat /tmp/live_pics.txt | while read IMG ; do cp $IMG /mnt/case/images ; done
Our command reads the live_pics.txt file, line by line and copies each file out to a single directory on an external drive mounted at /mnt/case. The reason we don't do this is that if we have two files with the same name in our live_pics list (but in different directories) then the cp command will copy out the first file but then overwrite it with the second file - because a file with that name already exists in our receiving directory. Also, if a file name in our list of pictures, happens to start with a "-" character then the shell will interpret the remainder of the string as an option to the cp command resulting in an error message. In addition, if there is any white space in the file path or file name, the shell will assume that that is the end of the line, and fail to copy the image out. Here is my solution to the problem; I use a function that checks to see if a file with the file name already exists, if so it appends [1], [2] etc to the file. I had to overcome my fear and loathing of perl to introduce a perl regular expression for checking if the file name already exists. I set the Internal Field Separator environmental variable ($IFS) to a newline, thus the function uses a new line character as a marker for the end of a line (ignoring the white space in any file paths). I also include a "--" after the -p option to let the shell know that we have finished with our options. Here is the function and a few lines of code to show how you would use the function:
filecp () {
filepath="$1"
filename=`basename "$1"`
while [ -e $dir/"$filename" ]; do
filename=`echo "$filename" | perl -pe 's/(\[(\d+)\])?(\..*)?$/"[".(1+$2)."]$3"/e;'`
done
cp -p -- "$filepath" $dir/"$filename"
}
IFS='\n'
dir=/mnt/case/images
cat /tmp/live_pics.txt | while read IMG ; do filecp $IMG ; done
unset IFS
Obviously the same principle applies to any list of files that you want to copy out from the file system, so the code can be integrated into any of your scripts in your previewing system. If you haven't been using some of the defensive programming techniques in this code when using the cp command, you really need this code!
Understanding the dead
Crimminey! All this stuff about code pages and unicode and UTF is about as comprehensible as the unearthly groaning and hissing that I use to verbally communicate. Let your humble dead forensicator try and put it all into some semblance of order - we need to know this for when I talk about "keyword searching" in a future post. We will dispel some myths along the way, things that have been wrongly suggested to me as being true, over the years.
Lets skip computing very early history and jump to the early IBM machines.
They used punch cards to store data, the most widely adopted punch card had 80 columns and 12 rows. The card was punched at intersections of the rows and colums, the locations of the punches represented numbers and letters. However, the re was no need to distinguish between upper and lower case letters as there were no word processing packages (Microsoft Office Victorian never really took off I guess). Only a small set of other characters needed to be represented. As a result, the character range that was represented was only 47, 0-9 A-Z and some other characters. So at this point in time, the concept of binary storage was absent from computers. All that changed with in the mid-1950s with the advent of IBMs computer hard disk. The first model the IBM 350 had 30 million BIT capacity.
The cost of storing a single BIT was eye-wateringly expensive. There was a necessity to not only move to the concept of binary encoding, but also to ensure the smallest data unit (later named a byte) was as efficient as possible. The old punch cards could represent 47 characters, all these characters could be represented with just 6 bits in a data unit, so IBM grouped together their bits into groups of 6 to form their data units. Underlying these groupings was the concept of a truth table.
Each group of 6 bits was assigned a numeric value and a truth table was consulted to see what character the numeric value represented - this concept is still used today, it is often referred to as "character encoding".
IBM soon spotted the limitation of 6 bit groupings, especially when they came to thinking about word processing, simply adding lower case letters to the truth table would require 73 characters, that is before you even the consider the additional punctuation marks. This lead to 7 bit groupings which were introduced on the IBM 355 hard disk. This meant that 127 characters could be represented in the truth table for a 7 bit scheme - all upper/lower case characters, punctuation marks, the main math characters could easily be represented, with still some space in our truth table for more at a later date.
Looking back, we can laugh and ask why they didn't simply use an 8 bit byte, as 8 is a power of 2 and thus much more binary-friendly. Well the reality was that the IBM 355 stored 6 million 7 bit groupings. The cost of storing a single bit was in the 10s of dollars region. Moving to an 8 bit grouping would mean that the 8th bit seldom get used (there were already unclaimed slots in our 7 bit truth table) resulting in 1000s of dollars of redundant storage. There was a conflict here between the coders and the engineers, using 7 bit bytes made for less efficient code, using 8 bit bytes made for inefficient storage. Someone had to win, it was the coders who emerged glorious and victorious once the fog of battle had cleared. Thus the term byte was coined and 8 bit grouping settled on.
Interestingly IBM actually experimented with 9 bit bytes at some point to allow error checking. However, 8 bits groupings won on the basis of efficiency. So, historically there has been many battles regarding the optimum number of grouping bits into data units. This explains why IP addresses and the payload of internet data packets are encoded as "octets" - they earlier developers were explicitly stating that the bits should be put into 8 bit groupings, as opposed to using any of the other approaches to grouping bits then in existence.
8 bit bytes were settled on, but it was important that the truth table for these bytes were standardised across all systems. Much like the VHS VS Betamax war (those of you, like myself, born at a more comfortable distance from the apocalypse will recognise that I am talking about video tape player standards here), there were two main competing systems for being the agreed standard for establishing the truth tables. EBCDIC, which was an extension of an older code called BCDIC, which was a means of encoding those 47 characters on the old punch cards with 6 bits. EBCDIC was a full 8 bit character encoding truth table. The main competing standard was ASCII (American Standard Code for Information Interchange).
MYTH No 1. ASCII is an 8 bit encoding scheme.
Not true, it wasn't then and isn't now. It is a 7 bit encoding scheme.
The fact that EBCDIC used the same number of bits as the newly defined 8 bit byte, may explain why for most of the 1970s it was winning the war to be the standard. Ultimately, ASCII offered more advantages over EBCDIC, in that the alphabet characters were sequential in the truth table and the unused 8th bit could be utililised as a parity bit for error checking, thus ASCII became the accepted standard for character interchange coding.
So now we had our 8 bit data transmission/storage unit, known as a byte and an agreed standard for our truth table - ASCII. As ASCII used 7 bits, 128 characters could be defined. The first 32 characters in the ASCII truth table are non printable e.g Carriage Return, Line Feed, Null, etc, the remaining character space defines the upper/lower case letters, the numbers 0-9, punctuation marks and basic mathematical operators.
All would have been well if only the English speaking world wanted to use computers...obviously the ASCII table allowed for English letter characters. Unsurprisingly, citizens of other nations wanted to use computers and use them in their native language. This posed a significant problem, how to transmit and store data in a multitude of languages? English and a lot of Western European languages use the same characters for letters, they are Latin Script letters. Even then, there were problems with cedillas, accents and umlauts being appended to letters. Things got more complicated moving into central Europe and Eastern Europe, where Greek Script and Cyrillic script are used to represent the written form of languages there.
The solution was to use that 8th bit in ASCII, which gave 128 additional slots in the truth table, developers could define characters for their language script in using those slots. The resultant truth tables were known as "code pages".
Myth 2: Code pages contain only non-English language characters.
Not true, most code pages are an extension of the ASCII standard, thus the code pages contain the same characters as the 7 bit ASCII encoding + up to 128 additional language characters.
The implication here is that if I prepare a text file with my GB keyboard, using English language characters and save the file with any of the code pages designed for use with, say cyrillic script, then the resultant data will be the same as if I saved the file using plain old ASCII. My text only uses characters in the original ASCII defined character set. So if you are looking for the word "zombie" on a data set, but don't know if any code pages have been used, no need to worry, if the word was encoded used any of the extended ASCII code pages, you don't have to experiment with any code page search parameters.
Myth 3: The bad guys can hide ASCII messages by saving them in extended ASCII code pages. Not true..see above!
So, the use of code pages solved the problem of encoding some foreign language characters. I say "some" because there are a number of languages that have far more characters in their alphabet that can be stored in the 128 slots at the upper end of the ASCII table. This fact, coupled with the near impossibility of translating some code pages into others, lead to the development of unicode. Unicode is essentially a super set of all the code pages. It is a truth table that seeks to assign a numeric value for every "character" in every language in the world, in addition the truth table also includes symbols used in other fields such as science and mathematics. Each slot in the unicode truth table is referred to as being a "code point". The concept of "characters" also needs extending. Many of the "character" renderings in Arabic script are dependent on the rendering of characters either side of them. In other languages, several letters may be combined to form a single character. These combined letters are referred to as being "glyphs", the unicode standard emphasises the concept of glyphs. It follows that if your software has looked up the numeric value of a "character" in the unicode truth table, then that software (or your O/S) must have the corresponding glyph installed to display it the "character" correctly to the user.
Well in excess of 1,000,000 code points exist in unicode. Fundamentally, then, computers perform word processing (and other operations involving character strings) with numbers. Those numbers are code points, the code point is looked up in the unicode truth table and the corresponding glyph displayed on the screen. Those numbers also need to be represented in binary. No problem...we could represent all of the numbers (or code points) in our unicode truth table with 3 bytes. However, this approach is RAM inefficient. Unicode divides it's full repertoire of characters in "planes" of 65,536. The first plane contains all the characters for modern languages, therefore we only need 2 bytes to represent those, we are wasting one byte per glyph if we use a 3 byte value. The problem is even worse when dealing with the English characters, their code points are at the start of the unicode truth table, (making this part of unicode backwardly compatible with ASCII), so only need 1 byte to represent a character - 2 bytes are therefore being wasted using a 3 byte scheme. What was needed were schemes to encode those numbers (code points!), the most popular currently in use are:
UTF-8
UTF-16
UTF-32
Myth 4: UTF-8 uses 1 byte, UTF-16 uses 2 bytes, UTF-32 uses 4bytes.
Not entirely true. UTF-16 does use a 16 bit encoding scheme, UTF-32 uses a 32 bit encoding scheme but UTF-8 is a variable length scheme, it can use 8 - 32 bits.
There are pros and cons for each encoding method, UTF-32 means that all characters can be encoded with a fixed length value occupying 4 bytes. However, this is inefficient, for the previous stated reasons. UTF-8 is a variable length encoding. If the encoded code point value is for an ASCII text character then the first bit is set to zero, the remaining 7 bits (we only need one byte for ASCII !!), are used to store the value of the code point. It is fully backward compatible with the orginal ASCII standard. Thus if you are searching for an English language string in a UTF-8 data set then you don't need to set any special parameters with your search configuration.
If the UTF-8 encoded character requires more than 1 byte then the first bit(s) are set to reflect the number of bytes used in the array (which can be 2,3 or 4 bytes). Thus in a two byte array, the first 2 bits are set to 1, in a 3 byte array the first 3 bits are set to 1, and so on.
UTF-16 encoding use a fixed 16 bit array, however for characters in the above the first plane of Unicode, two pairs are needed.
Conceptually it is quite straight forward, however, with UTF-16 and UTF-32 there is an issue of endian-ness, should the encoded value be in big endian or little endian?
One approach is to use Byte Order Marking, this is simply a pair of bytes at the start of the encoded data that indicate the endian-ness being used.
From a programmers point of view, you want your application to know what encoding scheme is being used to encode the characters. Your average user doesn't really care as long what encoding scheme is being used as long as the string is being rendered correctly. The problem is for us forensicators who given a large dataset (especially that stuff in unallocated space), really need to know what encoding schemes are being used as this will greatly affect their keyword searching strategy. There are no fixed signatures in a stream of characters that are encoded with the old school code pages. Certainly you can experiment by viewing files in different code pages to see if the resultant data looks like a valid and coherent script - some of the forensic tools can be configured to do this as part of their analysis routines. The forensicators in the English speaking world (and those using the a written language that is based on Latin Script) have got it fairly easy. But what about if your suspect is a Foreign National? Even if they have the English language version OS installed on their machine, and English character keyboard and the English language word processing package, this doesn't exclude the possibility of them having documents, emails, or web pages in a foreign language on their system.
Myth 5: You can find the word "hello" in a foreign language data set by typing the string "hello" into your keyword searching tool and selecting the appropriate code page for the relevant foreign language.
This is so wrong, and I have heard this view being expressed several times.
You need to understand that when doing keyword searching, programmatically, you are actually doing number searching, those numbers being the code points in a particular truth table. It follows that first of all you need to know the type of truth table used to encode the characters, but before you do that you need to translate the english word you are looking for (hello) into the specific language you are looking for (and there may be several different possible translations). In a non-latin script based language there will be no congruence between the old school code pages and the new school unicode encoding scheme. So you will need to search for the code point numbers in UTF-8, UTF-16, UTF-32 and any relevant code pages. There are so tools, particularly e-discovery tools that advertise that they can detect foreign language based data, but this is different from detecting the encoding method used.
All in all, foreign language detection in forensics is challenging to say the least. So many issues about digital forensics are discussed, yet I see very little around foreign language handling and character encoding. If you have any experience in this field then please feel free to comment.
One of the programs on my previewing system will attempt to identify the language used in non-English Office documents, PDFs, emails and web-pages, it will also output the data in plain text so that it can be copied and pasted into google translate to get at least an idea of the theme of of the data. I will post that code in a future posting, for now any comments about this whole topic will be gratefully received.
Lets skip computing very early history and jump to the early IBM machines.
They used punch cards to store data, the most widely adopted punch card had 80 columns and 12 rows. The card was punched at intersections of the rows and colums, the locations of the punches represented numbers and letters. However, the re was no need to distinguish between upper and lower case letters as there were no word processing packages (Microsoft Office Victorian never really took off I guess). Only a small set of other characters needed to be represented. As a result, the character range that was represented was only 47, 0-9 A-Z and some other characters. So at this point in time, the concept of binary storage was absent from computers. All that changed with in the mid-1950s with the advent of IBMs computer hard disk. The first model the IBM 350 had 30 million BIT capacity.
The cost of storing a single BIT was eye-wateringly expensive. There was a necessity to not only move to the concept of binary encoding, but also to ensure the smallest data unit (later named a byte) was as efficient as possible. The old punch cards could represent 47 characters, all these characters could be represented with just 6 bits in a data unit, so IBM grouped together their bits into groups of 6 to form their data units. Underlying these groupings was the concept of a truth table.
Each group of 6 bits was assigned a numeric value and a truth table was consulted to see what character the numeric value represented - this concept is still used today, it is often referred to as "character encoding".
IBM soon spotted the limitation of 6 bit groupings, especially when they came to thinking about word processing, simply adding lower case letters to the truth table would require 73 characters, that is before you even the consider the additional punctuation marks. This lead to 7 bit groupings which were introduced on the IBM 355 hard disk. This meant that 127 characters could be represented in the truth table for a 7 bit scheme - all upper/lower case characters, punctuation marks, the main math characters could easily be represented, with still some space in our truth table for more at a later date.
Looking back, we can laugh and ask why they didn't simply use an 8 bit byte, as 8 is a power of 2 and thus much more binary-friendly. Well the reality was that the IBM 355 stored 6 million 7 bit groupings. The cost of storing a single bit was in the 10s of dollars region. Moving to an 8 bit grouping would mean that the 8th bit seldom get used (there were already unclaimed slots in our 7 bit truth table) resulting in 1000s of dollars of redundant storage. There was a conflict here between the coders and the engineers, using 7 bit bytes made for less efficient code, using 8 bit bytes made for inefficient storage. Someone had to win, it was the coders who emerged glorious and victorious once the fog of battle had cleared. Thus the term byte was coined and 8 bit grouping settled on.
Interestingly IBM actually experimented with 9 bit bytes at some point to allow error checking. However, 8 bits groupings won on the basis of efficiency. So, historically there has been many battles regarding the optimum number of grouping bits into data units. This explains why IP addresses and the payload of internet data packets are encoded as "octets" - they earlier developers were explicitly stating that the bits should be put into 8 bit groupings, as opposed to using any of the other approaches to grouping bits then in existence.
8 bit bytes were settled on, but it was important that the truth table for these bytes were standardised across all systems. Much like the VHS VS Betamax war (those of you, like myself, born at a more comfortable distance from the apocalypse will recognise that I am talking about video tape player standards here), there were two main competing systems for being the agreed standard for establishing the truth tables. EBCDIC, which was an extension of an older code called BCDIC, which was a means of encoding those 47 characters on the old punch cards with 6 bits. EBCDIC was a full 8 bit character encoding truth table. The main competing standard was ASCII (American Standard Code for Information Interchange).
MYTH No 1. ASCII is an 8 bit encoding scheme.
Not true, it wasn't then and isn't now. It is a 7 bit encoding scheme.
The fact that EBCDIC used the same number of bits as the newly defined 8 bit byte, may explain why for most of the 1970s it was winning the war to be the standard. Ultimately, ASCII offered more advantages over EBCDIC, in that the alphabet characters were sequential in the truth table and the unused 8th bit could be utililised as a parity bit for error checking, thus ASCII became the accepted standard for character interchange coding.
So now we had our 8 bit data transmission/storage unit, known as a byte and an agreed standard for our truth table - ASCII. As ASCII used 7 bits, 128 characters could be defined. The first 32 characters in the ASCII truth table are non printable e.g Carriage Return, Line Feed, Null, etc, the remaining character space defines the upper/lower case letters, the numbers 0-9, punctuation marks and basic mathematical operators.
All would have been well if only the English speaking world wanted to use computers...obviously the ASCII table allowed for English letter characters. Unsurprisingly, citizens of other nations wanted to use computers and use them in their native language. This posed a significant problem, how to transmit and store data in a multitude of languages? English and a lot of Western European languages use the same characters for letters, they are Latin Script letters. Even then, there were problems with cedillas, accents and umlauts being appended to letters. Things got more complicated moving into central Europe and Eastern Europe, where Greek Script and Cyrillic script are used to represent the written form of languages there.
The solution was to use that 8th bit in ASCII, which gave 128 additional slots in the truth table, developers could define characters for their language script in using those slots. The resultant truth tables were known as "code pages".
Myth 2: Code pages contain only non-English language characters.
Not true, most code pages are an extension of the ASCII standard, thus the code pages contain the same characters as the 7 bit ASCII encoding + up to 128 additional language characters.
The implication here is that if I prepare a text file with my GB keyboard, using English language characters and save the file with any of the code pages designed for use with, say cyrillic script, then the resultant data will be the same as if I saved the file using plain old ASCII. My text only uses characters in the original ASCII defined character set. So if you are looking for the word "zombie" on a data set, but don't know if any code pages have been used, no need to worry, if the word was encoded used any of the extended ASCII code pages, you don't have to experiment with any code page search parameters.
Myth 3: The bad guys can hide ASCII messages by saving them in extended ASCII code pages. Not true..see above!
So, the use of code pages solved the problem of encoding some foreign language characters. I say "some" because there are a number of languages that have far more characters in their alphabet that can be stored in the 128 slots at the upper end of the ASCII table. This fact, coupled with the near impossibility of translating some code pages into others, lead to the development of unicode. Unicode is essentially a super set of all the code pages. It is a truth table that seeks to assign a numeric value for every "character" in every language in the world, in addition the truth table also includes symbols used in other fields such as science and mathematics. Each slot in the unicode truth table is referred to as being a "code point". The concept of "characters" also needs extending. Many of the "character" renderings in Arabic script are dependent on the rendering of characters either side of them. In other languages, several letters may be combined to form a single character. These combined letters are referred to as being "glyphs", the unicode standard emphasises the concept of glyphs. It follows that if your software has looked up the numeric value of a "character" in the unicode truth table, then that software (or your O/S) must have the corresponding glyph installed to display it the "character" correctly to the user.
Well in excess of 1,000,000 code points exist in unicode. Fundamentally, then, computers perform word processing (and other operations involving character strings) with numbers. Those numbers are code points, the code point is looked up in the unicode truth table and the corresponding glyph displayed on the screen. Those numbers also need to be represented in binary. No problem...we could represent all of the numbers (or code points) in our unicode truth table with 3 bytes. However, this approach is RAM inefficient. Unicode divides it's full repertoire of characters in "planes" of 65,536. The first plane contains all the characters for modern languages, therefore we only need 2 bytes to represent those, we are wasting one byte per glyph if we use a 3 byte value. The problem is even worse when dealing with the English characters, their code points are at the start of the unicode truth table, (making this part of unicode backwardly compatible with ASCII), so only need 1 byte to represent a character - 2 bytes are therefore being wasted using a 3 byte scheme. What was needed were schemes to encode those numbers (code points!), the most popular currently in use are:
UTF-8
UTF-16
UTF-32
Myth 4: UTF-8 uses 1 byte, UTF-16 uses 2 bytes, UTF-32 uses 4bytes.
Not entirely true. UTF-16 does use a 16 bit encoding scheme, UTF-32 uses a 32 bit encoding scheme but UTF-8 is a variable length scheme, it can use 8 - 32 bits.
There are pros and cons for each encoding method, UTF-32 means that all characters can be encoded with a fixed length value occupying 4 bytes. However, this is inefficient, for the previous stated reasons. UTF-8 is a variable length encoding. If the encoded code point value is for an ASCII text character then the first bit is set to zero, the remaining 7 bits (we only need one byte for ASCII !!), are used to store the value of the code point. It is fully backward compatible with the orginal ASCII standard. Thus if you are searching for an English language string in a UTF-8 data set then you don't need to set any special parameters with your search configuration.
If the UTF-8 encoded character requires more than 1 byte then the first bit(s) are set to reflect the number of bytes used in the array (which can be 2,3 or 4 bytes). Thus in a two byte array, the first 2 bits are set to 1, in a 3 byte array the first 3 bits are set to 1, and so on.
UTF-16 encoding use a fixed 16 bit array, however for characters in the above the first plane of Unicode, two pairs are needed.
Conceptually it is quite straight forward, however, with UTF-16 and UTF-32 there is an issue of endian-ness, should the encoded value be in big endian or little endian?
One approach is to use Byte Order Marking, this is simply a pair of bytes at the start of the encoded data that indicate the endian-ness being used.
From a programmers point of view, you want your application to know what encoding scheme is being used to encode the characters. Your average user doesn't really care as long what encoding scheme is being used as long as the string is being rendered correctly. The problem is for us forensicators who given a large dataset (especially that stuff in unallocated space), really need to know what encoding schemes are being used as this will greatly affect their keyword searching strategy. There are no fixed signatures in a stream of characters that are encoded with the old school code pages. Certainly you can experiment by viewing files in different code pages to see if the resultant data looks like a valid and coherent script - some of the forensic tools can be configured to do this as part of their analysis routines. The forensicators in the English speaking world (and those using the a written language that is based on Latin Script) have got it fairly easy. But what about if your suspect is a Foreign National? Even if they have the English language version OS installed on their machine, and English character keyboard and the English language word processing package, this doesn't exclude the possibility of them having documents, emails, or web pages in a foreign language on their system.
Myth 5: You can find the word "hello" in a foreign language data set by typing the string "hello" into your keyword searching tool and selecting the appropriate code page for the relevant foreign language.
This is so wrong, and I have heard this view being expressed several times.
You need to understand that when doing keyword searching, programmatically, you are actually doing number searching, those numbers being the code points in a particular truth table. It follows that first of all you need to know the type of truth table used to encode the characters, but before you do that you need to translate the english word you are looking for (hello) into the specific language you are looking for (and there may be several different possible translations). In a non-latin script based language there will be no congruence between the old school code pages and the new school unicode encoding scheme. So you will need to search for the code point numbers in UTF-8, UTF-16, UTF-32 and any relevant code pages. There are so tools, particularly e-discovery tools that advertise that they can detect foreign language based data, but this is different from detecting the encoding method used.
All in all, foreign language detection in forensics is challenging to say the least. So many issues about digital forensics are discussed, yet I see very little around foreign language handling and character encoding. If you have any experience in this field then please feel free to comment.
One of the programs on my previewing system will attempt to identify the language used in non-English Office documents, PDFs, emails and web-pages, it will also output the data in plain text so that it can be copied and pasted into google translate to get at least an idea of the theme of of the data. I will post that code in a future posting, for now any comments about this whole topic will be gratefully received.
Monday, 3 September 2012
Identifying the dead redux
In my previous POST I showed you how to do a file signature check on every file in the file system. That should result in us having a flat, text based database that looks have this format:
The inestimable Barry Grundy points out that this is only useful if you ensure that the description of all the graphic files in your magic file contains the string "image data". You should NOT edit your system magic file, simply create your custom magic file, the format is simple, and place it in your /etc/ directory.
In the example above, everything works fine, but now consider this data set.
Using our previous command we now get these results:
/media/sda1/stuff/TK8TP-9JN6P-7X7WW-RFFTV-B7QPF:
ASCII text
/media/sda1/stuff/mov00281.3gp:
ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/cooltext751106231.png:
PNG image data, 857 x 70, 8-bit/color RGBA, non-interlaced
/media/sda1/stuff/phantom.zip:
Zip archive data, at least v2.0 to extract
/media/sda1/stuff/th_6857-1c356e.jpg:
JPEG image data, JFIF standard 1.01
/media/sda1/stuff/.Index.dat:
data
/media/sda1/stuff/mov00279.3gp:
ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/phoneid.txt:
ASCII text, with CRLF line terminators
The format of the file is two fields separated by a colon. The first field is the full path and filename, the second field (after the colon) is a description of the file taken from the magic file. Notice that the file command can identify ASCII text files.
Remember that our database is a file called listoffiles, it is in the path $reppath/tmp/.
If we wanted to create a list of graphic files then we could use a command like this:
grep 'image data' $reppath/tmp/listoffiles | cut -d: -f 1 > $reppath/tmp/liveimagefiles.txt
The above command is searching each line of our file list for the string "image data", then piping the results to the cut command which prints the first field of any matched lines, so the command returns the following:
Remember that our database is a file called listoffiles, it is in the path $reppath/tmp/.
If we wanted to create a list of graphic files then we could use a command like this:
grep 'image data' $reppath/tmp/listoffiles | cut -d: -f 1 > $reppath/tmp/liveimagefiles.txt
The above command is searching each line of our file list for the string "image data", then piping the results to the cut command which prints the first field of any matched lines, so the command returns the following:
/media/sda1/stuff/cooltext751106231.png
/media/sda1/stuff/th_6857-1c356e.jpg
In the example above, everything works fine, but now consider this data set.
/media/sda1/stuff/TK8TP-9JN6P-7X7WW-RFFTV-B7QPF: ASCII text
/media/sda1/stuff/mov00281.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/cooltext751106231.png: PNG image data, 857 x 70, 8-bit/color RGBA, non-interlaced
/media/sda1/stuff/phantom.zip: Zip archive data, at least v2.0 to extract
/media/sda1/stuff/th_6857-1c356e.jpg: JPEG image data, JFIF standard 1.01
/media/sda1/stuff/.Index.dat: data
/media/sda1/stuff/mov00279.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/phoneid.txt: ASCII text, with CRLF line terminators
/media/sda1/stuff/image data.doc: Microsoft Office DocumentUsing our previous command we now get these results:
/media/sda1/stuff/cooltext751106231.png
/media/sda1/stuff/th_6857-1c356e.jpg
/media/sda1/stuff/image data.doc
The final line is a false positive, we can avoid this happening by only searching the second field (the file description) of each line of file list and printing the first field of any matches. We can use the awk command to do this. In its simplest form awk will break down a line of input into fields. By default the field separator is white space, however you can change the field separator by using the -F option, then select which field to print with the print option. So, we could print all the file paths and names (the first field in our listoffiles with this command:
awk -F: '{print $1}' $reppath/tmp/listoffiles
If we wanted to print just the descriptions (the second field) then we could use this:
awk -F: '{print $2}' $reppath/tmp/listoffiles
We can see that awk uses the nomenclature $1, $2, $3 etc to name each field in the lines of input.
If we wanted to search the second field for our "image data" pattern then we could use this command:
awk -F: '$2 ~ /image\ data/ {print $1}' $reppath/tmp/listoffiles
We use the -F option to set the file separator to a colon, we put our statement in single quotes, the $2 ~ means go to the second field and match the following pattern (which has to be inside the / / characters) then print the first field of any matching lines. The white space in our pattern to be matched has to be escaped with a \ character. We can redirect the output to a files, meaning that we have a complete of all the graphic file paths and names in a single file. We can read that list of graphic files to copy out our graphic images...but there are some problems we need to overcome....
Identifying the dead
When previewing a system, once you get down to the file system level, the first thing you might want to do is create a database of live files and what those files actually are - what is known as a file signature analysis. Linux generally relies on file signatures to identify file types, rather than file extension (which is the Windows way).
Therefore there is already a an expansive file signature database installed on all Linux systems named "magic" (often found at /usr/share/misc). In addition you can create you own magic file with your customised signatures. The magic file is used extensively by the "file" command. The file command compares the signature in a regular file to the signatures in the magic database(s) and returns a description of the file based on the content of the magic file. Most forensicators understand the significance of doing a file signature check as opposed to relying on file extension.
A user can change a file extension, the firefox web browser caches files to its web cache with no file extension, the opera browser caches files with a .tmp file extension. So if you are viewing graphic files via your forensic suite, or even if you have exported them to view in a file browser, have you checked whether your tool of choice is working on the file extenison or file signature? It may be that you are missing 1000's of images in, say, the firefox cache based on faulty assumptions...make sure you check, chaps! Sorting files by file extension is about as effective as aiming for the centre of mass on a zombie in the hope of stopping them. Remember: Head shots to destroy zombies, file signature analysis to identify file types. Simples!
Anyway, we can start to create our previewing script. Remember with my file system mounting SCRIPT all the file systems were mounted under the /media node, thus /dev/sda1 gets mounted at /media/sda1, /dev/sda2 at /media/sda2 etc etc. The script also initiates a loop to process each mounted file system in turn and exports some variables that can be used by our processing script. Also, our external drive is mounted at /mnt/cases. Here is the relevant part of the script:
First thing to note is that Forensicator has been a bad zombie by putting his variables in lower case, it is much better programming practice to put them in upper case to make the code easier to read.
The first line intitiates our loop, it is isolating all the file systems mounted under /media by looking in the /etc/mtab file, then excluding our cd/dvd drive in case we have booted the system from CD (as opposed to thumb drive). From line 9 onwards it is referencing some variables called $evipath, $csname, $evnum. If you look earlier in the diskmount script you will see that they were created during an interactive session, when the user was prompted for input, like this:
/mnt/cases/BADGUY_55-08/ABC1/_dev_sda1/tmp/listoffiles
The syntax for the find command is a bit weird if you aren't familiar with all the options, the command is saying find all entities in the path (for instance) /media/sda1,
confine the results to regular files ( -type f), execute the file command (-exec file) for each entity found ( {} ) and redirect results to my listoffiles.
This is how I would do the file signature check, once this database is created, I script out interrogating the database for certain file types then processing those. The code I have, and will be publishing here, will process the live set, deleted set and unallocated space along with the interpartition gaps and ambient data such as swap/hiberfil/memory dumps, it will export various files out for review, hunt for encrypted files, process compressed data and various archive formats, create storyboards of any movie files, do virus checking, recovers and processes 25 different chat/messaging formats, processes all the major email formats, processes p2p history files, does complete URL recovery and analysis, and lots more. This is all done automatically with a single command.
Therefore there is already a an expansive file signature database installed on all Linux systems named "magic" (often found at /usr/share/misc). In addition you can create you own magic file with your customised signatures. The magic file is used extensively by the "file" command. The file command compares the signature in a regular file to the signatures in the magic database(s) and returns a description of the file based on the content of the magic file. Most forensicators understand the significance of doing a file signature check as opposed to relying on file extension.
A user can change a file extension, the firefox web browser caches files to its web cache with no file extension, the opera browser caches files with a .tmp file extension. So if you are viewing graphic files via your forensic suite, or even if you have exported them to view in a file browser, have you checked whether your tool of choice is working on the file extenison or file signature? It may be that you are missing 1000's of images in, say, the firefox cache based on faulty assumptions...make sure you check, chaps! Sorting files by file extension is about as effective as aiming for the centre of mass on a zombie in the hope of stopping them. Remember: Head shots to destroy zombies, file signature analysis to identify file types. Simples!
Anyway, we can start to create our previewing script. Remember with my file system mounting SCRIPT all the file systems were mounted under the /media node, thus /dev/sda1 gets mounted at /media/sda1, /dev/sda2 at /media/sda2 etc etc. The script also initiates a loop to process each mounted file system in turn and exports some variables that can be used by our processing script. Also, our external drive is mounted at /mnt/cases. Here is the relevant part of the script:
for
i in `cat /etc/mtab | grep media | egrep -iv 'cd|dvd'| awk -F\
'{print $2}'`
do
export
volpath=`echo $i`
#eg /media/sda1
export
suspart=`echo $volpath | sed 's/media/dev/g'`
#eg /dev/sda1
export
fsuspart=`echo $suspart | sed 's/\///g'`
#eg devsda1
export
susdev=`echo $suspart | sed 's/[0-9]*//g'`
#eg /dev/sda
export
tsusdev=$susdev
#eg /dev/sda
dirname=`echo
$suspart | sed 's/\//_/g'`
#eg _dev_sda1
ddirname=`echo
$susdev | sed 's/\//_/g'`
#eg _dev_sda
sudo
mkdir -m 777 $evipath/$csname/$evnum/$ddirname #eg
/mnt/cases/BADGUY_55-08/ABC1/_dev_sda1
sudo
mkdir -m 777 $evipath/$csname/$evnum/$ddirname/$dirname
cd
$evipath/$csname/$evnum/$ddirname/$dirname
export
reppath=`pwd`
#eg /mnt/cases/BADGUY_55-08/ABC1/_dev_sda1
sudo
mkdir -m 777 findings
sudo
mkdir -m 777 Report
sudo
mkdir -m 777 tmp
First thing to note is that Forensicator has been a bad zombie by putting his variables in lower case, it is much better programming practice to put them in upper case to make the code easier to read.
The first line intitiates our loop, it is isolating all the file systems mounted under /media by looking in the /etc/mtab file, then excluding our cd/dvd drive in case we have booted the system from CD (as opposed to thumb drive). From line 9 onwards it is referencing some variables called $evipath, $csname, $evnum. If you look earlier in the diskmount script you will see that they were created during an interactive session, when the user was prompted for input, like this:
export
evipath=/mnt/cases
echo
-n "What is the case name (NO SPACES OR FORWARD SLASHES)? > "
read
csname
#eg BADGUY_55-08
export
csname
echo
-n "What is the evidence number of suspect system (NO SPACES OR
FORWARD SLASHES)? > "
read
evnum #eg
ABC1
export
evnum
echo
-n "What is your rank and name? > "
read
examiner #eg
DC_Sherlock_Holmes
export
examiner
The "read" command is great for getting user input assigned to a variable, the value is then exported for use by other scripts. I have commented the code (anything after the # character) to show you an example of the what the variable value looks like. So, we have a mounted external drive, we have created a case directory structure on it, the topmost directory is the case name, the next directory down the tree is that of the evidence number, inside that will be a directory for each physical device eg. _dev_sda, inside that there will be a directory for each partition (_dev_sda1, _dev_sda2, etc etc), inside each of those will be 3 directories named "findings", "Report" and "tmp". We have also, as part of our loop, created at variable called $reppath (short for Report Path), this variable points at the partitions directory on our ouput drive, so that data can be sent to the findings|Report|tmp directory, an example of a $reppath variable value would be something like:
/mnt/cases/BADGUY_55-08/ABC1/_dev_sda1
If I wanted to create a database of all the files and their description for each partion the code would therefore be:
find $volpath -type f -exec file {} \; >> $reppath/tmp/listoffiles
The $volpath is the mounted partition, eg /media/sda1. The database is called listoffiles (it is just a simple text file), when the $reppath variable gets expanded the full file name and path would be something like:
/mnt/cases/BADGUY_55-08/ABC1/_dev_sda1/tmp/listoffiles
The syntax for the find command is a bit weird if you aren't familiar with all the options, the command is saying find all entities in the path (for instance) /media/sda1,
confine the results to regular files ( -type f), execute the file command (-exec file) for each entity found ( {} ) and redirect results to my listoffiles.
This is how I would do the file signature check, once this database is created, I script out interrogating the database for certain file types then processing those. The code I have, and will be publishing here, will process the live set, deleted set and unallocated space along with the interpartition gaps and ambient data such as swap/hiberfil/memory dumps, it will export various files out for review, hunt for encrypted files, process compressed data and various archive formats, create storyboards of any movie files, do virus checking, recovers and processes 25 different chat/messaging formats, processes all the major email formats, processes p2p history files, does complete URL recovery and analysis, and lots more. This is all done automatically with a single command.
Sunday, 2 September 2012
A plague of previewing...the rising
So you have created your preview disk, booted the system and now want to start your previewing. You can use some forensic tools such as the SLEUTHKIT to do some analysis, but we might want to use a lot of the tools built into linux...for that we need to mount the file systems in a forensically sound way, and probably mount some external storage to send our exported data to.
There are a number of fantastic tools available to us in linux for discovering physical disks, partition table and files system information, before we mount those file systems.
We will also want to know what exists in the gaps between partitons....is there raw data or maybe a complete, but deleted, file system? If a file system fails to mount, we would want to know why...is the file system encrypted? If the file system is bit-locker key, maybe we want to scan the partition for a recovery key? We will also want to mount an external drive in read/write mode to collect any results from out analysis. This can be quite a challenging task at the terminal, so I have written a script that will automatically perform most of the above tasks in seconds. You can add this script to your boot CD. Note there are a couple of dependencies that should be available in your systems packet manager to install, the deps are:
gdisk (to handle GPT partitions)
sleuthkit
hachoir tools set
hdparm (to get detailed info about IDE hard disks)
sdparm (to get detailed info about SATA/SCSI hard disks)
bulk_extractor
So this will be the flow of our script:
Check if the user is root.
Get a list of partitions
Get list of mounted partitions - we need to do this in case we have booted using
a thumbdrive, we want to exclude our boot media from the analysis.
Get list of physical disks (in case we want to do anything at the physical disk level)
Check for presence of GPT partition tables.
Check each partition for presence of bitlocker signatures.
Mount all mountable partitions under the /media node in Read Only mode.
Check for the presence of any AppleMac partitions, and mount any partitions found.
Any file system that fail to mount, conduct entropy test on a sample of 5mb of data from the partition - warn user if the file system appears to be encrypted (I will cover this in a bit more detail in a future post).
Create mount point for external drive
Prompt user to plug in external drive
Detect and mount partition on external drive in Read/Write mode
Create case directory structure on external drive
Create report containing details of interrogated system, hard disk info, partition info, RAM/Processor info
Process each mounted partition in turn (We can do a simple check to determine if a Windows or MacOS is present, then launch another script appropriate for those OS. If no OS detected, assume it it is a storage disk and process accordingly.
Once each mounted partition processed, image in turn each inter-partition gap to external drive.
Analyse is ip gap, to see if it has a valid file system, if so, mount the file system and process, else treat as raw data and process with different script.
Check for presence of Linux swap partitions and process same
If any bit-locker signatures found, launch process to look for recovery key on drive.
In the script you will see that I have commented out the lines to launch the various analysis scripts for each partition. We'll look as some of the interesting types of automated analysis that we can do in future posts. For now, you can find the shell script HERE.
You may wonder why I use the "loop" option to the mount command, this is to prevent the journal in some journalling file systems being MODIFIED.
The inter-partition gap analysis is something that isn't always done in computer forensics, the layout of many computer forensic suites don't lend themselves to easy analysis so performing this analysis is often overlooked. If you aren't looking in the inter-partition gaps routinely then you are doing you analysis incorrectly.
There are a number of fantastic tools available to us in linux for discovering physical disks, partition table and files system information, before we mount those file systems.
We will also want to know what exists in the gaps between partitons....is there raw data or maybe a complete, but deleted, file system? If a file system fails to mount, we would want to know why...is the file system encrypted? If the file system is bit-locker key, maybe we want to scan the partition for a recovery key? We will also want to mount an external drive in read/write mode to collect any results from out analysis. This can be quite a challenging task at the terminal, so I have written a script that will automatically perform most of the above tasks in seconds. You can add this script to your boot CD. Note there are a couple of dependencies that should be available in your systems packet manager to install, the deps are:
gdisk (to handle GPT partitions)
sleuthkit
hachoir tools set
hdparm (to get detailed info about IDE hard disks)
sdparm (to get detailed info about SATA/SCSI hard disks)
bulk_extractor
So this will be the flow of our script:
Check if the user is root.
Get a list of partitions
Get list of mounted partitions - we need to do this in case we have booted using
a thumbdrive, we want to exclude our boot media from the analysis.
Get list of physical disks (in case we want to do anything at the physical disk level)
Check for presence of GPT partition tables.
Check each partition for presence of bitlocker signatures.
Mount all mountable partitions under the /media node in Read Only mode.
Check for the presence of any AppleMac partitions, and mount any partitions found.
Any file system that fail to mount, conduct entropy test on a sample of 5mb of data from the partition - warn user if the file system appears to be encrypted (I will cover this in a bit more detail in a future post).
Create mount point for external drive
Prompt user to plug in external drive
Detect and mount partition on external drive in Read/Write mode
Create case directory structure on external drive
Create report containing details of interrogated system, hard disk info, partition info, RAM/Processor info
Process each mounted partition in turn (We can do a simple check to determine if a Windows or MacOS is present, then launch another script appropriate for those OS. If no OS detected, assume it it is a storage disk and process accordingly.
Once each mounted partition processed, image in turn each inter-partition gap to external drive.
Analyse is ip gap, to see if it has a valid file system, if so, mount the file system and process, else treat as raw data and process with different script.
Check for presence of Linux swap partitions and process same
If any bit-locker signatures found, launch process to look for recovery key on drive.
In the script you will see that I have commented out the lines to launch the various analysis scripts for each partition. We'll look as some of the interesting types of automated analysis that we can do in future posts. For now, you can find the shell script HERE.
You may wonder why I use the "loop" option to the mount command, this is to prevent the journal in some journalling file systems being MODIFIED.
The inter-partition gap analysis is something that isn't always done in computer forensics, the layout of many computer forensic suites don't lend themselves to easy analysis so performing this analysis is often overlooked. If you aren't looking in the inter-partition gaps routinely then you are doing you analysis incorrectly.
Friday, 31 August 2012
Task saturated
Most LE forensicator types will know what it is like to be overwhelmed with urgent stuff, the situation not helped when your supervisor is shouting at you to get stuff done and then, when you have gone home, your supervisor phoning you up to shout at you some more (and we love that). So it is nice to remember that, no matter how crushing the challenges facing you, things could be worse. Thus, I commend to you the fictional(?) zombie stylings of STEPHEN KNIGHT.
This all too credible tale of zombie mayhem is a milestone in the canon of the walking dead. All too often zombies are unfairly portrayed as thoughtless automatons , unable to perform any tasks beyond the standard ripping and tearing of flesh. In this book, a detachment of soldiers are heavily outnumbered and trapped in a tower block (a block of flats for those who speak the Queen's English), their efforts to escape are hampered by the appearance of zombie special forces soldiers who have managed to retain some memories of their training. In fact, these uber-zombies are so adept that they are able to reduce the most complicated and seemingly insoluble problems to very simple solutions....blow stuff up!!!
Relentless gripping, this is essential reading. This is the first part of a trilogy, will the tale end happily with the walking dead ascending to global dominance, or will the jammy living score an unlikely and undeserved victory?
This all too credible tale of zombie mayhem is a milestone in the canon of the walking dead. All too often zombies are unfairly portrayed as thoughtless automatons , unable to perform any tasks beyond the standard ripping and tearing of flesh. In this book, a detachment of soldiers are heavily outnumbered and trapped in a tower block (a block of flats for those who speak the Queen's English), their efforts to escape are hampered by the appearance of zombie special forces soldiers who have managed to retain some memories of their training. In fact, these uber-zombies are so adept that they are able to reduce the most complicated and seemingly insoluble problems to very simple solutions....blow stuff up!!!
Relentless gripping, this is essential reading. This is the first part of a trilogy, will the tale end happily with the walking dead ascending to global dominance, or will the jammy living score an unlikely and undeserved victory?
Thursday, 30 August 2012
Facebook artifacts
It is widely known that Facebook artifacts can be cached to the disk, a couple of years ago the chat artifacts were written as plain text files that could be found in the web browser cache for IE users. This is no longer the case, however some artifacts can still be recovered from a hard disk, particularly from the swap file.
The artifacts are in json format, but facebook are fond of updating their infrastructure so the internal structure of the json artifacts may change frequently. Do you have tool for recovering such artifacts? Is it keeping pace with changes to the structure of the artifacts? I am going to show you a way to not only recover the artifacts and parse them to generate more user friendly output, but also how to ensure you can stay up-to-date with changes facebook make to the structure of the artifacts. This technique will also allow you to recover other chat/messaging artifacts such as Yahoo IM2.
What we need to do is recover ALL the json artifacts from the disk, save them to file then write a script to parse out any messages. I use the mighty and utterly essential bulk_extractor tool to recover the json artifacts (amongst many other things). If you haven't used this tool then you absolutely MUST get hold of it, there are virtually no cases where I don't deploy it. I will cover more uses for the tool in future posts, there is a Widows version with a nice GUI on the download page, but for now we'll look at the recovery and parsing of json data. We can do this on a the suspect system, having customised our boot disk and generated the .iso as per my PREVIOUS POST. Alternatively you could run bulk extractor against a disk image.
You will need to run bulk_extractor as root from the terminal. There are lots of options for bulk_extractor we are just going to run it with the json scanner enabled.
The command would be:
bulk_extractor -E json -o /home/fotd/bulk /dev/sda
The -E json option turns off all the scanners except the json scanner, we then specify and output directory and the physical device that we want to process.
Upon completion we find a text file called json.txt, this will contain all the json strings preceded by the disk offset that they were found at.
The json strings can be very long, sometimes 10s of thousands of characters in length...so I can't really show you any of the interesting strings here. Ideally you want to view the strings without line wrapping, the gedit text editor can do this for you. Trying to manually review the json and identify facebook artifacts is the road to insanity, so we'll can script the processing of the json strings. What we are going to do is search our json strings for facebook artifacts, the newer ones will have the string "msg_body" within the 1st 200 characters. So we can read our json file, line by line, looking for the term "msg_body", if a json string matches this criteria we will search it a bit deeper looking for other landmarks in the string that are an indicator of facebook artifacts. We can use those landmarks as field separators for awk, to isolate structures in the string, such as the message content, message date, author id etc. Here is a chunk of code that is representative of the script:
jbproc () {
OUTFILE=FACEBOOK_MSGS.csv
echo "Sender Name,Sender ID,Msg Subject,Message Content,Msg Type,Message Date/Time,Message ID,Recipient ID,Recipient Name,Offset," > $FACEBOOK_MSGS.csv
The final line submits each line of our json.txt file to a function called jbproc. The first line of the function checks to see if the term msg_body appears in the first 200 characters of the line. Note that we pipe the result of egrep to the head command. If we didn't do this and our test found 2 instances of "msg_body" then our script would fall over as the test command will only accept a single result. The rest of the script is fairly straightforward. In the TEXT variable you want to make sure you remove any commas, as the output is going to a comma deliminated file - otherwise you formatting is going to be messed up. The time stamp in the json is a unixtime value multiplied by a thousand, so we need to divide the number by 1000 then convert the value to human readable with the date command. All our variable values are echoed out to our spread sheet. The code snippet above just deals with one type of facebook artifact, you can download the full script that processes all the various facebook artifacts HERE. Save the script into /usr/local/bin, make it executable then change into the directory containing your json.txt file, run the script and you will find a spreadsheet containing all the parsed output in the same directory.
The big advantage of this approach is that if facebook change their json output, then you can quickly see what the changes are by checking the bulk_extractor generated json.txt file, then simply edit the script to reflect the new changes.
If you are going to use the script, let me know how results compare to any other tools that you might be using to recover facebook artifacts.
The artifacts are in json format, but facebook are fond of updating their infrastructure so the internal structure of the json artifacts may change frequently. Do you have tool for recovering such artifacts? Is it keeping pace with changes to the structure of the artifacts? I am going to show you a way to not only recover the artifacts and parse them to generate more user friendly output, but also how to ensure you can stay up-to-date with changes facebook make to the structure of the artifacts. This technique will also allow you to recover other chat/messaging artifacts such as Yahoo IM2.
What we need to do is recover ALL the json artifacts from the disk, save them to file then write a script to parse out any messages. I use the mighty and utterly essential bulk_extractor tool to recover the json artifacts (amongst many other things). If you haven't used this tool then you absolutely MUST get hold of it, there are virtually no cases where I don't deploy it. I will cover more uses for the tool in future posts, there is a Widows version with a nice GUI on the download page, but for now we'll look at the recovery and parsing of json data. We can do this on a the suspect system, having customised our boot disk and generated the .iso as per my PREVIOUS POST. Alternatively you could run bulk extractor against a disk image.
You will need to run bulk_extractor as root from the terminal. There are lots of options for bulk_extractor we are just going to run it with the json scanner enabled.
The command would be:
bulk_extractor -E json -o /home/fotd/bulk /dev/sda
The -E json option turns off all the scanners except the json scanner, we then specify and output directory and the physical device that we want to process.
Upon completion we find a text file called json.txt, this will contain all the json strings preceded by the disk offset that they were found at.
The json strings can be very long, sometimes 10s of thousands of characters in length...so I can't really show you any of the interesting strings here. Ideally you want to view the strings without line wrapping, the gedit text editor can do this for you. Trying to manually review the json and identify facebook artifacts is the road to insanity, so we'll can script the processing of the json strings. What we are going to do is search our json strings for facebook artifacts, the newer ones will have the string "msg_body" within the 1st 200 characters. So we can read our json file, line by line, looking for the term "msg_body", if a json string matches this criteria we will search it a bit deeper looking for other landmarks in the string that are an indicator of facebook artifacts. We can use those landmarks as field separators for awk, to isolate structures in the string, such as the message content, message date, author id etc. Here is a chunk of code that is representative of the script:
jbproc () {
if
test `echo $CHATFILE | head -c 200 | egrep -m 1 -o msg_body | head -n
1`
then
echo "new single message found"
MSGTYPE=OFFLINE_MESSAGE
SUBJ=`echo $CHATFILE | awk -F'5Cu003Cp>' '{print $2}' |
awk -F'5Cu003C' '{print $1}'`
UTIME=`echo $CHATFILE | awk -Ftimestamp '{print $2}' | awk
-F, '{print $1}' | awk -F: '{print $2}'`
HTIME=`date -d @$(($UTIME/1000))`
TEXT=`echo $CHATFILE | awk -Fcontent\ noh '{print $2}' |
awk -F5Cu003Cp\> '{print $2}' | awk -F5Cu003C '{print $1}' | sed
's/,/ /g' | sed 's/,/ /g'`
SNDID=`echo $CHATFILE | awk -Fsender_fbid '{print $2}' |
awk -F, '{print $1}'| awk -F: '{print $2}'`
SNDNME=`echo $CHATFILE | awk -Fsender_name '{print $2}' |
awk -F, '{print $1}'| awk -F: '{print $2}'`
RCPTID=NONE
RCPNME=NONE
MSGID=NONE
OFFSET=`echo $CHATFILE | awk '{print $1}'`
echo
"$SNDNME,$SNDID,$SUBJ,$TEXT,$MSGTYPE,$HTIME,$MSGID,$RCPTID,$RCPNME,$OFFSET,"
>> $OUTFILE
}
echo "Sender Name,Sender ID,Msg Subject,Message Content,Msg Type,Message Date/Time,Message ID,Recipient ID,Recipient Name,Offset," > $FACEBOOK_MSGS.csv
cat
json.txt | while read CHATFILE ; do jbproc $CHATFILE ; done
The final line submits each line of our json.txt file to a function called jbproc. The first line of the function checks to see if the term msg_body appears in the first 200 characters of the line. Note that we pipe the result of egrep to the head command. If we didn't do this and our test found 2 instances of "msg_body" then our script would fall over as the test command will only accept a single result. The rest of the script is fairly straightforward. In the TEXT variable you want to make sure you remove any commas, as the output is going to a comma deliminated file - otherwise you formatting is going to be messed up. The time stamp in the json is a unixtime value multiplied by a thousand, so we need to divide the number by 1000 then convert the value to human readable with the date command. All our variable values are echoed out to our spread sheet. The code snippet above just deals with one type of facebook artifact, you can download the full script that processes all the various facebook artifacts HERE. Save the script into /usr/local/bin, make it executable then change into the directory containing your json.txt file, run the script and you will find a spreadsheet containing all the parsed output in the same directory.
The big advantage of this approach is that if facebook change their json output, then you can quickly see what the changes are by checking the bulk_extractor generated json.txt file, then simply edit the script to reflect the new changes.
If you are going to use the script, let me know how results compare to any other tools that you might be using to recover facebook artifacts.
Wednesday, 29 August 2012
Customising your boot CD
Trying to roll-your-own version of Linux used to be more stressful then being trapped in a basement during the zombie apocalypse with someone who suddenly develops flu-like symptoms. Mercifully, this is no longer the case (depending what distro you are using), a set of the 'remastersys' scripts are included on some distos to simplify the process. I am assuming that you know some basic linux for this post!
The engine of my preview system is the CAINE bootable CD (version 2.0). To create your own system you will need to get the dd image (if you want the simplest solution). Once you have downloaded and unzipped the image, you just need to write the image to a 2GB thumbdrive (or larger). Remember that you are sending the image to the physical device, not to a file in a file system on your thumbdrive. The first sector of the image needs to be written to the first sector on the thumbdrive. To do this you can use any linux distro, just open a terminal and type:
sudo dd if=/path/to/image of=/dev/sdb
This assumes that your thumbdrive is identified as /dev/sdb. You will be prompted for the root password for the system, type it in and wait several minutes until the copying is complete. You now have a bootable thumbdrive that you can boot any computer with - assuming the BIOS supports USB booting. You can go ahead and boot any machine with the thumbdrive, simply configure the BIOS to boot from the USB drive.
Bear in mind that it is unlikely that any wireless cardvdrivers are going to be available, you should anticipate using an ethernet cable to access the internet.
Once the system is up and running, you can add and remove programs using the synaptic packet manager, or downloading and installing programs manually.
Any scripts that I publish here, or you download from other sites, I recommend putting in the /usr/local/bin directory.
Remember to make all scripts executable by changing the permissions, like this:
sudo chmod a+x /usr/local/bin/scriptname.sc
The root passord is "caine".
All you need to do now is generate the .iso image so you can create a bootable CD of your distro. You can do this with the remastersys scripts - there is a nice GUI available in the menu, to help you along. These screenshot shows the path to the GUI and the resulting dialog boxes:
Make sure you select the modify option for the first run, we are going to have to configure remastersys with some parameters. Once you select modify, the following dialog box appears:
What we need to do with the above options is select the directory where we want the .iso created. The .iso will be about 600mb, although other temporary files will be written so figure you will need 1.2 GB of space. You probably won't have space on your thumbdrive, so we can send the .iso image to a directory where we are going to mount and external drive or thumbdrive. The "working directory" field is the FULL PATH we want to send our .iso file to. In my example, I have a directory at /stuff where I mount an external drive to receive the .iso image. It is VERY IMPORTANT that you also put the same path into the "Files to Exclude" field - otherwise all the data on your external drive will be included in the .iso!!. Give your .iso file a name in the Filename field then select the "Go back to main menu" option, you will be taken to this dialog box:
Select the dist option and press OK. A terminal will open and you will be able to follow the progress of your .iso being created. You will find the .iso in a directory called "remastersys" in the path that you specified previously. Make sure that you have mounted your external drive in READ/WRITE mode at the directory you specified in the settings.
Your .iso image can now be burned to as many CDs as you like.
Now I must feed...but when my grim business is done we will start on some scripting.
The engine of my preview system is the CAINE bootable CD (version 2.0). To create your own system you will need to get the dd image (if you want the simplest solution). Once you have downloaded and unzipped the image, you just need to write the image to a 2GB thumbdrive (or larger). Remember that you are sending the image to the physical device, not to a file in a file system on your thumbdrive. The first sector of the image needs to be written to the first sector on the thumbdrive. To do this you can use any linux distro, just open a terminal and type:
sudo dd if=/path/to/image of=/dev/sdb
This assumes that your thumbdrive is identified as /dev/sdb. You will be prompted for the root password for the system, type it in and wait several minutes until the copying is complete. You now have a bootable thumbdrive that you can boot any computer with - assuming the BIOS supports USB booting. You can go ahead and boot any machine with the thumbdrive, simply configure the BIOS to boot from the USB drive.
Bear in mind that it is unlikely that any wireless cardvdrivers are going to be available, you should anticipate using an ethernet cable to access the internet.
Once the system is up and running, you can add and remove programs using the synaptic packet manager, or downloading and installing programs manually.
Any scripts that I publish here, or you download from other sites, I recommend putting in the /usr/local/bin directory.
Remember to make all scripts executable by changing the permissions, like this:
sudo chmod a+x /usr/local/bin/scriptname.sc
The root passord is "caine".
All you need to do now is generate the .iso image so you can create a bootable CD of your distro. You can do this with the remastersys scripts - there is a nice GUI available in the menu, to help you along. These screenshot shows the path to the GUI and the resulting dialog boxes:
Make sure you select the modify option for the first run, we are going to have to configure remastersys with some parameters. Once you select modify, the following dialog box appears:
What we need to do with the above options is select the directory where we want the .iso created. The .iso will be about 600mb, although other temporary files will be written so figure you will need 1.2 GB of space. You probably won't have space on your thumbdrive, so we can send the .iso image to a directory where we are going to mount and external drive or thumbdrive. The "working directory" field is the FULL PATH we want to send our .iso file to. In my example, I have a directory at /stuff where I mount an external drive to receive the .iso image. It is VERY IMPORTANT that you also put the same path into the "Files to Exclude" field - otherwise all the data on your external drive will be included in the .iso!!. Give your .iso file a name in the Filename field then select the "Go back to main menu" option, you will be taken to this dialog box:
Select the dist option and press OK. A terminal will open and you will be able to follow the progress of your .iso being created. You will find the .iso in a directory called "remastersys" in the path that you specified previously. Make sure that you have mounted your external drive in READ/WRITE mode at the directory you specified in the settings.
Your .iso image can now be burned to as many CDs as you like.
Now I must feed...but when my grim business is done we will start on some scripting.
Forensic previewing with Linux
I s'pose the first question is why do any form of previewing? Most LE agencies that I know of experience very high volumes of request for digital forensic services, this often leads to backlogs in cases. To combat this, agencies have adopted a wide variety of reponses including combinations of outsourcing, prioritising cases, triaging, introducing KPIs, setting policies to only view the files in the live set on some systems. All of these approaches are practical solutions, however they do have some drawbacks such as cost or risk in evidence being missed. The squeeze on budgets is already impacting on many agencies abilities to outsource cases, hire more staff or buy new hardware/software. My approach is to make sure every piece of digital storage is processed, using open source tools. Often my processes go deeper than what is done during a "full" forensic examination in many labs. Costs of this approach are minimal - some external drives and some 4-port KVM switches so that many systems can be previewed in parallel. You really want to make sure that your costly forensic tools are being focussed on media that is known to contain evidence. Of course, the commercial forensic tools (many of which I like A LOT!) do give you the ability to do some previewing, however this ties up your software dongles and your forensic workstations. You also need to be at the keyboard, as the approach often involves configuring a process, running it, then configuring the next process and running it and so on...
I leverage the processing power of the suspect system, processing the system in a forensically sound manner by booting the suspect system with a forensic CD then running a single program to do the analysis that I require, selecting various options depending on circumstances. The processing may take many hours (sometimes up to 18 hours), however, my forensic workstation and software is free to tackle systems that I have previously processed and found evidence on. I can view the output from my previewing in a couple of hours, a establish if there is evidence on the system or not. Therefore, I overcome the problem of potential evidence being missed that exist in some other approaches to reducing backlogs of cases. Most linux forensic boot disks can be installed to a workstation to process loose hard disks and USB storage devices. Ultimately the majority of storage devices can be eliminated from the need to undergo costly and time-consuming forensic examinations, only disks/media known to contain evidence are processed, everything that is seized gets looked at...double gins all round.
So, that is about as serious and po-faced as this blog is going to get. The next post will look at customising your forensic boot CD for your own needs (it's A LOT simpler than most people realise).
I leverage the processing power of the suspect system, processing the system in a forensically sound manner by booting the suspect system with a forensic CD then running a single program to do the analysis that I require, selecting various options depending on circumstances. The processing may take many hours (sometimes up to 18 hours), however, my forensic workstation and software is free to tackle systems that I have previously processed and found evidence on. I can view the output from my previewing in a couple of hours, a establish if there is evidence on the system or not. Therefore, I overcome the problem of potential evidence being missed that exist in some other approaches to reducing backlogs of cases. Most linux forensic boot disks can be installed to a workstation to process loose hard disks and USB storage devices. Ultimately the majority of storage devices can be eliminated from the need to undergo costly and time-consuming forensic examinations, only disks/media known to contain evidence are processed, everything that is seized gets looked at...double gins all round.
So, that is about as serious and po-faced as this blog is going to get. The next post will look at customising your forensic boot CD for your own needs (it's A LOT simpler than most people realise).
Tuesday, 28 August 2012
Dawn of the forensicator of the dead
YALG! Yet Another Linux Geek blogging. I'm afraid I no longer have a functioning cerebrum so can't compete with most of the very intelligent (and annoyingly alive) linux DFIR types out there. Hopefully this blog will have a vague appeal to those who have dipped their toe into the murky world of Linux and then pathetically ran away screaming at the incomprehensibility of it all, or indeed those who haven't yet had to sheer life affirming joy of spending years in the trenches of front line forensics and want to know a smidgeon more.
I have developed a system of "enhanced previewing" of computer system and storage devices that allowed my team to get rid of the soul-destroying weight of computer backlogs. As it is a Linux system unencumbered with dingle-dangle-dongles and suchlike, it can be deployed on as many systems as space allows. You simply boot the suspects machine with the CD or thumbdrive, and then in a forensically sound manner (or at least I hope it is or I am going to look VERY foolish) plunder the drive for potential evidence that is then exported out to an external drive, by invoking a single command (or pressing an icon for you gui jockeys!). The output can be viewed on any PC, those machines that don't contain any evidence can be eliminated from further examination. Those that do contain evidence can then be tortured by your forensic tools of your choice - however the granular nature of the output of my system means that you know where to look for the evidence e.g in zip files in unallocated space and therefore do your forensic evidential recovery much more quickly. The system is essentially a Linux boot CD which I have customised + 50,000 lines of ham-fisted bash scripting wot I writ.
I have learned many painful lessons (at least they would have been painful if I had a working Parietal Lobe) and some interesting stuff. So if you want to know how to recovery specific file types PROPERLY, identify encrypted data, recover encryption keys, review hundreds of hours of movie footage in minutes, classify files according to the language that they are written in, process all types of email, recover facebook artifacts reliably and loads of other stuff, then stay tuned. I will be sharing bash/shell code that you can mock unrelentingly (and possibly use in your cases). So there will be some very basic stuff and some more challenging stuff....now what first?
I have developed a system of "enhanced previewing" of computer system and storage devices that allowed my team to get rid of the soul-destroying weight of computer backlogs. As it is a Linux system unencumbered with dingle-dangle-dongles and suchlike, it can be deployed on as many systems as space allows. You simply boot the suspects machine with the CD or thumbdrive, and then in a forensically sound manner (or at least I hope it is or I am going to look VERY foolish) plunder the drive for potential evidence that is then exported out to an external drive, by invoking a single command (or pressing an icon for you gui jockeys!). The output can be viewed on any PC, those machines that don't contain any evidence can be eliminated from further examination. Those that do contain evidence can then be tortured by your forensic tools of your choice - however the granular nature of the output of my system means that you know where to look for the evidence e.g in zip files in unallocated space and therefore do your forensic evidential recovery much more quickly. The system is essentially a Linux boot CD which I have customised + 50,000 lines of ham-fisted bash scripting wot I writ.
I have learned many painful lessons (at least they would have been painful if I had a working Parietal Lobe) and some interesting stuff. So if you want to know how to recovery specific file types PROPERLY, identify encrypted data, recover encryption keys, review hundreds of hours of movie footage in minutes, classify files according to the language that they are written in, process all types of email, recover facebook artifacts reliably and loads of other stuff, then stay tuned. I will be sharing bash/shell code that you can mock unrelentingly (and possibly use in your cases). So there will be some very basic stuff and some more challenging stuff....now what first?
Subscribe to:
Posts (Atom)