Thursday, 30 August 2012

Facebook artifacts

It is widely known that Facebook artifacts can be cached to the disk, a couple of years ago the chat artifacts were written as plain text files that could be found in the web browser cache for IE users.   This is no longer the case, however some artifacts can still be recovered from a hard disk, particularly from the swap file.
The artifacts are in json format, but facebook are fond of updating their infrastructure so the internal structure of the json artifacts may change frequently.  Do you have tool for recovering such artifacts?  Is it keeping pace with changes to the structure of the artifacts?  I am going to show you a way to not only recover the artifacts and parse them to generate more user friendly output, but also how to ensure you can stay up-to-date with changes facebook make to the structure of the artifacts.   This technique will also allow you to recover other chat/messaging artifacts such as Yahoo  IM2.

What we need to do is recover ALL the json artifacts from the disk, save them to file then write a script to parse out any messages.  I use the mighty and utterly essential   bulk_extractor tool to recover the json artifacts (amongst many other things).  If you haven't used this tool then you absolutely MUST get hold of it, there are virtually no cases where I don't deploy it.  I will cover more uses for the tool in future posts, there is a Widows version with a nice GUI on the download page, but for now we'll look at the recovery and parsing of json data.  We can do this on a the suspect system, having customised our boot disk and generated the .iso as per my PREVIOUS POST.  Alternatively you could run bulk extractor against a disk image.
You will need to run bulk_extractor as root from the terminal.  There are lots of options for bulk_extractor we are just going to run it with the json scanner enabled.
The command would be:
bulk_extractor -E json -o /home/fotd/bulk /dev/sda
The -E json option turns off all the scanners except the json scanner, we then specify and output directory and the physical device that we want to process.
Upon completion we find a text file called json.txt, this will contain all the json strings preceded by the disk offset that they were found at.

The json strings can be very long, sometimes 10s of thousands of characters in I can't really show you any of the interesting strings here.   Ideally you want to view the strings without line wrapping, the gedit text editor can do this for you.  Trying to manually review the json and identify facebook artifacts is the road to insanity, so we'll can script the processing of the json strings.   What we are going to do is search our json strings for facebook artifacts, the newer ones will have the string "msg_body" within the 1st 200 characters.   So we can read our json file, line by line, looking for the term "msg_body", if a json string matches this criteria we will search it a bit deeper looking for other landmarks in the string that are an indicator of facebook artifacts.  We can use those landmarks as field separators for awk, to isolate structures in the string, such as the message content, message date, author id etc.  Here is a chunk of code that is representative of the script:

jbproc () {

if test `echo $CHATFILE | head -c 200 | egrep -m 1 -o msg_body | head -n 1`
   echo "new single message found"
   SUBJ=`echo $CHATFILE | awk -F'5Cu003Cp>' '{print $2}' | awk -F'5Cu003C' '{print $1}'`
   UTIME=`echo $CHATFILE | awk -Ftimestamp '{print $2}' | awk -F, '{print $1}' | awk -F: '{print $2}'`
   HTIME=`date -d @$(($UTIME/1000))`
   TEXT=`echo $CHATFILE | awk -Fcontent\ noh '{print $2}' | awk -F5Cu003Cp\> '{print $2}' | awk -F5Cu003C '{print $1}' | sed 's/,/ /g' | sed 's/,/ /g'`
   SNDID=`echo $CHATFILE | awk -Fsender_fbid '{print $2}' | awk -F, '{print $1}'| awk -F: '{print $2}'`
   SNDNME=`echo $CHATFILE | awk -Fsender_name '{print $2}' | awk -F, '{print $1}'| awk -F: '{print $2}'`
   OFFSET=`echo $CHATFILE | awk '{print $1}'`
echo "Sender Name,Sender ID,Msg Subject,Message Content,Msg Type,Message Date/Time,Message ID,Recipient ID,Recipient Name,Offset," > $FACEBOOK_MSGS.csv
cat json.txt | while read CHATFILE ; do  jbproc $CHATFILE ; done

The final line submits each line of our json.txt file to a function called jbproc.  The first line of the function checks to see if the term msg_body appears in the first 200 characters of the line.  Note that we pipe the result of egrep to the head command. If we didn't do this and our test found 2 instances of "msg_body" then our script would fall over as the test command will only accept a single result.  The rest of the script is fairly straightforward.  In the TEXT variable you want to make sure you remove any commas, as the output is going to a comma deliminated file - otherwise you formatting is going to be messed up.   The time stamp in the json is a unixtime value multiplied by a thousand, so we need to divide the number by 1000 then convert the value to human readable with the date command.  All our variable values are echoed out to our spread sheet.   The code snippet above just deals with one type of facebook artifact, you can download the full script that processes all the various facebook artifacts HERE.  Save the script into /usr/local/bin, make it executable then change into the directory containing your json.txt file, run the script and you will find a spreadsheet containing all the parsed output in the same directory.

The big advantage of this approach is that if facebook change their json output, then you can quickly see what the changes are by checking the bulk_extractor generated json.txt file, then simply edit the script to reflect the new changes.

If you are going to use the script, let me know how results compare to any other tools that you might be using to recover facebook artifacts.


  1. F of the Dead:

    Graet work...I however it is a bit over my head. I have a ton of data to look at and I am having trouble figuring out how to use your process.


  2. How fast should the script process a 65MB json.txt file?


  3. There are too many unknown's here to say, MLB. It depends on your processor, how many json strings are facebook chat strings, etc. An hour or 2, maybe?