Monday, 3 September 2012

Identifying the dead redux

In my previous POST I showed you how to do a file signature check on every file in the file system.   That should result in us having a flat, text based database that looks have this format:


/media/sda1/stuff/TK8TP-9JN6P-7X7WW-RFFTV-B7QPF: ASCII text
/media/sda1/stuff/mov00281.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/cooltext751106231.png: PNG image data, 857 x 70, 8-bit/color RGBA, non-interlaced
/media/sda1/stuff/phantom.zip: Zip archive data, at least v2.0 to extract
/media/sda1/stuff/th_6857-1c356e.jpg: JPEG image data, JFIF standard 1.01
/media/sda1/stuff/.Index.dat: data
/media/sda1/stuff/mov00279.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/phoneid.txt: ASCII text, with CRLF line terminators

The format of the file is two fields separated by a colon.   The first field is the full path and filename, the second field (after the colon) is a description of the file taken from the magic file.  Notice that the file command can identify ASCII text files.
Remember that our database is a file called listoffiles, it is in the path $reppath/tmp/.
If we wanted to create a list of graphic files then we could use a command like this:
grep 'image data' $reppath/tmp/listoffiles | cut -d: -f 1 > $reppath/tmp/liveimagefiles.txt

The above command is searching each line of our file list for the string "image data", then piping the results to the cut command which prints the first field of any matched lines, so the command returns the following:

/media/sda1/stuff/cooltext751106231.png
/media/sda1/stuff/th_6857-1c356e.jpg

The inestimable Barry Grundy points out that this is only useful if you ensure that the description of all  the graphic files in your magic file contains the string "image data".  You should NOT edit your system magic file, simply create your custom magic file, the format is simple, and place it in your /etc/ directory.

In the example above, everything works fine, but now consider this data set.


/media/sda1/stuff/TK8TP-9JN6P-7X7WW-RFFTV-B7QPF: ASCII text
/media/sda1/stuff/mov00281.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/cooltext751106231.png: PNG image data, 857 x 70, 8-bit/color RGBA, non-interlaced
/media/sda1/stuff/phantom.zip: Zip archive data, at least v2.0 to extract
/media/sda1/stuff/th_6857-1c356e.jpg: JPEG image data, JFIF standard 1.01
/media/sda1/stuff/.Index.dat: data
/media/sda1/stuff/mov00279.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/phoneid.txt: ASCII text, with CRLF line terminators
/media/sda1/stuff/image data.doc: Microsoft Office Document

Using our previous command we now get these results:

/media/sda1/stuff/cooltext751106231.png
/media/sda1/stuff/th_6857-1c356e.jpg
/media/sda1/stuff/image data.doc

The final line is a false positive, we can avoid this happening by only searching the second field (the file description) of each line of file list and printing the first field of any matches.   We can use the awk command to do this.  In its simplest form awk will break down a line of input into fields.   By default the field separator is white space, however you can change the field separator by using the -F option, then select which field to print with the print option.   So, we could print all the file paths and names (the first field in our listoffiles with this command:
awk -F: '{print $1}' $reppath/tmp/listoffiles

If we wanted to print just the descriptions (the second field) then we could use this:
awk -F: '{print $2}' $reppath/tmp/listoffiles
We can see that awk uses the nomenclature $1, $2, $3 etc to name each field in the lines of input.

If we wanted to search the second field for our "image data" pattern then we could use this command:
awk -F: '$2 ~ /image\ data/ {print $1}' $reppath/tmp/listoffiles

We use the -F option to set the file separator to a colon, we put our statement in single quotes, the $2 ~ means go to the second field and match the following pattern  (which has to be inside the / / characters) then print the first field of any matching lines.   The white space in our pattern to be matched has to be escaped with a \ character.   We can redirect the output to a files, meaning that we have a complete of all the graphic file paths and names in a single file.   We can read that list of graphic files to copy out our graphic images...but there are some problems we need to overcome....




No comments:

Post a Comment