/media/sda1/stuff/TK8TP-9JN6P-7X7WW-RFFTV-B7QPF:
ASCII text
/media/sda1/stuff/mov00281.3gp:
ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/cooltext751106231.png:
PNG image data, 857 x 70, 8-bit/color RGBA, non-interlaced
/media/sda1/stuff/phantom.zip:
Zip archive data, at least v2.0 to extract
/media/sda1/stuff/th_6857-1c356e.jpg:
JPEG image data, JFIF standard 1.01
/media/sda1/stuff/.Index.dat:
data
/media/sda1/stuff/mov00279.3gp:
ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/phoneid.txt:
ASCII text, with CRLF line terminators
The format of the file is two fields separated by a colon. The first field is the full path and filename, the second field (after the colon) is a description of the file taken from the magic file. Notice that the file command can identify ASCII text files.
Remember that our database is a file called listoffiles, it is in the path $reppath/tmp/.
If we wanted to create a list of graphic files then we could use a command like this:
grep 'image data' $reppath/tmp/listoffiles | cut -d: -f 1 > $reppath/tmp/liveimagefiles.txt
The above command is searching each line of our file list for the string "image data", then piping the results to the cut command which prints the first field of any matched lines, so the command returns the following:
Remember that our database is a file called listoffiles, it is in the path $reppath/tmp/.
If we wanted to create a list of graphic files then we could use a command like this:
grep 'image data' $reppath/tmp/listoffiles | cut -d: -f 1 > $reppath/tmp/liveimagefiles.txt
The above command is searching each line of our file list for the string "image data", then piping the results to the cut command which prints the first field of any matched lines, so the command returns the following:
/media/sda1/stuff/cooltext751106231.png
/media/sda1/stuff/th_6857-1c356e.jpg
In the example above, everything works fine, but now consider this data set.
/media/sda1/stuff/TK8TP-9JN6P-7X7WW-RFFTV-B7QPF: ASCII text
/media/sda1/stuff/mov00281.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/cooltext751106231.png: PNG image data, 857 x 70, 8-bit/color RGBA, non-interlaced
/media/sda1/stuff/phantom.zip: Zip archive data, at least v2.0 to extract
/media/sda1/stuff/th_6857-1c356e.jpg: JPEG image data, JFIF standard 1.01
/media/sda1/stuff/.Index.dat: data
/media/sda1/stuff/mov00279.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/phoneid.txt: ASCII text, with CRLF line terminators
/media/sda1/stuff/image data.doc: Microsoft Office DocumentUsing our previous command we now get these results:
/media/sda1/stuff/cooltext751106231.png
/media/sda1/stuff/th_6857-1c356e.jpg
/media/sda1/stuff/image data.doc
The final line is a false positive, we can avoid this happening by only searching the second field (the file description) of each line of file list and printing the first field of any matches. We can use the awk command to do this. In its simplest form awk will break down a line of input into fields. By default the field separator is white space, however you can change the field separator by using the -F option, then select which field to print with the print option. So, we could print all the file paths and names (the first field in our listoffiles with this command:
awk -F: '{print $1}' $reppath/tmp/listoffiles
If we wanted to print just the descriptions (the second field) then we could use this:
awk -F: '{print $2}' $reppath/tmp/listoffiles
We can see that awk uses the nomenclature $1, $2, $3 etc to name each field in the lines of input.
If we wanted to search the second field for our "image data" pattern then we could use this command:
awk -F: '$2 ~ /image\ data/ {print $1}' $reppath/tmp/listoffiles
We use the -F option to set the file separator to a colon, we put our statement in single quotes, the $2 ~ means go to the second field and match the following pattern (which has to be inside the / / characters) then print the first field of any matching lines. The white space in our pattern to be matched has to be escaped with a \ character. We can redirect the output to a files, meaning that we have a complete of all the graphic file paths and names in a single file. We can read that list of graphic files to copy out our graphic images...but there are some problems we need to overcome....
No comments:
Post a Comment