Crimminey! All this stuff about code pages and unicode and UTF is about as comprehensible as the unearthly groaning and hissing that I use to verbally communicate. Let your humble dead forensicator try and put it all into some semblance of order - we need to know this for when I talk about "keyword searching" in a future post. We will dispel some myths along the way, things that have been wrongly suggested to me as being true, over the years.
Lets skip computing very early history and jump to the early IBM machines.
They used punch cards to store data, the most widely adopted punch card had 80 columns and 12 rows. The card was punched at intersections of the rows and colums, the locations of the punches represented numbers and letters. However, the re was no need to distinguish between upper and lower case letters as there were no word processing packages (Microsoft Office Victorian never really took off I guess). Only a small set of other characters needed to be represented. As a result, the character range that was represented was only 47, 0-9 A-Z and some other characters. So at this point in time, the concept of binary storage was absent from computers. All that changed with in the mid-1950s with the advent of IBMs computer hard disk. The first model the IBM 350 had 30 million BIT capacity.
The cost of storing a single BIT was eye-wateringly expensive. There was a necessity to not only move to the concept of binary encoding, but also to ensure the smallest data unit (later named a byte) was as efficient as possible. The old punch cards could represent 47 characters, all these characters could be represented with just 6 bits in a data unit, so IBM grouped together their bits into groups of 6 to form their data units. Underlying these groupings was the concept of a truth table.
Each group of 6 bits was assigned a numeric value and a truth table was consulted to see what character the numeric value represented - this concept is still used today, it is often referred to as "character encoding".
IBM soon spotted the limitation of 6 bit groupings, especially when they came to thinking about word processing, simply adding lower case letters to the truth table would require 73 characters, that is before you even the consider the additional punctuation marks. This lead to 7 bit groupings which were introduced on the IBM 355 hard disk. This meant that 127 characters could be represented in the truth table for a 7 bit scheme - all upper/lower case characters, punctuation marks, the main math characters could easily be represented, with still some space in our truth table for more at a later date.
Looking back, we can laugh and ask why they didn't simply use an 8 bit byte, as 8 is a power of 2 and thus much more binary-friendly. Well the reality was that the IBM 355 stored 6 million 7 bit groupings. The cost of storing a single bit was in the 10s of dollars region. Moving to an 8 bit grouping would mean that the 8th bit seldom get used (there were already unclaimed slots in our 7 bit truth table) resulting in 1000s of dollars of redundant storage. There was a conflict here between the coders and the engineers, using 7 bit bytes made for less efficient code, using 8 bit bytes made for inefficient storage. Someone had to win, it was the coders who emerged glorious and victorious once the fog of battle had cleared. Thus the term byte was coined and 8 bit grouping settled on.
Interestingly IBM actually experimented with 9 bit bytes at some point to allow error checking. However, 8 bits groupings won on the basis of efficiency. So, historically there has been many battles regarding the optimum number of grouping bits into data units. This explains why IP addresses and the payload of internet data packets are encoded as "octets" - they earlier developers were explicitly stating that the bits should be put into 8 bit groupings, as opposed to using any of the other approaches to grouping bits then in existence.
8 bit bytes were settled on, but it was important that the truth table for these bytes were standardised across all systems. Much like the VHS VS Betamax war (those of you, like myself, born at a more comfortable distance from the apocalypse will recognise that I am talking about video tape player standards here), there were two main competing systems for being the agreed standard for establishing the truth tables. EBCDIC, which was an extension of an older code called BCDIC, which was a means of encoding those 47 characters on the old punch cards with 6 bits. EBCDIC was a full 8 bit character encoding truth table. The main competing standard was ASCII (American Standard Code for Information Interchange).
MYTH No 1. ASCII is an 8 bit encoding scheme.
Not true, it wasn't then and isn't now. It is a 7 bit encoding scheme.
The fact that EBCDIC used the same number of bits as the newly defined 8 bit byte, may explain why for most of the 1970s it was winning the war to be the standard. Ultimately, ASCII offered more advantages over EBCDIC, in that the alphabet characters were sequential in the truth table and the unused 8th bit could be utililised as a parity bit for error checking, thus ASCII became the accepted standard for character interchange coding.
So now we had our 8 bit data transmission/storage unit, known as a byte and an agreed standard for our truth table - ASCII. As ASCII used 7 bits, 128 characters could be defined. The first 32 characters in the ASCII truth table are non printable e.g Carriage Return, Line Feed, Null, etc, the remaining character space defines the upper/lower case letters, the numbers 0-9, punctuation marks and basic mathematical operators.
All would have been well if only the English speaking world wanted to use computers...obviously the ASCII table allowed for English letter characters. Unsurprisingly, citizens of other nations wanted to use computers and use them in their native language. This posed a significant problem, how to transmit and store data in a multitude of languages? English and a lot of Western European languages use the same characters for letters, they are Latin Script letters. Even then, there were problems with cedillas, accents and umlauts being appended to letters. Things got more complicated moving into central Europe and Eastern Europe, where Greek Script and Cyrillic script are used to represent the written form of languages there.
The solution was to use that 8th bit in ASCII, which gave 128 additional slots in the truth table, developers could define characters for their language script in using those slots. The resultant truth tables were known as "code pages".
Myth 2:
Code pages contain only non-English language characters.
Not true, most code pages are an extension of the ASCII standard, thus the code pages contain the same characters as the 7 bit ASCII encoding + up to 128 additional language characters.
The implication here is that if I prepare a text file with my GB keyboard, using English language characters and save the file with any of the code pages designed for use with, say cyrillic script, then the resultant data will be the same as if I saved the file using plain old ASCII. My text only uses characters in the original ASCII defined character set. So if you are looking for the word "zombie" on a data set, but don't know if any code pages have been used, no need to worry, if the word was encoded used any of the extended ASCII code pages, you don't have to experiment with any code page search parameters.
Myth 3: The bad guys can hide ASCII messages by saving them in extended ASCII code pages. Not true..see above!
So, the use of code pages solved the problem of encoding some foreign language characters. I say "some" because there are a number of languages that have far more characters in their alphabet that can be stored in the 128 slots at the upper end of the ASCII table. This fact, coupled with the near impossibility of translating some code pages into others, lead to the development of unicode. Unicode is essentially a super set of all the code pages. It is a truth table that seeks to assign a numeric value for every "character" in every language in the world, in addition the truth table also includes symbols used in other fields such as science and mathematics. Each slot in the unicode truth table is referred to as being a "code point". The concept of "characters" also needs extending. Many of the "character" renderings in Arabic script are dependent on the rendering of characters either side of them. In other languages, several letters may be combined to form a single character. These combined letters are referred to as being "glyphs", the unicode standard emphasises the concept of glyphs. It follows that if your software has looked up the numeric value of a "character" in the unicode truth table, then that software (or your O/S) must have the corresponding glyph installed to display it the "character" correctly to the user.
Well in excess of 1,000,000 code points exist in unicode. Fundamentally, then, computers perform word processing (and other operations involving character strings) with numbers. Those numbers are code points, the code point is looked up in the unicode truth table and the corresponding glyph displayed on the screen. Those numbers also need to be represented in binary. No problem...we could represent all of the numbers (or code points) in our unicode truth table with 3 bytes. However, this approach is RAM inefficient. Unicode divides it's full repertoire of characters in "planes" of 65,536. The first plane contains all the characters for modern languages, therefore we only need 2 bytes to represent those, we are wasting one byte per glyph if we use a 3 byte value. The problem is even worse when dealing with the English characters, their code points are at the start of the unicode truth table, (making this part of unicode backwardly compatible with ASCII), so only need 1 byte to represent a character - 2 bytes are therefore being wasted using a 3 byte scheme. What was needed were schemes to encode those numbers (code points!), the most popular currently in use are:
UTF-8
UTF-16
UTF-32
Myth 4: UTF-8 uses 1 byte, UTF-16 uses 2 bytes, UTF-32 uses 4bytes.
Not entirely true. UTF-16 does use a 16 bit encoding scheme, UTF-32 uses a 32 bit encoding scheme but UTF-8 is a variable length scheme, it can use 8 - 32 bits.
There are pros and cons for each encoding method, UTF-32 means that all characters can be encoded with a fixed length value occupying 4 bytes. However, this is inefficient, for the previous stated reasons. UTF-8 is a variable length encoding. If the encoded code point value is for an ASCII text character then the first bit is set to zero, the remaining 7 bits (we only need one byte for ASCII !!), are used to store the value of the code point. It is fully backward compatible with the orginal ASCII standard. Thus if you are searching for an English language string in a UTF-8 data set then you don't need to set any special parameters with your search configuration.
If the UTF-8 encoded character requires more than 1 byte then the first bit(s) are set to reflect the number of bytes used in the array (which can be 2,3 or 4 bytes). Thus in a two byte array, the first 2 bits are set to 1, in a 3 byte array the first 3 bits are set to 1, and so on.
UTF-16 encoding use a fixed 16 bit array, however for characters in the above the first plane of Unicode, two pairs are needed.
Conceptually it is quite straight forward, however, with UTF-16 and UTF-32 there is an issue of endian-ness, should the encoded value be in big endian or little endian?
One approach is to use Byte Order Marking, this is simply a pair of bytes at the start of the encoded data that indicate the endian-ness being used.
From a programmers point of view, you want your application to know what encoding scheme is being used to encode the characters. Your average user doesn't really care as long what encoding scheme is being used as long as the string is being rendered correctly. The problem is for us forensicators who given a large dataset (especially that stuff in unallocated space), really need to know what encoding schemes are being used as this will greatly affect their keyword searching strategy. There are no fixed signatures in a stream of characters that are encoded with the old school code pages. Certainly you can experiment by viewing files in different code pages to see if the resultant data looks like a valid and coherent script - some of the forensic tools can be configured to do this as part of their analysis routines. The forensicators in the English speaking world (and those using the a written language that is based on Latin Script) have got it fairly easy. But what about if your suspect is a Foreign National? Even if they have the English language version OS installed on their machine, and English character keyboard and the English language word processing package, this doesn't exclude the possibility of them having documents, emails, or web pages in a foreign language on their system.
Myth 5: You can find the word "hello" in a foreign language data set by typing the string "hello" into your keyword searching tool and selecting the appropriate code page for the relevant foreign language.
This is so wrong, and I have heard this view being expressed several times.
You need to understand that when doing keyword searching, programmatically, you are actually doing number searching, those numbers being the code points in a particular truth table. It follows that first of all you need to know the type of truth table used to encode the characters, but before you do that you need to translate the english word you are looking for (hello) into the specific language you are looking for (and there may be several different possible translations). In a non-latin script based language there will be no congruence between the old school code pages and the new school unicode encoding scheme. So you will need to search for the code point numbers in UTF-8, UTF-16, UTF-32 and any relevant code pages. There are so tools, particularly e-discovery tools that advertise that they can detect foreign language based data, but this is different from detecting the encoding method used.
All in all, foreign language detection in forensics is challenging to say the least. So many issues about digital forensics are discussed, yet I see very little around foreign language handling and character encoding. If you have any experience in this field then please feel free to comment.
One of the programs on my previewing system will attempt to identify the language used in non-English Office documents, PDFs, emails and web-pages, it will also output the data in plain text so that it can be copied and pasted into google translate to get at least an idea of the theme of of the data. I will post that code in a future posting, for now any comments about this whole topic will be gratefully received.