Hex (base 16) encoding
Martin McBride, 2017-04-09
Tags binary encoding hex base16
Categories binary encoding data formats
This is probably the simplest method of encoding binary data. It is the least efficient of all the methods surveyed, because the encoded data is approximately twice the size of the raw binary. But it has the advantage of being human readable.
Hex encoding is pretty ubiquitous, but there is not really any one widely accepted standard. RFC 3548 attempts to define a standard for Base16 (ie hex) encoding, but bear in mind that this is quite a recent document and a large number of people had been implementing their own idea of hex encoding for several decades before that.
In this scheme, each binary byte is represented by a 2 character encoding. Those characters hold an ASCII version of the byte’s hex value. For instance the binary value 0x1A would be represented by a 2 character text string, "1A".
It is recommended that the encoder should include a line break (CR LF pair) every so often, perhaps every 80 characters of output. This ensures compatibility with systems which process data on a line by line basis, and it also makes it easier to view the data in a text editor.
As a practical example, consider how we would encode the following sequence of 5 bytes:
0x12 0x34 0x56 0x78 0x9A
This would simply be translated into a ten character string
A decoder might encounter data which does not completely conform to the specification above. It is then up to the decoder to decide whether to ignore the discrepancy, or indicate an error. Without any widely accepted specification for hex encoding, it is to a large extent a matter of opinion how seriously each error should be taken. Here are some the main error cases:
Whitespace characters - if the data contains spaces, line breaks and other whitespace characters, it is probably safe to ignore them and decode the data as if they were not there. On the other hand, a decoder should not rely on the data having line feeds, and should be able to cope with arbitrarily long lines.
Lower case characters - hex data might contain characters a-f instead of A-F. A decoder should accept this data, and treat upper and lower cases characters as identical.
Illegal characters - if the data contains other characters (symbols, letters greater than F etc), then the situation is more serious. The encoder probably didn't put these characters in, so it would tend to indicate data corruption or similar.
Incomplete last byte - if the data stream contains an odd number of hex characters, then the data might have been truncated or otherwise corrupted.