Hex (base 16) encoding

By Martin McBride, 2017-04-09
Tags: binary encoding hex base16
Categories: binary encoding data formats

This is probably the simplest method of encoding binary data. It is the least efficient of all the methods surveyed because the encoded data is approximately twice the size of the raw binary. But it has the advantage of being human-readable.

Hex encoding is pretty ubiquitous, but there is not really any one widely accepted standard. RFC 3548 attempts to define a standard for Base16 (ie hex) encoding, but bear in mind that this is quite a recent document and a large number of people had been implementing their own idea of hex encoding for several decades before that.

Algorithm

In this scheme, each binary byte is represented by a 2-character encoding. Those characters hold an ASCII version of the byte’s hex value. For instance the binary value 0x1A would be represented by a 2-character text string, "1A".

It is recommended that the encoder should include a line break (CR LF pair) every so often, perhaps every 80 characters of output. This ensures compatibility with systems which process data on a line-by-line basis, and it also makes it easier to view the data in a text editor.

Example

As a practical example, consider how we would encode the following sequence of 5 bytes:

0x12 0x34 0x56 0x78 0x9A

This would simply be translated into a ten-character string

"123456789A"

Error Conditions

A decoder might encounter data which does not completely conform to the specification above. It is then up to the decoder to decide whether to ignore the discrepancy or indicate an error. Without any widely accepted specification for hex encoding, it is to a large extent a matter of opinion as to how seriously each error should be taken. Here are some of the main error cases:

Whitespace characters - if the data contains spaces, line breaks and other whitespace characters, it is probably safe to ignore them and decode the data as if they were not there. On the other hand, a decoder should not rely on the data having line feeds and should be able to cope with arbitrarily long lines.

Lowercase characters - hex data might contain characters a-f instead of A-F. A decoder should accept this data, and treat upper and lower cases characters as identical.

Illegal characters - if the data contains other characters (symbols, letters greater than F etc), then the situation is more serious. The encoder probably didn't put these characters in, so it would tend to indicate data corruption or similar.

Incomplete last byte - if the data stream contains an odd number of hex characters, then the data might have been truncated or otherwise corrupted.

See also

Sign up to the Creative Coding Newletter

Join my newsletter to receive occasional emails when new content is added, using the form below:

Popular tags

555 timer abstract data type abstraction addition algorithm and gate array ascii ascii85 base32 base64 battery binary binary encoding binary search bit block cipher block padding byte canvas colour coming soon computer music condition cryptographic attacks cryptography decomposition decryption deduplication dictionary attack encryption file server flash memory hard drive hashing hexadecimal hmac html image insertion sort ip address key derivation lamp linear search list mac mac address mesh network message authentication code music nand gate network storage none nor gate not gate op-amp or gate pixel private key python quantisation queue raid ram relational operator resources rgb rom search sort sound synthesis ssd star network supercollider svg switch symmetric encryption truth table turtle graphics yenc