What is binary encoding?
Categories: binary encoding data formats
What is binary encoding, and why is it useful? All is explained here.
Binary, ASCII and Text data
Binary data is a sequence of 8-bit bytes, where each byte can have a value between 0x00 and 0xFF. In general, we can’t assume much about this data, except that any byte could potentially have any value.
ASCII data represents text as a sequence of bytes. In the ASCII system, byte values in the range 0x00 to 0x7F are used to represent English language letters (upper and lower case), numerals, punctuation symbols, and various "control characters". Byte values above 0x80 have no well-defined meaning in ASCII.
Since ASCII data is not expected to contain byte values of 0x80 or greater (ie with the most significant bit set), it is often called 7-bit data.
Printable characters in ASCII are values in the range 0x21 to 0x7E, which includes letters a-z, A-Z, digits 0-9 and all standard punctuation.
Whitespace in ASCII consists of the space character (0x20), carriage return (CR, 0x0A), line feed (LF, 0x0D) and tab (0x09).
Text data is ASCII data which only contains printable and whitespace characters.
Problems with Binary Data
If a system is designed to handle text data, it might make certain assumptions about that data. This can easily cause the system to fail if binary data is passed through it. Here are some of the most common problems:
Line endings - different computer operating systems have different conventions for representing line endings. Some use a CR character, some use an LF character, and some use CR followed by LF. Some systems try to be helpful by automatically substituting these characters. This is great for genuine text data but absolutely disastrous for binary data.
Tab substitution - in a similar way, some systems automatically substitute tab characters for multiple spaces or vice versa.
Special characters - some systems assign special meanings to particular non-printable characters. For instance, some text systems use "end of data" control characters, and might terminate the data when they find such a character. Typically NUL (0x00), Ctrl-D (0x03) or Ctrl-Z (0x19) are used for this purpose. Some systems even emit a beep when they encounter the BEL character (0x07)!
Line length - some systems process text on a line-by-line basis, and they often make assumptions about how long text lines will be (eg 80 characters maximum). If a file is encountered where the lines are too long, it might lead to data loss, program errors, or even a crash.
As we noted earlier, lines are delimited by either CR, LF or CRLF characters. But in a binary file, there is no reason to suppose that these characters will appear regularly, if at all.
Rejection - some systems scan the data for non-text characters, and simply refuse to process binary data.
A Solution – Binary Encoding
We have listed some of the possible problems with processing binary data in a text-based system. Of course, some systems are more robust than others, but you are likely to encounter one or more of these types of problems in many cases.
A solution to this problem is to use binary encoding. Before passing our binary data through a text-based system, we encode it as a (longer) sequence of text characters. When we get the data back out of the system, we must decode it to obtain our original data.
We obviously need to be careful about whitespace characters, because they might not be transferred reliably. On the other hand, they are clearly necessary (CR or LF are needed to split the data into manageable line lengths). Most encoding schemes use only printable characters for encoding but allow line breaks to be present (but ignore them when decoding).
See also
Sign up to the Creative Coding Newletter
Join my newsletter to receive occasional emails when new content is added, using the form below:
Popular tags
555 timer abstract data type abstraction addition algorithm and gate array ascii ascii85 base32 base64 battery binary binary encoding binary search bit block cipher block padding byte canvas colour coming soon computer music condition cryptographic attacks cryptography decomposition decryption deduplication dictionary attack encryption file server flash memory hard drive hashing hexadecimal hmac html image insertion sort ip address key derivation lamp linear search list mac mac address mesh network message authentication code music nand gate network storage none nor gate not gate op-amp or gate pixel private key python quantisation queue raid ram relational operator resources rgb rom search sort sound synthesis ssd star network supercollider svg switch symmetric encryption truth table turtle graphics yenc