This project provides OCR-optimized error correction for printed codes, coupons, IDs, and labels.
The goal is to print a compact code with Reed-Solomon ECC and OCR-safe parity text. When OCR misreads one or more symbols, the decoder can detect that the scanned code is invalid, identify which symbol positions are inconsistent, and correct the original message when the number of errors is within the configured Reed-Solomon limit.
This is especially useful when the print environment is not ideal. Examples include dot-matrix printers with missing pins, ribbons that are running out of ink, low-resolution printing, labels that easily collect dirt, or any workflow where printed characters can be partially damaged before OCR reads them.
This can improve OCR reliability toward 100% for controlled printed-code workflows, assuming the scan quality is good enough and the number of OCR mistakes does not exceed the correction capacity.
ReedSolomonForOcr implements Reed-Solomon over GF(256).
- Symbol size: 8 bits
- Maximum codeword length: 255 symbols
- Encoding format:
message + parity - Correction capacity: up to
floor(nsym / 2)unknown symbol errors
For example, nsym=10 adds 10 parity symbols and can correct up to 5 unknown symbol errors.
The normal byte-oriented API works with integer symbols in the range 0..255:
from importlib.util import module_from_spec, spec_from_file_location
spec = spec_from_file_location("reed_solomon_ocr", "reed-solomon-ocr.py")
module = module_from_spec(spec)
spec.loader.exec_module(module)
ReedSolomonForOcr = module.ReedSolomonForOcr
rs = ReedSolomonForOcr(nsym=10)
message = ReedSolomonForOcr.bytes_to_symbols(b"HELLO-123")
codeword = rs.encode(message)
is_valid = rs.check(codeword)
decoded = rs.correct(codeword)
assert is_valid
assert decoded == messageOCR mistakes often come from look-alike characters. When reducing a character set to avoid confusion, choose the most distinctive character from each look-alike group.
Recommended alphanumeric choices:
- For
0andO: keep neither. Remove both0andO. - For
1,I, andl: keep only the number1. Remove capitalIand lowercasel. - For
2andZ: keep the number2. Remove capitalZ. - For
5andS: keep the number5. Remove capitalS. - For
8andB: keep the number8. Remove capitalB. - For
6andG: keep the number6. Remove capitalG. - For
VandU: keep capitalU. Remove capitalV.
Ultimate safe character list:
- Safe numbers:
2 3 4 5 6 7 8 9 - Safe letters:
A C D E F H J K L M N P Q R T U W X Y
The codec uses this OCR-safe alphabet for parity text:
23456789ACDEFHJKLMNPQRTUWXY
Because Reed-Solomon symbols are GF(256) bytes, one parity byte cannot fit into one OCR-safe character. The implementation encodes each parity byte as two OCR-safe characters.
rs = ReedSolomonForOcr(nsym=10)
message = ReedSolomonForOcr.bytes_to_symbols(b"HELLO-123")
message_symbols, safe_parity = rs.encode_with_ocr_safe_parity(message)
print(message_symbols)
print(safe_parity)Print or store both:
message_symbols: the original message symbolssafe_parity: OCR-safe parity characters
codeword = rs.codeword_from_ocr_safe_parity(message_symbols, safe_parity)
assert rs.check(codeword)corrupted_message = message_symbols[:]
corrupted_message[0] ^= 0x55
decoded = rs.correct_with_ocr_safe_parity(corrupted_message, safe_parity)
assert decoded == messagesymbols = ReedSolomonForOcr.bytes_to_symbols(b"ABC123")
data = ReedSolomonForOcr.symbols_to_bytes(symbols)The module also exposes wrapper functions:
codeword = module.rs_encode_msg(message, nsym=10)
decoded = module.rs_correct_msg(codeword, nsym=10)
message_symbols, safe_parity = module.rs_encode_msg_with_ocr_safe_parity(message, nsym=10)
decoded = module.rs_correct_msg_with_ocr_safe_parity(message_symbols, safe_parity, nsym=10)python3 main.py
python3 -m unittest -v