(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43366671

一个Hacker News的帖子讨论了二进制文件格式中设计魔数的建议。Hacker.town上的原帖作者建议这些魔数应该包含一个零字节、一个无效的UTF-8序列,以及至少一个高位被设置的字节。这些要求旨在防止文件被错误地识别为标准文本格式,例如ASCII或UTF-8。具体来说,高位要求将其与ASCII区分开来,无效的UTF-8序列防止将其误认为UTF-8文本,而零字节则防止一般的文本识别。一些用户质疑其理由并提供解释,参考作者的原帖以寻求澄清。另一些用户分析了ELF文件格式的魔数是否符合这些建议,发现它并不完全符合。

相关文章
  • 关于二进制文件格式魔数设计的建议 2025-03-17
  • (评论) 2024-07-16
  • (评论) 2024-04-28
  • (评论) 2023-10-24
  • (评论) 2024-07-22

  • 原文
    Hacker News new | past | comments | ask | show | jobs | submit login
    Recommendations for designing magic numbers of binary file formats (hackers.town)
    39 points by _Microft 8 hours ago | hide | past | favorite | 9 comments










       SHOULD include a zero byte
    
    I guess it is expected to be at the end of the magic number to act as a null-termibated string?

       MUST include a byte sequence that is invalid UTF-8
    
    I guess it is to differentiate a text file from a specific format?

       MUST include at least one byte with the high bit set
    
    Any reason?


    I think the idea of all these is to make the file not be recognised as text (which doesn't allow nulls), ASCII (which doesn't use the high bit), UTF-8 (which doesn't allow invalid UTF-8 sequences).

    Basically so that no valid file in this binary format will be incorrectly misidentified as a text file.



    The author explains their reasoning in the next post: https://hackers.town/@zwol/114155807716413069


    Well, the 3rd point follows from the second: all sequences without the high bit set are valid ASCII, and all valid ASCII sequences are valid UTF-8.


    As anyone able to break down why those requirements are desirable?


    From the top of my head, most are to make it as clear as possible that the file is binary and NOT text:

    > MUST be the very first N bytes in the file

    For every system to be able to parse it without loading the entire file

    > MUST be at least four bytes long, eight is better

    To reduce risk of two different binary files on the same system having the same magic number

    > MUST include at least one byte with the high bit set

    To avoid wrongful identification as an ASCII file (ASCII doesn't use the high bit)

    > MUST include a byte sequence that is invalid UTF-8

    To avoid wrongful identification as UTF-8 text file

    > SHOULD include a zero byte

    To avoid wrongful identification as ANY text file



    Why is ELF a good example?

        7F 45 4C 46
    
    - MUST be the very first N bytes in the file -> check

    - MUST be at least four bytes long, eight is better -> check, but only four

    - MUST include at least one byte with the high bit set -> nope

    - MUST include a byte sequence that is invalid UTF-8 -> nope

    - SHOULD include a zero byte -> nope

    So, just 1.5 out of 5. Not good.

    By the way, does anyone know the reason it starts with DEL (7F) specifically?



    It's (7F) ELF


    Hmm, I would expect that to be 31F, if it stood for "ELF" in correct Hexspeak.






    Join us for AI Startup School this June 16-17 in San Francisco!


    Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



    Search:
    联系我们 contact @ memedata.com