我为我的应用程序设计了一个自定义协议。

我为我的应用程序设计了一个自定义协议。
I Designed a Custom Protocol for My App

原始链接: https://blog.roj.dev/how-i-designed-a-custom-protocol-for-my-app

Rast是一个实验性项目，旨在高效检测中央库尔德语（索拉尼语）文本中的拼写错误，尤其是在网络连接中。其关键组成部分是**K8**，一种新的库尔德语8位编码标准，旨在克服UTF-8编码在非ASCII字符方面的效率低下。K8还包含一个页脚，以与现有的UTF-8字符保持向后兼容。该项目利用自定义传输协议，专注于最大限度地减少数据传输。系统在初始细节传输后，不重复发送完整的错误细节，而是传输错误*引用*。错误数据结构包含错误和细节的数量、指示文本中错误位置的偏移量，以及用于标题和描述的紧凑标题。该协议通过缓存错误细节并使用索引将其链接到特定错误，从而优化效率，减少WebSocket连接中的冗余。虽然考虑过基于位的流传输，但开发成本被证明过高。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录我为我的应用设计了一个自定义协议 (roj.dev) 4 点赞 _roj 1小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 Orochikaku 14分钟前 [–] 请考虑在你的头部添加版本号。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

Introduction

Rast (in Kurdish: “ڕاست” [raːst]) is a new experimental project of mine for detecting orthographical errors in texts written in the Central Kurdish language, also known as Sorani.

I designed it in a way that works very efficiently with long texts over a duplex network connection. To do this, I first created K8, a 8-bit coding standard for Kurdish.

K8

This was designed because Kurdish characters are non-ASCII, and take two bytes each when encoded using UTF-8, making them very unefficient for binary protocols.

It also supports backward-compatibility for non-covered UTF-8 characters when needed by encoding an optional footer.

Below is an example of that.

00 -- version
97 -- س
A1 -- ڵ
8A -- ا
A6 -- و
AB -- ،
20 -- space
00
00
00
21 -- !
01 -- footer start
D0
BC -- м
D0
B8 -- и
D1
80 -- р

The above is a representation of the literal سڵاو، мир!.

This footer-based compatibility is used K8 is used in Rast's URL state, and the version without it is used in the transport protocol as described below.

The Transport Protocol

The goal of the project is straightforward: receive a stream of text, stream back a list of errors.

Errors are made up of their details, which are two strings of text: a generic title and a specific description.

Each component of an error’s detail is transported once only. Afterwards, their references will be kept by both the server and the client throughout the WebSocket connection.

Below is a brief representation of it.

+-------------------------------------------------------------------+ header
|              uin16 - error count                                  | header
+-------------------------------------------------------------------+ header
|              uin16 - detail count                                 | header
+-----------------------+-------------------------------------------+ errors
| uint16 - error offset | uint8 error length                        | errors
+-----------------------+-------------------------------------------+ errors
| ..................... | ..................                        | errors
+-----------------------+-------------------------------------------+ details
|  uint8 title length   | uint8 desc length | uint16 errorCount     | details
+-----------------------+-------------------------------------------+ details
| ..................... | ..................                        | details
+-----------------------+-------------------------------------------+ details
|          title        |    description   | uint16[] error_indexes | details
+-----------------------+-------------------------------------------+ details
| ..................... | ..................                        | details
+-----------------------+-------------------------------------------+ details

Here are some details that might have been missed above:

The first two bytes of each packet error_count counts the number of errors found inside a text input.
The upcoming two bytes detail_count is the number of error details returned on this round.
The next group of bytes with a length of 3 * error_count marks the positions of the errors inside the text.
What comes next are detail_count headers of the error details.
The last group is the error details and the indexes of the errors they apply to.

The fields title and the description will either be an arbitrary cache index, or the human-readable information about the errors encoded in K8, depending on whether they were previously sent throughout the connection.

Conclusion

This protocol took a while to design, and I like how it turned out. I considered using bit-based streaming, but I failed to get it into production due to the development cost. I will be writing about updates on it below if there were any in the future.