Rust 的块模式

Rust 的块模式
Rust's Block Pattern

原始链接: https://notgull.net/block-pattern/

## Rust 中的“块模式” 这种习惯用法被称为“块模式”，它利用 Rust 将块作为有效表达式的能力，从而创建更简洁、更健壮的代码。其核心思想是将一系列操作（例如加载和解析配置数据）封装在一个赋值给单个变量的块中。与声明多个会使命名空间混乱的中间变量不同，该模式将这些变量的作用域限制在*块内*。这立即明确了块的目的（例如，`let config = { ... }`），并减少了潜在的命名冲突或临时变量的意外重用。资源也得到有效管理，因为块内的变量会在块结束时被释放。除了清晰性之外，块模式还提供了“可变性消除”。通过在*块内*声明可变变量，您可以限制它们的作用域，防止在函数其他地方进行意外修改。这提高了安全性并减少了潜在的错误。虽然重构为单独的函数可以实现类似的好处，但块模式保持了更简单操作的内联代码流程，并避免了函数调用和参数列表的开销。它是一种简单的技术，可以显著提高 Rust 代码的可读性和可维护性。

## Rust 的“块模式”总结一篇 Hacker News 的讨论围绕着 Rust 中的一种惯用模式，称为“块模式”，它利用了 Rust 块是表达式的特性。这种模式允许局部变量作用域，并通过将相关语句分组而简化代码，而无需引入新函数。核心思想是在块中封装临时变量的创建和操作，只返回最终结果。这避免了在周围作用域中污染不必要的变量，并可以提高可读性。一个关键的好处是在块内使用 `?` 运算符进行错误处理，而无需过早地从包含函数传播错误。讨论强调了该模式在管理临时变量、减少可变性和处理复杂初始化方面的实用性。虽然其他语言（如 Scala 的表达式和 C++ lambda）中存在类似的概念，但 Rust 的语法使其尤其简洁。与此相关的“try 块”的稳定化也是一个感兴趣的话题，因为它将进一步增强这种模式的能力。最终，块模式提供了一种简洁而强大的结构 Rust 代码的方式，从而提高了清晰度并减少了潜在的错误。

原文

Here’s a little idiom that I haven’t really seen discussed anywhere, that I think makes Rust code much cleaner and more robust.

I don’t know if there’s an actual name for this idiom; I’m calling it the “block pattern” for lack of a better word. I find myself reaching for it frequently in code, and I think other Rust code could become cleaner if it followed this pattern. If there’s an existing name for this, please let me know!

The pattern comes from blocks in Rust being valid expressions. For example, this code:

…is equal to this code:

…which is, in turn, equal to this code:

let foo = {
    let x = 1;
    let y = 2;
    x + y
};

So, why does this matter?

Let’s say you have a function that loads a configuration file, then sends a few HTTP requests based on that config file. In order to load that config file, first you need to load the raw bytes of that file from the disk. Then you need to parse whatever the format of the configuration file is. For the sake of having a complex enough program to demonstrate the value of this pattern, let’s say it’s JSON with comments. You would need to remove the comments first using the regex crate, then parse the resulting JSON with something like serde-json.

Such a function would look like this:

use regex::{Regex, RegexBuilder};
use std::{fs, sync::LazyLock};

/// Format of the configuration file.
#[derive(serde::Deserialize)]
struct Config { /* ... */ }

// Always make sure to cache your regexes!
static STRIP_COMMENTS: LazyLock<Regex> = LazyLock::new(|| {
    RegexBuilder::new(r"//.*").multi_line(true).build().expect("regex build failed")
});

/// Function to load the config and send some HTTP requests.
fn foo(cfg_file: &str) -> anyhow::Result<()> {
    // Load the raw bytes of the file.
    let config_data = fs::read(cfg_file)?;

    // Convert to a string to the regex can work on it.
    let config_string = String::from_utf8(&config_data)?;

    // Strip out all comments.
    let stripped_data = STRIP_COMMENTS.replace(&config_string, "");

    // Parse as JSON.
    let config = serde_json::from_str(&stripped_data)?;

    // Do some work based on this data.
    send_http_request(&config.url1)?;
    send_http_request(&config.url2)?;
    send_http_request(&config.url3)?;

    Ok(())
}

This is fairly simple, and just leverages a few Rust crates and language features to parse JSON and then do something with it.

However, there are a few weaknesses here. In the foo function, we declare four new variables (config_data, config_string, stripped_data, config) only for only one of those variables to be used after the configuration parsing (config). In addition, let’s say you didn’t know what this code was for going in, and you didn’t have these comments (or you had bad comments). One might ask why you’re declaring the regular expression STRIP_COMMENTS, or why you’re loading data from a file.

When I write code, I try to make it immediately obvious what the purpose of the code is, and why it’s written that way. This is why I generally avoid C’s “bottom-up” strategy for organizing code. It’s like being given a few screws and being expected to implicitly understand that it should be built into a chair. In Rust, I like that you are able to define your top-level functions first, and then go down and define all the bits and pieces after.

Although, we can do a little bit better. What if we organized the foo function like this:

/// Function to load the config and send some HTTP requests.
fn foo(cfg_file: &str) -> anyhow::Result<()> {
    // Load the configuration from the file.
    let config = {
        // Cached regular expression for stripping comments.
        static STRIP_COMMENTS: LazyLock<Regex> = LazyLock::new(|| {
            RegexBuilder::new(r"//.*").multi_line(true).build().expect("regex build failed")
        });

        // Load the raw bytes of the file.
        let raw_data = fs::read(cfg_file)?;

        // Convert to a string to the regex can work on it.
        let data_string = String::from_utf8(&raw_data)?;

        // Strip out all comments.
        let stripped_data = STRIP_COMMENTS.replace(&config_string, "");

        // Parse as JSON.
        serde_json::from_str(&stripped_data)?
    };

    // Do some work based on this data.
    send_http_request(&config.url1)?;
    send_http_request(&config.url2)?;
    send_http_request(&config.url3)?;

    Ok(())
}

In this function, we’ve moved all of the configuration-related code (parsing, loading, even the static regex) into the block. This works because Rust lets you have items, statements and expressions inside of a block, hence why we were able to move everything inside. This pattern has three immediate advantages:

The block starts with the intent of the code (let config = ...). We can see that we’re working to resolve some kind of configuration object right off the bat. Only then do we move into the implementation details of the code.
It reduces pollution of the namespace of both the foo function and the top-level module. Now in foo, the variable names config_data, config_string et al are no longer used. In addition to allowing these variable names to be re-used, it makes this code a lot more “idiot-proof”. If someone else were to edit the foo function, they would only be able to use config. They wouldn’t be able to use the raw_data or STRIP_COMMENTS items, which are only meant to be used by the config parser.
The variables raw_data and data_string go out of scope at the end of the block, which means they are dropped, freeing up resources.

As an aside, all three of these advantages also come if you were to refactor the block out into its own function. However, this pattern has two key advantages over that:

The code flow is still inline with the rest of the function. For shorter blocks, this improves reading comprehension, since it means you don’t have to go to a different part of the code to fully understand the function.
If there are a lot of variables that the block would use, it prevents needing to explicitly name those variables as parameters.

There is one more benefit that’s not exposed in the above example: erasure of mutability. Let’s say you construct some object for use in a later part of the function:

let mut data = vec![];
data.push(1);
data.extend_from_slice(&[4, 5, 6, 7]);

data.iter().for_each(|x| println!("{x}"));
return data[2];

The issue is that data is declared as mutable, which means the rest of the function can mutate it. Since a lot of bugs come from data being mutated when it isn’t supposed to be mutated, we’d like to restrict the mutability of the data to a certain area of the function. This is also possible with the block pattern:

let data = {
    let mut data = vec![];
    data.push(1);
    data.extend_from_slice(&[4, 5, 6, 7]);
    data
};

data.iter().for_each(|x| println!("{x}"));
return data[2];

This effectively “closes” the mutability to a certain section of the function.

Closing Thoughts

I don’t know if this pattern is already well known to the Rust community. Even if it isn’t, I figure it’s still a good idea to bring it to people who may be inexperienced in Rust.