在C语言中实现泛型类型

在C语言中实现泛型类型
Implementing Generic Types in C

原始链接: https://btmc.substack.com/p/implementing-generic-types-in-c

C语言缺乏原生泛型类型，开发者不得不模拟实现。常用的方法有四种：模板宏、模板头文件、类型擦除（void*）以及内联宏。模板宏笨拙且易错，通常不推荐使用。模板头文件利用`#define`和`#include`，提供了更好的语法和类型检查，但是需要仔细管理命名冲突和多重定义问题。使用`void*`的类型擦除允许单独编译，这对动态链接库或外部函数接口(FFI)至关重要，但是它牺牲了类型安全，并且需要显式地管理大小。内联宏最为用户友好，避免了名字改编和定义问题。然而，它们可能导致代码膨胀，并且缺乏合适的语法高亮。像CC和stb_ds这样的库改进了这种方法。最终，模板头文件和内联宏的选择取决于个人偏好。模板头文件更容易开发，而内联宏提供了更友好的用户体验。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录在 C 语言中实现泛型类型 (btmc.substack.com) 21 分，来自 sirwhinesalot，1 小时前 | 隐藏 | 过去 | 收藏 | 讨论加入我们 6 月 16-17 日在旧金山举办的 AI 初创公司学校！指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

（评论） 2024-09-02

（评论） 2024-07-16

（评论） 2024-07-01

（评论） 2025-03-17

原文

One of the most annoying things about programming in C is the lack of support for generic types, also known as parametric polymorphism. Not to be confused with the abomination that is _Generic.

I wish C had something like the following:

struct Vector<T> {
  T* data;
  size_t capacity;
  size_t length;
};

void vector_resize<T>(Vector<T>* v, size_t capacity) {
  v->data = realloc(v->data, capacity * sizeof(T));
  v->capacity = capacity;
}

void vector_append<T>(Vector<T>* v, T element) {
  if (v->length >= v->capacity) {
    vector_resize(v, v->capacity ? v->capacity * 2 : 8);
  }
  v->data[v->length] = element;
  v->length++;
}

void vector_clear<T>(Vector<T>* v) {
  free(v->data);
  v->data = NULL;
  v->capacity = 0;
  v->length = 0;
}

Vector<char*> v = {0};
vector_resize(&v, 10);
vector_push(&v, "abc");
vector_clean(&v);

C++ can easily do the above with templates, and I personally quite enjoy programming in “C with methods, templates, namespaces and overloading”, avoiding the more complicated elements of C++ (like move semantics) while still benefiting from it.

But sometimes you really want to program in pure C for whatever reason. There isn’t really a perfect way to do the above in C, but there are ways, some better than others.

I’m aware of the following four methods:

Template Macros
Template Headers
Type Erasure with void*
Inlining Macros

We’ll go over each of them and discuss their pros and cons.

C++ templates are an implementation of generic types based on monomorphization. What does that mean? It’s basically copy paste + find & replace.

When you write:

template<typename T> 
T id(T data) { return data; }
...
int i = id(10);
char* str = id("abc");

What the compiler spits out is something like:

int id_int(int data) { return data; }
char* id_char(char* data) { return data; }
...
int i = id_int(10);
char* str = id_char("abc");

Though the names of the functions aren’t nice and obvious like that, they’re actually unreadable mangled gibberish. The important point is that the templated code gets “copy-pasted” each time you call it with a different type, and everywhere T was used the concrete type is used instead.

We can implement a similar thing in C using macros for our Vector struct:

#define _VECTOR_TEMPLATE_INSTANCE(T) \
typedef struct { \
  T* data; \
  size_t capacity; \
  size_t length; \
} Vector_##T; \
\
void vector_resize_##T(Vector_##T* v, size_t capacity) { \
  v->data = realloc(v->data, capacity * sizeof(T)); \
  v->capacity = capacity; \
} \
\
void vector_append_##T(Vector_##T* v, T element) { \
  if (v->length >= v->capacity) { \
    vector_resize_##T(v, v->capacity ? v->capacity * 2 : 8); \
  } \
  v->data[v->length] = element; \
  v->length++; \
} \
\
void vector_clear_##T(Vector_##T* v) { \
  free(v->data); \
  v->data = NULL; \
  v->capacity = 0; \
  v->length = 0; \
}
#define VECTOR_TEMPLATE_INSTANCE(T) _VECTOR_TEMPLATE_INSTANCE(T)

The two levels of macros are necessary for Vector_##T to work. Without them it’ll just become Vector_T. The preprocessor is weird.

Anyway, we can then call this macro like so:

VECTOR_TEMPLATE_INSTANCE(int)

And all the code in the macro gets copy pasted with T replaced with int.

There are many issues with this particular solution. The first and most obvious is just how hideous it is. All those backslashes and pound signs… ugh. You also get no syntax highlighting or autocompletion and errors will only show up once you invoke the macro. But that’s not all! The following doesn’t work:

VECTOR_TEMPLATE_INSTANCE(char*)

Because the “name mangling” will be borked due to the *. The solution is to typedef pointer types or wrap them in a struct, or to add an extra “name” parameter to the macro like:

VECTOR_TEMPLATE_INSTANCE(char*, str)

But we also need to consider where to call this macro, because we need to be careful not to call it with the same type in multiple places, or we might end up with multiple definitions for the same function.

For an example of this approach in the wild, check out M*LIB. Personally I find this way of implementing generics in C to be all downsides.

The idea of the template header is similar to the template macro, but using a “parameterized” header file instead:

#ifndef T
#error you need to define T before including this header
#else

typedef struct {
  T* data;
  size_t capacity;
  size_t length;
} CAT(Vector,T);

void CAT(vector_resize,T)(CAT(Vector,T)* v, size_t capacity) {
  v->data = realloc(v->data, capacity * sizeof(T));
  v->capacity = capacity;
}

void CAT(vector_append,T)(CAT(Vector,T)* v, T element) {
  if (v->length >= v->capacity) {
    CAT(vector_resize,T)(v, v->capacity ? v->capacity * 2 : 8);
  }
  v->data[v->length] = element;
  v->length++;
}

void CAT(vector_clear,T)(CAT(Vector,T)* v) {
  free(v->data);
  v->data = NULL;
  v->capacity = 0;
  v->length = 0;
}

#undef T
#endif

We can then use this header file like so:

#define T int
#include "vector_template.h"

We’ve gotten rid of the backslashes, we get proper syntax highlighting, and by just adding a #define T int line temporarily we get type checking and autocompletion and all that good stuff.

The CAT macro is needed to implement the indirection for ##T we used in the macro version. There are ways to make it a little prettier (see below) but I wanted to show the basic solution first.

We still have the naming issue when using pointer types and the multiple definition issue, but thankfully they can be worked around without too much trouble by doing the following:

#ifndef T
#error you need to define T before including this header
#else

#ifndef SUFFIX
#define SUFFIX T
#endif

#define G(N) CAT(N, SUFFIX)

typedef struct {
  T* data;
  size_t capacity;
  size_t length;
} G(Vector);

void G(vector_resize)(G(Vector)* v, size_t capacity);
void G(vector_append)(G(Vector)* v, T element);
void G(vector_clear)(G(Vector)* v);

#ifdef VECTOR_IMPLEMENTATION

void G(vector_resize)(G(Vector)* v, size_t capacity) {
  v->data = realloc(v->data, capacity * sizeof(T));
  v->capacity = capacity;
}

void G(vector_append)(G(Vector)* v, T element) {
  if (v->length >= v->capacity) {
    G(vector_resize)(v, v->capacity ? v->capacity * 2 : 8);
  }
  v->data[v->length] = element;
  v->length++;
}

void G(vector_clear)(G(Vector)* v) {
  free(v->data);
  v->data = NULL;
  v->capacity = 0;
  v->length = 0;
}

#endif

#undef G
#undef SUFFIX
#undef T
#endif

We add an extra parameter called SUFFIX (which is mapped to T by default) to allow changing the suffix in the type and function names (could also have been PREFIX or NAME or whatever you want to call it). The CAT macro is wrapped in the G macro to make things nicer to look at and automatically apply the suffix. Lastly we employ the same approach used by header-only libraries to separate the function declarations from their implementation, solving the “careful where you use this” issue.

For an example of this approach in the wild, check out STC.

I mentioned that C++ generics use monomorphization, but that’s not the only way to implement generic types. Java, for example, uses type erasure.

Basically, when you write the following in Java:

class Vector<T> {
  public T[] data;
  public size_t capacity;
  public size_t length;
}

Vector<int> v = new Vector<int>();

What the compiler is actually doing is:

class Vector {
  public object[] data;
  public size_t capacity;
  public size_t length;
}

Vector v = new Vector();

The actual element type has disappeared. T and int have been erased. This approach to generics has pros and cons, but for our purposes the important part is that we can actually do something pretty similar in C without any macros:

typedef struct {
  void* data;
  size_t capacity;
  size_t length;
} Vector;

void vector_resize(Vector* v, size_t capacity, size_t t_size) {
  v->data = realloc(v->data, capacity * t_size);
  v->capacity = capacity;
}

void vector_append(Vector* v, void* element, size_t t_size) {
  if (v->length >= v->capacity) {
    vector_resize(v, v->capacity ? v->capacity * 2 : 8, t_size);
  }
  size_t offset = v->length * t_size;
  memcpy((char*)v->data + offset, element, t_size);
  v->length++;
}

void* vector_get(Vector* v, size_t index, size_t t_size) {
  size_t offset = index * t_size;
  return (char*)v->data + offset;
}

void vector_clear(Vector* v) {
  free(v->data);
  v->data = NULL;
  v->capacity = 0;
  v->length = 0;
}

We can use it as follows:

Vector v = {0};
vector_resize(&v, 10, sizeof(double));
double element = 4.0,
vector_push(&v, &element, sizeof(double));
vector_clean(&v);

Not a macro in sight, wonderful. But we had to pay a price. Every function now needs an extra parameter so we know the size of the data we are working with. We also can’t index the vector data directly, we have to treat it as an opaque blob of bytes to write into (using memcpy). We need vector_get for the same reason.

We also have to pass elements by pointer. That means we need the element to be an lvalue so we can take its address. That’s why 4.0 is stored in a variable first.

The biggest issue is the same one Java had before version 5, there’s no type safety:

Vector v = {0};
vector_push(&v, &some_int, sizeof(double));
vector_push(&v, &some_double, sizeof(char*));
vector_push(&v, &some_char_ptr, sizeof(int));

We can improve upon the sizeof issue by storing size information on the vector itself:

typedef struct {
  void* data;
  size_t capacity;
  size_t length;
  size_t t_size;
} Vector;

Vector vector_init(Vector* v, size_t element_size) {
  return (Vector){.t_size = element_size};
}

void vector_append(Vector* v, void* element) {
  if (v->length >= v->capacity) {
    vector_resize(v, v->capacity ? v->capacity * 2 : 8);
  }
  size_t offset = v->length * v->t_size;
  memcpy((char*)v->data + offset, element, v->t_size);
  v->length++;
}

void vector_resize(Vector* v, size_t capacity) { ... }
void* vector_get(Vector* v, size_t index) { ... }
void vector_clear(Vector* v) { ... }

Parameter t_size is a specific instance of a more general t_info structure, which stores all the information the container needs from the element type to function. Storing this information in the container, however, does not prevent us from appending a double to a vector of ints.

The main reason to use this approach is separate compilation. You can compile the code for a vector once and then link to it many time. Useful for dynamic libraries or FFI-related stuff, but it doesn’t provide much value when programming in C itself.

For an example of this approach in the wild, check out GLib.

This one is essentially a “fixed” version of the template macro solution. The idea is to have individual macros for each operation, none of which declare anything:

#define Vector(T) struct { \
  T* data; \
  size_t length; \
  size_t capacity; \
}

#define vector_resize(v, capacity) { \
  v->data = realloc(v->data, capacity * sizeof(v->data[0])); \
  v->capacity = capacity; \
}

#define vector_append(v, element) { \
  if (v->length >= v->capacity) { \
    vector_resize(v, v->capacity ? v->capacity * 2 : 8); \
  } \
  v->data[v->length] = element; \
  v->length++; \
}

#define vector_clear(v) { \
  free(v->data); \
  v->data = NULL; \
  v->capacity = 0; \
  v->length = 0; \
}

We can use it as follows:

typedef Vector(char*) Vector_str;

Vector_str v = {0};
vector_resize(&v, 10);
vector_append(&v, "abc");
vector_clean(&v);

Compared to the template macro solution this has massive advantages. There’s no “name mangling” issue because the macros don’t declare anything. We don’t need to worry about where these macros are invoked for the same reason. Outside of function arguments we don’t even need the typedef.

As with the template macro approach, this approach also suffers from a lack of syntax highlighting and typechecking, plus those damn backslashes again. It also causes a bit of code bloat since the same code is getting inlined all over the place.

One way to minimize those issues is to shift some of the work to helper functions. In the our Vector example none of the macros is large enough or complicated enough to justify doing this, but for a HashMap you’ll definitely want to do it.

Regardless, since the we’re working with multiple smaller macros rather than one big unwieldy macro, it’s a lot more bearable.

For an example of this approach in the wild, check out CC (though CC goes even further with typeof tricks to not require the typedef). There’s also Sean Barrett’s stb_ds library which avoids the typedef by hiding the length and capacity behind the data pointer itself in memory.

I do not understand their hatred of the typedef, but both libraries are good.

I’ve gone over four approaches to implement Generic types in C using different techniques, each with pros and cons. Well, except the first one, template macros, where I can’t really find any pro, only cons.

The type erasure (void*) approach only makes sense if FFI or dynamic linking is involved, or you really must avoid code bloat at all costs.

The inlining macro approach is the most pleasant to use, but it is a bit annoying to develop since it requires programming inside macros (bleh). For more complex data structures than a dynamic array you’ll want to shift as much work to helper functions as possible, making it a bit of a hybrid with the void* approach.

The template header version is somewhat annoying to use but is in my opinion the most pleasant to actually program. It also works ok in an FFI or dynamic linking setting one level higher. For example, a graphics library might expose Vector_texture and Vector_vertex types and such.

So, to me, there are actually only two choices when all you care about is programming in C itself: template headers or inlining macros. The former is nicer to program while the latter is nicer to use. Try them out, see which one you prefer.

在C语言中实现泛型类型 Implementing Generic Types in C

在C语言中实现泛型类型
Implementing Generic Types in C