Fixing C Strings
32 points by ushakov 6 months ago | 65 comments- WalterBright 6 months ago
It's the same solution D uses, except that it's a builtin type, and works for all arrays. I proposed this solution for C:struct str { char *dat; sz len; };
https://www.digitalmars.com/articles/C-biggest-mistake.html
It's hard to overstate what a huge win this is. D has had 23 years of experience with it, and the virtual elimination of array overflow bugs is just win, win, win.
I will never understand why C keeps adding extensions consisting of marginal features, and ignores this foundational fix. I guess they still aren't tired of buffer overflow bugs always being the #1 security vulnerability of shipped C code (and C++, too!).
- Levitating 6 months ago> It's the same solution D uses
As well as most other languages and many C codebases right? Often with a separate length/capacity so the buffer can be larger than the string.
- WalterBright 6 months agoTrue, but it turns out that very few arrays need to be resizeable.
- WalterBright 6 months ago
- volemo 6 months agoWhy this and not?
struct str { sz len; char dat[]; };
- WalterBright 6 months agoThose are called "length prefixed strings", or more simply "Pascal strings". The difficulty is one has to reallocate and copy to represent a substring.
- b3orn 6 months agoHaving a pointer in the struct allows you to increase the string's capacity without changing all the references to it.
- WalterBright 6 months ago
- wat10000 6 months agoChanging pointers to include length would require an ABI break on pretty much all platforms. You’d either have to recompile the world or have some sort of bridging thing that converts calls from C-with-fat-pointers to standard C. And even recompiling the world wouldn’t be enough, since lots of C code relies on being able to do things like cast pointers to integers, manipulate them as numbers, then cast them back. That’s UB by the standard but platforms and compilers can and often do define that behavior to be something useful.
You could say, well, forget binary compatibility and forget nasty code that bit-twiddles pointers. But then why are you even using C? Those are the things that set it apart.
Clang is trying to solve this with annotations that allow the programmer to construct fat pointers, either as structures or just implicitly by having the length in a variable somewhere, and enforcing those bounds in the compiler. Seems promising. https://clang.llvm.org/docs/BoundsSafetyImplPlans.html
- WalterBright 6 months ago> Changing pointers to include length would require an ABI break on pretty much all platforms.
Fixed with my proposal: https://www.digitalmars.com/articles/C-biggest-mistake.html
- Gibbon1 6 months agoYeah but not having standard buffer and slice types along with safe API's that require their use is unforgivable.
I'm also of the opinion that a backwards compatibility with null terminated string is actually terrible. Because you want people to eventually go, oh this code uses gross null terminated strings, lets fix that.
- WalterBright 6 months ago
- Koshkin 6 months ago> I will never understand why C keeps
Well, I, for one, do like the idea of C (in contrast to D or C++) still being sort of the lowest-level high-level programming language - one that's just a notch above the assembler.
- gizmo686 6 months agoThere is nothing particuarly low level about null terminated strings. It is just a convention that a bunch of standard library functions follow. Then there is a dozen variants of each of those standard library functions, most of which end up making you pass in some form of a length parameter anyway, because the convention is so terrible.
At this point, I think we might be in a better world if C simply did not offer any string API in the first place. If
- II2II 6 months agoExpressing strings as null terminated or in a data structure that includes the size and data has no relationship to it being a higher or lower level language. They still need to determine where a string starts and ends in memory. The means of doing so is different, the assembly language representation will be slightly different, but the language isn't hiding anything behind an abstraction. Contrast that to, say, Pascal where the length of the string is hidden from the developer and can only be accessed through function calls. (It probably should be, but that is beside the point.)
- pjmlp 6 months agoDepending on the Pascal implementation, the length might be accessible actually, but better not mess directly with it.
- pjmlp 6 months ago
- gf000 6 months ago> one that's just a notch above the assembler.
That has never been true, unless you are writing programs against a PDP-11. C compilers can change even the O complexity of your algorithms, it's nowhere close to assembly.
- teo_zero 6 months ago> C compilers can change even the O complexity of your algorithms
I'd be curious to see an example...
- teo_zero 6 months ago
- WalterBright 6 months agoMy proposal https://www.digitalmars.com/articles/C-biggest-mistake.html does not take anything away from C's low level abilities.
- pjmlp 6 months agoThe usual C mythology continuously spread since the mists of UNIX epoch.
- gpderetta 6 months agoWhat's lower level on C compared to C++ or D?
- WalterBright 6 months agoNothing. You can get just as down and dirty in D as you can in C. You just don't have to suffer under the preprocessor as it blasts your kingdom. And you've got a fully functional inline assembler, and modules.
- WalterBright 6 months ago
- gizmo686 6 months ago
- Levitating 6 months ago
- kevin_thibedeau 6 months ago> Current compilers warn you if the format string doesn’t match its arguments. But this only works on functions that have the same signature as printf so it doesn’t work on my implementation.
GCC has the format attribute that lets you have printf type checking on your own variadic functions:
https://gcc.gnu.org/onlinedocs/gcc-14.2.0/gcc/Common-Functio...
- WalterBright 6 months agoSo does D:
https://dlang.org/spec/pragma.html#printf
It was a huge win, at least for me. I implemented it because I was really sick of mismatches. (Although I was careful to use the right formats, when refactoring I'd change a type and then the printf's would go awry. Having the compiler flag them made for quick fixing.)
- samatman 6 months agoZig builds string formatting at comptime, so if the format string doesn't match the arguments, the program won't compile. It's nice.
- norir 6 months agoFormat strings are a technique that I think largely should be left behind anyway. String interpolation is in my experience usually shorter, easier to read and will always be checked by the compiler.
- jonhohle 6 months agoIt’s shorter until you need any type of formatting. It’s hard to get more terse than `%8.2f`.
- gizmo686 6 months agoYou can do that with string interpolation too. In python it would be:
compared with C-style printf:print(f"{foo:8.2f}")
printf("%8.2f", foo)
- rowanG077 6 months agoAlso hard to be more cryptic. The amount of times I regoogle that syntax, or what exactly those numbers mean is basically uncountable.
- gizmo686 6 months ago
- jonhohle 6 months ago
- WalterBright 6 months ago
- simscitizen 6 months agoThere are quite a few of these "better C string" idioms floating around.
Another one to consider is e.g. https://github.com/antirez/sds (used by Redis), which instead stores the string contents in-line with the metadata.
- ropejumper 6 months agoTwo people have already mentioned things like storing the length inline or including a null-terminator to be backwards-compatible. What's described there is basically the same as std::string_view or &str, and to me one of the biggest reasons to use these structures is that your particular view of the string doesn't interfere with someone else's. You can slice your string in the middle and just look at it piecewise without bothering anyone else.
Choosing between these trade-offs just depends on what you're doing. I'd definitely choose this pattern if I were to write a parser for instance.
- Dwedit 6 months agoThe problem with string views is that they are borrowing the parent string, so you'd need to hold a strong reference to the parent string. This is easy to do in a garbage collected language, because you don't have to do anything. But it's a lot more complicated if you need to do this with reference counting. Do you make every single string view update the reference counter? Do you make a special lighter string view that doesn't keep a counted reference, and is subject to memory safety issues?
- ropejumper 6 months agoYep, you're right. One way to make this less of a problem is to make this distinction at the type level, having both an owned_string and a string_view for example. You can even make owned_string store its length inline.
- wruza 6 months agoThese are regular questions in languages with (and without) reference counting, what’s so special about string views?
- conradludgate 6 months agoTypically you need 4 pointers to represent a strong reference count for a string view.
* One for the start of the source string, with an inline strong count * One for the end of the source string so you know how much to deallocate (only really applicable to Rust) * One for the start of the view * One for the end of the view
32 bytes for each string view is quite a lot. Depending on context you could use 32bit lengths instead of end-pointers if you're OK with <4GB strings, saving 8 bytes.
- Dwedit 6 months agoThere's basically no distinction between a string view and an array slice. It's borrowing an array, and the view is nothing but a reference to the parent, start position, and length.
But views are also implemented as a plain pointer and a length, and that's where the memory safety issues from borrowing begin.
- conradludgate 6 months ago
- ropejumper 6 months ago
- Dwedit 6 months ago
- jdblair 6 months agoI've done something similar, but unlike the author, I always reserved one extra byte and I always null terminated the string. This was so I could use existing string output functions.
- cozzyd 6 months agoWhy not have the null terminator so you can pass to normal printf?
You could even do something crazy with packing a null byte with sz on 64-bit systems (since you will never have a string that long anyway...)
- D4ckard 6 months agoYes, there are really cool packing techniques. See this talk for example: https://www.youtube.com/watch?v=kPR8h4-qZdk
I don't include the null-terminator because I use this type in my own environment where I never use null-terminated strings so there is no need for it.
- bigpingo 6 months ago
lets you pass the length as an int argument.printf("%.*s", len, str);
- D4ckard 6 months ago
- up2isomorphism 6 months agoFor all the complaints ,all you need to do is to include an another .h files from some string lib and that’s it.
But I would say for 95% percent using a fixed length char array with strncpy will work just fine.
- superjared 6 months agoThe bstring library[0] has been around a _long_ time.
- codr7 6 months agoI would consider putting the buffer last in the structure and making it flexible to allow skipping one allocation.
- ncruces 6 months agoThat misses the point. These are passed by value.
- ncruces 6 months ago
- Levitating 6 months ago> I liked this kind of pattern at the bottom of OpenAI's site :)
Where on OpenAI's site do I find a footer like that?
- 6 months ago
- Quis_sum 6 months agoSorry, but there is a significant misunderstanding: There is no such thing as a string in C. What you call a string is a pointer to char (typically "int8") - nothing more nothing less. The \0 termination is just a convention/convenience to avoid passing the bounds of the memory segment, resp. when to stop processing earlier.
Once you go down the route proposed by many of the comments here - why not enhance it to deal with UTF8... Or rather implement a proper "array" type? What about the lack of multidimensional arrays instead of the pointer to pointer to ... approach? Idiosyncracies such as "int a[2][3];" being of type "int *" and not "int **"?
C was never intended to shield you from mistakes, but rather replace a macro assembler. ANSI C addressed some of the issues in the original K&R C, but that is about it.
If your use case would benefit from all of these protections, there are plenty of higher level language alternatives...
- kelnos 6 months agoThat's incorrect. If I write this in my .c file:
The compiler will not treat that as a simple mere pointer to char when allocating space for it in the binary. It will see that the rhs is surrounded by double quote characters, and allocate 3 bytes for it, instead of 2, and put a NUL byte after the bytes for 'H' and 'i'.char *s = "Hi";
Nul-tetminated strings are absolutely a part of the language. Certainly you can make and store strings in a different way if you'd like, but the language itself defines what a string and string literal is.
- Quis_sum 6 months agos is still a pointer to character. This is just an optimised shorthand for (assuming ASCII):
char s[3];
Which is then initialised with: 0x48,0x69,0x00
There simply is no such thing as a string type in C.
All the "string" functions work on a char pointer which is incremented until it points to a 0.
- Quis_sum 6 months ago
- __d 6 months agoString literals are one place where the compiler implements the null-termination, so it is built-in in that sense.
As per the OP’s example, a wrapper macro like their STR can work around this.
- Quis_sum 6 months agoI don't have an issue with the OP's example, which is quite nifty indeed, despite the penalties incurred.
- Quis_sum 6 months ago
- kelnos 6 months ago
- teo_zero 6 months agoGood attempt at a topic that annoys many programmers.
I see a problem with the separation between str and str_buf, though: you create new strings with the latter, but most functions take the former as arguments. Do you convert them every time? Isn't your code littered with str_from_buf()?
Put it in another way, it's like the mess with const that you mention in your article. If str is the type you use for a const read-only string, and str_buf for a non-const mutable string, you would like to pass a non-const even to those functions that "only" require a const. (I say "only" because being const is a weaker requirement than being mutable; the fact that it's more wordy is another thing that C's syntax makes confusing, but this is an entirely different topic!)
It would be nice if the compiler could be instructed to automatically cast str_buf into str and not vice versa, just like it does for non-const to const.
The only way out I can think of, would be to get rid of the two types and only use the one with the cap field, with the convention that if cap is zero, then the string is read-only. The drawback is that certain mistakes are only detected at run-time and not enforced by the compiler. For example, a function than takes a string s and replaces every substring s1 with s2 could have the following prototype in the two-type system:
And it would be immediate to recognize that you cannot pass a read-only string as the first argument. With a one-type system you loose this ability.replace(str_buf s, str s1, str s2);
Oh well, I guess if a perfect solution existed, it would have been adopted by the C committee, wouldn't it? /s
- kelnos 6 months agoDo you convert them every time?
No, the article addresses this: since the memory layout of the first two struct members is the same in both structs, you can use a pointer to str_buf anywhere a function calls for a pointer to str, after casting it.
- teo_zero 6 months ago> you can use a pointer to str_buf anywhere a function calls for a pointer to str
Yes, you could, but I see no function mentioned in TFA that wants a pointer to str, only functions that want a str: print_str(), print_fmt(), com_write(). At the same time, the functions that return strings return a struct, never a pointer: str_new(), str_from_range(), str_from_buf(), fmt_buf_new(), and the pseudo-function STR().
To use the memory layout trick you should go through reference + cast + dereference:
My question still holds: is the code littered by such conversion artifacts?*((struct str *)&...)
- teo_zero 6 months ago
- kelnos 6 months ago
- 6 months ago
- zwnow 6 months agoNever had a string related bug in any programming language in 4 years. I sincerely don't know what people talk about when they claim strings are buggy? What kinda tasks do these happen in?
- Koshkin 6 months agoIt's just that the "traditional" implementations of the operations on C strings (strcat etc.) are considered unsafe - which they are, strictly speaking. (But, to be fair, I haven't ever had problems using them, either.)
- paulddraper 6 months agoI don’t know who you’re talking to or what they’re saying, do I can’t say.
This article is about C strings FYI.
- zwnow 6 months agoThe article claims they are buggy, thats what I am refering to.
- zwnow 6 months ago
- Koshkin 6 months ago
- zabzonk 6 months agoI have been using null terminated strings since the mid 1970s - before using C, and have never had any problems with them.I have never seen an explanation from someone that has that makes any sense.
- atiedebee 6 months agoIn my experience, the standard library is inconsistent with its 0 terminator handling.
fgets will treat the length passed as the capacity of the buffer, and terminate the last byte with a 0.
scanf however treats the length as the number of characters to read, meaning that you need a capacity of n+1 to make sure the 0 terminator is stored properly as well.
Its quite easy to mess up placing the 0 terminator yourself too. It's an overall unnecessary burden that could've been fixed quite easily.
- zabzonk 6 months agothe standard library certainly has a few problems, if you haven't read the docs, but that does not mean that i do.
- kelnos 6 months agoI'm sorry that not every programmer in the world has achieved your level of supreme perfection. We shouldn't design our languages or stdlib with the assumption that everyone will have read every line of documentation about them, and (even if they have) remember everything every time they sit down in front of their text editor. That's unrealistic.
I don't actually believe you that you've been programming for 50 years and never misused a string or a string API in C or a language with similar string handling. But even if I did believe you, it wouldn't matter. Many people make mistakes, and those mistakes have cost people a lot of time, money, and stress. If you've not read about any of these instances, then I suggest you've been living under a rock and are incredibly out of touch.
Or you're just trolling.
- kelnos 6 months ago
- zabzonk 6 months ago
- atiedebee 6 months ago