How to actually use regex.h

2022-08-18T14:40:18Z

After an idea of solene, I recently modified vger code to use a regex in order to get various elements of a request:

Since vger is written in C, I went for the library included in OpenBSD: regex.h.

I found the manpage regex.3 a bit hard to read. Actually, I missed examples to make sure I had understood everything, especially about matching and getting substrings, not only matching a regex.

So, that's how to use regex.h.

vger

regex.3

The regex we'll use

Let's say we have the string: "My name is Paul Muad'Dib".

We need to get the first and last name. So we could write this stupid regex: "^.* (.*) (.*)$"

Subexpressions are between ().

Variables declaration

To work, we need a regex_t type to store our compiled regex.

We need to store the subexpressions in a regmatch_t structure we'll call "match". This structure must be wide enough to keep in memory the full match and the substrings.

In our case, we hope to match 2 substrings. This means our regmatch_t must be 2+1 wide.

I found in the ed source code the use of a define to store this and think it's clever (they set it to 30 oO).

This gives us :

#define SE_MAX 3 /* number of expected subexpressions + 1 /*
...
regex_t reg;
regmatch_t match[SE_MAX];
size_t nmatch = SE_MAX;

Run the regex

This is quite easy, we call regcomp to compile the regex and regexec to run it.

regcomp(&reg, regex, REG_EXTENDED);
regexec(&reg, s, nmatch, match, 0);

Extract the substrings

Now, match[1] should have the first name and match[2] the last name.

If so, reg.re_nsub must be 2.

Items in match have two elements, rm_so and rm_eo storing the starting and ending offset of match copared to the original string.

In other words, the position in the original string where the subexpression starts and ends.

In our case, the first names starts after shifting 11 times on the right and stops at position 15.

My name is Paul Muad'Dib
           ^   ^
           |   |
           |   +-> match[1].rm_eo = 16
           |
           +-> match[1].rm_so = 12

If no match were found, rm_so and rm_eo are equals.

So we first check if we have a match :

if ((len = match[i].rm_eo - match[i].rm_so) > 0) {

Doing so, we get the length of the substring to copy.

Then, we can copy the match located at s + match[i].rm_so, s being our original string.

memcpy(first, s + match[i].rm_so, len);

Full code

In order to handle errors, we call regerror to store the error in a buffer "buf" :

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <regex.h>
#include <sys/types.h>
#define SE_MAX 3	/* max subexpression, in nmatch. hard to really understand */
int 
main(int argc, char *argv[])
{
	regex_t reg;
	regmatch_t match[SE_MAX];
	size_t nmatch = SE_MAX;
	size_t len = 0;
	int ret = 0;
	char buf[BUFSIZ] = {'\0'};
	char *regex = "^.* (.*) (.*)$";
	char *s = "My name is Paul Muad'Dib";
	char first[10] = {'\0'};
	char last[10] = {'\0'};
	if ((ret = regcomp(&reg, regex, REG_EXTENDED)) != 0) {
		regerror(ret, &reg, buf, sizeof(buf));
		goto stop;
	}
	if ((ret = regexec(&reg, s, nmatch, match, 0)) != 0) {
		regerror(ret, &reg, buf, sizeof(buf));
		goto stop;
	}
	for (int i = 1; i <= reg.re_nsub; i++) {
		if ((len = match[i].rm_eo - match[i].rm_so) > 0) {
			switch (i) {
			case 1:
				memcpy(first, s + match[i].rm_so, len);
			case 2:
				memcpy(last, s + match[i].rm_so, len);
			}
		}
	}
 stop:
	regfree(&reg);
	puts(buf);
	printf("first name: %s\nlast name: %s\n", first, last);
	return 0;
}