begriffs.com https://begriffs.com/atom.xml Joe Nelson joe@begriffs.com 2023-10-10T00:00:00Z How to build dependable bare-metal ARM firmware with UNIX tools https://begriffs.com/posts/2023-10-10-bare-metal-firmware.html 2023-10-10T00:00:00Z 2023-10-10T00:00:00Z Download the eBook ($12)

For software developers, the world of hardware and firmware can be an exciting change. Firmware catapults your logic into the physical world. Rather than moving text between forms and a database, you can move motors. Rather than listening for an API call, you can listen for SONAR or GPS signals.

This is the guide I wish I had when first starting embedded development. It cultivates professional embedded programming habits from the start. We’ll skip the beginner ecosystem like Arduino, and get the most out of hardware with bare metal programming.

The low-level approach allows you to:

  • Choose from a variety of chips to match project requirements.
  • Use a real-time OS, if desired, instead of a “superloop,” for more natural multitasking.
  • Make the most of hardware resources. No Arduino, and no embedded Linux. Projects have nearly instant boot times.
  • Avoid bugs in intermediate libraries by using a smaller software stack.
  • Do full remote debugging with the ability to breakpoint and inspect variables and registers.
  • Achieve MISRA conformance if necessary, for safety critical systems.

In particular, we target the ARM architecture, due to popularity. While the examples use STMicroelectronics hardware, we avoid their vendor IDE and hardware abstraction layer (HAL). The principles in this guide work with chips from any ARM vendor. Rather than proprietary IDEs and libraries, we’ll use entirely open source tools in a Unix environment (like BSD, Linux, or macOS). Here’s why:

  • Your project won’t “bit rot.” Once it builds, it will continue to build for years to come.
  • Leverage a mature toolset, like POSIX Make, C99, GCC/LLVM, and GDB/LLDB. They’re either already on your system, or easy to install with the OS package manager.
  • Use the ubiquitous CMSIS hardware interface. ARM contractually obligates its hardware vendors to supply CMSIS implementations for their products.
  • Let official manuals rather than 3rd party libraries be the source of truth. The register names in CMSIS match terminology in the hardware reference manuals.

Using a strong foundation of toolchain and libraries, we’ll build the same simple “blinky” project in four different ways. We’ll see the boot-up sequence of CMSIS vs the standard library crt0 system. We’ll try writing the program with and without an RTOS, and try dynamic vs static memory allocation. We’ll also see an example of a fault handler, and how to do remote debugging.

By the end of the guide, you can venture confidently into building, flashing, and debugging more complex projects. The guide constructs examples based on product datasheets and first principles, it’s not a copy of existing demos or code snippets.

Download the guide below. For the cost of a sandwich you’ll be up and running.

Download the eBook ($12)

]]>
Pleasant debugging with GDB and DDD https://begriffs.com/posts/2022-07-17-debugging-gdb-ddd.html 2022-07-17T00:00:00Z 2022-07-17T00:00:00Z GDB is an old and ubiquitous debugger for Linux and BSD systems that has extensive language, processor, and binary format support. Its interface is a little cryptic, but learning GDB pays off.

This article is a set of miscellaneous configuration and scripting tricks that illustrate reusable principles. It assumes you’re familiar with the basics of debugging, like breakpoints, stepping, inspecting variables, etc.

Table of contents

GDB front ends

By default, GDB provides a terse line-based terminal. You need to explicitly ask to print the source code being debugged, the values of variables, or the current list of breakpoints. There are four ways to customize this interface. Ordered from basic to complicated, they are:

  1. Get used to the default behavior. Then you’ll be comfortable on any system with GDB installed. However, this approach does forego some real conveniences.
  2. Enable the built-in GDB TUI mode with the -tui command line flag (available since GDB version 7.5). The TUI creates Curses windows for source, registers, commands, etc. It’s easier to trace execution through the code and spot breakpoints than in the default interface.
  3. Customize the UI using scripting, sourced from your .gdbinit. Some good examples are projects like gdb-dashboard and gef.
  4. Use a graphical front-end that communicates with an “inferior” GDB instance. Front ends either use the GDB machine interface (MI) to communicate, or they screen scrape sessions directly.

In my experiments, the TUI mode (option two) seemed promising, but it has some limitations:

  • no persistent window to display variables or the call stack
  • no ability to set or clear breakpoints by mouse
  • no value inspection with mouse hover
  • mouse scroll wheel didn’t work for me on OpenBSD+xterm
  • no interactive structure/pointer exploration
  • no historical value tracking for variables (aside from GDB’s Linux-only process record and replay)

Ultimately I chose option four, with the Data Display Debugger (DDD). It’s fairly ancient, and requires configuration changes to work at all with recent versions of GDB. However, it has a lot of features delivered in a 3MB binary, with no library dependencies other than a Motif-compatible UI toolkit. DDD can also control GDB sessions remotely over SSH.

DDD screenshot

Fixing DDD freeze on startup

As a front-end, DDD translates user actions to text commands that it sends to GDB. Newer front-ends use GDB’s unambiguous machine interface (MI), but DDD never got updated for that. It parses the standard text interface, essentially screen scraping GDB’s regular output. This causes some problems, but there are workarounds.

Upon starting DDD, the first serious error you’ll run into is the program locking up with this message:

Waiting until GDB gets ready...

The freeze happens because DDD is looking for the prompt (gdb). However, DDD never sees that prompt because it incorrectly changed the prompt at startup.

To fix this error, you must explicitly set the prompt and unset the extended-prompt. In ~/.ddd/init include this code:

Ddd*gdbSettings: \
unset extended-prompt\n\
set prompt (gdb) \n

The root of the problem is that during DDD’s first run, it probes all GDB settings, and saves them in to its .ddd/init file for consistency in future runs. It probes by running show settingname for all settings. However, it interprets the results wrong for these settings:

  • exec-direction
  • extended-prompt
  • filename-display
  • interactive-mode
  • max-value-size
  • mem inaccessible-by-default
  • mpx bound
  • record btrace bts
  • record btrace pt
  • remote interrupt-sequence
  • remote system-call-allowed
  • tdesc

The incorrect detection is especially bad for extended-prompt. GDB reports the value as not set, which DDD interprets – not as the lack of a value – but as text to set for the extended prompt. That text overrides the regular prompt, causing GDB to output not set as its actual prompt.

Honoring gdbinit changes

As mentioned, DDD probes and saves all GDB settings during first launch. While specifying all settings in ~/.ddd/init might make for deterministic behavior on local and remote debugging sessions, it’s inflexible. I want ~/.gdbinit to be the source of truth.

Thus you should:

  • Delete all Ddd*gdbSettings other than the prompt ones above, and
  • Set Ddd*saveOptionsOnExit: off to prevent DDD from putting the values back.

Dark mode

DDD’s default color scheme is a bit glaring. For dark mode in the code window, console, and data display panel, set these resources:

Ddd*XmText.background:             black
Ddd*XmText.foreground:             white
Ddd*XmTextField.background:        black
Ddd*XmTextField.foreground:        white
Ddd*XmList.background:             black
Ddd*XmList.foreground:             white
Ddd*graph_edit.background:         #333333
Ddd*graph_edit.edgeColor:          red
Ddd*graph_edit.nodeColor:          white
Ddd*graph_edit.gridColor:          white

UTF-8 rendering

By default, DDD uses X core fonts. All its resources, like Ddd*defaultFont, can pick from only those legacy fonts, which don’t properly render UTF-8. For proper rendering, we have to change the Motif rendering table to use the newer FreeType (XFT) fonts. Pick an XFT font you have on your system; I chose Inconsolata:

Ddd*renderTable: rt
Ddd*rt*fontType: FONT_IS_XFT
Ddd*rt*fontName: Inconsolata
Ddd*rt*fontSize: 8

The change applies to all UI areas of the program except the data display window. That window comes from an earlier codebase bolted on to DDD, and I don’t know how to change its rendering. AFAICT, you can choose only legacy fonts there, with Ddd*dataFont and Ddd*dataFontSize.

Although international graphemes are garbled in the data display window, you can inspect UTF-8 variables by printing them in the GDB console, or by hovering the mouse over variable names for a tooltip display.

Remote GDB configuration

DDD interacts with GDB through the terminal like a user would, so it can drive debugging sessions over SSH just as easily as local sessions. It also knows how to fetch remote source files, and find remote program PIDs to which GDB can attach. DDD’s default program for running commands on a remote inferior is remsh or rsh, but it can be customized to use SSH:

Ddd*rshCommand: ssh -t

In my experience, the -t is needed, or else GDB warnings and errors can appear out of order with the (gdb) prompt, making DDD hang.

To debug a remote GDB over SSH, pass the --host option to DDD. I usually include these command-line options:

ddd --debugger gdb --host admin@example.com --no-exec-window

(I specify the remote debugger command as gdb when it differs from my local inferior debugger command of egdb from the OpenBSD devel/gdb port.)

GDB tricks

Useful execution commands

Beyond the basics of run, continue and next, don’t forget some other handy commands.

  • finish - execute until the current function returns, and break in caller. Useful if you accidentally go too deep, or if the rest of a function is of no interest.
  • until - execute until reaching a later line. You can use this on the last line of a loop to run through the rest of the iterations, break out, and stop.
  • start - create a temporary breakpoint on the first line of main() and then run. Starts the program and breaks right away.
  • step vs next - how to remember the difference? Think a flight of “steps” goes downward, “stepping down” into subroutines. Whereas “next” is the next contiguous source line.

Batch mode

GDB can be used non-interactively, with predefined scripts, to create little utility programs. For example, the poor man’s profiler is a technique of calling GDB repeatedly to sample the call stack of a running program. It sends the results to awk to tally where most wall clock time (as opposed to just CPU time) is being spent.

A related idea is using GDB to print information about a core dump without leaving the UNIX command line. We can issue a single GDB command to list the backtraces for all threads, plus all stack frame variables and function arguments. Notice the print settings customized for clean, verbose output.

# show why program.core died

gdb --batch \
  -ex "set print frame-arguments all" \
  -ex "set print pretty on" \
  -ex "set print addr off" \
  -ex "thread apply all bt full" \
  /path/to/program program.core

You can put this incantation (minus the final program and core file paths) into a shell alias (like bt) so you can run it more easily. To test, you can generate a core by running a program and sending it SIGQUIT with Ctrl-\. Adjusting ulimit -c may also be necessary to save cores, depending on your OS.

User-defined commands

GDB allows you to define custom commands that can do arbitrarily complex things. Commands can set breakpoints, display values, and even call to the shell.

Here’s an example that does a few of these things. It traces the system calls made by a single function of interest. The real work happens by shelling out to OpenBSD’s ktrace(1). (An equivalent tracing utility should exist for your operating system.)

define ktrace
    # if a user presses enter on a blank line, GDB will by default
    # repeat the command, but we don't want that for ktrace

    dont-repeat

    # set a breakpoint for the specified function, and run commands
    # when the breakpoint is hit

    break $arg0
    commands
        # don't echo the commands to the user
        silent

        # set a convenience variable with the result of a C function
        set $tracepid = (int)getpid()

        # eval (GDB 7.2+) interpolates values into a command, and runs it
        eval "set $ktraceout=\"/tmp/ktrace.%d.out\"", $tracepid
        printf "ktrace started: %s\n", $ktraceout
        eval "shell ktrace -a -f %s -p %d", $ktraceout, $tracepid

        printf "\nrun \"ktrace_stop\" to stop tracing\n\n"

        # "finish" continues execution for the duration of the current
        # function, and then breaks
        finish

        # After commands that continue execution, like finish does,
        # we lose control in the GDB breakpoint. We cannot issue
        # more commands here
    end

    # GDB automatically sets $bpnum to the identifier of the created breakpoint
    set $tracebp = $bpnum
end

define ktrace_stop
    dont-repeat

    # consult $ktraceout and $tracebp set by ktrace earlier

    eval "shell ktrace -c -f %s", $ktraceout
    del $tracebp
    printf "ktrace stopped for %s\n", $ktraceout
end

Here’s demonstration with a simple program. It has two functions that involve different kinds of system calls:

#define _POSIX_C_SOURCE 200112L

#include <stdio.h>
#include <unistd.h>

void delay(void)
{
	sleep(1);
}

void alert(void)
{
	puts("Hello");
}

int main(void)
{
	alert();
	delay();
}

After loading the program into GDB, here’s how to see which syscalls the delay() function makes. Tracing is focused to just that function, and doesn’t include the system calls made by any other functions, like alert().

(gdb) ktrace delay
Breakpoint 1 at 0x1a10: file sleep.c, line 7.
(gdb) run
Starting program: sleep
ktrace started: /tmp/ktrace.5432.out

run "ktrace_stop" to stop tracing

main () at sleep.c:20
(gdb) ktrace_stop
ktrace stopped for /tmp/ktrace.5432.out

The trace output is a binary file, and we can use kdump(1) to view it, like this:

$ kdump -f /tmp/ktrace.5432.out
  5432 sleep    CALL  kbind(0x7f7ffffda6a8,24,0xa0ef4d749fb64797)
  5432 sleep    RET   kbind 0
  5432 sleep    CALL  nanosleep(0x7f7ffffda748,0x7f7ffffda738)
  5432 sleep    STRU  struct timespec { 1 }
  5432 sleep    STRU  struct timespec { 0 }
  5432 sleep    RET   nanosleep 0

This shows that, on OpenBSD, sleep(3) calls nanosleep(2).

On a related note, another way to get insight into syscalls is by setting catchpoints to break on a call of interest. This is a Linux-only feature.

Hooks

GDB treats user defined commands specially whose names begin with hook- or hookpost-. It runs hook-foo (hookpost-foo) automatically before (after) a user runs the command foo. In addition, a pseudo-command “stop” exists for when execution stops at a breakpoint.

As an example, consider automatic variable displays. GDB can automatically print the value of expressions every time the program stops with, e.g. display varname. However, what if we want to display all local variables this way?

There’s no direct expression to do it with display, but we can create a hook:

define hook-stop
    # do it conditionally
    if $display_locals_flag
        # dump the values of all local vars
        info locals
    end
end

# commands to (de)activate the display

define display_locals
    set $display_locals_flag = 1
end

define undisplay_locals
    set $display_locals_flag = 0
end

To be fair, the TUI single key mode binds info locals to the v key, so our hook is less useful in TUI mode than it first appears.

Python API

Simple helper functions

GDB exposes a Python API for finer control over the debugger. GDB scripts can include Python directly in designated blocks. For instance, right in .gdbinit we can access the Python API to get call stack frame information.

In this example, we’ll trace function calls matching a regex. If no regex is specified, we’ll match all functions visible to GDB, except low level functions (which start with underscore).

# drop into python to access frame information

python
    # this module contains the GDB API

    import gdb

    # define a helper function we can use later in a user command
    #
    # it prints the name of the function in the specified frame,
    # with indentation depth matching the stack depth

    def frame_indented_name(frame):
        # frame.level() is not always available,
        # so we traverse the list and count depth

        f = frame
        depth = 0
        while (f):
            depth = depth + 1
            f = f.older()
        return "%s%s" % ("  " * depth, frame.name())
end

# trace calls of functions matching a regex

define ftrace
    dont-repeat

    # we'll set possibly many breakpoints, so record the
    # starting number of the group

    set $first_new = 1 + ($bpnum ? $bpnum : 0)

    if $argc < 1
        # by default, trace all functions except those that start with
        # underscore, which are low-level system things
        #
        # rbreak sets multiple breakpoints via a regex

        rbreak ^[a-zA-Z]
    else
        # or match based on ftrace argument, if passed

        rbreak $arg0
    end
    commands
        silent
        
        # drop into python again to use our helper function to
        # print the name of the newest frame

        python print(frame_indented_name(gdb.newest_frame()))

        # then immediately keep going
        cont
    end

    printf "\nTracing enabled. To disable, run:\n\tdel %d-%d\n", $first_new, $bpnum
end

To use ftrace, put breakpoints at either end of an area of interest. When you arrive at the first breakpoint, run ftrace with an optional regex argument. Then, continue the debugger and watch the output.

Here’s sample trace output from inserting a key-value into a treemap (tm_insert()) in my libderp library. You can see the “split” and “skew” operations happening in the underlying balanced AA-tree.

tm_insert
  malloc
    omalloc
  malloc
    omalloc
          map
          insert
  internal_tm_insert
    derp_strcmp
    internal_tm_insert
      derp_strcmp
      internal_tm_insert
        derp_strcmp
        internal_tm_insert
        internal_tm_skew
        internal_tm_split
      internal_tm_skew
      internal_tm_split
    internal_tm_skew
    internal_tm_split

Pretty printing

GDB allows you to customize the way it displays values. For instance, you may want to inspect Unicode strings when working with the ICU library. ICU’s internal encoding for UChar is UTF-16. GDB has no way to know that an array ostensibly containing numbers is actually a string of UTF-16 code units. However, using the Python API, we can convert the string to a form GDB understands.

While a bit esoteric, this example provides the template you would use to create pretty printers for any type.

import gdb.printing, re

# a pretty printer 

class UCharPrinter:
    'Print ICU UChar string'

    def __init__(self, val):
        self.val = val

    # tell gdb to print the value in quotes, like a string
    def display_hint(self):
        return 'string'

    # the actual work...
    def to_string(self):
        p_c16 = gdb.lookup_type('char16_t').pointer()
        return self.val.cast(p_c16).string('UTF-16')

# bookkeeping that associates the UCharPrinter with the types
# it can handle, and adds an entry to "info pretty-printer"

class UCharPrinterInfo(gdb.printing.PrettyPrinter):
    # friendly name for printer
    def __init__(self):
        super().__init__('UChar string printer')
        self._re = re.compile('^UChar [\[*]')
  
    # is UCharPrinter appropriate for val?
    def __call__(self, val):
        if self._re.match(str(val.type)):
            return UCharPrinter(val)

While it’s nice to create code such as the pretty printer above, the code won’t do anything until we tell GDB how and when to load it. You can certainly dump Python code blocks into your ~/.gdbinit, but that’s not very modular, and can load things unnecessarily.

I prefer to organize the code in dedicated directories like this:

mkdir -p ~/.gdb/{py-modules,auto-load}

The ~/.gdb/py-modules is for user modules (like the ICU pretty printer), and ~/.gdb/auto-load is for scripts that GDB automatically loads at certain times.

Having created those directories, tell GDB to consult them. Add this to your ~/.gdbinit:

add-auto-load-safe-path /home/foo/.gdb
add-auto-load-scripts-directory /home/foo/.gdb/auto-load

Now, when GDB loads a library like /usr/lib/baz.so.x.y on behalf of your program, it will also search for ~/.gdb/auto-load/usr/lib/baz.so.x.y-gdb.py and load it if it exists. To see which libraries GDB loads for an application, enable verbose mode, and then start execution.

(gdb) set verbose
(gdb) start

...
Reading symbols from /usr/libexec/ld.so...
Reading symbols from /usr/lib/libpthread.so.26.1...
Reading symbols from ...

On my machine for an application using ICU, GDB loaded /usr/local/lib/libicuio.so.20.1. To enable the ICU pretty printer, I create an auto-load file:

# ~/.gdb/auto-load/usr/local/lib/libicuuc.so.20.1-gdb.py

import gdb.printing
import printers.libicuuc

gdb.printing.register_pretty_printer(
    gdb.current_objfile(),
    printers.libicuuc.UCharPrinterInfo())

The final question is how the auto-loader resolves the printers.libicuuc module. We need to add ~/.gdb/py-modules to the Python system path. I use a little trick: a file in the appropriate directory that detects its own location and adds that to the syspath:

# ~/.gdb/py-modules/add-syspath.py

import sys, os

sys.path.append(os.path.dirname(os.path.realpath(__file__)))

Then just source the file from ~/.gdbinit:

source /home/foo/.gdb/py-modules/add-syspath.py

After doing that, save the ICU pretty printing code as ~/.gdb/py-modules/printers/libicuuc.py, and the import printers.libicuuc statement will find it.

DDD features

In addition to providing a graphical user interface, DDD has a few features of its own.

Historical values

Each time the program stops at a breakpoint, DDD records the values of all displayed variables. You can place breakpoints strategically to sample the historical values of a variable, and then view or plot them on a graph.

For instance, compile this program with debugging information enabled, and load it in DDD:

int main(void)
{
	unsigned x = 381;
	while (x != 1)
		x = (x % 2 == 0) ? x/2 : 3*x + 1;
	return 0;
}
  1. Double click to the left of the x = ... line to set a breakpoint. Right click the stop sign icon that appears, and select Properties…. In the dialog box, click Edit >> and enter continue into the text box. Apply your change and close the dialog. This breakpoint will stop, record the value of x, then immediately continue running.

  2. Set a breakpoint on the return 0 line.

  3. Select GDB console from the View menu (or press Alt-1).

  4. Run start in the GDB console to run the program and break at the first line.

  5. Double click the “x” variable to add it to the graphical display. (If you don’t put it in the display window, DDD won’t track its values over time.)

  6. Select Continue from the Program menu (or press F9). You’ll see the displayed value of x updating rapidly.

  7. When execution stops at the last breakpoint, run graph history x in the GDB console. It will output an array of all previous values:

    (gdb) graph history x
    history x = {0, 381, 1144, 572, 286, 143, 430, 215, 646, 323, 970, 485,
    1456, 728, 364, 182, 91, 274, 137, 412, 206, 103, 310, 155, 466, 233, 700, 350,
    175, 526, 263, 790, 395, 1186, 593, 1780, 890, 445, 1336, 668, 334, 167, 502,
    251, 754, 377, 1132, 566, 283, 850, 425, 1276, 638, 319, 958, 479, 1438, 719,
    2158, 1079, 3238, 1619, 4858, 2429, 7288, 3644, 1822, 911, 2734, 1367, 4102,
    2051, 6154, 3077, 9232, 4616, 2308, 1154, 577, 1732, 866, 433, 1300, 650, 325,
    976, 488, 244, 122, 61, 184, 92, 46, 23, 70, 35, 106, 53, 160, 80, 40, 20, 10,
    5, 16, 8, 4, 2, 1}

graph of values

To see the values plotted graphically, run

graph plot `graph display x`

DDD sends the data to gnuplot to render the graph. (Be sure to set Ddd*plotTermType: x11 in ~/.ddd/init, or else DDD will hang with a dialog saying “Starting Gnuplot…”.)

Interesting shortcuts

DDD has some shortcuts that aren’t obvious from the interface, but which I found interesting in the documentation.

  • Control-doubleclick on the left of a line to set a temporary breakpoint, or on an existing breakpoint to delete it. Control double clicking in the data window dereferences in place, rather than creating a new display.
  • Click and drag a breakpoint to a new line, and it moves while preserving all its properties.
  • Click and hold buttons to reveal special functions. For instance, on the watch button to set a watchpoint on change or on read.
  • Pressing Esc (or the interrupt button) acts like an impromptu breakpoint.
  • By default, typing into the source window redirects keystrokes to the GDB console, so you don’t have to focus the console to issue commands.
  • Control-Up/Down changes the stack frame quickly.
  • You can display more than single local variables in the data window. Go to Data -> Status Displays to access checkboxes of other common ones, like the backtrace, or all local vars at once.
  • Pressing F1 shows help specific to whatever control is under the mouse cursor.
  • GDB by default tries to confirm kill/detach when you quit. Use ‘set confirm off’ to disable the prompt.

Further reading

]]>
Practical parsing with Flex and Bison https://begriffs.com/posts/2021-11-28-practical-parsing.html 2021-11-28T00:00:00Z 2021-11-28T00:00:00Z Although parsing is often described from the perspective of writing a compiler, there are many common smaller tasks where it’s useful. Reading file formats, talking over the network, creating shells, and analyzing source code are all easier using a robust parser.

By taking time to learn general-purpose parsing tools, you can go beyond fragile homemade solutions, and inflexible third-party libraries. We’ll cover Lex and Yacc in this guide because they are mature and portable. We’ll also cover their later incarnations as Flex and Bison.

Above all, this guide is practical. We’ll see how to properly integrate parser generators into your build system, how to create thread-safe parsing modules, and how to parse real data formats. I’ll motivate each feature of the parser generator with a concrete problem it can solve. And, I promise, none of the typical calculator examples.

Table of contents

Lexical scanning

People usually use two stages to process structured text. The first stage, lexing (aka scanning), breaks the input into meaningful chunks of characters. The second, parsing, groups the scanned chunks following potentially recursive rules. However, a nice lexing tool like Lex can be useful on its own, even when not paired with a parser.

The simplest way to describe Lex is that it runs user-supplied C code blocks for regular expression matches. It reads a list of regexes and constructs a giant state machine which attempts to match them all “simultaneously.”

A lex input file is composed of three possible sections: definitions, rules, and helper functions. The sections are delimited by %%. Lex transforms its input file into a plain C file that can be built using an ordinary C compiler.

Here’s an example. We’ll match the strings cot, cat, and cats. Our actions will print a replacement for each.

/* catcot.l */

%{
#include <stdio.h>
%}

%%

cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }

To build it:

# turn the input into an intermediate C file
lex -t catcot.l > catcot.c

# compile it
cc -o catcot catcot.c -ll

(Alternately, build it in one step with make catcot. Even in the absence of a Makefile, POSIX make has suffix rules that handle .l files.)

The program outputs simple substitutions:

echo "the cat on the cot joined the cats" | ./catcot
the thankless pet on the portable bed joined the anti-herd

The reason it prints non-matching words (such as “the”) is that there’s an implicit rule matching any character (.) and echoing it. In most real parsers we’ll want to override that.

Here’s what’s happening inside the scanner. Lex reads the regexes and generates a state machine to consume input. Below is a visualization of the states, with transitions labeled by input character. The circles with a double outline indicate states that trigger actions.

cat state machine

Note there’s no notion of word boundaries in our lexer, it’s operating on characters alone. For instance:

echo "catch!" | ./catcot
thankless petch!

That sounds rather like an insult.

An important subtlety is how Lex handles multiple eligible matches. It picks the longest possible match available, and in the case of a tie, picks the matching pattern defined earliest.

To illustrate, suppose we add a looser regex, c.t, first.

%%
c.t { printf("mumble mumble"); } 
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }

Lex detects that the rule masks cat and cot, and outputs a warning:

catcot.l:10: warning, rule cannot be matched
catcot.l:11: warning, rule cannot be matched

It still compiles though, and behaves like this:

echo "the cat on the cot joined the cats" | ./catcot
the mumble mumble on the mumble mumble joined the anti-herd

Notice that it still matched cats, because cats is longer than c.t.

Compare what happens if we move the loose regex to the end of our rules. It can then pick up whatever strings get past the others.

%%
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }
c.t { printf("mumble mumble"); } 

It acts like this:

echo "cut the cot" | ./catcot
mumble mumble the portable bed

Now’s a good time to take a detour and observe how our user-defined code acts in the generated C file. Lex creates a function called yylex(), and inserts the code blocks verbatim into a switch statement. When using lex with a parser, the parser will call yylex() to retrieve tokens, named by integers. For now, our user-defined code isn’t returning tokens to a parser, but doing simple print statements.

/* catcot.c (generated by lex) */

int yylex (void)
{
	/* ... */
	switch ( yy_act )
	{
		/* ... */

		case 1:
		YY_RULE_SETUP
		#line 9 "catcot.l"
		{ printf("portable bed"); }
			YY_BREAK
		case 2:
		YY_RULE_SETUP
		#line 10 "catcot.l"
		{ printf("thankless pet"); }
			YY_BREAK
		case 3:
		YY_RULE_SETUP
		#line 11 "catcot.l"
		{ printf("anti-herd"); }
			YY_BREAK

		/* ... */
	}
	/* ... */
}

As mentioned, a lex file is comprised of three sections:

DEFINITIONS

%%

RULES

%%

HELPER FUNCTIONS

The definitions section is where you can embed C code to include headers and declare functions used in rules. The definitions section can also define friendly names for regexes that can be reused in the rules.

The rules section, as we saw, contains a list of regexes and associated user code.

The final section is where to put the full definitions of helper functions. This is also where you’d put the main() function. If you omit main(), the Lex library provides one that simply calls yylex(). This default main() implementation (and implementations for a few other functions) is available by linking your lex-generated C code with -ll compiler flag.

Let’s see a short, fun example: converting Roman numerals to decimal. Thanks to lex’s behavior of matching longer strings first, it can read the single-letter numerals, but look ahead for longer subtractive forms like “IV” or “XC.”

/* roman-lex.l */

/* the %{ ... %} enclose C blocks that are copied
   into the generated code */

%{
#include <stdio.h>
#include <stdlib.h>

/* globals are visible to user actions amd main() */

int total;
%}

%%

 /*<- notice the whitespace before this comment, which
      is necessary for comments in the rules section */

 /* the basics */

I  { total +=    1; }
V  { total +=    5; }
X  { total +=   10; }
L  { total +=   50; }
C  { total +=  100; }
D  { total +=  500; }
M  { total += 1000; }

 /* special cases match with preference
    because they are longer strings */

IV { total +=    4; }
IX { total +=    9; }
XL { total +=   40; }
XC { total +=   90; }
CD { total +=  400; }
CM { total +=  900; }

 /* ignore final newline */

\n ;

 /* but die on anything else */

.  {
	fprintf(stderr, "unexpected: %s\n", yytext);
	exit(EXIT_FAILURE);
}

%%

/* provide our own main() rather than the implementation
   from lex's library linked with -ll */

int main(void)
{
	/* only have to call yylex() once, since our
	   actions don't return */
	yylex();

	fprintf(yyout, "%d\n", total);
	return EXIT_SUCCESS;
}

More realistic scanner

Now that we’ve seen Lex’s basic operation in the previous section, let’s consider a useful example: syntax highlighting. Detecting keywords in syntax is a problem that lex can handle by itself, without help from yacc.

Because lex and yacc are so old (predating C), and used in so many projects, you can find grammars already written for most languages. For instance, we’ll take quut’s C specification for lex, and modify it to do syntax highlighting.

This relatively short program accurately handles the full complexity of the language. It’s easiest to understand by reading in full. See the inline comments for new and subtle details.

/* c.l syntax highlighter */

%{
/* POSIX for isatty, fileno */
#define _POSIX_C_SOURCE 200112L

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/* declarations are visible to user actions */

enum FG
{
	fgRED      = 31,   fgGREEN    = 32,
	fgORANGE   = 33,   fgCYAN     = 36,   
	fgDARKGREY = 90,   fgYELLOW   = 93
};

void set_color(enum FG);
void reset_color(void);
void color_print(enum FG, const char *);

void consume_comment(void);
%}

/* named regexes we can use in rules */

O   [0-7]
D   [0-9]
NZ  [1-9]
L   [a-zA-Z_]
A   [a-zA-Z_0-9]
H   [a-fA-F0-9]
HP  (0[xX])
E   ([Ee][+-]?{D}+)
P   ([Pp][+-]?{D}+)
FS  (f|F|l|L)
IS  (((u|U)(l|L|ll|LL)?)|((l|L|ll|LL)(u|U)?))
CP  (u|U|L)
SP  (u8|u|U|L)
ES  (\\(['"\?\\abfnrtv]|[0-7]{1,3}|x[a-fA-F0-9]+))
WS  [ \t\v\n\f]

%%

 /* attempting to match and capture an entire multi-line
    comment could strain lex's buffers, so we match the
    beginning, and call consume_comment() to deal with
    the ensuing characters, in our own less resource-
    intensive way */

"/*"      {
	set_color(fgDARKGREY);

	/* For greater flexibility, we'll output to lex's stream, yyout.
	   It defaults to stdout. */
	fputs(yytext, yyout);

	consume_comment();
	reset_color();
}

 /* single-line comments can be handled the default way.
    The yytext variable is provided by lex and points
    to the characters that match the regex */

"//".*    {
	color_print(fgDARKGREY, yytext);
}

^[ \t]*#.*      {
	color_print(fgRED, yytext);
}

 /* you can use the same code block for multiple regexes */

auto     |
bool     |
char     |
const    |
double   |
enum     |
extern   |
float    |
inline   |
int      |
long     |
register |
restrict |
short    |
size_t   |
signed   |
static   |
struct   |
typedef  |
union    |
unsigned |
void     |
volatile |
_Bool    |
_Complex {
	color_print(fgGREEN, yytext);
}

break    |
case     |
continue |
default  |
do       |
else     |
for      |
goto     |
if       |
return   |
sizeof   |
switch   |
while    {
	color_print(fgYELLOW, yytext);
}

 /* we use the named regexes heavily below; putting
    them in curly brackets expands them */

{L}{A}*  {

	/* without this rule, keywords within larger words
	   would be highlighted, like the "if" in "life" --
	   this rule prevents that because it's a longer match */

	fputs(yytext, yyout);
}

{HP}{H}+{IS}?               |
{NZ}{D}*{IS}?               |
"0"{O}*{IS}?                |
{CP}?"'"([^'\\\n]|{ES})+"'" |
{D}+{E}{FS}?                |
{D}*"."{D}+{E}?{FS}?        |
{D}+"."{E}?{FS}?            |
{HP}{H}+{P}{FS}?            |
{HP}{H}*"."{H}+{P}{FS}?     |
{HP}{H}+"."{P}{FS}?         {
	color_print(fgCYAN, yytext);
}

({SP}?\"([^"\\\n]|{ES})*\"{WS}*)+ {
	color_print(fgORANGE, yytext);
}

 /* explicitly mention the default rule */

. ECHO;

%%

/* definitions of the functions we declared earlier */

/* the color functions use ANSI escape codes, and may
   not be portable across all terminal emulators. */

void set_color(enum FG c)
{
	fprintf(yyout, "\033[%d;1m", c);
}

void reset_color(void)
{
	fputs("\033[0m", yyout);
}

void color_print(enum FG c, const char *s)
{
	set_color(c);
	fputs(s, yyout);
	reset_color();
}

/* this function directly consumes characters in lex
   using the input() function. It pulls characters
   from the same stream that the regex state machine
   reads. */
void consume_comment(void)
{
	int c;

	/* EOF in lex is 0, which is different from
	   the EOF macro in the C standard library */
	while ((c = input()) != 0)
	{
		putchar(c);
		if (c == '*')
		{
			while ((c = input()) == '*')
				putchar(c);
			if (c == 0) break;
			putchar(c);
			if (c == '/') return;
		}
	}
}

int main(void)
{
	if (!isatty(fileno(stdout)))
	{
		/* a more flexible option would be to make the
		   color changing functions do nothing, but that's
		   too much fuss for an example program */

		fputs("Stdout is not a terminal\n", stderr);
		return EXIT_FAILURE;
	}
	/* since we'll be changing terminal color, be sure to
	   reset it for any program termination event */
	atexit(reset_color);

	/* let our lex rules do the rest */
	yylex();
	return EXIT_SUCCESS;
}

Using a scanner as a library

One of the biggest areas of improvement between classic lex/yacc and flex/bison is the ability of the latter to generate code that’s easier to embed into a larger application. Lex and yacc are designed to create standalone programs, with user-defined code blocks stuck inside. When classic lex and yacc work together, they use a bunch of global variables.

Flex and Bison, on the other hand, can generate thread-safe functions with uniquely prefixed names that can be safely linked into larger programs. To demonstrate, we’ll do another scanner (with Flex this time).

The following Rube Goldberg contraption uses Flex to split words on whitespace and call a user-supplied callback for each word. There’s certainly an easier non-Flex way to do this task, but this example illustrates how to encapsulate Flex code into a reusable library.

/* words.l */

/* don't generate functions we don't need */
%option nounput noinput noyywrap

/* generate a scanner that's thread safe */
%option reentrant

/* Generate "words" rather than "yy" as a prefix, e.g.
   wordslex() rather than yylex(). This allows multiple
   Flex scanners to be linked with the same application */
%option prefix="words"

%%

[^ \t\n]+ {
	/* the return statement causes yylex to stop and return */
	return 1; /* our code for a word token */
}

  /* do nothing for any other characters, don't
     output them as would be the default behavior */
.|\n	  ;

%%

/* Callers interact with this function, which neatly hides
   the Flex inside.

   Also, we'll call "yy" functions like "yylex()" inside,
   and Flex will rename them in the resulting C file to
   calls with the "words" prefix, like "wordslex()"

   Zero return means success, nonzero is a Flex error
   code. */

int words_callback(char *s, void (*f)(const char *))
{
	/* in the reentrant mode, we maintain our
	   own scanner and its associated state */
	int i;
	yyscan_t scanner;
	YY_BUFFER_STATE buf;

	if ((i = yylex_init(&scanner)) != 0)
		return i;

	/* read from a string rather than a stream */
	buf = yy_scan_string(s, scanner);

	/* Each time yylex finds a word, it returns nonzero.
	   It resumes where it left off when we call it again */
	while ((i = yylex(scanner)) > 0)
	{
		/* call the user supplied function f with
		   yytext of the match */
		f(yyget_text(scanner));
	}

	/* clean up */
	yy_delete_buffer(buf, scanner);
	yylex_destroy(scanner);
	return 0;
}

Build it like this:

# generate scanner, build object file
flex -t words.l > words.c
cc -c words.c

# verify that all public text symbols are prefixed by "words"
nm -g words.o | grep " T "

A calling program can use our library without seeing any Flex internals.

/* test_words.c */

#include <stdio.h>

/* words_callback defined in the object file -- you could put
   this declaration in a header file words.h */
int words_callback(char *, void (*)(const char *));

void print_word(const char *w)
{
	puts(w);

	/* if you want to use the parameter w in the future, you
	   need to duplicate it in memory whose lifetime you control */
}

int main(void)
{
	words_callback(
		"The quick brown fox\n"
		"jumped over the lazy dog\n",
		&print_word
	);
	return 0;
}

To build the program, just link it with words.o.

cc -o test_words test_words.c words.o

Parsing

Now that we’ve seen how to identify tokens with a scanner, let’s learn how a parser can act on the tokens using recursive rules. Yacc/byacc/bison are LALR (look-ahead left recursive) parsers, and Bison supports more powerful modes if desired.

Mental model of LR parsing

LR parsers build bottom-up toward a goal, shifting tokens onto a stack and combining (“reducing”) them according to rules. It’s helpful to get a mental model for this process, so let’s jump into a simple example and simulate what yacc does.

Here’s a yacc grammar with a single rule to build a result called foo. We specify that foo is comprised of lex tokens A, B, and C.

%token A B C

%%

foo: A B C

Yacc transforms the grammar into a state machine which looks like this:

foo: A B C

The first rule in the file (and the only rule in our case) becomes yacc’s goal. Yacc begins in state 0, with the implicit rule 0: $accept: • foo $end. The parse will be accepted if we can produce a foo followed immediately by the end of input. The bullet point indicates our progress reading the input. In state 0 it’s at the beginning, meaning we haven’t read anything yet.

Initially there’s no lookahead token, so yacc calls yylex() to get one. If lex produces an A, we follow the state transition to state 1. Because the arrow is a solid line, not dashed, yacc “shifts” the token to its token stack. It also pushes state 1 onto a state stack, which now holds states 0 and 1.

State 1 is trying to satisfy the rule which it calls rule 1, namely 1 foo: A • B C. The bullet point after the A indicates we’ve seen the A already. Don’t confuse the state numbers and rule numbers – yacc numbers them independently.

Yacc continues processing input, shifting tokens and moving to states 3 and 5 if lex produces the expected tokens. If, at any point, lex produces a token not matching any transitions in the current state, then yacc reports a syntax error and terminates. (There’s a way to do error recovery, but that’s another topic.)

State 5 has seen all necessary tokens for rule 1: 1 foo: A B C •. Yacc continues to the diamond marked “R1,” which is a reduction action. Yacc “reduces” rule 1, popping the A, B, C terminal tokens off the stack and pushing a single non-terminal foo token. When it pops the three tokens, it pops the same number of states (states 5, 3, and 1). Popping three states lands us back in state 0.

State 0 has a dashed line going to state 2 that matches the foo token that was just reduced. The dashed line means “goto” rather than “shift,” because rule 0 doesn’t have to shift anything onto the token stack. The previous reduction already took care of that.

Finally, state 2 asks lex for another token, and if lex reports EOF, that matches $end and sends us to state 4, which ties a ribbon on it with the Acc(ept) action.

From what we’ve seen so far, each state may seem to be merely tracking progress through a single rule. However, states actually track all legal ways forward from tokens previously consumed. A single state can track multiple candidate rules. For instance:

%token A B C

%%

 /* foo is either x or y */

foo: x | y;

 /* x and y both start with an A */

x: A B;

y: A C;

For this grammar, yacc produces the following state machine:

foo : x | y

In state 1 we’ve seen token A, and so rules 3 and 4 are both in the running to reduce an x or y. On a B or C token, the possibilities narrow to a single rule (in state 5 or 6).

Also notice that our rule foo : x | y doesn’t occur verbatim in any states. Yacc separates it into 1 foo: x and 2 foo: y. Thus, the numbered rules don’t always match the rules in the grammar one-to-one.

Yacc can also use peek ahead by one token to choose which rule to reduce, without shifting the “lookahead” token. In the following grammar, rules x and y match the same tokens. However, the foo rule can say to choose x when followed by a B, or y when followed by a C:

%token A B C

%%

foo : x B | y C;

x : A;

y : A;

Note multiple reductions coming out of state 1 in the generated state machine:

lookahead for the first state

The presence of a bracketed token ([C]) exiting state 1 indicates that the state uses lookahead. If the state sees token C, it reduces rule 4. Otherwise it reduces rule 3. Lookahead tokens remain to be read when following a dashed-line (goto) action, such as from state 0 to state 4.

Ambiguous grammars

While yacc is a powerful tool to transform a grammar into a state machine, it may not operate the way you intend on ambiguous grammars. These are grammars with a state that could proceed in more than one way with the same input.

As grammars get complicated, it’s quite possible to create ambiguities. Let’s look at small examples that make it easier to see the mechanics of the conflict. That way, when it happens in a real grammar, we’ll have a better feeling for it.

In the following example, the input A B matches both x and y B. There’s no reason for yacc to choose one construction over the other when reducing to foo. So why does this matter, you ask? Don’t we get to foo either way? Yes, but real parsers will have different user code assigned to run per rule, and it matters which code block gets executed.

%token A B

%%

foo : x | y B ;

x : A B ;

y : A ;

The state machine shows ambiguity at state 1:

shift/reduce conflict

At state 1, when the next token is B, the state could shift the token and enter state 5 (attempting to reduce x). It could also reduce y and leave B as lookahead. This is called a shift/reduce conflict. Yacc’s policy in such a conflict is to favor a shift over a reduce.

Alternately, we can construct a grammar with a state that has more than one eligible reduction for the same input. The purest toy example would be foo : A | A, generating:

reduce/reduce conflict

In a reduce/reduce conflict, yacc chooses to reduce the conflicting rule presented earlier in the grammar.

Constructing semantic values

While matching tokens, parsers typically build a user-defined value in memory to represent features of the input. Once the parse reaches the goal state and succeeds, then the user code will act on the memory value (or pass it along to a calling program).

Yacc has stores the semantic values from parsed tokens in variables ($1, $2, …) accessible to code blocks, and it provides a variable ($$) for assigning the semantic result of the current code block.

Let’s see it in action. We won’t do a hackneyed calculator, but let’s still make a parser that operates on integers. Integer values allow us to avoid thinking about memory management.

We’ll revisit the roman numeral example, and this time let lex match the digits while yacc combines them into a final result. It’s actually more cumbersome than our earlier way, but illustrates how to work with semantic parse values.

There are some comments in the example below about portability between yacc variants. The three most prominent variants, in order of increasing features, are: the POSIX interface matching roughly the AT&T yacc functionally, byacc (Berkeley Yacc), and GNU Bison.

/* roman.y  (plain yacc) */

%{
#include <stdio.h>

/* declarations to fix warnings from sloppy
   yacc/byacc/bison code generation. For instance,
   the code should have a declaration of yylex. */

int yylex(void);

/* The POSIX specification says yyerror should return
   int, although bison documentation says the value is
   ignored. We match POSIX just in case. */

int yyerror(const char *s);
%}

/* tokens our lexer will produce */
%token NUM

%%

/* The first rule is the final goal. Yacc will work
   backward trying to arrive here. This "results" rule
   is a stub we use to print the value from "number." */

results :
  number { fprintf(yyout, "%d\n", $1); }
;

/* as the lexer produces more NUMs, keep adding them */

number :

  /* this is a common pattern for saying number is one or
     more NUMs.  Notice we specify "number NUM" and not
     "NUM number". In yacc recursion, think "right is wrong
     and left is right." */

  number NUM { $$ = $1 + $2; }

  /* base case, using default rule of $$ = $1 */

| NUM
;

The corresponding lexer matches individual numerals, and returns them with their semantic values.

/* roman.l */

%{
/* The .tab.h file is generated by yacc, and we'll explain
   it later */

#include "roman.tab.h"

/* lex communicates semantic token values to yacc through
   a shared global variable */

extern int yylval;
%}

/* when using flex (rather than vanilla lex) fix
   unused function warnings by adding:

%option noinput nounput
*/

%%

 /* The constant for NUM comes from roman.tab.h,
    and was generated because we declared
    "%token NUM" in roman.y */

I  { yylval =    1; return NUM; }
V  { yylval =    5; return NUM; }
X  { yylval =   10; return NUM; }
L  { yylval =   50; return NUM; }
C  { yylval =  100; return NUM; }
D  { yylval =  500; return NUM; }
M  { yylval = 1000; return NUM; }

IV { yylval =    4; return NUM; }
IX { yylval =    9; return NUM; }
XL { yylval =   40; return NUM; }
XC { yylval =   90; return NUM; }
CD { yylval =  400; return NUM; }
CM { yylval =  900; return NUM; }

 /* ignore final newline */
\n ;

 /* As a default action, return the ascii value of
    the character as if it were a token identifier.
    The values from roman.tab.h are offset above 256 to
    be above any ascii value, so there's no ambiguity

    Our parser won't be expecting these values, so
    they will lead to a syntax error */
.  { return *yytext; }

To review: lex generates a yylex() function, and yacc generates yyparse() that calls yylex() repeatedly to get new token identifiers. Lex actions copy semantic values to yylval which Yacc copies into $-variables accessible in parser rule actions.

Building an executable roman from the input files roman.y and roman.l requires explanation. With appropriate command line flags, yacc will create the files roman.tab.c and roman.tab.h from roman.y. Lex will create roman.lex.c from roman.l, using token identifiers in roman.tab.h.

In short, here are the build dependencies for each file:

build dependency graph

And here’s how to express it all in a Makefile.

# put together object files from lexer and parser, and
# link the yacc and lex libraries (in that order, to pick
# main() from yacc's library rather than lex's)

roman : roman.tab.o roman.lex.o
	$(CC) -o $@ roman.tab.o roman.lex.o -ly -ll

# tell make which files yacc will generate
#
# an explanation of the arguments:
# -b roman  -  name the files roman.tab.*
# -d        -  generate a .tab.h file too

roman.tab.h roman.tab.c : roman.y
	$(YACC) -d -b roman $?

# the object file relies on the generated lexer, and
# on the token constants 

roman.lex.o : roman.tab.h roman.lex.c

# can't use the default suffix rule because we're
# changing the name of the output to .lex.c

roman.lex.c : roman.l
	$(LEX) -t $? > $@

And now, the moment of truth:

$ make

$ echo MMMCMXCIX | ./roman
3999

Using a parser as a library

In this example we’ll parse LISP S-expressions, limited to string and integer atoms. There’s more going on in this one, such as memory management, different semantic types per token, and packaging the lexer and parser together into a single thread-safe library. This example requires Bison.

/* lisp.y  (requires Bison) */

/* a "pure" api means communication variables like yylval
   won't be global variables, and yylex is assumed to
   have a different signature */

%define api.pure true

/* change prefix of symbols from yy to "lisp" to avoid
   clashes with any other parsers we may want to link */

%define api.prefix {lisp}

/* generate much more meaningful errors rather than the
   uninformative string "syntax error" */

%define parse.error verbose

/* Bison offers different %code insertion locations in
   addition to yacc's %{ %} construct.

   The "top" location is good for headers and feature
   flags like the _XOPEN_SOURCE we use here */

%code top {
	/* XOPEN for strdup */
	#define _XOPEN_SOURCE 600
	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>

	/* Bison versions 3.7.5 and above provide the YYNOMEM
	   macro to allow our actions to signal the unlikely
	   event that they couldn't allocate memory. Thanks
	   to the Bison team for adding this feature at my
	   request. :) YYNOMEM causes yyparse() to return 2.

	   The following conditional define allows us to use
	   the functionality in earlier versions too. */

	#ifndef YYNOMEM
	#define YYNOMEM goto yyexhaustedlab
	#endif
}

/* The "requires" code location is designed for defining
   data types that we can use as yylval's for tokens. Code
   in this section is also added to the .tab.h file for
   inclusion by calling code */

%code requires {
	enum sexpr_type {
		SEXPR_ID, SEXPR_NUM, SEXPR_PAIR, SEXPR_NIL
	};

	struct sexpr
	{
		enum sexpr_type type;
		union
		{
			int   num;
			char *id;
		} value;
		struct sexpr *left, *right;
	};
}

/* These are the semantic types available for tokens,
   which we name num, str, and node.

   The %union construction is classic yacc as well. It
   generates a C union and sets its as the YYSTYPE, which
   will be the type of yylval */

%union
{
	int num;
	char *str;
	struct sexpr *node;
}

/* Add another argument in yyparse() so that we
   can communicate the parsed result to the caller.
   We can't return the result directly, since the
   return value is already reserved as an int, with
   0=success, 1=error, 2=nomem

   NOTE
   In our case, the param is a data pointer. However,
   if it were a function pointer (such as a callback),
   then its type would have to be put behind a typedef,
   or else parse-param will mangle the declaration. */

%parse-param {struct sexpr **result}

/* param adds an extra param to yyparse (like parse-param)
   but also causes yyparse to send the value to yylex.
   In our case the caller will initialize their own scanner
   instance and pass it through */

%param {void *scanner}

/* the "provides" location adds the code to our generated
   parser, but also to the .tab.h file for use by callers */

%code provides {
	void sexpr_free(struct sexpr *s);
}

/* unqualified %code is for internal use, things that
   our actions can see. These declarations prevent
   warnings.  Notice the final param in each that came
   from the %param directive above */

%code {
	int lisperror(void *foo, char const *msg, const void *s);
	int lisplex(void *lval, const void *s);
}

/* Now when we declare tokens, we add their type
   in brackets. The type names come from our %union */

%token <str> ID
%token <num> NUM

/* whereas tokens come from the lexer, these
   non-terminals are defined in the parser, and
   we set their types with %type */

%type <node> start sexpr pair list members atom

/* if there's an error partway through parsing, the
   caller wouldn't get a chance to free memory for
   the work in progress. Bison will clean up the memory
   if we provide destructors, though. */

%destructor { free($$); } <str>
%destructor { sexpr_free($$); } <node>

%%

 /* once again we use a dummy non-terminal to perform
    a side-effect, in this case setting *result */

start :
  sexpr   { *result = $$ = $1; return 0; }
;

sexpr :
  atom
| list
| pair
;

list :

  /* This is a shortcut: we use the ascii value for
     parens '('=40, ')'=41 as their token codes.
     Thus we don't have to define a bunch of crap
     manually like LPAREN, RPAREN */

  '(' members ')' { $$ = $2; }

| '('')' {
	struct sexpr *nil = malloc(sizeof *nil);
	if (!nil) YYNOMEM;
	*nil = (struct sexpr){.type = SEXPR_NIL};
	$$ = nil;
  }
;

members :
  sexpr {
	struct sexpr *s = malloc(sizeof *s),
				 *nil = malloc(sizeof *nil);
	if (!s || !nil) {
		free(s), free(nil);
		YYNOMEM;
	}
	*nil = (struct sexpr){.type = SEXPR_NIL};

	/* convention: we assume that a previous parser
	   value like $1 is non-NULL, else it would have
	   died already with YYNOMEM. We're responsible
	   for checking only our own allocations */

	*s = (struct sexpr){
		.type = SEXPR_PAIR,
		.left = $1,
		.right = nil
	};
	$$ = s;
  }
| sexpr members {
	struct sexpr *s = malloc(sizeof *s);

	/* Another important memory convention: we
	   can't trust that our lexer successfully
	   allocated its yylvalue, because the signature
	   of yylex doesn't communicate failure. We
	   assume NULL in $1 means alloc failure and
	   we report that. The only other way to signal
	   from yylex would be to make a fake token to
	   represent out-of-memory, but that's harder */

	if (!s) YYNOMEM;
	*s = (struct sexpr){
		.type = SEXPR_PAIR,
		.left = $1,
		.right = $2
	};
	$$ = s;
  }
;

pair :
  '(' sexpr '.' sexpr ')' {
	struct sexpr *s = malloc(sizeof *s);
	if (!s) YYNOMEM;
	*s = (struct sexpr){
		.type = SEXPR_PAIR,
		.left = $2,
		.right = $4
	};
	$$ = s;
  }
;

atom :
  ID {
	if (!$1) YYNOMEM;

	struct sexpr *s = malloc(sizeof *s);
	if (!s) YYNOMEM;
	*s = (struct sexpr){
		.type = SEXPR_ID,
		.value.id = strdup($1)
	};
	if (!s->value.id)
	{
		free(s);
		YYNOMEM;
	}
	$$ = s;
  }
| NUM {
	struct sexpr *s = malloc(sizeof *s);
	if (!s) YYNOMEM;
	*s = (struct sexpr){
		.type = SEXPR_NUM,
		.value.num = $1
	};
	$$ = s;
  }
;

%%

/* notice the extra parameters required
   by %param and %parse-param */

int lisperror(void *yylval, char const *msg, const void *s)
{
	(void)yylval;
	(void)s;
	return fprintf(stderr, "%s\n", msg);
}

/* useful internally by us, and externally by callers */

void sexpr_free(struct sexpr *s)
{
	if (!s)
		return;
	
	if (s->type == SEXPR_ID)
		free(s->value.id);
	else if (s->type == SEXPR_PAIR)
	{
		sexpr_free(s->left);
		sexpr_free(s->right);
	}
	free(s);
}

The parser does the bulk of the work. We just need to pair it with a scanner that reads atoms and parens.

/* lisp.l */

/* disable unused functions so we don't
   get compiler warnings about them */

%option noyywrap nounput noinput
%option noyyalloc noyyrealloc noyyfree

/* change our prefix from yy to lisp */

%option prefix="lisp"

/* use the pure parser calling convention */

%option reentrant bison-bridge

%{
#include "lisp.tab.h"

#define YY_EXIT_FAILURE ((void)yyscanner, EXIT_FAILURE)

/* XOPEN for strdup */
#define _XOPEN_SOURCE 600
#include <limits.h>
#include <stdlib.h>
#include <string.h>

/* seems like a bug that I have to do this, since flex
   should know prefix=lisp and match bison's LISPSTYPE */
#define YYSTYPE LISPSTYPE

int lisperror(const char *msg);
%}

%%

[[:alpha:]][[:alnum:]]* {
	/* The memory that yytext points to gets overwritten
	   each time a pattern matches. We need to give the caller
	   a copy. Also, if strdup fails and returns NULL, it's up
	   to the caller (the parser) to detect that.

	   Notice yylval is a pointer to union now.  It's passed
	   as an arg to yylex in pure parsing mode */

	yylval->str = strdup(yytext);
	return ID;
}

[-+]?[[:digit:]]+ {
	long n = strtol(yytext, NULL, 10);

	if (n < INT_MIN || n > INT_MAX)
		lisperror("Number out of range");
	yylval->num = (int)n;
	return NUM;
}

[[:space:]]  ; /* ignore */

/* this is a handy rule to return the ASCII value
   of any other character. Importantly, parens */

. { return *yytext; }

Finally, here’s how to call the parser from a regular program.

/* driver_lisp.c */

#include <stdio.h>
#include <stdlib.h>

#define YYSTYPE LISPSTYPE
#include "lisp.tab.h"
#include "lisp.lex.h"

void sexpr_print(struct sexpr* s, unsigned depth)
{
	for (unsigned i = 0; i < depth; i++)
		printf("  ");
	switch (s->type)
	{
		case SEXPR_ID:
			puts(s->value.id);
			break;
		case SEXPR_NUM:
			printf("%d\n", s->value.num);
			break;
		case SEXPR_PAIR:
			puts(".");
			sexpr_print(s->left, depth+1);
			sexpr_print(s->right, depth+1);
			break;
		case SEXPR_NIL:
			puts("()");
			break;
		default:
			abort();
	}
}

int main(void)
{
	int i;
	struct sexpr *expr;
	yyscan_t scanner;

	if ((i = lisplex_init(&scanner)) != 0)
		exit(i);

	int e = lispparse(&expr, scanner);
	printf("Code = %d\n", e);
	if (e == 0 /* success */)
	{
		sexpr_print(expr, 0);
		sexpr_free(expr);
	}

	lisplex_destroy(scanner);
	return 0;
}

To build it, use the Makefile pattern from roman to create analogous lisp.lex.o and lisp.tab.o. This example requires Flex and Bison, so set LEX=flex and YACC=bison at the top of the Makefile to override whatever system defaults are used for these programs. Finally, compile driver_lisp.c and link with those object files.

Here’s the program in action:

$ echo "(1 () (2 . 3) (4))" | ./driver_lisp
Code = 0
.
  1
  .
    ()
    .
      .
        2
        3
      .
        .
          4
          ()
        ()

Designing against an RFC

Internet Request For Comment (RFC) documents describe the syntax of many protocols and data formats. They often include complete Augmented Backus-Naur Form (ABNF) grammars, which we can convert into robust yacc parsers.

Let’s examine RFC4181, which describes the comma-separated value (CSV) format. It’s pretty simple, but has problematic edge cases: commas in quoted values, quoted quotes, raw newlines in quoted values, and blank-as-a-value.

Here’s the full grammar from the RFC. Notice how alternatives are specified with “/” rather than “|”, and how ABNF has the constructions *(zero-or-more-things) and [optional-thing]:

file = [header CRLF] record *(CRLF record) [CRLF]

header = name *(COMMA name)

record = field *(COMMA field)

name = field

field = (escaped / non-escaped)

escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

non-escaped = *TEXTDATA

COMMA = %x2C

CR = %x0D

DQUOTE =  %x22

LF = %x0A

CRLF = CR LF

TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

The grammar makes no distinction between lexing and parsing, although the uppercase identifiers hint at lexer tokens. While it may be tempting to translate to yacc top-down, starting at the file level, I’ve found the most productive way is to start with lexing.

We can combine most of the grammar into two lex rules to match fields:

%%

\"([^"]|\"\")*\" {
	/* this is what the ABNF calls "escaped" */

	/* TODO: copy un-escaped internals to yylval */

	return FIELD;
}

[^",\r\n]+ {
	/* This is *almost* what the ABNF calls "un-escaped,"
	   except it won't match an empty field, like
	   a,,b
	    ^---- this

	   Actually, even if we tried matching an empty string,
	   the comma or crlf would prove a longer match and
	   trump this one.
	*/

	/* TODO: capture the value to yylval */

	/* no need to bother yacc with two token types, we
	   call them both FIELD. */
	return FIELD;
}

 /* handle both UNIX and DOS style, per the spec */
\n|\r\n    { return CRLF; }

 /* catch the comma, and any other unexpected thing */
.          { return *yytext; }

With FIELD out of the way, here’s what’s left to translate:

file = [header CRLF] record *(CRLF record) [CRLF]

header = name *(COMMA name)

record = field *(COMMA field)

name = field

Let’s also drop the designation of the first row as the “header.” The application can choose to treat the first ordinary row as a header if desired. This simplifies the grammar to:

file = record *(CRLF record) [CRLF]

record = field *(COMMA field)

At this point it’s easy to convert to yacc.

%token CRLF FIELD

%%

file :
  record
| file CRLF record
;

record :
  field.opt
| record ',' field.opt
;

 /* Here is where we handle the potentially blank
    non-escaped FIELD. The ".opt" suffix doesn't mean
    anything to yacc, it's just a reminder for us that
    this *may* match a FIELD, or nothing at all */
field.opt :
  /* empty */
| FIELD
;

Matching blank fields is tricky. There are three fields in a,,b, no way around it. That means we have to identify some value (either a non-terminal symbol, or a terminal token) out of thin air between characters of input. As a corollary, given that we have to honor blank fields as existing, we’re forced to interpret e.g. a 0-byte file as one record with a single blank field.

We handled the situation with an empty yacc rule in field.opt. Empty rules allow the parser to reduce when it sees unexpected lookahead tokens. Perhaps it’s also possible to use fancy tricks in the lexer (like trailing context and start conditions) to also match empty non-escaped fields. However, I think an empty parser rule is more elegant.

Three notes about empty rules:

  1. We wrote the empty rule in a way that plain yacc can understand. If you want to use a Bison extension, you can write empty rules as %empty, which distinguishes them from accidentally missing rules.
  2. Bison’s --graph visualization doesn’t render empty rules properly. Use the -v option and examine the textual .output file to see the rule.
  3. Adding multiple empty rules can be common source of reduce/reduce conflicts. I ran into this with early experiments in parsing CSV, and the Bison manual section 5.6 provides a great example.

Now that we’ve seen the structure of the grammar, let’s fill in the skeleton to process the CSV content. From now on, examples in this article will use my libderp library for basic data structures like maps and vectors.

/* csv.l */

%{
#define _XOPEN_SOURCE 600
#include <stdlib.h>
#include <string.h>

/* the union in csv.tab.h requires the vector type, and
   plain yacc doesn't have "%code requires" to provide
   the include like Bison, so we include derp/vector.h */
#include <derp/vector.h>
#include "csv.tab.h"
%}

%%

\"([^"]|\"\")*\" {
	/* yyleng is precomputed strlen(yytext) */
    size_t i, n = yyleng;
    char *s;

    s = yylval.str = calloc(n, 1);
    if (!s)
        return FIELD;

	/* copy yytext, changing "" to " */
    for (i = 1 /*skip 0="*/; i < n-1; i++)
    {
        *s++ = yytext[i];
        if (yytext[i] == '"')
            i++; /* skip second one */
    }
    return FIELD;
}

[^",\r\n]+ { yylval.str = strdup(yytext); return FIELD; }
\n|\r\n    { return CRLF; }
.          { return *yytext; }

The complete parser below combines values from the lexer into full records, using the vector type. It then prints each record and frees it.

/* csv.y  (plain yacc) */

%{
	#include <stdbool.h>
	#include <stdio.h>
	#include <stdlib.h>

	/* for the vector datatype and v_ functions */
	#include <derp/vector.h>
	/* for helper function derp_free */
	#include <derp/common.h>

	int yylex(void);
	int yyerror(const char *s);
	bool one_empty_field(vector *);
%}

%union
{
	char *str;
	vector *record;
}

%token CRLF
%token <str> FIELD
%type <str> field.opt
%type <record> record

/* in bison, add this:

%destructor { free($$); } <str>
%destructor { v_free($$); } <record>
*/

%%

file :
  consumed_record
| file CRLF consumed_record
;

/* A record can be constructed in two ways, but we want to
   run the same side effect for either case. We add an
   intermediate non-terminal symbol "consumed_record" just
   to perform the action. In library code, this would be a
   good place to send the the record to a callback function. */

consumed_record :
  record {
	/* a record comprised of exactly one blank field is a
	   blank record, which we can skip */
	if (!one_empty_field($1))
	{
		size_t n = v_length($1);
		printf("#fields = %zu\n", n);
		for (size_t i = 0; i < n; i++)
			printf("\t%s\n", (char*)v_at($1, i));
	}
	v_free($1);
  }
;

record :
  field.opt {
	/* In our earlier example, lisp.y, we showed how to check
	   for memory allocation failure. We skip that here for
	   brevity. */

	vector *r = v_new();
	v_dtor(r, derp_free, NULL);
	v_append(r, $1);
	$$ = r;
  }
| record ',' field.opt {
	v_append($1, $3);
	$$ = $1;
  }
;

field.opt :
  /* empty */ { $$ = calloc(1,1); }
| FIELD
;

%%

bool one_empty_field(vector *r)
{
	return v_length(r) == 1 && *((char*)v_first(r)) == '\0';
}

int yyerror(const char *s)
{
	return fprintf(stderr, "%s\n", s);
}

Build it (using the steps shown for earlier examples). You’ll also need to link with libderp version 0.1.0, which you can see how to do in the project readme.

Next, verify with test cases:

# https://en.wikipedia.org/wiki/Comma-separated_values#Example

$ ./csv << EOF
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
EOF
#fields = 5
        Year
        Make
        Model
        Description
        Price
#fields = 5
        1997
        Ford
        E350
        ac, abs, moon
        3000.00
#fields = 5
        1999
        Chevy
        Venture "Extended Edition"
        
        4900.00
#fields = 5
        1999
        Chevy
        Venture "Extended Edition, Very Large"
        
        5000.00
#fields = 5
        1996
        Jeep
        Grand Cherokee
        MUST SELL!
air, moon roof, loaded
        4799.00
# extra testing for empty fields before crlf and eof

$ printf ",\n," | ./csv
#fields = 2
        
        
#fields = 2
        
        

Parsing a more complicated RFC

IRCv3 extends the Internet Relay Chat (IRC) protocol with useful features. Its core syntactical change to support new features is message tagging. We’ll write a parser to extract information from RFC 1459 messages, including IRCv3 tags.

The BNF from this standard is written in a slightly different dialect than that of the CSV RFC.

<message>       ::= ['@' <tags> <SPACE>] [':' <prefix> <SPACE> ] <command> [params] <crlf>

<tags>          ::= <tag> [';' <tag>]*
<tag>           ::= <key> ['=' <escaped_value>]
<key>           ::= [ <client_prefix> ] [ <vendor> '/' ] <key_name>
<client_prefix> ::= '+'
<key_name>      ::= <non-empty sequence of ascii letters, digits, hyphens ('-')>
<escaped_value> ::= <sequence of zero or more utf8 characters except NUL, CR, LF, semicolon (`;`) and SPACE>
<vendor>        ::= <host>
<host>          ::= see RFC 952 [DNS:4] for details on allowed hostnames

<prefix>        ::= <servername> | <nick> [ '!' <user> ] [ '@' <host> ]
<nick>          ::= <letter> { <letter> | <number> | <special> }
<command>       ::= <letter> { <letter> } | <number> <number> <number>
<SPACE>         ::= ' ' { ' ' }
<params>        ::= <SPACE> [ ':' <trailing> | <middle> <params> ]
<middle>        ::= <Any *non-empty* sequence of octets not including SPACE
                    or NUL or CR or LF, the first of which may not be ':'>
<trailing>      ::= <Any, possibly *empty*, sequence of octets not including
                      NUL or CR or LF>

<user>          ::= <nonwhite> { <nonwhite> }
<letter>        ::= 'a' ... 'z' | 'A' ... 'Z'
<number>        ::= '0' ... '9'
<crlf>          ::= CR LF

As before, it’s helpful to start from the bottom up, applying the power of lex regexes. However, we run into the problem that most of the tokens match almost anything. The same string could conceivably be a host, nick, user, key_name, and command all at once. Lex would match the string with whichever rule comes first in the grammar.

Yacc can’t easily pass lex any clues about what tokens it expects, given what tokens have come before. Lex is on its own. For this reason, the designers of lex gave it a way to keep a memory. Rules can be tagged with a start condition, saying they are eligible only in certain states. Rule actions can then enter new states prior to returning.

/* Incomplete irc.l, showing start conditions and patterns.

   This lexer produces the following tokens:
   SPACE COMMAND MIDDLE TRAILING TAG PREFIX ':' '@'
*/

/* It's nice to prefix the regex names with "re_"
   to see them better in the rules */

re_space    [ ]+
re_host     [[:alnum:]][[:alnum:]\.\-]*
re_nick     [[:alpha:]][[:alnum:]\-\[\]\\`^{}_]*
re_user     [~[:alpha:]][[:alnum:]]*
re_keyname  [[:alnum:]\-]+
re_keyval   [^ ;\r\n]*
re_command  [[:alpha:]]+|[[:digit:]]{3}
re_middle   [^: \r\n][^ \r\n]*
re_trailing [^\r\n]*

/* Declare start conditions. The "%x" means
   they are exclusive, vs "%s" for inclusive. */

%x IN_TAGS IN_PREFIX IN_PARAMS

%%

 /* these patterns are not tagged with a start
    condition, and are active in the default state
    of INITIAL. They will match only when none of
    the exclusive conditions are active. They
    *would* match on inclusive states (but we have
    none).

    The BEGIN command changes state. */

@ { BEGIN IN_TAGS; return *yytext; }
: { BEGIN IN_PREFIX; return *yytext; }

{re_space} { return SPACE; }
{re_command} {
	/* TODO: construct yylval */
	BEGIN IN_PARAMS;
	return COMMAND;
}


 /* these patterns will only match IN_TAGS, which
    as we saw earlier, gets activated from the
    INITIAL state when "@" is encountered */

<IN_TAGS>\+?({re_host}\/)?{re_keyname}(={re_keyval})?  {
	/* TODO: construct yylval */
	return TAG;
}
<IN_TAGS>{re_space} {
	BEGIN INITIAL;
	return SPACE;
}
<IN_TAGS>; { return ';'; }


<IN_PREFIX>({re_host})|({re_nick})(!{re_user})?(@{re_host})? {
	/* TODO: construct yylval */
	BEGIN INITIAL;
	return PREFIX;
}


<IN_PARAMS>{re_space} { return SPACE; }
<IN_PARAMS>{re_middle} {
	/* TODO: construct yylval */
	return MIDDLE;
}
<IN_PARAMS>:{re_trailing} {
	/* TODO: construct yylval */
	BEGIN INITIAL;
	return TRAILING;
}


 /* the "*" state applies to all states,
    including INITIAL and the exclusive ones */

<*>\n|\r\n  ; /* ignore */

We’ll revisit the lexer to fill in details for assigning yylval. First, let’s see the parser and its data types.

/* irc.y  (Bison only)

   Using Bison mostly for the %code positions, making 
   it easier to use libderp between flex and bison.

   - WARNING -
   There is absolutely no memory hygiene in this example.
   We don't check for allocation failure, and we don't free
   things when done. See the earlier lisp.y/.l examples
   for guidance about that.
*/

/* output more descriptive messages than "syntax error" */
%define parse.error verbose

%code top {
	#define _XOPEN_SOURCE 600
	#include <stdio.h>
	#include <stdlib.h>
}

%code requires {
	#include <derp/list.h>
	#include <derp/treemap.h>

	struct prefix
	{
		char *host;
		char *nick;
		char *user;
	};

	/* building an irc_message is the overall
	   goal for this parser */
	struct irc_message
	{
		treemap *tags;
		struct prefix *prefix;
		char *command;
		list *params;
	};
}

%code provides {
	int yyerror(char const *msg);
	int yylex(void);
	void message_print(struct irc_message *m);
}

%union
{
	char *str;
	struct prefix *prefix;
	treemap *map;
	struct map_pair *pair;
	list *list;
	struct irc_message *msg;
}

%token          SPACE
%token <str>    COMMAND MIDDLE TRAILING
%token <pair>   TAG
%token <prefix> PREFIX

%type <msg> message tagged_message prefixed_message
%type <map> tags
%type <list> params

%%

 /* Like in the CSV example, we start with a dummy
    rule just to add side-effects */

final :
  tagged_message { message_print($1); }
;

 /* Messages begin with two optional components,
    a set of tags and a prefix.
 
    <message> ::= ['@' <tags> <SPACE>] [':' <prefix> <SPACE> ] <command> [params]
 
    Rather than making a single message rule with
    tons of variations (and duplicated code), I chose
    to build the message in stages.
 
    tagged_message <- prefixed_message <- message
 
    A prefixed_message adds prefix information, or
    passes the message along verbatim if there is none.
    Similarly for tagged_message. */

tagged_message :

  /* When there are more than one matched token,
     it's helpful to add Bison "named references"
     in brackets. Thus, below, the rule can refer to
     $ts rather than $2, or $msg rather than $4.
     Makes it way easier to rearrange tokens while
     you're experimenting. */

  '@' tags[ts] SPACE prefixed_message[msg] {
	$msg->tags = $ts;
	$$ = $msg;
  }

  /* here's the pass-through case when there are
     no tags on the message */

| prefixed_message
;

prefixed_message :
  ':' PREFIX[pfx] SPACE message[msg] {
	$msg->prefix = $pfx;
	$$ = $msg;
  }
| message
;

message :
  COMMAND[cmd] params[ps] {
	struct irc_message *m = malloc(sizeof *m);
	*m = (struct irc_message) {
		.command=$cmd, .params=$ps
	};
	$$ = m;
  }
;

tags :
  TAG {
	treemap *t = tm_new(derp_strcmp, NULL);
	tm_insert(t, $1->k, $1->v);
	$$ = t;
  }
| tags[ts] ';' TAG[t] {
	tm_insert($ts, $t->k, $t->v);
	$$ = $ts;
  }
;

params :
  SPACE TRAILING {
	$$ = l_new();
	l_prepend($$, $2);
  }
| SPACE MIDDLE[mid] params[ps] {
	l_prepend($ps, $mid);
	$$ = $ps;
  }
| %empty {
	$$ = l_new();
  }
;

%%

int yyerror(char const *msg)
{
	return fprintf(stderr, "%s\n", msg);
}

void message_print(struct irc_message *m)
{
	if (m->tags)
	{
		struct tm_iter  *it = tm_iter_begin(m->tags);
		struct map_pair *p;

		puts("Tags:");
		while ((p = tm_iter_next(it)) != NULL)
			printf("\t'%s'='%s'\n", (char*)p->k, (char*)p->v);
		tm_iter_free(it);
	}
	if (m->prefix)
		printf("Prefix: Nick %s, User %s, Host %s\n",
		       m->prefix->nick, m->prefix->user,
			   m->prefix->host);
	if (m->command)
		printf("Command: %s\n", m->command);
	if (!l_is_empty(m->params))
	{
		puts("Params:");
		for (list_item *li = l_first(m->params); li; li = li->next)
			printf("\t%s\n", (char*)li->data);
	}
}

Returning to the lexer, here is the code with all the details filled in to construct yylval for the tokens.

/* irc.l  - complete file */

%option noyywrap nounput noinput

%{
#include "irc.tab.h"

#define _XOPEN_SOURCE 600

#include <limits.h>
#include <stdlib.h>
#include <string.h>
%}

re_space    [ ]+
re_host     [[:alnum:]][[:alnum:]\.\-]*
re_nick     [[:alpha:]][[:alnum:]\-\[\]\\`^{}_]*
re_user     [~[:alpha:]][[:alnum:]]*
re_keyname  [[:alnum:]\-]+
re_keyval   [^ ;\r\n]*
re_command  [[:alpha:]]+|[[:digit:]]{3}
re_middle   [^: \r\n][^ \r\n]*
re_trailing [^\r\n]*

%x IN_TAGS IN_PREFIX IN_PARAMS

%%

@ { BEGIN IN_TAGS; return *yytext; }
: { BEGIN IN_PREFIX; return *yytext; }

{re_space} { return SPACE; }
{re_command} {
	yylval.str = strdup(yytext);
	BEGIN IN_PARAMS;
	return COMMAND;
}


<IN_TAGS>\+?({re_host}\/)?{re_keyname}(={re_keyval})?  {
	struct map_pair *p = malloc(sizeof *p);
	char *split = strchr(yytext, '=');
	if (split)
		*split = '\0';
	*p = (struct map_pair){
		.k = strdup(yytext),
		.v = split ? strdup(split+1) : calloc(1,1)
	};
	yylval.pair = p;
	return TAG;
}
<IN_TAGS>{re_space} {
	BEGIN INITIAL;
	return SPACE;
}
<IN_TAGS>; { return ';'; }


<IN_PREFIX>({re_host})|({re_nick})(!{re_user})?(@{re_host})? {
	struct prefix *p = malloc(sizeof *p);
	if (!p)
		goto done;
	*p = (struct prefix){0};
	char *bang = strchr(yytext, '!'),
	     *at   = strchr(yytext, '@');
	if (!bang && !at)
	{
		p->host = strdup(yytext);
		goto done;
	}
	if (bang) *bang = '\0';
	if (at) *at = '\0';
	p->nick = strdup(yytext);
	if (bang)
		p->user = strdup(bang+1);
	if (at)
		p->host = strdup(at+1);
done:
	yylval.prefix = p;
	BEGIN INITIAL;
	return PREFIX;
}


<IN_PARAMS>{re_space} { return SPACE; }
<IN_PARAMS>{re_middle} {
	yylval.str = strdup(yytext);
	return MIDDLE;
}
<IN_PARAMS>:{re_trailing} {
	yylval.str = strdup(yytext+1); /* trim : */
	BEGIN INITIAL;
	return TRAILING;
}

<*>\n|\r\n  ; /* ignore */

Build irc.y and irc.l according to our typical pattern (and link with libderp). Here’s an example of the IRCv3 parser in action:

# Try an example from
# https://ircv3.net/specs/extensions/message-tags#examples

$ ./irc <<EOF
@aaa=bbb;ccc;example.com/ddd=eee :nick!ident@host.com PRIVMSG me :Hello
EOF
Tags:
        'aaa'='bbb'
        'ccc'=''
        'example.com/ddd'='eee'
Prefix: Nick nick, User ident, Host host.com
Command: PRIVMSG
Params:
        me
        Hello

Further resources

  • POSIX (issue 7) specifications for Lex and Yacc. (To view POSIX docs locally, try begriffs/posix-man.)
  • Lex & Yacc, 2nd ed by John R. Levine, Tony Mason, Doug Brown. Levine subsequently wrote an updated book called flex & bison: Text Processing Tools. However I got the older version to get a better feel for history and portability.
  • To bridge the gap between core knowledge and the latest features, consult the GNU Bison manual and the Flex manual. (You can build the Flex manual from source, or download version 2.6.4 that I’ve pre-built for you as PDF.)
  • Effective Flex & Bison by Chris L. verBurg is a collection of tips for “correctness, efficiency, robustness, complexity, maintainability and usability.” It’s clear Chris has plenty of experience writing real-world parsers.
  • Vim has classic yacc highlighting built in, but you can add support for Bison extensions with justinmk/vim-syntax-extra.
]]>
Dynamic linking best practices https://begriffs.com/posts/2021-07-04-shared-libraries.html 2021-07-04T00:00:00Z 2021-07-04T00:00:00Z In this article we’ll learn how to build shared libraries and install them properly on several platforms. For guidance, we’ll examine the goals and history of dynamic linking on UNIX-based operating systems.

Content for the article comes from researching how to create a shared library, wading through sloppy conventions that people recommend online, and testing on multiple Unix-like systems. Hopefully it can set the record straight and help improve the quality of open source libraries.

The common UNIX pattern

The design typically used nowadays for dynamic linking (in BSD, MacOS, and Linux) came from SunOS in 1988. The paper Shared Libraries in SunOS neatly explains the goals, design, and implementation.

The authors’ main motivations were saving disk and memory space, and upgrading libraries (or the OS) without needing to relink programs. The resource usage motivation is probably less important on today’s powerful personal computers than it was in 1988. However, the flexibility to upgrade libraries is as useful as ever, as well as the ability to easily inspect which library versions each application uses.

Dynamic linking is not without its critics, and isn’t appropriate in all situations. It runs a little slower because of position-independent code (PIC) and late loading. (The SunOS paper called it a “classic space/time trade-off.”) The complexity of the loader on some systems offers increased attack surface. Finally, upgraded libraries may affect some programs differently than others, for instance breaking those that rely on undocumented behavior.

At compile time the link editor resolves symbols in specified libraries, and makes a note in the resulting binary to load those libraries. At runtime, applications call code to map the shared library symbols in memory at the correct memory addresses.

SunOS and subsequent UNIX-like systems added compile-time flags to the linker (ld) to generate – or link against – dynamically linked libraries. The designers also added a special system library (ld.so) with code to find and load other libraries for an application. The pre-main() initialization routine of a program loads ld.so and runs it from within the program to find and load the rest of the required libraries.

Versioning

As mentioned, applications can take advantage of updated libraries without needing recompilation. Library updates can be classified in three categories:

  1. Implementation improvements for the current interface. Bug fixes, performance. (Patch release)
  2. New features, additions to the interface. (Minor release)
  3. Backward-incompatible change to the interface or its operation. (Major release)

An application linked against a library at a given major release will continue to work properly when loading any newer minor or patch release. Applications may not work properly when loading a different major release, or an earlier minor release than that used at link time.

Multiple applications can exist on a machine at once, and each may require different releases of a single library. The system should provide a way to store multiple library releases and load the right one for each app. Different systems have different ways to do it, as we’ll see later.

Version identifiers

Each library release can be marked with a version identifier (or “version”) which seeks to capture information about the library’s release history. There are multiple ways to map release history to a version identifier.

The two most common mapping systems are semantic versioning and libtool versioning. Semantic versioning counts the number of releases of various kinds that have happened, and writes them in lexicographic order. Libtool versioning counts distinct library interfaces.

Semantic versioning is written as major.minor.patch and libtool as current:revision:age. The intuition is that current counts interface changes. Any time the interface changes, whether in a minor or major way, current increases. Here’s how each system would record the same history of release events:

Event Semver Libtool
Initial 1.0.0 1:0:0
Minor 1.1.0 2:0:1
Minor 1.2.0 3:0:2
Patch 1.2.1 3:1:2
Major 2.0.0 4:0:0
Patch 2.0.1 4:1:0
Patch 2.0.2 4:2:0
Minor 2.1.0 5:0:1

Here’s how applications answer the question, “Can I load a given library?”

Semver
Does the library to be loaded have the same major version as the library I linked with, and a minor version at least as big?
Libtool
Is the current interface number of the library I linked with between current - age and current of the library to be loaded?

We’ll be using semantic versioning in this guide, because libtool versioning is only relevant to libtool, a tool to abstract library creation across platforms. I believe we can make portable libraries without libtool. I mention both systems only to show that there’s more than one way to build version identifiers.

One final note: version identifiers say that things have changed, but omit what changed. More complicated systems exist to track library compatibility. Solaris, for instance, developed a system called symbol versioning. Symbol versioning chases space savings at the expense of operational complexity, and we’ll consider it later.

API vs ABI

One subtlety of versioning is that changes can happen in either a library’s programming interface (API) or binary interface (ABI). A C library’s programming interface is defined through its header files. A backward-incompatible API change means a program written for the previous version would not compile when including headers from the new version.

By contrast, a binary interface is a runtime concept. It concerns the calling conventions for functions, or the memory layout (and meaning) of data shared between program and library. The ABI ensures compatibility at load and run-time, while the API ensures compatibility at compile and link time.

The two interfaces usually change hand-in-hand, and people sometimes confuse them. It’s possible for one to change without the other, though.

Examples of breaking ABI, but API stability:

In these library changes, application code doesn’t need to change, but does need to be recompiled with the new library headers in order to work at runtime.

  • A changed numerical value behind a #define constant. A program compiled before the change would pass the wrong value to the library.
  • Reordered elements in a struct. The program and library would read different offsets in memory thinking they’re referring to the same element, which is definitely an ABI break. Even adding an element after the others would affect the structure’s size, and hence layout within an array. An added field at the end may or may not affect a particular library’s ABI.
  • Widening function arguments. For instance changing a short int argument to a long int on an architecture/compiler where their size differs. Recompilation would be necessary to handle e.g. sign extension, or the offset of the next argument.
  • Other languages, including C++, have more opportunities for surprise ABI breakage.

Examples of ABI stability, but breaking API:

In these library changes, application code would need to be modified to compile successfully against the new library, even though code compiled before the change could load and call the library without issue.

  • Changing an argument from const foo * to foo *. A pointer to a const object cannot be implicitly converted to a pointer to a non-const object. The ABI doesn’t care though, and moves the same bytes. (If the library does in fact modify the dereferenced value, it may be an unpleasant surprise to the application of course.)
  • Changing the name of a struct element, while keeping its meaning and leaving it in the same position relative to the other elements.

It’s usually easy to tell when you’ve added functionality vs broken backward compatibility, but there are tools to check for sure. For instance, the ABI Compliance Checker can detect breakages in C and C++ libraries.

In light of the versioning discussion earlier, which changes should the version identifier describe? At the very least, the ABI. When the loader is searching for a library, the ABI determines whether a library would be compatible at runtime. However, I think a more conservative versioning scheme is wise, where you bump a version when either the API or ABI change. You’ll end up with potentially more library versions installed, but each shared API/ABI version will provide guarantees at both compilation and runtime.

Variance of linker and loader by system

Linkers (ld, lld)

After compiling object files, the compiler front-end (gcc, clang, cc, c99) will invoke the linker (ld, lld) to find unresolved symbols and match them across object files or in shared libraries. The linker searches only the shared libraries requested by the front-end, in the order specified on the command line. If an unresolved symbol is found in a listed library, the linker marks a dependency on that library in the generated executable.

The -l option adds a library to the list of candidates for symbol search. To add libfoo.so (or libfoo.dylib on Mac), specify -lfoo. The linker looks for the library files in its search path. To add directories to the default search path(s), use -L, for instance -L/usr/local/lib.

What happens if multiple versions of a library exist in the same directory? For instance two major versions, libfoo.so.1 and libfoo.so.2? OpenBSD knows about version numbers, and would pick the highest version automatically for -lfoo. Linux and Mac would match neither, because they’re looking for an exact match of libfoo.so (or libfoo.dylib). Similarly, what if both a static and dynamic library exist in the same directory, libfoo.a and libfoo.so? All systems will choose the dynamic one.

Greater control is necessary. GCC has a colon option to solve the problem, for instance -l:libfoo.so.1. However clang doesn’t have it, so a truly portable build shouldn’t rely on it. Some systems solve the problem by creating a symlink from libfoo.so to the specific library desired. However when done in a system location like /usr/local/lib, it nominates a single inflexible link-time version for the whole system. I’ll suggest a different solution later that involves storing link-time files in a separate place from load-time libraries.

Loaders (ld.so, dyld)

At launch time, programs with dynamic library dependencies load and run ld.so (or dyld on Mac) to find and load the rest of their dependencies. The load library inspects DT_NEEDED ELF tags (or LOAD_DYLIB names in Mach-O on Mac) to determine which library filename to find on the system. Interestingly, these values are not specified by the program developer, but by the library developer. They are extracted from the libraries themselves at link-time.

Dynamic libraries contain an internal “runtime name” called SONAME in ELF, or install_name in Mach-O. An application may link against a file named libfoo.so, but the library SONAME can say, “search for me under the filename libfoo.so.1.2 at load time.” The loader cares only about filenames, it never consults SONAMES. Conversely, the linker’s output cares only about SONAMES, not input library filenames.

Loaders in different operating systems go about finding dependent libraries slightly differently. OpenBSD’s ld.so is very true to the SunOS model, and understands semantic versions. For instance, if asked to load libfoo.so.1.2, it will attempt to find libfoo.so.1.x with the largest x ≥ 2. FreeBSD also claims to have this behavior, but I didn’t observe it in my tests.

In 1995, Solaris 2.5 created a way to track semantic versioning at the symbol level, rather than for the entire library. With symbol versioning there would be a single e.g. libfoo.so file that simply grows over time. Every function inside is marked with a version number. The same function name can even exist under multiple versions with different implementations.

The advantage of symbol versioning is that it can save space. In the alternative, where versioning is per-library rather than per-symbol, a large percentage of object code is often copied unchanged from one library version to the next. The disadvantages of symbol versioning are:

  1. It’s harder to see exactly which versions are installed on a system. Versions are hidden within libraries, rather than visible in filenames.
  2. Library developers have to maintain a separate symbol mapfile for the linker.

Symbol versioning quickly found its way into Linux, and became a staple of Glibc. Because of Linux’s symbol versioning preference, its ld.so doesn’t make any effort to rendezvous with the latest minor library version (à la SunOS or OpenBSD). Ld.so searches for an exact match between SONAME and filename.

However, even on Linux, most libraries don’t use symbol versioning. Also, their SONAMEs typically record only a major version (like libfoo.so.2). Within that major version, you just have to hope the hidden minor version is new enough for all applications compiled or installed on the system. If an app relies on functions added in a later minor library version, it’ll crash when it attempts to call them. (Setting the environment variable LD_BIND_NOW=1 will attempt to resolve all symbols at program start instead, to detect the failure up front.)

MacOS uses an entirely different object format (Mach-O rather than ELF), and a differently named loader library (dyld rather than ld.so). Mac’s dynamically linked libraries are named .dylib, and their version numbers precede the extension.

Native Mac applications are usually installed into their own dedicated directories, with libraries bundled inside. Thus the loader has special provisions for finding libraries, like the keywords @executable_path, @loader_path and @rpath in the install_name. MacOS supports system libraries too, with dyld consulting the DYLD_FALLBACK_LIBRARY_PATH, by default $(HOME)/lib:/usr/local/lib:/lib:/usr/lib.

Like Linux, Mac does an exact name match – no minor version rendezvous. Unlike Linux, libraries can record their full semantic version internally, and a “compatibility” version. The compatibility version gets copied into an application at link time, and says the application requires at least that version at runtime.

For example, libfoo.1.dylib with full version 1.2.3 should have a compatibility version of 1.2.0 according to the rules of semantic versioning. An application linked against it would refuse to load libfoo with lesser minor version, like 1.1.5. At load time, the user would see a clear error:

dyld: Library not loaded: libfoo.1.dylib
  Referenced from: myapp
  Reason: Incompatible library version: myapp requires version 1.2.0 or later,
          but libfoo.1.dylib provides version 1.1.5

Portable best practices

Linking

Standard practice is to create symlinks libfoo.so -> libfoo.so.x -> libfoo.so.x.y.z in a shared system directory. The first link (without the version number) is for linking at build time. Problem is, it’s pinned to one version. There’s no portable way to select which version to link against when there are multiple versions installed.

Also, standard practice gives even less care to versioning header files. Sometimes whichever version was most recently installed overwrites them in /usr/local/include. Sometimes the headers are maintained only at the major version level, in /usr/local/include/libfoo-n.

To solve these problems, I suggest bundling all development (linking) library files together into a different directory structure per version. Since I advocated earlier that the “total” library version should be bumped whenever the API or ABI changes, the same version safely applies to headers and binaries.

First choose an installation PREFIX. If the system has an /opt directory, pick that, otherwise /usr/local. In this directory, add dynamic and/or static libraries, headers, man pages, and pkg-config files as desired:

$PREFIX/libfoo-dev.x.y.z
├── libfoo.pc
├── libfoo-static.pc
├── include
│   └── foo
│       ├── ...
│       └── ...
├── lib
│   ├── libfoo.so (or dylib or dll)
│   └── static
│       └── libfoo.a
└── man
    ├── ...
    └── ...

Linking against libfoo.x.y.z is easy. In a Makefile, set your flags like this:

CFLAGS  += -I/opt/libfoo-dev.x.y.z/include 
LDFLAGS += -L/opt/libfoo-dev.x.y.z/lib
LDLIBS  += -lfoo

# an example suffix rule using the flags
.c:
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< $(LDLIBS)

Version flexibility with pkg-config

Pkg-config can allow an application to express a range of acceptable library versions, rather than hardcoding a specific one. In a configure script, we’ll test for the library’s presence and version, and output the flags to config.mk:

# supposing we require libfoo 1.x for x >= 1
pkg-config --print-errors 'libfoo >= 1.1, libfoo < 2.0'

# save flags to config.mk
cat > config.mk <<-EOF
	CFLAGS += $(pkg-config --cflags libfoo)
	LDFLAGS += $(pkg-config --libs-only-L libfoo)
	LDLIBS += $(pkg-config --libs-only-l libfoo)
EOF

Then our Makefile becomes:

include config.mk

.c:
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< $(LDLIBS)

To choose a specific version of libfoo, we can add it to the pkg-config search path and run the configure script:

# make desired libfoo version visible to pkg-config
export PKG_CONFIG_PATH="/opt/libfoo-dev.x.y.z:$PKG_CONFIG_PATH"

./configure
make

To create pkg-config .pc files for a library, see Dan Nicholson’s guide. In order to offer both a static and dynamic library, the best way I could imagine was to release separate files, libfoo.pc and libfoo-static.pc that differ in their -L flag. One uses lib and another lib/static. (Pkg-config’s --static flag is a bit of a misnomer, and just passes items in Libs.private in addition to Libs in the build process.)

Loading

This section talks about installing dynamic libraries for system-wide loading. Libraries installed for this purpose are not meant to link with at compile time, but to load at runtime.

ELF installation (BSD/Linux)

ELF objects don’t have much version metadata. SONAME is about it. That, combined with the lackluster behavior of loaders on some systems, means the traditional installation technique doesn’t work too well.

Let’s review the traditional way to install ELF libraries, and then a safer method I designed.

Traditional installation method

  1. For version x.y.z, compile libfoo.so with SONAME libfoo.so.x
  2. Copy libfoo.so to /usr/local/lib/libfoo.so.x.y.z
  3. Create symlink libfoo.so.x -> libfoo.x.y.z

This way allows a sysadmin to see exactly which versions are installed, and to have multiple major versions installed at once. It doesn’t allow multiple minor versions per major (although usually only the latest minor is needed), and more importantly doesn’t offer protection against loading too old a minor version.

Safer installation method

  1. For version x.y.z, compile libfoo.so with SONAME libfoo.so.x.y

    # use compilation flags
    -shared -Wl,-soname,libfoo.so.${MAJOR}.${MINOR}
  2. Copy libfoo.so to /usr/local/lib/libfoo.so.x.y.z

  3. Backfill minor version symlinks in DEST:

    i=0
    while [ $i -le "$MINOR" ]; do
    	ln -fs "libfoo.so.$VER" "$DEST/libfoo.so.$MAJOR.$i"
    	i=$((i+1))
    done

At the cost of potentially a lot of minor version symlinks, this technique emulates the SunOS and OpenBSD behavior of minor version rendezvous. Also, because the SONAME has major.minor granularity, it will protect against loading too old a minor version.

(As an alternative to the symlinks, FreeBSD has libmap.conf)

Mach-O installation (MacOS)

Mach-O has more version metadata inside than ELF, so a traditional install works fine here.

  1. For version x.y.z, compile libfoo.dylib with

    • install_name libfoo.x.dylib
    • current version x.y.z
    • compatibility version x.y
    # use compilation flags
    -dynamiclib -install_name "libfoo.${MAJOR}.dylib" \
                -current_version ${VER} \
                -compatibility_version ${MAJOR}.${MINOR}.0
  2. Copy libfoo.dylib to /usr/local/lib/libfoo.x.dylib

It’s important to set the compatibility version correctly so that Mac’s dyld will prevent loading too old a minor version. To upgrade the library, overwrite libfoo.x.dylib with one of a later internal minor release.

Example code

For an example of how to build a library portably, and install it conveniently for the linker and loader, see begriffs/libderp. It’s my first shared library, where I tested the ideas for this article.

]]>
Tips for stable and portable software https://begriffs.com/posts/2020-08-31-portable-stable-software.html 2020-08-31T00:00:00Z 2020-08-31T00:00:00Z After several years’ involvement with quickly evolving programming languages, I’ve come to appreciate stability. I’d like to make my programs easy to build on a wide variety of systems with minimal adjustment. I’d like them to keep working long into the future as environments change.

To think about stability more clearly, let’s divide a functioning program into its layers. Then we can examine development choices one layer at a time.

concentric circles of program resources

The more features a program needs, the further out it must reach through the layers.

Layer 0: Programming language

Choose a language with multiple implementations and a standard

Every language has to start somewhere, often as an implementation by a single person or small group. At this stage the language evolves rapidly, and to be fair it’s this stage that advances the state of the art.

However, using a language in its single-implementation stage means you’re committing a percentage of your energy to the “research project” of the language itself. You’ll deal with breaking changes (including tools), and experimental dead-ends.

If you love the idea behind a new language, or believe it’s a winner and that your early familiarity will pay off, then go for it! Otherwise use a language that has advanced beyond a single implementation. That way you can focus on your domain of expertise rather than keeping up with a language research agenda.

Languages get to the next stage when groups of people fork them for new situations and architectures. Some people add features, other people discover difficulties in their environments. Stakeholders then debate and reach consensus through a standardization process. The end result is that the standard, rather than a particular software artifact, defines the language and has the final say.

Naturally the whole thing takes a while. Standardized languages are going to be fairly old. They’ll miss out on recent ideas, but will be well understood. Here are some mature languages with standards:

  • Ada
  • C
  • Common Lisp
  • ECMAScript
  • Pascal
  • SQL

I’ve been using C lately because of its portability, simple (yet expressive) abstract machine model, and deep compatibility with POSIX and foundational libraries.

Avoid – or wrap – compiler language extensions

If you’re using a language with a standard, take advantage of it. First, choose a specific version of the standard. Older versions are generally more widely supported, but have fewer features. In the C world I usually pick C99 because it has some conveniences over C89, and is still supported pretty much everywhere (although only partially on Windows).

Consult your compiler documentation to see if the compiler can catch accidental uses of non-standard behavior. In clang or gcc, add the following flags to your Makefile:

# enforce a specific version of the standard
CFLAGS += -std=c99 -pedantic

Substitute another version for “c99” as desired. The pedantic flag rejects all programs that use forbidden extensions, and some other programs that do not follow ISO C.

If you do want to use compiler extensions (such as those in gcc or clang), wrap them behind your own macros so that the code stays portable. The PostgreSQL project does this kind of thing in c.h. Here’s an example at random:

/*
 * Use "pg_attribute_always_inline" in place of "inline" for functions that
 * we wish to force inlining of, even when the compiler's heuristics would
 * choose not to.  But, if possible, don't force inlining in unoptimized
 * debug builds.
 */
#if (defined(__GNUC__) && __GNUC__ > 3 && defined(__OPTIMIZE__)) || defined(__SUNPRO_C) || defined(__IBMC__)
/* GCC > 3, Sunpro and XLC support always_inline via __attribute__ */
#define pg_attribute_always_inline __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
/* MSVC has a special keyword for this */
#define pg_attribute_always_inline __forceinline
#else
/* Otherwise, the best we can do is to say "inline" */
#define pg_attribute_always_inline inline
#endif

Notice how they adapt to various compilers and provide a final fallback. Of course, avoiding extensions in the first place is the simplest option, when possible.

Layer 1: Standard library

Learn it, and consult the standard

Take time to learn your language’s standard library. It’s a freebie, you get it wherever your program goes. Read about the library functions in the language standard, since they will be covered there.

Gaining knowledge of the standard library can help reduce reliance on unnecessary third-party libraries. The ECMAScript world, for instance, is rife with micro-libraries that attempt to supplement the ECMA standard’s real or perceived shortcomings.

The size of a single-implementation language’s library is a trade-off between ease of implementation and ease of use. A giant library like that in the Go language makes it harder for creators of would-be rival implementations, and thus slows the progress to a robust standard.

To learn more about the C standard library, see my article.

Learn the rationale and gotchas

Because standards bodies avoid breaking existing codebases, and because stable languages are slow to change, there will be weird or dangerous functions in the standard library. However the dangers are well known and documented in supporting literature, unlike the dangers in new, relatively untested systems.

Here are some great books for C:

  • “The CERT C Coding Standard” by Robert C. Seacord (ISBN 978-0321984043). Illustrates potential insecurity with, among other things, the standard library. Lists real code that caused vulnerabilities.
  • “The Standard C Library” by P. J. Plauger (ISBN 978-0131315099). Thorough details about the C89 stdlib.
  • “C Traps and Pitfalls” by Andrew Koenig (978-0201179286).
  • “C Programming FAQs” by Steve Summit (ISBN 978-0201845198). I can see why these were historically the most frequently asked questions. I asked many of them myself.

Also the C99 standard has an accompanying rationale document. It talks about alternate designs considered and rejected.

Layer 2: POSIX

Similarly to how competing C implementations led to the C standard, the Unix wars led to POSIX. POSIX specifies a “lowest common denominator” interface that many operating systems honor to a greater or lesser degree.

Read the spec, compare with man pages

Whenever you use system calls outside the C standard library, check whether they’re part of POSIX, and if their official description differs from your local man pages. The Open Group offers a free searchable HTML version of POSIX.1. As of this writing it’s POSIX.1-2017 (which is POSIX.1-2008 plus two technical corrigenda).

There’s one more complication: POSIX.1-2008 (aka “Issue 7”) isn’t fully supported everywhere. (For instance I found that macOS doesn’t support pthread barriers, semaphores, or asynchronous thread cancellation.) I think the root cause is that 2008 requires thread and real-time functionality that was previously in optional extensions. If you stick to functionality in POSIX.1-2001 (aka Issue 6) you should be safe on all reasonably recent platforms.

Activate a version

To call POSIX functions you must define the _POSIX_C_SOURCE “feature test” macro before including header files. Select a specific POSIX version by using one of these values:

Edition Release year Macro value
1 1988 (N/A)
2 1990 1
3 1992 2
4 1993 199309L
5 1995 199506L
6 2001 200112L
7 2008 200809L

Header files hide or reveal functions based on the feature test macro. For example, the getline() function from Issue 7 allocates memory and reads a line.

/* line.c */
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h> /* ssize_t */

int main(void)
{
	char *line = NULL;
	size_t len = 0;
	ssize_t read;
	while ((read = getline(&line, &len, stdin)) != -1)
		printf("Length %zd: %s", read, line);
	free(line);
	return 0;
}

Trying to use getline() on Issue 6 (POSIX.1-2001) fails:

$ cc -std=c99 -pedantic -Werror -D_POSIX_C_SOURCE=200112L line.c -o line

line.c:10:17: error: implicit declaration of function 'getline' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
        while ((read = getline(&line, &len, stdin)) != -1)
                       ^
1 error generated.

Selecting Issue 7 with -D_POSIX_C_SOURCE=200809L fixes it.

Important note: setting _POSIX_C_SOURCE will hide non-POSIX operating system extras in the standard headers. The best practice is to separate your source files into those that are POSIX conformant, and those (hopefully few) that aren’t. Compile the latter without the feature macro and link them all together at the end.

Use POSIX in the build process too

POSIX defines the interface for not just the library functions discussed earlier, but for the shell and common tools too. If you use those tools for your builds then you don’t need to install any extra software on destination machines to compile your project.

Probably the most common sources of accidental lock-in are bashisms and GNU extensions to Make. For scripts, use sh, and use (POSIX) make for Makefiles. Too many projects use GNU features needlessly. In fact, learning the portable subset of Make features leads to cleaner, more reliable builds.

This is a topic for an entire article of its own. Chris Wellons wrote a nice tutorial about it. Also “Managing Projects with make” by Andrew Oram (ISBN 0-937175-90-0) is a little book that’s packed with good advice.

Layer 3: Operating system extras

Operating systems include useful functionality beyond POSIX. For instance extensions to pthreads (setting reader-writer preference or thread processor affinity), access to specialized hardware (like audio or graphics), alternate I/O interfaces and semantics, and functions for safety like strlcpy or pledge.

Three ways to use these features portably are to:

  1. wrap them in your own interface and conditionally compile the implementation, or
  2. build a static shim library (“libcompat”) as part of your project to use when functionality is missing, or
  3. link to a third party library that abstracts the details.

We’ll talk about third-party libraries later. Let’s look at option one now.

Detecting OS functions

Consider the example of generating random data. It requires help from the OS since POSIX offers only pseudo-random numbers.

We’ll split our Makefile into two parts:

  1. Makefile – specifies targets, dependencies and rules, that hold on all systems
  2. config.mk – sets macros and build flags specific to the local system

The Makefile will include the specifics of config.mk like this:

# inside the Makefile...

# set up common options and then...

include config.mk

We’ll generate config.mk with a configure script. A developer will run the script before their first build to detect the environment options. The most primitive way for configure to work would be to try parse uname and make decisions based on what OS or distro it sees. A more accurate way is to try to directly probe the needed OS C functions.

To see if a C function exists, we can just try compiling test snippets of code and see if they succeed. You might think this is awkward or that it requires cluttering your project with test code, but it’s actually pretty elegant.

First make this shell script helper function:

compiles ()
{
	stage="$(mktemp -d)"
	echo "$2" > "$stage/test.c"
	(cc -Werror "$1" -o "$stage/test" "$stage/test.c" >/dev/null 2>&1)
	cc_success=$?
	rm -rf "$stage"
	return $cc_success
}

The compiles() function takes two arguments: an optional compiler flag, and the source code to attempt to compile.

Let’s use the helper to check for OS random number generators. The BSD world offers arc4random_buf to get random bytes, and Linux offers getrandom. The configure script can check for each feature like this:

if compiles "" "
	#include <stdint.h>
	#include <stdlib.h>
	int main(void)
	{
		void (*p)(void *, size_t) = arc4random_buf;
		return (intptr_t)p;
	}"
then
	echo "CFLAGS += -DHAVE_ARC4RANDOM" >> config.mk
fi

if compiles "-D_POSIX_C_SOURCE=200112L" "
	#include <stdint.h>
	#include <sys/types.h>
	#include <sys/random.h>
	int main(void)
	{
		ssize_t (*p)(void *, size_t, unsigned int) = getrandom;
		return (intptr_t)p;
	}"
then
	echo "CFLAGS += -DHAVE_GETRANDOM" >> config.mk
fi

See? Not too bad. These code snippets test not only whether the functions exist, but also check their type signatures. Notice how the second example is compiled with POSIX for the ssize_t type, while the first example is intentionally not marked POSIX conformant because doing so would hide the extra function arc4random_buf that BSD puts in stdlib.h.

Wrap OS functions behind your own

It’s helpful to isolate the use of non-portable functions in a distinct translation unit, and export your own interface on top. That way it’s more straightforward to set up conditional compilation in one place, or to refactor in the future.

Let’s continue the example from the previous section of generating random bytes. With the hard work of OS feature detection behind us, we can wrap the differing OS interfaces behind our own function:

#include <stdint.h>
#include <stdlib.h>
#ifdef HAVE_GETRANDOM
#include <sys/random.h>
#endif

void get_random_bytes(void *buf, size_t n)
{
#if defined HAVE_ARC4RANDOM  /* BSD */
	arc4random_buf(buf, n);
#elif defined HAVE_GETRANDOM /* Linux */
	getrandom(buf, n, 0);
#else
#error OS does not provide recognized function to get entropy
#endif
}

The Makefile defines HAVE_ARC4RANDOM or HAVE_GETRANDOM using CFLAGS when the corresponding functions exist. The code can just use ifdefs. Notice the #error in the #else case to fail compilation with a clear message on unsupported platforms.

The degree of portability we strive for causes trade-offs. Example: we could add a fallback to reading /dev/random. The configure script from the previous section could check whether the device exists:

if test -c /dev/random; then
	echo "CFLAGS += -DHAVE_DEVRANDOM" >> config.mk
fi

Using that information, we could add another #elif in get_random_bytes() so that it can potentially work on more systems. However, in this case, the increased portability would require a change in interface. Since fopen() or fread() on /dev/random could fail, our function would need to return bool. Currently the OS functions we’re calling can’t fail, so a void return is fine.

Test on multiple OSes and hardware

The true test of portability is, of course, building and running on multiple operating systems, compilers, and hardware architectures. It can be surprising to see what assumptions this can uncover. Testing portability early and often makes it easier to keep a program shipshape.

The PostgreSQL project, for instance, maintains a bunch of disparate machines known as the “buildfarm.” Buildfarm members each have their own OS, compiler, and architecture. The team compiles every new feature on these machines and runs the test suite there.

Focusing on the architectures alone, we can see an impressive variety in the buildfarm:

Even if you have no intention to run on these architectures, testing there will lead to better code. (See my article C Portability Lessons from Weird Machines.)

Layer 4: third-party libraries

Many languages have their own application-level package managers, but C has no exclusive package manager. The language has too much history and spans too many environments to have locked into that. Instead people build dependencies from source, or use the OS package manager.

Build with pkg-config

Linking to libraries requires knowing their path, name, and compiler settings. Additionally we want to know which version is installed and whether it’s in-bounds. Since there’s no application-level package manager for C, we need to use another tool to discover installed libraries.

The most cross-platform way to find – and build against – dependency libraries is pkg-config. The tool allows you to query system packages, regardless of how they were installed. To be compatible with pkg-config, each library foo provides a libfoo.pc file containing keys and values like this:

prefix=/usr/local
exec_prefix=${prefix}
includedir=${prefix}/include
libdir=${exec_prefix}/lib

Name: libfoo
Description: The foo library
Version: 1.2.3
Cflags: -I${includedir}/foo
Libs: -L${libdir} -lfoo

The pkg-config executable can query the metadata and provide flags for your Makefile. Call it from your configure script like this:

# check that a sufficient version is installed
pkg-config --print-errors 'libfoo >= 1.0'

# save flags to config.mk
cat >> config.mk <<-EOF
	CFLAGS += $(pkg-config --cflags libfoo)
	LDFLAGS += $(pkg-config --libs-only-L libfoo)
	LDLIBS += $(pkg-config --libs-only-l libfoo)
EOF

Notice the LDLIBS vs LDFLAGS distinction. LDLIBS are options that need to go at the very end of the build line. The default POSIX make suffix rules don’t mention LDLIBS, but here’s a rule you can use instead:

.c:
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< $(LDLIBS)

Sometimes an operating system will include extra functionality and package it up as a portable library you can use on other operating systems. In this case you can use pkg-config conditionally.

For instance, OpenBSD spun off the LibreSSL project (a more usable OpenSSL). OpenBSD includes the functionality internally. In the configure script just do an operating system check:

# LibreSSL
case "$(uname -s)" in
	OpenBSD)
		# included with OS
		echo 'LDLIBS += -ltls' >> config.mk
		;;
	*)
		# requires a package
		pkg-config --print-errors 'libtls >= 2.5.0'
		cat >> config.mk <<-EOF
			CFLAGS += $(pkg-config --cflags libtls)
			LDFLAGS += $(pkg-config --libs-only-L libtls)
			LDLIBS += $(pkg-config --libs-only-l libtls)
		EOF
esac

For more information about pkg-config, see Dan Nicholson’s guide.

Compensating for the standard library

The C standard library has no generic collections. You have to write your own linked lists, trees, and hash tables. Real Programmers™ might like this, but I don’t.

POSIX offers limited help with their interface in search.h:

  • Binary search tree. This interface has worked for me, although twalk() doesn’t contain an argument to pass auxiliary data to the callback. The callback needs to consult a global or thread-local variable for that. The quality of implementation may vary as well, likely with regard to how/if the tree is balanced.
  • Queue. Very basic functions to insert or delete from a doubly linked (possibly circular) list. It takes void*, but expects a structure whose first two members are pointers to the same structure type (forward and backward pointers).
  • Hash table. Unnecessarily constrained interface. It creates a single hash table in hidden memory. You can destroy the table and later make another, but can never have more than one active at a time anywhere in the callstack. Obviously not thread safe, but that seems to be the least of its problems.

To go beyond that, you’ll have to use third-party libraries. Many well-known libraries seem pretty bloated (GLib, tbox, Apache Portable Runtime). I found a smaller, cleaner library called simply C Algorithms. Haven’t used it in a project yet, but it looks stable and well tested. I also built the library locally with added pedantic C99 flags and got no warnings.

Two other stable libraries (code snippets?) which have received a lot of use over the years are Uthash and BSD’s queue(3) (browse queue.h from OpenBSD, or the FreeBSD variant).

Uthash describes itself this way:

Any C structure can be stored in a hash table using uthash. Just add a UT_hash_handle to the structure and choose one or more fields in your structure to act as the key. Then use these macros to store, retrieve or delete items from the hash table.”

The BSD queue code has been used and improved all the way back to the 1990s. It provides macros to create and manipulate singly-linked lists, simple queues, lists, and tail queues. The man page is quite good.

The functionality differs in the codebase of OpenBSD and FreeBSD. I use the OpenBSD version, but it has a little less functionality. In particular, FreeBSD adds the STAILQ (singly-linked tail queue), and a list swap operation. There was once a CIRCLEQ for circular queues, but it used dodgy coding practices and was removed.

Both Uthash and Queue are header files with macros that you vendor into your project and include rather than linking against. In general I consider “header-only libraries” to be undesirable because they abuse the notion of a translation unit, bloat object files, and make debugging harder. However I’ve used these libraries and they do work well.

User interface

The fewer UI features a program requires, the more portable it will be and the fewer opportunities there will be for it to mess up. (Does your command line app really need to output an emoji rocket ship or animated-in-place text spinner?)

The lowest common denominator is the standard I/O library in C, or its equivalent in other languages. Reading and writing text, pretending to be a teletype.

The next level of sophistication is static output but an input line you can modify (like the fancier teletypes that could edit a line before sending). Different terminals support intraline editing differently, and you should use a library to handle it. The classic is GNU readline. Readline provides this functionality:

  • Moving the text cursor (vi and emacs modes)
  • Searching the command history
  • Controlling a kill ring
  • Using tab completion

Its license is straight up GPL though, not even LGPL. There are more permissive knockoffs like libedit (requires ncurses), or linenoise (which is restricted to VT100 terminals/emulators).

Going up yet another level is the text user interface (TUI), where the whole screen is your canvas, but you draw on it with text. Historically terminal control codes diverged wildly, so a standard programming interface was born, X/Open Curses. The most popular implementation is ncurses, which adds some nonstandard extensions as well.

Curses handles these tasks:

  • Terminal capability detection
  • “Raw” mode keyboard input
  • Cursor motion
  • Line drawing
  • Highlighting, underlining
  • Inserting and deleting lines and characters
  • Status line
  • Area clear
  • Windows
  • Color

To stop pretending the computer is an archaic device from the 70s, you can use the cross-platform SDL2 library. It gives low level access to audio, keyboard, mouse, joystick, and graphics hardware. The platform support really is impressive. Everything from Unix, Mac, and Windows to mobile and web rendering.

Finally, for a classic native desktop application with widgets, the most stable and portable choice is probably Motif. The interface is stark, but it runs everywhere, and won’t change or break on you.

Sample of Motif widgets

The Motif Programming Manual (free download) says this by way of introduction:

So why motif? Because it remains what it has long been: the common native windowing toolkit for all the UNIX platforms, fully supported by all the major operating system vendors. It is still the only truly industrial strength toolkit capable of supporting large scale and long term projects. Everything else is tainted: it isn’t ready or fully functionally complete, or the functional specification changes in a non-backwards-compatible manner per release, or there are performance issues. Perhaps it doesn’t truly port across UNIX systems, or it isn’t fully ICCCM compliant with software written in any other toolkit on the desktop, or there are political battles as various groups try to control the specification for their own purposes. […] With motif, you know where you are: it’s stable, it’s robust, it’s professionally supported, and it all works.

A reference manual is also available for download.

I was a little skeptical that it would be supported on macOS, but I tried the hello world example and, sure enough, it worked fine on XQuartz. I think there’s value in using Motif rather than a monster like GTK.

]]>
Create impeccable MIME email from markdown https://begriffs.com/posts/2020-07-16-generating-mime-email.html 2020-07-16T00:00:00Z 2020-07-16T00:00:00Z The goal

I want to create emails that look their best in all mail clients, whether graphical or text based. Ideally I’d write a message in a simple format like Markdown, and generate the final email from the input file. Additionally, I’d like to be able to include fenced code snippets in the message, and make them available as attachments.

Demo

I created a utility called mimedown that reads markdown through stdin and prints multipart MIME to stdout.

Let’s see it in action. Here’s an example message:

## This is a demo email with code

Hey, does this code look fishy to you?

```crash.c
#include <stdio.h>

int main(void)
{
	char a[] = "string literal";
	char *p  = "string literal";

	/* capitalize first letter */
	p[0] = a[0] = 'S';
	printf("a: %s\np: %s\n", a, p);
	return 0;
}
```

It blows up when I compile it and run it:

```compile.txt
$ cc -std=c99 -pedantic -Wall -Wextra crash.c -o crash
$ ./crash
Bus error: 10
```

Turns out we're invoking undefined behavior.

* The C99 spec, appendix J.2 Undefined Behavior mentions this case:
  > The program attempts to modify a string literal (6.4.5).
* Steve Summit's C FAQ [question 1.32](http://c-faq.com/decl/strlitinit.html)
  covers the difference between an array initialized with string literal vs a
  pointer to a string literal constant.
* The SEI CERT C Coding standard
  [STR30-C](https://wiki.sei.cmu.edu/confluence/display/c/STR30-C.+Do+not+attempt+to+modify+string+literals)
  demonstrates the problem with non-compliant code, and compares with compliant
  fixes.

After running it through the generator and emailing it to myself, here’s how the result looks in the Fastmail web interface:

rendered in fastmail

Notice how the code blocks are displayed inline and are available as attachments with the correct MIME type.

I intentionally haven’t configured Mutt to render HTML, so it falls back to the text alternative in the message, which also looks good. Notice how the message body is interleaved with Content-Disposition: inline attachments for each code snippet.

code and text in Mutt

The email generator also creates references for external urls. It substitutes the urls in the original body text with references, and consolidates the links into a bibliography of type text/uri-list at the end of the message. Here’s another Mutt screenshot of the end of the message, with red circles added.

links as references

The generated MIME structure of our sample message looks like this:

  I     1 <no description>          [multipa/alternativ, 7bit, 3.1K]
  I     2 ├─><no description>            [multipa/mixed, 7bit, 1.7K]
  I     3 │ ├─><no description>      [text/plain, 7bit, utf-8, 0.1K]
  I     4 │ ├─>crash.c                 [text/x-c, 7bit, utf-8, 0.2K]
  I     5 │ ├─><no description>      [text/plain, 7bit, utf-8, 0.1K]
  I     6 │ ├─>compile.txt           [text/plain, 7bit, utf-8, 0.1K]
  I     7 │ ├─><no description>      [text/plain, 7bit, utf-8, 0.5K]
  I     8 │ └─>references.uri     [text/uri-list, 7bit, utf-8, 0.2K]
  I     9 └─><no description>         [text/html, 7bit, utf-8, 1.3K]

At the outermost level, the message is split into two alternatives: HTML and multipart/mixed. Within the multipart/mixed part is a succession of message text and code snippets, all with inline disposition. The final mixed item is the list of referenced urls (if necessary).

Other niceties

Lines of the message body are re-flowed to at most 72 characters, to conform to historical length constraints. Additionally, to accommodate narrow terminal windows, mimedown uses a technique called format=flowed. This is a clever standard (RFC 3676) which adds trailing spaces to any lines that we would like the client reader to re-flow, such as those in paragraphs.

Neither hard wrapping nor format=flowed is applied to code block fences in the original markdown. Code snippets are turned into verbatim attachments and won’t be mangled.

Finally, the HTML version of the message is tasteful and conservative. It should display properly on any HTML client, since it validates with ISO HTML (ISO/IEC 15445:2000, based on HTML 4.01 Strict).

Try it yourself

Clone it here: github.com/begriffs/mimedown. It’s written in portable C99. The only build dependency is the cmark library for parsing markdown.

]]>
Logging TLS session keys in LibreSSL https://begriffs.com/posts/2020-05-25-libressl-keylogging.html 2020-05-25T00:00:00Z 2020-05-25T00:00:00Z LibreSSL is a fork of OpenSSL that improves code quality and security. It was originally developed for OpenBSD, but has since been ported to several platforms (Linux, *BSD, HP-UX, Solaris, macOS, AIX, Windows) and is now the default TLS provider for some of them.

When debugging a program that uses LibreSSL, it can be useful to see decrypted network traffic. Wireshark can decrypt TLS if you provide the secret session key, however the session key is difficult to obtain. It is different from the private key used for functions like tls_config_set_keypair_file(), which merely secures the initial TLS handshake with asymmetric cryptography. The handshake establishes the session key between client and server using a method such as Diffie-Hellman (DH). The session key is then used for efficient symmetric cryptography for the remainder of the communication.

Web browsers, from their Netscape provenance, will log session keys to a file specified by the environment variable SSLKEYLOGFILE when present. Netscape packaged this behavior in its Network Security Services library.

OpenSSL and LibreSSL don’t implement that NSS behavior, although OpenSSL allows code to register a callback for when TLS key material is generated or received. The callback receives a string in the NSS Key Log Format.

In addition to refactoring OpenSSL code, LibreSSL offers a simplified TLS interface called libtls. The simplicity makes it more likely that applications will use it safely. However, I couldn’t find an easy way to log session keys for my libtls connection.

I found a somewhat hacky way to do it, and asked their development list whether there’s a better way. From the lack of response, I assume there isn’t yet. Posting the solution here in case it’s helpful for anyone else.

This module provides a tls_dump_keylog() function that appends to the file specified in SSLKEYLOGFILE.

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

#include <openssl/ssl.h>

/* A copy of the tls structure from libtls/tls_internal.h
 *
 * This is a fragile hack! When the structure changes in libtls
 * then it will be Undefined Behavior to alias it with this.
 * See C99 section 6.5 (Expressions), paragraph 7
 */
struct tls_internal {
	struct tls_config *config;
	struct tls_keypair *keypair;

	struct {
		char *msg;
		int num;
		int tls;
	} error;

	uint32_t flags;
	uint32_t state;

	char *servername;
	int socket;

	SSL *ssl_conn;
	SSL_CTX *ssl_ctx;

	struct tls_sni_ctx *sni_ctx;

	X509 *ssl_peer_cert;
	STACK_OF(X509) *ssl_peer_chain;

	struct tls_conninfo *conninfo;

	struct tls_ocsp *ocsp;

	tls_read_cb read_cb;
	tls_write_cb write_cb;
	void *cb_arg;
};

static void printhex(FILE *fp, const unsigned char* s, size_t len)
{
	while (len-- > 0)
		fprintf(fp, "%02x", *s++);
}

bool tls_dump_keylog(struct tls *tls)
{
	FILE *fp;
	SSL_SESSION *sess;
	unsigned int len_key, len_id;
	unsigned char key[256];
	const unsigned char *id;

	const char *path = getenv("SSLKEYLOGFILE");
	if (!path)
		return false;

	/* potentially nonstrict aliasing */
	sess = SSL_get_session(((struct tls_internal*)tls)->ssl_conn);
	if (!sess)
	{
		fprintf(stderr, "Failed to get session for TLS\n");
		return false;
	}
	len_key = SSL_SESSION_get_master_key(sess, key, sizeof key);
	id      = SSL_SESSION_get_id(sess, &len_id);

	if ((fp = fopen(path, "a")) == NULL)
	{
		fprintf(stderr, "Unable to write keylog to '%s'\n", path);
		return false;
	}
	fputs("RSA Session-ID:", fp);
	printhex(fp, id, len_id);
	fputs(" Master-Key:", fp);
	printhex(fp, key, len_key);
	fputs("\n", fp);
	fclose(fp);
	return true;
}

To use the logfile in Wireshark, right click on a TLS packet, and select Protocol Preferences(Pre)-Master-Secret log filename.

(Pre)-Master-Secret log filename menu item

In the resulting dialog, add the filename to the logfile. Then you can view the decrypted traffic with FollowTLS Stream.

Follow TLS stream menu item
]]>
Concurrent programming, with examples https://begriffs.com/posts/2020-03-23-concurrent-programming.html 2020-03-23T00:00:00Z 2020-03-23T00:00:00Z Mention concurrency and you’re bound to get two kinds of unsolicited advice: first that it’s a nightmarish problem which will melt your brain, and second that there’s a magical programming language or niche paradigm which will make all your problems disappear.

We won’t run to either extreme here. Instead we’ll cover the production workhorses for concurrent software – threading and locking – and learn about them through a series of interesting programs. By the end of this article you’ll know the terminology and patterns used by POSIX threads (pthreads).

This is an introduction rather than a reference. Plenty of reference material exists for pthreads – whole books in fact. I won’t dwell on all the options of the API, but will briskly give you the big picture. None of the examples contain error handling because it would merely clutter them.

Table of contents

Concurrency vs parallelism

First it’s important to distinguish concurrency vs parallelism. Concurrency is the ability of parts of a program to work correctly when executed out of order. For instance, imagine tasks A and B. One way to execute them is sequentially, meaning doing all steps for A, then all for B:

A B

Concurrent execution, on the other hand, alternates doing a little of each task until both are all complete:

Concurrency allows a program to make progress even when certain parts are blocked. For instance, when one task is waiting for user input, the system can switch to another task and do calculations.

When tasks don’t just interleave, but run at the same time, that’s called parallelism. Multiple CPU cores can run instructions simultaneously:

A B

When a program – even without hardware parallelism – switches rapidly enough from one task to another, it can feel to the user that tasks are executing at the same time. You could say it provides the “illusion of parallelism.” However, true parallelism has the potential for greater processor throughput for problems that can be broken into independent subtasks. Some ways of dealing with concurrency, such as multi-threaded programming, can exploit hardware parallelism automatically when available.

Some languages (or more accurately, some language implementations) are unable to achieve true multi-threaded parallelism. Ruby MRI and CPython for instance use a global interpreter lock (GIL) to simplify their implementation. The GIL prevents more than one thread from running at once. Programs in these interpreters can benefit from I/O concurrency, but not extra computational power.

Our first concurrent program

Languages and libraries offer different ways to add concurrency to a program. UNIX for instance has a bunch of disjointed mechanisms like signals, asynchronous I/O (AIO), select, poll, and setjmp/longjmp. Using these mechanisms can complicate program structure and make programs harder to read than sequential code.

Threads offer a cleaner and more consistent way to address these motivations. For I/O they’re usually clearer than polling or callbacks, and for processing they are more efficient than Unix processes.

Crazy bankers

Let’s get started by adding concurrency to a program to simulate a bunch of crazy bankers sending random amounts of money from one bank account to another. The bankers don’t communicate with one another, so this is a demonstration of concurrency without synchronization.

Adding concurrency is the easy part. The real work is in making threads wait for one another to ensure a correct result. We’ll see a number of mechanisms and patterns for synchronization later, but for now let’s see what goes wrong without synchronization.

/* banker.c */

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>

#define N_ACCOUNTS 10
#define N_THREADS  20
#define N_ROUNDS   10000

/* 10 accounts with $100 apiece means there's $1,000
   in the system. Let's hope it stays that way...  */
#define INIT_BALANCE 100

/* making a struct here for the benefit of future
   versions of this program */
struct account
{
	long balance;
} accts[N_ACCOUNTS];

/* Helper for bankers to choose an account and amount at
   random. It came from Steve Summit's excellent C FAQ
   http://c-faq.com/lib/randrange.html */
int rand_range(int N)
{
	return (int)((double)rand() / ((double)RAND_MAX + 1) * N);
}

/* each banker will run this function concurrently. The
   weird signature is required for a thread function */
void *disburse(void *arg)
{
	size_t i, from, to;
	long payment;

	/* idiom to tell compiler arg is unused */
	(void)arg;

	for (i = 0; i < N_ROUNDS; i++)
	{
		/* pick distinct 'from' and 'to' accounts */
		from = rand_range(N_ACCOUNTS);
		do {
			to = rand_range(N_ACCOUNTS);
		} while (to == from);

		/* go nuts sending money, try not to overdraft */
		if (accts[from].balance > 0)
		{
			payment = 1 + rand_range(accts[from].balance);
			accts[from].balance -= payment;
			accts[to].balance   += payment;
		}
	}
	return NULL;
}

int main(void)
{
	size_t i;
	long total;
	pthread_t ts[N_THREADS];

	srand(time(NULL));

	for (i = 0; i < N_ACCOUNTS; i++)
		accts[i].balance = INIT_BALANCE;

	printf("Initial money in system: %d\n",
		N_ACCOUNTS * INIT_BALANCE);

	/* start the threads, using whatever parallelism the
	   system happens to offer. Note that pthread_create
	   is the *only* function that creates concurrency */
	for (i = 0; i < N_THREADS; i++)
		pthread_create(&ts[i], NULL, disburse, NULL);

	/* wait for the threads to all finish, using the
	   pthread_t handles pthread_create gave us */
	for (i = 0; i < N_THREADS; i++)
		pthread_join(ts[i], NULL);

	for (total = 0, i = 0; i < N_ACCOUNTS; i++)
		total += accts[i].balance;

	printf("Final money in system: %ld\n", total);
}

The following simple Makefile can be used to compile all the programs in this article:

.POSIX:
CFLAGS = -std=c99 -pedantic -D_POSIX_C_SOURCE=200809L -Wall -Wextra
LDLIBS = -lpthread

.c:
		$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< $(LDLIBS)

We’re overriding make’s default suffix rule for .c so that -lpthread comes after the source input file. This makeefile will work with any of our programs. If you have foo.c you can simply run make foo and it knows what to do without your needing to add any specific rule for foo to the Makefile.

Data races

Try compiling and running banker.c. Notice anything strange?

Threads share memory directly. Each thread can read and write variables in shared memory without any overhead. However when threads simultaneously read and write the same data it’s called a data race and generally causes problems.

In particular, threads in banker.c have data races when they read and write account balances. The bankers program moves money between accounts, however the total amount of money in the system does not remain constant. The books don’t balance. Exactly how the program behaves depends on thread scheduling policies of the operating system. On OpenBSD the total money seldom stays at $1,000. Sometimes money gets duplicated, sometimes it vanishes. On macOS the result is generally that all the money disappears, or even becomes negative!

The property that money is neither created nor destroyed in a bank is an example of a program invariant, and it gets violated by data races. Note that parallelism is not required for a race, only concurrency.

Here’s the problematic code in the disburse() function:

payment = 1 + rand_range(accts[from].balance);
accts[from].balance -= payment;
accts[to].balance   += payment;

The threads running this code can be paused or interleaved at any time. Not just between any of the statements, but partway through arithmetic operations which may not execute atomically on the hardware. Never rely on “thread inertia,” which is the mistaken feeling that the thread will finish a group of statements without interference.

Let’s examine exactly how statements can interleave between banker threads, and the resulting problems. The columns of the table below are threads, and the rows are moments in time.

Here’s a timeline where two threads read the same account balance when planning how much money to transfer. It can cause an overdraft.

Overdrafting
Thread A Thread B
payment = 1 + rand_range(accts[from].balance);
payment = 1 + rand_range(accts[from].balance);
At this point, thread B’s payment-to-be may be in excess of the true balance because thread A has already earmarked some of the money unbeknownst to B.
accts[from].balance -= payment;
accts[from].balance -= payment;
Some of the same dollars could be transferred twice and the originating account could even go negative if the overlap of the payments is big enough.

Here’s a timeline where the debit made by one thread can be undone by that made by another.

Lost debit
Thread A Thread B
accts[from].balance -= payment; accts[from].balance -= payment;
If -= is not atomic, the threads might switch execution after reading the balance and after doing arithmetic, but before assignment. Thus one assignment would be overwritten by the other. The “lost update” creates extra money in the system.

Similar problems can occur when bankers have a data race in destination accounts. Races in the destination account would tend to decrease total money supply. (To learn more about concurrency problems, see my article Practical Guide to SQL Transaction Isolation).

Locks and deadlock

In the example above, we found that a certain section of code was vulnerable to data races. Such tricky parts of a program are called critical sections. We must ensure each thread gets all the way through the section before another thread is allowed to enter it.

To give threads mutually exclusive access to a critical section, pthreads provides the mutually exclusive lock (mutex for short). The pattern is:

pthread_mutex_lock(&some_mutex);

/* ... do things in the critical section ... */

pthread_mutex_unlock(&some_mutex);

Any thread calling pthread_mutex_lock on a previously locked mutex will go to sleep and not be scheduled until the mutex is unlocked (and any other threads already waiting on the mutex have gone first).

Another way to look at mutexes is that their job is to preserve program invariants. The critical section between locking and unlocking is a place where a certain invariant may be temporarily broken, as long as it is restored by the end. Some people recommend adding an assert() statement before unlocking, to help document the invariant. If an invariant is difficult to specify in an assertion, a comment can be useful instead.

A function is called thread-safe if multiple invocations can safely run concurrently. A cheap, but inefficient, way to make any function thread-safe is to give it its own mutex and lock it right away:

/* inefficient but effective way to protect a function */

pthread_mutex_t foo_mtx = PTHREAD_MUTEX_INITIALIZER;

void foo(/* some arguments */)
{
	pthread_mutex_lock(&foo_mtx);

	/* we're safe in here, but it's a bottleneck */

	pthread_mutex_unlock(&foo_mtx);
}

To see why this is inefficient, imagine if foo() was designed to output characters to a file specified in its arguments. Because the function takes a global lock, no two threads could run it at once, even if they wanted to write to different files. Writing to different files should be independent activities, and what we really want to protect against are two threads concurrently writing the same file.

The amount of data that a mutex protects is called its granularity, and smaller granularity can often be more efficient. In our foo() example, we could store a mutex for every file we write, and have the function choose and lock the appropriate mutex. Multi-threaded programs typically add a mutex as a member variable to data structures, to associate the lock with its data.

Let’s update the banker program to keep a mutex in each account and prevent data races.

/* banker_lock.c */

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>

#define N_ACCOUNTS 10
#define N_THREADS  100
#define N_ROUNDS   10000

struct account
{
	long balance;
	/* add a mutex to prevent races on balance */
	pthread_mutex_t mtx;
} accts[N_ACCOUNTS];

int rand_range(int N)
{
	return (int)((double)rand() / ((double)RAND_MAX + 1) * N);
}

void *disburse(void *arg)
{
	size_t i, from, to;
	long payment;

	(void)arg;

	for (i = 0; i < N_ROUNDS; i++)
	{
		from = rand_range(N_ACCOUNTS);
		do {
			to = rand_range(N_ACCOUNTS);
		} while (to == from);

		/* get an exclusive lock on both balances before
		   updating (there's a problem with this, see below) */
		pthread_mutex_lock(&accts[from].mtx);
		pthread_mutex_lock(&accts[to].mtx);
		if (accts[from].balance > 0)
		{
			payment = 1 + rand_range(accts[from].balance);
			accts[from].balance -= payment;
			accts[to].balance   += payment;
		}
		pthread_mutex_unlock(&accts[to].mtx);
		pthread_mutex_unlock(&accts[from].mtx);
	}
	return NULL;
}

int main(void)
{
	size_t i;
	long total;
	pthread_t ts[N_THREADS];

	srand(time(NULL));

	/* set the initial balance, but also create a
	   new mutex for each account */
	for (i = 0; i < N_ACCOUNTS; i++)
		accts[i] = (struct account)
			{100, PTHREAD_MUTEX_INITIALIZER};

	for (i = 0; i < N_THREADS; i++)
		pthread_create(&ts[i], NULL, disburse, NULL);

	puts("(This program will probably deadlock, "
	     "and need to be manually terminated...)");

	for (i = 0; i < N_THREADS; i++)
		pthread_join(ts[i], NULL);

	for (total = 0, i = 0; i < N_ACCOUNTS; i++)
		total += accts[i].balance;

	printf("Total money in system: %ld\n", total);
}

Now everything should be safe. No money being created or destroyed, just perfect exchanges between the accounts. The invariant is that the total balance of the source and destination accounts is the same before we transfer the money as after. It’s broken only inside the critical section.

As a side note, at this point you might think it would be more efficient be to take a single lock at a time, like this:

  • lock the source account
  • withdraw money into a thread local variable
  • unlock the source account
  • (danger zone!)
  • lock the destination account
  • deposit the money
  • unlock the destination account

This would not be safe. During the time between unlocking the source account and locking the destination, the invariant does not hold, yet another thread could observe this state. For instance a report running in another thread just at that time could read the balance of both accounts and observe money missing from the system.

We do need to lock both accounts during the transfer. However the way we’re doing it causes a different problem. Try to run the program. It gets stuck forever and never prints the final balance! Its threads are deadlocked.

Deadlock is the second villain of concurrent programming, and happens when threads wait on each others’ locks, but no thread unlocks for any other. The case of the bankers is a classic simple form called the deadly embrace. Here’s how it plays out:

Deadly embrace
Thread A Thread B
lock account 1
lock account 2
lock account 2
At this point thread A is blocked because thread B already holds a lock on account 2.
lock account 1
Now thread B is blocked because thread A holds a lock on account 1. However thread A will never unlock account 1 because thread A is blocked!

The problem happens because threads lock resources in different orders, and because they refuse to give locks up. We can solve the problem by addressing either of these causes.

The first approach to preventing deadlock is to enforce a locking hierarchy. This means the programmer comes up with an arbitrary order for locks, and always takes “earlier” locks before “later” ones. The terminology comes from locks in hierarchical data structures like trees, but it really amounts to using any kind of consistent locking order.

In our case of the banker program we store all the accounts in an array, so we can use the array index as the lock order. Let’s compare.

/* the original way to lock mutexes, which caused deadlock */

pthread_mutex_lock(&accts[from].mtx);
pthread_mutex_lock(&accts[to].mtx);
/* move money */
pthread_mutex_unlock(&accts[to].mtx);
pthread_mutex_unlock(&accts[from].mtx);

Here’s a safe way, enforcing a locking hierarchy:

/* lock mutexes in earlier accounts first */

#define MIN(a,b) ((a) < (b) ? (a) : (b))
#define MAX(a,b) ((a) < (b) ? (b) : (a))

pthread_mutex_lock(&accts[MIN(from, to)].mtx);
pthread_mutex_lock(&accts[MAX(from, to)].mtx);
/* move money */
pthread_mutex_unlock(&accts[MAX(from, to)].mtx);
pthread_mutex_unlock(&accts[MIN(from, to)].mtx);

/* notice we unlock in opposite order */

A locking hierarchy is the most efficient way to prevent deadlock, but it isn’t always easy to contrive. It’s also creates a potentially undocumented coupling between different parts of a program which need to collaborate in the convention.

Backoff is a different way to prevent deadlock which works for locks taken in any order. It takes a lock, but then checks whether the next is obtainable. If not, it unlocks the first to allow another thread to make progress, and tries again.

/* using pthread_mutex_trylock to dodge deadlock */

while (1)
{
	pthread_mutex_lock(&accts[from].mtx);
	
	if (pthread_mutex_trylock(&accts[to].mtx) == 0)
		break; /* got both locks */

	/* didn't get the second one, so unlock the first */
	pthread_mutex_unlock(&accts[from].mtx);
	/* force a sleep so another thread can try --
	   include <sched.h> for this function */
	sched_yield();
}
/* move money */
pthread_mutex_unlock(&accts[to].mtx);
pthread_mutex_unlock(&accts[from].mtx);

One tricky part is the call to sched_yield(). Without it the loop will immediately try to grab the lock again, competing as hard as it can with other threads who could make more productive use of the lock. This causes livelock, where threads fight for access to the locks. The sched_yield() puts the calling thread to sleep and at the back of the scheduler’s run queue.

Despite its flexibility, backoff is definitely less efficient than a locking hierarchy because it can make wasted calls to lock and unlock mutexes. Try modifying the banker program with these approaches and measure how fast they run.

Condition variables

After safely getting access to a shared variable with a mutex, a thread may discover that the value of the variable is not yet suitable for the thread to act upon. For instance, if the thread was looking for an item to process in a shared queue, but found the queue was empty. The thread could poll the value, but this is inefficient. Pthreads provides condition variables to allow threads to wait for events of interest or notify other threads when these events happen.

Condition variables are not themselves locks, nor do they hold any value of their own. They are merely events with a programmer-assigned meaning. For example, a structure representing a queue could have a mutex for safely accessing the data, plus some condition variables. One to represent the event of the queue becoming empty, and another to announce when a new item is added.

Before getting deeper into how condition variables work, let’s see one in action with our banker program. We’ll measure contention between the bankers. First we’ll increase the number of threads and accounts, and keep statistics about how many bankers manage to get inside the disburse() critical section at once. Any time the max score is broken, we’ll signal a condition variable. A dedicated thread will wait on it and update a scoreboard.

/* banker_stats.c */

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>

/* increase the accounts and threads, but make sure there are
 * "too many" threads so they tend to block each other */
#define N_ACCOUNTS 50
#define N_THREADS  100
#define N_ROUNDS   10000

#define MIN(a,b) ((a) < (b) ? (a) : (b))
#define MAX(a,b) ((a) < (b) ? (b) : (a))

struct account
{
	long balance;
	pthread_mutex_t mtx;
} accts[N_ACCOUNTS];

int rand_range(int N)
{
	return (int)((double)rand() / ((double)RAND_MAX + 1) * N);
}

/* keep a special mutex and condition variable
 * reserved for just the stats */
pthread_mutex_t stats_mtx = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t  stats_cnd = PTHREAD_COND_INITIALIZER;
int stats_curr = 0, stats_best = 0;

/* use this interface to modify the stats */
void stats_change(int delta)
{
	pthread_mutex_lock(&stats_mtx);
	stats_curr += delta;
	if (stats_curr > stats_best)
	{
		stats_best = stats_curr;
		/* signal new high score */
		pthread_cond_broadcast(&stats_cnd);
	}
	pthread_mutex_unlock(&stats_mtx);
}

/* a dedicated thread to update the scoreboard UI */
void *stats_print(void *arg)
{
	int prev_best;

	(void)arg;

	/* we never return, nobody needs to
	 * pthread_join() with us */
	pthread_detach(pthread_self());

	while (1)
	{
		pthread_mutex_lock(&stats_mtx);

		prev_best = stats_best;
		/* go to sleep until stats change, and always
		 * check that they actually have changed */
		while (prev_best == stats_best)
			pthread_cond_wait(
				&stats_cnd, &stats_mtx);

		/* overwrite current line with new score */
		printf("\r%2d", stats_best);
		pthread_mutex_unlock(&stats_mtx);

		fflush(stdout);
	}
}

void *disburse(void *arg)
{
	size_t i, from, to;
	long payment;

	(void)arg;

	for (i = 0; i < N_ROUNDS; i++)
	{
		from = rand_range(N_ACCOUNTS);
		do {
			to = rand_range(N_ACCOUNTS);
		} while (to == from);

		pthread_mutex_lock(&accts[MIN(from, to)].mtx);
		pthread_mutex_lock(&accts[MAX(from, to)].mtx);

		/* notice we still have a lock hierarchy, because
		 * we call stats_change() after locking all account
		 * mutexes (stats_mtx comes last) */
		stats_change(1); /* another banker in crit sec */
		if (accts[from].balance > 0)
		{
			payment = 1 + rand_range(accts[from].balance);
			accts[from].balance -= payment;
			accts[to].balance   += payment;
		}
		stats_change(-1); /* leaving crit sec */

		pthread_mutex_unlock(&accts[MAX(from, to)].mtx);
		pthread_mutex_unlock(&accts[MIN(from, to)].mtx);
	}
	return NULL;
}

int main(void)
{
	size_t i;
	long total;
	pthread_t ts[N_THREADS], stats;

	srand(time(NULL));

	for (i = 0; i < N_ACCOUNTS; i++)
		accts[i] = (struct account)
			{100, PTHREAD_MUTEX_INITIALIZER};

	for (i = 0; i < N_THREADS; i++)
		pthread_create(&ts[i], NULL, disburse, NULL);

	/* start thread to update the user on how many bankers
	 * are in the disburse() critical section at once */
	pthread_create(&stats, NULL, stats_print, NULL);

	for (i = 0; i < N_THREADS; i++)
		pthread_join(ts[i], NULL);

	/* not joining with the thread running stats_print,
	 * we'll let it disappar when main exits */

	for (total = 0, i = 0; i < N_ACCOUNTS; i++)
		total += accts[i].balance;

	printf("\nTotal money in system: %ld\n", total);
}

With fifty accounts and a hundred threads, not all threads will be able to be in the critical section of disburse() at once. It varies between runs. Run the program and see how well it does on your machine. (One complication is that making all threads synchronize on stats_mtx may throw off the measurement, because there are threads who could have executed independently but now must interact.)

Let’s look at how to properly use condition variables. We notified threads of a new event with pthread_cond_broadcast(&stats_cnd). This function marks all threads waiting on state_cnd as ready to run.

Sometimes multiple threads are waiting on a single cond var. A broadcast will wake them all, but sometimes the event source knows that only one thread will be able to do any work. For instance if only one item is added to a shared queue. In that case the pthread_cond_signal function is better than pthread_cond_broadcast. Unnecessarily waking multiple threads causes overhead. In our case we know that only one thread is waiting on the cond var, so it really makes no difference.

Remember that it’s never wrong to use a broadcast, whereas in some cases it might be wrong to use a signal. Signal is just an optimized broadcast.

The waiting side of a cond var ought always to have this pattern:

pthread_mutex_lock(&mutex);
while (!PREDICATE)
	pthread_cond_wait(&cond_var, &mutex);
pthread_mutex_unlock(&mutex);

Condition variables are always associated with a predicate, and the association is implicit in the programmer’s head. You shouldn’t reuse a condition variable for multiple predicates. The intention is that code will signal the cond var when the predicate becomes true.

Before testing the predicate we lock a mutex that covers the data being tested. That way no other thread can change the data immediately after we test it (also pthread_cond_wait() requires a locked mutex). If the predicate is already true we needn’t wait on the cond var, so the loop falls through, otherwise the thread begins to wait.

Condition variables allow you to make this series of events atomic: unlock a mutex, register our interest in the event, and block. Without that atomicity another thread might awaken to take our lock and broadcast before we’ve registered ourselves as interested. Without the atomicity we could be blocked forever.

When pthread_cond_wait() returns, the calling thread awakens and atomically gets its mutex back. It’s all set to check the predicate again in the loop. But why check the predicate? Wasn’t the cond var signaled because the predicate was true, and isn’t the relevant data protected by a mutex? There are three reasons to check:

  1. If the condition variable had been broadcast, other threads might have been listening, and another might have been scheduled first and might have done our job. The loop tests for that interception.
  2. On some multiprocessor systems, making condition variable wakeup completely predictable might substantially slow down all cond var operations. Such systems allow spurious wakeups, and threads need to be prepared to check if they were woken appropriately.
  3. It can be convenient to signal on a loose predicate. Threads can signal the variables when the event seems likely, or even mistakenly signal, and the program will still work. For instance, we signal when when stats_best gets a new high score, but we could have chosen to signal at every invocation of stats_change().

Given that we have to pass a locked mutex to pthread_cond_wait(), which we had to create, why don’t cond vars come with their own built-in mutex? The reason is flexibility. Although you should use only one mutex with a cond var, there can be multiple cond vars for the same mutex. Think of the example of the mutex protecting a queue, and the different events that can happen in the queue.

Other synchronization primitives

Barriers

It’s time to bid farewell to the banker programs, and turn to something more lively: Conway’s Game of Life! The game has a set of rules operating on a grid of cells that determines which cells live or die based on how many living neighbors each has.

The game can take advantage of multiple processors, using each processor to operate on a different part of the grid in parallel. It’s a so-called embarrassingly parallel problem because each section of the grid can be processed in isolation, without needing results from other sections.

Barriers ensure that all threads have reached a particular stage in a parallel computation before allowing any to proceed to the next stage. Each thread calls pthread_barrier_wait() to rendezvous with the others. One of the threads, chosen randomly, will see the PTHREAD_BARRIER_SERIAL_THREAD return value, which nominates that thread to do any cleanup or preparation between stages.

/* life.c */

#include <assert.h>
#include <pthread.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

/* mandatory in POSIX.1-2008, but check laggards like macOS */
#include <unistd.h>
#if !defined(_POSIX_BARRIERS) || _POSIX_BARRIERS < 0
#error your OS lacks POSIX barrier support
#endif

/* dimensions of board */
#define ROWS 32
#define COLS 78
/* how long to pause between rounds */
#define FRAME_MS 100
#define THREADS 4

/* proper modulus (in C, '%' is merely remainder) */
#define MOD(x,N) (((x) < 0) ? ((x) % (N) + (N)) : ((x) % (N)))

bool alive[ROWS][COLS], alive_next[ROWS][COLS];
pthread_barrier_t tick;

/* Should a cell live or die? Using ssize_t because we have
   to deal with signed arithmetic like row-1 when row=0 */
bool fate(ssize_t row, ssize_t col)
{
	ssize_t i, j;
	short neighbors = 0;

	assert(0 <= row && row < ROWS);
	assert(0 <= col && col < COLS);

	/* joined edges form a torus */
	for (i = row-1; i <= row+1; i++)
		for (j = col-1; j <= col+1; j++)
			neighbors += alive[MOD(i, ROWS)][MOD(j, COLS)];
	/* don't count self as a neighbor */
	neighbors -= alive[row][col];

	return neighbors == 3 ||
		(neighbors == 2 && alive[row][col]);
}

/* overwrite the board on screen */
void draw(void)
{
	ssize_t i, j;

	/* clear screen (non portable, requires ANSI terminal) */
	fputs("\033[2J\033[1;1H", stdout);

	flockfile(stdout);
	for (i = 0; i < ROWS; i++)
	{
		/* putchar_unlocked is thread safe when stdout is locked,
		   and it's as fast as single-threaded putchar */
		for (j = 0; j < COLS; j++)
			putchar_unlocked(alive[i][j] ? 'X' : ' ');
		putchar_unlocked('\n');
	}
	funlockfile(stdout);
	fflush(stdout);
}

void *update_strip(void *arg)
{
	ssize_t offset = *(ssize_t*)arg, i, j;
	struct timespec t;

	t.tv_sec = 0;
	t.tv_nsec = FRAME_MS * 1000000;

	while (1)
	{
		if (pthread_barrier_wait(&tick) ==
			PTHREAD_BARRIER_SERIAL_THREAD)
		{
			/* we drew the short straw, so we're on graphics duty */

			/* could have used pointers to multidimensional
			 * arrays and swapped them rather than memcpy'ing
			 * the array contents, but it makes the code a
			 * little more complicated with dereferences */
			memcpy(alive, alive_next, sizeof alive);
			draw();
			nanosleep(&t, NULL);
		}

		/* rejoin at another barrier to avoid data race on
		   the game board while it's copied and drawn */
		pthread_barrier_wait(&tick);
		for (i = offset; i < offset + (ROWS / THREADS); i++)
			for (j = 0; j < COLS; j++)
				alive_next[i][j] = fate(i, j);
	}

	return NULL;
}

int main(void)
{
	pthread_t *workers;
	ssize_t *offsets;
	size_t i, j;

	assert(ROWS % THREADS == 0);
	/* main counts as a thread, so need only THREADS-1 more */
	workers = malloc(sizeof(*workers) * (THREADS-1));
	offsets = malloc(sizeof(*offsets) * ROWS / THREADS);

	srand(time(NULL));
	for (i = 0; i < ROWS; i++)
		for (j = 0; j < COLS; j++)
			alive_next[i][j] = rand() < (int)((RAND_MAX+1u) / 3);

	pthread_barrier_init(&tick, NULL, THREADS);
	for (i = 0; i < THREADS-1; i++)
	{
		offsets[i] = i * ROWS / THREADS;
		pthread_create(&workers[i], NULL, update_strip, &offsets[i]);
	}

	/* use current thread as a worker too */
	offsets[i] = i * ROWS / THREADS;
	update_strip(&offsets[i]);

	/* shouldn't ever get here */
	pthread_barrier_destroy(&tick);
	free(offsets);
	free(workers);
	return EXIT_SUCCESS;
}

It’s a fun example although slightly contrived. We’re adding a sleep between rounds to slow down the animation, so it’s unnecessary to chase parallelism. Also there’s a memoized algorithm called hashlife we should be using if pure speed is the goal. However our code illustrates a natural use for barriers.

Notice how we wait at the barrier twice in rapid succession. After emerging from the first barrier, one of the threads (chosen at random) copies the new state to the board and draws it. The other threads run ahead to the next barrier and wait there so they don’t cause a data race writing to the board. Once the drawing thread arrives at the barrier with them, then all can proceed to calculate cells’ fate for the next round.

Barriers are guaranteed to be present in POSIX.1-2008, but are optional in earlier versions of the standard. Notably macOS is stuck at an old version of POSIX. Presumably they’re too busy “innovating” with their keyboard touchbar to invest in operating system fundamentals.

Spinlocks

Spinlocks are implementations of mutexes optimized for fine-grained locking. Often used in low level code like drivers or operating systems, spinlocks are designed to be the most primitive and fastest sync mechanism available. They’re generally not appropriate for application programming. They are only truly necessary for situations like interrupt handlers when a thread is not allowed to go to sleep for any reason.

Aside from that scenario, it’s better to just use a mutex, since mutexes are pretty efficient these days. Modern mutexes often try a short-lived internal spinlock and fall back to heavier techniques only as needed. Mutexes also sometimes use a wait queue called a futex, which can take a lock in user-space whenever there is no contention from another thread.

When attempting to lock a spinlock, a thread runs a tight loop repeatedly checking a value in shared memory for a sign it’s safe to proceed. Spinlock implementations use special atomic assembly language instructions to test that the value is unlocked and lock it. The particular instructions vary per architecture, and can be performed in user space to avoid the overhead of a system call.

The while waiting for a lock, the loop doesn’t block the thread, but instead continues running and burns CPU energy. The technique works only on true multi-processor systems or a uniprocessor system with preemption enabled. On a uniprocessor system with cooperative threading the loop could never be interrupted, and will livelock.

In POSIX.1-2008 spinlock support is mandatory. In previous versions the presence of this feature was indicated by the _POSIX_SPIN_LOCKS macro. Spinlock functions start with pthread_spin_.

Reader-writer locks

Whereas a mutex enforces mutual exclusion, a reader-writer lock allows concurrent read access. Multiple threads can read in parallel, but all block when a thread takes the lock for writing. The increased concurrency can improve application performance. However, blindly replacing mutexes with reader-writer locks “for performance” doesn’t work. Our earlier banker program, for instance, could suffer from duplicate withdrawals if it allowed multiple readers in an account at once.

Below is an rwlock example. It’s a password cracker I call 5dm (md5 backwards). It aims for maximum parallelism searching for a preimage of an MD5 hash. Worker threads periodically poll whether one among them has found an answer, and they use a reader-writer lock to avoid blocking on each other when doing so.

The example is slightly contrived, in that the difficulty of brute forcing passwords increases exponentially with their length. Using multiple threads reduces the time by only a constant factor – but 4x faster is still 4x faster on a four core computer!

The example below uses MD5() from OpenSSL. To build it, include this in our previous Makefile:

CFLAGS  += `pkg-config --cflags libcrypto`
LDFLAGS += `pkg-config --libs-only-L libcrypto`
LDLIBS  += `pkg-config --libs-only-l libcrypto`

To run it, pass in an MD5 hash and max preimage search length. Note the -n in echo to suppress the newline, since newline is not in our search alphabet:

$ time ./5dm $(echo -n 'fun' | md5) 5
fun

real  0m0.067s
user  0m0.205s
sys	  0m0.007s

Notice how 0.2 seconds of CPU time elapsed in parallel, but the user got their answer in 0.067 seconds.

On to the code:

/* 5dm.c */

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <openssl/md5.h>
#include <pthread.h>

/* build arbitrary words from the ascii between ' ' and '~' */
#define ASCII_FIRST ' '
#define ASCII_LAST  '~'
#define N_ALPHA (1 + ASCII_LAST - ASCII_FIRST)
/* refuse to search beyond this astronomical length */
#define LONGEST_PREIMAGE 128

#define MAX(x,y) ((x)<(y) ? (y) : (x))

/* a fast way to enumerate words, operating on an array in-place */
unsigned word_advance(char *word, unsigned delta)
{
	if (delta == 0)
		return 0;
	if (*word == '\0')
	{
		*word++ = ASCII_FIRST + delta - 1;
		*word = '\0';
	}
	else
	{
		char c = *word - ASCII_FIRST;
		*word = ASCII_FIRST + ((c + delta) % N_ALPHA);
		if (c + delta >= N_ALPHA)
			return 1 + word_advance(word+1, 1 /* not delta */);
	}
	return 1;
}

/* pack each pair of ASCII hex digits into single bytes */
bool hex2md5(const char *hex, unsigned char *b)
{
	int offset = 0;
	if(strlen(hex) != MD5_DIGEST_LENGTH*2)
		return false;
	while (offset < MD5_DIGEST_LENGTH*2)
	{
		if (sscanf(hex+offset, "%2hhx", b++) == 1)
			offset += 2;
		else
			return false;
	}
	return true;
}

/* random things a worker will need, since thread
 * functions receive only one argument */
struct goal
{
	/* input */
	pthread_t *workers;
	size_t n_workers;
	size_t max_len;
	unsigned char hash[MD5_DIGEST_LENGTH];

	/* output */
	pthread_rwlock_t lock;
	char preimage[LONGEST_PREIMAGE];
	bool success;
};

/* custom starting word for each worker, but shared goal */
struct task
{
	struct goal *goal;
	char initial_preimage[LONGEST_PREIMAGE];
};

void *crack_thread(void *arg)
{
	struct task *t = arg;
	unsigned len, changed;
	unsigned char hashed[MD5_DIGEST_LENGTH];
	char preimage[LONGEST_PREIMAGE];
	int iterations = 0;

	strcpy(preimage, t->initial_preimage);
	len = strlen(preimage);

	while (len <= t->goal->max_len)
	{
		MD5((const unsigned char*)preimage, len, hashed);
		if (memcmp(hashed, t->goal->hash, MD5_DIGEST_LENGTH) == 0)
		{
			/* success -- tell others to call it off */
			pthread_rwlock_wrlock(&t->goal->lock);

			t->goal->success = true;
			strcpy(t->goal->preimage, preimage);

			pthread_rwlock_unlock(&t->goal->lock);
			return NULL;
		}
		/* each worker jumps ahead n_workers words, and all workers
		   started at an offset, so all words are covered */
		changed = word_advance(preimage, t->goal->n_workers);
		len = MAX(len, changed);

		/* check if another worker has succeeded, but only every
		   thousandth iteration, since taking the lock adds overhead */
		if (iterations++ % 1000 == 0)
		{
			/* in the overwhelming majority of cases workers only read,
			   so an rwlock allows them to continue in parallel */
			pthread_rwlock_rdlock(&t->goal->lock);
			int success = t->goal->success;
			pthread_rwlock_unlock(&t->goal->lock);
			if (success)
				return NULL;
		}
	}
	return NULL;
}

/* launch a parallel search for an md5 preimage */
bool crack(const unsigned char *md5, size_t max_len,
           unsigned threads, char *result)
{
	struct goal g =
	{
		.workers   = malloc(threads * sizeof(pthread_t)),
		.n_workers = threads,
		.max_len   = max_len,
		.success   = false,
		.lock      = PTHREAD_RWLOCK_INITIALIZER
	};
	memcpy(g.hash, md5, MD5_DIGEST_LENGTH);

	struct task *tasks = malloc(threads * sizeof(struct task));

	for (size_t i = 0; i < threads; i++)
	{
		tasks[i].goal = &g;
		tasks[i].initial_preimage[0] = '\0';
		/* offset the starting word for each worker by i */
		word_advance(tasks[i].initial_preimage, i);
		pthread_create(g.workers+i, NULL, crack_thread, tasks+i);
	}

	/* if one worker finds the answer, others will abort */
	for (size_t i = 0; i < threads; i++)
		pthread_join(g.workers[i], NULL);

	if (g.success)
		strcpy(result, g.preimage);

	free(tasks);
	free(g.workers);
	return g.success;
}

int main(int argc, char **argv)
{
	char preimage[LONGEST_PREIMAGE];
	int max_len = 4;
	unsigned char md5[MD5_DIGEST_LENGTH];

	if (argc != 2 && argc != 3)
	{
		fprintf(stderr,
		        "Usage: %s md5-string [search-depth]\n",
		        argv[0]);
		return EXIT_FAILURE;
	}

	if (!hex2md5(argv[1], md5))
	{
		fprintf(stderr,
		       "Could not parse as md5: %s\n", argv[1]);
		return EXIT_FAILURE;
	}

	if (argc > 2 && strtol(argv[2], NULL, 10))
		if ((max_len = strtol(argv[2], NULL, 10)) > LONGEST_PREIMAGE)
		{
			fprintf(stderr,
					"Preimages limited to %d characters\n",
					LONGEST_PREIMAGE);
			return EXIT_FAILURE;
		}

	if (crack(md5, max_len, 4, preimage))
	{
		puts(preimage);
		return EXIT_SUCCESS;
	}
	else
	{
		fprintf(stderr,
				"Could not find result in strings up to length %d\n",
		        max_len);
		return EXIT_FAILURE;
	}
}

Although read-write locks can be implemented in terms of mutexes and condition variables, such implementations are significantly less efficient than is possible. Therefore, this synchronization primitive is included in POSIX.1-2008 for the purpose of allowing more efficient implementations in multi-processor systems.

The final thing to be aware of is that an rwlock implementation can choose either reader-preference or writer-preference. When readers and writers are contending for a lock, the preference determines who gets to skip the queue and go first. When there is a lot of reader activity with a reader-preference, then a writer will continually get moved to the end of the line and experience starvation, where it never gets to write. I noticed writer starvation on Linux (glibc) when running four threads on a little 1-core virtual machine. Glibc provides the nonportable pthread_rwlockattr_setkind_np() function to specify a preference.

You may have noticed that workers in our password cracker use polling to see whether the solution has been found, and whether they should give up. We’ll examine a more explicit method of cancellation in a later section.

Semaphores

Semaphores keep count of, in the abstract, an amount of resource “units” available. Threads can safely add or remove a unit without causing a data race. When a thread requests a unit but there are none, then the thread will block.

A semaphore is like a mix between a lock and a condition variable. Unlike mutexes, semaphores have no concept of an owner. Any thread may release threads blocked on a semaphore, whereas with a mutex the lock holder must unlock it. Unlike a condition variable, a semaphore operates independently of a predicate.

An example of a problem uniquely suited for semaphores would be to ensure that exactly two threads run at once on a task. You would initialize the semaphore to the value two, and allow a bunch of threads to wait on the semaphore. After two get past, the rest will block. When each thread is done, it posts one unit back to the semaphore, which allows another thread to take its place.

In reality, if you’ve got pthreads, you only need semaphores for asynchronous signal handlers. You can use them in other situations, but this is the only place they are needed. Mutexes aren’t async signal safe. Making them so would be much slower than an implementation that isn’t async signal safe, and would slow down ordinary mutex operation.

Here’s an example of posting a semaphore from a signal handler:

/* sem_tickler.c */

#include <semaphore.h>
#include <signal.h>
#include <stdio.h>

#include <unistd.h>
#if !defined(_POSIX_SEMAPHORES) || _POSIX_SEMAPHORES < 0
#error your OS lacks POSIX semaphore support
#endif

sem_t tickler;

void int_catch(int sig)
{
	(void) sig;

	signal(SIGINT, &int_catch);
	sem_post(&tickler); /* async signal safe: */
}

int main(void)
{
	sem_init(&tickler, 0, 0);
	signal(SIGINT, &int_catch);

	for (int i = 0; i < 3; i++)
	{
		sem_wait(&tickler);
		puts("That tickles!");
	}
	puts("(Died from overtickling)");
	return 0;
}

Semaphores aren’t even necessary for proper signal handling. It’s easier to have a thread simply sigwait() than it is to set up an asynchronous handler. In the example below, the main thread waits, but you can spawn a dedicated thread for this in a real application.

/* sigwait_tickler.c */

#include <signal.h>
#include <stdio.h>

int main(void)
{
	sigset_t set;
	int which;
	sigemptyset(&set);
	sigaddset(&set, SIGINT);

	for (int i = 0; i < 3; i++)
	{
		sigwait(&set, &which);
		puts("That tickles!");
	}
	puts("(Died from overtickling)");
	return 0;
}

So don’t feel dependent on semaphores. In fact your system may not have them. The POSIX semaphore API works with pthreads and is present in POSIX.1-2008, but is an optional part of POSIX.1b in earlier versions. Apple, for one, decided to punt, so the semaphore functions on macOS are stubbed to return error codes.

Cancellation

Thread cancellation is generally used when you have threads doing long-running tasks and there’s a way for a user to abort through the UI or console. Another common scenario is when multiple threads set off to explore a search space and one finds the answer first.

Our previous reader-writer lock example was the second scenario, where the threads explored a search space. It was an example of do-it-yourself cancellation through polling. However sometimes threads aren’t able to poll, such as when they are blocked on I/O or a lock. Pthreads offers an API to cancel threads even in those situations.

By default a cancelled thread isn’t immediately blown away, because it may have a mutex locked, be holding resources, or have a potentially broken invariant. The canceller wouldn’t know how to repair that invariant without some complicated logic. The thread to be canceled needs to be written to do cleanup and unlock mutexes.

For each thread, cancellation can be enabled or disabled, and if enabled, may be in deferred or asynchronous mode. The default is enabled and deferred, which allows a cancelled thread to survive until the next cancellation points, such as waiting on a condition variable or blocking on IO (see full list). In a purely computational section of code you can add your own cancellation points with pthread_testcancel().

Let’s see how to modify our previous MD5 cracking example using standard pthread cancellation. Three of the functions are the same as before: word_advance(), hex2md5(), and main(). But we now use a condition variable to alert crack() whenever a crack_thread() returns.

/* 5dm-testcancel.c */

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <openssl/md5.h>
#include <pthread.h>

#define ASCII_FIRST ' '
#define ASCII_LAST  '~'
#define N_ALPHA (1 + ASCII_LAST - ASCII_FIRST)
#define LONGEST_PREIMAGE 128

#define MAX(x,y) ((x)<(y) ? (y) : (x))

unsigned word_advance(char *word, unsigned delta)
{
	if (delta == 0)
		return 0;
	if (*word == '\0')
	{
		*word++ = ASCII_FIRST + delta - 1;
		*word = '\0';
	}
	else
	{
		char c = *word - ASCII_FIRST;
		*word = ASCII_FIRST + ((c + delta) % N_ALPHA);
		if (c + delta >= N_ALPHA)
			return 1 + word_advance(word+1, 1 /* not delta */);
	}
	return 1;
}

bool hex2md5(const char *hex, unsigned char *b)
{
	int offset = 0;
	if(strlen(hex) != MD5_DIGEST_LENGTH*2)
		return false;
	while (offset < MD5_DIGEST_LENGTH*2)
	{
		if (sscanf(hex+offset, "%2hhx", b++) == 1)
			offset += 2;
		else
			return false;
	}
	return true;
}

struct goal
{
	/* input */
	pthread_t *workers;
	size_t n_workers;
	size_t max_len;
	unsigned char hash[MD5_DIGEST_LENGTH];

	/* output */
	pthread_mutex_t lock;
	pthread_cond_t returning;
	unsigned n_done;
	char preimage[LONGEST_PREIMAGE];
	bool success;
};

struct task
{
	struct goal *goal;
	char initial_preimage[LONGEST_PREIMAGE];
};

void *crack_thread(void *arg)
{
	struct task *t = arg;
	unsigned len, changed;
	unsigned char hashed[MD5_DIGEST_LENGTH];
	char preimage[LONGEST_PREIMAGE];
	int iterations = 0;

	strcpy(preimage, t->initial_preimage);
	len = strlen(preimage);

	while (len <= t->goal->max_len)
	{
		MD5((const unsigned char*)preimage, len, hashed);
		if (memcmp(hashed, t->goal->hash, MD5_DIGEST_LENGTH) == 0)
		{
			pthread_mutex_lock(&t->goal->lock);

			t->goal->success = true;
			strcpy(t->goal->preimage, preimage);
			t->goal->n_done++;

			/* alert the boss that another worker is done */
			pthread_cond_signal(&t->goal->returning);
			pthread_mutex_unlock(&t->goal->lock);
			return NULL;
		}
		changed = word_advance(preimage, t->goal->n_workers);
		len = MAX(len, changed);

		if (iterations++ % 1000 == 0)
			pthread_testcancel(); /* add a cancellation point */
	}

	pthread_mutex_lock(&t->goal->lock);
	t->goal->n_done++;
	/* alert the boss that another worker is done */
	pthread_cond_signal(&t->goal->returning);
	pthread_mutex_unlock(&t->goal->lock);
	return NULL;
}

/* cancellation cleanup function that we also call
 * during regular exit from the crack() function */
void crack_cleanup(void *arg)
{
	struct task *tasks = arg;
	struct goal *g = tasks[0].goal;

	/* this mutex unlock pairs with the lock in the crack() function */
	pthread_mutex_unlock(&g->lock);
	for (size_t i = 0; i < g->n_workers; i++)
	{
		pthread_cancel(g->workers[i]);
		/* must wait for each to terminate, so that freeing
		 * their shared memory is safe */
		pthread_join(g->workers[i], NULL);
	}
	/* now it's safe to free memory */
	free(g->workers);
	free(tasks);
}

bool crack(const unsigned char *md5, size_t max_len,
           unsigned threads, char *result)
{
	struct goal g =
	{
		.workers   = malloc(threads * sizeof(pthread_t)),
		.n_workers = threads,
		.max_len   = max_len,
		.success   = false,
		.n_done    = 0,
		.lock      = PTHREAD_MUTEX_INITIALIZER,
		.returning = PTHREAD_COND_INITIALIZER
	};
	memcpy(g.hash, md5, MD5_DIGEST_LENGTH);

	struct task *tasks = malloc(threads * sizeof(struct task));

	for (size_t i = 0; i < threads; i++)
	{
		tasks[i].goal = &g;
		tasks[i].initial_preimage[0] = '\0';
		word_advance(tasks[i].initial_preimage, i);
		pthread_create(g.workers+i, NULL, crack_thread, tasks+i);
	}

	/* coming up to cancellation points, so establish
	 * a cleanup handler */
	pthread_cleanup_push(crack_cleanup, tasks);

	pthread_mutex_lock(&g.lock);
	/* We can't join() on all the workers now because it's up to
	 * us to cancel them after one finds the answer. We have to
	 * remain responsive and not block on any particular worker */
	while (!g.success && g.n_done < threads)
		pthread_cond_wait(&g.returning, &g.lock);
	/* at this point either a thread succeeded or all have given up */
	if (g.success)
		strcpy(result, g.preimage);
	/* mutex unlocked in the cleanup handler */

	/* Use the same cleanup handler for normal exit too. The "1"
	 * argument says to execute the function we had previous pushed */
	pthread_cleanup_pop(1);
	return g.success;
}

int main(int argc, char **argv)
{
	char preimage[LONGEST_PREIMAGE];
	int max_len = 4;
	unsigned char md5[MD5_DIGEST_LENGTH];

	if (argc != 2 && argc != 3)
	{
		fprintf(stderr,
		        "Usage: %s md5-string [search-depth]\n",
		        argv[0]);
		return EXIT_FAILURE;
	}

	if (!hex2md5(argv[1], md5))
	{
		fprintf(stderr,
		       "Could not parse as md5: %s\n", argv[1]);
		return EXIT_FAILURE;
	}

	if (argc > 2 && strtol(argv[2], NULL, 10))
		if ((max_len = strtol(argv[2], NULL, 10)) > LONGEST_PREIMAGE)
		{
			fprintf(stderr,
					"Preimages limited to %d characters\n",
					LONGEST_PREIMAGE);
			return EXIT_FAILURE;
		}

	if (crack(md5, max_len, 4, preimage))
	{
		puts(preimage);
		return EXIT_SUCCESS;
	}
	else
	{
		fprintf(stderr,
				"Could not find result in strings up to length %d\n",
		        max_len);
		return EXIT_FAILURE;
	}
}

Using cancellation is actually a little more flexible than our rwlock implementation in 5dm. If the crack() function is running in its own thread, the whole thing can now be cancelled. The cancellation handler will “pass along” the cancellation to each of the worker threads.

Writing general purpose library code that works with threads requires some care. It should handle deferred cancellation gracefully, including disabling cancellation when appropriate and always using cleanup handlers.

For cleanup handlers, notice the pattern of how we pthread_cleanup_push() the cancellation handler, and later pthread_cleanup_pop() it for regular (non-cancel) cleanup too. Using the same cleanup procedure in all situations makes the code more reliable.

Also notice how the boss thread now cancels workers, rather than the winning worker cancelling the others. You can join a canceled thread, but you can’t cancel an already joined (or detached) thread. If you want to both cancel and join a thread it ought to be done in one place.

Let’s turn out attention to the new worker threads. They are still polling for cancellation, like they polled with the reader-writer locks, but in this case they do it with a new function:

if (iterations++ % 1000 == 0)
	pthread_testcancel();

Admittedly it adds a little overhead to poll every thousandth loop, both with the rwlock, and with the testcancel. It also adds latency to the time between the cancellation request and the thread quitting, since the loop could run up to 999 times in between. A more efficient but dangerous method is to enable asynchronous cancellation, meaning the thread immediately dies when cancelled.

Async cancellation is dangerous because code is seldom async-cancel-safe. Anything that uses locks or works with shared state even slightly can break badly. Async-cancel-safe code can call very few functions, since those functions may not be safe. This includes calling libraries that use something as innocent as malloc(), since stopping malloc part way through could corrupt the heap.

Our crack_thread() function should be async-cancel-safe, at least during its calculation and not when taking locks. The MD5() function from OpenSSL also appears to be safe. Here’s how we can rewrite our function (notice how we disable cancellation before taking a lock):

/* rewritten to use async cancellation */

void *crack_thread(void *arg)
{
	struct task *t = arg;
	unsigned len, changed;
	unsigned char hashed[MD5_DIGEST_LENGTH];
	char preimage[LONGEST_PREIMAGE];
	int cancel_type, cancel_state;

	strcpy(preimage, t->initial_preimage);
	len = strlen(preimage);

	/* async so we don't have to pthread_testcancel() */
	pthread_setcanceltype(
			PTHREAD_CANCEL_ASYNCHRONOUS, &cancel_type);

	while (len <= t->goal->max_len)
	{
		MD5((const unsigned char*)preimage, len, hashed);
		if (memcmp(hashed, t->goal->hash, MD5_DIGEST_LENGTH) == 0)
		{
			/* protect the mutex against async cancellation */
			pthread_setcancelstate(
					PTHREAD_CANCEL_DISABLE, &cancel_state);
			pthread_mutex_lock(&t->goal->lock);

			t->goal->success = true;
			strcpy(t->goal->preimage, preimage);
			t->goal->n_done++;

			pthread_cond_signal(&t->goal->returning);
			pthread_mutex_unlock(&t->goal->lock);
			return NULL;
		}
		changed = word_advance(preimage, t->goal->n_workers);
		len = MAX(len, changed);
	}

	/* restore original cancellation type */
	pthread_setcanceltype(cancel_type, &cancel_type);

	pthread_mutex_lock(&t->goal->lock);
	t->goal->n_done++;
	pthread_cond_signal(&t->goal->returning);
	pthread_mutex_unlock(&t->goal->lock);
	return NULL;
}

Asynchronous cancellation does not appear to work on macOS, but as we’ve seen that’s par for the course on that operating system.

Development tools

Valgrind DRD and helgrind

DRD and Helgrind are Valgrind tools for detecting errors in multithreaded C and C++ programs. The tools work for any program that uses the POSIX threading primitives or that uses threading concepts built on top of the POSIX threading primitives.

The tools have overlapping abilities like detecting data races and improper use of the pthreads API. Additionally, Helgrind can detect locking hierarchy violations, and DRD can alert when there is lock contention.

Both tools pinpoint the lines of code where problems arise. For example, we can run DRD on our first crazy bankers program:

valgrind --tool=drd ./banker

Here is a characteristic example of an error it emits:

==8524== Thread 3:
==8524== Conflicting load by thread 3 at 0x003090b0 size 8
==8524==    at 0x1088BD: disburse (banker.c:48)
==8524==    by 0x4C324F3: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==8524==    by 0x4E514A3: start_thread (pthread_create.c:456)
==8524== Allocation context: BSS section of /home/admin/banker
==8524== Other segment start (thread 2)
==8524==    at 0x514FD01: clone (clone.S:80)
==8524== Other segment end (thread 2)
==8524==    at 0x509D820: rand (rand.c:26)
==8524==    by 0x108857: rand_range (banker.c:26)
==8524==    by 0x1088A0: disburse (banker.c:42)
==8524==    by 0x4C324F3: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==8524==    by 0x4E514A3: start_thread (pthread_create.c:456)

It finds conflicting loads and stores from lines 48, 51, and 52.

48: if (accts[from].balance > 0)
49: {
50:		payment = 1 + rand_range(accts[from].balance);
51:		accts[from].balance -= payment;
52:		accts[to].balance   += payment;
53: }

Helgrind can identify the lock hierarchy violation in our example of deadlocking bankers:

valgrind --tool=helgrind ./banker_lock
==8989== Thread #4: lock order "0x3091F8 before 0x3090D8" violated
==8989==
==8989== Observed (incorrect) order is: acquisition of lock at 0x3090D8
==8989==    at 0x4C3010C: mutex_lock_WRK (hg_intercepts.c:904)
==8989==    by 0x1089B9: disburse (banker_lock.c:38)
==8989==    by 0x4C32D06: mythread_wrapper (hg_intercepts.c:389)
==8989==    by 0x4E454A3: start_thread (pthread_create.c:456)
==8989==
==8989==  followed by a later acquisition of lock at 0x3091F8
==8989==    at 0x4C3010C: mutex_lock_WRK (hg_intercepts.c:904)
==8989==    by 0x1089D1: disburse (banker_lock.c:39)
==8989==    by 0x4C32D06: mythread_wrapper (hg_intercepts.c:389)
==8989==    by 0x4E454A3: start_thread (pthread_create.c:456)

To identify when there is too much contention for a lock, we can ask DRD to alert us when a thread blocks for more than n milliseconds on a mutex:

valgrind --tool=drd --exclusive-threshold=2 ./banker_lock_hierarchy

Since we throw too many threads at a small number of accounts, we see wait times that cross the threshold, like this one that waited seven ms:

==7565== Acquired at:
==7565==    at 0x483F428: pthread_mutex_lock_intercept (drd_pthread_intercepts.c:888)
==7565==    by 0x483F428: pthread_mutex_lock (drd_pthread_intercepts.c:898)
==7565==    by 0x109280: disburse (banker_lock_hierarchy.c:40)
==7565==    by 0x483C114: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==7565==    by 0x4863FA2: start_thread (pthread_create.c:486)
==7565==    by 0x49764CE: clone (clone.S:95)
==7565== Lock on mutex 0x10c258 was held during 7 ms (threshold: 2 ms).
==7565==    at 0x4840478: pthread_mutex_unlock_intercept (drd_pthread_intercepts.c:978)
==7565==    by 0x4840478: pthread_mutex_unlock (drd_pthread_intercepts.c:991)
==7565==    by 0x109395: disburse (banker_lock_hierarchy.c:47)
==7565==    by 0x483C114: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==7565==    by 0x4863FA2: start_thread (pthread_create.c:486)
==7565==    by 0x49764CE: clone (clone.S:95)
==7565== mutex 0x10c258 was first observed at:
==7565==    at 0x483F368: pthread_mutex_lock_intercept (drd_pthread_intercepts.c:885)
==7565==    by 0x483F368: pthread_mutex_lock (drd_pthread_intercepts.c:898)
==7565==    by 0x109280: disburse (banker_lock_hierarchy.c:40)
==7565==    by 0x483C114: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==7565==    by 0x4863FA2: start_thread (pthread_create.c:486)
==7565==    by 0x49764CE: clone (clone.S:95)

Clang ThreadSanitizer (TSan)

ThreadSanitizer is a clang instrumentation module. To use it, choose CC = clang and add -fsanitize=thread to CFLAGS. Then when you build programs, they will be modified to detect data races and print statistics to stderr.

Here’s a portion of the output when running the bankers program:

WARNING: ThreadSanitizer: data race (pid=11312)
  Read of size 8 at 0x0000014aeeb0 by thread T2:
    #0 disburse /home/admin/banker.c:48 (banker+0x0000004a4372)

  Previous write of size 8 at 0x0000014aeeb0 by thread T1:
    #0 disburse /home/admin/banker.c:52 (banker+0x0000004a43ba)

TSan can also detect lock hierarchy violations, such as in banker_lock:

WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=10095)
  Cycle in lock order graph: M1 (0x0000014aef78) => M2 (0x0000014aeeb8) => M1

  Mutex M2 acquired here while holding mutex M1 in thread T1:
    #0 pthread_mutex_lock <null> (banker_lock+0x000000439a10)
    #1 disburse /home/admin/banker_lock.c:39 (banker_lock+0x0000004a4398)

    Hint: use TSAN_OPTIONS=second_deadlock_stack=1 to get more informative warning message

  Mutex M1 acquired here while holding mutex M2 in thread T9:
    #0 pthread_mutex_lock <null> (banker_lock+0x000000439a10)
    #1 disburse /home/admin/banker_lock.c:39 (banker_lock+0x0000004a4398)

Mutrace

While Valgrind DRD can identify highly contended locks, it virtualizes the execution of the program under test, and skews the numbers. Other utilities can use software probes to get this information from a test running at full speed. In BSD land there is the plockstat provider for DTrace, and on Linux there is the specially-written mutrace. I had a lot of trouble trying to get plockstat to work on FreeBSD, so here’s an example of using mutrace to analyze our banker program.

mutrace ./banker_lock_hierarchy
mutrace: Showing 10 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       0   200211   153664    95985      991.349        0.005        0.267 M-.--.
       1   200552   142173    61902      641.963        0.003        0.170 M-.--.
       2   199657   140837    47723      476.737        0.002        0.125 M-.--.
       3   199566   140863    39268      371.451        0.002        0.108 M-.--.
       4   199936   141381    33243      295.909        0.001        0.090 M-.--.
       5   199548   141297    28193      232.647        0.001        0.084 M-.--.
       6   200329   142027    24230      183.301        0.001        0.066 M-.--.
       7   199951   142338    21018      142.494        0.001        0.057 M-.--.
       8   200145   142990    18201      107.692        0.001        0.052 M-.--.
       9   200105   143794    15713       76.231        0.000        0.028 M-.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC

mutrace: Note that the flags column R is only valid in --track-rt mode!

mutrace: Total runtime is 1896.903 ms.

mutrace: Results for SMP with 4 processors.

Off-CPU profiling

Typical profilers measure the amount of CPU time spent in each function. However when a thread is blocked by I/O, a lock, or a condition variable, then it isn’t using CPU time. To determine where functions spend the most “wall clock time,” we need to sample the call stack for all threads at intervals, and count how frequently we see each entry. When a thread is off-CPU its call stack stays unchanged.

The pstack program is traditionally the way to get a snapshot of a running program’s stack. It exists on old Unices, and used to be on Linux until Linux made a breaking change. The most portable way to get stack snapshots is using gdb with an awk wrapper, as documented in the Poor Man’s Profiler.

Remember our early condition variable example that measured how many threads entered the critical section in disburse() at once? We asked whether synchronization on stats_mtx threw off the measurement. With off-CPU profiling we can look for clues.

Here’s a script based on the Poor Man’s Profiler:

./banker_stats &
pid=$!

while kill -0 $pid
  do
    gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p $pid
  done | \
awk '
  BEGIN { s = ""; }
  /^Thread/ { print s; s = ""; }
  /^\#/ { if (s != "" ) { s = s "," $4} else { s = $4 } }
  END { print s }' | \
sort | uniq -c | sort -r -n -k 1,1

It outputs limited information, but we can see that waiting for locks in disburse() takes the majority of program time, being present in 872 of our samples. By contrast, waiting for the stats_mtx lock in stats_update() doesn’t appear in our sample at all. It must have had very little affect on our parallelism.

    872 at,__GI___pthread_mutex_lock,disburse,start_thread,clone
     11 at,__random,rand,rand_range,disburse,start_thread,clone
      9 expected=0,,mutex=0x562533c3f0c0,<stats_cnd>,,stats_print,start_thread,clone
      9 __GI___pthread_timedjoin_ex,main
      5 at,__pthread_mutex_unlock_usercnt,disburse,start_thread,clone
      1 at,__pthread_mutex_unlock_usercnt,stats_change,disburse,start_thread,clone
      1 at,__GI___pthread_mutex_lock,stats_change,disburse,start_thread,clone
      1 __random,rand,rand_range,disburse,start_thread,clone

macOS Instruments

Although Mac’s POSIX thread support is pretty weak, its XCode tooling does include a nice profiler. From the Instruments application, choose the profiling template called “System Trace.” It adds a GUI on top of DTrace to display thread states (among other things). I modified our banker program to use only five threads and recorded its run. The Instruments app visualizes every event that happens, including threads blocking and being interrupted:

thread states

Within the program you can zoom into the history and hover over events for info.

perf c2c

Perf is a Linux tool to measure hardware performance counters during the execution of a program. Joe Mario created a Perf feature called c2c which detects false sharing of variables between CPUs.

In a NUMA multi-core computer, each CPU has its own set of caches, and all CPUs share main memory. Memory is divided into fixed size blocks (often 64 bytes) called cache lines. Any time a CPU reads or writes memory, it must fetch or store the entire cache line surrounding the desired address. If one CPU has already cached a line, and another CPU writes to that area in memory, the system has to perform an expensive operation to make the caches coherent.

When two unrelated variables in a program are stored close enough together in memory to be in the same cache line, it can cause a performance problem in multi-threaded programs. If threads running on separate CPUs access the unrelated variables, it can cause a tug of war between their underlying cache line, which is called false sharing.

For instance, our Game of Life simulator could potentially have false sharing at the edges of each section of board accessed by each thread. To verify this, I attempted to run perf c2c on an Amazon EC2 instance (since I lack a physical computer running Linux), but got an error that memory events are not supported on the virtual machine. I was running kernel 4.19.0 on Intel Xeon Platinum 8124M CPUs, so I assume this was a security restriction from Amazon.

If you are able to run c2c, and detect false sharing in a multi-threaded program, the solution is to align the variables more aggressively. POSIX provides the posix_memalign() function to allocate bytes aligned on a desired boundary. In our Life example, we could have used an array of pointers to dynamically allocated rows rather than a contiguous two-dimensional array.

Intel VTune Profiler

The VTune Profiler is available for free (with registration) on Linux, macOS, and Windows. It works on x86 hardware only of course. I haven’t used it, but their marketing page shows some nice pictures. The tool can visually identify the granularity of locks, present a prioritized list of synchronization objects that hurt performance, and visualize lock contention.

Further reading

]]>
History and effective use of Vim https://begriffs.com/posts/2019-07-19-history-use-vim.html 2019-07-19T00:00:00Z 2019-07-19T00:00:00Z This article is based on historical research and on simply reading the Vim user manual cover to cover. Hopefully these notes will help you (re?)discover core functionality of the editor, so you can abandon pre-packaged vimrc files and use plugins more thoughtfully.

physical books

To go beyond the topics in this blog post, I’d recommend getting a paper copy of the manual and a good pocket reference. I couldn’t find any hard copy of the official Vim manual, and ended up printing this PDF using printme1.com. The PDF is a printer-friendly version of the files $VIMRUNTIME/doc/usr_??.txt distributed with the editor. For a convenient list of commands, I’d recommend the vi and Vim Editors Pocket Reference.

Table of Contents

History

Birth of vi

Vi commands and features go back more than fifty years, starting with the QED editor. Here is the lineage:

  • 1966 : QED (“Quick EDitor”) in Berkeley Timesharing System
  • 1969 Jul: moon landing (just for reference)
  • 1969 Aug: QED -> ed at AT&T
  • 1976 Feb: ed -> em (“Editor for Mortals”) at Queen Mary College
  • 1976 : em -> ex (“EXtended”) at UC Berkeley
  • 1977 Oct: ex gets visual mode, vi

hard copy terminal

You can discover the similarities all the way between QED and ex by reading the QED manual and ex manual. Both editors use a similar grammar to specify and operate on line ranges.

Editors like QED, ed, and em were designed for hard-copy terminals, which are basically electric typewriters with a modem attached. Hard-copy terminals print system output on paper. Output could not be changed once printed, obviously, so the editing process consisted of user commands to update and manually print ranges of text.

video terminal

By 1976 video terminals such as the ADM-3A started to be available. The Ex editor added an “open mode” which allowed intraline editing on video terminals, and a visual mode for screen oriented editing on cursor-addressible terminals. The visual mode (activated with the command “vi”) kept an up-to-date view of part of the file on screen, while preserving an ex command line at the bottom of the screen. (Fun fact: the h,j,k,l keys on the ADM-3A had arrows drawn on them, so that choice of motion keys in vi was simply to match the keyboard.)

Learn more about the journey from ed to ex/vi in this interview with Bill Joy. He talks about how he made ex/vi, and some things that disappointed him about it.

Classic vi is truly just an alter-ego of ex – they are the same binary, which decides to start in ex mode or vi mode based on the name of the executable invoked. The legacy of all this history is that ex/vi is refined by use, requires scant system resources, and can operate under limited bandwidth communication. It is also available on most systems and fully specified in POSIX.

From vi to vim

Being a derivative of ed, the ex/vi editor was intellectual property of AT&T. To use vi on platforms other than Unix, people had to write clones that did not share in the original codebase.

Some of the clones:

  • nvi - 1980 for 4BSD
  • calvin - 1987 for DOS
  • vile - 1990 for DOS
  • stevie - 1987 for Atari ST
  • elvis - 1990 for Minix and 386BSD
  • vim - 1991 for Amiga
  • viper - 1995 for Emacs
  • elwin - 1995 for Windows
  • lemmy - 2002 for Windows

We’ll be focusing on that little one in the middle: vim. Bram Moolenaar wanted to use vi on the Amiga. He began porting Stevie from the Atari and evolving it. He called his port “Vi IMitation.” For a full first-hand account, see Bram’s interview with Free Software Magazine.

By version 1.22 Vim was rechristened “Vi IMproved,” matching and surpassing features of the original. Here is the timeline of the next major versions, with some of their big features:

1991 Nov 2 Vim 1.14: First release (on Fred Fish disk #591).
1992 Vim 1.22: Port to Unix. Vim now competes with Vi.
1994 Aug 12 Vim 3.0: Support for multiple buffers and windows.
1996 May 29 Vim 4.0: Graphical User Interface (largely by Robert Webb).
1998 Feb 19 Vim 5.0: Syntax coloring/highlighting.
2001 Sep 26 Vim 6.0: Folding, plugins, vertical split.
2006 May 8 Vim 7.0: Spell check, omni completion, undo branches, tabs.
2016 Sep 12 Vim 8.0: Jobs, async I/O, native packages.

For more info about each version, see e.g. :help vim8. To see plans for the future, including known bugs, see :help todo.txt.

Version 8 included some async job support due to peer pressure from NeoVim, whose developers wanted to run debuggers and REPLs for their web scripting languages inside the editor.

Vim is super portable. By adapting over time to work on a wide variety of platforms, the editor was forced to keep portable coding habits. It runs on OS/390, Amiga, BeOS and BeBox, Macintosh classic, Atari MiNT, MS-DOS, OS/2, QNX, RISC-OS, BSD, Linux, OS X, VMS, and MS-Windows. You can rely on Vim being there no matter what computer you’re using.

In a final twist in the vi saga, the original ex/vi source code was finally released in 2002 under a BSD free software license. It is available at ex-vi.sourceforge.net.

Let’s get down to business. Before getting to odds, ends, and intermediate tricks, it helps to understand how Vim organizes and reads its configuration files.

Configuration hierarchy

I used to think, incorrectly, that Vim reads all its settings and scripts from the ~/.vimrc file alone. Browsing random “dotfiles” repositories can reinforce this notion. Quite often people publish monstrous single .vimrc files that try to control every aspect of the editor. These big configs are sometimes called “vim distros.”

In reality Vim has a tidy structure, where .vimrc is just one of several inputs. In fact you can ask Vim exactly which scripts it has loaded. Try this: edit a source file from a random programming project on your computer. Once loaded, run

:scriptnames

Take time to read the list. Try to guess what the scripts might do, and note the directories where they live.

Was the list longer than you expected? If you have installed loads of plugins the editor has a lot to do. Check what slows down the editor most at startup by running the following and look at the start.log it creates:

vim --startuptime start.log name-of-your-file

Just for comparison, see how quickly Vim starts without your existing configuration:

vim --clean --startuptime clean.log name-of-your-file

To determine which scripts to run at startup or buffer load time, Vim traverses a “runtime path.” The path is a comma-separated list of directories that each contain a common structure. Vim inspects the structure of each directory to find scripts to run. Directories are processed in the order they appear in the list.

Check the runtimepath on your system by running:

:set runtimepath

My system contains the following directories in the default value for runtimepath. Not all of them even exist in the filesystem, but they would be consulted if they did.

~/.vim
The home directory, for personal preferences.
/usr/local/share/vim/vimfiles
A system-wide Vim directory, for preferences from the system administrator.
/usr/local/share/vim/vim81
Aka $VIMRUNTIME, for files distributed with Vim.
/usr/local/share/vim/vimfiles/after
The “after” directory in the system-wide Vim directory. This is for the system administrator to overrule or add to the distributed defaults.
~/.vim/after
The “after” directory in the home directory. This is for personal preferences to overrule or add to the distributed defaults or system-wide settings.

Because directories are processed by their order in line, the only thing that is special about the “after” directories is that they are at the end of the list. There is nothing magical about the word “after.”

When processing each directory, Vim looks for subfolders with specific names. To learn more about them, see :help runtimepath. Here is a selection of those we will be covering, with brief descriptions.

plugin/
Vim script files that are loaded automatically when editing any kind of file. Called “global plugins.”
autoload/
(Not to be confused with “plugin.”) Scripts in autoload contain functions that are loaded only when requested by other scripts.
ftdetect/
Scripts to detect filetypes. They can base their decision on filename extension, location, or internal file contents.
ftplugin/
Scripts that are executed when editing files with known type.
compiler/
Definitions of how to run various compilers or linters, and of how to parse their output. Can be shared between multiple ftplugins. Also not applied automatically, must be called with :compiler
pack/
Container for Vim 8 native packages, the successor to “Pathogen” style package management. The native packaging system does not require any third-party code.

Finally, ~/.vimrc is the catchall for general editor settings. Use it for setting defaults that can be overridden for particular file types. For a comprehensive overview of settings you can choose in .vimrc, run :options.

Third-party plugins

Plugins are simply Vim scripts that must be put into the correct places in the runtimepath in order to execute. Installing them is conceptually easy: download the file(s) into place. The challenge is that it’s hard to remove or update some plugins because they litter subdirectories in the runtimepath with their scripts, and it can be hard to tell which plugin is responsible for which files.

“Plugin managers” evolved to address this need. Vim.org has had a plugin registry going back at least as far as 2003 (as identified by the Internet Archive). However it wasn’t until about 2008 that the notion of a plugin manager really came into vogue.

These tools add plugins’ separate directories to Vim’s runtimepath, and compile help tags for plugin documentation. Most plugin managers also install and update plugin code from the internet, sometimes in parallel or with colorful progress bars.

In chronological order, here is the parade of plugin managers. I based the date ranges on earliest and latest releases of each, or when no official releases are identified, on the earliest and latest commit dates.

  • Mar 2006 - Jul 2014 : Vimball (A distribution format and associated Vim commands)
  • Oct 2008 - Dec 2015 : Pathogen (Deprecated in favor of native vim packages)
  • Aug 2009 - Dec 2009 : Vimana
  • Dec 2009 - Dec 2014 : VAM
  • Aug 2010 - Nov 2010 : Jolt
  • Oct 2010 - Nov 2012 : tplugin
  • Oct 2010 - Feb 2014 : Vundle (Discontinued after NeoBundle ripped off code)
  • Mar 2012 - Mar 2018 : vim-flavor
  • Apr 2012 - Mar 2016 : NeoBundle (Deprecated in favor of dein)
  • Jan 2013 - Aug 2017 : infect
  • Feb 2013 - Aug 2016 : vimogen
  • Oct 2013 - Jan 2015 : vim-unbundle
  • Dec 2013 - Jul 2015 : Vizardry
  • Feb 2014 - Oct 2018 : vim-plug
  • Jan 2015 - Oct 2015 : enabler
  • Aug 2015 - Apr 2016 : Vizardry 2
  • Jan 2016 - Jun 2018 : dein.vim
  • Sep 2016 - Present : native in Vim 8
  • Feb 2017 - Sep 2018 : minpac
  • Mar 2018 - Mar 2018 : autopac
  • Feb 2017 - Jun 2018 : pack
  • Mar 2017 - Sep 2017 : vim-pck
  • Sep 2017 - Sep 2017 : vim8-pack
  • Sep 2017 - May 2019 : volt
  • Sep 2018 - Feb 2019 : vim-packager
  • Feb 2019 - Feb 2019 : plugpac.vim

The first thing to note is the overwhelming variety of these tools, and the second is that each is typically active for about four years before presumably going out of fashion.

The most stable way to manage plugins is to simply use Vim 8’s built-in functionality, which requires no third-party code. Let’s walk through how to do it.

First create two directories, opt and start, within a pack directory in your runtimepath.

mkdir -p ~/.vim/pack/foobar/{opt,start}

Note the placeholder “foobar.” This name is entirely up to you. It classifies the packages that will go inside. Most people throw all their plugins into a single nondescript category, which is fine. Pick whatever name you like; I’ll continue to use foobar here. You could theoretically create multiple categories too, like ~/.vim/pack/navigation and ~/.vim/pack/linting. Note that Vim does not detect duplication between categories and will double-load duplicates if they exist.

Packages in “start” get loaded automatically, whereas those in “opt” won’t load until specifically requested in Vim with the :packadd command. Opt is good for lesser-used packages, and keeps Vim fast by not running scripts unnecessarily. Note that there isn’t a counterpart to :packadd to unload a package.

For this example we’ll add the “ctrlp” fuzzy find plugin to opt. Download and extract the latest release into place:

curl -L https://github.com/kien/ctrlp.vim/archive/1.79.tar.gz \
	| tar zx -C ~/.vim/pack/foobar/opt

That command creates a ~/.vim/pack/foobar/opt/ctrlp.vim-1.79 folder, and the package is ready to use. Back in vim, create a helptags index for the new package:

:helptags ~/.vim/pack/foobar/opt/ctrlp.vim-1.79/doc

That creates a file called “tags” in the package’s doc folder, which makes the topics available for browsing in Vim’s internal help system. (Alternately you can run :helptags ALL once the package has been loaded, which takes care of all docs in the runtimepath.)

When you want to use the package, load it (and know that tab completion works for plugin names, so you don’t have to type the whole name):

:packadd ctrlp.vim-1.79

Packadd includes the package’s base directory in the runtimepath, and sources its plugin and ftdetect scripts. After loading ctrlp, you can press CTRL-P to pop up a fuzzy find file matcher.

Some people keep their ~/.vim directory under version control and use git submodules for each package. For my part, I simply extract packages from tarballs and track them in my own repository. If you use mature packages you don’t need to upgrade them often, plus the scripts are generally small and don’t clutter git history much.

Backups and undo

Depending on user settings, Vim can protect against four types of loss:

  1. A crash during editing (between saves). Vim can protect against this one by periodically saving unwritten changes to a swap file.
  2. Editing the same file with two instances of Vim, overwriting changes from one or both instances. Swap files protect against this too.
  3. A crash during the save process itself, after the destination file is truncated but before the new contents have been fully written. Vim can protect against this with a “writebackup.” To do this, it writes to a new file and swaps it with the original on success, in a way that depends on the “backupcopy” setting.
  4. Saving new file contents but wanting the original back. Vim can protect against this by persisting the backup copy of the file after writing changes.

Before examining sensible settings, how about some comic relief? Here are just a sampling of comments from vimrc files on GitHub:

  • “Do not create swap file. Manage this in version control”
  • “Backups are for pussies. Use version control”
  • “use version control FFS!”
  • “We live in a world with version control, so get rid of swaps and backups”
  • “don’t write backup files, version control is enough backup”
  • “I’ve never actually used the VIM backup files… Use version control”
  • “Since most stuff is on version control anyway”
  • “Disable backup files, you are using a version control system anyway :)”
  • “version control has arrived, git will save us”
  • “disable swap and backup files (Always use version control! ALWAYS!)”
  • “Turn backup off, since I version control everything”

The comments reflect awareness of only the fourth case above (and the third by accident), whereas the authors generally go on to disable the swap file too, leaving one and two unprotected.

Here is the configuration I recommend to keep your edits safe:

" Protect changes between writes. Default values of
" updatecount (200 keystrokes) and updatetime
" (4 seconds) are fine
set swapfile
set directory^=~/.vim/swap//

" protect against crash-during-write
set writebackup
" but do not persist backup after successful write
set nobackup
" use rename-and-write-new method whenever safe
set backupcopy=auto
" patch required to honor double slash at end
if has("patch-8.1.0251")
	" consolidate the writebackups -- not a big
	" deal either way, since they usually get deleted
	set backupdir^=~/.vim/backup//
end

" persist the undo tree for each file
set undofile
set undodir^=~/.vim/undo//

These settings enable backups for writes-in-progress, but do not persist them after successful write because version control etc etc. Note that you’ll need to mkdir ~/.vim/{swap,undodir,backup} or else Vim will fall back to the next available folder in the preference list. You should also probably chmod the folders to keep the contents private, because the swap files and undo history might contain sensitive information.

One thing to note about the paths in our config is that they end in a double slash. That ending enables a feature to disambiguate swaps and backups for files with the same name that live in different directories. For instance the swap file for /foo/bar will be saved in ~/.vim/swap/%foo%bar.swp (slashes escaped as percent signs). Vim had a bug until a fairly recent patch where the double slash was not honored for backupdir, and we guard against that above.

We also have Vim persist the history of undos for each file, so that you can apply them even after quitting and editing the file again. While it may sound redundant with the swap file, the undo history is complementary because it is written only when the file is written. (If it were written more frequently it might not match the state of the file on disk after a crash, so Vim doesn’t do that.)

Speaking of undo, Vim maintains a full tree of edit history. This means you can make a change, undo it, then redo it differently and all three states are recoverable. You can see the times and magnitude of changes with the :undolist command, but it’s hard to visualize the tree structure from it. You can navigate to specific changes in that list, or move in time with :earlier and :later which take a time argument like 5m, or the count of file saves, like 3f. However navigating the undo tree is an instance when I think a plugin – like undotreeis warranted.

Enabling these disaster recovery settings can bring you peace of mind. I used to save compulsively after most edits or when stepping away from the computer, but now I’ve made an effort to leave documents unsaved for hours at a time. I know how the swap file works now.

Some final notes: keep an eye on all these disaster recovery files, they can pile up in your .vim folder and use space over time. Also setting nowritebackup might be necessary when saving a huge file with low disk space, because Vim must otherwise make an entire copy of the file temporarily. By default the “backupskip” setting disables backups for anything in the system temp directory.

Vim’s “patchmode” is related to backups. You can use it in directories that aren’t under version control. For instance if you want to download a source tarball, make an edit and send a patch over a mailing list without bringing git into the picture. Run :set patchmod=.orig and any file ‘foo’ Vim is about to write will be backed up to ‘foo.orig’. You can then create a patch on the command line between the .orig files and the new ones.

Include and path

Most programming languages allow you to include one module or file from another. Vim knows how to track program identifiers in included files using the configuration settings path, include, suffixesadd, and includeexpr. The identifier search (see :help include-search) is an alternative to maintaining a tags file with ctags for system headers.

The settings for C programs work out of the box. Other languages are supported too, but require tweaking. That’s outside the scope of this article, see :help include.

If everything is configured right, you can press [i on an identifier to display its definition, or [d for a macro constant. Also when you press gf with the cursor on a filename, Vim searches the path to find it and jump there. Because the path also affects the :find command, some people have the tendency to add ‘**/*’ or commonly accessed directories to the path in order to use :find like a poor man’s fuzzy finder. Doing this slows down the identifier search with directories which aren’t relevant to that task.

A way to get the same level of crappy find capability, without polluting the path, is to just make another mapping. You can then press <Leader><space> (which is typically backslash space) then start typing a filename and use tab or CTRL-D completion to find the file.

" fuzzy-find lite
nmap <Leader><space> :e ./**/

Just to reiterate: the path parameter was designed for header files. If you want more proof, there is even a :checkpath command to see whether the path is functioning. Load a C file and run :checkpath. It will display filenames it was unable to find that are included transitively by the current file. Also :checkpath! with a bang dumps the whole hierarchy of files included from the current file.

By default path has the value “.,/usr/include,,” meaning the working directory, /usr/include, and files that are siblings of the active buffer. The directory specifiers and globs are pretty powerful, see :help file-searching for the details.

In my C ftplugin (more on that later), I also have the path search for include files within the current project, like ./src/include or ./include .

setlocal path=.,,*/include/**3,./*/include/**3
setlocal path+=/usr/include

The ** with a number like **3 bounds the depth of the search in subdirectories. It’s wise to add depth bounds where you can to avoid identifier searches that lock up.

Here are other patterns you might consider adding to your path if :checkpath identifies that files can’t be found in your project. It depends on your system of course.

  • More system includes: /usr/include/**4,/usr/local/include/**3
  • Homebrew library headers: /usr/local/Cellar/**2/include/**2
  • Macports library headers: /opt/local/include/**
  • OpenBSD library headers: /usr/local/lib/\*/include,/usr/X11R6/include/\*\*3

See also: :he [, :he gf, :he :find.

Edit ⇄ compile cycle

The :make command runs a program of the user’s choice to build a project, and collects the output in the quickfix buffer. Each item in the quickfix records the filename, line, column, type (warning/error) and message of each output item. A fairly idomatic mapping uses bracket commands to move through quickfix items:

" quickfix shortcuts
nmap ]q :cnext<cr>
nmap ]Q :clast<cr>
nmap [q :cprev<cr>
nmap [Q :cfirst<cr>

If, after updating the program and rebuilding, you are curious what the error messages said last time, use :colder (and :cnewer to return). To see more information about the currently selected error use :cc, and use :copen to see the full quickfix buffer. You can populate the quickfix yourself without running :make with :cfile, :caddfile, or :cexpr.

Vim parses output from the build process according to the errorformat string, which contains scanf-like escape sequences. It’s typical to set this in a “compiler file.” For instance, Vim ships with one for gcc in $VIMRUNTIME/compiler/gcc.vim, but has no compiler file for clang. I created the following definition for ~/.vim/compiler/clang.vim:

" formatting variations documented at
" https://clang.llvm.org/docs/UsersManual.html#formatting-of-diagnostics
"
" It should be possible to make this work for the combination of
" -fno-show-column and -fcaret-diagnostics as well with multiline
" and %p, but I was too lazy to figure it out.
"
" The %D and %X patterns are not clang per se. They capture the
" directory change messages from (GNU) 'make -w'. I needed this
" for building a project which used recursive Makefiles.

CompilerSet errorformat=
	\%f:%l%c:{%*[^}]}{%*[^}]}:\ %trror:\ %m,
	\%f:%l%c:{%*[^}]}{%*[^}]}:\ %tarning:\ %m,
	\%f:%l:%c:\ %trror:\ %m,
	\%f:%l:%c:\ %tarning:\ %m,
	\%f(%l,%c)\ :\ %trror:\ %m,
	\%f(%l,%c)\ :\ %tarning:\ %m,
	\%f\ +%l%c:\ %trror:\ %m,
	\%f\ +%l%c:\ %tarning:\ %m,
	\%f:%l:\ %trror:\ %m,
	\%f:%l:\ %tarning:\ %m,
	\%D%*\\a[%*\\d]:\ Entering\ directory\ %*[`']%f',
	\%D%*\\a:\ Entering\ directory\ %*[`']%f',
	\%X%*\\a[%*\\d]:\ Leaving\ directory\ %*[`']%f',
	\%X%*\\a:\ Leaving\ directory\ %*[`']%f',
	\%DMaking\ %*\\a\ in\ %f

CompilerSet makeprg=make

To activate this compiler profile, run :compiler clang. This is typically done in an ftplugin file.

Another example is running GNU Diction on a text document to identify wordy and commonly misused phrases in sentences. Create a “compiler” called diction.vim:

CompilerSet errorformat=%f:%l:\ %m
CompilerSet makeprg=diction\ -s\ %

After you run :compiler diction you can use the normal :make command to run it and populate the quickfix. The final mild convenience in my .vimrc is a mapping to run make:

" real make
map <silent> <F5> :make<cr><cr><cr>
" GNUism, for building recursively
map <silent> <s-F5> :make -w<cr><cr><cr>

Diffs and patches

Vim’s internal diffing is powerful, but it can be daunting, especially the three-way merge view. In reality it’s not so bad once you take time to study it. The main idea is that every window is either in or out of “diff mode.” All windows put in diffmode (with :difft[his]) get compared with all other windows already in diff mode.

For example, let’s start simple. Create two files:

echo "hello, world" > h1
echo "goodbye, world" > h2

vim h1 h2

In vim, split the arguments into their own windows with :all. In the top window, for h1, run :difft. You’ll see a gutter appear, but no difference detected. Move to the other window with CTWL-W CTRL-W and run :difft again. Now hello and goobye are identified as different in the current chunk. Continuing in the bottom window, you can run :diffg[et] to get “hello” from the top window, or :diffp[ut] to send “goodbye” into the top window. Pressing ]c or [c would move between chunks if there were more than one.

A shortcut would be running vim -d h1 h2 instead (or its alias, vimdiff h1 h2) which applies :difft to all windows. Alternatively, load just h1 with vim h1 and then :diffsplit h2. Remember that fundamentally these commands just load files into windows and set the diff mode.

With these basics in mind, let’s learn to use Vim as a three-way mergetool for git. First configure git:

git config merge.tool vimdiff
git config merge.conflictstyle diff3
git config mergetool.prompt false

Now, when you hit a merge conflict, run git mergetool. It will bring Vim up with four windows. This part looks scary, and is where I used to flail around and often quit in frustration.

+-----------+------------+------------+
|           |            |            |
|           |            |            |
|   LOCAL   |    BASE    |   REMOTE   |
+-----------+------------+------------+
|                                     |
|                                     |
|             (edit me)               |
+-------------------------------------+

Here’s the trick: do all the editing in the bottom window. The top three windows simply provide context about how the file differs on either side of the merge (local / remote), and how it looked prior to either side doing any work (base).

Move within the bottom window with ]c, and for each chunk choose whether to replace it with text from local, base, or remote – or whether to write in your own change which might combine parts from several.

To make it easier to pull changes from the top windows, I set some mappings in my vimrc:

" shortcuts for 3-way merge
map <Leader>1 :diffget LOCAL<CR>
map <Leader>2 :diffget BASE<CR>
map <Leader>3 :diffget REMOTE<CR>

We’ve already seen :diffget, and here our bindings pass an argument of the buffer name that identifies which window to pull from.

Once done with the merge, run :wqa to save all the windows and quit. If you want to abandon the merge instead, run :cq to abort all changes and return an error code to the shell. This will signal to git that it should ignore your changes.

Diffget can also accept a range. If you want to pull in all changes from one of the top windows rather than working chunk by chunk, just run :1,$+1diffget {LOCAL,BASE,REMOTE}. The “+1” is required because there can be deleted lines “below” the last line of a buffer.

The three-way marge is fairly easy after all. There’s no need for plugins like Fugitive, at least for presenting a simplified view for resolving merge conflicts.

Finally, as of patch 8.1.0360, Vim is bundled with the xdiff library and can create diffs internally. This can be more efficient than shelling out to an external program, and allows for a choice of diff algorithms. The “patience” algorithm often produces more human-readable output than the default, “myers.” Set it in your .vimrc like so:

if has("patch-8.1.0360")
	set diffopt+=internal,algorithm:patience
endif

Buffer I/O

See if this sounds familiar: you’re editing a buffer and want to save it as a new file, so you :w newname. After editing some more, you :w, but it writes over the original file. What you want for this scenario is :saveas newname, which does the write but also changes the filename of the buffer for future writes. Alternately, the :file newname command will change the filename without doing a write.

It also pays off to learn more about the read and write commands. Becuase r and w are Ex commands, they work with ranges. Here are some variations you might not know about:

:w >>foo append the whole buffer to a file
:.w >>foo append current line to a file
:$r foo read foo into the end of the buffer
:0r foo read foo into the start, moving existing lines down
:.,$w foo write current line and below to a file
:r !ls read ls output into cursor position
:w !wc send buffer to wc and display output
:.!tr ‘A-Za-z’ ‘N-ZA-Mn-za-m’ apply ROT-13 to current line
:w|so % chain commands: write and then source buffer
:e! throw away unsaved changes, reload buffer
:hide edit foo edit foo, hide current buffer if dirty

Useless fun fact: we piped a line to tr in an example above to apply a ROT-13 cypher, but Vim has that functionality built in with the the g? command. Apply it to a motion, like g?$.

Filetypes

Filetypes are a way to change settings based on the type of file detected in a buffer. They don’t need to be automatically detected though, we can manually enable them to interesting effect. An example is doing hex editing. Any file can be viewed as raw hexadecimal values. GitHub user the9ball created a clever ftplugin script that filters a buffer back and forth through the xxd utility for hex editing.

The xxd utility was bundled as part of Vim 5 for convenience. The Vim todo.txt file mentions they want to make it more seamless to edit binary files, but xxd can take us pretty far.

Here is code you can put in ~/.vim/ftplugin/xxd.vim. Its presence in ftplugin means Vim will execute the script when filetype (aka “ft”) becomes xxd. I added some basic comments to the script.

" without the xxd command this is all pointless
if !executable('xxd')
	finish
endif

" don't insert a newline in the final line if it
" doesn't already exist, and don't insert linebreaks
setlocal binary noendofline
silent %!xxd -g 1
%s/\r$//e

" put the autocmds into a group for easy removal later
augroup ftplugin-xxd
	" erase any existing autocmds on buffer
	autocmd! * <buffer>

	" before writing, translate back to binary
	autocmd BufWritePre <buffer> let b:xxd_cursor = getpos('.')
	autocmd BufWritePre <buffer> silent %!xxd -r

	" after writing, restore hex view and mark unmodified
	autocmd BufWritePost <buffer> silent %!xxd -g 1
	autocmd BufWritePost <buffer> %s/\r$//e
	autocmd BufWritePost <buffer> setlocal nomodified
	autocmd BufWritePost <buffer> call setpos('.', b:xxd_cursor) | unlet b:xxd_cursor

	" update text column after changing hex values
	autocmd TextChanged,InsertLeave <buffer> let b:xxd_cursor = getpos('.')
	autocmd TextChanged,InsertLeave <buffer> silent %!xxd -r
	autocmd TextChanged,InsertLeave <buffer> silent %!xxd -g 1
	autocmd TextChanged,InsertLeave <buffer> call setpos('.', b:xxd_cursor) | unlet b:xxd_cursor
augroup END

" when filetype is set to no longer be "xxd," put the binary
" and endofline settings back to what they were before, remove
" the autocmds, and replace buffer with its binary value
let b:undo_ftplugin = 'setl bin< eol< | execute "au! ftplugin-xxd * <buffer>" | execute "silent %!xxd -r"'

Try opening a file, then running :set ft. Note what type it is. Then:set ft=xxd. Vim will turn into a hex editor. To restore your view, :set ft=foo where foo was the original type. Note that in hex view you even get syntax highlighting because $VIMRUNTIME/syntax/xxd.vim ships with Vim by default.

Notice the nice use of “b:undo_ftplugin” which is an opportunity for filetypes to clean up after themselves when the user or ftdetect mechanism switches away from them to another filetype. (The example above could use a little work because if you :set ft=xxd then set it back, the buffer is marked as modified even if you never changed anything.)

Ftplugins also allow you to refine an existing filetype. For instance, Vim already has some good defaults for C programming in $VIMRUNTIME/ftplugin/c.vim. I put these extra options in ~/.vim/after/ftplugin/c.vim to add my own settings on top:

" the smartest indent engine for C
setlocal cindent
" my preferred "Allman" style indentation
setlocal cino="Ls,:0,l1,t0,(s,U1,W4"

" for quickfix errorformat
compiler clang
" shows long build messages better
setlocal ch=2

" auto-create folds per grammar
setlocal foldmethod=syntax
setlocal foldlevel=10

" local project headers
setlocal path=.,,*/include/**3,./*/include/**3
" basic system headers
setlocal path+=/usr/include

setlocal tags=./tags,tags;~
"                      ^ in working dir, or parents
"                ^ sibling of open file

" the default is menu,preview but the preview window is annoying
setlocal completeopt=menu

iabbrev #i #include
iabbrev #d #define
iabbrev main() int main(int argc, char **argv)

" add #include guard
iabbrev #g _<c-r>=expand("%:t:r")<cr><esc>VgUV:s/[^A-Z]/_/g<cr>A_H<esc>yypki#ifndef <esc>j0i#define <esc>o<cr><cr>#endif<esc>2ki

Notice how the script uses “setlocal” rather than “set.” This applies the changes to just the current buffer rather than the whole Vim instance.

This script also enables some light abbreviations. Like I can type #g and press enter and it adds an include guard with the current filename:

#ifndef _FILENAME_H
#define _FILENAME_H

/* <-- cursor here */

#endif

You can also mix filetypes by using a dot (“.”). Here is one application. Different projects have different coding conventions, so you can combine your default C settings with those for a particular project. The OpenBSD source code follows the style(9) format, so let’s make a special openbsd filetype. Combine the two filetypes with :set ft=c.openbsd on relevant files.

To detect the openbsd filetype we can look at the contents of buffers rather than just their extensions or locations on disk. The telltale sign is that C files in the OpenBSD source contain /* $OpenBSD: in the first line.

To detect them, create ~/.vim/after/ftdetect/openbsd.vim:

augroup filetypedetect
        au BufRead,BufNewFile *.[ch]
                \  if getline(1) =~ 'OpenBSD;'
                \|   setl ft=c.openbsd
                \| endif
augroup END

The Vim port for OpenBSD already includes a special syntax file for this filetype: /usr/local/share/vim/vimfiles/syntax/openbsd.vim. If you recall, the /usr/local/share/vim/vimfiles directory is in the runtimepath and is set aside for files from the system administrator. The provided openbsd.vim script includes a function:

function! OpenBSD_Style()
	setlocal cindent
	setlocal cinoptions=(4200,u4200,+0.5s,*500,:0,t0,U4200
	setlocal indentexpr=IgnoreParenIndent()
	setlocal indentkeys=0{,0},0),:,0#,!^F,o,O,e
	setlocal noexpandtab
	setlocal shiftwidth=8
	setlocal tabstop=8
	setlocal textwidth=80
endfun

We simply need to call the function at the appropriate time. Create ~/.vim/after/ftplugin/openbsd.vim:

call OpenBSD_Style()

Now opening any C or header file with the characteristic comment at the top will be recognized as type c.openbsd and will use indenting options that conform with the style(9) man page.

Don’t forget the mouse

This is a friendly reminder that despite our command-line machismo, the mouse is in fact supported in Vim, and can do some things more easily than the keyboard. Mouse events work even over SSH thanks to xterm turning mouse events into stdin escape codes.

To enable mouse support, set mouse=n. Many people use mouse=a to make it work in all modes, but I prefer to enable it only in normal mode. This avoids creating visual selections when I click links with a keyboard modifier to open them in my browser.

Here are things the mouse can do:

  • Open or close folds (when foldcolumn > 0).
  • Select tabs (beats gt gt gt…)
  • Click to complete a motion, like d<click!>. Similar to the easymotion plugin but without any plugin.
  • Jump to help topics with double click.
  • Drag the status line at the bottom to change cmdheight.
  • Drag edge of window to resize.
  • Scroll wheel.

Misc editing

This section could be enormous, but I’ll stick to a few tricks I learned. The first one that blew me away was :set virtualedit=all. It allows you to move the cursor anywhere in the window. If you enter characters or insert a visual block, Vim will add whatever spaces are required to the left of the inserted characters to keep them in place. Virtual edit mode makes it simple to edit tabular data. Turn it off with :set virtualedit=.

Next are some movement commands. I used to rely a lot on } to jump by paragraphs, and just muscle my way down the page. However the ] character makes more precise motions: by function ]], scope ]}, paren ‘])’, comment ]/, diff block ]c. This series is why the quickfix mapping ]q mentioned earlier fits the pattern so well.

For big jumps I used to try things like 1000j, but in normal mode you can actually just type a percentage and Vim will go there, like 50%. Speaking of scroll percentage, you can see it at any time with CTRL-G. Thus I now do :set noruler and ask to see the info as needed. It’s less cluttered. Kind of the opposite of the trend of colorful patched font powerlines.

After jumping around between tags, files, or within a file, there are some commands to get your bearings. Try :ls, :tags, :jumps, and :marks. Jumping through tags actually creates a stack, and you can press CTRL-T to pop one back. I used to always press CTRL-O to back out of jumps, but it is not as direct as popping the tag stack.

In a project directory that has been indexed with ctags, you can open the editor directly to a tag with -t, like vim -t main. To find tags files more flexibly, set the tags configuration variable. Note the semicolon in the example below that allows Vim to search the current directory upward to the home directory. This way you could have a more general system tags file outside the project folder.

set tags=./tags,**5/tags,tags;~
"                          ^ in working dir, or parents
"                   ^ in any subfolder of working dir
"           ^ sibling of open file

There are some buffer tricks too. Switching to a buffer with :bu can take a fragment of the buffer name, not just a number. Sometimes it’s harder to memorize those numbers than remember the name of a source file. You can navigate buffers with marks too. If you use a capital letter as the name of a mark, you can jump to it across buffers. You could set a mark H in a header, C in a source file, and M in a Makefile to go from one buffer to another.

Do you ever get mad after yanking a word, deleting a word somewhere else, trying paste the first word in, and then discovering your original yank is overwritten? The Vim registers are underappreciated for this. Inspect their contents with :reg. As you yank text, previous yanks are rotated into the registers "0 - "9. So "0p pastes the next-to-last yank/deletion. The special registers "+ and "* can copy/paste from/to the system clipboard. They usually mean the same thing, except in some X11 setups that distinguish primary and secondary selection.

Another handy hidden feature is the command line window. It it’s a buffer that contains your previous commands and searches. Bring it up with q: or q/. Once inside you can move to any line and press enter to run it. However you can also edit any of the lines before pressing enter. Your changes won’t affect the line (the new command will merely be added to the bottom of the list).

This article could go on and on, so I’m going to call it here. For more great topics, see these help sections: views-sessions, viminfo, TOhtml, ins-completion, cmdline-completion, multi-repeat, scroll-cursor, text-objects, grep, netrw-contents.

vim logo
]]>
Unicode programming, with examples https://begriffs.com/posts/2019-05-23-unicode-icu.html 2019-05-23T00:00:00Z 2019-05-23T00:00:00Z Most programming languages evolved awkwardly during the transition from ASCII to 16-bit UCS-2 to full Unicode. They contain internationalization features that often aren’t portable or don’t suffice.

Unicode is more than a numbering scheme for the characters of every language – although that in itself is a useful accomplishment. Unicode also includes characters’ case, directionality, and alphabetic properties. The Unicode standard and specifications describe the proper way to divide words and break lines, sort text, format numbers, display text in different directions, split/combine/reorder vowels South Asian languages, and determine when characters may look visually confusable.

Human languages are highly varied and internally inconsistent, and any application which treats strings as more than an opaque byte stream must embrace the complexity. Realistically this means using a mature third-party library.

This article illustrates text processing ideas with example programs. We’ll use the International Components for Unicode (ICU) library, which is mature, portable, and powers the international text processing behind many products and operating systems.

IBM (the maintainers of ICU) officially support a C, C++ and Java API. We’ll use the C API here for a better view into the internals. Many languages have bindings to the library, so these concepts should be applicable to your language of choice.

Table of Contents:

Concepts

Before getting into the example code, it’s important to learn the terminology. Let’s start at the most basic question.

What is a “character?”

“Character” is an overloaded term. What a native speaker of a language identifies as a letter or symbol is often stored as multiple values in the internal Unicode representation. The representation is further obscured by an additional encoding in memory, on disk, or during network transmission.

Let’s start at the abstraction closest to the user: the grapheme cluster. A “grapheme” is a graphical unit that a reader recognizes as a single element of the writing system. It’s the character as a user would understand it. For example, 山, ä and క్క are graphemes. Pieces of a single grapheme always stay together in print; breaking them apart is either nonsense or changes the meaning of the symbol. They are rendered as “glyphs,” i.e. markings on paper or screen which vary by font, style, or position in a word.

You might imagine that Unicode assigns each grapheme a unique number, but that is not true. It would be wasteful because there is a combinatorial explosion between letters and diacritical marks. For instance (o, ô, ọ, ộ) and (a, â, ạ, ậ) follow a pattern. Rather than assigning a distinct number to each, it’s more efficient to assign a number to o and a, and then to each of the combining marks. The graphemes can be built from letters and combining marks e.g. ậ = a + ◌̂ + ◌̣.

In reality Unicode takes both approaches. It assigns numbers to basic letters and combining marks, but also to some of their more common combinations. Many graphemes can thus be created in more than one way. For instance ộ can be specified in five ways:

  • A: U+006f (o) + U+0302 (◌̂) + U+0323 (◌̣)
  • B: U+006f (o) + U+0323 (◌̣) + U+0302 (◌̂)
  • C: U+00f4 (ô) + U+0323 (◌̣)
  • D: U+1ecd (ọ) + U+0302 (◌̂)
  • E: U+1ed9 (ộ)

The numbers (written U+xxxx) for each abstract character and each combining symbol are called “codepoints.” Every Unicode string is expressed as a list of codepoints. As illustrated above, multiple strings of codepoints may render into the same sequence of graphemes.

To meaningfully compare strings codepoint by codepoint for equality, both strings should both be represented in a consistent way. A standardized choice of codepoint decomposition for graphemes is called a “normal form.”

One choice is to decompose a string into as many codepoints as possible like examples A and B (with a weighting factor of which combining marks should come first). That is called Normalization Form Canonical Decomposition (NFD). Another choice is to do the opposite and use the fewest codepoints possible like example E. This is called Normalization Form Canonical Composition (NFC).

A core concept to remember is that, although codepoints are the building blocks of text, they don’t match up 1-1 with user-perceived characters (graphemes). Operations such as taking the length of an array of codepoints, or accessing arbitrary array positions are typically not useful for Unicode programs. Programs must also be mindful of the combining characters, like diacritical marks, when inserting or deleting codepoints. Inserting U+0061 into the asterisk position U+006f U+0302 (*) U+0323 changes the string “ộ” into “ôạ” rather than “ộa”.

Glyphs vs graphemes

It’s not just fonts that cause graphemes to be rendered into varying glyphs. The rules of some languages cause glyphs to change through contextual shaping. For instance the Arabic letter “heh” has four forms, depending on which sides are flanked by letters. When isolated it appears as ﻩ and in the final/initial/medial position in a word it appears as ﻪ/ﻫ/ﻬ respectively. Similarly, Greek displays lower-case sigma differently at the end of the word (final form) than elsewhere. Some glyphs change based on visual order. In a right-to-left language the starting parenthesis “(” mirrors to display as “)”.

Not only do individual graphemes’ glyphs vary, graphemes can combine to form single glyphs. One way is through ligatures. The latin letters “fi” often join the dot of the i with the curve of the f (presentation form U+FB01 fi). Another way is language irregularity. The Arabic ا and ل, when contiguous, must form ﻻ.

Conversely, a single grapheme can split into multiple glyphs. For instance in some Indic languages, vowels can split and surround preceding consonants. In Bengali, U+09CC ৌ surrounds U+09AE ম to become মৌ . Try placing a cursor at the end of this text box and pressing backspace:

How are codepoints encoded?

In 1990, Unicode codepoints were 16 bits wide. That choice turned out to be too small for the symbols and languages people wanted to represent, so the committee extended the standard to 21 bits. That’s fine in the abstract, but how the 21 bits are stored in memory or communicated between computers depends on practical factors.

It’s an unusual memory size. Computer hardware doesn’t typically access memory in 21-bit chunks. Networking protocols, too, are better geared toward transmitting eight bits at a time. Thus, codepoints are broken into sequences of more conventionally sized blocks called code units for persistence on disk, transmission over networks, and manipulation in memory.

The Unicode Transformation Formats (UTF) describe different ways to map between codepoints and code units. The transformation formats are named after the bit width of their code units (7, 8, 16, or 32), as well as the endianness (BE or LE). For instance: UTF-8, or UTF-16BE. In addition to the UTFs, there’s another – more complex – encoding called Punycode. It is designed to conform with the limited ASCII character subset used for Internet host names.

A final bit of terminology. A “plane” is a continuous group of 65,536 code points. There are 17 planes, identified by the numbers 0 to 16. Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly-used characters. The higher planes (1 through 16) are called “supplementary planes.”

Which encoding should you choose?

For transmission and storage, use UTF-8. Programs which move ASCII data can handle it without modification. Machine endianness does not affect UTF-8, and the byte-sized units work well in networks and filesystems.

Some sites, like UTF-8 Everywhere go even further and recommend using UTF-8 for internal manipulation of text in program memory. However, I would suggest you use whatever encoding your Unicode library favors for this. You’ll be performing operations through the library API, not directly on code units. As we’re seeing, there is too much complexity between glyphs, graphemes, codepoints and code units to be manipulating the units directly. Use the encoding preferred by your library and convert to/from UTF-8 at the edges of the program.

It’s unwise to use UTF-32 to store strings in memory. In this encoding it’s true that every code unit can hold a full codepoint. However, the relationship between codepoints and glyphs isn’t straightforward, so there isn’t a programmatic advantage to storing the string this way.

UTF-32 also wastes at minimum 11 (32 - 21) bits per codepoint, and typically more. For instance, UTF-16 requires only one 16-bit code unit to encode points in the Base Multilingual Plane (the most commonly encountered points). Thus UTF-32 can typically double the space required for the BMP.

There are times to manipulate UTF-32, such as when examining a single codepoint. We’ll see examples below.

ICU example programs

The programs in this article are ready to compile and run. They require the ICU C library called ICU4C, which is available on most platforms through the operating system package manager.

ICU provides five libraries for linking (we need the first two):

Package Contents
icu-uc Common (uc) and Data (dt/data) libraries
icu-io Ustdio/iostream library (icuio)
icu-i18n Internationalization (in/i18n) library
icu-le Layout Engine
icu-lx Paragraph Layout

To use ICU4C, set the compiler and linker flags with pkg-config in your Makefile. (Pkg-config may also need to be installed on your computer.)

CFLAGS  = -std=c99 -pedantic -Wall -Wextra \
          `pkg-config --cflags icu-uc icu-io`
LDFLAGS = `pkg-config --libs icu-uc icu-io`

The examples in this article conform to the C89 standard, but we specify C99 in the Makefile because the ICU header files use C99-style (//) comments.

Generating random codepoints

To start getting a feel for ICU’s I/O and codepoint manipulation, let’s make a program to output completely random (but valid) codepoints. You could use this program as a basic fuzz tester, to see whether its output confuses other programs. A real fuzz tester ought to have the ability to take an explicit seed for repeatable output, but we will omit that functionality from our simple demo.

This program has limited portability because it gets entropy from /dev/urandom, a Unix device. To generate good random numbers using only the C standard library, see my other article. Also POSIX provides pseudo-random number functions.

/* for constants like EXIT_FAILURE */
#include <stdlib.h>
/* we'll be using standard C I/O to read random bytes */
#include <stdio.h>

/* to determine codepoint categories */
#include <unicode/uchar.h>
/* to output UTF-32 codepoints in proper encoding for terminal */
#include <unicode/ustdio.h>

int main(int argc, char **argv)
{
	long i = 0, linelen;
	/* somewhat non-portable: /dev/urandom is unix specific */
	FILE *f = fopen("/dev/urandom", "rb");
	UFILE *out;
	/* UTF-32 code unit can hold an entire codepoint */
	UChar32 c;
	/* to learn about c */
	UCharCategory cat;

	if (!f)
	{
		fputs("Unable to open /dev/urandom\n", stderr);
		return EXIT_FAILURE;
	}

	/* optional length to insert line breaks */
	linelen = argc > 1 ? strtol(argv[1], NULL, 10) : 0;

	/* have to obtain a Unicode-aware file handle. This function
	 * has no failure return code, it always works. */
	out = u_get_stdout();

	/* read a random 32 bits, presumably forever */
	while (fread(&c, sizeof c, 1, f))
	{
		/* Scale 32-bit value to a number within code planes
		 * zero through fourteen. (Planes 15-16 are private-use)
		 *
		 * The modulo bias is insignificant. The first 65535
		 * codepoints are minutely favored, being generated by
		 * 4370 different 32-bit numbers each. The remaining
		 * 917505 codepoints are generated by 4369 numbers each.
		 */
		c %= 0xF0000;
		cat = u_charType(c);

		/* U_UNASSIGNED are "non-characters" with no assigned
		 * meanings for interchange. U_PRIVATE_USE_CHAR are
		 * reserved for use within organizations, and
		 * U_SURROGATE are designed for UTF-16 code units in
		 * particular. Don't print any of those. */
		if (cat != U_UNASSIGNED && cat != U_PRIVATE_USE_CHAR &&
		    cat != U_SURROGATE)
		{
			u_fputc(c, out);
			if (linelen && ++i >= linelen)
			{
				i = 0;
				/* there are a number of Unicode
				 * linebreaks, but the standard ASCII
				 * \n is valid, and will interact well
				 * with a shell */
				u_fputc('\n', out);
			}
		}
	}

	/* should never get here */
	fclose(f);
	return EXIT_SUCCESS;
}

A note about the mysterious U_UNASSIGNED category, the “non-characters.” These are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. The Unicode Standard sets aside 66 non-character code points. The last two code points of each plane are noncharacters (U+FFFE and U+FFFF on the BMP). In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0…U+FDEF.

Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. They are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed.

Manipulating codepoints

We discussed non-characters in the previous section, but there are also Private Use codepoints. Unlike non-characters, those for private use are designated for interchange between systems. However the precise meaning and glyphs for these characters is specific to the organization using them. The same codepoints can be used for different things by different people.

Unicode provides a large area for private use. Both a small code block in the BMP, as well as two entire planes: 15 and 16. Because no browser or text editor will render PUA codepoints beyond (typically) empty boxes, we can exploit plane 15 to make a visually confusing code. Ultimately it’s a cheesy transposition cypher, but it’s kind of fun.

Below is a program to shift characters in the BMP to/from plane 15, the Private Use Area A. Example output of an encoded string: 󰁂󰁥󰀠󰁳󰁵󰁲󰁥󰀠󰁴󰁯󰀠󰁤󰁲󰁩󰁮󰁫󰀠󰁹󰁯󰁵󰁲󰀠󰁏󰁶󰁡󰁬󰁴󰁩󰁮󰁥󰀡󰀊

#include <stdio.h>
#include <stdlib.h>
/* for strcmp in argument parsing */
#include <string.h>

#include <unicode/ustdio.h>

void usage(const char *prog)
{
	puts("Shift base multilingual plane to/from PUA-A\n");
	printf("Usage: %s [-d]\n\n", prog);
	puts("Encodes stdin (or decode with -d)");
	exit(EXIT_SUCCESS);
}

int main(int argc, char **argv)
{
	UChar32 c;
	UFILE *in, *out;
	enum { MODE_ENCODE, MODE_DECODE } mode = MODE_ENCODE;

	if (argc > 2)
		usage(argv[0]);
	else if(argc > 1)
	{
		if (strcmp(argv[1], "-d") == 0)
			mode = MODE_DECODE;
		else
			usage(argv[0]);
	}

	out = u_get_stdout();

	in = u_finit(stdin, NULL, NULL);
	if (!in)
	{
		fputs("Error opening stdout as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* u_fgetcx returns UTF-32. U_EOF happens to be 0xFFFF,
	 * not -1 like EOF typically is in stdio.h */
	while ((c = u_fgetcx(in)) != U_EOF)
	{
		/* -1 for UChar32 actually signifies invalid character */
		if (c == (UChar32)0xFFFFFFFF)
		{
			fputs("Invalid character.\n", stderr);
			continue;
		}
		if (mode == MODE_ENCODE)
		{
			/* Move the BMP into the Supplementary
			 * Private Use Area-A, which begins
			 * at codepoint 0xf0000 */
			if (0 < c && c < 0xe000)
				c += 0xf0000;
		}
		else
		{
			/* Move the Supplementary Private Use
			 * Plane down into the BMP */
			if (0xf0000 < c && c < 0xfe000)
				c -= 0xf0000;
		}
		u_fputc(c, out);
	}

	/* if you u_finit it, then u_fclose it */
	u_fclose(in);

	return EXIT_SUCCESS;
}

Examining UTF-8 code units

So far we’ve been working entirely with complete codepoints. This next example gets into their representation as code units in a transformation format, namely UTF-8. We will read the codepoint as a hexadecimal program argument, and convert it to between 1-4 bytes in UTF-8, and print the hex values of those bytes.

/*** utf8.c ***/

#include <stdio.h>
#include <stdlib.h>

#include <unicode/utf8.h>

int main(int argc, char **argv)
{
	UChar32 c;
	/* ICU defines its own bool type to be used
	 * with their macro */
	UBool err = FALSE;
	/* ICU uses C99 types like uint8_t */
	uint8_t bytes[4] = {0};
	/* probably should be size_t not int32_t, but
	 * just matching what their macro expects */
	int32_t written = 0, i;
	char *parsed;

	if (argc != 2)
	{
		fprintf(stderr, "Usage: %s codepoint\n", *argv);
		exit(EXIT_FAILURE);
	}
	c = strtol(argv[1], &parsed, 16);
	if (!*argv[1] || *parsed)
	{
		fprintf(stderr,
			"Cannot parse codepoint: U+%s\n", argv[1]);
		exit(EXIT_FAILURE);
	}

	/* this is a macro, and updates the variables
	 * directly. No need to pass addresses.
	 * We're saying: write to "bytes", tell us how
	 * many were "written", limit it to four */
	U8_APPEND(bytes, written, 4, c, err);
	if (err == TRUE)
	{
		fprintf(stderr, "Invalid codepoint: U+%s\n", argv[1]);
		exit(EXIT_FAILURE);
	}

	/* print in format 'xxd -r' can read */
	printf("0: ");
	for (i = 0; i < written; ++i)
		printf("%2x", bytes[i]);
	puts("");
	return EXIT_SUCCESS;
}

Suppose you compile this to a program named utf8. Here are some examples:

# ascii characters are unchanged
$ ./utf8 61
0: 61

# other codepoints require more bytes
$ ./utf8 1F41A
0: f09f909a

# format is compatible with "xxd"
$ ./utf8 1F41A | xxd -r
🐚

# surrogates (used in UTF-16) are not valid codepoints
$ ./utf8 DC00
Invalid codepoint: U+DC00

Reading lines into internal UTF-16 representation

Unlimited line length

Here’s a useful helper function named u_wholeline() which reads a line of any length into a dynamically allocated buffer. It reads as UChar*, which is ICU’s standard UTF-16 code unit array.

/* to properly test realloc */
#include <errno.h>
#include <stdlib.h>

#include <unicode/ustdio.h>

/* line Feed, vertical tab, form feed, carriage return,
 * next line, line separator, paragraph separator */
#define NEWLINE(c) ( \
	((c) >= 0xa && (c) <= 0xd) || \
	(c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 )

/* allocates buffer, caller must free */
UChar *u_wholeline(UFILE *f)
{
	/* assume most lines are shorter
	 * than 128 UTF-16 code units */
	size_t i, sz = 128;
	UChar c, *s = malloc(sz * sizeof(*s)), *s_new;

	if (!s)
		return NULL;

	/* u_fgetc returns UTF-16, unlike u_fgetcx */
	for (i = 0; (s[i] = u_fgetc(f)) != U_EOF && !NEWLINE(s[i]); ++i)
		if (i >= sz)
		{
			/* double the buffer when it runs out */
			sz *= 2;
			errno = 0;
			s_new = realloc(s, sz * sizeof(*s));
			if (errno == ENOMEM)
				free(s);
			if ((s = s_new) == NULL)
				return NULL;
		}

	/* if terminated by CR, eat LF */
	if (s[i] == 0xd && (c = u_fgetc(f)) != 0xa)
		u_fungetc(c, f);
	/* s[i] will either be U_EOF or a newline; wipe it */
	s[i] = '\0';

	return s;
}

Limited line length

The previous example reads an entire line. However, reading a limited number of code units from UTF-16 lines is more tricky. Truncating a Unicode string is always a little dangerous due to possibly splitting a word and breaking contextual shaping.

UTF-16 also has surrogate pairs, which are how that translation format expresses codepoints outside the BMP. Ending a UTF-16 string early can split surrogate pairs without the proper precaution.

The following example reads lines in chunks of at most three UTF-16 code units at a time. If it reads two consecutive codepoints from supplementary planes it will fail. The program accepts a “fix” argument to make it push a final unpaired surrogate back onto the stream for a future read.

/*** codeunit.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utf16.h>

/* BUFSZ set to be very small so that lines must be read in
 * many chunks. Helps illustrate split surrogate pairs */
#define BUFSZ 4

void printHex(const UChar *s)
{
	while (*s)
		printf("%x ", *s++);
	putchar('\n');
}

/* yeah, slightly annoying duplication */
void printHex32(const UChar32 *s)
{
	while (*s)
		printf("%x ", *s++);
	putchar('\n');
}

int main(int argc, char **argv)
{
	UFILE *in;
	/* read line into ICU's default UTF-16 representation */
	UChar line[BUFSZ];
	/* A buffer to hold codepoints of "line" as UTF-32 code
	 * units.  The length is sufficient because it requires
	 * fewer (or at least no greater) code units in UTF-32 to
	 * encode the string */
	UChar32 codepoints[BUFSZ];
	UChar *final;
	UErrorCode err = U_ZERO_ERROR;

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* read lines one small BUFSZ chunk at a time */
	while (u_fgets(line, BUFSZ, in))
	{
		/* correct for split surrogate pairs only
		 * if the "fix" argument is present */
		if (argc > 1 && strcmp(argv[1], "fix") == 0)
		{
			final = line + u_strlen(line);
			/* want to consider the character before \0
			 * if such exists */
			if (final > line)
				final--;
			/* if it is the lead unit of a surrogate pair */
			if (U16_IS_LEAD(*final))
			{
				/* push it back for a future read, and
				 * truncate the string */
				u_fungetc(*final, in);
				*final = '\0';
			}
		}

		printf("UTF-16    : ");
		printHex(line);
		u_strToUTF32(
			codepoints, BUFSZ, NULL,
			line, -1, &err);
		printf("Error?    : %s\n", u_errorName(err));
		printf("Codepoints: ");
		printHex32(codepoints);

		/* reset potential errors and go for another chunk */
		err = U_ZERO_ERROR;
		*codepoints = '\0';
	}

	u_fclose(in);
	return EXIT_SUCCESS;
}

If the program reads two weird numerals 𝟘𝟙 (different from 01), neither of which are in the BMP, it finds one codepoint but chokes on the broken pair:

$ echo -n 𝟘𝟙 | ./codeunit
UTF-16    : d835 dfd8 d835
Error?    : U_INVALID_CHAR_FOUND
Codepoints: 1d7d8
UTF-16    : dfd9
Error?    : U_INVALID_CHAR_FOUND
Codepoints:

However if we pass the “fix” argument, the program will read two complete codepoints:

$ echo -n 𝟘𝟙 | ./codeunit fix
UTF-16    : d835 dfd8
Error?    : U_ZERO_ERROR
Codepoints: 1d7d8
UTF-16    : d835 dfd9
Error?    : U_ZERO_ERROR
Codepoints: 1d7d9

Perhaps a better way to read a line with limited length is to use a “break iterator” to stop on a word boundary. We’ll see more about that later.

Extracting, iterating codepoints in UTF-16 string

Our next example will rather laboriously remove diacritical marks from a string. There’s an easier way to do this called “transformation,” but doing it manually provides an opportunity to decompose characters and iterate over them with the U16_NEXT macro.

/*** nomarks.c ***/

#include <stdlib.h>

#include <unicode/uchar.h>
#include <unicode/unorm2.h>
#include <unicode/ustdio.h>
#include <unicode/utf16.h>

/* Limit to how many decomposed UTF-16 units a single
 * codepoint will become in NFD. I don't know the
 * correct value here so I chose a value that seems
 * to be overkill */
#define MAX_DECOMP_LEN 16

int main(void)
{
	long i, n;
	UChar32 c;
	UFILE *in, *out;
	UChar decomp[MAX_DECOMP_LEN];
	UErrorCode status = U_ZERO_ERROR;
	UNormalizer2 *norm;

	out = u_get_stdout();

	in = u_finit(stdin, NULL, NULL);
	if (!in)
	{
		/* using stdio functions with stderr and ustdio
		 * with stdout. Mixing the two on a single file
		 * handle would probably be bad. */
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* create a normalizer, in this case one going to NFD */
	norm = (UNormalizer2 *)unorm2_getNFDInstance(&status);
	if (U_FAILURE(status)) {
		fprintf(stderr,
			"unorm2_getNFDInstance(): %s\n",
			u_errorName(status));
		return EXIT_FAILURE;
	}

	/* consume input as UTF-32 units one by one */
	while ((c = u_fgetcx(in)) != U_EOF)
	{
		/* Decompose c to isolate its n combining character
		 * codepoints. Saves them as UTF-16 code units.  FYI,
		 * this function ignores the type of "norm" and always
		 * denormalizes */
		n = unorm2_getDecomposition(
			norm, c, decomp, MAX_DECOMP_LEN, &status
		);

		if (U_FAILURE(status)) {
			fprintf(stderr,
				"unorm2_getDecomposition(): %s\n",
				u_errorName(status));
			u_fclose(in);
			return EXIT_FAILURE;
		}

		/* if c does not decompose and is not itself
		 * a diacritical mark */
		if (n < 0 && ublock_getCode(c) !=
		    UBLOCK_COMBINING_DIACRITICAL_MARKS)
			u_fputc(c, out);

		/* walk canonical decomposition, reuse c variable */
		for (i = 0; i < n; )
		{
			/* the U16_NEXT macro iterates over UChar (aka
			 * UTF-16, advancing by one or two elements as
			 * needed to get a codepoint. It saves the result
			 * in UTF-32. The macro updates i and c. */
			U16_NEXT(decomp, i, n, c);
			/* output only if not combining diacritical */
			if (ublock_getCode(c) !=
			    UBLOCK_COMBINING_DIACRITICAL_MARKS)
				u_fputc(c, out);
		}
	}

	u_fclose(in);
	/* u_get_stdout() doesn't need to be u_fclose'd */
	return EXIT_SUCCESS;
}

Here’s an example of running the program:

$ echo "résumé façade" | ./nomarks
resume facade

Transformation

ICU provides a rich domain specific language for transforming strings. For example, our entire program in the previous section can be replaced by the transformation NFD; [:Nonspacing Mark:] Remove; NFC. This means to perform a canonical decomposition, remove nonspacing marks, and then canonically compose again. (In fact our program above didn’t re-compose.)

The program below echoes stdin to stdout, but passes the output through a transformation.

/*** trans-stream.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utrans.h>

int main(int argc, char **argv)
{
	UChar32 c;
	UParseError pe;
	UFILE *in, *out;
	UTransliterator *t;
	UErrorCode status = U_ZERO_ERROR;
	UChar *xform_id;
	size_t n;

	if (argc != 2)
	{
		fprintf(stderr,
			"Usage: %s \"translation rules\"\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* the UTF-16 string should never be longer than the UTF-8
	 * argv[1], so this should be safe */
	n = strlen(argv[1]) + 1;
	xform_id = malloc(n * sizeof(UChar));
	u_strFromUTF8(xform_id, n, NULL, argv[1], -1, &status);

	/* create transliterator by identifier */
	t = utrans_openU(xform_id, -1, UTRANS_FORWARD,
	                 NULL, -1, &pe, &status);
	/* don't need the identifier any more */
	free(xform_id);
	if (U_FAILURE(status)) {
		fprintf(stderr, "utrans_open(%s): %s\n",
		        argv[1], u_errorName(status));
		return EXIT_FAILURE;
	}

	out = u_get_stdout();
	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* transparently transliterate stdout */
	u_fsettransliterator(out, U_WRITE, t, &status);
	if (U_FAILURE(status)) {
		fprintf(stderr,
		        "Failed to set transliterator on stdout: %s\n",
		        u_errorName(status));
		u_fclose(in);
		return EXIT_FAILURE;
	}

	/* what looks like a simple echo loop actually
	 * transliterate characters */
	while ((c = u_fgetcx(in)) != U_EOF)
		u_fputc(c, out);

	utrans_close(t);
	u_fclose(in);
}

As mentioned, it can emulate our earlier “nomarks” program:

$ echo "résumé façade" | ./trans "NFD; [:Nonspacing Mark:] Remove; NFC"
resume facade

It can also transliterate between scripts like this:

$ echo "miirekkaḍiki veḷutunnaaru?" | ./trans "Telugu"
మీరెక్కడికి వెళుతున్నఅరు?

Applying the transformation to a stream with u_fsettransliterator is a simple way to do things. However I did discover and file an ICU bug which will be fixed in version 65.1.

A more robust way to apply transformations is by manipulating UChar strings directly. The technique is also probably more applicable in real applications.

Here’s a rewrite of trans-stream that operates on strings directly:

/*** trans-string.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utrans.h>

/* max number of UTF-16 code units to accumulate while looking
 * for an unambiguous transliteration. Has to be fairly long to
 * handle names in Name-Any transliteration like
 * \N{LATIN CAPITAL LETTER O WITH OGONEK AND MACRON} */
#define CONTEXT 100

int main(int argc, char **argv)
{
	UErrorCode status = U_ZERO_ERROR;
	UChar c, *end;
	UChar input[CONTEXT] = {0}, *buf, *enlarged;
	UFILE *in, *out; 
	UTransPosition pos;
	int32_t width, sizeNeeded, bufLen;

	size_t n;
	UChar *xform_id;
	UTransliterator *t;

	/* bufLen must be able to hold at least CONTEXT, and
	 * will be increased as needed for transliteration */
	bufLen = CONTEXT;
	buf = malloc(sizeof(UChar) * bufLen);

	if (argc != 2)
	{
		fprintf(stderr,
			"Usage: %s \"translation rules\"\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* allocate and read identifier, like earlier example */
	n = strlen(argv[1]) + 1;
	xform_id = malloc(n * sizeof(UChar));
	u_strFromUTF8(xform_id, n, NULL, argv[1], -1, &status);

	t = utrans_openU(xform_id, -1, UTRANS_FORWARD,
	                 NULL, -1, NULL, &status);
	free(xform_id);
	if (U_FAILURE(status)) {
		fprintf(stderr, "utrans_open(%s): %s\n",
		        argv[1], u_errorName(status));
		return EXIT_FAILURE;
	}

	out = u_get_stdout();
	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	end = input;
	/* append UTF-16 code units one at a time for incremental
	 * transliteration */
	while ((c = u_fgetc(in)) != U_EOF)
	{
		/* we consider at most CONTEXT consecutive code units
		 * for transliteration (minus one for \0) */
		if (end - input >= CONTEXT-1)
		{
			fprintf(stderr,
				"Exceeded max (%i) code units "
				"for context.\n",
				CONTEXT);
			break;
		}
		*end++ = c;
		*end = '\0';

		/* copy string so far to buf to operate on */
		u_strcpy(buf, input);
		pos.start = pos.contextStart = 0;
		pos.limit = pos.contextLimit = end - input;
		sizeNeeded = -1;
		utrans_transIncrementalUChars(
			t, buf, &sizeNeeded, bufLen, &pos, &status
		);
		/* if buf not big enough for transliterated result */
		if (status == U_BUFFER_OVERFLOW_ERROR)
		{
			/* utrans_transIncrementalUChars sets sizeNeeded,
			 * so resize the buffer */
			if ((enlarged =
			     realloc(buf, sizeof(UChar)*sizeNeeded))
			    == NULL)
			{
				fprintf(stderr,
					"Unable to grow buffer.\n");
				/* fail gracefully and display
				 * what we can */
				break;
			}
			buf = enlarged;
			bufLen = sizeNeeded;
			u_strcpy(buf, input);
			pos.start = pos.contextStart = 0;
			pos.limit = pos.contextLimit = end - input;
			sizeNeeded = -1;

			/* one more time, but with sufficient space */
			status = U_ZERO_ERROR;
			utrans_transIncrementalUChars(
				t, buf, &sizeNeeded, bufLen,
				&pos, &status
			);
		}
		/* handle errors other than U_BUFFER_OVERFLOW_ERROR */
		if (U_FAILURE(status)) {
			fprintf(stderr,
				"utrans_transIncrementalUChars(): %s\n",
				u_errorName(status));
			break;
		}

		/* print buf[0 .. pos.start - 1] */
		u_printf("%.*S", pos.start, buf);

		/* Remove the code units which were processed,
		 * shifting back the remaining ones which could
		 * not be unambiguously transliterated. Then hit
		 * the loop to get another code unit and try again. */
		u_strcpy(input, buf+pos.start);
		end = input + (pos.limit - pos.start);
	}

	/* if any leftovers from incremental transliteration */
	if (end > input)
	{
		/* transliterate input array in place, do our best */
		width = end - input;
		utrans_transUChars(
			t, input, NULL, CONTEXT, 0, &width, &status);
		u_printf("%S", input);
	}

	utrans_close(t);
	u_fclose(in);
	free(buf);
	return U_SUCCESS(status) ? EXIT_SUCCESS : EXIT_FAILURE;
}

Punycode

Punycode is a representation of Unicode within the limited ASCII character subset used for internet host names. If you enter a non-ASCII URL into a web browser navigation bar, the browser translates to Punycode before making the actual DNS lookup.

The encoding is part of the more general process of Internationalizing Domain Names in Applications (IDNA), which also normalizes the string.

Note that not all Unicode strings can be successfully encoded. For instance codepoints like “⒈” include a period in the glyph and are used for numbered lists. Converting that dot to the ASCII hostname would inadvertently specify a subdomain. ICU turns the offending character into U+FFFD (the “replacement character”) in the output and returns an error.

The following program uses uidna_nameToASCII or uidna_nameToUnicode as needed to translate between Unicode and punycode.

/*** puny.c ***/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* uidna stands for International Domain Names in 
 * Applications and contains punycode routines */
#include <unicode/uidna.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

void chomp(UChar *s)
{
	/* unicode characters that split lines */
	UChar splits[] =
		{0xa, 0xb, 0xc, 0xd, 0x85, 0x2028, 0x2029, '\0'};
	if (s)
		s[u_strcspn(s, splits)] = '\0';
}

int main(int argc, char **argv)
{
	UFILE *in;
	UChar input[1024], output[1024];
	UIDNAInfo info = UIDNA_INFO_INITIALIZER;
	UErrorCode status = U_ZERO_ERROR;
	UIDNA *idna = uidna_openUTS46(UIDNA_DEFAULT, &status);

	/* default action is performing punycode */
	int32_t (*action)(
			const UIDNA*, const UChar*, int32_t, UChar*, 
			int32_t, UIDNAInfo*, UErrorCode*
		) = uidna_nameToASCII;

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* the "decode" option reverses our action */
	if (argc > 1 && strcmp(argv[1], "decode") == 0)
		action = uidna_nameToUnicode;

	/* u_fgets includes the newline, so we chomp it */
	u_fgets(input, sizeof(input)/sizeof(*input), in);
	chomp(input);

	action(idna, input, -1, output,
		sizeof(output)/sizeof(*output),
		&info, &status);

	if (U_SUCCESS(status) && info.errors!=0)
		fputs("Bad input.\n", stderr);

	u_printf("%S\n", output);

	uidna_close(idna);
	u_fclose(in);
	return 0;
}

Example of using the program:

$ echo "façade.com" | ./puny
xn--faade-zra.com

# not every string is allowed

$ echo "a⒈.com" | ./puny
Bad input.
a�.com

Changing case

The C standard library has functions like toupper which operate on a single character at a time. ICU has equivalents like u_toupper, but working on single codepoints isn’t sufficient for proper casing. Let’s examine the program and see why.

/*** pointcase.c ***/

#include <stdlib.h>
#include <string.h>

#include <unicode/uchar.h>
#include <unicode/ustdio.h>

int main(int argc, char **argv)
{
	UChar32 c;
	UFILE *in, *out;
	UChar32 (*op)(UChar32) = NULL;

	/* set op to one of the casing operations
	 * in uchar.h */
	if (argc < 2 || strcmp(argv[1], "upper") == 0)
		op = u_toupper;
	else if (strcmp(argv[1], "lower") == 0)
		op = u_tolower;
	else if (strcmp(argv[1], "title") == 0)
		op = u_totitle;
	else
	{
		fprintf(stderr, "Unrecognized case: %s\n", argv[1]);
		return EXIT_FAILURE;
	}

	out = u_get_stdout();
	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* operates on UTF-32 */
	while ((c = u_fgetcx(in)) != U_EOF)
		u_fputc(op(c), out);

	u_fclose(in);
	return EXIT_SUCCESS;
}
# not quite right, ß should become SS:

$ echo "Die große Stille" | ./pointcase upper
DIE GROßE STILLE

# also wrong, final sigma should be ς:

$ echo "ΣΊΣΥΦΟΣ" | ./pointcase lower
σίσυφοσ

As you can see, some graphemes need to “expand” into a greater number, and others are position-sensitive. To do this properly, we have to operate on entire strings rather than individual characters. Here is a program to do it right:

/*** strcase.c ***/

#include <locale.h>
#include <stdlib.h>
#include <string.h>

#include <unicode/ustdio.h>
#include <unicode/ustring.h>

#define BUFSZ 1024

/* wrapper function for u_strToTitle with signature
 * matching the other casing functions */
int32_t title(UChar *dest, int32_t destCapacity,
		const UChar *src, int32_t srcLength,
		const char *locale, UErrorCode *pErrorCode)
{
	return u_strToTitle(dest, destCapacity, src,
			srcLength, NULL, locale, pErrorCode);
}

int main(int argc, char **argv)
{
	UFILE *in;
	char *locale;
	UChar line[BUFSZ], cased[BUFSZ];
	UErrorCode status = U_ZERO_ERROR;
	int32_t (*op)(
			UChar*, int32_t, const UChar*, int32_t,
			const char*, UErrorCode*
		) = NULL;

	/* casing is locale-dependent */
	if (!(locale = setlocale(LC_CTYPE, "")))
	{
		fputs("Cannot determine system locale\n", stderr);
		return EXIT_FAILURE;
	}

	if (argc < 2 || strcmp(argv[1], "upper") == 0)
		op = u_strToUpper;
	else if (strcmp(argv[1], "lower") == 0)
		op = u_strToLower;
	else if (strcmp(argv[1], "title") == 0)
		op = title;
	else
	{
		fprintf(stderr, "Unrecognized case: %s\n", argv[1]);
		return EXIT_FAILURE;
	}

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* Ideally we should change case up to the last word
	 * break and push the remaining characters back for
	 * a future read if the line was longer than BUFSZ.
	 * Currently, if the string is truncated, the final
	 * character would incorrectly be considered
	 * terminal, which affects casing rules in Greek. */
	while (u_fgets(line, BUFSZ, in))
	{
		op(cased, BUFSZ, line, -1, locale, &status);
		/* if casing increases string length, and goes
		 * beyond buffer size like the german ß -> SS */
		if (status == U_BUFFER_OVERFLOW_ERROR)
		{
			/* Just issue a warning and read another line.
			 * Don't treat it as severely as other errors. */
			fputs("Line too long\n", stderr);
			status = U_ZERO_ERROR;
		}
		else if (U_FAILURE(status))
		{
			fputs(u_errorName(status), stderr);
			break;
		}
		else
			u_printf("%S", cased);
	}

	u_fclose(in);
	return U_SUCCESS(status)
		? EXIT_SUCCESS : EXIT_FAILURE;
}

This works better.

$ echo "Die große Stille" | ./strcase upper
DIE GROSSE STILLE

$ echo "ΣΊΣΥΦΟΣ" | ./strcase lower
σίσυφος

Counting words and graphemes

Let’s make a version of wc (the Unix word count program) that knows more about Unicode. Our version will properly count grapheme clusters and word boundaries.

For example, regular wc gets confused by the ancient Ogham script. This was a series of notches scratched into fence posts, and has a space character which is nonblank.

$ echo "ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ " | wc
       1       1      37

One word, you say? Puh-leaze, if your program can’t handle Medieval Irish carvings then I want nothing to do with it. Here’s one that can:

/*** uwc.c ***/

#include <locale.h>
#include <stdlib.h>

#include <unicode/ubrk.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

#define BUFSZ 512

/* line Feed, vertical tab, form feed, carriage return, 
 * next line, line separator, paragraph separator */
#define NEWLINE(c) ( \
	((c) >= 0xa && (c) <= 0xd) || \
	(c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 )

int main(void)
{
	UFILE *in;
	char *locale;
	UChar line[BUFSZ];
	UBreakIterator *brk_g, *brk_w;
	UErrorCode status = U_ZERO_ERROR;
	long ngraph = 0, nword = 0, nline = 0;
	size_t len;

	/* word breaks are locale-specific, so we'll obtain
	 * LC_CTYPE from the environment */
	if (!(locale = setlocale(LC_CTYPE, "")))
	{
		fputs("Cannot determine system locale\n", stderr);
		return EXIT_FAILURE;
	}

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	/* create an iterator for graphemes */
	brk_g = ubrk_open(
		UBRK_CHARACTER, locale, NULL, -1, &status);
	/* and another for the edges of words */
	brk_w = ubrk_open(
		UBRK_WORD, locale, NULL, -1, &status);

	/* yes, this is sensitive to splitting end of line
	 * surrogate pairs and can be improved by our previous
	 * function for reading bounded lines */
	while (u_fgets(line, BUFSZ, in))
	{
		len = u_strlen(line);

		ubrk_setText(brk_g, line, len, &status);
		ubrk_setText(brk_w, line, len, &status);

		/* Start at beginning of string, count breaks.
		 * Could have been a for loop, but this looks
		 * simpler to me. */
		ubrk_first(brk_g);
		while (ubrk_next(brk_g) != UBRK_DONE)
			ngraph++;

		ubrk_first(brk_w);
		while (ubrk_next(brk_w) != UBRK_DONE)
			if (ubrk_getRuleStatus(brk_w) ==
			    UBRK_WORD_LETTER)
				nword++;

		/* count the newline if it exists */
		if (len > 0 && NEWLINE(line[len-1]))
			nline++;
	}

	printf("locale  : %s\n"
	       "Grapheme: %zu\n"
	       "Word    : %zu\n"
	       "Line    : %zu\n",
	       locale, ngraph, nword, nline);

	/* clean up iterators after use */
	ubrk_close(brk_g);
	ubrk_close(brk_w);
	u_fclose(in);
}

Much better:

$ echo "ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ " | ./uwc
locale  : en_US.UTF-8
Grapheme: 14
Word    : 4
Line    : 1

When comparing strings, we can be more or less strict. A familiar example is case sensitivity, but Unicode provides other options. Comparing strings for equality is a degenerate case of sorting, where the strings must not only be determined as equal, but put in order. Sorting is called “collation” and the Unicode collation algorithm supports multiple levels of increasing strictness.

Level Description
Primary base characters
Secondary accents
Tertiary case/variant
Quaternary punctuation

Each level acts as a tie-breaker when strings match in previous levels. When searching we can choose how deep to check before declaring strings equal. To illustrate, consider a text file called words.txt containing these words:

Cooperate
coöperate
COÖPERATE
co-operate
final
fides

We will write a program called ugrep, where we can specify a comparison level and search string. If we search for “cooperate” and allow comparisons up to the tertiary level it matches nothing:

$ ./ugrep 3 cooperate < words.txt
# it's an exact match, no results

It is possible to shift certain “ignorable” characters (like ‘-’) down to the quaternary level while conducting the original level 3 search:

$ ./ugrep 3i cooperate < words.txt
4: co-operate

Doing the same search at the secondary level disregards case, but is still sensitive to accents.

$ ./ugrep 2 cooperate < words.txt
1: Cooperate

Once again, can allow ignorables at this level.

$ ./ugrep 2i cooperate < words.txt
1: Cooperate
4: co-operate

Finally, going only to the primary level, we match words with the same base letters, modulo case and accents.

$ ./ugrep 1 cooperate < words.txt
1: Cooperate
2: coöperate
3: COÖPERATE

Note that the idea of a “base character” is dependent on locale. In Swedish, the letters o and ö are quite distinct, and not minor variants as in English. Setting the locale prior to search restricts the results even at the primary level.

$ LC_COLLATE=sv_SE ./ugrep 1 cooperate < fun.txt
1: Cooperate

One note about the tertiary level. It distinguishes not just case, but ligature presentation forms.

$ ./ugrep 3 fi < words.txt
6: fides

# vs

$ ./ugrep 2 fi < words.txt
5: final
6: fides

Pretty flexible, right? Let’s see the code.

/*** ugrep.c ***/

#include <locale.h>
#include <stdlib.h>
#include <string.h>

#include <unicode/ucol.h>
#include <unicode/usearch.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

#define BUFSZ 1024

int main(int argc, char **argv)
{
	char *locale;
	UFILE *in;
	UCollator *col;
	UStringSearch *srch = NULL;
	UErrorCode status = U_ZERO_ERROR;
	UChar *needle, line[BUFSZ];
	UColAttributeValue strength;
	int ignoreInsignificant = 0, asymmetric = 0;
	size_t n;
	long i;

	if (argc != 3)
	{
		fprintf(stderr,
			"Usage: %s {1,2,@,3}[i] pattern\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* cryptic parsing for our cryptic options */
	switch (*argv[1])
	{
		case '1':
			strength = UCOL_PRIMARY;
			break;
		case '2':
			strength = UCOL_SECONDARY;
			break;
		case '@':
			strength = UCOL_SECONDARY, asymmetric = 1;
			break;
		case '3':
			strength = UCOL_TERTIARY;
			break;
		default:
			fprintf(stderr,
				"Unknown strength: %s\n", argv[1]);
			return EXIT_FAILURE;
	}
	/* length of argv[1] is >0 or we would have died */
	ignoreInsignificant = argv[1][strlen(argv[1])-1] == 'i';

	n = strlen(argv[2]) + 1;
	/* if UTF-8 could encode it in n, then UTF-16
	 * should be able to as well */
	needle = malloc(n * sizeof(*needle));
	u_strFromUTF8(needle, n, NULL, argv[2], -1, &status);

	/* searching is a degenerate case of collation,
	 * so we read the LC_COLLATE locale */
	if (!(locale = setlocale(LC_COLLATE, "")))
	{
		fputs("Cannot determine system collation locale\n",
		      stderr);
		return EXIT_FAILURE;
	}

	if (!(in = u_finit(stdin, NULL, NULL)))
	{
		fputs("Error opening stdin as UFILE\n", stderr);
		return EXIT_FAILURE;
	}

	col = ucol_open(locale, &status);
	ucol_setStrength(col, strength);

	if (ignoreInsignificant)
		/* shift ignorable characters down to
		 * quaternary level */
		ucol_setAttribute(col, UCOL_ALTERNATE_HANDLING,
		                  UCOL_SHIFTED, &status);

	/* Assumes all lines fit in BUFSZ. Should
	 * fix this in real code and not increment i */
	for (i = 1; u_fgets(line, BUFSZ, in); ++i)
	{
		/* first time through, set up all options */
		if (!srch)
		{
			srch = usearch_openFromCollator(
				needle, -1, line, -1,
			    col, NULL, &status
			);
			if (asymmetric)
				usearch_setAttribute(
					srch, USEARCH_ELEMENT_COMPARISON,
					USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD,
					&status
				);
		}
		/* afterward just switch text */
		else
			usearch_setText(srch, line, -1, &status);

		/* check if keyword appears in line */
		if (usearch_first(srch, &status) != USEARCH_DONE)
			u_printf("%ld: %S", i, line);
	}

	usearch_close(srch);
	ucol_close(col);
	u_fclose(in);
	free(needle);

	return EXIT_SUCCESS;
}

Comparing strings modulo normalization

In the concepts section, we saw a single grapheme can be constructed with different combinations of codepoints. In many cases when comparing strings for equality, we’re most interested in the strings being perceived by the user in the same way rather than a simple byte-for-byte match.

The ICU library provides a unorm_compare function which returns a value similar to strcmp, and acts in a normalization independent way. It normalizes both strings incrementally while comparing them, so it can stop early if it finds a difference.

Here is code to check that the five ways of representing ộ are equivalent:

#include <stdio.h>
#include <unicode/unorm2.h>

int main(void)
{
	UErrorCode status = U_ZERO_ERROR;
	UChar s[][4] = {
		{0x006f,0x0302,0x0323,0},
		{0x006f,0x0323,0x0302,0},
		{0x00f4,0x0323,0,0},
		{0x1ecd,0x0302,0,0},
		{0x1ed9,0,0,0}
	};

	const size_t n = sizeof(s)/sizeof(s[0]);
	size_t i;

	for (i = 0; i < n; ++i)
		printf("%zu == %zu: %d\n", i, (i+1)%n,
			unorm_compare(
				s[i], -1, s[(i+1)%n], -1, 0, &status));
}

Output:

0 == 1: 0
1 == 2: 0
2 == 3: 0
3 == 4: 0
4 == 0: 0

A return value of 0 means the strings are equal.

Confusable strings

Because Unicode introduces so many graphemes, there are more possibilities for scammers to confuse people using lookalike glyphs. For instance, domains like adoḅe.com or pаypal.com (with Cyrillic а) can direct unwary visitors to phishing sites. ICU contains an entire module for detecting “confusables,” those strings which are known to look too similar when rendered in common fonts. Each string is assigned a “skeleton” such that confusable strings get the same skeleton.

For an example, see my utility utofu. It has a little extra complexity with sqlite access code, so I am not reproducing it here. It’s designed to check Unicode strings to detect changes over time that might be spoofing.

The method of operation is this:

  1. Read line as UTF-8
  2. Convert to Normalization Form C for consistency
  3. Calculate skeleton string
  4. Insert UTF-8 version of normalized input and its skeleton into a database if the skeleton doesn’t already exist
  5. Compare the normalized input string to the string in the database having corresponding skeleton. If not an exact match die with an error.

Further reading

Unicode and internationalization is a huge topic. I could only scratch the surface in this article. I read and enjoyed sections from these books and reference materials, and would recommend them:

]]>