For software developers, the world of hardware and firmware can be an exciting change. Firmware catapults your logic into the physical world. Rather than moving text between forms and a database, you can move motors. Rather than listening for an API call, you can listen for SONAR or GPS signals.
This is the guide I wish I had when first starting embedded development. It cultivates professional embedded programming habits from the start. We’ll skip the beginner ecosystem like Arduino, and get the most out of hardware with bare metal programming.
The low-level approach allows you to:
In particular, we target the ARM architecture, due to popularity. While the examples use STMicroelectronics hardware, we avoid their vendor IDE and hardware abstraction layer (HAL). The principles in this guide work with chips from any ARM vendor. Rather than proprietary IDEs and libraries, we’ll use entirely open source tools in a Unix environment (like BSD, Linux, or macOS). Here’s why:
Using a strong foundation of toolchain and libraries, we’ll build the same simple “blinky” project in four different ways. We’ll see the boot-up sequence of CMSIS vs the standard library crt0 system. We’ll try writing the program with and without an RTOS, and try dynamic vs static memory allocation. We’ll also see an example of a fault handler, and how to do remote debugging.
By the end of the guide, you can venture confidently into building, flashing, and debugging more complex projects. The guide constructs examples based on product datasheets and first principles, it’s not a copy of existing demos or code snippets.
Download the guide below. For the cost of a sandwich you’ll be up and running.
]]>This article is a set of miscellaneous configuration and scripting tricks that illustrate reusable principles. It assumes you’re familiar with the basics of debugging, like breakpoints, stepping, inspecting variables, etc.
Table of contents
By default, GDB provides a terse line-based terminal. You need to explicitly ask to print the source code being debugged, the values of variables, or the current list of breakpoints. There are four ways to customize this interface. Ordered from basic to complicated, they are:
.gdbinit. Some
good examples are projects like
gdb-dashboard and
gef.In my experiments, the TUI mode (option two) seemed promising, but it has some limitations:
Ultimately I chose option four, with the Data Display Debugger (DDD). It’s fairly ancient, and requires configuration changes to work at all with recent versions of GDB. However, it has a lot of features delivered in a 3MB binary, with no library dependencies other than a Motif-compatible UI toolkit. DDD can also control GDB sessions remotely over SSH.
As a front-end, DDD translates user actions to text commands that it sends to GDB. Newer front-ends use GDB’s unambiguous machine interface (MI), but DDD never got updated for that. It parses the standard text interface, essentially screen scraping GDB’s regular output. This causes some problems, but there are workarounds.
Upon starting DDD, the first serious error you’ll run into is the program locking up with this message:
Waiting until GDB gets ready...
The freeze happens because DDD is looking for the prompt (gdb). However, DDD
never sees that prompt because it incorrectly changed the prompt at startup.
To fix this error, you must explicitly set the prompt and unset the
extended-prompt. In ~/.ddd/init include this code:
Ddd*gdbSettings: \
unset extended-prompt\n\
set prompt (gdb) \n
The root of the problem is that during DDD’s first run, it probes all GDB
settings, and saves them in to its .ddd/init file for consistency in future
runs. It probes by running show settingname for all settings. However, it
interprets the results wrong for these settings:
The incorrect detection is especially bad for extended-prompt. GDB reports
the value as not set, which DDD interprets – not as the lack of a value –
but as text to set for the extended prompt. That text overrides the regular
prompt, causing GDB to output not set as its actual prompt.
As mentioned, DDD probes and saves all GDB settings during first launch. While
specifying all settings in ~/.ddd/init might make for deterministic behavior
on local and remote debugging sessions, it’s inflexible. I want ~/.gdbinit to
be the source of truth.
Thus you should:
Ddd*gdbSettings other than the prompt ones above, andDdd*saveOptionsOnExit: off to prevent DDD from putting the values back.DDD’s default color scheme is a bit glaring. For dark mode in the code window, console, and data display panel, set these resources:
Ddd*XmText.background: black
Ddd*XmText.foreground: white
Ddd*XmTextField.background: black
Ddd*XmTextField.foreground: white
Ddd*XmList.background: black
Ddd*XmList.foreground: white
Ddd*graph_edit.background: #333333
Ddd*graph_edit.edgeColor: red
Ddd*graph_edit.nodeColor: white
Ddd*graph_edit.gridColor: white
By default, DDD uses X core fonts. All its resources, like Ddd*defaultFont,
can pick from only those legacy fonts, which don’t properly render UTF-8. For
proper rendering, we have to change the Motif rendering
table to use the newer
FreeType (XFT) fonts. Pick an XFT font you have on your system; I chose
Inconsolata:
Ddd*renderTable: rt
Ddd*rt*fontType: FONT_IS_XFT
Ddd*rt*fontName: Inconsolata
Ddd*rt*fontSize: 8
The change applies to all UI areas of the program except the data display
window. That window comes from an earlier codebase bolted on to DDD, and I
don’t know how to change its rendering. AFAICT, you can choose only legacy
fonts there, with Ddd*dataFont and Ddd*dataFontSize.
Although international graphemes are garbled in the data display window, you can inspect UTF-8 variables by printing them in the GDB console, or by hovering the mouse over variable names for a tooltip display.
DDD interacts with GDB through the terminal like a user would, so it can drive
debugging sessions over SSH just as easily as local sessions. It also knows how
to fetch remote source files, and find remote program PIDs to which GDB can
attach. DDD’s default program for running commands on a remote inferior is
remsh or rsh, but it can be customized to use SSH:
Ddd*rshCommand: ssh -t
In my experience, the -t is needed, or else GDB warnings and errors can
appear out of order with the (gdb) prompt, making DDD hang.
To debug a remote GDB over SSH, pass the --host option to DDD. I usually
include these command-line options:
ddd --debugger gdb --host admin@example.com --no-exec-window
(I specify the remote debugger command as gdb when it differs from my local
inferior debugger command of egdb from the OpenBSD
devel/gdb port.)
Beyond the basics of run, continue and next, don’t forget some other
handy commands.
finish - execute until the current function returns, and break in caller.
Useful if you accidentally go too deep, or if the rest of a function is of
no interest.until - execute until reaching a later line. You can use this on the
last line of a loop to run through the rest of the iterations, break
out, and stop.start - create a temporary breakpoint on the first line of main() and
then run. Starts the program and breaks right away.step vs next - how to remember the difference? Think a flight of “steps”
goes downward, “stepping down” into subroutines. Whereas “next” is the next
contiguous source line.GDB can be used non-interactively, with predefined scripts, to create little utility programs. For example, the poor man’s profiler is a technique of calling GDB repeatedly to sample the call stack of a running program. It sends the results to awk to tally where most wall clock time (as opposed to just CPU time) is being spent.
A related idea is using GDB to print information about a core dump without leaving the UNIX command line. We can issue a single GDB command to list the backtraces for all threads, plus all stack frame variables and function arguments. Notice the print settings customized for clean, verbose output.
# show why program.core died
gdb --batch \
-ex "set print frame-arguments all" \
-ex "set print pretty on" \
-ex "set print addr off" \
-ex "thread apply all bt full" \
/path/to/program program.coreYou can put this incantation (minus the final program and core file paths) into
a shell alias (like bt) so you can run it more easily. To test, you can
generate a core by running a program and sending it SIGQUIT with Ctrl-\.
Adjusting ulimit -c may also be necessary to save cores, depending on your
OS.
GDB allows you to define custom commands that can do arbitrarily complex things. Commands can set breakpoints, display values, and even call to the shell.
Here’s an example that does a few of these things. It traces the system calls made by a single function of interest. The real work happens by shelling out to OpenBSD’s ktrace(1). (An equivalent tracing utility should exist for your operating system.)
define ktrace
# if a user presses enter on a blank line, GDB will by default
# repeat the command, but we don't want that for ktrace
dont-repeat
# set a breakpoint for the specified function, and run commands
# when the breakpoint is hit
break $arg0
commands
# don't echo the commands to the user
silent
# set a convenience variable with the result of a C function
set $tracepid = (int)getpid()
# eval (GDB 7.2+) interpolates values into a command, and runs it
eval "set $ktraceout=\"/tmp/ktrace.%d.out\"", $tracepid
printf "ktrace started: %s\n", $ktraceout
eval "shell ktrace -a -f %s -p %d", $ktraceout, $tracepid
printf "\nrun \"ktrace_stop\" to stop tracing\n\n"
# "finish" continues execution for the duration of the current
# function, and then breaks
finish
# After commands that continue execution, like finish does,
# we lose control in the GDB breakpoint. We cannot issue
# more commands here
end
# GDB automatically sets $bpnum to the identifier of the created breakpoint
set $tracebp = $bpnum
end
define ktrace_stop
dont-repeat
# consult $ktraceout and $tracebp set by ktrace earlier
eval "shell ktrace -c -f %s", $ktraceout
del $tracebp
printf "ktrace stopped for %s\n", $ktraceout
end
Here’s demonstration with a simple program. It has two functions that involve different kinds of system calls:
#define _POSIX_C_SOURCE 200112L
#include <stdio.h>
#include <unistd.h>
void delay(void)
{
sleep(1);
}
void alert(void)
{
puts("Hello");
}
int main(void)
{
alert();
delay();
}After loading the program into GDB, here’s how to see which syscalls the
delay() function makes. Tracing is focused to just that function, and doesn’t
include the system calls made by any other functions, like alert().
(gdb) ktrace delay
Breakpoint 1 at 0x1a10: file sleep.c, line 7.
(gdb) run
Starting program: sleep
ktrace started: /tmp/ktrace.5432.out
run "ktrace_stop" to stop tracing
main () at sleep.c:20
(gdb) ktrace_stop
ktrace stopped for /tmp/ktrace.5432.out
The trace output is a binary file, and we can use kdump(1) to view it, like this:
$ kdump -f /tmp/ktrace.5432.out
5432 sleep CALL kbind(0x7f7ffffda6a8,24,0xa0ef4d749fb64797)
5432 sleep RET kbind 0
5432 sleep CALL nanosleep(0x7f7ffffda748,0x7f7ffffda738)
5432 sleep STRU struct timespec { 1 }
5432 sleep STRU struct timespec { 0 }
5432 sleep RET nanosleep 0
This shows that, on OpenBSD, sleep(3) calls nanosleep(2).
On a related note, another way to get insight into syscalls is by setting catchpoints to break on a call of interest. This is a Linux-only feature.
GDB treats user defined commands specially whose names begin with hook- or
hookpost-. It runs hook-foo (hookpost-foo) automatically before (after) a
user runs the command foo. In addition, a pseudo-command “stop” exists for
when execution stops at a breakpoint.
As an example, consider automatic variable
displays. GDB
can automatically print the value of expressions every time the program stops
with, e.g. display varname. However, what if we want to display all local
variables this way?
There’s no direct expression to do it with display, but we can create a hook:
define hook-stop
# do it conditionally
if $display_locals_flag
# dump the values of all local vars
info locals
end
end
# commands to (de)activate the display
define display_locals
set $display_locals_flag = 1
end
define undisplay_locals
set $display_locals_flag = 0
end
To be fair, the TUI single key
mode
binds info locals to the v key, so our hook is less useful in TUI mode than
it first appears.
GDB exposes a Python
API for finer
control over the debugger. GDB scripts can include Python directly in
designated blocks. For instance, right in .gdbinit we can access the Python
API to get call stack frame information.
In this example, we’ll trace function calls matching a regex. If no regex is specified, we’ll match all functions visible to GDB, except low level functions (which start with underscore).
# drop into python to access frame information
python
# this module contains the GDB API
import gdb
# define a helper function we can use later in a user command
#
# it prints the name of the function in the specified frame,
# with indentation depth matching the stack depth
def frame_indented_name(frame):
# frame.level() is not always available,
# so we traverse the list and count depth
f = frame
depth = 0
while (f):
depth = depth + 1
f = f.older()
return "%s%s" % (" " * depth, frame.name())
end
# trace calls of functions matching a regex
define ftrace
dont-repeat
# we'll set possibly many breakpoints, so record the
# starting number of the group
set $first_new = 1 + ($bpnum ? $bpnum : 0)
if $argc < 1
# by default, trace all functions except those that start with
# underscore, which are low-level system things
#
# rbreak sets multiple breakpoints via a regex
rbreak ^[a-zA-Z]
else
# or match based on ftrace argument, if passed
rbreak $arg0
end
commands
silent
# drop into python again to use our helper function to
# print the name of the newest frame
python print(frame_indented_name(gdb.newest_frame()))
# then immediately keep going
cont
end
printf "\nTracing enabled. To disable, run:\n\tdel %d-%d\n", $first_new, $bpnum
end
To use ftrace, put breakpoints at either end of an area of interest. When you arrive at the first breakpoint, run ftrace with an optional regex argument. Then, continue the debugger and watch the output.
Here’s sample trace output from inserting a key-value into a treemap
(tm_insert()) in my libderp library.
You can see the “split” and “skew” operations happening in the underlying
balanced AA-tree.
tm_insert
malloc
omalloc
malloc
omalloc
map
insert
internal_tm_insert
derp_strcmp
internal_tm_insert
derp_strcmp
internal_tm_insert
derp_strcmp
internal_tm_insert
internal_tm_skew
internal_tm_split
internal_tm_skew
internal_tm_split
internal_tm_skew
internal_tm_split
GDB allows you to customize the way it displays values. For instance, you may want to inspect Unicode strings when working with the ICU library. ICU’s internal encoding for UChar is UTF-16. GDB has no way to know that an array ostensibly containing numbers is actually a string of UTF-16 code units. However, using the Python API, we can convert the string to a form GDB understands.
While a bit esoteric, this example provides the template you would use to create pretty printers for any type.
import gdb.printing, re
# a pretty printer
class UCharPrinter:
'Print ICU UChar string'
def __init__(self, val):
self.val = val
# tell gdb to print the value in quotes, like a string
def display_hint(self):
return 'string'
# the actual work...
def to_string(self):
p_c16 = gdb.lookup_type('char16_t').pointer()
return self.val.cast(p_c16).string('UTF-16')
# bookkeeping that associates the UCharPrinter with the types
# it can handle, and adds an entry to "info pretty-printer"
class UCharPrinterInfo(gdb.printing.PrettyPrinter):
# friendly name for printer
def __init__(self):
super().__init__('UChar string printer')
self._re = re.compile('^UChar [\[*]')
# is UCharPrinter appropriate for val?
def __call__(self, val):
if self._re.match(str(val.type)):
return UCharPrinter(val)While it’s nice to create code such as the pretty printer above, the code won’t
do anything until we tell GDB how and when to load it. You can certainly dump
Python code blocks into your ~/.gdbinit, but that’s not very modular, and can
load things unnecessarily.
I prefer to organize the code in dedicated directories like this:
mkdir -p ~/.gdb/{py-modules,auto-load}The ~/.gdb/py-modules is for user modules (like the ICU pretty printer), and
~/.gdb/auto-load is for scripts that GDB automatically loads at certain times.
Having created those directories, tell GDB to consult them. Add this to your
~/.gdbinit:
add-auto-load-safe-path /home/foo/.gdb
add-auto-load-scripts-directory /home/foo/.gdb/auto-load
Now, when GDB loads a library like /usr/lib/baz.so.x.y on behalf of your
program, it will also search for ~/.gdb/auto-load/usr/lib/baz.so.x.y-gdb.py
and load it if it exists. To see which libraries GDB loads for an application,
enable verbose mode, and then start execution.
(gdb) set verbose
(gdb) start
...
Reading symbols from /usr/libexec/ld.so...
Reading symbols from /usr/lib/libpthread.so.26.1...
Reading symbols from ...
On my machine for an application using ICU, GDB loaded
/usr/local/lib/libicuio.so.20.1. To enable the ICU pretty printer, I create
an auto-load file:
# ~/.gdb/auto-load/usr/local/lib/libicuuc.so.20.1-gdb.py
import gdb.printing
import printers.libicuuc
gdb.printing.register_pretty_printer(
gdb.current_objfile(),
printers.libicuuc.UCharPrinterInfo())The final question is how the auto-loader resolves the printers.libicuuc
module. We need to add ~/.gdb/py-modules to the Python system path. I use a
little trick: a file in the appropriate directory that detects its own location
and adds that to the syspath:
# ~/.gdb/py-modules/add-syspath.py
import sys, os
sys.path.append(os.path.dirname(os.path.realpath(__file__)))Then just source the file from ~/.gdbinit:
source /home/foo/.gdb/py-modules/add-syspath.py
After doing that, save the ICU pretty printing code as
~/.gdb/py-modules/printers/libicuuc.py, and the import printers.libicuuc
statement will find it.
In addition to providing a graphical user interface, DDD has a few features of its own.
Each time the program stops at a breakpoint, DDD records the values of all displayed variables. You can place breakpoints strategically to sample the historical values of a variable, and then view or plot them on a graph.
For instance, compile this program with debugging information enabled, and load it in DDD:
int main(void)
{
unsigned x = 381;
while (x != 1)
x = (x % 2 == 0) ? x/2 : 3*x + 1;
return 0;
}Double click to the left of the x = ... line to set a breakpoint. Right
click the stop sign icon that appears, and select Properties…. In the
dialog box, click Edit >> and enter continue into the text box. Apply
your change and close the dialog. This breakpoint will stop, record the value
of x, then immediately continue running.
Set a breakpoint on the return 0 line.
Select GDB console from the View menu (or press Alt-1).
Run start in the GDB console to run the program and break at the
first line.
Double click the “x” variable to add it to the graphical display. (If you don’t put it in the display window, DDD won’t track its values over time.)
Select Continue from the Program menu (or press F9). You’ll see the
displayed value of x updating rapidly.
When execution stops at the last breakpoint, run graph history x in the
GDB console. It will output an array of all previous values:
(gdb) graph history x
history x = {0, 381, 1144, 572, 286, 143, 430, 215, 646, 323, 970, 485,
1456, 728, 364, 182, 91, 274, 137, 412, 206, 103, 310, 155, 466, 233, 700, 350,
175, 526, 263, 790, 395, 1186, 593, 1780, 890, 445, 1336, 668, 334, 167, 502,
251, 754, 377, 1132, 566, 283, 850, 425, 1276, 638, 319, 958, 479, 1438, 719,
2158, 1079, 3238, 1619, 4858, 2429, 7288, 3644, 1822, 911, 2734, 1367, 4102,
2051, 6154, 3077, 9232, 4616, 2308, 1154, 577, 1732, 866, 433, 1300, 650, 325,
976, 488, 244, 122, 61, 184, 92, 46, 23, 70, 35, 106, 53, 160, 80, 40, 20, 10,
5, 16, 8, 4, 2, 1}
To see the values plotted graphically, run
graph plot `graph display x`
DDD sends the data to gnuplot to render the graph. (Be sure to set
Ddd*plotTermType: x11 in ~/.ddd/init, or else DDD will hang with a dialog
saying “Starting Gnuplot…”.)
DDD has some shortcuts that aren’t obvious from the interface, but which I found interesting in the documentation.
By taking time to learn general-purpose parsing tools, you can go beyond fragile homemade solutions, and inflexible third-party libraries. We’ll cover Lex and Yacc in this guide because they are mature and portable. We’ll also cover their later incarnations as Flex and Bison.
Above all, this guide is practical. We’ll see how to properly integrate parser
generators into your build system, how to create thread-safe parsing modules,
and how to parse real data formats. I’ll motivate each feature of the parser
generator with a concrete problem it can solve. And, I promise, none of the
typical calculator examples.
Table of contents
People usually use two stages to process structured text. The first stage, lexing (aka scanning), breaks the input into meaningful chunks of characters. The second, parsing, groups the scanned chunks following potentially recursive rules. However, a nice lexing tool like Lex can be useful on its own, even when not paired with a parser.
The simplest way to describe Lex is that it runs user-supplied C code blocks for regular expression matches. It reads a list of regexes and constructs a giant state machine which attempts to match them all “simultaneously.”
A lex input file is composed of three possible sections: definitions, rules,
and helper functions. The sections are delimited by %%. Lex transforms its
input file into a plain C file that can be built using an ordinary C compiler.
Here’s an example. We’ll match the strings cot, cat, and cats. Our
actions will print a replacement for each.
/* catcot.l */
%{
#include <stdio.h>
%}
%%
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }To build it:
# turn the input into an intermediate C file
lex -t catcot.l > catcot.c
# compile it
cc -o catcot catcot.c -ll(Alternately, build it in one step with make catcot. Even in the absence of a
Makefile, POSIX make has suffix
rules
that handle .l files.)
The program outputs simple substitutions:
echo "the cat on the cot joined the cats" | ./catcot
the thankless pet on the portable bed joined the anti-herdThe reason it prints non-matching words (such as “the”) is that there’s an
implicit rule matching any character (.) and echoing it. In most real parsers
we’ll want to override that.
Here’s what’s happening inside the scanner. Lex reads the regexes and generates a state machine to consume input. Below is a visualization of the states, with transitions labeled by input character. The circles with a double outline indicate states that trigger actions.
Note there’s no notion of word boundaries in our lexer, it’s operating on characters alone. For instance:
echo "catch!" | ./catcot
thankless petch!That sounds rather like an insult.
An important subtlety is how Lex handles multiple eligible matches. It picks the longest possible match available, and in the case of a tie, picks the matching pattern defined earliest.
To illustrate, suppose we add a looser regex, c.t, first.
%%
c.t { printf("mumble mumble"); }
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }Lex detects that the rule masks cat and cot, and outputs a warning:
catcot.l:10: warning, rule cannot be matched
catcot.l:11: warning, rule cannot be matched
It still compiles though, and behaves like this:
echo "the cat on the cot joined the cats" | ./catcot
the mumble mumble on the mumble mumble joined the anti-herd
Notice that it still matched cats, because cats is longer than c.t.
Compare what happens if we move the loose regex to the end of our rules. It can then pick up whatever strings get past the others.
%%
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }
c.t { printf("mumble mumble"); } It acts like this:
echo "cut the cot" | ./catcot
mumble mumble the portable bed
Now’s a good time to take a detour and observe how our user-defined code acts
in the generated C file. Lex creates a function called yylex(), and inserts
the code blocks verbatim into a switch statement. When using lex with a parser,
the parser will call yylex() to retrieve tokens, named by integers. For now,
our user-defined code isn’t returning tokens to a parser, but doing simple
print statements.
/* catcot.c (generated by lex) */
int yylex (void)
{
/* ... */
switch ( yy_act )
{
/* ... */
case 1:
YY_RULE_SETUP
#line 9 "catcot.l"
{ printf("portable bed"); }
YY_BREAK
case 2:
YY_RULE_SETUP
#line 10 "catcot.l"
{ printf("thankless pet"); }
YY_BREAK
case 3:
YY_RULE_SETUP
#line 11 "catcot.l"
{ printf("anti-herd"); }
YY_BREAK
/* ... */
}
/* ... */
}As mentioned, a lex file is comprised of three sections:
DEFINITIONS
%%
RULES
%%
HELPER FUNCTIONS
The definitions section is where you can embed C code to include headers and declare functions used in rules. The definitions section can also define friendly names for regexes that can be reused in the rules.
The rules section, as we saw, contains a list of regexes and associated user code.
The final section is where to put the full definitions of helper functions.
This is also where you’d put the main() function. If you omit main(), the
Lex library provides one that simply calls yylex(). This default main()
implementation (and implementations for a few other functions) is available by
linking your lex-generated C code with -ll compiler flag.
Let’s see a short, fun example: converting Roman numerals to decimal. Thanks to lex’s behavior of matching longer strings first, it can read the single-letter numerals, but look ahead for longer subtractive forms like “IV” or “XC.”
/* roman-lex.l */
/* the %{ ... %} enclose C blocks that are copied
into the generated code */
%{
#include <stdio.h>
#include <stdlib.h>
/* globals are visible to user actions amd main() */
int total;
%}
%%
/*<- notice the whitespace before this comment, which
is necessary for comments in the rules section */
/* the basics */
I { total += 1; }
V { total += 5; }
X { total += 10; }
L { total += 50; }
C { total += 100; }
D { total += 500; }
M { total += 1000; }
/* special cases match with preference
because they are longer strings */
IV { total += 4; }
IX { total += 9; }
XL { total += 40; }
XC { total += 90; }
CD { total += 400; }
CM { total += 900; }
/* ignore final newline */
\n ;
/* but die on anything else */
. {
fprintf(stderr, "unexpected: %s\n", yytext);
exit(EXIT_FAILURE);
}
%%
/* provide our own main() rather than the implementation
from lex's library linked with -ll */
int main(void)
{
/* only have to call yylex() once, since our
actions don't return */
yylex();
fprintf(yyout, "%d\n", total);
return EXIT_SUCCESS;
}Now that we’ve seen Lex’s basic operation in the previous section, let’s consider a useful example: syntax highlighting. Detecting keywords in syntax is a problem that lex can handle by itself, without help from yacc.
Because lex and yacc are so old (predating C), and used in so many projects, you can find grammars already written for most languages. For instance, we’ll take quut’s C specification for lex, and modify it to do syntax highlighting.
This relatively short program accurately handles the full complexity of the language. It’s easiest to understand by reading in full. See the inline comments for new and subtle details.
/* c.l syntax highlighter */
%{
/* POSIX for isatty, fileno */
#define _POSIX_C_SOURCE 200112L
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
/* declarations are visible to user actions */
enum FG
{
fgRED = 31, fgGREEN = 32,
fgORANGE = 33, fgCYAN = 36,
fgDARKGREY = 90, fgYELLOW = 93
};
void set_color(enum FG);
void reset_color(void);
void color_print(enum FG, const char *);
void consume_comment(void);
%}
/* named regexes we can use in rules */
O [0-7]
D [0-9]
NZ [1-9]
L [a-zA-Z_]
A [a-zA-Z_0-9]
H [a-fA-F0-9]
HP (0[xX])
E ([Ee][+-]?{D}+)
P ([Pp][+-]?{D}+)
FS (f|F|l|L)
IS (((u|U)(l|L|ll|LL)?)|((l|L|ll|LL)(u|U)?))
CP (u|U|L)
SP (u8|u|U|L)
ES (\\(['"\?\\abfnrtv]|[0-7]{1,3}|x[a-fA-F0-9]+))
WS [ \t\v\n\f]
%%
/* attempting to match and capture an entire multi-line
comment could strain lex's buffers, so we match the
beginning, and call consume_comment() to deal with
the ensuing characters, in our own less resource-
intensive way */
"/*" {
set_color(fgDARKGREY);
/* For greater flexibility, we'll output to lex's stream, yyout.
It defaults to stdout. */
fputs(yytext, yyout);
consume_comment();
reset_color();
}
/* single-line comments can be handled the default way.
The yytext variable is provided by lex and points
to the characters that match the regex */
"//".* {
color_print(fgDARKGREY, yytext);
}
^[ \t]*#.* {
color_print(fgRED, yytext);
}
/* you can use the same code block for multiple regexes */
auto |
bool |
char |
const |
double |
enum |
extern |
float |
inline |
int |
long |
register |
restrict |
short |
size_t |
signed |
static |
struct |
typedef |
union |
unsigned |
void |
volatile |
_Bool |
_Complex {
color_print(fgGREEN, yytext);
}
break |
case |
continue |
default |
do |
else |
for |
goto |
if |
return |
sizeof |
switch |
while {
color_print(fgYELLOW, yytext);
}
/* we use the named regexes heavily below; putting
them in curly brackets expands them */
{L}{A}* {
/* without this rule, keywords within larger words
would be highlighted, like the "if" in "life" --
this rule prevents that because it's a longer match */
fputs(yytext, yyout);
}
{HP}{H}+{IS}? |
{NZ}{D}*{IS}? |
"0"{O}*{IS}? |
{CP}?"'"([^'\\\n]|{ES})+"'" |
{D}+{E}{FS}? |
{D}*"."{D}+{E}?{FS}? |
{D}+"."{E}?{FS}? |
{HP}{H}+{P}{FS}? |
{HP}{H}*"."{H}+{P}{FS}? |
{HP}{H}+"."{P}{FS}? {
color_print(fgCYAN, yytext);
}
({SP}?\"([^"\\\n]|{ES})*\"{WS}*)+ {
color_print(fgORANGE, yytext);
}
/* explicitly mention the default rule */
. ECHO;
%%
/* definitions of the functions we declared earlier */
/* the color functions use ANSI escape codes, and may
not be portable across all terminal emulators. */
void set_color(enum FG c)
{
fprintf(yyout, "\033[%d;1m", c);
}
void reset_color(void)
{
fputs("\033[0m", yyout);
}
void color_print(enum FG c, const char *s)
{
set_color(c);
fputs(s, yyout);
reset_color();
}
/* this function directly consumes characters in lex
using the input() function. It pulls characters
from the same stream that the regex state machine
reads. */
void consume_comment(void)
{
int c;
/* EOF in lex is 0, which is different from
the EOF macro in the C standard library */
while ((c = input()) != 0)
{
putchar(c);
if (c == '*')
{
while ((c = input()) == '*')
putchar(c);
if (c == 0) break;
putchar(c);
if (c == '/') return;
}
}
}
int main(void)
{
if (!isatty(fileno(stdout)))
{
/* a more flexible option would be to make the
color changing functions do nothing, but that's
too much fuss for an example program */
fputs("Stdout is not a terminal\n", stderr);
return EXIT_FAILURE;
}
/* since we'll be changing terminal color, be sure to
reset it for any program termination event */
atexit(reset_color);
/* let our lex rules do the rest */
yylex();
return EXIT_SUCCESS;
}One of the biggest areas of improvement between classic lex/yacc and flex/bison is the ability of the latter to generate code that’s easier to embed into a larger application. Lex and yacc are designed to create standalone programs, with user-defined code blocks stuck inside. When classic lex and yacc work together, they use a bunch of global variables.
Flex and Bison, on the other hand, can generate thread-safe functions with uniquely prefixed names that can be safely linked into larger programs. To demonstrate, we’ll do another scanner (with Flex this time).
The following Rube Goldberg contraption uses Flex to split words on whitespace and call a user-supplied callback for each word. There’s certainly an easier non-Flex way to do this task, but this example illustrates how to encapsulate Flex code into a reusable library.
/* words.l */
/* don't generate functions we don't need */
%option nounput noinput noyywrap
/* generate a scanner that's thread safe */
%option reentrant
/* Generate "words" rather than "yy" as a prefix, e.g.
wordslex() rather than yylex(). This allows multiple
Flex scanners to be linked with the same application */
%option prefix="words"
%%
[^ \t\n]+ {
/* the return statement causes yylex to stop and return */
return 1; /* our code for a word token */
}
/* do nothing for any other characters, don't
output them as would be the default behavior */
.|\n ;
%%
/* Callers interact with this function, which neatly hides
the Flex inside.
Also, we'll call "yy" functions like "yylex()" inside,
and Flex will rename them in the resulting C file to
calls with the "words" prefix, like "wordslex()"
Zero return means success, nonzero is a Flex error
code. */
int words_callback(char *s, void (*f)(const char *))
{
/* in the reentrant mode, we maintain our
own scanner and its associated state */
int i;
yyscan_t scanner;
YY_BUFFER_STATE buf;
if ((i = yylex_init(&scanner)) != 0)
return i;
/* read from a string rather than a stream */
buf = yy_scan_string(s, scanner);
/* Each time yylex finds a word, it returns nonzero.
It resumes where it left off when we call it again */
while ((i = yylex(scanner)) > 0)
{
/* call the user supplied function f with
yytext of the match */
f(yyget_text(scanner));
}
/* clean up */
yy_delete_buffer(buf, scanner);
yylex_destroy(scanner);
return 0;
}Build it like this:
# generate scanner, build object file
flex -t words.l > words.c
cc -c words.c
# verify that all public text symbols are prefixed by "words"
nm -g words.o | grep " T "If you compile with more warnings enabled, the compiler will complain about “unused parameter yyscanner” in several functions. Flex’s reentrant mode adds this parameter to the functions, and the default implementation doesn’t use it.
To fix the warnings, we can provide our own definitions. First, disable some of Flex’s auto-generated functions. Add these options to your lex input file:
%option noyyalloc noyyfree noyyrealloc
Provide the implementations yourself down by words_callback, and add the macro in a code block up by the %options.
/* add in a code block by the %options */
#define YY_EXIT_FAILURE ((void)yyscanner, EXIT_FAILURE)
/* add definitions down by words_callback */
void *wordsalloc(size_t size, void *yyscanner)
{
(void) yyscanner;
return malloc(size);
}
void *wordsrealloc(void * ptr, size_t size, void *yyscanner)
{
(void) yyscanner;
return realloc(ptr, size);
}
void wordsfree(void *ptr, void *yyscanner)
{
(void) yyscanner;
free(ptr);
}
A calling program can use our library without seeing any Flex internals.
/* test_words.c */
#include <stdio.h>
/* words_callback defined in the object file -- you could put
this declaration in a header file words.h */
int words_callback(char *, void (*)(const char *));
void print_word(const char *w)
{
puts(w);
/* if you want to use the parameter w in the future, you
need to duplicate it in memory whose lifetime you control */
}
int main(void)
{
words_callback(
"The quick brown fox\n"
"jumped over the lazy dog\n",
&print_word
);
return 0;
}To build the program, just link it with words.o.
cc -o test_words test_words.c words.oNow that we’ve seen how to identify tokens with a scanner, let’s learn how a parser can act on the tokens using recursive rules. Yacc/byacc/bison are LALR (look-ahead left recursive) parsers, and Bison supports more powerful modes if desired.
LR parsers build bottom-up toward a goal, shifting tokens onto a stack and combining (“reducing”) them according to rules. It’s helpful to get a mental model for this process, so let’s jump into a simple example and simulate what yacc does.
Here’s a yacc grammar with a single rule to build a result called foo. We specify that foo is comprised of lex tokens A, B, and C.
%token A B C
%%
foo: A B CYacc transforms the grammar into a state machine which looks like this:
The first rule in the file (and the only rule in our case) becomes yacc’s
goal. Yacc begins in state 0, with the implicit rule 0: $accept: • foo $end.
The parse will be accepted if we can produce a foo followed immediately by
the end of input. The bullet point indicates our progress reading the input. In
state 0 it’s at the beginning, meaning we haven’t read anything yet.
Initially there’s no lookahead token, so yacc calls yylex() to get one. If
lex produces an A, we follow the state transition to state 1. Because the arrow
is a solid line, not dashed, yacc “shifts” the token to its token stack. It also
pushes state 1 onto a state stack, which now holds states 0 and 1.
State 1 is trying to satisfy the rule which it calls rule 1, namely 1 foo: A • B C. The bullet point after the A indicates we’ve seen the A already. Don’t
confuse the state numbers and rule numbers – yacc numbers them independently.
Yacc continues processing input, shifting tokens and moving to states 3 and 5 if lex produces the expected tokens. If, at any point, lex produces a token not matching any transitions in the current state, then yacc reports a syntax error and terminates. (There’s a way to do error recovery, but that’s another topic.)
State 5 has seen all necessary tokens for rule 1: 1 foo: A B C •. Yacc
continues to the diamond marked “R1,” which is a reduction action. Yacc
“reduces” rule 1, popping the A, B, C terminal tokens off the stack and pushing
a single non-terminal foo token. When it pops the three tokens, it pops the
same number of states (states 5, 3, and 1). Popping three states lands us back
in state 0.
State 0 has a dashed line going to state 2 that matches the foo token that was just reduced. The dashed line means “goto” rather than “shift,” because rule 0 doesn’t have to shift anything onto the token stack. The previous reduction already took care of that.
Finally, state 2 asks lex for another token, and if lex reports EOF, that
matches $end and sends us to state 4, which ties a ribbon on it with the
Acc(ept) action.
From what we’ve seen so far, each state may seem to be merely tracking progress through a single rule. However, states actually track all legal ways forward from tokens previously consumed. A single state can track multiple candidate rules. For instance:
%token A B C
%%
/* foo is either x or y */
foo: x | y;
/* x and y both start with an A */
x: A B;
y: A C;For this grammar, yacc produces the following state machine:
In state 1 we’ve seen token A, and so rules 3 and 4 are both in the running to reduce an x or y. On a B or C token, the possibilities narrow to a single rule (in state 5 or 6).
Also notice that our rule foo : x | y doesn’t occur verbatim in any states.
Yacc separates it into 1 foo: x and 2 foo: y. Thus, the numbered rules
don’t always match the rules in the grammar one-to-one.
Yacc can also use peek ahead by one token to choose which rule to reduce, without shifting the “lookahead” token. In the following grammar, rules x and y match the same tokens. However, the foo rule can say to choose x when followed by a B, or y when followed by a C:
%token A B C
%%
foo : x B | y C;
x : A;
y : A;Note multiple reductions coming out of state 1 in the generated state machine:
The presence of a bracketed token ([C]) exiting state 1 indicates that the
state uses lookahead. If the state sees token C, it reduces rule 4. Otherwise
it reduces rule 3. Lookahead tokens remain to be read when following a
dashed-line (goto) action, such as from state 0 to state 4.
While yacc is a powerful tool to transform a grammar into a state machine, it may not operate the way you intend on ambiguous grammars. These are grammars with a state that could proceed in more than one way with the same input.
As grammars get complicated, it’s quite possible to create ambiguities. Let’s look at small examples that make it easier to see the mechanics of the conflict. That way, when it happens in a real grammar, we’ll have a better feeling for it.
In the following example, the input A B matches both x and y B. There’s
no reason for yacc to choose one construction over the other when reducing to
foo. So why does this matter, you ask? Don’t we get to foo either way? Yes,
but real parsers will have different user code assigned to run per rule, and it
matters which code block gets executed.
%token A B
%%
foo : x | y B ;
x : A B ;
y : A ;The state machine shows ambiguity at state 1:
At state 1, when the next token is B, the state could shift the token and enter state 5 (attempting to reduce x). It could also reduce y and leave B as lookahead. This is called a shift/reduce conflict. Yacc’s policy in such a conflict is to favor a shift over a reduce.
Alternately, we can construct a grammar with a state that has more than one
eligible reduction for the same input. The purest toy example would be foo : A | A, generating:
In a reduce/reduce conflict, yacc chooses to reduce the conflicting rule presented earlier in the grammar.
While matching tokens, parsers typically build a user-defined value in memory to represent features of the input. Once the parse reaches the goal state and succeeds, then the user code will act on the memory value (or pass it along to a calling program).
Yacc has stores the semantic values from parsed tokens in variables ($1,
$2, …) accessible to code blocks, and it provides a variable ($$) for
assigning the semantic result of the current code block.
Let’s see it in action. We won’t do a hackneyed calculator, but let’s still make a parser that operates on integers. Integer values allow us to avoid thinking about memory management.
We’ll revisit the roman numeral example, and this time let lex match the digits while yacc combines them into a final result. It’s actually more cumbersome than our earlier way, but illustrates how to work with semantic parse values.
There are some comments in the example below about portability between yacc variants. The three most prominent variants, in order of increasing features, are: the POSIX interface matching roughly the AT&T yacc functionally, byacc (Berkeley Yacc), and GNU Bison.
/* roman.y (plain yacc) */
%{
#include <stdio.h>
/* declarations to fix warnings from sloppy
yacc/byacc/bison code generation. For instance,
the code should have a declaration of yylex. */
int yylex(void);
/* The POSIX specification says yyerror should return
int, although bison documentation says the value is
ignored. We match POSIX just in case. */
int yyerror(const char *s);
%}
/* tokens our lexer will produce */
%token NUM
%%
/* The first rule is the final goal. Yacc will work
backward trying to arrive here. This "results" rule
is a stub we use to print the value from "number." */
results :
number { fprintf(yyout, "%d\n", $1); }
;
/* as the lexer produces more NUMs, keep adding them */
number :
/* this is a common pattern for saying number is one or
more NUMs. Notice we specify "number NUM" and not
"NUM number". In yacc recursion, think "right is wrong
and left is right." */
number NUM { $$ = $1 + $2; }
/* base case, using default rule of $$ = $1 */
| NUM
;The corresponding lexer matches individual numerals, and returns them with their semantic values.
/* roman.l */
%{
/* The .tab.h file is generated by yacc, and we'll explain
it later */
#include "roman.tab.h"
/* lex communicates semantic token values to yacc through
a shared global variable */
extern int yylval;
%}
/* when using flex (rather than vanilla lex) fix
unused function warnings by adding:
%option noinput nounput
*/
%%
/* The constant for NUM comes from roman.tab.h,
and was generated because we declared
"%token NUM" in roman.y */
I { yylval = 1; return NUM; }
V { yylval = 5; return NUM; }
X { yylval = 10; return NUM; }
L { yylval = 50; return NUM; }
C { yylval = 100; return NUM; }
D { yylval = 500; return NUM; }
M { yylval = 1000; return NUM; }
IV { yylval = 4; return NUM; }
IX { yylval = 9; return NUM; }
XL { yylval = 40; return NUM; }
XC { yylval = 90; return NUM; }
CD { yylval = 400; return NUM; }
CM { yylval = 900; return NUM; }
/* ignore final newline */
\n ;
/* As a default action, return the ascii value of
the character as if it were a token identifier.
The values from roman.tab.h are offset above 256 to
be above any ascii value, so there's no ambiguity
Our parser won't be expecting these values, so
they will lead to a syntax error */
. { return *yytext; }To review: lex generates a yylex() function, and yacc generates yyparse() that
calls yylex() repeatedly to get new token identifiers. Lex actions copy
semantic values to yylval which Yacc copies into $-variables accessible in
parser rule actions.
Building an executable roman from the input files roman.y and roman.l
requires explanation. With appropriate command line flags, yacc will create the
files roman.tab.c and roman.tab.h from roman.y. Lex will create
roman.lex.c from roman.l, using token identifiers in roman.tab.h.
In short, here are the build dependencies for each file:
And here’s how to express it all in a Makefile.
# put together object files from lexer and parser, and
# link the yacc and lex libraries (in that order, to pick
# main() from yacc's library rather than lex's)
roman : roman.tab.o roman.lex.o
$(CC) -o $@ roman.tab.o roman.lex.o -ly -ll
# tell make which files yacc will generate
#
# an explanation of the arguments:
# -b roman - name the files roman.tab.*
# -d - generate a .tab.h file too
roman.tab.h roman.tab.c : roman.y
$(YACC) -d -b roman $?
# the object file relies on the generated lexer, and
# on the token constants
roman.lex.o : roman.tab.h roman.lex.c
# can't use the default suffix rule because we're
# changing the name of the output to .lex.c
roman.lex.c : roman.l
$(LEX) -t $? > $@And now, the moment of truth:
$ make
$ echo MMMCMXCIX | ./roman
3999In this example we’ll parse LISP S-expressions, limited to string and integer atoms. There’s more going on in this one, such as memory management, different semantic types per token, and packaging the lexer and parser together into a single thread-safe library. This example requires Bison.
/* lisp.y (requires Bison) */
/* a "pure" api means communication variables like yylval
won't be global variables, and yylex is assumed to
have a different signature */
%define api.pure true
/* change prefix of symbols from yy to "lisp" to avoid
clashes with any other parsers we may want to link */
%define api.prefix {lisp}
/* generate much more meaningful errors rather than the
uninformative string "syntax error" */
%define parse.error verbose
/* Bison offers different %code insertion locations in
addition to yacc's %{ %} construct.
The "top" location is good for headers and feature
flags like the _XOPEN_SOURCE we use here */
%code top {
/* XOPEN for strdup */
#define _XOPEN_SOURCE 600
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Bison versions 3.7.5 and above provide the YYNOMEM
macro to allow our actions to signal the unlikely
event that they couldn't allocate memory. Thanks
to the Bison team for adding this feature at my
request. :) YYNOMEM causes yyparse() to return 2.
The following conditional define allows us to use
the functionality in earlier versions too. */
#ifndef YYNOMEM
#define YYNOMEM goto yyexhaustedlab
#endif
}
/* The "requires" code location is designed for defining
data types that we can use as yylval's for tokens. Code
in this section is also added to the .tab.h file for
inclusion by calling code */
%code requires {
enum sexpr_type {
SEXPR_ID, SEXPR_NUM, SEXPR_PAIR, SEXPR_NIL
};
struct sexpr
{
enum sexpr_type type;
union
{
int num;
char *id;
} value;
struct sexpr *left, *right;
};
}
/* These are the semantic types available for tokens,
which we name num, str, and node.
The %union construction is classic yacc as well. It
generates a C union and sets its as the YYSTYPE, which
will be the type of yylval */
%union
{
int num;
char *str;
struct sexpr *node;
}
/* Add another argument in yyparse() so that we
can communicate the parsed result to the caller.
We can't return the result directly, since the
return value is already reserved as an int, with
0=success, 1=error, 2=nomem
NOTE
In our case, the param is a data pointer. However,
if it were a function pointer (such as a callback),
then its type would have to be put behind a typedef,
or else parse-param will mangle the declaration. */
%parse-param {struct sexpr **result}
/* param adds an extra param to yyparse (like parse-param)
but also causes yyparse to send the value to yylex.
In our case the caller will initialize their own scanner
instance and pass it through */
%param {void *scanner}
/* the "provides" location adds the code to our generated
parser, but also to the .tab.h file for use by callers */
%code provides {
void sexpr_free(struct sexpr *s);
}
/* unqualified %code is for internal use, things that
our actions can see. These declarations prevent
warnings. Notice the final param in each that came
from the %param directive above */
%code {
int lisperror(void *foo, char const *msg, const void *s);
int lisplex(void *lval, const void *s);
}
/* Now when we declare tokens, we add their type
in brackets. The type names come from our %union */
%token <str> ID
%token <num> NUM
/* whereas tokens come from the lexer, these
non-terminals are defined in the parser, and
we set their types with %type */
%type <node> start sexpr pair list members atom
/* if there's an error partway through parsing, the
caller wouldn't get a chance to free memory for
the work in progress. Bison will clean up the memory
if we provide destructors, though. */
%destructor { free($$); } <str>
%destructor { sexpr_free($$); } <node>
%%
/* once again we use a dummy non-terminal to perform
a side-effect, in this case setting *result */
start :
sexpr { *result = $$ = $1; return 0; }
;
sexpr :
atom
| list
| pair
;
list :
/* This is a shortcut: we use the ascii value for
parens '('=40, ')'=41 as their token codes.
Thus we don't have to define a bunch of crap
manually like LPAREN, RPAREN */
'(' members ')' { $$ = $2; }
| '('')' {
struct sexpr *nil = malloc(sizeof *nil);
if (!nil) YYNOMEM;
*nil = (struct sexpr){.type = SEXPR_NIL};
$$ = nil;
}
;
members :
sexpr {
struct sexpr *s = malloc(sizeof *s),
*nil = malloc(sizeof *nil);
if (!s || !nil) {
free(s), free(nil);
YYNOMEM;
}
*nil = (struct sexpr){.type = SEXPR_NIL};
/* convention: we assume that a previous parser
value like $1 is non-NULL, else it would have
died already with YYNOMEM. We're responsible
for checking only our own allocations */
*s = (struct sexpr){
.type = SEXPR_PAIR,
.left = $1,
.right = nil
};
$$ = s;
}
| sexpr members {
struct sexpr *s = malloc(sizeof *s);
/* Another important memory convention: we
can't trust that our lexer successfully
allocated its yylvalue, because the signature
of yylex doesn't communicate failure. We
assume NULL in $1 means alloc failure and
we report that. The only other way to signal
from yylex would be to make a fake token to
represent out-of-memory, but that's harder */
if (!s) YYNOMEM;
*s = (struct sexpr){
.type = SEXPR_PAIR,
.left = $1,
.right = $2
};
$$ = s;
}
;
pair :
'(' sexpr '.' sexpr ')' {
struct sexpr *s = malloc(sizeof *s);
if (!s) YYNOMEM;
*s = (struct sexpr){
.type = SEXPR_PAIR,
.left = $2,
.right = $4
};
$$ = s;
}
;
atom :
ID {
if (!$1) YYNOMEM;
struct sexpr *s = malloc(sizeof *s);
if (!s) YYNOMEM;
*s = (struct sexpr){
.type = SEXPR_ID,
.value.id = strdup($1)
};
if (!s->value.id)
{
free(s);
YYNOMEM;
}
$$ = s;
}
| NUM {
struct sexpr *s = malloc(sizeof *s);
if (!s) YYNOMEM;
*s = (struct sexpr){
.type = SEXPR_NUM,
.value.num = $1
};
$$ = s;
}
;
%%
/* notice the extra parameters required
by %param and %parse-param */
int lisperror(void *yylval, char const *msg, const void *s)
{
(void)yylval;
(void)s;
return fprintf(stderr, "%s\n", msg);
}
/* useful internally by us, and externally by callers */
void sexpr_free(struct sexpr *s)
{
if (!s)
return;
if (s->type == SEXPR_ID)
free(s->value.id);
else if (s->type == SEXPR_PAIR)
{
sexpr_free(s->left);
sexpr_free(s->right);
}
free(s);
}The parser does the bulk of the work. We just need to pair it with a scanner that reads atoms and parens.
/* lisp.l */
/* disable unused functions so we don't
get compiler warnings about them */
%option noyywrap nounput noinput
%option noyyalloc noyyrealloc noyyfree
/* change our prefix from yy to lisp */
%option prefix="lisp"
/* use the pure parser calling convention */
%option reentrant bison-bridge
%{
#include "lisp.tab.h"
#define YY_EXIT_FAILURE ((void)yyscanner, EXIT_FAILURE)
/* XOPEN for strdup */
#define _XOPEN_SOURCE 600
#include <limits.h>
#include <stdlib.h>
#include <string.h>
/* seems like a bug that I have to do this, since flex
should know prefix=lisp and match bison's LISPSTYPE */
#define YYSTYPE LISPSTYPE
int lisperror(const char *msg);
%}
%%
[[:alpha:]][[:alnum:]]* {
/* The memory that yytext points to gets overwritten
each time a pattern matches. We need to give the caller
a copy. Also, if strdup fails and returns NULL, it's up
to the caller (the parser) to detect that.
Notice yylval is a pointer to union now. It's passed
as an arg to yylex in pure parsing mode */
yylval->str = strdup(yytext);
return ID;
}
[-+]?[[:digit:]]+ {
long n = strtol(yytext, NULL, 10);
if (n < INT_MIN || n > INT_MAX)
lisperror("Number out of range");
yylval->num = (int)n;
return NUM;
}
[[:space:]] ; /* ignore */
/* this is a handy rule to return the ASCII value
of any other character. Importantly, parens */
. { return *yytext; }Finally, here’s how to call the parser from a regular program.
/* driver_lisp.c */
#include <stdio.h>
#include <stdlib.h>
#define YYSTYPE LISPSTYPE
#include "lisp.tab.h"
#include "lisp.lex.h"
void sexpr_print(struct sexpr* s, unsigned depth)
{
for (unsigned i = 0; i < depth; i++)
printf(" ");
switch (s->type)
{
case SEXPR_ID:
puts(s->value.id);
break;
case SEXPR_NUM:
printf("%d\n", s->value.num);
break;
case SEXPR_PAIR:
puts(".");
sexpr_print(s->left, depth+1);
sexpr_print(s->right, depth+1);
break;
case SEXPR_NIL:
puts("()");
break;
default:
abort();
}
}
int main(void)
{
int i;
struct sexpr *expr;
yyscan_t scanner;
if ((i = lisplex_init(&scanner)) != 0)
exit(i);
int e = lispparse(&expr, scanner);
printf("Code = %d\n", e);
if (e == 0 /* success */)
{
sexpr_print(expr, 0);
sexpr_free(expr);
}
lisplex_destroy(scanner);
return 0;
}To build it, use the Makefile pattern from roman to create analogous
lisp.lex.o and lisp.tab.o. This example requires Flex and Bison, so set
LEX=flex and YACC=bison at the top of the Makefile to override whatever
system defaults are used for these programs. Finally, compile driver_lisp.c
and link with those object files.
Here’s the program in action:
$ echo "(1 () (2 . 3) (4))" | ./driver_lisp
Code = 0
.
1
.
()
.
.
2
3
.
.
4
()
()Internet Request For Comment (RFC) documents describe the syntax of many protocols and data formats. They often include complete Augmented Backus-Naur Form (ABNF) grammars, which we can convert into robust yacc parsers.
Let’s examine RFC4181, which describes the comma-separated value (CSV) format. It’s pretty simple, but has problematic edge cases: commas in quoted values, quoted quotes, raw newlines in quoted values, and blank-as-a-value.
Here’s the full grammar from the RFC. Notice how alternatives are specified
with “/” rather than “|”, and how ABNF has the constructions
*(zero-or-more-things) and [optional-thing]:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D
DQUOTE = %x22
LF = %x0A
CRLF = CR LF
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
The grammar makes no distinction between lexing and parsing, although the
uppercase identifiers hint at lexer tokens. While it may be tempting to
translate to yacc top-down, starting at the file level, I’ve found the most
productive way is to start with lexing.
We can combine most of the grammar into two lex rules to match fields:
%%
\"([^"]|\"\")*\" {
/* this is what the ABNF calls "escaped" */
/* TODO: copy un-escaped internals to yylval */
return FIELD;
}
[^",\r\n]+ {
/* This is *almost* what the ABNF calls "un-escaped,"
except it won't match an empty field, like
a,,b
^---- this
Actually, even if we tried matching an empty string,
the comma or crlf would prove a longer match and
trump this one.
*/
/* TODO: capture the value to yylval */
/* no need to bother yacc with two token types, we
call them both FIELD. */
return FIELD;
}
/* handle both UNIX and DOS style, per the spec */
\n|\r\n { return CRLF; }
/* catch the comma, and any other unexpected thing */
. { return *yytext; }With FIELD out of the way, here’s what’s left to translate:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
Let’s also drop the designation of the first row as the “header.” The application can choose to treat the first ordinary row as a header if desired. This simplifies the grammar to:
file = record *(CRLF record) [CRLF]
record = field *(COMMA field)
At this point it’s easy to convert to yacc.
%token CRLF FIELD
%%
file :
record
| file CRLF record
;
record :
field.opt
| record ',' field.opt
;
/* Here is where we handle the potentially blank
non-escaped FIELD. The ".opt" suffix doesn't mean
anything to yacc, it's just a reminder for us that
this *may* match a FIELD, or nothing at all */
field.opt :
/* empty */
| FIELD
;Matching blank fields is tricky. There are three fields in a,,b, no way
around it. That means we have to identify some value (either a non-terminal
symbol, or a terminal token) out of thin air between characters of input. As
a corollary, given that we have to honor blank fields as existing, we’re forced
to interpret e.g. a 0-byte file as one record with a single blank field.
We handled the situation with an empty yacc rule in field.opt. Empty rules
allow the parser to reduce when it sees unexpected lookahead tokens. Perhaps
it’s also possible to use fancy tricks in the lexer (like trailing context and
start conditions) to also match empty non-escaped fields. However, I think an
empty parser rule is more elegant.
Three notes about empty rules:
%empty, which
distinguishes them from accidentally missing rules.--graph visualization doesn’t render empty rules properly. Use
the -v option and examine the textual .output file to see the rule.Now that we’ve seen the structure of the grammar, let’s fill in the skeleton to process the CSV content. From now on, examples in this article will use my libderp library for basic data structures like maps and vectors.
/* csv.l */
%{
#define _XOPEN_SOURCE 600
#include <stdlib.h>
#include <string.h>
/* the union in csv.tab.h requires the vector type, and
plain yacc doesn't have "%code requires" to provide
the include like Bison, so we include derp/vector.h */
#include <derp/vector.h>
#include "csv.tab.h"
%}
%%
\"([^"]|\"\")*\" {
/* yyleng is precomputed strlen(yytext) */
size_t i, n = yyleng;
char *s;
s = yylval.str = calloc(n, 1);
if (!s)
return FIELD;
/* copy yytext, changing "" to " */
for (i = 1 /*skip 0="*/; i < n-1; i++)
{
*s++ = yytext[i];
if (yytext[i] == '"')
i++; /* skip second one */
}
return FIELD;
}
[^",\r\n]+ { yylval.str = strdup(yytext); return FIELD; }
\n|\r\n { return CRLF; }
. { return *yytext; }The complete parser below combines values from the lexer into full records, using the vector type. It then prints each record and frees it.
/* csv.y (plain yacc) */
%{
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
/* for the vector datatype and v_ functions */
#include <derp/vector.h>
/* for helper function derp_free */
#include <derp/common.h>
int yylex(void);
int yyerror(const char *s);
bool one_empty_field(vector *);
%}
%union
{
char *str;
vector *record;
}
%token CRLF
%token <str> FIELD
%type <str> field.opt
%type <record> record
/* in bison, add this:
%destructor { free($$); } <str>
%destructor { v_free($$); } <record>
*/
%%
file :
consumed_record
| file CRLF consumed_record
;
/* A record can be constructed in two ways, but we want to
run the same side effect for either case. We add an
intermediate non-terminal symbol "consumed_record" just
to perform the action. In library code, this would be a
good place to send the the record to a callback function. */
consumed_record :
record {
/* a record comprised of exactly one blank field is a
blank record, which we can skip */
if (!one_empty_field($1))
{
size_t n = v_length($1);
printf("#fields = %zu\n", n);
for (size_t i = 0; i < n; i++)
printf("\t%s\n", (char*)v_at($1, i));
}
v_free($1);
}
;
record :
field.opt {
/* In our earlier example, lisp.y, we showed how to check
for memory allocation failure. We skip that here for
brevity. */
vector *r = v_new();
v_dtor(r, derp_free, NULL);
v_append(r, $1);
$$ = r;
}
| record ',' field.opt {
v_append($1, $3);
$$ = $1;
}
;
field.opt :
/* empty */ { $$ = calloc(1,1); }
| FIELD
;
%%
bool one_empty_field(vector *r)
{
return v_length(r) == 1 && *((char*)v_first(r)) == '\0';
}
int yyerror(const char *s)
{
return fprintf(stderr, "%s\n", s);
}Build it (using the steps shown for earlier examples). You’ll also need to link with libderp version 0.1.0, which you can see how to do in the project readme.
Next, verify with test cases:
# https://en.wikipedia.org/wiki/Comma-separated_values#Example
$ ./csv << EOF
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
EOF#fields = 5
Year
Make
Model
Description
Price
#fields = 5
1997
Ford
E350
ac, abs, moon
3000.00
#fields = 5
1999
Chevy
Venture "Extended Edition"
4900.00
#fields = 5
1999
Chevy
Venture "Extended Edition, Very Large"
5000.00
#fields = 5
1996
Jeep
Grand Cherokee
MUST SELL!
air, moon roof, loaded
4799.00
# extra testing for empty fields before crlf and eof
$ printf ",\n," | ./csv#fields = 2
#fields = 2
IRCv3 extends the Internet Relay Chat (IRC) protocol with useful features. Its core syntactical change to support new features is message tagging. We’ll write a parser to extract information from RFC 1459 messages, including IRCv3 tags.
The BNF from this standard is written in a slightly different dialect than that of the CSV RFC.
<message> ::= ['@' <tags> <SPACE>] [':' <prefix> <SPACE> ] <command> [params] <crlf>
<tags> ::= <tag> [';' <tag>]*
<tag> ::= <key> ['=' <escaped_value>]
<key> ::= [ <client_prefix> ] [ <vendor> '/' ] <key_name>
<client_prefix> ::= '+'
<key_name> ::= <non-empty sequence of ascii letters, digits, hyphens ('-')>
<escaped_value> ::= <sequence of zero or more utf8 characters except NUL, CR, LF, semicolon (`;`) and SPACE>
<vendor> ::= <host>
<host> ::= see RFC 952 [DNS:4] for details on allowed hostnames
<prefix> ::= <servername> | <nick> [ '!' <user> ] [ '@' <host> ]
<nick> ::= <letter> { <letter> | <number> | <special> }
<command> ::= <letter> { <letter> } | <number> <number> <number>
<SPACE> ::= ' ' { ' ' }
<params> ::= <SPACE> [ ':' <trailing> | <middle> <params> ]
<middle> ::= <Any *non-empty* sequence of octets not including SPACE
or NUL or CR or LF, the first of which may not be ':'>
<trailing> ::= <Any, possibly *empty*, sequence of octets not including
NUL or CR or LF>
<user> ::= <nonwhite> { <nonwhite> }
<letter> ::= 'a' ... 'z' | 'A' ... 'Z'
<number> ::= '0' ... '9'
<crlf> ::= CR LF
As before, it’s helpful to start from the bottom up, applying the power of lex regexes. However, we run into the problem that most of the tokens match almost anything. The same string could conceivably be a host, nick, user, key_name, and command all at once. Lex would match the string with whichever rule comes first in the grammar.
Yacc can’t easily pass lex any clues about what tokens it expects, given what tokens have come before. Lex is on its own. For this reason, the designers of lex gave it a way to keep a memory. Rules can be tagged with a start condition, saying they are eligible only in certain states. Rule actions can then enter new states prior to returning.
/* Incomplete irc.l, showing start conditions and patterns.
This lexer produces the following tokens:
SPACE COMMAND MIDDLE TRAILING TAG PREFIX ':' '@'
*/
/* It's nice to prefix the regex names with "re_"
to see them better in the rules */
re_space [ ]+
re_host [[:alnum:]][[:alnum:]\.\-]*
re_nick [[:alpha:]][[:alnum:]\-\[\]\\`^{}_]*
re_user [~[:alpha:]][[:alnum:]]*
re_keyname [[:alnum:]\-]+
re_keyval [^ ;\r\n]*
re_command [[:alpha:]]+|[[:digit:]]{3}
re_middle [^: \r\n][^ \r\n]*
re_trailing [^\r\n]*
/* Declare start conditions. The "%x" means
they are exclusive, vs "%s" for inclusive. */
%x IN_TAGS IN_PREFIX IN_PARAMS
%%
/* these patterns are not tagged with a start
condition, and are active in the default state
of INITIAL. They will match only when none of
the exclusive conditions are active. They
*would* match on inclusive states (but we have
none).
The BEGIN command changes state. */
@ { BEGIN IN_TAGS; return *yytext; }
: { BEGIN IN_PREFIX; return *yytext; }
{re_space} { return SPACE; }
{re_command} {
/* TODO: construct yylval */
BEGIN IN_PARAMS;
return COMMAND;
}
/* these patterns will only match IN_TAGS, which
as we saw earlier, gets activated from the
INITIAL state when "@" is encountered */
<IN_TAGS>\+?({re_host}\/)?{re_keyname}(={re_keyval})? {
/* TODO: construct yylval */
return TAG;
}
<IN_TAGS>{re_space} {
BEGIN INITIAL;
return SPACE;
}
<IN_TAGS>; { return ';'; }
<IN_PREFIX>({re_host})|({re_nick})(!{re_user})?(@{re_host})? {
/* TODO: construct yylval */
BEGIN INITIAL;
return PREFIX;
}
<IN_PARAMS>{re_space} { return SPACE; }
<IN_PARAMS>{re_middle} {
/* TODO: construct yylval */
return MIDDLE;
}
<IN_PARAMS>:{re_trailing} {
/* TODO: construct yylval */
BEGIN INITIAL;
return TRAILING;
}
/* the "*" state applies to all states,
including INITIAL and the exclusive ones */
<*>\n|\r\n ; /* ignore */We’ll revisit the lexer to fill in details for assigning yylval. First, let’s see the parser and its data types.
/* irc.y (Bison only)
Using Bison mostly for the %code positions, making
it easier to use libderp between flex and bison.
- WARNING -
There is absolutely no memory hygiene in this example.
We don't check for allocation failure, and we don't free
things when done. See the earlier lisp.y/.l examples
for guidance about that.
*/
/* output more descriptive messages than "syntax error" */
%define parse.error verbose
%code top {
#define _XOPEN_SOURCE 600
#include <stdio.h>
#include <stdlib.h>
}
%code requires {
#include <derp/list.h>
#include <derp/treemap.h>
struct prefix
{
char *host;
char *nick;
char *user;
};
/* building an irc_message is the overall
goal for this parser */
struct irc_message
{
treemap *tags;
struct prefix *prefix;
char *command;
list *params;
};
}
%code provides {
int yyerror(char const *msg);
int yylex(void);
void message_print(struct irc_message *m);
}
%union
{
char *str;
struct prefix *prefix;
treemap *map;
struct map_pair *pair;
list *list;
struct irc_message *msg;
}
%token SPACE
%token <str> COMMAND MIDDLE TRAILING
%token <pair> TAG
%token <prefix> PREFIX
%type <msg> message tagged_message prefixed_message
%type <map> tags
%type <list> params
%%
/* Like in the CSV example, we start with a dummy
rule just to add side-effects */
final :
tagged_message { message_print($1); }
;
/* Messages begin with two optional components,
a set of tags and a prefix.
<message> ::= ['@' <tags> <SPACE>] [':' <prefix> <SPACE> ] <command> [params]
Rather than making a single message rule with
tons of variations (and duplicated code), I chose
to build the message in stages.
tagged_message <- prefixed_message <- message
A prefixed_message adds prefix information, or
passes the message along verbatim if there is none.
Similarly for tagged_message. */
tagged_message :
/* When there are more than one matched token,
it's helpful to add Bison "named references"
in brackets. Thus, below, the rule can refer to
$ts rather than $2, or $msg rather than $4.
Makes it way easier to rearrange tokens while
you're experimenting. */
'@' tags[ts] SPACE prefixed_message[msg] {
$msg->tags = $ts;
$$ = $msg;
}
/* here's the pass-through case when there are
no tags on the message */
| prefixed_message
;
prefixed_message :
':' PREFIX[pfx] SPACE message[msg] {
$msg->prefix = $pfx;
$$ = $msg;
}
| message
;
message :
COMMAND[cmd] params[ps] {
struct irc_message *m = malloc(sizeof *m);
*m = (struct irc_message) {
.command=$cmd, .params=$ps
};
$$ = m;
}
;
tags :
TAG {
treemap *t = tm_new(derp_strcmp, NULL);
tm_insert(t, $1->k, $1->v);
$$ = t;
}
| tags[ts] ';' TAG[t] {
tm_insert($ts, $t->k, $t->v);
$$ = $ts;
}
;
params :
SPACE TRAILING {
$$ = l_new();
l_prepend($$, $2);
}
| SPACE MIDDLE[mid] params[ps] {
l_prepend($ps, $mid);
$$ = $ps;
}
| %empty {
$$ = l_new();
}
;
%%
int yyerror(char const *msg)
{
return fprintf(stderr, "%s\n", msg);
}
void message_print(struct irc_message *m)
{
if (m->tags)
{
struct tm_iter *it = tm_iter_begin(m->tags);
struct map_pair *p;
puts("Tags:");
while ((p = tm_iter_next(it)) != NULL)
printf("\t'%s'='%s'\n", (char*)p->k, (char*)p->v);
tm_iter_free(it);
}
if (m->prefix)
printf("Prefix: Nick %s, User %s, Host %s\n",
m->prefix->nick, m->prefix->user,
m->prefix->host);
if (m->command)
printf("Command: %s\n", m->command);
if (!l_is_empty(m->params))
{
puts("Params:");
for (list_item *li = l_first(m->params); li; li = li->next)
printf("\t%s\n", (char*)li->data);
}
}Returning to the lexer, here is the code with all the details filled in to construct yylval for the tokens.
/* irc.l - complete file */
%option noyywrap nounput noinput
%{
#include "irc.tab.h"
#define _XOPEN_SOURCE 600
#include <limits.h>
#include <stdlib.h>
#include <string.h>
%}
re_space [ ]+
re_host [[:alnum:]][[:alnum:]\.\-]*
re_nick [[:alpha:]][[:alnum:]\-\[\]\\`^{}_]*
re_user [~[:alpha:]][[:alnum:]]*
re_keyname [[:alnum:]\-]+
re_keyval [^ ;\r\n]*
re_command [[:alpha:]]+|[[:digit:]]{3}
re_middle [^: \r\n][^ \r\n]*
re_trailing [^\r\n]*
%x IN_TAGS IN_PREFIX IN_PARAMS
%%
@ { BEGIN IN_TAGS; return *yytext; }
: { BEGIN IN_PREFIX; return *yytext; }
{re_space} { return SPACE; }
{re_command} {
yylval.str = strdup(yytext);
BEGIN IN_PARAMS;
return COMMAND;
}
<IN_TAGS>\+?({re_host}\/)?{re_keyname}(={re_keyval})? {
struct map_pair *p = malloc(sizeof *p);
char *split = strchr(yytext, '=');
if (split)
*split = '\0';
*p = (struct map_pair){
.k = strdup(yytext),
.v = split ? strdup(split+1) : calloc(1,1)
};
yylval.pair = p;
return TAG;
}
<IN_TAGS>{re_space} {
BEGIN INITIAL;
return SPACE;
}
<IN_TAGS>; { return ';'; }
<IN_PREFIX>({re_host})|({re_nick})(!{re_user})?(@{re_host})? {
struct prefix *p = malloc(sizeof *p);
if (!p)
goto done;
*p = (struct prefix){0};
char *bang = strchr(yytext, '!'),
*at = strchr(yytext, '@');
if (!bang && !at)
{
p->host = strdup(yytext);
goto done;
}
if (bang) *bang = '\0';
if (at) *at = '\0';
p->nick = strdup(yytext);
if (bang)
p->user = strdup(bang+1);
if (at)
p->host = strdup(at+1);
done:
yylval.prefix = p;
BEGIN INITIAL;
return PREFIX;
}
<IN_PARAMS>{re_space} { return SPACE; }
<IN_PARAMS>{re_middle} {
yylval.str = strdup(yytext);
return MIDDLE;
}
<IN_PARAMS>:{re_trailing} {
yylval.str = strdup(yytext+1); /* trim : */
BEGIN INITIAL;
return TRAILING;
}
<*>\n|\r\n ; /* ignore */Build irc.y and irc.l according to our typical pattern (and link with libderp). Here’s an example of the IRCv3 parser in action:
# Try an example from
# https://ircv3.net/specs/extensions/message-tags#examples
$ ./irc <<EOF
@aaa=bbb;ccc;example.com/ddd=eee :nick!ident@host.com PRIVMSG me :Hello
EOFTags:
'aaa'='bbb'
'ccc'=''
'example.com/ddd'='eee'
Prefix: Nick nick, User ident, Host host.com
Command: PRIVMSG
Params:
me
Hello
Content for the article comes from researching how to create a shared library, wading through sloppy conventions that people recommend online, and testing on multiple Unix-like systems. Hopefully it can set the record straight and help improve the quality of open source libraries.
The design typically used nowadays for dynamic linking (in BSD, MacOS, and Linux) came from SunOS in 1988. The paper Shared Libraries in SunOS neatly explains the goals, design, and implementation.
The authors’ main motivations were saving disk and memory space, and upgrading libraries (or the OS) without needing to relink programs. The resource usage motivation is probably less important on today’s powerful personal computers than it was in 1988. However, the flexibility to upgrade libraries is as useful as ever, as well as the ability to easily inspect which library versions each application uses.
Dynamic linking is not without its critics, and isn’t appropriate in all situations. It runs a little slower because of position-independent code (PIC) and late loading. (The SunOS paper called it a “classic space/time trade-off.”) The complexity of the loader on some systems offers increased attack surface. Finally, upgraded libraries may affect some programs differently than others, for instance breaking those that rely on undocumented behavior.
At compile time the link editor resolves symbols in specified libraries, and makes a note in the resulting binary to load those libraries. At runtime, applications call code to map the shared library symbols in memory at the correct memory addresses.
SunOS and subsequent UNIX-like systems added compile-time flags to the linker
(ld) to generate – or link against – dynamically linked libraries. The
designers also added a special system library (ld.so) with code to find and
load other libraries for an application. The pre-main() initialization
routine of a program loads ld.so and runs it from within the program to find
and load the rest of the required libraries.
As mentioned, applications can take advantage of updated libraries without needing recompilation. Library updates can be classified in three categories:
An application linked against a library at a given major release will continue to work properly when loading any newer minor or patch release. Applications may not work properly when loading a different major release, or an earlier minor release than that used at link time.
Multiple applications can exist on a machine at once, and each may require different releases of a single library. The system should provide a way to store multiple library releases and load the right one for each app. Different systems have different ways to do it, as we’ll see later.
Each library release can be marked with a version identifier (or “version”) which seeks to capture information about the library’s release history. There are multiple ways to map release history to a version identifier.
The two most common mapping systems are semantic versioning and libtool versioning. Semantic versioning counts the number of releases of various kinds that have happened, and writes them in lexicographic order. Libtool versioning counts distinct library interfaces.
Semantic versioning is written as major.minor.patch and libtool as
current:revision:age. The intuition is that current counts interface
changes. Any time the interface changes, whether in a minor or major way,
current increases. Here’s how each system would record the same history of
release events:
| Event | Semver | Libtool |
|---|---|---|
| Initial | 1.0.0 | 1:0:0 |
| Minor | 1.1.0 | 2:0:1 |
| Minor | 1.2.0 | 3:0:2 |
| Patch | 1.2.1 | 3:1:2 |
| Major | 2.0.0 | 4:0:0 |
| Patch | 2.0.1 | 4:1:0 |
| Patch | 2.0.2 | 4:2:0 |
| Minor | 2.1.0 | 5:0:1 |
Here’s how applications answer the question, “Can I load a given library?”
current interface number of the library I linked with between
current - age and current of the library to be loaded?
We’ll be using semantic versioning in this guide, because libtool versioning is only relevant to libtool, a tool to abstract library creation across platforms. I believe we can make portable libraries without libtool. I mention both systems only to show that there’s more than one way to build version identifiers.
One final note: version identifiers say that things have changed, but omit what changed. More complicated systems exist to track library compatibility. Solaris, for instance, developed a system called symbol versioning. Symbol versioning chases space savings at the expense of operational complexity, and we’ll consider it later.
One subtlety of versioning is that changes can happen in either a library’s programming interface (API) or binary interface (ABI). A C library’s programming interface is defined through its header files. A backward-incompatible API change means a program written for the previous version would not compile when including headers from the new version.
By contrast, a binary interface is a runtime concept. It concerns the calling conventions for functions, or the memory layout (and meaning) of data shared between program and library. The ABI ensures compatibility at load and run-time, while the API ensures compatibility at compile and link time.
The two interfaces usually change hand-in-hand, and people sometimes confuse them. It’s possible for one to change without the other, though.
Examples of breaking ABI, but API stability:
In these library changes, application code doesn’t need to change, but does need to be recompiled with the new library headers in order to work at runtime.
Examples of ABI stability, but breaking API:
In these library changes, application code would need to be modified to compile successfully against the new library, even though code compiled before the change could load and call the library without issue.
const foo * to foo *. A pointer to a const
object cannot be implicitly converted to a pointer to a non-const object.
The ABI doesn’t care though, and moves the same bytes. (If the library does in
fact modify the dereferenced value, it may be an unpleasant surprise to the
application of course.)It’s usually easy to tell when you’ve added functionality vs broken backward compatibility, but there are tools to check for sure. For instance, the ABI Compliance Checker can detect breakages in C and C++ libraries.
In light of the versioning discussion earlier, which changes should the version identifier describe? At the very least, the ABI. When the loader is searching for a library, the ABI determines whether a library would be compatible at runtime. However, I think a more conservative versioning scheme is wise, where you bump a version when either the API or ABI change. You’ll end up with potentially more library versions installed, but each shared API/ABI version will provide guarantees at both compilation and runtime.
After compiling object files, the compiler front-end (gcc, clang, cc, c99) will invoke the linker (ld, lld) to find unresolved symbols and match them across object files or in shared libraries. The linker searches only the shared libraries requested by the front-end, in the order specified on the command line. If an unresolved symbol is found in a listed library, the linker marks a dependency on that library in the generated executable.
The -l option adds a library to the list of candidates for symbol search.
To add libfoo.so (or libfoo.dylib on Mac), specify -lfoo. The linker
looks for the library files in its search path. To add directories to the
default search path(s), use -L, for instance -L/usr/local/lib.
What happens if multiple versions of a library exist in the same directory? For
instance two major versions, libfoo.so.1 and libfoo.so.2? OpenBSD knows
about version numbers, and would pick the highest version automatically for
-lfoo. Linux and Mac would match neither, because they’re looking for an
exact match of libfoo.so (or libfoo.dylib). Similarly, what if both a
static and dynamic library exist in the same directory, libfoo.a and
libfoo.so? All systems will choose the dynamic one.
Greater control is necessary. GCC has a colon option to solve the problem, for
instance -l:libfoo.so.1. However clang doesn’t have it, so a truly portable
build shouldn’t rely on it. Some systems solve the problem by creating a
symlink from libfoo.so to the specific library desired. However when done in
a system location like /usr/local/lib, it nominates a single inflexible
link-time version for the whole system. I’ll suggest a different solution later
that involves storing link-time files in a separate place from load-time
libraries.
At launch time, programs with dynamic library dependencies load and run ld.so (or dyld on Mac) to find and load the rest of their dependencies. The load library inspects DT_NEEDED ELF tags (or LOAD_DYLIB names in Mach-O on Mac) to determine which library filename to find on the system. Interestingly, these values are not specified by the program developer, but by the library developer. They are extracted from the libraries themselves at link-time.
Dynamic libraries contain an internal “runtime name” called SONAME in ELF, or
install_name in Mach-O. An application may link against a file named
libfoo.so, but the library SONAME can say, “search for me under the filename
libfoo.so.1.2 at load time.” The loader cares only about filenames, it never
consults SONAMES. Conversely, the linker’s output cares only about SONAMES,
not input library filenames.
Loaders in different operating systems go about finding dependent libraries slightly differently. OpenBSD’s ld.so is very true to the SunOS model, and understands semantic versions. For instance, if asked to load libfoo.so.1.2, it will attempt to find libfoo.so.1.x with the largest x ≥ 2. FreeBSD also claims to have this behavior, but I didn’t observe it in my tests.
In 1995, Solaris 2.5 created a way to track semantic versioning at the symbol level, rather than for the entire library. With symbol versioning there would be a single e.g. libfoo.so file that simply grows over time. Every function inside is marked with a version number. The same function name can even exist under multiple versions with different implementations.
The advantage of symbol versioning is that it can save space. In the alternative, where versioning is per-library rather than per-symbol, a large percentage of object code is often copied unchanged from one library version to the next. The disadvantages of symbol versioning are:
Symbol versioning quickly found its way into Linux, and became a staple of Glibc. Because of Linux’s symbol versioning preference, its ld.so doesn’t make any effort to rendezvous with the latest minor library version (à la SunOS or OpenBSD). Ld.so searches for an exact match between SONAME and filename.
However, even on Linux, most libraries don’t use symbol versioning. Also, their
SONAMEs typically record only a major version (like libfoo.so.2). Within that
major version, you just have to hope the hidden minor version is new enough for
all applications compiled or installed on the system. If an app relies on
functions added in a later minor library version, it’ll crash when it attempts
to call them. (Setting the environment variable LD_BIND_NOW=1 will attempt to
resolve all symbols at program start instead, to detect the failure up front.)
MacOS uses an entirely different object format (Mach-O rather than ELF), and a
differently named loader library (dyld rather than ld.so). Mac’s dynamically
linked libraries are named .dylib, and their version numbers precede the
extension.
Native Mac applications are usually installed into their own dedicated
directories, with libraries bundled inside. Thus the loader has special
provisions for finding libraries, like the keywords @executable_path,
@loader_path and @rpath in the install_name. MacOS supports system
libraries too, with dyld consulting the DYLD_FALLBACK_LIBRARY_PATH, by
default $(HOME)/lib:/usr/local/lib:/lib:/usr/lib.
Like Linux, Mac does an exact name match – no minor version rendezvous. Unlike Linux, libraries can record their full semantic version internally, and a “compatibility” version. The compatibility version gets copied into an application at link time, and says the application requires at least that version at runtime.
For example, libfoo.1.dylib with full version 1.2.3 should have a
compatibility version of 1.2.0 according to the rules of semantic versioning.
An application linked against it would refuse to load libfoo with lesser minor
version, like 1.1.5. At load time, the user would see a clear error:
dyld: Library not loaded: libfoo.1.dylib
Referenced from: myapp
Reason: Incompatible library version: myapp requires version 1.2.0 or later,
but libfoo.1.dylib provides version 1.1.5
Standard practice is to create symlinks libfoo.so -> libfoo.so.x -> libfoo.so.x.y.z in a shared system directory. The first link (without the version number) is for linking at build time. Problem is, it’s pinned to one version. There’s no portable way to select which version to link against when there are multiple versions installed.
Also, standard practice gives even less care to versioning header files. Sometimes whichever version was most recently installed overwrites them in /usr/local/include. Sometimes the headers are maintained only at the major version level, in /usr/local/include/libfoo-n.
To solve these problems, I suggest bundling all development (linking) library files together into a different directory structure per version. Since I advocated earlier that the “total” library version should be bumped whenever the API or ABI changes, the same version safely applies to headers and binaries.
First choose an installation PREFIX. If the system has an /opt directory, pick that, otherwise /usr/local. In this directory, add dynamic and/or static libraries, headers, man pages, and pkg-config files as desired:
$PREFIX/libfoo-dev.x.y.z
├── libfoo.pc
├── libfoo-static.pc
├── include
│ └── foo
│ ├── ...
│ └── ...
├── lib
│ ├── libfoo.so (or dylib or dll)
│ └── static
│ └── libfoo.a
└── man
├── ...
└── ...
Linking against libfoo.x.y.z is easy. In a Makefile, set your flags like this:
CFLAGS += -I/opt/libfoo-dev.x.y.z/include
LDFLAGS += -L/opt/libfoo-dev.x.y.z/lib
LDLIBS += -lfoo
# an example suffix rule using the flags
.c:
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< $(LDLIBS)Pkg-config can allow
an application to express a range of acceptable library versions, rather than
hardcoding a specific one. In a configure script, we’ll test for the library’s
presence and version, and output the flags to config.mk:
# supposing we require libfoo 1.x for x >= 1
pkg-config --print-errors 'libfoo >= 1.1, libfoo < 2.0'
# save flags to config.mk
cat > config.mk <<-EOF
CFLAGS += $(pkg-config --cflags libfoo)
LDFLAGS += $(pkg-config --libs-only-L libfoo)
LDLIBS += $(pkg-config --libs-only-l libfoo)
EOFThen our Makefile becomes:
include config.mk
.c:
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< $(LDLIBS)To choose a specific version of libfoo, we can add it to the pkg-config search path and run the configure script:
# make desired libfoo version visible to pkg-config
export PKG_CONFIG_PATH="/opt/libfoo-dev.x.y.z:$PKG_CONFIG_PATH"
./configure
makeTo create pkg-config .pc files for a library, see Dan Nicholson’s
guide. In order to
offer both a static and dynamic library, the best way I could imagine was to
release separate files, libfoo.pc and libfoo-static.pc that differ in their
-L flag. One uses lib and another lib/static. (Pkg-config’s --static
flag is a bit of a misnomer, and just passes items in Libs.private in
addition to Libs in the build process.)
This section talks about installing dynamic libraries for system-wide loading. Libraries installed for this purpose are not meant to link with at compile time, but to load at runtime.
ELF objects don’t have much version metadata. SONAME is about it. That, combined with the lackluster behavior of loaders on some systems, means the traditional installation technique doesn’t work too well.
Let’s review the traditional way to install ELF libraries, and then a safer method I designed.
Traditional installation method
This way allows a sysadmin to see exactly which versions are installed, and to have multiple major versions installed at once. It doesn’t allow multiple minor versions per major (although usually only the latest minor is needed), and more importantly doesn’t offer protection against loading too old a minor version.
Safer installation method
For version x.y.z, compile libfoo.so with SONAME libfoo.so.x.y
# use compilation flags
-shared -Wl,-soname,libfoo.so.${MAJOR}.${MINOR}Copy libfoo.so to /usr/local/lib/libfoo.so.x.y.z
Backfill minor version symlinks in DEST:
i=0
while [ $i -le "$MINOR" ]; do
ln -fs "libfoo.so.$VER" "$DEST/libfoo.so.$MAJOR.$i"
i=$((i+1))
doneAt the cost of potentially a lot of minor version symlinks, this technique emulates the SunOS and OpenBSD behavior of minor version rendezvous. Also, because the SONAME has major.minor granularity, it will protect against loading too old a minor version.
(As an alternative to the symlinks, FreeBSD has libmap.conf)
Mach-O has more version metadata inside than ELF, so a traditional install works fine here.
For version x.y.z, compile libfoo.dylib with
# use compilation flags
-dynamiclib -install_name "libfoo.${MAJOR}.dylib" \
-current_version ${VER} \
-compatibility_version ${MAJOR}.${MINOR}.0Copy libfoo.dylib to /usr/local/lib/libfoo.x.dylib
It’s important to set the compatibility version correctly so that Mac’s dyld will prevent loading too old a minor version. To upgrade the library, overwrite libfoo.x.dylib with one of a later internal minor release.
For an example of how to build a library portably, and install it conveniently for the linker and loader, see begriffs/libderp. It’s my first shared library, where I tested the ideas for this article.
]]>To think about stability more clearly, let’s divide a functioning program into its layers. Then we can examine development choices one layer at a time.
The more features a program needs, the further out it must reach through the layers.
The operating system should be listed as the outermost layer, instead of 3rd-party libraries. Libraries are often designed to be portable across operating systems.
Every language has to start somewhere, often as an implementation by a single person or small group. At this stage the language evolves rapidly, and to be fair it’s this stage that advances the state of the art.
However, using a language in its single-implementation stage means you’re committing a percentage of your energy to the “research project” of the language itself. You’ll deal with breaking changes (including tools), and experimental dead-ends.
If you love the idea behind a new language, or believe it’s a winner and that your early familiarity will pay off, then go for it! Otherwise use a language that has advanced beyond a single implementation. That way you can focus on your domain of expertise rather than keeping up with a language research agenda.
Languages get to the next stage when groups of people fork them for new situations and architectures. Some people add features, other people discover difficulties in their environments. Stakeholders then debate and reach consensus through a standardization process. The end result is that the standard, rather than a particular software artifact, defines the language and has the final say.
Naturally the whole thing takes a while. Standardized languages are going to be fairly old. They’ll miss out on recent ideas, but will be well understood. Here are some mature languages with standards:
I’ve been using C lately because of its portability, simple (yet expressive) abstract machine model, and deep compatibility with POSIX and foundational libraries.
If you’re using a language with a standard, take advantage of it. First, choose a specific version of the standard. Older versions are generally more widely supported, but have fewer features. In the C world I usually pick C99 because it has some conveniences over C89, and is still supported pretty much everywhere (although only partially on Windows).
Consult your compiler documentation to see if the compiler can catch accidental uses of non-standard behavior. In clang or gcc, add the following flags to your Makefile:
# enforce a specific version of the standard
CFLAGS += -std=c99 -pedanticSubstitute another version for “c99” as desired. The pedantic flag rejects all programs that use forbidden extensions, and some other programs that do not follow ISO C.
If you do want to use compiler extensions (such as those in gcc or clang), wrap them behind your own macros so that the code stays portable. The PostgreSQL project does this kind of thing in c.h. Here’s an example at random:
/*
* Use "pg_attribute_always_inline" in place of "inline" for functions that
* we wish to force inlining of, even when the compiler's heuristics would
* choose not to. But, if possible, don't force inlining in unoptimized
* debug builds.
*/
#if (defined(__GNUC__) && __GNUC__ > 3 && defined(__OPTIMIZE__)) || defined(__SUNPRO_C) || defined(__IBMC__)
/* GCC > 3, Sunpro and XLC support always_inline via __attribute__ */
#define pg_attribute_always_inline __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
/* MSVC has a special keyword for this */
#define pg_attribute_always_inline __forceinline
#else
/* Otherwise, the best we can do is to say "inline" */
#define pg_attribute_always_inline inline
#endifNotice how they adapt to various compilers and provide a final fallback. Of course, avoiding extensions in the first place is the simplest option, when possible.
Take time to learn your language’s standard library. It’s a freebie, you get it wherever your program goes. Read about the library functions in the language standard, since they will be covered there.
Gaining knowledge of the standard library can help reduce reliance on unnecessary third-party libraries. The ECMAScript world, for instance, is rife with micro-libraries that attempt to supplement the ECMA standard’s real or perceived shortcomings.
The size of a single-implementation language’s library is a trade-off between ease of implementation and ease of use. A giant library like that in the Go language makes it harder for creators of would-be rival implementations, and thus slows the progress to a robust standard.
To learn more about the C standard library, see my article.
Because standards bodies avoid breaking existing codebases, and because stable languages are slow to change, there will be weird or dangerous functions in the standard library. However the dangers are well known and documented in supporting literature, unlike the dangers in new, relatively untested systems.
Here are some great books for C:
Also the C99 standard has an accompanying rationale document. It talks about alternate designs considered and rejected.
Similarly to how competing C implementations led to the C standard, the Unix wars led to POSIX. POSIX specifies a “lowest common denominator” interface that many operating systems honor to a greater or lesser degree.
Whenever you use system calls outside the C standard library, check whether they’re part of POSIX, and if their official description differs from your local man pages. The Open Group offers a free searchable HTML version of POSIX.1. As of this writing it’s POSIX.1-2017 (which is POSIX.1-2008 plus two technical corrigenda).
There’s one more complication: POSIX.1-2008 (aka “Issue 7”) isn’t fully supported everywhere. (For instance I found that macOS doesn’t support pthread barriers, semaphores, or asynchronous thread cancellation.) I think the root cause is that 2008 requires thread and real-time functionality that was previously in optional extensions. If you stick to functionality in POSIX.1-2001 (aka Issue 6) you should be safe on all reasonably recent platforms.
To call POSIX functions you must define the _POSIX_C_SOURCE “feature test”
macro before including header files. Select a specific POSIX version by using
one of these values:
| Edition | Release year | Macro value |
|---|---|---|
| 1 | 1988 | (N/A) |
| 2 | 1990 | 1 |
| 3 | 1992 | 2 |
| 4 | 1993 | 199309L |
| 5 | 1995 | 199506L |
| 6 | 2001 | 200112L |
| 7 | 2008 | 200809L |
Header files hide or reveal functions based on the feature test macro. For example, the getline() function from Issue 7 allocates memory and reads a line.
/* line.c */
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h> /* ssize_t */
int main(void)
{
char *line = NULL;
size_t len = 0;
ssize_t read;
while ((read = getline(&line, &len, stdin)) != -1)
printf("Length %zd: %s", read, line);
free(line);
return 0;
}Trying to use getline() on Issue 6 (POSIX.1-2001) fails:
$ cc -std=c99 -pedantic -Werror -D_POSIX_C_SOURCE=200112L line.c -o line
line.c:10:17: error: implicit declaration of function 'getline' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
while ((read = getline(&line, &len, stdin)) != -1)
^
1 error generated.Selecting Issue 7 with -D_POSIX_C_SOURCE=200809L fixes it.
Important note: setting _POSIX_C_SOURCE will hide non-POSIX operating
system extras in the standard headers. The best practice is to separate your
source files into those that are POSIX conformant, and those (hopefully few)
that aren’t. Compile the latter without the feature macro and link them all
together at the end.
POSIX defines the interface for not just the library functions discussed earlier, but for the shell and common tools too. If you use those tools for your builds then you don’t need to install any extra software on destination machines to compile your project.
Probably the most common sources of accidental lock-in are bashisms and GNU extensions to Make. For scripts, use sh, and use (POSIX) make for Makefiles. Too many projects use GNU features needlessly. In fact, learning the portable subset of Make features leads to cleaner, more reliable builds.
This is a topic for an entire article of its own. Chris Wellons wrote a nice tutorial about it. Also “Managing Projects with make” by Andrew Oram (ISBN 0-937175-90-0) is a little book that’s packed with good advice.
Operating systems include useful functionality beyond POSIX. For instance extensions to pthreads (setting reader-writer preference or thread processor affinity), access to specialized hardware (like audio or graphics), alternate I/O interfaces and semantics, and functions for safety like strlcpy or pledge.
Three ways to use these features portably are to:
We’ll talk about third-party libraries later. Let’s look at option one now.
Consider the example of generating random data. It requires help from the OS since POSIX offers only pseudo-random numbers.
We’ll split our Makefile into two parts:
Makefile – specifies targets, dependencies and rules, that hold on all
systemsconfig.mk – sets macros and build flags specific to the local systemThe Makefile will include the specifics of config.mk like this:
# inside the Makefile...
# set up common options and then...
include config.mkWe’ll generate config.mk with a configure script. A developer will run the
script before their first build to detect the environment options. The most
primitive way for configure to work would be to try parse
uname
and make decisions based on what OS or distro it sees. A more accurate way is
to try to directly probe the needed OS C functions.
To see if a C function exists, we can just try compiling test snippets of code and see if they succeed. You might think this is awkward or that it requires cluttering your project with test code, but it’s actually pretty elegant.
First make this shell script helper function:
compiles ()
{
stage="$(mktemp -d)"
echo "$2" > "$stage/test.c"
(cc -Werror "$1" -o "$stage/test" "$stage/test.c" >/dev/null 2>&1)
cc_success=$?
rm -rf "$stage"
return $cc_success
}The compiles() function takes two arguments: an optional compiler flag, and
the source code to attempt to compile.
Note that mktemp and cc are not POSIX compliant. You can write your own
mktemp function using POSIX primitives, but I wanted to conserve space in
this example. For cc, the spec offers c99 (or c89 in 4th edition POSIX).
However, the c99 utility doesn’t allow controlling warning levels, and I
wanted to specify that warnings be treated as errors. The cc alias is a
common de-facto standard.
Let’s use the helper to check for OS random number generators. The BSD world
offers arc4random_buf to get random
bytes, and Linux offers
getrandom. The
configure script can check for each feature like this:
if compiles "" "
#include <stdint.h>
#include <stdlib.h>
int main(void)
{
void (*p)(void *, size_t) = arc4random_buf;
return (intptr_t)p;
}"
then
echo "CFLAGS += -DHAVE_ARC4RANDOM" >> config.mk
fi
if compiles "-D_POSIX_C_SOURCE=200112L" "
#include <stdint.h>
#include <sys/types.h>
#include <sys/random.h>
int main(void)
{
ssize_t (*p)(void *, size_t, unsigned int) = getrandom;
return (intptr_t)p;
}"
then
echo "CFLAGS += -DHAVE_GETRANDOM" >> config.mk
fiSee? Not too bad. These code snippets test not only whether the functions
exist, but also check their type signatures. Notice how the second example is
compiled with POSIX for the ssize_t type, while the first example is
intentionally not marked POSIX conformant because doing so would hide the
extra function arc4random_buf that BSD puts in stdlib.h.
It’s helpful to isolate the use of non-portable functions in a distinct translation unit, and export your own interface on top. That way it’s more straightforward to set up conditional compilation in one place, or to refactor in the future.
Let’s continue the example from the previous section of generating random bytes. With the hard work of OS feature detection behind us, we can wrap the differing OS interfaces behind our own function:
#include <stdint.h>
#include <stdlib.h>
#ifdef HAVE_GETRANDOM
#include <sys/random.h>
#endif
void get_random_bytes(void *buf, size_t n)
{
#if defined HAVE_ARC4RANDOM /* BSD */
arc4random_buf(buf, n);
#elif defined HAVE_GETRANDOM /* Linux */
getrandom(buf, n, 0);
#else
#error OS does not provide recognized function to get entropy
#endif
}The Makefile defines HAVE_ARC4RANDOM or HAVE_GETRANDOM using CFLAGS when
the corresponding functions exist. The code can just use ifdefs. Notice the
#error in the #else case to fail compilation with a clear message on
unsupported platforms.
The degree of portability we strive for causes trade-offs. Example: we could
add a fallback to reading /dev/random. The configure script from the
previous section could check whether the device exists:
if test -c /dev/random; then
echo "CFLAGS += -DHAVE_DEVRANDOM" >> config.mk
fiUsing that information, we could add another #elif in get_random_bytes() so
that it can potentially work on more systems. However, in this case, the
increased portability would require a change in interface. Since fopen() or
fread() on /dev/random could fail, our function would need to return bool.
Currently the OS functions we’re calling can’t fail, so a void return is fine.
The true test of portability is, of course, building and running on multiple operating systems, compilers, and hardware architectures. It can be surprising to see what assumptions this can uncover. Testing portability early and often makes it easier to keep a program shipshape.
The PostgreSQL project, for instance, maintains a bunch of disparate machines known as the “buildfarm.” Buildfarm members each have their own OS, compiler, and architecture. The team compiles every new feature on these machines and runs the test suite there.
Focusing on the architectures alone, we can see an impressive variety in the buildfarm:
Even if you have no intention to run on these architectures, testing there will lead to better code. (See my article C Portability Lessons from Weird Machines.)
I’ve been thinking of assembling a buildfarm and offering a paid continuous integration service. If this interests you, please send me an email. I think the project is a good cause, and with enough subscriptions I could cover the electricity and hardware costs.
Many languages have their own application-level package managers, but C has no exclusive package manager. The language has too much history and spans too many environments to have locked into that. Instead people build dependencies from source, or use the OS package manager.
Linking to libraries requires knowing their path, name, and compiler settings. Additionally we want to know which version is installed and whether it’s in-bounds. Since there’s no application-level package manager for C, we need to use another tool to discover installed libraries.
The most cross-platform way to find – and build against – dependency
libraries is
pkg-config. The tool
allows you to query system packages, regardless of how they were installed. To
be compatible with pkg-config, each library foo provides a libfoo.pc file
containing keys and values like this:
prefix=/usr/local
exec_prefix=${prefix}
includedir=${prefix}/include
libdir=${exec_prefix}/lib
Name: libfoo
Description: The foo library
Version: 1.2.3
Cflags: -I${includedir}/foo
Libs: -L${libdir} -lfooThe pkg-config executable can query the metadata and provide flags for your
Makefile. Call it from your configure script like this:
# check that a sufficient version is installed
pkg-config --print-errors 'libfoo >= 1.0'
# save flags to config.mk
cat >> config.mk <<-EOF
CFLAGS += $(pkg-config --cflags libfoo)
LDFLAGS += $(pkg-config --libs-only-L libfoo)
LDLIBS += $(pkg-config --libs-only-l libfoo)
EOFNotice the LDLIBS vs LDFLAGS distinction. LDLIBS are options that need to go at the very end of the build line. The default POSIX make suffix rules don’t mention LDLIBS, but here’s a rule you can use instead:
.c:
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< $(LDLIBS)Sometimes an operating system will include extra functionality and package it up as a portable library you can use on other operating systems. In this case you can use pkg-config conditionally.
For instance, OpenBSD spun off the LibreSSL project (a more usable OpenSSL). OpenBSD includes the functionality internally. In the configure script just do an operating system check:
# LibreSSL
case "$(uname -s)" in
OpenBSD)
# included with OS
echo 'LDLIBS += -ltls' >> config.mk
;;
*)
# requires a package
pkg-config --print-errors 'libtls >= 2.5.0'
cat >> config.mk <<-EOF
CFLAGS += $(pkg-config --cflags libtls)
LDFLAGS += $(pkg-config --libs-only-L libtls)
LDLIBS += $(pkg-config --libs-only-l libtls)
EOF
esacFor more information about pkg-config, see Dan Nicholson’s guide.
The C standard library has no generic collections. You have to write your own linked lists, trees, and hash tables. Real Programmers™ might like this, but I don’t.
POSIX offers limited help with their interface in search.h:
twalk() doesn’t contain an
argument to pass auxiliary data to the callback. The callback needs to consult
a global or thread-local variable for that. The quality of implementation may
vary as well, likely with regard to how/if the tree is balanced.To go beyond that, you’ll have to use third-party libraries. Many well-known libraries seem pretty bloated (GLib, tbox, Apache Portable Runtime). I found a smaller, cleaner library called simply C Algorithms. Haven’t used it in a project yet, but it looks stable and well tested. I also built the library locally with added pedantic C99 flags and got no warnings.
Two other stable libraries (code snippets?) which have received a lot of use over the years are Uthash and BSD’s queue(3) (browse queue.h from OpenBSD, or the FreeBSD variant).
Uthash describes itself this way:
Any C structure can be stored in a hash table using uthash. Just add a UT_hash_handle to the structure and choose one or more fields in your structure to act as the key. Then use these macros to store, retrieve or delete items from the hash table.”
The BSD queue code has been used and improved all the way back to the 1990s. It provides macros to create and manipulate singly-linked lists, simple queues, lists, and tail queues. The man page is quite good.
The functionality differs in the codebase of OpenBSD and FreeBSD. I use the OpenBSD version, but it has a little less functionality. In particular, FreeBSD adds the STAILQ (singly-linked tail queue), and a list swap operation. There was once a CIRCLEQ for circular queues, but it used dodgy coding practices and was removed.
Both Uthash and Queue are header files with macros that you vendor into your project and include rather than linking against. In general I consider “header-only libraries” to be undesirable because they abuse the notion of a translation unit, bloat object files, and make debugging harder. However I’ve used these libraries and they do work well.
The fewer UI features a program requires, the more portable it will be and the fewer opportunities there will be for it to mess up. (Does your command line app really need to output an emoji rocket ship or animated-in-place text spinner?)
The lowest common denominator is the standard I/O library in C, or its equivalent in other languages. Reading and writing text, pretending to be a teletype.
The next level of sophistication is static output but an input line you can modify (like the fancier teletypes that could edit a line before sending). Different terminals support intraline editing differently, and you should use a library to handle it. The classic is GNU readline. Readline provides this functionality:
Its license is straight up GPL though, not even LGPL. There are more permissive knockoffs like libedit (requires ncurses), or linenoise (which is restricted to VT100 terminals/emulators).
Going up yet another level is the text user interface (TUI), where the whole screen is your canvas, but you draw on it with text. Historically terminal control codes diverged wildly, so a standard programming interface was born, X/Open Curses. The most popular implementation is ncurses, which adds some nonstandard extensions as well.
Curses handles these tasks:
To stop pretending the computer is an archaic device from the 70s, you can use the cross-platform SDL2 library. It gives low level access to audio, keyboard, mouse, joystick, and graphics hardware. The platform support really is impressive. Everything from Unix, Mac, and Windows to mobile and web rendering.
Finally, for a classic native desktop application with widgets, the most stable and portable choice is probably Motif. The interface is stark, but it runs everywhere, and won’t change or break on you.
The Motif Programming Manual (free download) says this by way of introduction:
So why motif? Because it remains what it has long been: the common native windowing toolkit for all the UNIX platforms, fully supported by all the major operating system vendors. It is still the only truly industrial strength toolkit capable of supporting large scale and long term projects. Everything else is tainted: it isn’t ready or fully functionally complete, or the functional specification changes in a non-backwards-compatible manner per release, or there are performance issues. Perhaps it doesn’t truly port across UNIX systems, or it isn’t fully ICCCM compliant with software written in any other toolkit on the desktop, or there are political battles as various groups try to control the specification for their own purposes. […] With motif, you know where you are: it’s stable, it’s robust, it’s professionally supported, and it all works.
A reference manual is also available for download.
I was a little skeptical that it would be supported on macOS, but I tried the hello world example and, sure enough, it worked fine on XQuartz. I think there’s value in using Motif rather than a monster like GTK.
]]>I want to create emails that look their best in all mail clients, whether graphical or text based. Ideally I’d write a message in a simple format like Markdown, and generate the final email from the input file. Additionally, I’d like to be able to include fenced code snippets in the message, and make them available as attachments.
I created a utility called mimedown that reads markdown through stdin and prints multipart MIME to stdout.
Let’s see it in action. Here’s an example message:
## This is a demo email with code
Hey, does this code look fishy to you?
```crash.c
#include <stdio.h>
int main(void)
{
char a[] = "string literal";
char *p = "string literal";
/* capitalize first letter */
p[0] = a[0] = 'S';
printf("a: %s\np: %s\n", a, p);
return 0;
}
```
It blows up when I compile it and run it:
```compile.txt
$ cc -std=c99 -pedantic -Wall -Wextra crash.c -o crash
$ ./crash
Bus error: 10
```
Turns out we're invoking undefined behavior.
* The C99 spec, appendix J.2 Undefined Behavior mentions this case:
> The program attempts to modify a string literal (6.4.5).
* Steve Summit's C FAQ [question 1.32](http://c-faq.com/decl/strlitinit.html)
covers the difference between an array initialized with string literal vs a
pointer to a string literal constant.
* The SEI CERT C Coding standard
[STR30-C](https://wiki.sei.cmu.edu/confluence/display/c/STR30-C.+Do+not+attempt+to+modify+string+literals)
demonstrates the problem with non-compliant code, and compares with compliant
fixes.
After running it through the generator and emailing it to myself, here’s how the result looks in the Fastmail web interface:
Notice how the code blocks are displayed inline and are available as attachments with the correct MIME type.
I intentionally haven’t configured Mutt to render HTML, so it falls back to the
text alternative in the message, which also looks good. Notice how the message
body is interleaved with Content-Disposition: inline attachments for each
code snippet.
The email generator also creates references for external urls. It substitutes
the urls in the original body text with references, and consolidates the links
into a bibliography of type text/uri-list at the end of the message. Here’s
another Mutt screenshot of the end of the message, with red circles added.
The generated MIME structure of our sample message looks like this:
I 1 <no description> [multipa/alternativ, 7bit, 3.1K]
I 2 ├─><no description> [multipa/mixed, 7bit, 1.7K]
I 3 │ ├─><no description> [text/plain, 7bit, utf-8, 0.1K]
I 4 │ ├─>crash.c [text/x-c, 7bit, utf-8, 0.2K]
I 5 │ ├─><no description> [text/plain, 7bit, utf-8, 0.1K]
I 6 │ ├─>compile.txt [text/plain, 7bit, utf-8, 0.1K]
I 7 │ ├─><no description> [text/plain, 7bit, utf-8, 0.5K]
I 8 │ └─>references.uri [text/uri-list, 7bit, utf-8, 0.2K]
I 9 └─><no description> [text/html, 7bit, utf-8, 1.3K]
At the outermost level, the message is split into two alternatives: HTML and multipart/mixed. Within the multipart/mixed part is a succession of message text and code snippets, all with inline disposition. The final mixed item is the list of referenced urls (if necessary).
Lines of the message body are re-flowed to at most 72 characters, to conform to historical length constraints. Additionally, to accommodate narrow terminal windows, mimedown uses a technique called format=flowed. This is a clever standard (RFC 3676) which adds trailing spaces to any lines that we would like the client reader to re-flow, such as those in paragraphs.
Neither hard wrapping nor format=flowed is applied to code block fences in the original markdown. Code snippets are turned into verbatim attachments and won’t be mangled.
Finally, the HTML version of the message is tasteful and conservative. It should display properly on any HTML client, since it validates with ISO HTML (ISO/IEC 15445:2000, based on HTML 4.01 Strict).
Clone it here: github.com/begriffs/mimedown. It’s written in portable C99. The only build dependency is the cmark library for parsing markdown.
]]>When debugging a program that uses LibreSSL, it can be useful to see decrypted
network traffic. Wireshark can decrypt TLS if
you provide the secret session key, however the session key is difficult to
obtain. It is different from the private key used for functions like
tls_config_set_keypair_file(), which merely secures the initial TLS handshake
with asymmetric cryptography. The handshake establishes the session key between
client and server using a method such as Diffie-Hellman (DH). The session key
is then used for efficient symmetric cryptography for the remainder of the
communication.
Web browsers, from their Netscape provenance, will log session keys to a file
specified by the environment variable SSLKEYLOGFILE when present. Netscape
packaged this behavior in its Network Security
Services
library.
OpenSSL and LibreSSL don’t implement that NSS behavior, although OpenSSL allows code to register a callback for when TLS key material is generated or received. The callback receives a string in the NSS Key Log Format.
In addition to refactoring OpenSSL code, LibreSSL offers a simplified TLS interface called libtls. The simplicity makes it more likely that applications will use it safely. However, I couldn’t find an easy way to log session keys for my libtls connection.
I found a somewhat hacky way to do it, and asked their development list whether there’s a better way. From the lack of response, I assume there isn’t yet. Posting the solution here in case it’s helpful for anyone else.
This module provides a tls_dump_keylog() function that appends to the file
specified in SSLKEYLOGFILE.
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <openssl/ssl.h>
/* A copy of the tls structure from libtls/tls_internal.h
*
* This is a fragile hack! When the structure changes in libtls
* then it will be Undefined Behavior to alias it with this.
* See C99 section 6.5 (Expressions), paragraph 7
*/
struct tls_internal {
struct tls_config *config;
struct tls_keypair *keypair;
struct {
char *msg;
int num;
int tls;
} error;
uint32_t flags;
uint32_t state;
char *servername;
int socket;
SSL *ssl_conn;
SSL_CTX *ssl_ctx;
struct tls_sni_ctx *sni_ctx;
X509 *ssl_peer_cert;
STACK_OF(X509) *ssl_peer_chain;
struct tls_conninfo *conninfo;
struct tls_ocsp *ocsp;
tls_read_cb read_cb;
tls_write_cb write_cb;
void *cb_arg;
};
static void printhex(FILE *fp, const unsigned char* s, size_t len)
{
while (len-- > 0)
fprintf(fp, "%02x", *s++);
}
bool tls_dump_keylog(struct tls *tls)
{
FILE *fp;
SSL_SESSION *sess;
unsigned int len_key, len_id;
unsigned char key[256];
const unsigned char *id;
const char *path = getenv("SSLKEYLOGFILE");
if (!path)
return false;
/* potentially nonstrict aliasing */
sess = SSL_get_session(((struct tls_internal*)tls)->ssl_conn);
if (!sess)
{
fprintf(stderr, "Failed to get session for TLS\n");
return false;
}
len_key = SSL_SESSION_get_master_key(sess, key, sizeof key);
id = SSL_SESSION_get_id(sess, &len_id);
if ((fp = fopen(path, "a")) == NULL)
{
fprintf(stderr, "Unable to write keylog to '%s'\n", path);
return false;
}
fputs("RSA Session-ID:", fp);
printhex(fp, id, len_id);
fputs(" Master-Key:", fp);
printhex(fp, key, len_key);
fputs("\n", fp);
fclose(fp);
return true;
}To use the logfile in Wireshark, right click on a TLS packet, and select Protocol Preferences → (Pre)-Master-Secret log filename.
In the resulting dialog, add the filename to the logfile. Then you can view the decrypted traffic with Follow → TLS Stream.
We won’t run to either extreme here. Instead we’ll cover the production workhorses for concurrent software – threading and locking – and learn about them through a series of interesting programs. By the end of this article you’ll know the terminology and patterns used by POSIX threads (pthreads).
This is an introduction rather than a reference. Plenty of reference material exists for pthreads – whole books in fact. I won’t dwell on all the options of the API, but will briskly give you the big picture. None of the examples contain error handling because it would merely clutter them.
First it’s important to distinguish concurrency vs parallelism. Concurrency is the ability of parts of a program to work correctly when executed out of order. For instance, imagine tasks A and B. One way to execute them is sequentially, meaning doing all steps for A, then all for B:
Concurrent execution, on the other hand, alternates doing a little of each task until both are all complete:
Concurrency allows a program to make progress even when certain parts are blocked. For instance, when one task is waiting for user input, the system can switch to another task and do calculations.
When tasks don’t just interleave, but run at the same time, that’s called parallelism. Multiple CPU cores can run instructions simultaneously:
When a program – even without hardware parallelism – switches rapidly enough from one task to another, it can feel to the user that tasks are executing at the same time. You could say it provides the “illusion of parallelism.” However, true parallelism has the potential for greater processor throughput for problems that can be broken into independent subtasks. Some ways of dealing with concurrency, such as multi-threaded programming, can exploit hardware parallelism automatically when available.
Some languages (or more accurately, some language implementations) are unable to achieve true multi-threaded parallelism. Ruby MRI and CPython for instance use a global interpreter lock (GIL) to simplify their implementation. The GIL prevents more than one thread from running at once. Programs in these interpreters can benefit from I/O concurrency, but not extra computational power.
Languages and libraries offer different ways to add concurrency to a program. UNIX for instance has a bunch of disjointed mechanisms like signals, asynchronous I/O (AIO), select, poll, and setjmp/longjmp. Using these mechanisms can complicate program structure and make programs harder to read than sequential code.
Threads offer a cleaner and more consistent way to address these motivations. For I/O they’re usually clearer than polling or callbacks, and for processing they are more efficient than Unix processes.
Let’s get started by adding concurrency to a program to simulate a bunch of crazy bankers sending random amounts of money from one bank account to another. The bankers don’t communicate with one another, so this is a demonstration of concurrency without synchronization.
Adding concurrency is the easy part. The real work is in making threads wait for one another to ensure a correct result. We’ll see a number of mechanisms and patterns for synchronization later, but for now let’s see what goes wrong without synchronization.
/* banker.c */
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#define N_ACCOUNTS 10
#define N_THREADS 20
#define N_ROUNDS 10000
/* 10 accounts with $100 apiece means there's $1,000
in the system. Let's hope it stays that way... */
#define INIT_BALANCE 100
/* making a struct here for the benefit of future
versions of this program */
struct account
{
long balance;
} accts[N_ACCOUNTS];
/* Helper for bankers to choose an account and amount at
random. It came from Steve Summit's excellent C FAQ
http://c-faq.com/lib/randrange.html */
int rand_range(int N)
{
return (int)((double)rand() / ((double)RAND_MAX + 1) * N);
}
/* each banker will run this function concurrently. The
weird signature is required for a thread function */
void *disburse(void *arg)
{
size_t i, from, to;
long payment;
/* idiom to tell compiler arg is unused */
(void)arg;
for (i = 0; i < N_ROUNDS; i++)
{
/* pick distinct 'from' and 'to' accounts */
from = rand_range(N_ACCOUNTS);
do {
to = rand_range(N_ACCOUNTS);
} while (to == from);
/* go nuts sending money, try not to overdraft */
if (accts[from].balance > 0)
{
payment = 1 + rand_range(accts[from].balance);
accts[from].balance -= payment;
accts[to].balance += payment;
}
}
return NULL;
}
int main(void)
{
size_t i;
long total;
pthread_t ts[N_THREADS];
srand(time(NULL));
for (i = 0; i < N_ACCOUNTS; i++)
accts[i].balance = INIT_BALANCE;
printf("Initial money in system: %d\n",
N_ACCOUNTS * INIT_BALANCE);
/* start the threads, using whatever parallelism the
system happens to offer. Note that pthread_create
is the *only* function that creates concurrency */
for (i = 0; i < N_THREADS; i++)
pthread_create(&ts[i], NULL, disburse, NULL);
/* wait for the threads to all finish, using the
pthread_t handles pthread_create gave us */
for (i = 0; i < N_THREADS; i++)
pthread_join(ts[i], NULL);
for (total = 0, i = 0; i < N_ACCOUNTS; i++)
total += accts[i].balance;
printf("Final money in system: %ld\n", total);
}The following simple Makefile can be used to compile all the programs in this article:
.POSIX:
CFLAGS = -std=c99 -pedantic -D_POSIX_C_SOURCE=200809L -Wall -Wextra
LDLIBS = -lpthread
.c:
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< $(LDLIBS)We’re overriding make’s default suffix
rule
for .c so that -lpthread comes after the source input file. This makeefile
will work with any of our programs. If you have foo.c you can simply run
make foo and it knows what to do without your needing to add any specific
rule for foo to the Makefile.
Try compiling and running banker.c. Notice anything strange?
Threads share memory directly. Each thread can read and write variables in shared memory without any overhead. However when threads simultaneously read and write the same data it’s called a data race and generally causes problems.
In particular, threads in banker.c have data races when they read and write
account balances. The bankers program moves money between accounts, however
the total amount of money in the system does not remain constant. The books
don’t balance. Exactly how the program behaves depends on thread scheduling
policies of the operating system. On OpenBSD the total money seldom stays at
$1,000. Sometimes money gets duplicated, sometimes it vanishes. On macOS the
result is generally that all the money disappears, or even becomes negative!
The property that money is neither created nor destroyed in a bank is an example of a program invariant, and it gets violated by data races. Note that parallelism is not required for a race, only concurrency.
Here’s the problematic code in the disburse() function:
payment = 1 + rand_range(accts[from].balance);
accts[from].balance -= payment;
accts[to].balance += payment;The threads running this code can be paused or interleaved at any time. Not just between any of the statements, but partway through arithmetic operations which may not execute atomically on the hardware. Never rely on “thread inertia,” which is the mistaken feeling that the thread will finish a group of statements without interference.
Let’s examine exactly how statements can interleave between banker threads, and the resulting problems. The columns of the table below are threads, and the rows are moments in time.
Here’s a timeline where two threads read the same account balance when planning how much money to transfer. It can cause an overdraft.
| Thread A | Thread B |
|---|---|
| payment = 1 + rand_range(accts[from].balance); | |
| payment = 1 + rand_range(accts[from].balance); | |
| At this point, thread B’s payment-to-be may be in excess of the true balance because thread A has already earmarked some of the money unbeknownst to B. | |
| accts[from].balance -= payment; | |
| accts[from].balance -= payment; | |
| Some of the same dollars could be transferred twice and the originating account could even go negative if the overlap of the payments is big enough. | |
Here’s a timeline where the debit made by one thread can be undone by that made by another.
| Thread A | Thread B |
|---|---|
| accts[from].balance -= payment; | accts[from].balance -= payment; |
If -= is not atomic, the threads might switch
execution after reading the balance and after doing arithmetic, but before
assignment. Thus one assignment would be overwritten by the other. The
“lost update” creates extra money in the system.
|
|
Similar problems can occur when bankers have a data race in destination accounts. Races in the destination account would tend to decrease total money supply. (To learn more about concurrency problems, see my article Practical Guide to SQL Transaction Isolation).
In the example above, we found that a certain section of code was vulnerable to data races. Such tricky parts of a program are called critical sections. We must ensure each thread gets all the way through the section before another thread is allowed to enter it.
To give threads mutually exclusive access to a critical section, pthreads provides the mutually exclusive lock (mutex for short). The pattern is:
pthread_mutex_lock(&some_mutex);
/* ... do things in the critical section ... */
pthread_mutex_unlock(&some_mutex);Any thread calling pthread_mutex_lock on a previously locked mutex will go to
sleep and not be scheduled until the mutex is unlocked (and any other threads
already waiting on the mutex have gone first).
Another way to look at mutexes is that their job is to preserve program
invariants. The critical section between locking and unlocking is a place where
a certain invariant may be temporarily broken, as long as it is restored by the
end. Some people recommend adding an assert() statement before unlocking, to
help document the invariant. If an invariant is difficult to specify in an
assertion, a comment can be useful instead.
A function is called thread-safe if multiple invocations can safely run concurrently. A cheap, but inefficient, way to make any function thread-safe is to give it its own mutex and lock it right away:
/* inefficient but effective way to protect a function */
pthread_mutex_t foo_mtx = PTHREAD_MUTEX_INITIALIZER;
void foo(/* some arguments */)
{
pthread_mutex_lock(&foo_mtx);
/* we're safe in here, but it's a bottleneck */
pthread_mutex_unlock(&foo_mtx);
}To see why this is inefficient, imagine if foo() was designed to output
characters to a file specified in its arguments. Because the function takes a
global lock, no two threads could run it at once, even if they wanted to write
to different files. Writing to different files should be independent
activities, and what we really want to protect against are two threads
concurrently writing the same file.
The amount of data that a mutex protects is called its granularity, and
smaller granularity can often be more efficient. In our foo() example, we
could store a mutex for every file we write, and have the function choose and
lock the appropriate mutex. Multi-threaded programs typically add a mutex as a
member variable to data structures, to associate the lock with its data.
Let’s update the banker program to keep a mutex in each account and prevent data races.
/* banker_lock.c */
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#define N_ACCOUNTS 10
#define N_THREADS 100
#define N_ROUNDS 10000
struct account
{
long balance;
/* add a mutex to prevent races on balance */
pthread_mutex_t mtx;
} accts[N_ACCOUNTS];
int rand_range(int N)
{
return (int)((double)rand() / ((double)RAND_MAX + 1) * N);
}
void *disburse(void *arg)
{
size_t i, from, to;
long payment;
(void)arg;
for (i = 0; i < N_ROUNDS; i++)
{
from = rand_range(N_ACCOUNTS);
do {
to = rand_range(N_ACCOUNTS);
} while (to == from);
/* get an exclusive lock on both balances before
updating (there's a problem with this, see below) */
pthread_mutex_lock(&accts[from].mtx);
pthread_mutex_lock(&accts[to].mtx);
if (accts[from].balance > 0)
{
payment = 1 + rand_range(accts[from].balance);
accts[from].balance -= payment;
accts[to].balance += payment;
}
pthread_mutex_unlock(&accts[to].mtx);
pthread_mutex_unlock(&accts[from].mtx);
}
return NULL;
}
int main(void)
{
size_t i;
long total;
pthread_t ts[N_THREADS];
srand(time(NULL));
/* set the initial balance, but also create a
new mutex for each account */
for (i = 0; i < N_ACCOUNTS; i++)
accts[i] = (struct account)
{100, PTHREAD_MUTEX_INITIALIZER};
for (i = 0; i < N_THREADS; i++)
pthread_create(&ts[i], NULL, disburse, NULL);
puts("(This program will probably deadlock, "
"and need to be manually terminated...)");
for (i = 0; i < N_THREADS; i++)
pthread_join(ts[i], NULL);
for (total = 0, i = 0; i < N_ACCOUNTS; i++)
total += accts[i].balance;
printf("Total money in system: %ld\n", total);
}Now everything should be safe. No money being created or destroyed, just perfect exchanges between the accounts. The invariant is that the total balance of the source and destination accounts is the same before we transfer the money as after. It’s broken only inside the critical section.
As a side note, at this point you might think it would be more efficient be to take a single lock at a time, like this:
This would not be safe. During the time between unlocking the source account and locking the destination, the invariant does not hold, yet another thread could observe this state. For instance a report running in another thread just at that time could read the balance of both accounts and observe money missing from the system.
We do need to lock both accounts during the transfer. However the way we’re doing it causes a different problem. Try to run the program. It gets stuck forever and never prints the final balance! Its threads are deadlocked.
Deadlock is the second villain of concurrent programming, and happens when threads wait on each others’ locks, but no thread unlocks for any other. The case of the bankers is a classic simple form called the deadly embrace. Here’s how it plays out:
| Thread A | Thread B |
|---|---|
| lock account 1 | |
| lock account 2 | |
| lock account 2 | |
| At this point thread A is blocked because thread B already holds a lock on account 2. | |
| lock account 1 | |
| Now thread B is blocked because thread A holds a lock on account 1. However thread A will never unlock account 1 because thread A is blocked! | |
The problem happens because threads lock resources in different orders, and because they refuse to give locks up. We can solve the problem by addressing either of these causes.
The first approach to preventing deadlock is to enforce a locking hierarchy. This means the programmer comes up with an arbitrary order for locks, and always takes “earlier” locks before “later” ones. The terminology comes from locks in hierarchical data structures like trees, but it really amounts to using any kind of consistent locking order.
In our case of the banker program we store all the accounts in an array, so we can use the array index as the lock order. Let’s compare.
/* the original way to lock mutexes, which caused deadlock */
pthread_mutex_lock(&accts[from].mtx);
pthread_mutex_lock(&accts[to].mtx);
/* move money */
pthread_mutex_unlock(&accts[to].mtx);
pthread_mutex_unlock(&accts[from].mtx);Here’s a safe way, enforcing a locking hierarchy:
/* lock mutexes in earlier accounts first */
#define MIN(a,b) ((a) < (b) ? (a) : (b))
#define MAX(a,b) ((a) < (b) ? (b) : (a))
pthread_mutex_lock(&accts[MIN(from, to)].mtx);
pthread_mutex_lock(&accts[MAX(from, to)].mtx);
/* move money */
pthread_mutex_unlock(&accts[MAX(from, to)].mtx);
pthread_mutex_unlock(&accts[MIN(from, to)].mtx);
/* notice we unlock in opposite order */A locking hierarchy is the most efficient way to prevent deadlock, but it isn’t always easy to contrive. It’s also creates a potentially undocumented coupling between different parts of a program which need to collaborate in the convention.
Backoff is a different way to prevent deadlock which works for locks taken in any order. It takes a lock, but then checks whether the next is obtainable. If not, it unlocks the first to allow another thread to make progress, and tries again.
/* using pthread_mutex_trylock to dodge deadlock */
while (1)
{
pthread_mutex_lock(&accts[from].mtx);
if (pthread_mutex_trylock(&accts[to].mtx) == 0)
break; /* got both locks */
/* didn't get the second one, so unlock the first */
pthread_mutex_unlock(&accts[from].mtx);
/* force a sleep so another thread can try --
include <sched.h> for this function */
sched_yield();
}
/* move money */
pthread_mutex_unlock(&accts[to].mtx);
pthread_mutex_unlock(&accts[from].mtx);One tricky part is the call to sched_yield(). Without it the loop will
immediately try to grab the lock again, competing as hard as it can with other
threads who could make more productive use of the lock. This causes
livelock, where threads fight for access to the locks. The sched_yield()
puts the calling thread to sleep and at the back of the scheduler’s run queue.
Despite its flexibility, backoff is definitely less efficient than a locking hierarchy because it can make wasted calls to lock and unlock mutexes. Try modifying the banker program with these approaches and measure how fast they run.
After safely getting access to a shared variable with a mutex, a thread may discover that the value of the variable is not yet suitable for the thread to act upon. For instance, if the thread was looking for an item to process in a shared queue, but found the queue was empty. The thread could poll the value, but this is inefficient. Pthreads provides condition variables to allow threads to wait for events of interest or notify other threads when these events happen.
Condition variables are not themselves locks, nor do they hold any value of their own. They are merely events with a programmer-assigned meaning. For example, a structure representing a queue could have a mutex for safely accessing the data, plus some condition variables. One to represent the event of the queue becoming empty, and another to announce when a new item is added.
Before getting deeper into how condition variables work, let’s see one in
action with our banker program. We’ll measure contention between the bankers.
First we’ll increase the number of threads and accounts, and keep statistics
about how many bankers manage to get inside the disburse() critical section
at once. Any time the max score is broken, we’ll signal a condition variable. A
dedicated thread will wait on it and update a scoreboard.
/* banker_stats.c */
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
/* increase the accounts and threads, but make sure there are
* "too many" threads so they tend to block each other */
#define N_ACCOUNTS 50
#define N_THREADS 100
#define N_ROUNDS 10000
#define MIN(a,b) ((a) < (b) ? (a) : (b))
#define MAX(a,b) ((a) < (b) ? (b) : (a))
struct account
{
long balance;
pthread_mutex_t mtx;
} accts[N_ACCOUNTS];
int rand_range(int N)
{
return (int)((double)rand() / ((double)RAND_MAX + 1) * N);
}
/* keep a special mutex and condition variable
* reserved for just the stats */
pthread_mutex_t stats_mtx = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t stats_cnd = PTHREAD_COND_INITIALIZER;
int stats_curr = 0, stats_best = 0;
/* use this interface to modify the stats */
void stats_change(int delta)
{
pthread_mutex_lock(&stats_mtx);
stats_curr += delta;
if (stats_curr > stats_best)
{
stats_best = stats_curr;
/* signal new high score */
pthread_cond_broadcast(&stats_cnd);
}
pthread_mutex_unlock(&stats_mtx);
}
/* a dedicated thread to update the scoreboard UI */
void *stats_print(void *arg)
{
int prev_best;
(void)arg;
/* we never return, nobody needs to
* pthread_join() with us */
pthread_detach(pthread_self());
while (1)
{
pthread_mutex_lock(&stats_mtx);
prev_best = stats_best;
/* go to sleep until stats change, and always
* check that they actually have changed */
while (prev_best == stats_best)
pthread_cond_wait(
&stats_cnd, &stats_mtx);
/* overwrite current line with new score */
printf("\r%2d", stats_best);
pthread_mutex_unlock(&stats_mtx);
fflush(stdout);
}
}
void *disburse(void *arg)
{
size_t i, from, to;
long payment;
(void)arg;
for (i = 0; i < N_ROUNDS; i++)
{
from = rand_range(N_ACCOUNTS);
do {
to = rand_range(N_ACCOUNTS);
} while (to == from);
pthread_mutex_lock(&accts[MIN(from, to)].mtx);
pthread_mutex_lock(&accts[MAX(from, to)].mtx);
/* notice we still have a lock hierarchy, because
* we call stats_change() after locking all account
* mutexes (stats_mtx comes last) */
stats_change(1); /* another banker in crit sec */
if (accts[from].balance > 0)
{
payment = 1 + rand_range(accts[from].balance);
accts[from].balance -= payment;
accts[to].balance += payment;
}
stats_change(-1); /* leaving crit sec */
pthread_mutex_unlock(&accts[MAX(from, to)].mtx);
pthread_mutex_unlock(&accts[MIN(from, to)].mtx);
}
return NULL;
}
int main(void)
{
size_t i;
long total;
pthread_t ts[N_THREADS], stats;
srand(time(NULL));
for (i = 0; i < N_ACCOUNTS; i++)
accts[i] = (struct account)
{100, PTHREAD_MUTEX_INITIALIZER};
for (i = 0; i < N_THREADS; i++)
pthread_create(&ts[i], NULL, disburse, NULL);
/* start thread to update the user on how many bankers
* are in the disburse() critical section at once */
pthread_create(&stats, NULL, stats_print, NULL);
for (i = 0; i < N_THREADS; i++)
pthread_join(ts[i], NULL);
/* not joining with the thread running stats_print,
* we'll let it disappar when main exits */
for (total = 0, i = 0; i < N_ACCOUNTS; i++)
total += accts[i].balance;
printf("\nTotal money in system: %ld\n", total);
}With fifty accounts and a hundred threads, not all threads will be able to be
in the critical section of disburse() at once. It varies between runs. Run
the program and see how well it does on your machine. (One complication is that
making all threads synchronize on stats_mtx may throw off the measurement,
because there are threads who could have executed independently but now must
interact.)
Let’s look at how to properly use condition variables. We notified threads of a
new event with pthread_cond_broadcast(&stats_cnd). This function marks all
threads waiting on state_cnd as ready to run.
Sometimes multiple threads are waiting on a single cond var. A broadcast will
wake them all, but sometimes the event source knows that only one thread will
be able to do any work. For instance if only one item is added to a shared
queue. In that case the pthread_cond_signal function is better than
pthread_cond_broadcast. Unnecessarily waking multiple threads causes
overhead. In our case we know that only one thread is waiting on the cond var,
so it really makes no difference.
Remember that it’s never wrong to use a broadcast, whereas in some cases it might be wrong to use a signal. Signal is just an optimized broadcast.
The waiting side of a cond var ought always to have this pattern:
pthread_mutex_lock(&mutex);
while (!PREDICATE)
pthread_cond_wait(&cond_var, &mutex);
pthread_mutex_unlock(&mutex);Condition variables are always associated with a predicate, and the association is implicit in the programmer’s head. You shouldn’t reuse a condition variable for multiple predicates. The intention is that code will signal the cond var when the predicate becomes true.
Before testing the predicate we lock a mutex that covers the data being tested.
That way no other thread can change the data immediately after we test it (also
pthread_cond_wait() requires a locked mutex). If the predicate is already
true we needn’t wait on the cond var, so the loop falls through, otherwise the
thread begins to wait.
Condition variables allow you to make this series of events atomic: unlock a mutex, register our interest in the event, and block. Without that atomicity another thread might awaken to take our lock and broadcast before we’ve registered ourselves as interested. Without the atomicity we could be blocked forever.
When pthread_cond_wait() returns, the calling thread awakens and atomically
gets its mutex back. It’s all set to check the predicate again in the loop. But
why check the predicate? Wasn’t the cond var signaled because the predicate was
true, and isn’t the relevant data protected by a mutex? There are three reasons
to check:
stats_best gets a
new high score, but we could have chosen to signal at every invocation of
stats_change().Given that we have to pass a locked mutex to pthread_cond_wait(), which we
had to create, why don’t cond vars come with their own built-in mutex? The
reason is flexibility. Although you should use only one mutex with a cond var,
there can be multiple cond vars for the same mutex. Think of the example of the
mutex protecting a queue, and the different events that can happen in the
queue.
It’s time to bid farewell to the banker programs, and turn to something more lively: Conway’s Game of Life! The game has a set of rules operating on a grid of cells that determines which cells live or die based on how many living neighbors each has.
The game can take advantage of multiple processors, using each processor to operate on a different part of the grid in parallel. It’s a so-called embarrassingly parallel problem because each section of the grid can be processed in isolation, without needing results from other sections.
Barriers ensure that all threads have reached a particular stage in a parallel
computation before allowing any to proceed to the next stage. Each thread calls
pthread_barrier_wait() to rendezvous with the others. One of the threads,
chosen randomly, will see the PTHREAD_BARRIER_SERIAL_THREAD return value,
which nominates that thread to do any cleanup or preparation between stages.
/* life.c */
#include <assert.h>
#include <pthread.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
/* mandatory in POSIX.1-2008, but check laggards like macOS */
#include <unistd.h>
#if !defined(_POSIX_BARRIERS) || _POSIX_BARRIERS < 0
#error your OS lacks POSIX barrier support
#endif
/* dimensions of board */
#define ROWS 32
#define COLS 78
/* how long to pause between rounds */
#define FRAME_MS 100
#define THREADS 4
/* proper modulus (in C, '%' is merely remainder) */
#define MOD(x,N) (((x) < 0) ? ((x) % (N) + (N)) : ((x) % (N)))
bool alive[ROWS][COLS], alive_next[ROWS][COLS];
pthread_barrier_t tick;
/* Should a cell live or die? Using ssize_t because we have
to deal with signed arithmetic like row-1 when row=0 */
bool fate(ssize_t row, ssize_t col)
{
ssize_t i, j;
short neighbors = 0;
assert(0 <= row && row < ROWS);
assert(0 <= col && col < COLS);
/* joined edges form a torus */
for (i = row-1; i <= row+1; i++)
for (j = col-1; j <= col+1; j++)
neighbors += alive[MOD(i, ROWS)][MOD(j, COLS)];
/* don't count self as a neighbor */
neighbors -= alive[row][col];
return neighbors == 3 ||
(neighbors == 2 && alive[row][col]);
}
/* overwrite the board on screen */
void draw(void)
{
ssize_t i, j;
/* clear screen (non portable, requires ANSI terminal) */
fputs("\033[2J\033[1;1H", stdout);
flockfile(stdout);
for (i = 0; i < ROWS; i++)
{
/* putchar_unlocked is thread safe when stdout is locked,
and it's as fast as single-threaded putchar */
for (j = 0; j < COLS; j++)
putchar_unlocked(alive[i][j] ? 'X' : ' ');
putchar_unlocked('\n');
}
funlockfile(stdout);
fflush(stdout);
}
void *update_strip(void *arg)
{
ssize_t offset = *(ssize_t*)arg, i, j;
struct timespec t;
t.tv_sec = 0;
t.tv_nsec = FRAME_MS * 1000000;
while (1)
{
if (pthread_barrier_wait(&tick) ==
PTHREAD_BARRIER_SERIAL_THREAD)
{
/* we drew the short straw, so we're on graphics duty */
/* could have used pointers to multidimensional
* arrays and swapped them rather than memcpy'ing
* the array contents, but it makes the code a
* little more complicated with dereferences */
memcpy(alive, alive_next, sizeof alive);
draw();
nanosleep(&t, NULL);
}
/* rejoin at another barrier to avoid data race on
the game board while it's copied and drawn */
pthread_barrier_wait(&tick);
for (i = offset; i < offset + (ROWS / THREADS); i++)
for (j = 0; j < COLS; j++)
alive_next[i][j] = fate(i, j);
}
return NULL;
}
int main(void)
{
pthread_t *workers;
ssize_t *offsets;
size_t i, j;
assert(ROWS % THREADS == 0);
/* main counts as a thread, so need only THREADS-1 more */
workers = malloc(sizeof(*workers) * (THREADS-1));
offsets = malloc(sizeof(*offsets) * ROWS / THREADS);
srand(time(NULL));
for (i = 0; i < ROWS; i++)
for (j = 0; j < COLS; j++)
alive_next[i][j] = rand() < (int)((RAND_MAX+1u) / 3);
pthread_barrier_init(&tick, NULL, THREADS);
for (i = 0; i < THREADS-1; i++)
{
offsets[i] = i * ROWS / THREADS;
pthread_create(&workers[i], NULL, update_strip, &offsets[i]);
}
/* use current thread as a worker too */
offsets[i] = i * ROWS / THREADS;
update_strip(&offsets[i]);
/* shouldn't ever get here */
pthread_barrier_destroy(&tick);
free(offsets);
free(workers);
return EXIT_SUCCESS;
}It’s a fun example although slightly contrived. We’re adding a sleep between rounds to slow down the animation, so it’s unnecessary to chase parallelism. Also there’s a memoized algorithm called hashlife we should be using if pure speed is the goal. However our code illustrates a natural use for barriers.
Notice how we wait at the barrier twice in rapid succession. After emerging from the first barrier, one of the threads (chosen at random) copies the new state to the board and draws it. The other threads run ahead to the next barrier and wait there so they don’t cause a data race writing to the board. Once the drawing thread arrives at the barrier with them, then all can proceed to calculate cells’ fate for the next round.
Barriers are guaranteed to be present in POSIX.1-2008, but are optional in earlier versions of the standard. Notably macOS is stuck at an old version of POSIX. Presumably they’re too busy “innovating” with their keyboard touchbar to invest in operating system fundamentals.
Spinlocks are implementations of mutexes optimized for fine-grained locking. Often used in low level code like drivers or operating systems, spinlocks are designed to be the most primitive and fastest sync mechanism available. They’re generally not appropriate for application programming. They are only truly necessary for situations like interrupt handlers when a thread is not allowed to go to sleep for any reason.
Aside from that scenario, it’s better to just use a mutex, since mutexes are pretty efficient these days. Modern mutexes often try a short-lived internal spinlock and fall back to heavier techniques only as needed. Mutexes also sometimes use a wait queue called a futex, which can take a lock in user-space whenever there is no contention from another thread.
When attempting to lock a spinlock, a thread runs a tight loop repeatedly checking a value in shared memory for a sign it’s safe to proceed. Spinlock implementations use special atomic assembly language instructions to test that the value is unlocked and lock it. The particular instructions vary per architecture, and can be performed in user space to avoid the overhead of a system call.
The while waiting for a lock, the loop doesn’t block the thread, but instead continues running and burns CPU energy. The technique works only on true multi-processor systems or a uniprocessor system with preemption enabled. On a uniprocessor system with cooperative threading the loop could never be interrupted, and will livelock.
In POSIX.1-2008 spinlock support is mandatory. In previous versions the
presence of this feature was indicated by the _POSIX_SPIN_LOCKS macro.
Spinlock functions start with pthread_spin_.
Whereas a mutex enforces mutual exclusion, a reader-writer lock allows concurrent read access. Multiple threads can read in parallel, but all block when a thread takes the lock for writing. The increased concurrency can improve application performance. However, blindly replacing mutexes with reader-writer locks “for performance” doesn’t work. Our earlier banker program, for instance, could suffer from duplicate withdrawals if it allowed multiple readers in an account at once.
Below is an rwlock example. It’s a password cracker I call 5dm (md5 backwards). It aims for maximum parallelism searching for a preimage of an MD5 hash. Worker threads periodically poll whether one among them has found an answer, and they use a reader-writer lock to avoid blocking on each other when doing so.
The example is slightly contrived, in that the difficulty of brute forcing passwords increases exponentially with their length. Using multiple threads reduces the time by only a constant factor – but 4x faster is still 4x faster on a four core computer!
The example below uses MD5() from OpenSSL. To build it, include this in our
previous Makefile:
CFLAGS += `pkg-config --cflags libcrypto`
LDFLAGS += `pkg-config --libs-only-L libcrypto`
LDLIBS += `pkg-config --libs-only-l libcrypto`To run it, pass in an MD5 hash and max preimage search length. Note the -n in
echo to suppress the newline, since newline is not in our search alphabet:
$ time ./5dm $(echo -n 'fun' | md5) 5
fun
real 0m0.067s
user 0m0.205s
sys 0m0.007sNotice how 0.2 seconds of CPU time elapsed in parallel, but the user got their answer in 0.067 seconds.
On to the code:
/* 5dm.c */
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
#include <pthread.h>
/* build arbitrary words from the ascii between ' ' and '~' */
#define ASCII_FIRST ' '
#define ASCII_LAST '~'
#define N_ALPHA (1 + ASCII_LAST - ASCII_FIRST)
/* refuse to search beyond this astronomical length */
#define LONGEST_PREIMAGE 128
#define MAX(x,y) ((x)<(y) ? (y) : (x))
/* a fast way to enumerate words, operating on an array in-place */
unsigned word_advance(char *word, unsigned delta)
{
if (delta == 0)
return 0;
if (*word == '\0')
{
*word++ = ASCII_FIRST + delta - 1;
*word = '\0';
}
else
{
char c = *word - ASCII_FIRST;
*word = ASCII_FIRST + ((c + delta) % N_ALPHA);
if (c + delta >= N_ALPHA)
return 1 + word_advance(word+1, 1 /* not delta */);
}
return 1;
}
/* pack each pair of ASCII hex digits into single bytes */
bool hex2md5(const char *hex, unsigned char *b)
{
int offset = 0;
if(strlen(hex) != MD5_DIGEST_LENGTH*2)
return false;
while (offset < MD5_DIGEST_LENGTH*2)
{
if (sscanf(hex+offset, "%2hhx", b++) == 1)
offset += 2;
else
return false;
}
return true;
}
/* random things a worker will need, since thread
* functions receive only one argument */
struct goal
{
/* input */
pthread_t *workers;
size_t n_workers;
size_t max_len;
unsigned char hash[MD5_DIGEST_LENGTH];
/* output */
pthread_rwlock_t lock;
char preimage[LONGEST_PREIMAGE];
bool success;
};
/* custom starting word for each worker, but shared goal */
struct task
{
struct goal *goal;
char initial_preimage[LONGEST_PREIMAGE];
};
void *crack_thread(void *arg)
{
struct task *t = arg;
unsigned len, changed;
unsigned char hashed[MD5_DIGEST_LENGTH];
char preimage[LONGEST_PREIMAGE];
int iterations = 0;
strcpy(preimage, t->initial_preimage);
len = strlen(preimage);
while (len <= t->goal->max_len)
{
MD5((const unsigned char*)preimage, len, hashed);
if (memcmp(hashed, t->goal->hash, MD5_DIGEST_LENGTH) == 0)
{
/* success -- tell others to call it off */
pthread_rwlock_wrlock(&t->goal->lock);
t->goal->success = true;
strcpy(t->goal->preimage, preimage);
pthread_rwlock_unlock(&t->goal->lock);
return NULL;
}
/* each worker jumps ahead n_workers words, and all workers
started at an offset, so all words are covered */
changed = word_advance(preimage, t->goal->n_workers);
len = MAX(len, changed);
/* check if another worker has succeeded, but only every
thousandth iteration, since taking the lock adds overhead */
if (iterations++ % 1000 == 0)
{
/* in the overwhelming majority of cases workers only read,
so an rwlock allows them to continue in parallel */
pthread_rwlock_rdlock(&t->goal->lock);
int success = t->goal->success;
pthread_rwlock_unlock(&t->goal->lock);
if (success)
return NULL;
}
}
return NULL;
}
/* launch a parallel search for an md5 preimage */
bool crack(const unsigned char *md5, size_t max_len,
unsigned threads, char *result)
{
struct goal g =
{
.workers = malloc(threads * sizeof(pthread_t)),
.n_workers = threads,
.max_len = max_len,
.success = false,
.lock = PTHREAD_RWLOCK_INITIALIZER
};
memcpy(g.hash, md5, MD5_DIGEST_LENGTH);
struct task *tasks = malloc(threads * sizeof(struct task));
for (size_t i = 0; i < threads; i++)
{
tasks[i].goal = &g;
tasks[i].initial_preimage[0] = '\0';
/* offset the starting word for each worker by i */
word_advance(tasks[i].initial_preimage, i);
pthread_create(g.workers+i, NULL, crack_thread, tasks+i);
}
/* if one worker finds the answer, others will abort */
for (size_t i = 0; i < threads; i++)
pthread_join(g.workers[i], NULL);
if (g.success)
strcpy(result, g.preimage);
free(tasks);
free(g.workers);
return g.success;
}
int main(int argc, char **argv)
{
char preimage[LONGEST_PREIMAGE];
int max_len = 4;
unsigned char md5[MD5_DIGEST_LENGTH];
if (argc != 2 && argc != 3)
{
fprintf(stderr,
"Usage: %s md5-string [search-depth]\n",
argv[0]);
return EXIT_FAILURE;
}
if (!hex2md5(argv[1], md5))
{
fprintf(stderr,
"Could not parse as md5: %s\n", argv[1]);
return EXIT_FAILURE;
}
if (argc > 2 && strtol(argv[2], NULL, 10))
if ((max_len = strtol(argv[2], NULL, 10)) > LONGEST_PREIMAGE)
{
fprintf(stderr,
"Preimages limited to %d characters\n",
LONGEST_PREIMAGE);
return EXIT_FAILURE;
}
if (crack(md5, max_len, 4, preimage))
{
puts(preimage);
return EXIT_SUCCESS;
}
else
{
fprintf(stderr,
"Could not find result in strings up to length %d\n",
max_len);
return EXIT_FAILURE;
}
}Although read-write locks can be implemented in terms of mutexes and condition variables, such implementations are significantly less efficient than is possible. Therefore, this synchronization primitive is included in POSIX.1-2008 for the purpose of allowing more efficient implementations in multi-processor systems.
The final thing to be aware of is that an rwlock implementation can choose
either reader-preference or writer-preference. When readers and writers are
contending for a lock, the preference determines who gets to skip the queue and
go first. When there is a lot of reader activity with a reader-preference, then
a writer will continually get moved to the end of the line and experience
starvation, where it never gets to write. I noticed writer starvation on
Linux (glibc) when running four threads on a little 1-core virtual machine.
Glibc provides the nonportable pthread_rwlockattr_setkind_np() function to
specify a preference.
You may have noticed that workers in our password cracker use polling to see whether the solution has been found, and whether they should give up. We’ll examine a more explicit method of cancellation in a later section.
Semaphores keep count of, in the abstract, an amount of resource “units” available. Threads can safely add or remove a unit without causing a data race. When a thread requests a unit but there are none, then the thread will block.
A semaphore is like a mix between a lock and a condition variable. Unlike mutexes, semaphores have no concept of an owner. Any thread may release threads blocked on a semaphore, whereas with a mutex the lock holder must unlock it. Unlike a condition variable, a semaphore operates independently of a predicate.
An example of a problem uniquely suited for semaphores would be to ensure that exactly two threads run at once on a task. You would initialize the semaphore to the value two, and allow a bunch of threads to wait on the semaphore. After two get past, the rest will block. When each thread is done, it posts one unit back to the semaphore, which allows another thread to take its place.
In reality, if you’ve got pthreads, you only need semaphores for asynchronous signal handlers. You can use them in other situations, but this is the only place they are needed. Mutexes aren’t async signal safe. Making them so would be much slower than an implementation that isn’t async signal safe, and would slow down ordinary mutex operation.
Here’s an example of posting a semaphore from a signal handler:
/* sem_tickler.c */
#include <semaphore.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#if !defined(_POSIX_SEMAPHORES) || _POSIX_SEMAPHORES < 0
#error your OS lacks POSIX semaphore support
#endif
sem_t tickler;
void int_catch(int sig)
{
(void) sig;
signal(SIGINT, &int_catch);
sem_post(&tickler); /* async signal safe: */
}
int main(void)
{
sem_init(&tickler, 0, 0);
signal(SIGINT, &int_catch);
for (int i = 0; i < 3; i++)
{
sem_wait(&tickler);
puts("That tickles!");
}
puts("(Died from overtickling)");
return 0;
}Semaphores aren’t even necessary for proper signal handling. It’s easier to
have a thread simply sigwait() than it is to set up an asynchronous handler.
In the example below, the main thread waits, but you can spawn a dedicated
thread for this in a real application.
/* sigwait_tickler.c */
#include <signal.h>
#include <stdio.h>
int main(void)
{
sigset_t set;
int which;
sigemptyset(&set);
sigaddset(&set, SIGINT);
for (int i = 0; i < 3; i++)
{
sigwait(&set, &which);
puts("That tickles!");
}
puts("(Died from overtickling)");
return 0;
}So don’t feel dependent on semaphores. In fact your system may not have them. The POSIX semaphore API works with pthreads and is present in POSIX.1-2008, but is an optional part of POSIX.1b in earlier versions. Apple, for one, decided to punt, so the semaphore functions on macOS are stubbed to return error codes.
Thread cancellation is generally used when you have threads doing long-running tasks and there’s a way for a user to abort through the UI or console. Another common scenario is when multiple threads set off to explore a search space and one finds the answer first.
Our previous reader-writer lock example was the second scenario, where the threads explored a search space. It was an example of do-it-yourself cancellation through polling. However sometimes threads aren’t able to poll, such as when they are blocked on I/O or a lock. Pthreads offers an API to cancel threads even in those situations.
By default a cancelled thread isn’t immediately blown away, because it may have a mutex locked, be holding resources, or have a potentially broken invariant. The canceller wouldn’t know how to repair that invariant without some complicated logic. The thread to be canceled needs to be written to do cleanup and unlock mutexes.
For each thread, cancellation can be enabled or disabled, and if enabled, may
be in deferred or asynchronous mode. The default is enabled and deferred, which
allows a cancelled thread to survive until the next cancellation points,
such as waiting on a condition variable or blocking on IO (see full
list).
In a purely computational section of code you can add your own cancellation
points with pthread_testcancel().
Let’s see how to modify our previous MD5 cracking example using standard
pthread cancellation. Three of the functions are the same as before:
word_advance(), hex2md5(), and main(). But we now use a condition
variable to alert crack() whenever a crack_thread() returns.
/* 5dm-testcancel.c */
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
#include <pthread.h>
#define ASCII_FIRST ' '
#define ASCII_LAST '~'
#define N_ALPHA (1 + ASCII_LAST - ASCII_FIRST)
#define LONGEST_PREIMAGE 128
#define MAX(x,y) ((x)<(y) ? (y) : (x))
unsigned word_advance(char *word, unsigned delta)
{
if (delta == 0)
return 0;
if (*word == '\0')
{
*word++ = ASCII_FIRST + delta - 1;
*word = '\0';
}
else
{
char c = *word - ASCII_FIRST;
*word = ASCII_FIRST + ((c + delta) % N_ALPHA);
if (c + delta >= N_ALPHA)
return 1 + word_advance(word+1, 1 /* not delta */);
}
return 1;
}
bool hex2md5(const char *hex, unsigned char *b)
{
int offset = 0;
if(strlen(hex) != MD5_DIGEST_LENGTH*2)
return false;
while (offset < MD5_DIGEST_LENGTH*2)
{
if (sscanf(hex+offset, "%2hhx", b++) == 1)
offset += 2;
else
return false;
}
return true;
}
struct goal
{
/* input */
pthread_t *workers;
size_t n_workers;
size_t max_len;
unsigned char hash[MD5_DIGEST_LENGTH];
/* output */
pthread_mutex_t lock;
pthread_cond_t returning;
unsigned n_done;
char preimage[LONGEST_PREIMAGE];
bool success;
};
struct task
{
struct goal *goal;
char initial_preimage[LONGEST_PREIMAGE];
};
void *crack_thread(void *arg)
{
struct task *t = arg;
unsigned len, changed;
unsigned char hashed[MD5_DIGEST_LENGTH];
char preimage[LONGEST_PREIMAGE];
int iterations = 0;
strcpy(preimage, t->initial_preimage);
len = strlen(preimage);
while (len <= t->goal->max_len)
{
MD5((const unsigned char*)preimage, len, hashed);
if (memcmp(hashed, t->goal->hash, MD5_DIGEST_LENGTH) == 0)
{
pthread_mutex_lock(&t->goal->lock);
t->goal->success = true;
strcpy(t->goal->preimage, preimage);
t->goal->n_done++;
/* alert the boss that another worker is done */
pthread_cond_signal(&t->goal->returning);
pthread_mutex_unlock(&t->goal->lock);
return NULL;
}
changed = word_advance(preimage, t->goal->n_workers);
len = MAX(len, changed);
if (iterations++ % 1000 == 0)
pthread_testcancel(); /* add a cancellation point */
}
pthread_mutex_lock(&t->goal->lock);
t->goal->n_done++;
/* alert the boss that another worker is done */
pthread_cond_signal(&t->goal->returning);
pthread_mutex_unlock(&t->goal->lock);
return NULL;
}
/* cancellation cleanup function that we also call
* during regular exit from the crack() function */
void crack_cleanup(void *arg)
{
struct task *tasks = arg;
struct goal *g = tasks[0].goal;
/* this mutex unlock pairs with the lock in the crack() function */
pthread_mutex_unlock(&g->lock);
for (size_t i = 0; i < g->n_workers; i++)
{
pthread_cancel(g->workers[i]);
/* must wait for each to terminate, so that freeing
* their shared memory is safe */
pthread_join(g->workers[i], NULL);
}
/* now it's safe to free memory */
free(g->workers);
free(tasks);
}
bool crack(const unsigned char *md5, size_t max_len,
unsigned threads, char *result)
{
struct goal g =
{
.workers = malloc(threads * sizeof(pthread_t)),
.n_workers = threads,
.max_len = max_len,
.success = false,
.n_done = 0,
.lock = PTHREAD_MUTEX_INITIALIZER,
.returning = PTHREAD_COND_INITIALIZER
};
memcpy(g.hash, md5, MD5_DIGEST_LENGTH);
struct task *tasks = malloc(threads * sizeof(struct task));
for (size_t i = 0; i < threads; i++)
{
tasks[i].goal = &g;
tasks[i].initial_preimage[0] = '\0';
word_advance(tasks[i].initial_preimage, i);
pthread_create(g.workers+i, NULL, crack_thread, tasks+i);
}
/* coming up to cancellation points, so establish
* a cleanup handler */
pthread_cleanup_push(crack_cleanup, tasks);
pthread_mutex_lock(&g.lock);
/* We can't join() on all the workers now because it's up to
* us to cancel them after one finds the answer. We have to
* remain responsive and not block on any particular worker */
while (!g.success && g.n_done < threads)
pthread_cond_wait(&g.returning, &g.lock);
/* at this point either a thread succeeded or all have given up */
if (g.success)
strcpy(result, g.preimage);
/* mutex unlocked in the cleanup handler */
/* Use the same cleanup handler for normal exit too. The "1"
* argument says to execute the function we had previous pushed */
pthread_cleanup_pop(1);
return g.success;
}
int main(int argc, char **argv)
{
char preimage[LONGEST_PREIMAGE];
int max_len = 4;
unsigned char md5[MD5_DIGEST_LENGTH];
if (argc != 2 && argc != 3)
{
fprintf(stderr,
"Usage: %s md5-string [search-depth]\n",
argv[0]);
return EXIT_FAILURE;
}
if (!hex2md5(argv[1], md5))
{
fprintf(stderr,
"Could not parse as md5: %s\n", argv[1]);
return EXIT_FAILURE;
}
if (argc > 2 && strtol(argv[2], NULL, 10))
if ((max_len = strtol(argv[2], NULL, 10)) > LONGEST_PREIMAGE)
{
fprintf(stderr,
"Preimages limited to %d characters\n",
LONGEST_PREIMAGE);
return EXIT_FAILURE;
}
if (crack(md5, max_len, 4, preimage))
{
puts(preimage);
return EXIT_SUCCESS;
}
else
{
fprintf(stderr,
"Could not find result in strings up to length %d\n",
max_len);
return EXIT_FAILURE;
}
}Using cancellation is actually a little more flexible than our rwlock
implementation in 5dm. If the crack() function is running in its own thread,
the whole thing can now be cancelled. The cancellation handler will “pass
along” the cancellation to each of the worker threads.
Writing general purpose library code that works with threads requires some care. It should handle deferred cancellation gracefully, including disabling cancellation when appropriate and always using cleanup handlers.
For cleanup handlers, notice the pattern of how we pthread_cleanup_push() the
cancellation handler, and later pthread_cleanup_pop() it for regular
(non-cancel) cleanup too. Using the same cleanup procedure in all situations
makes the code more reliable.
Also notice how the boss thread now cancels workers, rather than the winning worker cancelling the others. You can join a canceled thread, but you can’t cancel an already joined (or detached) thread. If you want to both cancel and join a thread it ought to be done in one place.
Let’s turn out attention to the new worker threads. They are still polling for cancellation, like they polled with the reader-writer locks, but in this case they do it with a new function:
if (iterations++ % 1000 == 0)
pthread_testcancel();Admittedly it adds a little overhead to poll every thousandth loop, both with the rwlock, and with the testcancel. It also adds latency to the time between the cancellation request and the thread quitting, since the loop could run up to 999 times in between. A more efficient but dangerous method is to enable asynchronous cancellation, meaning the thread immediately dies when cancelled.
Async cancellation is dangerous because code is seldom async-cancel-safe.
Anything that uses locks or works with shared state even slightly can break
badly. Async-cancel-safe code can call very few functions, since those
functions may not be safe. This includes calling libraries that use something
as innocent as malloc(), since stopping malloc part way through could corrupt
the heap.
Our crack_thread() function should be async-cancel-safe, at least during its
calculation and not when taking locks. The MD5() function from OpenSSL also
appears to be safe. Here’s how we can rewrite our function (notice how we
disable cancellation before taking a lock):
/* rewritten to use async cancellation */
void *crack_thread(void *arg)
{
struct task *t = arg;
unsigned len, changed;
unsigned char hashed[MD5_DIGEST_LENGTH];
char preimage[LONGEST_PREIMAGE];
int cancel_type, cancel_state;
strcpy(preimage, t->initial_preimage);
len = strlen(preimage);
/* async so we don't have to pthread_testcancel() */
pthread_setcanceltype(
PTHREAD_CANCEL_ASYNCHRONOUS, &cancel_type);
while (len <= t->goal->max_len)
{
MD5((const unsigned char*)preimage, len, hashed);
if (memcmp(hashed, t->goal->hash, MD5_DIGEST_LENGTH) == 0)
{
/* protect the mutex against async cancellation */
pthread_setcancelstate(
PTHREAD_CANCEL_DISABLE, &cancel_state);
pthread_mutex_lock(&t->goal->lock);
t->goal->success = true;
strcpy(t->goal->preimage, preimage);
t->goal->n_done++;
pthread_cond_signal(&t->goal->returning);
pthread_mutex_unlock(&t->goal->lock);
return NULL;
}
changed = word_advance(preimage, t->goal->n_workers);
len = MAX(len, changed);
}
/* restore original cancellation type */
pthread_setcanceltype(cancel_type, &cancel_type);
pthread_mutex_lock(&t->goal->lock);
t->goal->n_done++;
pthread_cond_signal(&t->goal->returning);
pthread_mutex_unlock(&t->goal->lock);
return NULL;
}Asynchronous cancellation does not appear to work on macOS, but as we’ve seen that’s par for the course on that operating system.
DRD and Helgrind are Valgrind tools for detecting errors in multithreaded C and C++ programs. The tools work for any program that uses the POSIX threading primitives or that uses threading concepts built on top of the POSIX threading primitives.
The tools have overlapping abilities like detecting data races and improper use of the pthreads API. Additionally, Helgrind can detect locking hierarchy violations, and DRD can alert when there is lock contention.
Both tools pinpoint the lines of code where problems arise. For example, we can run DRD on our first crazy bankers program:
valgrind --tool=drd ./bankerHere is a characteristic example of an error it emits:
==8524== Thread 3:
==8524== Conflicting load by thread 3 at 0x003090b0 size 8
==8524== at 0x1088BD: disburse (banker.c:48)
==8524== by 0x4C324F3: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==8524== by 0x4E514A3: start_thread (pthread_create.c:456)
==8524== Allocation context: BSS section of /home/admin/banker
==8524== Other segment start (thread 2)
==8524== at 0x514FD01: clone (clone.S:80)
==8524== Other segment end (thread 2)
==8524== at 0x509D820: rand (rand.c:26)
==8524== by 0x108857: rand_range (banker.c:26)
==8524== by 0x1088A0: disburse (banker.c:42)
==8524== by 0x4C324F3: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==8524== by 0x4E514A3: start_thread (pthread_create.c:456)
It finds conflicting loads and stores from lines 48, 51, and 52.
48: if (accts[from].balance > 0)
49: {
50: payment = 1 + rand_range(accts[from].balance);
51: accts[from].balance -= payment;
52: accts[to].balance += payment;
53: }Helgrind can identify the lock hierarchy violation in our example of deadlocking bankers:
valgrind --tool=helgrind ./banker_lock==8989== Thread #4: lock order "0x3091F8 before 0x3090D8" violated
==8989==
==8989== Observed (incorrect) order is: acquisition of lock at 0x3090D8
==8989== at 0x4C3010C: mutex_lock_WRK (hg_intercepts.c:904)
==8989== by 0x1089B9: disburse (banker_lock.c:38)
==8989== by 0x4C32D06: mythread_wrapper (hg_intercepts.c:389)
==8989== by 0x4E454A3: start_thread (pthread_create.c:456)
==8989==
==8989== followed by a later acquisition of lock at 0x3091F8
==8989== at 0x4C3010C: mutex_lock_WRK (hg_intercepts.c:904)
==8989== by 0x1089D1: disburse (banker_lock.c:39)
==8989== by 0x4C32D06: mythread_wrapper (hg_intercepts.c:389)
==8989== by 0x4E454A3: start_thread (pthread_create.c:456)
To identify when there is too much contention for a lock, we can ask DRD to alert us when a thread blocks for more than n milliseconds on a mutex:
valgrind --tool=drd --exclusive-threshold=2 ./banker_lock_hierarchySince we throw too many threads at a small number of accounts, we see wait times that cross the threshold, like this one that waited seven ms:
==7565== Acquired at:
==7565== at 0x483F428: pthread_mutex_lock_intercept (drd_pthread_intercepts.c:888)
==7565== by 0x483F428: pthread_mutex_lock (drd_pthread_intercepts.c:898)
==7565== by 0x109280: disburse (banker_lock_hierarchy.c:40)
==7565== by 0x483C114: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==7565== by 0x4863FA2: start_thread (pthread_create.c:486)
==7565== by 0x49764CE: clone (clone.S:95)
==7565== Lock on mutex 0x10c258 was held during 7 ms (threshold: 2 ms).
==7565== at 0x4840478: pthread_mutex_unlock_intercept (drd_pthread_intercepts.c:978)
==7565== by 0x4840478: pthread_mutex_unlock (drd_pthread_intercepts.c:991)
==7565== by 0x109395: disburse (banker_lock_hierarchy.c:47)
==7565== by 0x483C114: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==7565== by 0x4863FA2: start_thread (pthread_create.c:486)
==7565== by 0x49764CE: clone (clone.S:95)
==7565== mutex 0x10c258 was first observed at:
==7565== at 0x483F368: pthread_mutex_lock_intercept (drd_pthread_intercepts.c:885)
==7565== by 0x483F368: pthread_mutex_lock (drd_pthread_intercepts.c:898)
==7565== by 0x109280: disburse (banker_lock_hierarchy.c:40)
==7565== by 0x483C114: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==7565== by 0x4863FA2: start_thread (pthread_create.c:486)
==7565== by 0x49764CE: clone (clone.S:95)
ThreadSanitizer is a clang instrumentation module. To use it, choose CC = clang and add -fsanitize=thread to CFLAGS. Then when you build programs,
they will be modified to detect data races and print statistics to stderr.
Here’s a portion of the output when running the bankers program:
WARNING: ThreadSanitizer: data race (pid=11312)
Read of size 8 at 0x0000014aeeb0 by thread T2:
#0 disburse /home/admin/banker.c:48 (banker+0x0000004a4372)
Previous write of size 8 at 0x0000014aeeb0 by thread T1:
#0 disburse /home/admin/banker.c:52 (banker+0x0000004a43ba)
TSan can also detect lock hierarchy violations, such as in banker_lock:
WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=10095)
Cycle in lock order graph: M1 (0x0000014aef78) => M2 (0x0000014aeeb8) => M1
Mutex M2 acquired here while holding mutex M1 in thread T1:
#0 pthread_mutex_lock <null> (banker_lock+0x000000439a10)
#1 disburse /home/admin/banker_lock.c:39 (banker_lock+0x0000004a4398)
Hint: use TSAN_OPTIONS=second_deadlock_stack=1 to get more informative warning message
Mutex M1 acquired here while holding mutex M2 in thread T9:
#0 pthread_mutex_lock <null> (banker_lock+0x000000439a10)
#1 disburse /home/admin/banker_lock.c:39 (banker_lock+0x0000004a4398)
While Valgrind DRD can identify highly contended locks, it virtualizes the execution of the program under test, and skews the numbers. Other utilities can use software probes to get this information from a test running at full speed. In BSD land there is the plockstat provider for DTrace, and on Linux there is the specially-written mutrace. I had a lot of trouble trying to get plockstat to work on FreeBSD, so here’s an example of using mutrace to analyze our banker program.
mutrace ./banker_lock_hierarchymutrace: Showing 10 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
0 200211 153664 95985 991.349 0.005 0.267 M-.--.
1 200552 142173 61902 641.963 0.003 0.170 M-.--.
2 199657 140837 47723 476.737 0.002 0.125 M-.--.
3 199566 140863 39268 371.451 0.002 0.108 M-.--.
4 199936 141381 33243 295.909 0.001 0.090 M-.--.
5 199548 141297 28193 232.647 0.001 0.084 M-.--.
6 200329 142027 24230 183.301 0.001 0.066 M-.--.
7 199951 142338 21018 142.494 0.001 0.057 M-.--.
8 200145 142990 18201 107.692 0.001 0.052 M-.--.
9 200105 143794 15713 76.231 0.000 0.028 M-.--.
||||||
/|||||
Object: M = Mutex, W = RWLock /||||
State: x = dead, ! = inconsistent /|||
Use: R = used in realtime thread /||
Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
Mutex Protocol: i = INHERIT, p = PROTECT /
RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC
mutrace: Note that the flags column R is only valid in --track-rt mode!
mutrace: Total runtime is 1896.903 ms.
mutrace: Results for SMP with 4 processors.
Typical profilers measure the amount of CPU time spent in each function. However when a thread is blocked by I/O, a lock, or a condition variable, then it isn’t using CPU time. To determine where functions spend the most “wall clock time,” we need to sample the call stack for all threads at intervals, and count how frequently we see each entry. When a thread is off-CPU its call stack stays unchanged.
The pstack program is traditionally the way to get a snapshot of a running
program’s stack. It exists on old Unices, and used to be on Linux until Linux
made a breaking change. The most portable way to get stack snapshots is using
gdb with an awk wrapper, as documented in the Poor Man’s
Profiler.
Remember our early condition variable example that measured how many threads
entered the critical section in disburse() at once? We asked whether
synchronization on stats_mtx threw off the measurement. With off-CPU
profiling we can look for clues.
Here’s a script based on the Poor Man’s Profiler:
./banker_stats &
pid=$!
while kill -0 $pid
do
gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p $pid
done | \
awk '
BEGIN { s = ""; }
/^Thread/ { print s; s = ""; }
/^\#/ { if (s != "" ) { s = s "," $4} else { s = $4 } }
END { print s }' | \
sort | uniq -c | sort -r -n -k 1,1It outputs limited information, but we can see that waiting for locks in
disburse() takes the majority of program time, being present in 872 of our
samples. By contrast, waiting for the stats_mtx lock in stats_update()
doesn’t appear in our sample at all. It must have had very little affect on
our parallelism.
872 at,__GI___pthread_mutex_lock,disburse,start_thread,clone
11 at,__random,rand,rand_range,disburse,start_thread,clone
9 expected=0,,mutex=0x562533c3f0c0,<stats_cnd>,,stats_print,start_thread,clone
9 __GI___pthread_timedjoin_ex,main
5 at,__pthread_mutex_unlock_usercnt,disburse,start_thread,clone
1 at,__pthread_mutex_unlock_usercnt,stats_change,disburse,start_thread,clone
1 at,__GI___pthread_mutex_lock,stats_change,disburse,start_thread,clone
1 __random,rand,rand_range,disburse,start_thread,clone
Although Mac’s POSIX thread support is pretty weak, its XCode tooling does include a nice profiler. From the Instruments application, choose the profiling template called “System Trace.” It adds a GUI on top of DTrace to display thread states (among other things). I modified our banker program to use only five threads and recorded its run. The Instruments app visualizes every event that happens, including threads blocking and being interrupted:
Within the program you can zoom into the history and hover over events for info.
Perf is a Linux tool to measure hardware performance counters during the execution of a program. Joe Mario created a Perf feature called c2c which detects false sharing of variables between CPUs.
In a NUMA multi-core computer, each CPU has its own set of caches, and all CPUs share main memory. Memory is divided into fixed size blocks (often 64 bytes) called cache lines. Any time a CPU reads or writes memory, it must fetch or store the entire cache line surrounding the desired address. If one CPU has already cached a line, and another CPU writes to that area in memory, the system has to perform an expensive operation to make the caches coherent.
When two unrelated variables in a program are stored close enough together in memory to be in the same cache line, it can cause a performance problem in multi-threaded programs. If threads running on separate CPUs access the unrelated variables, it can cause a tug of war between their underlying cache line, which is called false sharing.
For instance, our Game of Life simulator could potentially have false sharing at the edges of each section of board accessed by each thread. To verify this, I attempted to run perf c2c on an Amazon EC2 instance (since I lack a physical computer running Linux), but got an error that memory events are not supported on the virtual machine. I was running kernel 4.19.0 on Intel Xeon Platinum 8124M CPUs, so I assume this was a security restriction from Amazon.
If you are able to run c2c, and detect false sharing in a multi-threaded program, the solution is to align the variables more aggressively. POSIX provides the posix_memalign() function to allocate bytes aligned on a desired boundary. In our Life example, we could have used an array of pointers to dynamically allocated rows rather than a contiguous two-dimensional array.
The VTune Profiler is available for free (with registration) on Linux, macOS, and Windows. It works on x86 hardware only of course. I haven’t used it, but their marketing page shows some nice pictures. The tool can visually identify the granularity of locks, present a prioritized list of synchronization objects that hurt performance, and visualize lock contention.
To go beyond the topics in this blog post, I’d recommend getting a paper copy
of the manual and a good pocket reference. I couldn’t find any hard copy of the
official Vim manual, and ended up printing this PDF
using printme1.com. The PDF is a printer-friendly
version of the files $VIMRUNTIME/doc/usr_??.txt distributed with the editor.
For a convenient list of commands, I’d recommend the vi and Vim Editors Pocket
Reference.
Vi commands and features go back more than fifty years, starting with the QED editor. Here is the lineage:

You can discover the similarities all the way between QED and ex by reading the QED manual and ex manual. Both editors use a similar grammar to specify and operate on line ranges.
Editors like QED, ed, and em were designed for hard-copy terminals, which are basically electric typewriters with a modem attached. Hard-copy terminals print system output on paper. Output could not be changed once printed, obviously, so the editing process consisted of user commands to update and manually print ranges of text.

By 1976 video terminals such as the ADM-3A started to be available. The Ex editor added an “open mode” which allowed intraline editing on video terminals, and a visual mode for screen oriented editing on cursor-addressible terminals. The visual mode (activated with the command “vi”) kept an up-to-date view of part of the file on screen, while preserving an ex command line at the bottom of the screen. (Fun fact: the h,j,k,l keys on the ADM-3A had arrows drawn on them, so that choice of motion keys in vi was simply to match the keyboard.)
Learn more about the journey from ed to ex/vi in this interview with Bill Joy. He talks about how he made ex/vi, and some things that disappointed him about it.
Classic vi is truly just an alter-ego of ex – they are the same binary, which decides to start in ex mode or vi mode based on the name of the executable invoked. The legacy of all this history is that ex/vi is refined by use, requires scant system resources, and can operate under limited bandwidth communication. It is also available on most systems and fully specified in POSIX.
Being a derivative of ed, the ex/vi editor was intellectual property of AT&T. To use vi on platforms other than Unix, people had to write clones that did not share in the original codebase.
Some of the clones:
We’ll be focusing on that little one in the middle: vim. Bram Moolenaar wanted to use vi on the Amiga. He began porting Stevie from the Atari and evolving it. He called his port “Vi IMitation.” For a full first-hand account, see Bram’s interview with Free Software Magazine.
By version 1.22 Vim was rechristened “Vi IMproved,” matching and surpassing features of the original. Here is the timeline of the next major versions, with some of their big features:
| 1991 Nov 2 | Vim 1.14: First release (on Fred Fish disk #591). |
| 1992 | Vim 1.22: Port to Unix. Vim now competes with Vi. |
| 1994 Aug 12 | Vim 3.0: Support for multiple buffers and windows. |
| 1996 May 29 | Vim 4.0: Graphical User Interface (largely by Robert Webb). |
| 1998 Feb 19 | Vim 5.0: Syntax coloring/highlighting. |
| 2001 Sep 26 | Vim 6.0: Folding, plugins, vertical split. |
| 2006 May 8 | Vim 7.0: Spell check, omni completion, undo branches, tabs. |
| 2016 Sep 12 | Vim 8.0: Jobs, async I/O, native packages. |
For more info about each version, see e.g. :help vim8. To see plans for
the future, including known bugs, see :help todo.txt.
Version 8 included some async job support due to peer pressure from NeoVim, whose developers wanted to run debuggers and REPLs for their web scripting languages inside the editor.
Vim is super portable. By adapting over time to work on a wide variety of platforms, the editor was forced to keep portable coding habits. It runs on OS/390, Amiga, BeOS and BeBox, Macintosh classic, Atari MiNT, MS-DOS, OS/2, QNX, RISC-OS, BSD, Linux, OS X, VMS, and MS-Windows. You can rely on Vim being there no matter what computer you’re using.
In a final twist in the vi saga, the original ex/vi source code was finally released in 2002 under a BSD free software license. It is available at ex-vi.sourceforge.net.
Let’s get down to business. Before getting to odds, ends, and intermediate tricks, it helps to understand how Vim organizes and reads its configuration files.
I used to think, incorrectly, that Vim reads all its settings and scripts from the ~/.vimrc file alone. Browsing random “dotfiles” repositories can reinforce this notion. Quite often people publish monstrous single .vimrc files that try to control every aspect of the editor. These big configs are sometimes called “vim distros.”
In reality Vim has a tidy structure, where .vimrc is just one of several inputs. In fact you can ask Vim exactly which scripts it has loaded. Try this: edit a source file from a random programming project on your computer. Once loaded, run
:scriptnames
Take time to read the list. Try to guess what the scripts might do, and note the directories where they live.
Was the list longer than you expected? If you have installed loads of plugins
the editor has a lot to do. Check what slows down the editor most at startup by
running the following and look at the start.log it creates:
vim --startuptime start.log name-of-your-fileJust for comparison, see how quickly Vim starts without your existing configuration:
vim --clean --startuptime clean.log name-of-your-fileTo determine which scripts to run at startup or buffer load time, Vim traverses a “runtime path.” The path is a comma-separated list of directories that each contain a common structure. Vim inspects the structure of each directory to find scripts to run. Directories are processed in the order they appear in the list.
Check the runtimepath on your system by running:
:set runtimepath
My system contains the following directories in the default value for
runtimepath. Not all of them even exist in the filesystem, but they would be
consulted if they did.
Because directories are processed by their order in line, the only thing that is special about the “after” directories is that they are at the end of the list. There is nothing magical about the word “after.”
When processing each directory, Vim looks for subfolders with specific names.
To learn more about them, see :help runtimepath. Here is a selection of those
we will be covering, with brief descriptions.
:compiler
Finally, ~/.vimrc is the catchall for general editor settings. Use it
for setting defaults that can be overridden for particular file types.
For a comprehensive overview of settings you can choose in .vimrc, run
:options.
Plugins are simply Vim scripts that must be put into the correct places in the runtimepath in order to execute. Installing them is conceptually easy: download the file(s) into place. The challenge is that it’s hard to remove or update some plugins because they litter subdirectories in the runtimepath with their scripts, and it can be hard to tell which plugin is responsible for which files.
“Plugin managers” evolved to address this need. Vim.org has had a plugin registry going back at least as far as 2003 (as identified by the Internet Archive). However it wasn’t until about 2008 that the notion of a plugin manager really came into vogue.
These tools add plugins’ separate directories to Vim’s runtimepath, and compile help tags for plugin documentation. Most plugin managers also install and update plugin code from the internet, sometimes in parallel or with colorful progress bars.
In chronological order, here is the parade of plugin managers. I based the date ranges on earliest and latest releases of each, or when no official releases are identified, on the earliest and latest commit dates.
The first thing to note is the overwhelming variety of these tools, and the second is that each is typically active for about four years before presumably going out of fashion.
The most stable way to manage plugins is to simply use Vim 8’s built-in functionality, which requires no third-party code. Let’s walk through how to do it.
First create two directories, opt and start, within a pack directory in your runtimepath.
mkdir -p ~/.vim/pack/foobar/{opt,start}Note the placeholder “foobar.” This name is entirely up to you. It classifies the packages that will go inside. Most people throw all their plugins into a single nondescript category, which is fine. Pick whatever name you like; I’ll continue to use foobar here. You could theoretically create multiple categories too, like ~/.vim/pack/navigation and ~/.vim/pack/linting. Note that Vim does not detect duplication between categories and will double-load duplicates if they exist.
Packages in “start” get loaded automatically, whereas those in “opt” won’t load
until specifically requested in Vim with the :packadd command. Opt is good
for lesser-used packages, and keeps Vim fast by not running scripts
unnecessarily. Note that there isn’t a counterpart to :packadd to unload a
package.
For this example we’ll add the “ctrlp” fuzzy find plugin to opt. Download and extract the latest release into place:
curl -L https://github.com/kien/ctrlp.vim/archive/1.79.tar.gz \
| tar zx -C ~/.vim/pack/foobar/optThat command creates a ~/.vim/pack/foobar/opt/ctrlp.vim-1.79 folder, and the package is ready to use. Back in vim, create a helptags index for the new package:
:helptags ~/.vim/pack/foobar/opt/ctrlp.vim-1.79/doc
That creates a file called “tags” in the package’s doc folder, which makes the
topics available for browsing in Vim’s internal help system. (Alternately you
can run :helptags ALL once the package has been loaded, which takes care of
all docs in the runtimepath.)
When you want to use the package, load it (and know that tab completion works for plugin names, so you don’t have to type the whole name):
:packadd ctrlp.vim-1.79
Packadd includes the package’s base directory in the runtimepath, and sources its plugin and ftdetect scripts. After loading ctrlp, you can press CTRL-P to pop up a fuzzy find file matcher.
Some people keep their ~/.vim directory under version control and use git submodules for each package. For my part, I simply extract packages from tarballs and track them in my own repository. If you use mature packages you don’t need to upgrade them often, plus the scripts are generally small and don’t clutter git history much.
Depending on user settings, Vim can protect against four types of loss:
Before examining sensible settings, how about some comic relief? Here are just a sampling of comments from vimrc files on GitHub:
The comments reflect awareness of only the fourth case above (and the third by accident), whereas the authors generally go on to disable the swap file too, leaving one and two unprotected.
Here is the configuration I recommend to keep your edits safe:
" Protect changes between writes. Default values of
" updatecount (200 keystrokes) and updatetime
" (4 seconds) are fine
set swapfile
set directory^=~/.vim/swap//
" protect against crash-during-write
set writebackup
" but do not persist backup after successful write
set nobackup
" use rename-and-write-new method whenever safe
set backupcopy=auto
" patch required to honor double slash at end
if has("patch-8.1.0251")
" consolidate the writebackups -- not a big
" deal either way, since they usually get deleted
set backupdir^=~/.vim/backup//
end
" persist the undo tree for each file
set undofile
set undodir^=~/.vim/undo//
These settings enable backups for writes-in-progress, but do not persist them
after successful write because version control etc etc. Note that you’ll need
to mkdir ~/.vim/{swap,undodir,backup} or else Vim will fall back to the next
available folder in the preference list. You should also probably chmod the
folders to keep the contents private, because the swap files and undo history
might contain sensitive information.
One thing to note about the paths in our config is that they end in a double
slash. That ending enables a feature to disambiguate swaps and backups for
files with the same name that live in different directories. For instance the
swap file for /foo/bar will be saved in ~/.vim/swap/%foo%bar.swp (slashes
escaped as percent signs). Vim had a bug until a fairly recent patch where the
double slash was not honored for backupdir, and we guard against that above.
We also have Vim persist the history of undos for each file, so that you can apply them even after quitting and editing the file again. While it may sound redundant with the swap file, the undo history is complementary because it is written only when the file is written. (If it were written more frequently it might not match the state of the file on disk after a crash, so Vim doesn’t do that.)
Speaking of undo, Vim maintains a full tree of edit history. This means you can
make a change, undo it, then redo it differently and all three states are
recoverable. You can see the times and magnitude of changes with the
:undolist command, but it’s hard to visualize the tree structure from it. You
can navigate to specific changes in that list, or move in time with :earlier
and :later which take a time argument like 5m, or the count of file saves,
like 3f. However navigating the undo tree is an instance when I think a plugin
– like undotree – is warranted.
Enabling these disaster recovery settings can bring you peace of mind. I used to save compulsively after most edits or when stepping away from the computer, but now I’ve made an effort to leave documents unsaved for hours at a time. I know how the swap file works now.
Some final notes: keep an eye on all these disaster recovery files, they can pile up in your .vim folder and use space over time. Also setting nowritebackup might be necessary when saving a huge file with low disk space, because Vim must otherwise make an entire copy of the file temporarily. By default the “backupskip” setting disables backups for anything in the system temp directory.
Vim’s “patchmode” is related to backups. You can use it in directories that
aren’t under version control. For instance if you want to download a source
tarball, make an edit and send a patch over a mailing list without bringing
git into the picture. Run :set patchmod=.orig and any file ‘foo’ Vim is
about to write will be backed up to ‘foo.orig’. You can then create a patch
on the command line between the .orig files and the new ones.
Most programming languages allow you to include one module or file from
another. Vim knows how to track program identifiers in included files using
the configuration settings path, include, suffixesadd, and includeexpr.
The identifier search (see :help include-search) is an alternative to
maintaining a tags file with ctags for system headers.
The settings for C programs work out of the box. Other languages are supported
too, but require tweaking. That’s outside the scope of this article, see :help include.
If everything is configured right, you can press [i on an identifier to
display its definition, or [d for a macro constant. Also when you press gf
with the cursor on a filename, Vim searches the path to find it and jump there.
Because the path also affects the :find command, some people have the
tendency to add ‘**/*’ or commonly accessed directories to the path in order
to use :find like a poor man’s fuzzy finder. Doing this slows down the
identifier search with directories which aren’t relevant to that task.
A way to get the same level of crappy find capability, without polluting the path, is to just make another mapping. You can then press <Leader><space> (which is typically backslash space) then start typing a filename and use tab or CTRL-D completion to find the file.
" fuzzy-find lite
nmap <Leader><space> :e ./**/
Just to reiterate: the path parameter was designed for header files. If you
want more proof, there is even a :checkpath command to see whether the path
is functioning. Load a C file and run :checkpath. It will display filenames
it was unable to find that are included transitively by the current file. Also
:checkpath! with a bang dumps the whole hierarchy of files included from the
current file.
By default path has the value “.,/usr/include,,” meaning the working directory,
/usr/include, and files that are siblings of the active buffer. The directory
specifiers and globs are pretty powerful, see :help file-searching for the
details.
In my C ftplugin (more on that later), I also have the path search for include files within the current project, like ./src/include or ./include .
setlocal path=.,,*/include/**3,./*/include/**3
setlocal path+=/usr/include
The ** with a number like **3 bounds the depth of the search in subdirectories. It’s wise to add depth bounds where you can to avoid identifier searches that lock up.
Here are other patterns you might consider adding to your path if :checkpath
identifies that files can’t be found in your project. It depends on your system
of course.
/usr/include/**4,/usr/local/include/**3/usr/local/Cellar/**2/include/**2/opt/local/include/**/usr/local/lib/\*/include,/usr/X11R6/include/\*\*3See also: :he [, :he gf, :he :find.
The :make command runs a program of the user’s choice to build a project, and
collects the output in the quickfix buffer. Each item in the quickfix records
the filename, line, column, type (warning/error) and message of each output
item. A fairly idomatic mapping uses bracket commands to move through quickfix
items:
" quickfix shortcuts
nmap ]q :cnext<cr>
nmap ]Q :clast<cr>
nmap [q :cprev<cr>
nmap [Q :cfirst<cr>
If, after updating the program and rebuilding, you are curious what the error
messages said last time, use :colder (and :cnewer to return). To see more
information about the currently selected error use :cc, and use :copen to
see the full quickfix buffer. You can populate the quickfix yourself without
running :make with :cfile, :caddfile, or :cexpr.
Vim parses output from the build process according to the errorformat string, which contains scanf-like escape sequences. It’s typical to set this in a “compiler file.” For instance, Vim ships with one for gcc in $VIMRUNTIME/compiler/gcc.vim, but has no compiler file for clang. I created the following definition for ~/.vim/compiler/clang.vim:
" formatting variations documented at
" https://clang.llvm.org/docs/UsersManual.html#formatting-of-diagnostics
"
" It should be possible to make this work for the combination of
" -fno-show-column and -fcaret-diagnostics as well with multiline
" and %p, but I was too lazy to figure it out.
"
" The %D and %X patterns are not clang per se. They capture the
" directory change messages from (GNU) 'make -w'. I needed this
" for building a project which used recursive Makefiles.
CompilerSet errorformat=
\%f:%l%c:{%*[^}]}{%*[^}]}:\ %trror:\ %m,
\%f:%l%c:{%*[^}]}{%*[^}]}:\ %tarning:\ %m,
\%f:%l:%c:\ %trror:\ %m,
\%f:%l:%c:\ %tarning:\ %m,
\%f(%l,%c)\ :\ %trror:\ %m,
\%f(%l,%c)\ :\ %tarning:\ %m,
\%f\ +%l%c:\ %trror:\ %m,
\%f\ +%l%c:\ %tarning:\ %m,
\%f:%l:\ %trror:\ %m,
\%f:%l:\ %tarning:\ %m,
\%D%*\\a[%*\\d]:\ Entering\ directory\ %*[`']%f',
\%D%*\\a:\ Entering\ directory\ %*[`']%f',
\%X%*\\a[%*\\d]:\ Leaving\ directory\ %*[`']%f',
\%X%*\\a:\ Leaving\ directory\ %*[`']%f',
\%DMaking\ %*\\a\ in\ %f
CompilerSet makeprg=make
To activate this compiler profile, run :compiler clang. This is typically
done in an ftplugin file.
Another example is running GNU Diction on a text document to identify wordy and commonly misused phrases in sentences. Create a “compiler” called diction.vim:
CompilerSet errorformat=%f:%l:\ %m
CompilerSet makeprg=diction\ -s\ %
After you run :compiler diction you can use the normal :make command to run
it and populate the quickfix. The final mild convenience in my .vimrc is a
mapping to run make:
" real make
map <silent> <F5> :make<cr><cr><cr>
" GNUism, for building recursively
map <silent> <s-F5> :make -w<cr><cr><cr>
Vim’s internal diffing is powerful, but it can be daunting, especially the
three-way merge view. In reality it’s not so bad once you take time to study
it. The main idea is that every window is either in or out of “diff mode.” All
windows put in diffmode (with :difft[his]) get compared with all other windows
already in diff mode.
For example, let’s start simple. Create two files:
echo "hello, world" > h1
echo "goodbye, world" > h2
vim h1 h2In vim, split the arguments into their own windows with :all. In the top
window, for h1, run :difft. You’ll see a gutter appear, but no difference
detected. Move to the other window with CTWL-W CTRL-W and run :difft again.
Now hello and goobye are identified as different in the current chunk.
Continuing in the bottom window, you can run :diffg[et] to get “hello” from
the top window, or :diffp[ut] to send “goodbye” into the top window. Pressing
]c or [c would move between chunks if there were more than one.
A shortcut would be running vim -d h1 h2 instead (or its alias, vimdiff h1 h2) which applies :difft to all windows. Alternatively, load just h1 with vim h1 and then :diffsplit h2. Remember that fundamentally these commands just
load files into windows and set the diff mode.
With these basics in mind, let’s learn to use Vim as a three-way mergetool for git. First configure git:
git config merge.tool vimdiff
git config merge.conflictstyle diff3
git config mergetool.prompt falseNow, when you hit a merge conflict, run git mergetool. It will bring Vim
up with four windows. This part looks scary, and is where I used to flail
around and often quit in frustration.
+-----------+------------+------------+
| | | |
| | | |
| LOCAL | BASE | REMOTE |
+-----------+------------+------------+
| |
| |
| (edit me) |
+-------------------------------------+
Here’s the trick: do all the editing in the bottom window. The top three windows simply provide context about how the file differs on either side of the merge (local / remote), and how it looked prior to either side doing any work (base).
Move within the bottom window with ]c, and for each chunk choose whether to
replace it with text from local, base, or remote – or whether to write in your
own change which might combine parts from several.
To make it easier to pull changes from the top windows, I set some mappings in my vimrc:
" shortcuts for 3-way merge
map <Leader>1 :diffget LOCAL<CR>
map <Leader>2 :diffget BASE<CR>
map <Leader>3 :diffget REMOTE<CR>
We’ve already seen :diffget, and here our bindings pass an argument of the
buffer name that identifies which window to pull from.
Once done with the merge, run :wqa to save all the windows and quit. If you
want to abandon the merge instead, run :cq to abort all changes and return an
error code to the shell. This will signal to git that it should ignore your
changes.
Diffget can also accept a range. If you want to pull in all changes from one
of the top windows rather than working chunk by chunk, just run :1,$+1diffget {LOCAL,BASE,REMOTE}. The “+1” is required because there can be deleted lines
“below” the last line of a buffer.
The three-way marge is fairly easy after all. There’s no need for plugins like Fugitive, at least for presenting a simplified view for resolving merge conflicts.
Finally, as of patch 8.1.0360, Vim is bundled with the xdiff library and can create diffs internally. This can be more efficient than shelling out to an external program, and allows for a choice of diff algorithms. The “patience” algorithm often produces more human-readable output than the default, “myers.” Set it in your .vimrc like so:
if has("patch-8.1.0360")
set diffopt+=internal,algorithm:patience
endif
See if this sounds familiar: you’re editing a buffer and want to save it as a
new file, so you :w newname. After editing some more, you :w, but it writes
over the original file. What you want for this scenario is :saveas newname,
which does the write but also changes the filename of the buffer for future
writes. Alternately, the :file newname command will change the filename
without doing a write.
It also pays off to learn more about the read and write commands. Becuase r and w are Ex commands, they work with ranges. Here are some variations you might not know about:
| :w >>foo | append the whole buffer to a file |
| :.w >>foo | append current line to a file |
| :$r foo | read foo into the end of the buffer |
| :0r foo | read foo into the start, moving existing lines down |
| :.,$w foo | write current line and below to a file |
| :r !ls | read ls output into cursor position |
| :w !wc | send buffer to wc and display output |
| :.!tr ‘A-Za-z’ ‘N-ZA-Mn-za-m’ | apply ROT-13 to current line |
| :w|so % | chain commands: write and then source buffer |
| :e! | throw away unsaved changes, reload buffer |
| :hide edit foo | edit foo, hide current buffer if dirty |
Useless fun fact: we piped a line to tr in an example above to apply a ROT-13
cypher, but Vim has that functionality built in with the the g? command.
Apply it to a motion, like g?$.
Filetypes are a way to change settings based on the type of file detected in a buffer. They don’t need to be automatically detected though, we can manually enable them to interesting effect. An example is doing hex editing. Any file can be viewed as raw hexadecimal values. GitHub user the9ball created a clever ftplugin script that filters a buffer back and forth through the xxd utility for hex editing.
The xxd utility was bundled as part of Vim 5 for convenience. The Vim todo.txt file mentions they want to make it more seamless to edit binary files, but xxd can take us pretty far.
Here is code you can put in ~/.vim/ftplugin/xxd.vim. Its presence in ftplugin
means Vim will execute the script when filetype (aka “ft”) becomes xxd. I added
some basic comments to the script.
" without the xxd command this is all pointless
if !executable('xxd')
finish
endif
" don't insert a newline in the final line if it
" doesn't already exist, and don't insert linebreaks
setlocal binary noendofline
silent %!xxd -g 1
%s/\r$//e
" put the autocmds into a group for easy removal later
augroup ftplugin-xxd
" erase any existing autocmds on buffer
autocmd! * <buffer>
" before writing, translate back to binary
autocmd BufWritePre <buffer> let b:xxd_cursor = getpos('.')
autocmd BufWritePre <buffer> silent %!xxd -r
" after writing, restore hex view and mark unmodified
autocmd BufWritePost <buffer> silent %!xxd -g 1
autocmd BufWritePost <buffer> %s/\r$//e
autocmd BufWritePost <buffer> setlocal nomodified
autocmd BufWritePost <buffer> call setpos('.', b:xxd_cursor) | unlet b:xxd_cursor
" update text column after changing hex values
autocmd TextChanged,InsertLeave <buffer> let b:xxd_cursor = getpos('.')
autocmd TextChanged,InsertLeave <buffer> silent %!xxd -r
autocmd TextChanged,InsertLeave <buffer> silent %!xxd -g 1
autocmd TextChanged,InsertLeave <buffer> call setpos('.', b:xxd_cursor) | unlet b:xxd_cursor
augroup END
" when filetype is set to no longer be "xxd," put the binary
" and endofline settings back to what they were before, remove
" the autocmds, and replace buffer with its binary value
let b:undo_ftplugin = 'setl bin< eol< | execute "au! ftplugin-xxd * <buffer>" | execute "silent %!xxd -r"'
Try opening a file, then running :set ft. Note what type it is. Then:set ft=xxd. Vim will turn into a hex editor. To restore your view, :set ft=foo
where foo was the original type. Note that in hex view you even get syntax
highlighting because $VIMRUNTIME/syntax/xxd.vim ships with Vim by default.
Notice the nice use of “b:undo_ftplugin” which is an opportunity for filetypes
to clean up after themselves when the user or ftdetect mechanism switches away
from them to another filetype. (The example above could use a little work
because if you :set ft=xxd then set it back, the buffer is marked as modified
even if you never changed anything.)
Ftplugins also allow you to refine an existing filetype. For instance, Vim
already has some good defaults for C programming in
$VIMRUNTIME/ftplugin/c.vim. I put these extra options in
~/.vim/after/ftplugin/c.vim to add my own settings on top:
" the smartest indent engine for C
setlocal cindent
" my preferred "Allman" style indentation
setlocal cino="Ls,:0,l1,t0,(s,U1,W4"
" for quickfix errorformat
compiler clang
" shows long build messages better
setlocal ch=2
" auto-create folds per grammar
setlocal foldmethod=syntax
setlocal foldlevel=10
" local project headers
setlocal path=.,,*/include/**3,./*/include/**3
" basic system headers
setlocal path+=/usr/include
setlocal tags=./tags,tags;~
" ^ in working dir, or parents
" ^ sibling of open file
" the default is menu,preview but the preview window is annoying
setlocal completeopt=menu
iabbrev #i #include
iabbrev #d #define
iabbrev main() int main(int argc, char **argv)
" add #include guard
iabbrev #g _<c-r>=expand("%:t:r")<cr><esc>VgUV:s/[^A-Z]/_/g<cr>A_H<esc>yypki#ifndef <esc>j0i#define <esc>o<cr><cr>#endif<esc>2ki
Notice how the script uses “setlocal” rather than “set.” This applies the changes to just the current buffer rather than the whole Vim instance.
This script also enables some light abbreviations. Like I can type #g
and press enter and it adds an include guard with the current filename:
#ifndef _FILENAME_H
#define _FILENAME_H
/* <-- cursor here */
#endifYou can also mix filetypes by using a dot (“.”). Here is one application.
Different projects have different coding conventions, so you can combine your
default C settings with those for a particular project. The OpenBSD source code
follows the style(9) format, so let’s make a
special openbsd filetype. Combine the two filetypes with :set ft=c.openbsd on
relevant files.
To detect the openbsd filetype we can look at the contents of buffers rather
than just their extensions or locations on disk. The telltale sign is that C
files in the OpenBSD source
contain /* $OpenBSD: in the first line.
To detect them, create ~/.vim/after/ftdetect/openbsd.vim:
augroup filetypedetect
au BufRead,BufNewFile *.[ch]
\ if getline(1) =~ 'OpenBSD;'
\| setl ft=c.openbsd
\| endif
augroup END
The Vim
port for
OpenBSD already includes a special syntax file for this filetype:
/usr/local/share/vim/vimfiles/syntax/openbsd.vim. If you recall, the
/usr/local/share/vim/vimfiles directory is in the runtimepath and is set
aside for files from the system administrator. The provided openbsd.vim script
includes a function:
function! OpenBSD_Style()
setlocal cindent
setlocal cinoptions=(4200,u4200,+0.5s,*500,:0,t0,U4200
setlocal indentexpr=IgnoreParenIndent()
setlocal indentkeys=0{,0},0),:,0#,!^F,o,O,e
setlocal noexpandtab
setlocal shiftwidth=8
setlocal tabstop=8
setlocal textwidth=80
endfun
We simply need to call the function at the appropriate time. Create
~/.vim/after/ftplugin/openbsd.vim:
call OpenBSD_Style()
Now opening any C or header file with the characteristic comment at the top will be recognized as type c.openbsd and will use indenting options that conform with the style(9) man page.
This is a friendly reminder that despite our command-line machismo, the mouse is in fact supported in Vim, and can do some things more easily than the keyboard. Mouse events work even over SSH thanks to xterm turning mouse events into stdin escape codes.
To enable mouse support, set mouse=n. Many people use mouse=a to make it
work in all modes, but I prefer to enable it only in normal mode. This avoids
creating visual selections when I click links with a keyboard modifier to open
them in my browser.
Here are things the mouse can do:
foldcolumn > 0).This section could be enormous, but I’ll stick to a few tricks I learned. The
first one that blew me away was :set virtualedit=all. It allows you to move
the cursor anywhere in the window. If you enter characters or insert a visual
block, Vim will add whatever spaces are required to the left of the inserted
characters to keep them in place. Virtual edit mode makes it simple to edit
tabular data. Turn it off with :set virtualedit=.
Next are some movement commands. I used to rely a lot on } to jump by
paragraphs, and just muscle my way down the page. However the ] character
makes more precise motions: by function ]], scope ]}, paren ‘])’, comment
]/, diff block ]c. This series is why the quickfix mapping ]q mentioned
earlier fits the pattern so well.
For big jumps I used to try things like 1000j, but in normal mode you can
actually just type a percentage and Vim will go there, like 50%. Speaking of
scroll percentage, you can see it at any time with CTRL-G. Thus I now do :set noruler and ask to see the info as needed. It’s less cluttered. Kind of the
opposite of the trend of colorful patched font powerlines.
After jumping around between tags, files, or within a file, there are some
commands to get your bearings. Try :ls, :tags, :jumps, and :marks.
Jumping through tags actually creates a stack, and you can press CTRL-T to pop
one back. I used to always press CTRL-O to back out of jumps, but it is not as
direct as popping the tag stack.
In a project directory that has been indexed with ctags, you can open the
editor directly to a tag with -t, like vim -t main. To find tags files more
flexibly, set the tags configuration variable. Note the semicolon in the
example below that allows Vim to search the current directory upward to the
home directory. This way you could have a more general system tags file
outside the project folder.
set tags=./tags,**5/tags,tags;~
" ^ in working dir, or parents
" ^ in any subfolder of working dir
" ^ sibling of open file
There are some buffer tricks too. Switching to a buffer with :bu can take a
fragment of the buffer name, not just a number. Sometimes it’s harder to
memorize those numbers than remember the name of a source file. You can
navigate buffers with marks too. If you use a capital letter as the name of a
mark, you can jump to it across buffers. You could set a mark H in a header, C
in a source file, and M in a Makefile to go from one buffer to another.
Do you ever get mad after yanking a word, deleting a word somewhere else,
trying paste the first word in, and then discovering your original yank is
overwritten? The Vim registers are underappreciated for this. Inspect their
contents with :reg. As you yank text, previous yanks are rotated into the
registers "0 - "9. So "0p pastes the next-to-last yank/deletion. The
special registers "+ and "* can copy/paste from/to the system clipboard.
They usually mean the same thing, except in some X11 setups that distinguish
primary and secondary selection.
Another handy hidden feature is the command line window. It it’s a buffer
that contains your previous commands and searches. Bring it up with q: or
q/. Once inside you can move to any line and press enter to run it. However
you can also edit any of the lines before pressing enter. Your changes won’t
affect the line (the new command will merely be added to the bottom of the
list).
This article could go on and on, so I’m going to call it here. For more great topics, see these help sections: views-sessions, viminfo, TOhtml, ins-completion, cmdline-completion, multi-repeat, scroll-cursor, text-objects, grep, netrw-contents.
Unicode is more than a numbering scheme for the characters of every language – although that in itself is a useful accomplishment. Unicode also includes characters’ case, directionality, and alphabetic properties. The Unicode standard and specifications describe the proper way to divide words and break lines, sort text, format numbers, display text in different directions, split/combine/reorder vowels South Asian languages, and determine when characters may look visually confusable.
Human languages are highly varied and internally inconsistent, and any application which treats strings as more than an opaque byte stream must embrace the complexity. Realistically this means using a mature third-party library.
This article illustrates text processing ideas with example programs. We’ll use the International Components for Unicode (ICU) library, which is mature, portable, and powers the international text processing behind many products and operating systems.
IBM (the maintainers of ICU) officially support a C, C++ and Java API. We’ll use the C API here for a better view into the internals. Many languages have bindings to the library, so these concepts should be applicable to your language of choice.
Table of Contents:
Before getting into the example code, it’s important to learn the terminology. Let’s start at the most basic question.
“Character” is an overloaded term. What a native speaker of a language identifies as a letter or symbol is often stored as multiple values in the internal Unicode representation. The representation is further obscured by an additional encoding in memory, on disk, or during network transmission.
Let’s start at the abstraction closest to the user: the grapheme cluster. A “grapheme” is a graphical unit that a reader recognizes as a single element of the writing system. It’s the character as a user would understand it. For example, 山, ä and క్క are graphemes. Pieces of a single grapheme always stay together in print; breaking them apart is either nonsense or changes the meaning of the symbol. They are rendered as “glyphs,” i.e. markings on paper or screen which vary by font, style, or position in a word.
You might imagine that Unicode assigns each grapheme a unique number, but that is not true. It would be wasteful because there is a combinatorial explosion between letters and diacritical marks. For instance (o, ô, ọ, ộ) and (a, â, ạ, ậ) follow a pattern. Rather than assigning a distinct number to each, it’s more efficient to assign a number to o and a, and then to each of the combining marks. The graphemes can be built from letters and combining marks e.g. ậ = a + ◌̂ + ◌̣.
In reality Unicode takes both approaches. It assigns numbers to basic letters and combining marks, but also to some of their more common combinations. Many graphemes can thus be created in more than one way. For instance ộ can be specified in five ways:
The numbers (written U+xxxx) for each abstract character and each combining symbol are called “codepoints.” Every Unicode string is expressed as a list of codepoints. As illustrated above, multiple strings of codepoints may render into the same sequence of graphemes.
To meaningfully compare strings codepoint by codepoint for equality, both strings should both be represented in a consistent way. A standardized choice of codepoint decomposition for graphemes is called a “normal form.”
One choice is to decompose a string into as many codepoints as possible like examples A and B (with a weighting factor of which combining marks should come first). That is called Normalization Form Canonical Decomposition (NFD). Another choice is to do the opposite and use the fewest codepoints possible like example E. This is called Normalization Form Canonical Composition (NFC).
A core concept to remember is that, although codepoints are the building blocks of text, they don’t match up 1-1 with user-perceived characters (graphemes). Operations such as taking the length of an array of codepoints, or accessing arbitrary array positions are typically not useful for Unicode programs. Programs must also be mindful of the combining characters, like diacritical marks, when inserting or deleting codepoints. Inserting U+0061 into the asterisk position U+006f U+0302 (*) U+0323 changes the string “ộ” into “ôạ” rather than “ộa”.
It’s not just fonts that cause graphemes to be rendered into varying glyphs. The rules of some languages cause glyphs to change through contextual shaping. For instance the Arabic letter “heh” has four forms, depending on which sides are flanked by letters. When isolated it appears as ﻩ and in the final/initial/medial position in a word it appears as ﻪ/ﻫ/ﻬ respectively. Similarly, Greek displays lower-case sigma differently at the end of the word (final form) than elsewhere. Some glyphs change based on visual order. In a right-to-left language the starting parenthesis “(” mirrors to display as “)”.
Not only do individual graphemes’ glyphs vary, graphemes can combine to form single glyphs. One way is through ligatures. The latin letters “fi” often join the dot of the i with the curve of the f (presentation form U+FB01 fi). Another way is language irregularity. The Arabic ا and ل, when contiguous, must form ﻻ.
Conversely, a single grapheme can split into multiple glyphs. For instance in some Indic languages, vowels can split and surround preceding consonants. In Bengali, U+09CC ৌ surrounds U+09AE ম to become মৌ . Try placing a cursor at the end of this text box and pressing backspace:
In 1990, Unicode codepoints were 16 bits wide. That choice turned out to be too small for the symbols and languages people wanted to represent, so the committee extended the standard to 21 bits. That’s fine in the abstract, but how the 21 bits are stored in memory or communicated between computers depends on practical factors.
It’s an unusual memory size. Computer hardware doesn’t typically access memory in 21-bit chunks. Networking protocols, too, are better geared toward transmitting eight bits at a time. Thus, codepoints are broken into sequences of more conventionally sized blocks called code units for persistence on disk, transmission over networks, and manipulation in memory.
The Unicode Transformation Formats (UTF) describe different ways to map between codepoints and code units. The transformation formats are named after the bit width of their code units (7, 8, 16, or 32), as well as the endianness (BE or LE). For instance: UTF-8, or UTF-16BE. In addition to the UTFs, there’s another – more complex – encoding called Punycode. It is designed to conform with the limited ASCII character subset used for Internet host names.
A final bit of terminology. A “plane” is a continuous group of 65,536 code points. There are 17 planes, identified by the numbers 0 to 16. Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly-used characters. The higher planes (1 through 16) are called “supplementary planes.”
For transmission and storage, use UTF-8. Programs which move ASCII data can handle it without modification. Machine endianness does not affect UTF-8, and the byte-sized units work well in networks and filesystems.
Some sites, like UTF-8 Everywhere go even further and recommend using UTF-8 for internal manipulation of text in program memory. However, I would suggest you use whatever encoding your Unicode library favors for this. You’ll be performing operations through the library API, not directly on code units. As we’re seeing, there is too much complexity between glyphs, graphemes, codepoints and code units to be manipulating the units directly. Use the encoding preferred by your library and convert to/from UTF-8 at the edges of the program.
It’s unwise to use UTF-32 to store strings in memory. In this encoding it’s true that every code unit can hold a full codepoint. However, the relationship between codepoints and glyphs isn’t straightforward, so there isn’t a programmatic advantage to storing the string this way.
UTF-32 also wastes at minimum 11 (32 - 21) bits per codepoint, and typically more. For instance, UTF-16 requires only one 16-bit code unit to encode points in the Base Multilingual Plane (the most commonly encountered points). Thus UTF-32 can typically double the space required for the BMP.
There are times to manipulate UTF-32, such as when examining a single codepoint. We’ll see examples below.
The programs in this article are ready to compile and run. They require the ICU C library called ICU4C, which is available on most platforms through the operating system package manager.
ICU provides five libraries for linking (we need the first two):
| Package | Contents |
|---|---|
| icu-uc | Common (uc) and Data (dt/data) libraries |
| icu-io | Ustdio/iostream library (icuio) |
| icu-i18n | Internationalization (in/i18n) library |
| icu-le | Layout Engine |
| icu-lx | Paragraph Layout |
To use ICU4C, set the compiler and linker flags with pkg-config in your Makefile. (Pkg-config may also need to be installed on your computer.)
CFLAGS = -std=c99 -pedantic -Wall -Wextra \
`pkg-config --cflags icu-uc icu-io`
LDFLAGS = `pkg-config --libs icu-uc icu-io`The examples in this article conform to the C89 standard, but we specify C99 in the Makefile because the ICU header files use C99-style (//) comments.
To start getting a feel for ICU’s I/O and codepoint manipulation, let’s make a program to output completely random (but valid) codepoints. You could use this program as a basic fuzz tester, to see whether its output confuses other programs. A real fuzz tester ought to have the ability to take an explicit seed for repeatable output, but we will omit that functionality from our simple demo.
This program has limited portability because it gets entropy from /dev/urandom, a Unix device. To generate good random numbers using only the C standard library, see my other article. Also POSIX provides pseudo-random number functions.
/* for constants like EXIT_FAILURE */
#include <stdlib.h>
/* we'll be using standard C I/O to read random bytes */
#include <stdio.h>
/* to determine codepoint categories */
#include <unicode/uchar.h>
/* to output UTF-32 codepoints in proper encoding for terminal */
#include <unicode/ustdio.h>
int main(int argc, char **argv)
{
long i = 0, linelen;
/* somewhat non-portable: /dev/urandom is unix specific */
FILE *f = fopen("/dev/urandom", "rb");
UFILE *out;
/* UTF-32 code unit can hold an entire codepoint */
UChar32 c;
/* to learn about c */
UCharCategory cat;
if (!f)
{
fputs("Unable to open /dev/urandom\n", stderr);
return EXIT_FAILURE;
}
/* optional length to insert line breaks */
linelen = argc > 1 ? strtol(argv[1], NULL, 10) : 0;
/* have to obtain a Unicode-aware file handle. This function
* has no failure return code, it always works. */
out = u_get_stdout();
/* read a random 32 bits, presumably forever */
while (fread(&c, sizeof c, 1, f))
{
/* Scale 32-bit value to a number within code planes
* zero through fourteen. (Planes 15-16 are private-use)
*
* The modulo bias is insignificant. The first 65535
* codepoints are minutely favored, being generated by
* 4370 different 32-bit numbers each. The remaining
* 917505 codepoints are generated by 4369 numbers each.
*/
c %= 0xF0000;
cat = u_charType(c);
/* U_UNASSIGNED are "non-characters" with no assigned
* meanings for interchange. U_PRIVATE_USE_CHAR are
* reserved for use within organizations, and
* U_SURROGATE are designed for UTF-16 code units in
* particular. Don't print any of those. */
if (cat != U_UNASSIGNED && cat != U_PRIVATE_USE_CHAR &&
cat != U_SURROGATE)
{
u_fputc(c, out);
if (linelen && ++i >= linelen)
{
i = 0;
/* there are a number of Unicode
* linebreaks, but the standard ASCII
* \n is valid, and will interact well
* with a shell */
u_fputc('\n', out);
}
}
}
/* should never get here */
fclose(f);
return EXIT_SUCCESS;
}A note about the mysterious U_UNASSIGNED category, the “non-characters.” These are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. The Unicode Standard sets aside 66 non-character code points. The last two code points of each plane are noncharacters (U+FFFE and U+FFFF on the BMP). In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0…U+FDEF.
Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. They are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed.
We discussed non-characters in the previous section, but there are also Private Use codepoints. Unlike non-characters, those for private use are designated for interchange between systems. However the precise meaning and glyphs for these characters is specific to the organization using them. The same codepoints can be used for different things by different people.
Unicode provides a large area for private use. Both a small code block in the BMP, as well as two entire planes: 15 and 16. Because no browser or text editor will render PUA codepoints beyond (typically) empty boxes, we can exploit plane 15 to make a visually confusing code. Ultimately it’s a cheesy transposition cypher, but it’s kind of fun.
Below is a program to shift characters in the BMP to/from plane 15, the Private Use Area A. Example output of an encoded string:
#include <stdio.h>
#include <stdlib.h>
/* for strcmp in argument parsing */
#include <string.h>
#include <unicode/ustdio.h>
void usage(const char *prog)
{
puts("Shift base multilingual plane to/from PUA-A\n");
printf("Usage: %s [-d]\n\n", prog);
puts("Encodes stdin (or decode with -d)");
exit(EXIT_SUCCESS);
}
int main(int argc, char **argv)
{
UChar32 c;
UFILE *in, *out;
enum { MODE_ENCODE, MODE_DECODE } mode = MODE_ENCODE;
if (argc > 2)
usage(argv[0]);
else if(argc > 1)
{
if (strcmp(argv[1], "-d") == 0)
mode = MODE_DECODE;
else
usage(argv[0]);
}
out = u_get_stdout();
in = u_finit(stdin, NULL, NULL);
if (!in)
{
fputs("Error opening stdout as UFILE\n", stderr);
return EXIT_FAILURE;
}
/* u_fgetcx returns UTF-32. U_EOF happens to be 0xFFFF,
* not -1 like EOF typically is in stdio.h */
while ((c = u_fgetcx(in)) != U_EOF)
{
/* -1 for UChar32 actually signifies invalid character */
if (c == (UChar32)0xFFFFFFFF)
{
fputs("Invalid character.\n", stderr);
continue;
}
if (mode == MODE_ENCODE)
{
/* Move the BMP into the Supplementary
* Private Use Area-A, which begins
* at codepoint 0xf0000 */
if (0 < c && c < 0xe000)
c += 0xf0000;
}
else
{
/* Move the Supplementary Private Use
* Plane down into the BMP */
if (0xf0000 < c && c < 0xfe000)
c -= 0xf0000;
}
u_fputc(c, out);
}
/* if you u_finit it, then u_fclose it */
u_fclose(in);
return EXIT_SUCCESS;
}So far we’ve been working entirely with complete codepoints. This next example gets into their representation as code units in a transformation format, namely UTF-8. We will read the codepoint as a hexadecimal program argument, and convert it to between 1-4 bytes in UTF-8, and print the hex values of those bytes.
/*** utf8.c ***/
#include <stdio.h>
#include <stdlib.h>
#include <unicode/utf8.h>
int main(int argc, char **argv)
{
UChar32 c;
/* ICU defines its own bool type to be used
* with their macro */
UBool err = FALSE;
/* ICU uses C99 types like uint8_t */
uint8_t bytes[4] = {0};
/* probably should be size_t not int32_t, but
* just matching what their macro expects */
int32_t written = 0, i;
char *parsed;
if (argc != 2)
{
fprintf(stderr, "Usage: %s codepoint\n", *argv);
exit(EXIT_FAILURE);
}
c = strtol(argv[1], &parsed, 16);
if (!*argv[1] || *parsed)
{
fprintf(stderr,
"Cannot parse codepoint: U+%s\n", argv[1]);
exit(EXIT_FAILURE);
}
/* this is a macro, and updates the variables
* directly. No need to pass addresses.
* We're saying: write to "bytes", tell us how
* many were "written", limit it to four */
U8_APPEND(bytes, written, 4, c, err);
if (err == TRUE)
{
fprintf(stderr, "Invalid codepoint: U+%s\n", argv[1]);
exit(EXIT_FAILURE);
}
/* print in format 'xxd -r' can read */
printf("0: ");
for (i = 0; i < written; ++i)
printf("%2x", bytes[i]);
puts("");
return EXIT_SUCCESS;
}Suppose you compile this to a program named utf8. Here are some examples:
# ascii characters are unchanged
$ ./utf8 61
0: 61
# other codepoints require more bytes
$ ./utf8 1F41A
0: f09f909a
# format is compatible with "xxd"
$ ./utf8 1F41A | xxd -r
🐚
# surrogates (used in UTF-16) are not valid codepoints
$ ./utf8 DC00
Invalid codepoint: U+DC00Here’s a useful helper function named u_wholeline() which reads a line of any length into a dynamically allocated buffer. It reads as UChar*, which is ICU’s standard UTF-16 code unit array.
/* to properly test realloc */
#include <errno.h>
#include <stdlib.h>
#include <unicode/ustdio.h>
/* line Feed, vertical tab, form feed, carriage return,
* next line, line separator, paragraph separator */
#define NEWLINE(c) ( \
((c) >= 0xa && (c) <= 0xd) || \
(c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 )
/* allocates buffer, caller must free */
UChar *u_wholeline(UFILE *f)
{
/* assume most lines are shorter
* than 128 UTF-16 code units */
size_t i, sz = 128;
UChar c, *s = malloc(sz * sizeof(*s)), *s_new;
if (!s)
return NULL;
/* u_fgetc returns UTF-16, unlike u_fgetcx */
for (i = 0; (s[i] = u_fgetc(f)) != U_EOF && !NEWLINE(s[i]); ++i)
if (i >= sz)
{
/* double the buffer when it runs out */
sz *= 2;
errno = 0;
s_new = realloc(s, sz * sizeof(*s));
if (errno == ENOMEM)
free(s);
if ((s = s_new) == NULL)
return NULL;
}
/* if terminated by CR, eat LF */
if (s[i] == 0xd && (c = u_fgetc(f)) != 0xa)
u_fungetc(c, f);
/* s[i] will either be U_EOF or a newline; wipe it */
s[i] = '\0';
return s;
}The previous example reads an entire line. However, reading a limited number of code units from UTF-16 lines is more tricky. Truncating a Unicode string is always a little dangerous due to possibly splitting a word and breaking contextual shaping.
UTF-16 also has surrogate pairs, which are how that translation format expresses codepoints outside the BMP. Ending a UTF-16 string early can split surrogate pairs without the proper precaution.
The following example reads lines in chunks of at most three UTF-16 code units at a time. If it reads two consecutive codepoints from supplementary planes it will fail. The program accepts a “fix” argument to make it push a final unpaired surrogate back onto the stream for a future read.
/*** codeunit.c ***/
#include <stdlib.h>
#include <string.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utf16.h>
/* BUFSZ set to be very small so that lines must be read in
* many chunks. Helps illustrate split surrogate pairs */
#define BUFSZ 4
void printHex(const UChar *s)
{
while (*s)
printf("%x ", *s++);
putchar('\n');
}
/* yeah, slightly annoying duplication */
void printHex32(const UChar32 *s)
{
while (*s)
printf("%x ", *s++);
putchar('\n');
}
int main(int argc, char **argv)
{
UFILE *in;
/* read line into ICU's default UTF-16 representation */
UChar line[BUFSZ];
/* A buffer to hold codepoints of "line" as UTF-32 code
* units. The length is sufficient because it requires
* fewer (or at least no greater) code units in UTF-32 to
* encode the string */
UChar32 codepoints[BUFSZ];
UChar *final;
UErrorCode err = U_ZERO_ERROR;
if (!(in = u_finit(stdin, NULL, NULL)))
{
fputs("Error opening stdin as UFILE\n", stderr);
return EXIT_FAILURE;
}
/* read lines one small BUFSZ chunk at a time */
while (u_fgets(line, BUFSZ, in))
{
/* correct for split surrogate pairs only
* if the "fix" argument is present */
if (argc > 1 && strcmp(argv[1], "fix") == 0)
{
final = line + u_strlen(line);
/* want to consider the character before \0
* if such exists */
if (final > line)
final--;
/* if it is the lead unit of a surrogate pair */
if (U16_IS_LEAD(*final))
{
/* push it back for a future read, and
* truncate the string */
u_fungetc(*final, in);
*final = '\0';
}
}
printf("UTF-16 : ");
printHex(line);
u_strToUTF32(
codepoints, BUFSZ, NULL,
line, -1, &err);
printf("Error? : %s\n", u_errorName(err));
printf("Codepoints: ");
printHex32(codepoints);
/* reset potential errors and go for another chunk */
err = U_ZERO_ERROR;
*codepoints = '\0';
}
u_fclose(in);
return EXIT_SUCCESS;
}If the program reads two weird numerals 𝟘𝟙 (different from 01), neither of which are in the BMP, it finds one codepoint but chokes on the broken pair:
$ echo -n 𝟘𝟙 | ./codeunit
UTF-16 : d835 dfd8 d835
Error? : U_INVALID_CHAR_FOUND
Codepoints: 1d7d8
UTF-16 : dfd9
Error? : U_INVALID_CHAR_FOUND
Codepoints:However if we pass the “fix” argument, the program will read two complete codepoints:
$ echo -n 𝟘𝟙 | ./codeunit fix
UTF-16 : d835 dfd8
Error? : U_ZERO_ERROR
Codepoints: 1d7d8
UTF-16 : d835 dfd9
Error? : U_ZERO_ERROR
Codepoints: 1d7d9Perhaps a better way to read a line with limited length is to use a “break iterator” to stop on a word boundary. We’ll see more about that later.
Our next example will rather laboriously remove diacritical marks from a string. There’s an easier way to do this called “transformation,” but doing it manually provides an opportunity to decompose characters and iterate over them with the U16_NEXT macro.
/*** nomarks.c ***/
#include <stdlib.h>
#include <unicode/uchar.h>
#include <unicode/unorm2.h>
#include <unicode/ustdio.h>
#include <unicode/utf16.h>
/* Limit to how many decomposed UTF-16 units a single
* codepoint will become in NFD. I don't know the
* correct value here so I chose a value that seems
* to be overkill */
#define MAX_DECOMP_LEN 16
int main(void)
{
long i, n;
UChar32 c;
UFILE *in, *out;
UChar decomp[MAX_DECOMP_LEN];
UErrorCode status = U_ZERO_ERROR;
UNormalizer2 *norm;
out = u_get_stdout();
in = u_finit(stdin, NULL, NULL);
if (!in)
{
/* using stdio functions with stderr and ustdio
* with stdout. Mixing the two on a single file
* handle would probably be bad. */
fputs("Error opening stdin as UFILE\n", stderr);
return EXIT_FAILURE;
}
/* create a normalizer, in this case one going to NFD */
norm = (UNormalizer2 *)unorm2_getNFDInstance(&status);
if (U_FAILURE(status)) {
fprintf(stderr,
"unorm2_getNFDInstance(): %s\n",
u_errorName(status));
return EXIT_FAILURE;
}
/* consume input as UTF-32 units one by one */
while ((c = u_fgetcx(in)) != U_EOF)
{
/* Decompose c to isolate its n combining character
* codepoints. Saves them as UTF-16 code units. FYI,
* this function ignores the type of "norm" and always
* denormalizes */
n = unorm2_getDecomposition(
norm, c, decomp, MAX_DECOMP_LEN, &status
);
if (U_FAILURE(status)) {
fprintf(stderr,
"unorm2_getDecomposition(): %s\n",
u_errorName(status));
u_fclose(in);
return EXIT_FAILURE;
}
/* if c does not decompose and is not itself
* a diacritical mark */
if (n < 0 && ublock_getCode(c) !=
UBLOCK_COMBINING_DIACRITICAL_MARKS)
u_fputc(c, out);
/* walk canonical decomposition, reuse c variable */
for (i = 0; i < n; )
{
/* the U16_NEXT macro iterates over UChar (aka
* UTF-16, advancing by one or two elements as
* needed to get a codepoint. It saves the result
* in UTF-32. The macro updates i and c. */
U16_NEXT(decomp, i, n, c);
/* output only if not combining diacritical */
if (ublock_getCode(c) !=
UBLOCK_COMBINING_DIACRITICAL_MARKS)
u_fputc(c, out);
}
}
u_fclose(in);
/* u_get_stdout() doesn't need to be u_fclose'd */
return EXIT_SUCCESS;
}Here’s an example of running the program:
$ echo "résumé façade" | ./nomarks
resume facadeICU provides a rich domain specific language for transforming strings. For example, our entire program in the previous section can be replaced by the transformation NFD; [:Nonspacing Mark:] Remove; NFC. This means to perform a canonical decomposition, remove nonspacing marks, and then canonically compose again. (In fact our program above didn’t re-compose.)
The program below echoes stdin to stdout, but passes the output through a transformation.
/*** trans-stream.c ***/
#include <stdlib.h>
#include <string.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utrans.h>
int main(int argc, char **argv)
{
UChar32 c;
UParseError pe;
UFILE *in, *out;
UTransliterator *t;
UErrorCode status = U_ZERO_ERROR;
UChar *xform_id;
size_t n;
if (argc != 2)
{
fprintf(stderr,
"Usage: %s \"translation rules\"\n", argv[0]);
return EXIT_FAILURE;
}
/* the UTF-16 string should never be longer than the UTF-8
* argv[1], so this should be safe */
n = strlen(argv[1]) + 1;
xform_id = malloc(n * sizeof(UChar));
u_strFromUTF8(xform_id, n, NULL, argv[1], -1, &status);
/* create transliterator by identifier */
t = utrans_openU(xform_id, -1, UTRANS_FORWARD,
NULL, -1, &pe, &status);
/* don't need the identifier any more */
free(xform_id);
if (U_FAILURE(status)) {
fprintf(stderr, "utrans_open(%s): %s\n",
argv[1], u_errorName(status));
return EXIT_FAILURE;
}
out = u_get_stdout();
if (!(in = u_finit(stdin, NULL, NULL)))
{
fputs("Error opening stdin as UFILE\n", stderr);
return EXIT_FAILURE;
}
/* transparently transliterate stdout */
u_fsettransliterator(out, U_WRITE, t, &status);
if (U_FAILURE(status)) {
fprintf(stderr,
"Failed to set transliterator on stdout: %s\n",
u_errorName(status));
u_fclose(in);
return EXIT_FAILURE;
}
/* what looks like a simple echo loop actually
* transliterate characters */
while ((c = u_fgetcx(in)) != U_EOF)
u_fputc(c, out);
utrans_close(t);
u_fclose(in);
}As mentioned, it can emulate our earlier “nomarks” program:
$ echo "résumé façade" | ./trans "NFD; [:Nonspacing Mark:] Remove; NFC"
resume facadeIt can also transliterate between scripts like this:
$ echo "miirekkaḍiki veḷutunnaaru?" | ./trans "Telugu"
మీరెక్కడికి వెళుతున్నఅరు?Applying the transformation to a stream with u_fsettransliterator is a simple way to do things. However I did discover and file an ICU bug which will be fixed in version 65.1.
A more robust way to apply transformations is by manipulating UChar strings directly. The technique is also probably more applicable in real applications.
Here’s a rewrite of trans-stream that operates on strings directly:
/*** trans-string.c ***/
#include <stdlib.h>
#include <string.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#include <unicode/utrans.h>
/* max number of UTF-16 code units to accumulate while looking
* for an unambiguous transliteration. Has to be fairly long to
* handle names in Name-Any transliteration like
* \N{LATIN CAPITAL LETTER O WITH OGONEK AND MACRON} */
#define CONTEXT 100
int main(int argc, char **argv)
{
UErrorCode status = U_ZERO_ERROR;
UChar c, *end;
UChar input[CONTEXT] = {0}, *buf, *enlarged;
UFILE *in, *out;
UTransPosition pos;
int32_t width, sizeNeeded, bufLen;
size_t n;
UChar *xform_id;
UTransliterator *t;
/* bufLen must be able to hold at least CONTEXT, and
* will be increased as needed for transliteration */
bufLen = CONTEXT;
buf = malloc(sizeof(UChar) * bufLen);
if (argc != 2)
{
fprintf(stderr,
"Usage: %s \"translation rules\"\n", argv[0]);
return EXIT_FAILURE;
}
/* allocate and read identifier, like earlier example */
n = strlen(argv[1]) + 1;
xform_id = malloc(n * sizeof(UChar));
u_strFromUTF8(xform_id, n, NULL, argv[1], -1, &status);
t = utrans_openU(xform_id, -1, UTRANS_FORWARD,
NULL, -1, NULL, &status);
free(xform_id);
if (U_FAILURE(status)) {
fprintf(stderr, "utrans_open(%s): %s\n",
argv[1], u_errorName(status));
return EXIT_FAILURE;
}
out = u_get_stdout();
if (!(in = u_finit(stdin, NULL, NULL)))
{
fputs("Error opening stdin as UFILE\n", stderr);
return EXIT_FAILURE;
}
end = input;
/* append UTF-16 code units one at a time for incremental
* transliteration */
while ((c = u_fgetc(in)) != U_EOF)
{
/* we consider at most CONTEXT consecutive code units
* for transliteration (minus one for \0) */
if (end - input >= CONTEXT-1)
{
fprintf(stderr,
"Exceeded max (%i) code units "
"for context.\n",
CONTEXT);
break;
}
*end++ = c;
*end = '\0';
/* copy string so far to buf to operate on */
u_strcpy(buf, input);
pos.start = pos.contextStart = 0;
pos.limit = pos.contextLimit = end - input;
sizeNeeded = -1;
utrans_transIncrementalUChars(
t, buf, &sizeNeeded, bufLen, &pos, &status
);
/* if buf not big enough for transliterated result */
if (status == U_BUFFER_OVERFLOW_ERROR)
{
/* utrans_transIncrementalUChars sets sizeNeeded,
* so resize the buffer */
if ((enlarged =
realloc(buf, sizeof(UChar)*sizeNeeded))
== NULL)
{
fprintf(stderr,
"Unable to grow buffer.\n");
/* fail gracefully and display
* what we can */
break;
}
buf = enlarged;
bufLen = sizeNeeded;
u_strcpy(buf, input);
pos.start = pos.contextStart = 0;
pos.limit = pos.contextLimit = end - input;
sizeNeeded = -1;
/* one more time, but with sufficient space */
status = U_ZERO_ERROR;
utrans_transIncrementalUChars(
t, buf, &sizeNeeded, bufLen,
&pos, &status
);
}
/* handle errors other than U_BUFFER_OVERFLOW_ERROR */
if (U_FAILURE(status)) {
fprintf(stderr,
"utrans_transIncrementalUChars(): %s\n",
u_errorName(status));
break;
}
/* print buf[0 .. pos.start - 1] */
u_printf("%.*S", pos.start, buf);
/* Remove the code units which were processed,
* shifting back the remaining ones which could
* not be unambiguously transliterated. Then hit
* the loop to get another code unit and try again. */
u_strcpy(input, buf+pos.start);
end = input + (pos.limit - pos.start);
}
/* if any leftovers from incremental transliteration */
if (end > input)
{
/* transliterate input array in place, do our best */
width = end - input;
utrans_transUChars(
t, input, NULL, CONTEXT, 0, &width, &status);
u_printf("%S", input);
}
utrans_close(t);
u_fclose(in);
free(buf);
return U_SUCCESS(status) ? EXIT_SUCCESS : EXIT_FAILURE;
}Punycode is a representation of Unicode within the limited ASCII character subset used for internet host names. If you enter a non-ASCII URL into a web browser navigation bar, the browser translates to Punycode before making the actual DNS lookup.
The encoding is part of the more general process of Internationalizing Domain Names in Applications (IDNA), which also normalizes the string.
Note that not all Unicode strings can be successfully encoded. For instance codepoints like “⒈” include a period in the glyph and are used for numbered lists. Converting that dot to the ASCII hostname would inadvertently specify a subdomain. ICU turns the offending character into U+FFFD (the “replacement character”) in the output and returns an error.
The following program uses uidna_nameToASCII or uidna_nameToUnicode as needed to translate between Unicode and punycode.
/*** puny.c ***/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* uidna stands for International Domain Names in
* Applications and contains punycode routines */
#include <unicode/uidna.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>
void chomp(UChar *s)
{
/* unicode characters that split lines */
UChar splits[] =
{0xa, 0xb, 0xc, 0xd, 0x85, 0x2028, 0x2029, '\0'};
if (s)
s[u_strcspn(s, splits)] = '\0';
}
int main(int argc, char **argv)
{
UFILE *in;
UChar input[1024], output[1024];
UIDNAInfo info = UIDNA_INFO_INITIALIZER;
UErrorCode status = U_ZERO_ERROR;
UIDNA *idna = uidna_openUTS46(UIDNA_DEFAULT, &status);
/* default action is performing punycode */
int32_t (*action)(
const UIDNA*, const UChar*, int32_t, UChar*,
int32_t, UIDNAInfo*, UErrorCode*
) = uidna_nameToASCII;
if (!(in = u_finit(stdin, NULL, NULL)))
{
fputs("Error opening stdin as UFILE\n", stderr);
return EXIT_FAILURE;
}
/* the "decode" option reverses our action */
if (argc > 1 && strcmp(argv[1], "decode") == 0)
action = uidna_nameToUnicode;
/* u_fgets includes the newline, so we chomp it */
u_fgets(input, sizeof(input)/sizeof(*input), in);
chomp(input);
action(idna, input, -1, output,
sizeof(output)/sizeof(*output),
&info, &status);
if (U_SUCCESS(status) && info.errors!=0)
fputs("Bad input.\n", stderr);
u_printf("%S\n", output);
uidna_close(idna);
u_fclose(in);
return 0;
}Example of using the program:
$ echo "façade.com" | ./puny
xn--faade-zra.com
# not every string is allowed
$ echo "a⒈.com" | ./puny
Bad input.
a�.comThe C standard library has functions like toupper which operate on a single character at a time. ICU has equivalents like u_toupper, but working on single codepoints isn’t sufficient for proper casing. Let’s examine the program and see why.
/*** pointcase.c ***/
#include <stdlib.h>
#include <string.h>
#include <unicode/uchar.h>
#include <unicode/ustdio.h>
int main(int argc, char **argv)
{
UChar32 c;
UFILE *in, *out;
UChar32 (*op)(UChar32) = NULL;
/* set op to one of the casing operations
* in uchar.h */
if (argc < 2 || strcmp(argv[1], "upper") == 0)
op = u_toupper;
else if (strcmp(argv[1], "lower") == 0)
op = u_tolower;
else if (strcmp(argv[1], "title") == 0)
op = u_totitle;
else
{
fprintf(stderr, "Unrecognized case: %s\n", argv[1]);
return EXIT_FAILURE;
}
out = u_get_stdout();
if (!(in = u_finit(stdin, NULL, NULL)))
{
fputs("Error opening stdin as UFILE\n", stderr);
return EXIT_FAILURE;
}
/* operates on UTF-32 */
while ((c = u_fgetcx(in)) != U_EOF)
u_fputc(op(c), out);
u_fclose(in);
return EXIT_SUCCESS;
}# not quite right, ß should become SS:
$ echo "Die große Stille" | ./pointcase upper
DIE GROßE STILLE
# also wrong, final sigma should be ς:
$ echo "ΣΊΣΥΦΟΣ" | ./pointcase lower
σίσυφοσAs you can see, some graphemes need to “expand” into a greater number, and others are position-sensitive. To do this properly, we have to operate on entire strings rather than individual characters. Here is a program to do it right:
/*** strcase.c ***/
#include <locale.h>
#include <stdlib.h>
#include <string.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#define BUFSZ 1024
/* wrapper function for u_strToTitle with signature
* matching the other casing functions */
int32_t title(UChar *dest, int32_t destCapacity,
const UChar *src, int32_t srcLength,
const char *locale, UErrorCode *pErrorCode)
{
return u_strToTitle(dest, destCapacity, src,
srcLength, NULL, locale, pErrorCode);
}
int main(int argc, char **argv)
{
UFILE *in;
char *locale;
UChar line[BUFSZ], cased[BUFSZ];
UErrorCode status = U_ZERO_ERROR;
int32_t (*op)(
UChar*, int32_t, const UChar*, int32_t,
const char*, UErrorCode*
) = NULL;
/* casing is locale-dependent */
if (!(locale = setlocale(LC_CTYPE, "")))
{
fputs("Cannot determine system locale\n", stderr);
return EXIT_FAILURE;
}
if (argc < 2 || strcmp(argv[1], "upper") == 0)
op = u_strToUpper;
else if (strcmp(argv[1], "lower") == 0)
op = u_strToLower;
else if (strcmp(argv[1], "title") == 0)
op = title;
else
{
fprintf(stderr, "Unrecognized case: %s\n", argv[1]);
return EXIT_FAILURE;
}
if (!(in = u_finit(stdin, NULL, NULL)))
{
fputs("Error opening stdin as UFILE\n", stderr);
return EXIT_FAILURE;
}
/* Ideally we should change case up to the last word
* break and push the remaining characters back for
* a future read if the line was longer than BUFSZ.
* Currently, if the string is truncated, the final
* character would incorrectly be considered
* terminal, which affects casing rules in Greek. */
while (u_fgets(line, BUFSZ, in))
{
op(cased, BUFSZ, line, -1, locale, &status);
/* if casing increases string length, and goes
* beyond buffer size like the german ß -> SS */
if (status == U_BUFFER_OVERFLOW_ERROR)
{
/* Just issue a warning and read another line.
* Don't treat it as severely as other errors. */
fputs("Line too long\n", stderr);
status = U_ZERO_ERROR;
}
else if (U_FAILURE(status))
{
fputs(u_errorName(status), stderr);
break;
}
else
u_printf("%S", cased);
}
u_fclose(in);
return U_SUCCESS(status)
? EXIT_SUCCESS : EXIT_FAILURE;
}This works better.
$ echo "Die große Stille" | ./strcase upper
DIE GROSSE STILLE
$ echo "ΣΊΣΥΦΟΣ" | ./strcase lower
σίσυφοςLet’s make a version of wc (the Unix word count program) that knows more about Unicode. Our version will properly count grapheme clusters and word boundaries.
For example, regular wc gets confused by the ancient Ogham script. This was a series of notches scratched into fence posts, and has a space character which is nonblank.
$ echo "ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ " | wc
1 1 37One word, you say? Puh-leaze, if your program can’t handle Medieval Irish carvings then I want nothing to do with it. Here’s one that can:
/*** uwc.c ***/
#include <locale.h>
#include <stdlib.h>
#include <unicode/ubrk.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#define BUFSZ 512
/* line Feed, vertical tab, form feed, carriage return,
* next line, line separator, paragraph separator */
#define NEWLINE(c) ( \
((c) >= 0xa && (c) <= 0xd) || \
(c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 )
int main(void)
{
UFILE *in;
char *locale;
UChar line[BUFSZ];
UBreakIterator *brk_g, *brk_w;
UErrorCode status = U_ZERO_ERROR;
long ngraph = 0, nword = 0, nline = 0;
size_t len;
/* word breaks are locale-specific, so we'll obtain
* LC_CTYPE from the environment */
if (!(locale = setlocale(LC_CTYPE, "")))
{
fputs("Cannot determine system locale\n", stderr);
return EXIT_FAILURE;
}
if (!(in = u_finit(stdin, NULL, NULL)))
{
fputs("Error opening stdin as UFILE\n", stderr);
return EXIT_FAILURE;
}
/* create an iterator for graphemes */
brk_g = ubrk_open(
UBRK_CHARACTER, locale, NULL, -1, &status);
/* and another for the edges of words */
brk_w = ubrk_open(
UBRK_WORD, locale, NULL, -1, &status);
/* yes, this is sensitive to splitting end of line
* surrogate pairs and can be improved by our previous
* function for reading bounded lines */
while (u_fgets(line, BUFSZ, in))
{
len = u_strlen(line);
ubrk_setText(brk_g, line, len, &status);
ubrk_setText(brk_w, line, len, &status);
/* Start at beginning of string, count breaks.
* Could have been a for loop, but this looks
* simpler to me. */
ubrk_first(brk_g);
while (ubrk_next(brk_g) != UBRK_DONE)
ngraph++;
ubrk_first(brk_w);
while (ubrk_next(brk_w) != UBRK_DONE)
if (ubrk_getRuleStatus(brk_w) ==
UBRK_WORD_LETTER)
nword++;
/* count the newline if it exists */
if (len > 0 && NEWLINE(line[len-1]))
nline++;
}
printf("locale : %s\n"
"Grapheme: %zu\n"
"Word : %zu\n"
"Line : %zu\n",
locale, ngraph, nword, nline);
/* clean up iterators after use */
ubrk_close(brk_g);
ubrk_close(brk_w);
u_fclose(in);
}Much better:
$ echo "ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ " | ./uwc
locale : en_US.UTF-8
Grapheme: 14
Word : 4
Line : 1When comparing strings, we can be more or less strict. A familiar example is case sensitivity, but Unicode provides other options. Comparing strings for equality is a degenerate case of sorting, where the strings must not only be determined as equal, but put in order. Sorting is called “collation” and the Unicode collation algorithm supports multiple levels of increasing strictness.
| Level | Description |
|---|---|
| Primary | base characters |
| Secondary | accents |
| Tertiary | case/variant |
| Quaternary | punctuation |
Each level acts as a tie-breaker when strings match in previous levels. When searching we can choose how deep to check before declaring strings equal. To illustrate, consider a text file called words.txt containing these words:
Cooperate
coöperate
COÖPERATE
co-operate
final
fides
We will write a program called ugrep, where we can specify a comparison level and search string. If we search for “cooperate” and allow comparisons up to the tertiary level it matches nothing:
$ ./ugrep 3 cooperate < words.txt
# it's an exact match, no resultsIt is possible to shift certain “ignorable” characters (like ‘-’) down to the quaternary level while conducting the original level 3 search:
$ ./ugrep 3i cooperate < words.txt
4: co-operateDoing the same search at the secondary level disregards case, but is still sensitive to accents.
$ ./ugrep 2 cooperate < words.txt
1: CooperateOnce again, can allow ignorables at this level.
$ ./ugrep 2i cooperate < words.txt
1: Cooperate
4: co-operateFinally, going only to the primary level, we match words with the same base letters, modulo case and accents.
$ ./ugrep 1 cooperate < words.txt
1: Cooperate
2: coöperate
3: COÖPERATENote that the idea of a “base character” is dependent on locale. In Swedish, the letters o and ö are quite distinct, and not minor variants as in English. Setting the locale prior to search restricts the results even at the primary level.
$ LC_COLLATE=sv_SE ./ugrep 1 cooperate < fun.txt
1: CooperateOne note about the tertiary level. It distinguishes not just case, but ligature presentation forms.
$ ./ugrep 3 fi < words.txt
6: fides
# vs
$ ./ugrep 2 fi < words.txt
5: final
6: fidesPretty flexible, right? Let’s see the code.
/*** ugrep.c ***/
#include <locale.h>
#include <stdlib.h>
#include <string.h>
#include <unicode/ucol.h>
#include <unicode/usearch.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>
#define BUFSZ 1024
int main(int argc, char **argv)
{
char *locale;
UFILE *in;
UCollator *col;
UStringSearch *srch = NULL;
UErrorCode status = U_ZERO_ERROR;
UChar *needle, line[BUFSZ];
UColAttributeValue strength;
int ignoreInsignificant = 0, asymmetric = 0;
size_t n;
long i;
if (argc != 3)
{
fprintf(stderr,
"Usage: %s {1,2,@,3}[i] pattern\n", argv[0]);
return EXIT_FAILURE;
}
/* cryptic parsing for our cryptic options */
switch (*argv[1])
{
case '1':
strength = UCOL_PRIMARY;
break;
case '2':
strength = UCOL_SECONDARY;
break;
case '@':
strength = UCOL_SECONDARY, asymmetric = 1;
break;
case '3':
strength = UCOL_TERTIARY;
break;
default:
fprintf(stderr,
"Unknown strength: %s\n", argv[1]);
return EXIT_FAILURE;
}
/* length of argv[1] is >0 or we would have died */
ignoreInsignificant = argv[1][strlen(argv[1])-1] == 'i';
n = strlen(argv[2]) + 1;
/* if UTF-8 could encode it in n, then UTF-16
* should be able to as well */
needle = malloc(n * sizeof(*needle));
u_strFromUTF8(needle, n, NULL, argv[2], -1, &status);
/* searching is a degenerate case of collation,
* so we read the LC_COLLATE locale */
if (!(locale = setlocale(LC_COLLATE, "")))
{
fputs("Cannot determine system collation locale\n",
stderr);
return EXIT_FAILURE;
}
if (!(in = u_finit(stdin, NULL, NULL)))
{
fputs("Error opening stdin as UFILE\n", stderr);
return EXIT_FAILURE;
}
col = ucol_open(locale, &status);
ucol_setStrength(col, strength);
if (ignoreInsignificant)
/* shift ignorable characters down to
* quaternary level */
ucol_setAttribute(col, UCOL_ALTERNATE_HANDLING,
UCOL_SHIFTED, &status);
/* Assumes all lines fit in BUFSZ. Should
* fix this in real code and not increment i */
for (i = 1; u_fgets(line, BUFSZ, in); ++i)
{
/* first time through, set up all options */
if (!srch)
{
srch = usearch_openFromCollator(
needle, -1, line, -1,
col, NULL, &status
);
if (asymmetric)
usearch_setAttribute(
srch, USEARCH_ELEMENT_COMPARISON,
USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD,
&status
);
}
/* afterward just switch text */
else
usearch_setText(srch, line, -1, &status);
/* check if keyword appears in line */
if (usearch_first(srch, &status) != USEARCH_DONE)
u_printf("%ld: %S", i, line);
}
usearch_close(srch);
ucol_close(col);
u_fclose(in);
free(needle);
return EXIT_SUCCESS;
}In the concepts section, we saw a single grapheme can be constructed with different combinations of codepoints. In many cases when comparing strings for equality, we’re most interested in the strings being perceived by the user in the same way rather than a simple byte-for-byte match.
The ICU library provides a unorm_compare function which returns a value similar to strcmp, and acts in a normalization independent way. It normalizes both strings incrementally while comparing them, so it can stop early if it finds a difference.
Here is code to check that the five ways of representing ộ are equivalent:
#include <stdio.h>
#include <unicode/unorm2.h>
int main(void)
{
UErrorCode status = U_ZERO_ERROR;
UChar s[][4] = {
{0x006f,0x0302,0x0323,0},
{0x006f,0x0323,0x0302,0},
{0x00f4,0x0323,0,0},
{0x1ecd,0x0302,0,0},
{0x1ed9,0,0,0}
};
const size_t n = sizeof(s)/sizeof(s[0]);
size_t i;
for (i = 0; i < n; ++i)
printf("%zu == %zu: %d\n", i, (i+1)%n,
unorm_compare(
s[i], -1, s[(i+1)%n], -1, 0, &status));
}Output:
0 == 1: 0
1 == 2: 0
2 == 3: 0
3 == 4: 0
4 == 0: 0
A return value of 0 means the strings are equal.
Because Unicode introduces so many graphemes, there are more possibilities for scammers to confuse people using lookalike glyphs. For instance, domains like adoḅe.com or pаypal.com (with Cyrillic а) can direct unwary visitors to phishing sites. ICU contains an entire module for detecting “confusables,” those strings which are known to look too similar when rendered in common fonts. Each string is assigned a “skeleton” such that confusable strings get the same skeleton.
For an example, see my utility utofu. It has a little extra complexity with sqlite access code, so I am not reproducing it here. It’s designed to check Unicode strings to detect changes over time that might be spoofing.
The method of operation is this:
Unicode and internationalization is a huge topic. I could only scratch the surface in this article. I read and enjoyed sections from these books and reference materials, and would recommend them:
]]>