Parsing with Binary String Pattern Matching
I want to show you a glimpse of what it could look like when you parse it binary string pattern matching in OCaml like you'd do on Erlang, Elixir, or Gleam
If you've ever had to parse anything, (anything really), I want to show you a glimpse of what it could look like when you parse it binary string pattern matching, like you'd do on Erlang, Elixir, or Gleam.
In OCaml, we can do this with the bitstring
package.
To do this we'll use a small example of something I've had to parse recently: WebSocket Frames.
A WebSocket frame looks like this:
This roughly means:
1st byte includes the
fin
,rsv1,2,3
andopcode
headers2nd byte includes a
mask
and apayload_length
3rd-7th byte includes an extended payload (can be shorter too)
the byte after that is going to be a masking key
and then the rest is the actual data you packed in the WebSocket frame
The way you'd parse this with parse combinators looks a lot like the BNF grammar used to describe this packet. Let's look at some code, shall we?
We'll assume we have a function Frame.make ~fin ~compressed ~rsv ~opcode ~mask ~payload
that will create a record of type Frame.t
. That part is the least interesting. The interesting thing is how do we get those values.
Here's how the websocketaf
parses this frame using :
(* this example code taken almost verbatim from the `websocketaf`
opam package.
I've only amended the last `parse` function to make it more obvious how
the parsing is becoming a frame. In practice they do something
different after parsing.
*)
(* a Bigstringaf.t (i know, i giggle too) is an efficient type
for a large string that doesn't have to be copied over and over.
If you need something similar in your code, consider using Cstruct.t
*)
type t = Bigstringaf.t
(* this function takes a bigstring and checks if the first bit, which
corresponds to the FIN bit is set *)
let is_fin t =
let bits = Bigstringaf.unsafe_get t 0 |> Char.code in
bits land (1 lsl 8) = 1 lsl 8
;;
(* this function extracts the reserverd bits from the bigstring *)
let rsv t =
let bits = Bigstringaf.unsafe_get t 0 |> Char.code in
(bits lsr 4) land 0b0111
;;
(* this function creates an int representation for an opcode *)
let opcode t =
let bits = Bigstringaf.unsafe_get t 0 |> Char.code in
bits land 0b1111 |> Opcode.unsafe_of_code
;;
(* this function gets the length of the payload.
and yes that comment was already there.
*)
let payload_length_of_offset t off =
let bits = Bigstringaf.unsafe_get t (off + 1) |> Char.code in
let length = bits land 0b01111111 in
if length = 126 then Bigstringaf.unsafe_get_int16_be t (off + 2) else
(* This is technically unsafe, but if somebody's asking us to read 2^63
* bytes, then we're already screwd. *)
if length = 127 then Bigstringaf.unsafe_get_int64_be t (off + 2) |> Int64.to_int else
length
;;
let payload_length t =
payload_length_of_offset t 0
;;
(* this function checks if the bigstring has a specific mask set *)
let has_mask t =
let bits = Bigstringaf.unsafe_get t 1 |> Char.code in
bits land (1 lsl 7) = 1 lsl 7
(* and this one checks if the mask exists and if so, gets the value *)
let mask t =
if not (has_mask t)
then None
else
Some (
let bits = Bigstringaf.unsafe_get t 1 |> Char.code in
if bits = 254 then Bigstringaf.unsafe_get_int32_be t 4 else
if bits = 255 then Bigstringaf.unsafe_get_int32_be t 10 else
Bigstringaf.unsafe_get_int32_be t 2)
;;
(* this computes the offset of the payload in bits *)
let payload_offset_of_bits bits =
let initial_offset = 2 in
let mask_offset = (bits land (1 lsl 7)) lsr (7 - 2) in
let length_offset =
let length = bits land 0b01111111 in
if length < 126
then 0
else 2 lsl ((length land 0b1) lsl 2)
in
initial_offset + mask_offset + length_offset
;;
let payload_offset t =
let bits = Bigstringaf.unsafe_get t 1 |> Char.code in
payload_offset_of_bits bits
;;
let length_of_offset t off =
let bits = Bigstringaf.unsafe_get t (off + 1) |> Char.code in
let payload_offset = payload_offset_of_bits bits in
let payload_length = payload_length_of_offset t off in
payload_offset + payload_length
;;
let length t =
length_of_offset t 0
;;
let apply_mask mask bs ~off ~len =
for i = off to off + len - 1 do
let j = (i - off) mod 4 in
let c = Bigstringaf.unsafe_get bs i |> Char.code in
let c = c lxor Int32.(logand (shift_right mask (8 * (3 - j))) 0xffl |> to_int) in
Bigstringaf.unsafe_set bs i (Char.unsafe_chr c)
done
;;
let unmask_inplace t =
if has_mask t then begin
let mask = mask_exn t in
let len = payload_length t in
let off = payload_offset t in
apply_mask mask t ~off ~len
end
;;
(* this was confusing to me too *)
let mask_inplace = unmask_inplace
let parse =
let open Angstrom in
Unsafe.peek 2 (fun bs ~off ~len:_ -> length_of_offset bs off)
>>= fun len -> Unsafe.take len Bigstringaf.sub
>>| fun frame ->
let fin = Websocket.Frame.is_fin frame in
let opcode = Websocket.Frame.opcode frame in
unmask_inplace frame;
let len = payload_length t in
let off = payload_offset t in
Frame.make ~fin ~opcode ~payload:(len, off, frame)
;;
This process begins by defining from the bottom up all the smallest parts we need to use in our final combinator. Only at the very end do we stitch everything together. It also requires us to keep track of several small numbers in different positions to understand what is being read from where.
If you wanted to run this parser incrementally, you now need an external loop that would normally read something like this:
let parse ?(parser = Frame.parse) data =
match parser with
(* check if the parser is currently in a partial state *)
| Angstrom.Buffered.Partial continue -> (
(* continue by with the next piece of data *)
match continue (`Bigstring data) with
(* if we're done, return our frame *)
| Angstrom.Buffered.Done (_unconsumed, frame) -> Ok (`frame frame)
(* handle any parsing errors *)
| Angstrom.Buffered.Fail (_, _, _) -> Error `bad_frame
(* otherwise return that we need more data to continue *)
| parser -> Ok (`more parser))
| Angstrom.Buffered.Done (_unconsumed, frame) -> Ok (`frame frame)
| Angstrom.Buffered.Fail (_, _, _) -> Error `bad_frame
(* helper function to read data and continue parsing until we are done
or until the parser errors *)
let rec read_all_and_parse ?(parser=Frame.parse) () =
let data = read_data () in
match parse ~parser data with
| Ok (`more parser) -> read_all_and_parse ~parser ()
| Ok (`frame frame) -> Ok frame
| Error reason -> Error reason
Not bad! I kinda like being able to control the parsing incrementally like this. It can make for really efficient code.
And here's how it can look when pattern matching on binary strings:
let deserialize ?(max_frame_size = 0) data =
match%bitstring Bitstring.bitstring_of_string data with
| {| fin : 1;
compressed : 1;
rsv : 2;
opcode : 4;
pad1 : 1 : check( pad1 = true );
pad2 : 7 : check( pad2 = 127 );
length : 64;
mask : 32;
payload : Int64.(mul length 8L |> to_int) : string;
rest : -1 : string |}
when max_frame_size = 0 || Int64.(length <= of_int max_frame_size) ->
Some (Frame.make ~fin ~compressed ~rsv ~opcode ~mask ~payload, rest)
| {| fin : 1;
compressed : 1;
rsv : 2;
opcode : 4;
pad1 : 1 : check( pad1 = true );
pad2 : 7 : check( pad2 = 126 );
length : 16 : int;
mask : 32 : int;
payload : (length * 8) : string;
rest : -1 : string |}
when max_frame_size = 0 || length <= max_frame_size ->
Some (Frame.make ~fin ~compressed ~rsv ~opcode ~mask ~payload, rest)
| {| fin : 1;
compressed : 1;
rsv : 2;
opcode : 4;
x : 1 : check( x = true );
length : 7 : int;
mask : 32 : int;
payload : (length * 8) : string;
rest : -1 : string |}
when length <= 125 && (max_frame_size == 0 || length <= max_frame_size) ->
Some (Frame.make ~fin ~compressed ~rsv ~opcode ~mask ~payload, rest)
| {| data : -1 : string |} -> Some (`more data, "")
This function also takes in a string as data, but it does a much clearer job of outlining what are the valid cases for frames on each matching branch.
Let's break this down.
A binary pattern match starts with match%bitstring data with
and includes several cases. Every case is a regular OCaml string with the delimiters {| ... |}
. The bitstring
package will translate this string into a series of operations over the string data. This is why this is more declarative.
Okay but what goes inside each pattern?
Roughly every binary pattern follows this format: {| variable : size : type |}
and they are all separated by semicolons.
For example, let's grab the first branch of the match:
{| fin : 1;
compressed : 1;
rsv : 2;
opcode : 4;
pad1 : 1 : check( pad1 = true );
pad2 : 7 : check( pad2 = 127 );
length : 64;
mask : 32;
payload : Int64.(mul length 8L |> to_int) : string;
rest : -1 : string |}
In here we are saying that:
the first bit will be captured by the variable
fin
,the second one by the variable
compressed
,the next 2 bits by the variable
rsv
the next 4 bits by the variable
opcode
the next 1 bit by the variable
pad1
, and we will check that the variablepad1
is true – since it is a single bit, the library maps this to abool
the next 7 bits by the variable
pad2
, and we will check that it is the number 127the next 64 bits by the variable
length
the next 32 bits by the variable
mask
and next
length * 8
bits by the variable payload – and we specifically want this to be captured as a stringthe rest of the bits (
-1
) as a string
Now compare that to our little diagram and tell me if this, while packed with info, is not easier to double-check that you are in fact parsing the right thing.
With that explained, we can go back to calling our function.
Since we control the returning values in this function, it is straightforward to fit it into other abstractions that we may be using. For example, we can use it with Seq.unfold
to generate a sequence of frames:
Seq.unfold deserialize data
(* `ok frame
`ok frame
`more rest *)
And with a little combinator for sequences, we can make it quite pleasant to incrementally parse over streams of data:
(* reduce a value until we explicitly halt or the sequence is finished *)
let rec reduce_while init fn t =
match t () with
| Seq.Nil -> init
| Seq.Cons (v, t') -> (
match fn v init with
| `continue acc -> reduce_while acc fn t'
| `halt acc -> acc)
(* do an incremental read/parse/handle_frame loop *)
let rec parse data =
let next =
Seq.unfold Frame.deserialize data
|> reduce_while `ok (fun frame state ->
match (state, frame) with
| `ok, `more buf -> `halt (`more buf)
| `ok, `ok frame -> `continue (handle_frame frame)
| `ok, `error reason -> `halt (`error reason)
| `error reason, _ -> `halt (`error reason)
| `more buf, _ -> `halt (`more buf))
in
match next with
| `ok -> Ok ()
| `error reason -> Error reason
| `more buffer ->
let more_data = read_data () in
parse (buffer ^ more_data)
Just to clarify, I'm sure you can fit Angstrom into a Seq.t
as well. So this last reduce_while
trick isn't a win specifically for binary pattern matching, but having full control over how the parser executes sure helps us build parsers that fit exactly where we need them to.
Conclusion
I like binary pattern matching better, it has way less cognitive overhead, it helps me parse instantly what a sequence of bits/bytes would look like, and its recursive descent approach helps me write parsers that fit exactly into the code around them.
They're not for everything, and some things can be rather painful to parse where you end up using tiny combinators anyway (like "parse all until new line"), but overall I think they're a major improvement in readability and declarativeness over parser combinators.
If you find this interesting, have a look at bitstring. I'll be writing more about this package soon.
That's all I've got for today folks, it's the last day of the year and I figured we should end it with one last 2023 issue.
If you liked what I've written you can support me on Github Sponsors ✨
Here's to more Practical OCaml in 2024 🥂
Thanks to @dillon_mulroy for reviewing a draft of this post.