Patches, Part 4: Line IDs
Joe Neeman February 25, 2019I've written quite a bit about the theory of patches and merging, but nothing yet about how to actually implement anything efficiently. That will be the subject of this post, and probably some future posts too. Algorithms and efficiency are not really discussed in the original paper, so most of this material I learned from reading the pijul source code. Having said that, my main focus here is on broader ideas and algorithms, and so you shouldn't assume that anything written here is an accurate reflection of pijul (plus, my pijul knowledge is about 2 years out of date by now).
Three sizes
Before getting to the juicy details, we have to decide what it means for things to be fast. In a VCS, there are three different size scales that we need to think about. From smallest to biggest, we have:
- The size of the change. Like, if we're changing one line in a giant file, then the size of the change is just the length of the one line that we're changing.
- The size of the current output. In the case of ojo (which just tracks a single file), this is just the size of the file.
- The size of the history. This includes everything that has ever been in the repository, so if the repository has been active for years then the size of the history could be much bigger than the current size.
The first obvious requirement is that the size of a patch should be proportional to the size of the change. This sounds almost too obvious to mention, but remember the definition of a patch from here: a patch consists of a source file, a target file, and a function from one to the other that has certain additional properties. If we were to naively translate this definition into code, the size of a patch would be proportional to the size of the entire file.
Of course, this is a solved problem in the world of UNIX-style diffs (which I mentioned all the way back in the first post). The problem is to adapt the diff approach to our mathematical patch framework; for example, the fact that our files need not even be ordered means that it doesn't make sense to talk about inserting a line "after line 62."
The key to solving this turns out to be to unique IDs: give every line in the
entire history of the repository a unique ID. This isn't even very difficult:
we can give every patch in the history of the repository a unique ID by hashing
its contents. For each patch, we can enumerate the lines that it adds and then
for the rest of time, we can uniquely refer to those lines like "the third line
added by patch Ar8f
."
Representing patches
Once we've added unique IDs to every line, it becomes pretty easy to encode patches compactly. For example, suppose we want to describe this patch:
Here, dA
is the unique ID of the patch that introduced the to-do, shoes, and
garbage lines, and x5
is the unique ID of the patch that we want to describe.
Anyway, the patch is now easy to describe by using the unique IDs to specify
what we want to do: delete the line with ID dA/1
, add the line with ID x5/0
and contents "work", and add an edge from the line dA/2
to the line
x5/0
.
Let's have a quick look at how this is implemented in ojo, by taking a peek
at the API docs. Patches, funnily
enough, are represented by the
Patch
struct, which
basically consists of metadata (author, commit message, timestamp) and a list
of Change
s. The
Change
s are the most interesting part, and they look like this:
In other words, the example that we saw above is basically all there is to it, as far as patches go.
If you want to see what actual patches look like in actual usage, you can do
that too because ojo keeps all of its data in human-readable text. After installing
ojo (with cargo install ojo
), you can create a new repository (with ojo init
), edit the file ojo_file.txt
with your favorite editor, and then:
$ ojo patch create -m "Initial commit" -a Me
Created patch PMyANESmvMQ8WR8ccSKpnH8pLc-uyt0jzGkauJBWeqx4=
$ ojo patch export -o out.txt PSc97nCk9oRrRl-2IW3H8TYVtA0hArdVtj5F0f4YSqqs=
Successfully wrote the file 'out.txt'
Now look in out.txt
to see your NewNode
s and NewEdge
s in all their glory.
Antiquing
I introduced unique IDs as a way to achieve compact representations of patches,
but it turns out that they also solve a problem that I promised to explain
two years ago:
how do I compute the "most antique"
version of a patch? Or equivalently, if I have some patch but I want to apply
it to a slightly different repository, how do I know whether I can do that?
With our description of patches above, this is completely trivial: a patch can
only add lines, delete lines, or add edges. Adding lines is always valid, no
matter what the repository contains. Deleting lines and adding edges can be
done if and only if the lines to delete, or the lines to connect, exist in the
repository. Since lines have unique IDs, checking this is unambiguous.
Actually, it's really easy because the line IDs are tied to the patch that
introduced them: a patch can be applied if and only if all the patch IDs that
it refers to have already been applied. For obvious reasons, we refer to these
as "dependencies": the dependencies of a patch are all the other patches that
it refers to in DeleteNode
and NewEdge
commands. You can see this in action
here.
By the way, this method will always give a minimal set of dependencies (in other words, the most antique version of a patch), but it isn't necessarily the right thing to do. For example, if a patch deletes a line then it seems reasonable for it to also depend on the lines adjacent to the deleted line. Ojo might do this in the future, but for now it sticks to the minimal dependencies.
Applying patches
Now that we know how to compactly represent patches, how quickly can we apply
them? To get really into detail here, we'd need to talk about how the state of
the repository is represented on disk (which is an interesting topic on its
own, but a bit out of scope for this post). Let's just pretend for now that the
current state of the repository is stored as a graph in memory, using some
general-purpose crate
(like, say, petgraph
).
Each node in the graph needs to store the contents of the corresponding line,
as well as a "tombstone" saying whether it has been deleted
(see the first post). Assuming we can
add nodes and edges in constant time (like, say, in petgraph
), applying
a single change is a constant time operation. That means the time it takes
to apply the whole patch is proportional to the number of changes.
That's the best we could hope for, so we're done, right? What was even the
point of the part about three size scales?
Revenge of the ghosts
Imagine you have a file that contains three lines:
first line
second line
last line
but behind the scenes, there are a bunch of lines that used to be there. So ojo's representation of your file might look like:
Now let's imagine that we delete "second line." The patch to do this consists
of a single DeleteLine
command, and it takes almost no time to apply:
Now that we have this internal representation, ojo needs to create a file on disk showing the new state. That is, we want to somehow go from the internal representation above to the file
first line
last line
Do you see the problem? Even though the output file is only two lines long, in order to produce it we need to visit all of the lines that used to be there but have since been deleted. In other words, we can apply patches quickly (in timescale 1), but rendering the output file is slow (in timescale 3). For a real VCS that tracks decade-old repositories, that clearly isn't going to fly.
Pseudo-edges
There are several ingredients that go into supporting fast rendering of output files (fast here means "timescale 2, most of the time", which is the best that we can hope for). Those are going to be the subject of the next post. So that you have something to think about until then, let me get you started: the key idea is to introduce "pseudo-edges," which are edges that we insert on our own in order to allow us to "skip" large regions of deleted lines. In the example above, the goal is to actually generate this graph:
These extra edges will allow us to quickly render output files, but they open
up a new can of worms (and were the source of several subtle bugs in pijul
last time I used it (i.e. two years ago)): how do we know when to add (or
remove, or update) pseudo-edges? Keep in mind that we aren't willing to
traverse the entire graph to compute the pseudo-edges, because that would
defeat the purpose.