What’s new in Git 2.50.0?

The Git project recently released Git Version 2.50.0. Let's look at a few notable highlights from this release, which includes contributions from the Git team at GitLab and also the wider Git community.

New git-diff-pairs(1) command

Diffs are at the heart of every code review and show all the changes made between two revisions. GitLab shows diffs in various places, but the most common place is a merge request's "Changes" tab. Behind the scenes, diff generation is powered by git-diff(1). For example:

$ git diff HEAD~1 HEAD

This command returns the full diff for all changed files. This might pose a scalability challenge because the number of files changed between a set of revisions could be very large and cause the command to reach self-imposed timeouts for the GitLab backend. For large change sets, it would be better if there were a way to break diff computation into smaller, more digestible chunks.

One way this can be achieved is by using git-diff-tree(1) to retrieve info about all the changed files:

$ git diff-tree -r -M --abbrev HEAD~ HEAD
:100644 100644 c9adfed339 99acf81487 M      Documentation/RelNotes/2.50.0.adoc
:100755 100755 1047b8d11d 208e91a17f M      GIT-VERSION-GEN

Git refers to this output as the "raw" format. In short, each line of output lists filepairs and the accompanying metadata about what has changed between the start and end revisions. Compared to generating the "patch" output for large changes, this process is relatively quick and provides a summary of everything that changed. This command can optionally perform rename detection by appending the -M flag to check if identified changes were due to a file rename.

With this information, we could use git-diff(1) to compute each of the filepair diffs individually. For example, we can provide the blob IDs directly:

$ git diff 1047b8d11de767d290170979a9a20de1f5692e26 208e91a17f04558ca66bc19d73457ca64d5385f

We can repeat this process for each of the filepairs, but spinning up a separate Git process for each individual file diff is not very efficient. Furthermore, when using blob IDs, the diff loses some contextual information such as the change status, and file modes which are stored in with the parent tree object. What we really want is a mechanism to feed "raw" filepair info and generate the corresponding patch output.

With the 2.50 release, Git has a new built-in command named git-diff-pairs(1). This command accepts "raw" formatted filepair info as input on stdin to determine exactly which patches to output. The following example showcases how this command could be used:

$ git diff-tree -r -z -M HEAD~ HEAD | git diff-pairs -z

When used in this manner, the resulting output is identical to using git-diff(1). By having a separate command to generate patch output, the "raw" output from git-diff-tree(1) can be broken up into smaller batches of filepairs and fed to separate git-diff-pairs(1) processes. This solves the previously mentioned scalability concern because diffs no longer have to be computed all at once. Future GitLab releases could build upon this mechanism to improve diff generation performance, especially in cases where large change sets are concerned. For more information on this change, check out the corresponding mailing-list thread.

This project was led by Justin Tobler.

Batched reference updates

Git provides the git-update-ref(1) command to perform reference updates. When used with the --stdin flag, multiple reference updates can be batched together in a single transaction by specifying instructions for each reference update to be performed on stdin. Bulk updating references in this manner also provides atomic behavior whereby a single reference update failure results in an aborted transaction and no references being updated. Here is an example showcasing this behavior:

# Create repository with three empty commits and branch named "foo"
$ git init
$ git commit --allow-empty -m 1
$ git commit --allow-empty -m 2
$ git commit --allow-empty -m 3
$ git branch foo

# Print out the commit IDs
$ git rev-list HEAD
cf469bdf5436ea1ded57670b5f5a0797f72f1afc
5a74cd330f04b96ce0666af89682d4d7580c354c
5a6b339a8ebffde8c0590553045403dbda831518

# Attempt to create a new reference and update existing reference in transaction.
# Update is expected to fail because the specified old object ID doesn’t match.
$ git update-ref --stdin <<EOF
> create refs/heads/bar cf469bdf5436ea1ded57670b5f5a0797f72f1afc
> update refs/heads/foo 5a6b339a8ebffde8c0590553045403dbda831518 5a74cd330f04b96ce0666af89682d4d7580c354c
> EOF
fatal: cannot lock ref 'refs/heads/foo': is at cf469bdf5436ea1ded57670b5f5a0797f72f1afc but expected 5a74cd330f04b96ce0666af89682d4d7580c354c

# The "bar" reference was not created.
$ git switch bar
fatal: invalid reference: bar

Compared to updating many references individually, updating in bulk is also much more efficient. While this works well, there might be certain circumstances where it is okay for a subset of the requested reference updates to fail, but we still want to take advantage of the efficiency gains of bulk updates.

With this release, git-update-ref(1) has the new --batch-updates option, which allows the updates to proceed even when one or more reference updates fails. In this mode, individual failures are reported in the following format:

rejected SP (<old-oid> | <old-target>) SP (<new-oid> | <new-target>) SP <rejection-reason> LF

This allows successful reference updates to proceed while providing context to which updates were rejected and for what reason. Using the same example repository from the previous example:

# Attempt to create a new reference and update existing reference in transaction.
$ git update-ref --stdin --batch-updates <<EOF
> create refs/heads/bar cf469bdf5436ea1ded57670b5f5a0797f72f1afc
> update refs/heads/foo 5a6b339a8ebffde8c0590553045403dbda831518 5a74cd330f04b96ce0666af89682d4d7580c354c
> EOF
rejected refs/heads/foo 5a6b339a8ebffde8c0590553045403dbda831518 5a74cd330f04b96ce0666af89682d4d7580c354c incorrect old value provided

# The "bar" reference was created even though the update to "foo" was rejected.
$ git switch bar
Switched to branch 'bar'

This time, with the --batch-updates option, the reference creation succeeded even though the update didn't work. This patch series lays the groundwork for future performance improvements in git-fetch(1) and git-receive-pack(1) when references are updated in bulk. For more information, check the mailing-list thread

This project was led by Karthik Nayak.

With git-cat-file(1), it is possible to print info for all objects contained in the repository via the --batch–all-objects option. For example:

# Setup simple repository.
$ git init
$ echo foo >foo
$ git add foo
$ git commit -m init

# Create an unreachable object.
$ git commit --amend --no-edit

# Use git-cat-file(1) to print info about all objects including unreachable objects.
$ git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)'
commit 0b07e71d14897f218f23d9a6e39605b466454ece
tree 205f6b799e7d5c2524468ca006a0131aa57ecce7
blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
commit c999f781fd7214b3caab82f560ffd079ddad0115

In some situations, a user might want to search through all objects in the repository, but only output a subset based on some specified attribute. For example, if we wanted to see only the objects that are commits, we could use grep(1):

$ git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' | grep ^commit
commit 0b07e71d14897f218f23d9a6e39605b466454ece
commit c999f781fd7214b3caab82f560ffd079ddad0115

While this works, one downside with filtering the output is that git-cat-file(1) still has to traverse all the objects in the repository, even the ones that the user is not interested in. This can be rather inefficient.

With this release, git-cat-file(1) now has the --filter option, which only shows objects matching the specified criteria. This is similar to the option of the same name for git-rev-list(1), but with only a subset of the filters supported. The supported filters are blob:none, blob:limit=, as well as object:type=. Similar to the previous example, objects can be filtered by type with Git directly:

$ git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' --filter='object:type=commit'
commit 0b07e71d14897f218f23d9a6e39605b466454ece
commit c999f781fd7214b3caab82f560ffd079ddad0115

Not only is it convenient for Git to handle the processing, for large repositories with many objects, it is also potentially more efficient. If a repository has bitmap indices, it becomes possible for Git to efficiently lookup objects of a specific type, and thus avoid scanning through the packfile, which leads to a significant speedup. Benchmarks conducted on the Chromium repository show significant improvements:

Benchmark 1: git cat-file --batch-check --batch-all-objects --unordered --buffer --no-filter
   Time (mean ± σ):     82.806 s ±  6.363 s    [User: 30.956 s, System: 8.264 s]
   Range (min … max):   73.936 s … 89.690 s    10 runs

Benchmark 2: git cat-file --batch-check --batch-all-objects --unordered --buffer --filter=object:type=tag
   Time (mean ± σ):      20.8 ms ±   1.3 ms    [User: 6.1 ms, System: 14.5 ms]
   Range (min … max):    18.2 ms …  23.6 ms    127 runs

Benchmark 3: git cat-file --batch-check --batch-all-objects --unordered --buffer --filter=object:type=commit
   Time (mean ± σ):      1.551 s ±  0.008 s    [User: 1.401 s, System: 0.147 s]
   Range (min … max):    1.541 s …  1.566 s    10 runs

Benchmark 4: git cat-file --batch-check --batch-all-objects --unordered --buffer --filter=object:type=tree
   Time (mean ± σ):     11.169 s ±  0.046 s    [User: 10.076 s, System: 1.063 s]
   Range (min … max):   11.114 s … 11.245 s    10 runs

Benchmark 5: git cat-file --batch-check --batch-all-objects --unordered --buffer --filter=object:type=blob
   Time (mean ± σ):     67.342 s ±  3.368 s    [User: 20.318 s, System: 7.787 s]
   Range (min … max):   62.836 s … 73.618 s    10 runs

Benchmark 6: git cat-file --batch-check --batch-all-objects --unordered --buffer --filter=blob:none
   Time (mean ± σ):     13.032 s ±  0.072 s    [User: 11.638 s, System: 1.368 s]
   Range (min … max):   12.960 s … 13.199 s    10 runs

Summary
   git cat-file --batch-check --batch-all-objects --unordered --buffer --filter=object:type=tag
    74.75 ± 4.61 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --filter=object:type=commit
   538.17 ± 33.17 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --filter=object:type=tree
   627.98 ± 38.77 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --filter=blob:none
  3244.93 ± 257.23 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --filter=object:type=blob
  3990.07 ± 392.72 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --no-filter

Interestingly, these results indicate that the computation time now scales with the number of objects for a given type instead of the number of total objects in the packfile. The original mailing-list thread can be found here.

This project was led by Patrick Steinhardt.

Improved performance when generating bundles

Git provides a means to generate an archive of a repository which contains a specified set of references and accompanying reachable objects via the git-bundle(1) command. This operation is used by GitLab to generate repository backups and also as part of the bundle-URI mechanism.

For large repositories containing millions of references, this operation can take hours or even days. For example, with the main GitLab repository (gitlab-org/gitlab), backup times were around 48 hours. Investigation revealed there was a performance bottleneck due to how Git was performing a check to avoid duplicated references being included in the bundle. The implementation used a nested for loop to iterate and compare all listed references, leading to O(N^2) time complexity. This scales very poorly as the number of references in a repository increases.

In this release, this issue was addressed by replacing the nested loops with a map data structure leading to a significant speedup. The following benchmark the performance improvement for creating a bundle with a repository containing 100,000 references:

Benchmark 1: bundle (refcount = 100000, revision = master)
  Time (mean ± σ):     14.653 s ±  0.203 s    [User: 13.940 s, System: 0.762 s]
  Range (min … max):   14.237 s … 14.920 s    10 runs

Benchmark 2: bundle (refcount = 100000, revision = HEAD)
  Time (mean ± σ):      2.394 s ±  0.023 s    [User: 1.684 s, System: 0.798 s]
  Range (min … max):    2.364 s …  2.425 s    10 runs

Summary
  bundle (refcount = 100000, revision = HEAD) ran
    6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)

To learn more, check out our blog post How we decreased GitLab repo backup times from 48 hours to 41 minutes. You can also find the original mailing list thread here.

This project was led by Karthik Nayak.

Better bundle URI unbundling

Through the bundle URI mechanism in Git, locations to fetch bundles from can be provided to clients with the goal to help speed up clones and fetches. When a client downloads a bundle, references under refs/heads/* are copied from the bundle into the repository along with their accompanying objects. A bundle might contain additional references outside of refs/heads/* such as refs/tags/*, which are simply ignored when using bundle URI on clone.

In Git 2.50, this restriction is lifted, and all references matching refs/* contained in the downloaded bundle are copied. Scott Chacon, who contributed this functionality, demonstrates the difference when cloning gitlab-org/gitlab-foss:

$ git-v2.49 clone --bundle-uri=gitlab-base.bundle https://gitlab.com/gitlab-org/gitlab-foss.git gl-2.49
Cloning into 'gl2.49'...
remote: Enumerating objects: 1092703, done.
remote: Counting objects: 100% (973405/973405), done.
remote: Compressing objects: 100% (385827/385827), done.
remote: Total 959773 (delta 710976), reused 766809 (delta 554276), pack-reused 0 (from 0)
Receiving objects: 100% (959773/959773), 366.94 MiB | 20.87 MiB/s, done.
Resolving deltas: 100% (710976/710976), completed with 9081 local objects.
Checking objects: 100% (4194304/4194304), done.
Checking connectivity: 959668, done.
Updating files: 100% (59972/59972), done.

$ git-v2.50 clone --bundle-uri=gitlab-base.bundle https://gitlab.com/gitlab-org/gitlab-foss.git gl-2.50
Cloning into 'gl-2.50'...
remote: Enumerating objects: 65538, done.
remote: Counting objects: 100% (56054/56054), done.
remote: Compressing objects: 100% (28950/28950), done.
remote: Total 43877 (delta 27401), reused 25170 (delta 13546), pack-reused 0 (from 0)
Receiving objects: 100% (43877/43877), 40.42 MiB | 22.27 MiB/s, done.
Resolving deltas: 100% (27401/27401), completed with 8564 local objects.
Updating files: 100% (59972/59972), done.

Comparing these results, we see that Git 2.50 fetches 43,887 objects (40.42 MiB) after the bundle was extracted whereas Git 2.49 fetches a total of 959,773 objects (366.94 MiB). Git 2.50 fetches roughly 95% fewer objects and 90% less data, which benefits both the client and the server. The server needs to process a lot less data to the client and the client needs to download and extract less data. In the example provided by Scott this led to a speedup of 25%.

To learn more, check out the corresponding mailing-list thread.

This patch series was contributed by Scott Chacon.

This article highlighted just a few of the contributions made by GitLab and the wider Git community for this latest release. You can learn about these from the official release announcement of the Git project. Also, check out our previous Git release blog posts to see other past highlights of contributions from GitLab team members.

What’s new in Git 2.50.0?

New git-diff-pairs(1) command

Batched reference updates

New filter option for git-cat-file(1)

Improved performance when generating bundles

Better bundle URI unbundling

Read more

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum.

Start shipping better software faster

Platform

Pricing

Solutions

Resources

Company

Contact Us

What’s new in Git 2.50.0?

New git-diff-pairs(1) command

Batched reference updates

New filter option for git-cat-file(1)

Improved performance when generating bundles

Better bundle URI unbundling

Read more

Stay in the know with GitLab's monthly newsletter

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum.

Start shipping better software faster