This document is also recorded in the storyboard as
https://storyboard.baserock.org/#!/story/62,
but this may be a more comfortable place to comment.
# Deprecating lorried repositories
Sometimes we lorry changes, then later no longer need that repository,
because the change gets moved into its primary repository,
or it was part of work in progress that was abandoned before completion.
We'd like to be able to:
1. Stop polling its upstream repository
2. Hide the deprecated repository from listings by default,
so new users don't start using them.
This needs to be hidden by both the cgit UI and the gitano ls,
so downstream troves also don't mirror deprecated repositories.
3. Potentially have downstream troves also deprecate their repositories.
4. Reclaim any used space.
5. All while still allowing old URLs to work.
## Stopping future polling
This is currently already possible by taking the lorry file out of
the active set.
## Hiding deprecated repositories
This already partially possible, since cgit has the ability to hide
repositories from listing, and gitano will hide repositories with
project.archived set.
Propagating this archived flag would require lorry controller to be able to
query this config in a more performant way than running `config $repo show
project.archived` for every repository.
This information is stored in the admin ref, so you could trigger some merge
based on whether that has changed, but you most certainly don't want to fetch
it in directly (and can't with the default configuration).
## Reclaiming space while allowing old URLs to work
Without any magic or knowledge about which repositories share space, the best
that can be done to reclaim used space would be to aggressively repack when a
repository is deprecated.
If you *do* know some repositories that are likely to contain shared objects,
such as projects that split out components, or forks, then there's a few
approaches possible:
### git-relink to de-duplicate objects by hardlinks
`git relink` will de-duplicate loose git objects,
but given objects tend to be packed, this may have limited success.
### Sharing object stores
You can use `.git/objects/info/alternates` to have the fork/child-project
repositories share the common history objects with the parent repository.
Doing this post-hoc is more awkward:
1. Create a repository containing all the objects and refs,
potentially by nominating a master and fetching all the refs in
a namespaced manner.
2. Duplicate this repository for each of the fork repositories.
3. In each repository, remove refs that aren't supposed to be there.
4. In the primary repository, `git gc --prune=all`
5. In all the forks, set .git/objects/info/alternates to point to the
primary repository's object store, so that they can look for
objects in the other repository.
6. Perform a `git gc --prune=all` or a `git repack -l` in forks
to discard any objects that were available locally,
but are also in the alternates object store.
After this, the fork repositories will share all common objects
that came about because the forked branch was merged,
but if the history is removed from the primary repository,
then the forks will stop to function.
### Merging forks into the same repository with namespaced branches
The unified repository that we would create as a step in sharing object stores,
could be used directly with some work,
if the namespaced manner is proper git namespaces.
You would then be able to have logical repositories by performing git
operations on physical repositories with only the branches in a different
namespace.
Either the git-{{upload,receive}-pack,upload-archive} commands need to be made
natively aware of how to locate the physical repository,
when run in a stub repository,
or you would have a proxy service run before the git service process,
which redirects the operation to the physical repository
by changing the git directory
and setting GIT_NAMESPACE to the name of the logical repository.
This approach has the advantage over the alternates approach because it won't
break everything if you delete refs from the physical repository and repack.
#### Handling http[s]:// URLs
git-http-backend's manpage has examples showing how have all
repositories matching a given path be shared, this wouldn't work for us,
as we need it to be more dynamic, but it proves the concept,
so we'd have a proxy cgi that asks gitano how to find the parent repository,
then munges the paths and sets GIT_NAMESPACE before exec'ing.
Naturally we'd need to run this service rather than using cgit's version.
#### Handling ssh://git@ URLs
Similar redirection could be added to gitano's processing of the commands,
though it may require a rethink of which admin-ref is checked.
#### Handling git:// URLs
The standard git-daemon would either need to be wrapped in -i mode,
or re-implemented, to have it proxy to the physical repository.
The protocol is fairly simple though, with a big-endian hex-encoded
header length 4 bytes, followed by an optional NUL terminated
host=$HOST:$PORT of what the client thinks the server's address
is (to allow for virtual hosting), followed by the service name
(e.g. git-upload-archive), a space, then the path part of the
git:// URL.
After this the service subcommand should be run from the git directory
with the stdio proxied through.
So it would be feasible to have a program that parses the header,
determines what namespace and repository to use, sets GIT_NAMESPACE,
forks off the git-daemon in -i mode, writes a new header, then
proxies the stdio of the client into the git-daemon.
#### Handling the CGit web UI
For CGit, the repositories need to either be hidden,
or it needs to be made aware of how to show the logical repositories,
since there is currently no way to hook this in,
and CGit is not ref namespace aware.