The objective is to create smaller module distribution out of a subset of bioperl-live modules, together with its related tests and scripts, as well as the development history of those files. For this purpose, we start from a bioperl-live repository, and filter out its history of unwanted files. This allows us to keep the development history of those modules.
The plan goes like this:
Aim: files-to-keep
file with one file path per line
When deciding what files should go into a module distribution, the following steps are helpful:
Start from the modules that there’s no doubt will be part of the distribution such as the main module that will give the module its name.
Check the dependencies and reverse dependencies of those modules.
A useful command to do it is grep -hPo 'Bio::[:a-z0-9]*'
module-to-check | sort -u
. Decide if they should be part of the
new distribution.
Repeat step above until there’s no new modules added.
Once there’s a list of modules for the new distribution, find the scripts that make use of those modules and decide if they should be part of that distribution too. Finally, find the tests units for those modules as well as any data files required for those tests.
Since we are keeping the development history, one needs to select all
related files across history. This includes files that have been
renamed as well as files that have been removed or merged together.
Searching the whole git history with the --stat
option is helpful as
it gives hints on renames and removed files which --follow
will
miss.
git log --stat
It is also helpful to scan the list of all files that ever existed in the history of a repository:
git log --all --pretty=format: --name-only --diff-filter=A | sort -u
List the path for the selected files, relative to the root of the
repository, one per line. Save the file as files-to-keep
.
Aim: git repository with the history of selected files
This step makes use of git filter-branch
which rewrites the history
of a repository by removing unwanted files and any empty commit. For
this purpose, a list of files to be removed is required. This list is
generated from the list of files to be kept that was previously
generated.
Start from a new clone of bioperl-live, name it after the new distribution, and remove the original remote to avoid accidentally push the modified history:
git clone https://github.com/bioperl/bioperl-live.git Bio-New-Distribution
cd Bio-New-Distribution
git remote remove origin
Generate the list of files to be removed using the files-to-keep
file previously created:
git log --all --pretty=format: --name-only --diff-filter=A | sort -u > all-files-ever
awk 'NR == FNR {a[$0]; next} !($0 in a)' files-to-keep all-files-ever > files-to-rm
Leave the file files-to-rm
at the root of the repository and call
(this command will take a very long time):
git filter-branch \
--prune-empty \
--index-filter 'xargs -a ../../files-to-rm git rm --cached --ignore-unmatch --quiet -- ' \
-- --all
Note the path ../../files-to-rm
. While git filter-branch
is
called from the root of the repository, the --index-filter
command
will be called from .git-rewrite/hash/
.
A few things remain to be done. First, we need to remove the empty merge commits that were left behind:
git filter-branch \
--force \
--prune-empty \
--parent-filter 'sed "s/-p //g" | xargs -r git show-branch --independent | sed "s/\</-p /g"' \
-- --all
Second, the old tags don’t make any sense, they point to objects that no longer exist and releases that were not done from this new repository. Remove them all:
git tag | xargs git tag -d
Finally, clean up the repository from references to old objects:
git gc --aggressive --prune=all
The only unwanted commit left behind will be a root commit. However,
removing the root commit requires a rebase which changes the committer
information (see author and committer of each commit with git log
--format=raw
).
The files to be part of the new distribution may already be in a separate repositories. For example, bioperl-live and bioperl-run each has half of the original bioperl repository. If the new distribution takes files from these two repositories, one needs to merge the two histories together.
First filter out of the history of both repositories, so you have two
directories, for example clean-live
and clean-run
. The two
repositories may have different organisation, for example,
bioperl-live has Bio
at the root of the repository while bioperl-run
has it in the lib
directory. It is easier to restructure each of
them in preparation for the merge. See the section about
restructuring the files for instructions.
cd clean-live
git mv ..
git commmit -m "maint: restructure for merge with ..."
cd clean-run
git mv ..
git commmit -m "maint: restructure for merge with ..."
The actual merge is done like so:
git remote add clean-run ../clean-run
git fetch clean-run
git merge --no-ff --no-commit clean-run/master
git commit -m "Merge Bio-... (bioperl-live) and Bio-... (bioperl-run)"
Aim: reach a state where releases can be made with dzil release
Organise the repository like this:
inc/
- perl modules to be installedt/
- test units and required databin/
- programs to be installed (remove the .pl
file extension)inc/
- modules required for build and test but not installedIn addition, create a new Changes
file:
Summary of important user-visible changes for THIS-DISTRIBUTION
---------------------------------------------------------------
{{$NEXT}}
* First release after split from bioperl-live.
And an appropriate dist.ini
file. See
Bio-ASN1-EntrezGene
and
Bio-EUtilities
as examples.
To make releases easier, there is a distzilla pluginbundle for bioperl. It will perform a set of pre-defined tests and define metadata in a standard form. It also comes with a Pod::Weaver configuration which will format the POD in a uniform manner across all bioperl modules. It’s best to do things in separate, one commit for each:
The top of a module will look something like this:
# ABSTRACT: Parses output from the PAML programs codeml, baseml, basemlg, codemlsites and yn00
# AUTHOR: Jason Stajich <jason@bioperl.org>
# AUTHOR: Aaron Mackey <amackey@virginia.edu>
# OWNER: Jason Stajich <jason@bioperl.org>
# OWNER: Aaron Mackey <amackey@virginia.edu>
# LICENSE: Perl_5
Use this checklist:
ABSTRACT
Software::License::*
module.
Check existing module for any specific license.BioPerl module for ...
# Let the code begin...
line=head2
with =method
, =attr
, or =internal
as
appropriate.PODNAME
with the script name too.When this is done:
git commit -m "maint: fix POD to make use of PodWeaver"
This stuff should be at the top of each module, before the POD blocks.
package Foo;
declaration to the very first lineVERSION
variable.utf8
, strict
, and warnings
.
If utf8
is set, should be the first.use
statements next. It’s handy to have them at the
top.When it is done:
git commmit -m "maint: move 'package' and 'use' statements to top"
The bioperl dist zilla plugin will enable a bunch of new tests. Most of the existing bioperl code will fail and will require changes. In addition, there is a lot of Perl best practices that are not followed. Some of the things to do are:
use warnings
use utf8
. This should be the third line of a module. The first
is the package declaration, the second is empty for dzil’s
[PkgVersion]
, the third declares file encoding.use parent
instead of use base
and manually modigying @ISA
our
instead of use vars
#!/usr/bin/env perl
Aim: new releases
First remove the files that are part of the new distribution from bioperl-live. It is important for the release to happen soon after because some people will be installing bioperl-live from the git repository. As such, it’s important that the removed modules are available for installation on CPAN from some distribution.
When removing files, be careful with data files for tests. Those files are sometimes shared between multiple tests so some may still be required.
Set both bioperl-live and the new distribution for the same version number. Add a note on the bioperl distribution about the removal of this modules.
Ask on the mailing list for a new bioperl release. Make the new distribution.