This folder contains "registration files" for UCSC genomes.

These files are used by the getChromInfoFromUCSC() function defined in
the GenomeInfoDb package.

There must be one file per genome.

Each file must be an R script (.R extension) and its name must be the name
of the genome (e.g. 'danRer11.R').

The script should be able to work as a standalone script so should
explicitly load packages if needed (e.g. with 'library(IRanges)'
and/or 'library(GenomeInfoDb)').

At a minimum, the script must define the 4 following variables:

  o GENOME:              Single non-empty string.

  o ORGANISM:            Single non-empty string.

  o ASSEMBLED_MOLECULES: Character vector with no NAs, no empty strings,
                         and no duplicates.

  o CIRC_SEQS:           Character vector (subset of ASSEMBLED_MOLECULES).

Additionally, it can also define any of the following variables:

  o FETCH_ORDERED_CHROM_SIZES: Function with 1 argument. Must return a 2-column
                         data frame with columns "chrom" (character)
                         and "size" (integer). Rows must be in "canonical"
                         chromosome order. A requirement is that the assembled
                         molecules come first.
                         Note that defining this function is not needed if
                         all the sequences in the genome are assembled
                         molecules. See for example registration files for
                         Worm (ce*.R files).

  o NCBI_LINKER:         Named list.

    Valid NCBI_LINKER components:
    - assembly_accession: single non-empty string.
    - AssemblyUnits: character vector.
    - special_mappings: named character vector.
    - unmapped_seqs: named list of character vectors.
    - drop_unmapped: TRUE or FALSE.

  o ENSEMBL_LINKER:      Single non-empty string (can only be "ucscToEnsembl"
                         or "chromAlias" at the moment).

All the above variables are recognized by getChromInfoFromUCSC(). They
must be defined at the top-level of the script and their names must be
in UPPER CASE.

The script can define its own top-level variables and functions, but, by
convention, their names should be in lower case and start with a dot.

See the files in this folder for numerous examples.

Here is how to perform some basic testing of a new registration file:

  1. In a **fresh** R session, use source() to source the new file.
     This has the effect of executing the code in the script (alternatively
     you can copy-paste the content of the script in your session).
     Note that a registration file is required to be a **standalone** R
     script. This means that we should be able to source it in a fresh R
     session, and it should just work (granted that all the required packages
     are installed). "Just work" here means that all the top-level variables
     defined in the script will get defined in your session (you should see
     them with ls()).

  2. Check the values of the 4 mandatory variables: GENOME, ORGANISM,
     ASSEMBLED_MOLECULES, and CIRC_SEQS.

  3. Call FETCH_ORDERED_CHROM_SIZES() and make sure it behaves has expected.
     Does the returned data frame has its rows in the expected order?

  4. Install GenomeInfoDb (with the new registration file in it), start R,
     load the package, and try to call registered_UCSC_genomes(). The
     returned data frame should now have an entry for the new genome.
     Make sure that all the fields in the new entry look as expected.
     Then call getChromInfoFromUCSC() on the new genome. Do the "assembled"
     and "circular" columns look as expected? Try with and without
     setting the 'assembled.molecules.only' argument to TRUE.

  5. If you've defined NCBI_LINKER in the registration file: try to call
     getChromInfoFromUCSC() with 'map.NCBI' set to TRUE.

  6. If you've defined ENSEMBL_LINKER in the registration file: try to call
     getChromInfoFromUCSC() with 'add.ensembl.col' set to TRUE.