PhyloSoC 2011: Automated submission to TreeBASE: charsets, matrices, and unit tests...Oh My!

Greetings everyone! I apologize for the fact that it has been a while since I have last posted. I wanted to make sure I fully understood what I was talking about before I actually wrote it down for all the world to see. A few weeks ago, I was asked to add charset functionality to TreeBASE. What exactly does that mean? I was in the same boat you were two weeks ago. And after attempting to figure it out on my own, I ended up going on a wild goose chase and virtually gaining no further understanding of my actual project. So I decided that it couldn't hurt to ask for more details. This post will involve a summary of the information I obtained from Bill and Rutger over the past week about CharSets. Much of it will come directly from the e-mails that were sent back and forth just for simplicity's sake (and Rutger and Bill worded it very well).

Here it goes. First off, this was my original task: I needed to generate a NeXML solution both for expressing CHARSET free text and for expressing the row-segment metadata like the Genbank accession number.

What is a CHARSET? In NEXUS parlance, the CHARSET is simply a specified set of characters with a title. This allows the user to annotate various sets of characters to mean different things, e.g., saying which sequences belong to which genes, or saying which characters are behavioral vs morphological, etc etc. This is one of the few really universal ways of specifying sets of characters. It is recognized by PAUP, MrBayes, etc, and they can use it in conjunction with defining partitions and applying different substitution models to different genes.

How are CHARETS used in TreeBASE? What are the advantages and disadvantages of CHARSETS? TreeBASE currently parses CHARSETs out of submitted NEXUS files, stores them, and then generates them when outputting a NEXUS serialization. However, our NeXML does not seem to output the equivalent metadata. Consequently, the NEXUS output contains some annotations that are not found in the NeXML output, which is a bummer. On the other hand, CHARSET annotations are only really human-readable, in that there is no machine readable syntax that use a controlled vocabulary, etc. (i.e. a human knows that both "CHARSET CO1 = 1-505;" and "CHARSET COI = 1-505;" both state that sequences 1 through 505 come from the gene cytochrome oxidase I, but a computer only knows that "CO1" and "COI" are two different strings). Likewise "CHARSET ambiguous_regions = 1-34 543-551 601 893-901;" is clear to humans but not to machines. NeXML has an opportunity to be much more explicit, but only if it is supplied data with a controlled vocabulary (which we can't do because we ingest them with free-text NEXUS). But nonetheless, it would be valuable for TreeBASE's CHARSETs to be expressed in NeXML too, even if they have uninterpretable strings.

The NeXML API did not yet have programmatic access to create such sets. These have just recently been added to the code by Rutger. It is simple to use:

Subset charset = matrix.createSubset("

ambiguous_regions");
charset.addThing(char1);
charset.addThing(char2); // etc.

These charset objects inherit from Annotatable and we can then attach annotations.

What are the differences between CHARSETS and RowSegments?

In addition to CHARSETs, TreeBASE also implements a similar annotation called "RowSegments" but this differs in several important ways:

1. CHARSETS applies to all homologous character scorings for all OTUs (ie taxa) in a character block or alignment. RowSegments specify a set of characters for any particular OTU or taxon or row in a matrix. So, taxon_a can have a RowSegment annotation for sequences 34 through 42, and taxon_b can have a RowSegment annotation for sequences 39 through 45. Whereas a CHARSET annotation has to apply to the same homologous characters in both taxon_a and taxon_b.

2. CHARSETS allow you to specify a scattering of characters (i.e. see the ambiguous_regions example above), whereas RowSegments have a single begin and end index -- each can only specify a stretch of sequence or characters.

3. CHARSETS have no controlled vocabulary. RowSegments have hard-typed fields for basic DarwinCore metadata plus culture numbers and Genbank accession numbers. It is likely that once we implement a MIAPA standard, we will need to soft-type our RowSegment annotations -- i.e. subject-predicate-object.

4. The conceptual understanding of a CHARSET is that it refers to an abstract class of characters: a type of gene, a type of morphological character, etc. The conceptual understanding of RowSegments is that these are metadata attached to the specimen(s) that were examined when deriving the characters: e.g. the specimen's culture number, museum collection code, the Genbank accession number for the sequence derived from the specimen, etc.

5. CHARSETS are a formal part of NEXUS and NeXML, RowSegments are not.

If RowSegments are not a part of NEXUS or NeXML, that must cause some problems...

One of our big problems is that we offer the ability to capture some pretty important metadata using RowSegment annotation, but the concept of the "RowSegment" is missing from NEXUS -- so we can't export it or ingest it using NEXUS. Mesquite has something that comes close: the NOTES block of a Mesquite-written NEXUS file has a special annotation, such as these two:

SUTM T = 4 N = genBankNumber S = AF284000;

... which means "the entire row in the matrix for taxon number 4 comes from the Genbank access AF284000." This doesn't work for us because we allow a row to have more than one Genbank accession number -- e.g. sequence 1-500 is AF284000 but 501-850 is AF45345.

Alternatively there is also:

AN T = 4 C = 1 AU = TreeBASE TF = ( CM AF284000 ) TF = ( R genBankNumber );

... which means, "the annotation for character 1 of taxon 4 was provided by someone named 'TreeBASE', who gave it the value 'AF284000' with reference to something called 'genBankNumber' ". This doesn't work for us because we would want to provide a whole range of sequences, not just one base. i.e., it would be better if Mesquite allowed us to write:

AN T = 4 C = 1-501 AU = TreeBASE TF = ( CM AF284000 ) TF = ( R genBankNumber );

AN T = 4 C = 501-850 AU = TreeBASE TF = ( CM AF45345 ) TF = ( R genBankNumber );

So unless Mesquite expands it's NOTES annotation capability, we cannot export NEXUS with our RowSegment annotation using a syntax that Mesquite understands and then present it to the user in a nice graphical way.

But nonetheless, it would be great to figure out how these metadata could be expressed in NeXML, especially if there's a way to imbed Darwin Core syntax and vocabularies inside the NeXML.

What is DarwinCore syntax? It is abbreviated DwC. The wikipedia page is pretty helpful (http://en.wikipedia.org/wiki/Darwin_Core). But basically is is a set of data standards that are used in biodiversity research to help keep everyone on the same page. The Term Reference page (http://rs.tdwg.org/dwc/terms/#theterms) is also very useful in gaining an understanding of what sorts of data it standardizes.

Isn't there already metadata being expressed in NeXML?
Currently, *some* of these metadata are being expressed in our NeXML, but probably in the wrong fashion (IMO). For example, the latitude and longitude are attached to OTU elements. However, the lat/long is an attribute of a specimen, and some of that specimen's characters or sequences were aligned in a matrix, and a derived analysis of the matrix produced a tree with OTUs in it. So the lat/long only very indirectly belongs to the OTUs. It much more directly annotates a set of sequences or characters for a particular row of character data. The issue is how to express this in NeXML.

Can you give me a two-sentence summary, please?

If the ultimate goal is to allow TreeBASE to ingest all the data it needs exclusively via NeXML, then the first step is to fully express all of TreeBASE's data and metadata in TreeBASE's NeXML output. And then the second step is to create a NeXML ingest that knows what to do and where to store this richly-annotated NeXML.

I will give you a moment to digest all of that. After finally sifting through all of this valuable information, I finally let out a sigh of relief because I was happy to finally have a thorough understanding as to what was going on. Our next step was to start putting that into action--which will be my next post (I promise I will do it later tonight!). I just didn't want to overwhelm readers too much!

PhyloSoC 2011: Automated submission to TreeBASE

Tuesday, June 14, 2011

charsets, matrices, and unit tests...Oh My!

No comments:

Post a Comment