Here it goes. First off, this was my original task: I needed to generate a NeXML solution both for expressing CHARSET free text and for expressing the row-segment metadata like the Genbank accession number.
What is a CHARSET? In NEXUS parlance, the CHARSET is simply a specified set of characters with a title. This allows the user to annotate various sets of characters to mean different things, e.g., saying which sequences belong to which genes, or saying which characters are behavioral vs morphological, etc etc. This is one of the few really universal ways of specifying sets of characters. It is recognized by PAUP, MrBayes, etc, and they can use it in conjunction with defining partitions and applying different substitution models to different genes.
How are CHARETS used in TreeBASE? What are the advantages and disadvantages of CHARSETS? TreeBASE currently parses CHARSETs out of submitted NEXUS files, stores them, and then generates them when outputting a NEXUS serialization. However, our NeXML does not seem to output the equivalent metadata. Consequently, the NEXUS output contains some annotations that are not found in the NeXML output, which is a bummer. On the other hand, CHARSET annotations are only really human-readable, in that there is no machine readable syntax that use a controlled vocabulary, etc. (i.e. a human knows that both "CHARSET CO1 = 1-505;" and "CHARSET COI = 1-505;" both state that sequences 1 through 505 come from the gene cytochrome oxidase I, but a computer only knows that "CO1" and "COI" are two different strings). Likewise "CHARSET ambiguous_regions = 1-34 543-551 601 893-901;" is clear to humans but not to machines. NeXML has an opportunity to be much more explicit, but only if it is supplied data with a controlled vocabulary (which we can't do because we ingest them with free-text NEXUS). But nonetheless, it would be valuable for TreeBASE's CHARSETs to be expressed in NeXML too, even if they have uninterpretable strings.
The NeXML API did not yet have programmatic access to create such sets. These have just recently been added to the code by Rutger. It is simple to use:
Subset charset = matrix.createSubset("
charset.addThing(char1);
charset.addThing(char2); // etc.
These charset objects inherit from Annotatable and we can then attach annotations.
In addition to CHARSETs, TreeBASE also implements a similar annotation called "RowSegments" but this differs in several important ways:
1. CHARSETS applies to all  homologous character scorings for all OTUs (ie taxa) in a character  block or alignment. RowSegments specify a set of characters for any  particular OTU or taxon or row in a matrix. So, taxon_a can have a  RowSegment annotation for sequences 34 through 42, and taxon_b can have  a RowSegment annotation for sequences 39 through 45. Whereas a CHARSET  annotation has to apply to the same homologous characters in both  taxon_a and taxon_b. 
2. CHARSETS allow you to specify a scattering of characters (i.e. see the ambiguous_regions example  above), whereas RowSegments have a single begin and end index -- each  can only specify a stretch of sequence or characters. 
3. CHARSETS  have no controlled vocabulary. RowSegments have hard-typed fields for  basic DarwinCore metadata plus culture numbers and Genbank accession  numbers. It is likely that once we implement a MIAPA standard, we will  need to soft-type our RowSegment annotations -- i.e.  subject-predicate-object. 
4. The conceptual  understanding of a CHARSET is that it refers to an abstract class of  characters: a type of gene, a type of morphological character, etc. The  conceptual understanding of RowSegments is that these are metadata  attached to the specimen(s) that were examined when deriving the  characters: e.g. the specimen's culture number, museum collection code,  the Genbank accession number for the sequence derived from the specimen,  etc. 
5. CHARSETS are a formal part of NEXUS and NeXML, RowSegments are not. 
If RowSegments are not a part of NEXUS or NeXML, that must cause some problems...
One of our big problems is that we offer the ability to capture  some pretty important metadata using RowSegment annotation, but the  concept of the "RowSegment" is missing from NEXUS -- so we can't export  it or ingest it using NEXUS.  Mesquite has something that comes close:  the NOTES block of a Mesquite-written NEXUS file has a special  annotation, such as these two:
SUTM  T = 4 N = genBankNumber S = AF284000;
...  which means "the entire row in the matrix for taxon number 4 comes from  the Genbank access  AF284000."  This doesn't work for us because we  allow a row to have more than one Genbank accession number -- e.g.  sequence 1-500 is AF284000 but 501-850 is AF45345. 
Alternatively there is also:
AN T = 4 C = 1  AU = TreeBASE TF = ( CM AF284000 ) TF = ( R genBankNumber );
...  which means, "the annotation for character 1 of taxon 4 was provided by  someone named 'TreeBASE', who gave it the value 'AF284000' with  reference to something called 'genBankNumber' ".  This doesn't work for  us because we would want to provide a whole range of sequences, not just  one base. i.e., it would be better if Mesquite allowed us to write:
AN T = 4 C = 1-501  AU = TreeBASE TF = ( CM AF284000 ) TF = ( R genBankNumber );
AN T = 4 C = 501-850  AU = TreeBASE TF = ( CM AF45345 ) TF = ( R genBankNumber );
So  unless Mesquite expands it's NOTES annotation capability, we cannot  export NEXUS with our RowSegment annotation using a syntax that Mesquite  understands and then present it to the user in a nice graphical way. 
But  nonetheless, it would be great to figure out how these metadata could  be expressed in NeXML, especially if there's a way to imbed Darwin Core  syntax and vocabularies inside the NeXML. 
What is DarwinCore syntax? It is abbreviated DwC. The wikipedia page is pretty helpful (http://en.wikipedia.org/wiki/Darwin_Core). But basically is is a set of data standards that are used in biodiversity research to help keep everyone on the same page. The Term Reference page (http://rs.tdwg.org/dwc/terms/#theterms) is also very useful in gaining an understanding of what sorts of data it standardizes.
Isn't there already metadata being expressed in NeXML?
Currently, *some* of these metadata are being expressed in our NeXML, but probably in the wrong fashion (IMO). For example, the latitude and longitude are attached to OTU elements. However, the lat/long is an attribute of a specimen, and some of that specimen's characters or sequences were aligned in a matrix, and a derived analysis of the matrix produced a tree with OTUs in it. So the lat/long only very indirectly belongs to the OTUs. It much more directly annotates a set of sequences or characters for a particular row of character data. The issue is how to express this in NeXML.
Can you give me a two-sentence summary, please?
If the ultimate goal is to allow TreeBASE to ingest all the data it  needs exclusively via NeXML, then the first step is to fully express  all of TreeBASE's data and metadata in TreeBASE's NeXML output. And then  the second step is to create a NeXML ingest that knows what to do and  where to store this richly-annotated NeXML. 
I will give you a moment to digest all of that. After finally sifting through all of this valuable information, I finally let out a sigh of relief because I was happy to finally have a thorough understanding as to what was going on.  Our next step was to start putting that into action--which will be my next post (I promise I will do it later tonight!). I just didn't want to overwhelm readers too much!
 
 
No comments:
Post a Comment