Wednesday, June 29, 2011

express charsets : CHECK! what's next?

Happy Wednesday everybody!

For those receiving my progress report, you know that we have successfully added charset expression in NeXML for TreeBASE. It feels extremely awesome to know I have actually contributed something important to an active and ongoing project.

The most important thing I have learned from this portion of the project: Test-driven Development, when you write a failing test and fix the code so that it satisfies all the tests that you wrote for it. Then you know you have a working program.  I never learned that in my classes and it is very cool to learn an important technique that I will continue to use in the future.

Summary of what we did:

  • We wrote two JUnit tests. testNexmlMatrixConverter() creates a NexmlMatrix within a Nexml document and checks to make sure that it matches up with the appropriate fetched matrix from TreeBASE. testNexmlMatrixCharSets() verifies the the coordinates and name of the charactersets  in the NexmlMatrix actually match up  to those of the TreeBASE matrix.  It also does a check to make sure that the study actually has charsets associated with it. This is important to prevent NullPointerExceptions.
  • After establishing a well thought-out and commented unit test for implementing char set expression, I used the logic from the unit test and reworded it to fit in with the actual methods of the NexmlMatrixConverter () class. The code for this class can be found here: It was pretty exciting when I was able to get everything to pass all of the tests we created for it and I was finally able to commit the code. Unfortunately, I included an unused import and it threw off the entire build of the project, causing problems for a lot of people. So...note to self: don't do that again. Anyways, after that was all sorted out, it is safe to say that TreeBASE now supports charset expression for NeXML.
Now what are the next steps?  I am going to be working on including row-segment metadata for NeXML output. Right now, I am working to express the metadata with Darwin Core terms for the specimen-related information. The sequence information is going to have to be expressed differently to be useful to anybody though. This involves editing the populateXmlMatrix() funcitons.  However, these row-segment annotations involve possibly (worded from Bill Piel) partitioning a row of character states into every possible segment and then assembling sets of row segments as needed. Then each set of partitions, you can express metadata as described above.  I have completed the editing of the populateXmlMatrix() methods, but I need to talk to Rutger now about what we are doing in regards to implementing them.

Alright, that's all folks. Have a good night!

Friday, June 24, 2011


All right, I need to get back down to business.  Where was it I left off again? Oh yes, character sets. My last post about these dealt with the purpose and function of char sets but I don't think I actually explained what I was doing with them.

I have been working in the NexmlMatrixConverterTest() class.  Here I was creating a unit test testNexmlCharSets() that verifies if the title and coordinates of a treebase character set and nexml character set. It also checks to see if the study being tested even has any character sets associated with it. The unit test was modeled off of the other one in this class created by Rutger.  Feeling a little lost at first, he wrote out a unit test that was very well-commented so I could figure out what was going on. You can check out the most recent code in this class at:

After I wrote my portion, Rutger put some finishing touches on it and I am now trying to get the actual NexmlMatrixConverter() class working by applying the logic from the unit tests to the actual code. Once I can get it passing all of the checks of the test, I think I will be good to go and be able to commit.

Time to get back to work now though! Toodles!

Thursday, June 23, 2011

impression of evolution & ievobio 2011: impressive!

My mind is about to explode from all of the information that it took in these past five days.  I was not exactly sure what to expect out of these conferences. I had been to several smaller conferences before but nothing like this. Every day was jam packed with talks and you basically had to run from one room to the next based on what you wanted to listen too. It was nuts.  The whole experience got me really excited about the whole field though. There are so many people doing so many different cool things all over the world. That was another cool aspect.  People were from all over the world at this conference. Probably the best part of it all was the fact that everyone was just so friendly.  Even I only knew 3 people there when I arrived, I felt as if I had met a million people after I left. Many of them I now would not even hesitate to contact in the future. Everyone was so genuinely interested in what other people were doing and it was so exciting to be a part of it all.  I was a little apprehensive as to how much fun we could actually have in Norman, Oklahoma.  However, I think the fact that we were all crammed into this isolated conference center, it forced people together and encouraged people to introduce themselves. And we all found our own fun.

The Evolution portion lasted Friday evening through Tuesday.  iEvoBio (more computer science focused) overlapped Tuesday and then continued into Wednesday.  It had already been a long week and I wasn't sure how much more information and ideas I could cram into my head.  However, once things got started, I was so excited to be there! There are so many awesome projects going on and it was pretty amazing that they are all open source.

Some of my favorite ongoing projects (besides anything to do with TreeBASE hehe) :

  • The Moorea Biocode Project ( Here they are trying to barcode every organism of the ecosystem for this tiny island in the middle of the Pacific Ocean. And so far they are doing an awesome job at it! The database they have set up is amazing. For each specimen, they have pictures of it, georeferenced maps, and even links to any sequences associated with it.
  • The Map of Life project (  This includes a plethora of resources regarding species distributions. It allows the user to search for a species, view its interactive range map, and then investigate information from other species data repositories and data from past surveys.
  • The idea of sharing and re-using trees: Here is a nicely-written NaturePrecedings abstract that describes the purpose and importance of this concept.

There were also many more, but I just wanted to highlight a few.  Also at iEvoBio, I was able to personally meet one of my mentors, Bill Piel as well as other people involved in the TreeBASE project such as Hilmar and Karen. It was quite the week. After a good night's sleep I am ready to get back to work.

Tuesday, June 14, 2011

charsets, matrices, and unit tests...Oh My!

Greetings everyone! I apologize for the fact that it has been a while since I have last posted.  I wanted to make sure I fully understood what I was talking about before I actually wrote it down for all the world to see. A few weeks ago, I was asked to add charset functionality to TreeBASE.  What exactly does that mean? I was in the same boat you were two weeks ago.  And after attempting to figure it out on my own, I ended up going on a wild goose chase and virtually gaining no further understanding of my actual project. So I decided that it couldn't hurt to ask for more details.  This post will involve a summary of the information I obtained from Bill and Rutger over the past week about CharSets. Much of it will come directly from the e-mails that were sent back and forth just for simplicity's sake (and Rutger and Bill worded it very well).

Here it goes. First off, this was my original task: I needed to generate a NeXML solution both for expressing CHARSET free text and for expressing the row-segment metadata like the Genbank accession number.

What is a CHARSET? In NEXUS parlance, the CHARSET is simply a specified set of characters with a title. This allows the user to annotate various sets of characters to mean different things, e.g., saying which sequences belong to which genes, or saying which characters are behavioral vs morphological, etc etc. This is one of the few really universal ways of specifying sets of characters. It is recognized by PAUP, MrBayes, etc, and they can use it in conjunction with defining partitions and applying different substitution models to different genes.

How are CHARETS used in TreeBASE? What are the advantages and disadvantages of CHARSETS? TreeBASE currently parses CHARSETs out of submitted NEXUS files, stores them, and then generates them when outputting a NEXUS serialization.  However, our NeXML does not seem to output the equivalent metadata. Consequently, the NEXUS output contains some annotations that are not found in the NeXML output, which is a bummer. On the other hand, CHARSET annotations are only really human-readable, in that there is no machine readable syntax that use a controlled vocabulary, etc. (i.e. a human knows that both "CHARSET CO1 = 1-505;" and "CHARSET COI = 1-505;" both state that sequences 1 through 505 come from the gene cytochrome oxidase I, but a computer only knows that "CO1" and "COI" are two different strings).  Likewise "CHARSET ambiguous_regions = 1-34 543-551 601 893-901;" is clear to humans but not to machines.  NeXML has an opportunity to be much more explicit, but only if it is supplied data with a controlled vocabulary (which we can't do because we ingest them with free-text NEXUS). But nonetheless, it would be valuable for TreeBASE's CHARSETs to be expressed in NeXML too, even if they have uninterpretable strings.

The NeXML API did not yet have programmatic access to create such sets. These have just recently been added to the code by Rutger. It is simple to use:

Subset charset = matrix.createSubset("
charset.addThing(char2); // etc.

These charset objects inherit from Annotatable and we can then attach annotations.

What are the differences between CHARSETS and RowSegments?
In addition to CHARSETs, TreeBASE also implements a similar annotation called "RowSegments" but this differs in several important ways:

1. CHARSETS applies to all homologous character scorings for all OTUs (ie taxa) in a character block or alignment. RowSegments specify a set of characters for any particular OTU or taxon or row in a matrix. So, taxon_a can have a RowSegment annotation for sequences 34 through 42, and taxon_b can have a RowSegment annotation for sequences 39 through 45. Whereas a CHARSET annotation has to apply to the same homologous characters in both taxon_a and taxon_b. 

2. CHARSETS allow you to specify a scattering of characters (i.e. see the ambiguous_regions example above), whereas RowSegments have a single begin and end index -- each can only specify a stretch of sequence or characters. 

3. CHARSETS have no controlled vocabulary. RowSegments have hard-typed fields for basic DarwinCore metadata plus culture numbers and Genbank accession numbers. It is likely that once we implement a MIAPA standard, we will need to soft-type our RowSegment annotations -- i.e. subject-predicate-object. 

4. The conceptual understanding of a CHARSET is that it refers to an abstract class of characters: a type of gene, a type of morphological character, etc. The conceptual understanding of RowSegments is that these are metadata attached to the specimen(s) that were examined when deriving the characters: e.g. the specimen's culture number, museum collection code, the Genbank accession number for the sequence derived from the specimen, etc. 

5. CHARSETS are a formal part of NEXUS and NeXML, RowSegments are not.

 If RowSegments are not a part of NEXUS or NeXML, that must cause some problems...
One of our big problems is that we offer the ability to capture some pretty important metadata using RowSegment annotation, but the concept of the "RowSegment" is missing from NEXUS -- so we can't export it or ingest it using NEXUS.  Mesquite has something that comes close: the NOTES block of a Mesquite-written NEXUS file has a special annotation, such as these two:

SUTM  T = 4 N = genBankNumber S = AF284000;

... which means "the entire row in the matrix for taxon number 4 comes from the Genbank access  AF284000."  This doesn't work for us because we allow a row to have more than one Genbank accession number -- e.g. sequence 1-500 is AF284000 but 501-850 is AF45345. 

Alternatively there is also:

AN T = 4 C = 1  AU = TreeBASE TF = ( CM AF284000 ) TF = ( R genBankNumber );
... which means, "the annotation for character 1 of taxon 4 was provided by someone named 'TreeBASE', who gave it the value 'AF284000' with reference to something called 'genBankNumber' ".  This doesn't work for us because we would want to provide a whole range of sequences, not just one base. i.e., it would be better if Mesquite allowed us to write:

AN T = 4 C = 1-501  AU = TreeBASE TF = ( CM AF284000 ) TF = ( R genBankNumber );
AN T = 4 C = 501-850  AU = TreeBASE TF = ( CM AF45345 ) TF = ( R genBankNumber );

So unless Mesquite expands it's NOTES annotation capability, we cannot export NEXUS with our RowSegment annotation using a syntax that Mesquite understands and then present it to the user in a nice graphical way. 

But nonetheless, it would be great to figure out how these metadata could be expressed in NeXML, especially if there's a way to imbed Darwin Core syntax and vocabularies inside the NeXML.

What is DarwinCore syntax? It is abbreviated DwC. The wikipedia page is pretty helpful ( But basically is is a set of data standards that are used in biodiversity research to help keep everyone on the same page.  The Term Reference page ( is also very useful in gaining an understanding of what sorts of data it standardizes.

Isn't there already metadata being expressed in NeXML?
Currently, *some* of these metadata are being expressed in our NeXML, but probably in the wrong fashion (IMO). For example, the latitude and longitude are attached to OTU elements. However, the lat/long is an attribute of a specimen, and some of that specimen's characters or sequences were aligned in a matrix, and a derived analysis of the matrix produced a tree with OTUs in it. So the lat/long only very indirectly belongs to the OTUs. It much more directly annotates a set of sequences or characters for a particular row of character data. The issue is how to express this in NeXML.

Can you give me a two-sentence summary, please?
If the ultimate goal is to allow TreeBASE to ingest all the data it needs exclusively via NeXML, then the first step is to fully express all of TreeBASE's data and metadata in TreeBASE's NeXML output. And then the second step is to create a NeXML ingest that knows what to do and where to store this richly-annotated NeXML. 
I will give you a moment to digest all of that. After finally sifting through all of this valuable information, I finally let out a sigh of relief because I was happy to finally have a thorough understanding as to what was going on.  Our next step was to start putting that into action--which will be my next post (I promise I will do it later tonight!). I just didn't want to overwhelm readers too much!

Friday, June 3, 2011

right under my nose...

Don't you hate it when you think you have been doing something right the entire time and then you find something way later on that would have made your life so much easier? That just happened to me.

It kept crossing my mind that the code for NeXML was a little skimpy. I kind of felt like I was missing something. However, I had only been looking under When studying this a little closer, I realized I am missing out on more than half the relevant code. Within the nexml-1.5-SNAPSHOT.jar, there is a whole set of important classes. Sure enough, Rutger even told me to study org.nexml.model* 

And now the frustration sets in....GRRR.