PhyloSoC 2011: Automated submission to TreeBASE

Monday, September 12, 2011

so long and farewell...maybe

Greetings everybody. I know it has been a long while since I posted, but it has been a busy month as I have now moved to Vietnam for the year to carry out my Fulbright grant. However, I did not want to leave my blog hanging...I wanted to write one last post.

To reflect a little bit, I have to say this has been one of the most rewarding experiences that I have ever participated in. It is hard to explain how hard it is to learn computer science just within the walls of the classroom. Even after one summer of participating in an actual project, I now feel competent to take on anything. This project has provided me a means that has proven to myself that I can actually commit something useful to a field that I love. I am excited to see where this takes me and I truly believe this experience has opened up many doors for me.

It is neat working for an organization with many different projects going on, but all working towards a similar goal of trying to make the field of phyloinformatics more manageable and useful to scientists everywhere. It makes me excited to enter into a field that is exponentially growing.

To recap my summer:

I learned to set up a large complicated code base locally on my computer.
I have gained a profound appreciation of the power of the nexml file and the importance of making nexml files more useful to the scientific community.
I learned what char sets and row segments were and why they are important to nexml. I even implemented char set functionality for TreeBASE.
I was able to participate in a few discussions regarding ideas about how to solve various issues for TreeBASE and fully understand the meaning of "easier said than done".
I learned to not be scared of the debugger and actually debugged a bug!
I learned/"got my feet wet" with Firebug and trying to figure out how to make jsPhyloSVG to work for TreeBASE but it was a project that was started toward the end and never exactly finished.

I am sure my list can go on and on regarding little odds and ends of things I learned while doing this project but I just wanted to write I a brief summary :) I want to thank Rutger and Bill, my mentors, as I was very new to a big project like this and they were extremely helpful and patient as I figured things out. This will be my last post for my 2011 project :( It went much faster than expected. However, I do hope to continue to contribute to TreeBASE in the future :) :) :)

Cheers!

Monday, August 1, 2011

bugs and trees and bugs in trees

It has been rather hectic these past few days. Yesterday, I moved back home to Dayton after four years of living in Chicago. Very sad :( And now I am getting ready for my adventures in Vietnam! Haven't really had time to give a full update on here as to what I have been doing because of the craziness.

Now that the Chicago chapter of my life has been closed, I have some time to blog about what has been going on with my project since my last post.

Thing #1: The Bug. I have to be honest, I was a bit scared to use the debugger in Eclipse. I have stealthily managed to avoid using debuggers in my college days. I was only familiar with the ridiculously simple debugger found within BlueJ from my Intro to Object-Oriented Programming Days. Anyways, I sucked it up and thought it was about time to learn the power of the debugger. I watched a series of video tutorials and they helped immensely. Still working out all of my kinks with it, but I was able to run a series of unit tests for my bug and at least eliminate some issues that are not causing the problem. I need to further inspect the nexml matrix contents to see what is coming up as null or something. More later...

Thing #2: jsPhlyoSVG. I updated the most recent version of a javascript file to see if it would fix the problem with the viewer. I just committed it and it does not appear to do so. Now I get to install a debugger within my Firefox browser and see what is going on within drawjsPhyloSVGTree(namespacedGUID,ntax)function. I will provide you with more insight on that later as well.

Happy August everyone. Rabbit rabbit rabbit!

Friday, July 15, 2011

change of plans

So I have been in hibernation mode. Sorry it has been so long since the last post -- my audience deserves better. I promise I have reasons for my lacking of updates. More to come. But speaking of audiences...I just happened to take a look at my blog stats. It feel pretty cool that I have had 446 visitors (well probably 100+ are from me refreshing the page a million times after I make a post), but still it is nice to know people are stopping by. And not just people I know...people from all over the world! Check it out:

So greetings to people from India, Romania, and all those other cool places I wish I have been too (Vietnam--see you in a few weeks). These little stats help remind me that this Google Summer of Code thing is on a global scale and it is nice to know that I am contributing to something like that.

So why the lack of presence? Well, to tell you the truth...there has not been too much to update you on. We have been in a "transition phase" trying to figure out what to do next. From my last post you may have noted that that we were interested in was to include row-segment metadata to NeXML output. Turns out that this is easier said than done. There is going to have to be some editing of the nexml.jar files (something that Rutger has to do) so we may have to put that on hold until more thought gets put into it. I did add some lines of code that allow for more annotations to be added the row-segments though. You can see that here. But the issue of character state sequence annotations remains an issue.

So what am I doing in the meantime? Well, I thought it might be nice to take some weight off of TreeBASE's shoulders by suggesting that I fix some of the 64 bugs that are currently open. This never-ending To Do list is about to get some attention! We settled on bug #3303002: NeXML not working for matrix or tree. For one reason or another, for some studies (not all) can get the tree but not the matrix for nexml. After doing some investigating, I am working on (and hopefully finishing up tonight) a unit test that is going to nail down the problem. And then I'm going to fix it! :)

Another thing we are considering for me after I get tired of debugging and before row-segment drama gets settled is to add another tree viewer (in addition to PhyloWidget...or maybe in place of?) called jsPhyloSVG ... a bit less catchier of a name but it seems super cool. Hopefully I will have some time to work on that.

Today was the official Midterm Evaluation day and it is nice to say I have made it halfway. I am learning quite a bit about myself -- to name a few:

The philosphy of : "1 ) Don't panic; 2 ) Start with what you know" is extremely useful in field of computer science. See previous post for an example. But I pretty much use this when I about to start a new task.
I seem to be in my "coding groove" 3PM onward. This is good and bad.

It is good because I spend my evenings productively and am able to work diligently with "Seinfeld" episodes on or NPR "Fresh Air" playing in the background, making the mood a little lighter. It gives me time to do morning things like respond to e-mails, read through my Bookmarks bar pages, clean up the apartment, hug my cat excessively etc. Then I finally sit down around noon-ish looking at my code and trying to gather my thoughts. This is much harder than it sounds. I read and reread the latest e-mails and gchats from Rutger and Bill, reminding myself of what I am doing. Then, out of nowhere, things are finally clear and I can often times get obsessed with what I am doing and not want to do anything else.
It is bad for several reasons:

I like drinking coffee when I am coding but coffee at 3PM is not the same as coffee at 8AM. Oh well, you can't have it all Laurel...
On the obsessive note, I can get so involved in what I am doing that I forget to eat dinner and/or socialize. #dweeb
Most coding jobs are 9-5 and I would be required to readjust my whole circadian rhythm if I were to pursue them. The luxury of Google Summer of Code is that if I want to start working at 3PM and continue throughout the evening, I can :) But it is not exactly normal...

Epiphanies are awesome. And they happen to come at very strange hours of the day during very strange times -- such as at 2AM when you are trying to fall asleep but can't because there is no AC and it is 101 degrees outside or mid conversation with someone and you have interrupt them and say "wait! wait! hold that thought I need to write something down..." I haven't really had this sort of experience during my biology research. Maybe one day. But the moment that one dawns on you is truly an experience :)

One last thing worth mentioning...since the last time I posted, Google+ has exploded. And I hope it replaces Facebook because it is so much more awesome (and less creepy).

Okay it's Friday night and I think that is quite enough for one blog post. Bye bye!

Wednesday, June 29, 2011

express charsets : CHECK!...so what's next?

Happy Wednesday everybody!

For those receiving my progress report, you know that we have successfully added charset expression in NeXML for TreeBASE. It feels extremely awesome to know I have actually contributed something important to an active and ongoing project.

The most important thing I have learned from this portion of the project: Test-driven Development, when you write a failing test and fix the code so that it satisfies all the tests that you wrote for it. Then you know you have a working program. I never learned that in my classes and it is very cool to learn an important technique that I will continue to use in the future.

Summary of what we did:

We wrote two JUnit tests. testNexmlMatrixConverter() creates a NexmlMatrix within a Nexml document and checks to make sure that it matches up with the appropriate fetched matrix from TreeBASE. testNexmlMatrixCharSets() verifies the the coordinates and name of the charactersets in the NexmlMatrix actually match up to those of the TreeBASE matrix. It also does a check to make sure that the study actually has charsets associated with it. This is important to prevent NullPointerExceptions.
After establishing a well thought-out and commented unit test for implementing char set expression, I used the logic from the unit test and reworded it to fit in with the actual methods of the NexmlMatrixConverter () class. The code for this class can be found here: http://treebase.svn.sourceforge.net/viewvc/treebase/trunk/treebase-core/src/main/java/org/cipres/treebase/domain/nexus/nexml/NexmlMatrixConverter.java?revision=926&view=markup. It was pretty exciting when I was able to get everything to pass all of the tests we created for it and I was finally able to commit the code. Unfortunately, I included an unused import and it threw off the entire build of the project, causing problems for a lot of people. So...note to self: don't do that again. Anyways, after that was all sorted out, it is safe to say that TreeBASE now supports charset expression for NeXML.

Now what are the next steps? I am going to be working on including row-segment metadata for NeXML output. Right now, I am working to express the metadata with Darwin Core terms for the specimen-related information. The sequence information is going to have to be expressed differently to be useful to anybody though. This involves editing the populateXmlMatrix() funcitons. However, these row-segment annotations involve possibly (worded from Bill Piel) partitioning a row of character states into every possible segment and then assembling sets of row segments as needed. Then each set of partitions, you can express metadata as described above. I have completed the editing of the populateXmlMatrix() methods, but I need to talk to Rutger now about what we are doing in regards to implementing them.

Alright, that's all folks. Have a good night!

Friday, June 24, 2011

back2work

All right, I need to get back down to business. Where was it I left off again? Oh yes, character sets. My last post about these dealt with the purpose and function of char sets but I don't think I actually explained what I was doing with them.

I have been working in the NexmlMatrixConverterTest() class. Here I was creating a unit test testNexmlCharSets() that verifies if the title and coordinates of a treebase character set and nexml character set. It also checks to see if the study being tested even has any character sets associated with it. The unit test was modeled off of the other one in this class created by Rutger. Feeling a little lost at first, he wrote out a unit test that was very well-commented so I could figure out what was going on. You can check out the most recent code in this class at: http://treebase.svn.sourceforge.net/viewvc/treebase/trunk/treebase-core/src/test/java/org/cipres/treebase/domain/nexus/NexmlMatrixConverterTest.java?view=log.

After I wrote my portion, Rutger put some finishing touches on it and I am now trying to get the actual NexmlMatrixConverter() class working by applying the logic from the unit tests to the actual code. Once I can get it passing all of the checks of the test, I think I will be good to go and be able to commit.

Time to get back to work now though! Toodles!

Thursday, June 23, 2011

impression of evolution & ievobio 2011: impressive!

My mind is about to explode from all of the information that it took in these past five days. I was not exactly sure what to expect out of these conferences. I had been to several smaller conferences before but nothing like this. Every day was jam packed with talks and you basically had to run from one room to the next based on what you wanted to listen too. It was nuts. The whole experience got me really excited about the whole field though. There are so many people doing so many different cool things all over the world. That was another cool aspect. People were from all over the world at this conference. Probably the best part of it all was the fact that everyone was just so friendly. Even I only knew 3 people there when I arrived, I felt as if I had met a million people after I left. Many of them I now would not even hesitate to contact in the future. Everyone was so genuinely interested in what other people were doing and it was so exciting to be a part of it all. I was a little apprehensive as to how much fun we could actually have in Norman, Oklahoma. However, I think the fact that we were all crammed into this isolated conference center, it forced people together and encouraged people to introduce themselves. And we all found our own fun.

The Evolution portion lasted Friday evening through Tuesday. iEvoBio (more computer science focused) overlapped Tuesday and then continued into Wednesday. It had already been a long week and I wasn't sure how much more information and ideas I could cram into my head. However, once things got started, I was so excited to be there! There are so many awesome projects going on and it was pretty amazing that they are all open source.

Some of my favorite ongoing projects (besides anything to do with TreeBASE hehe) :

The Moorea Biocode Project (http://mooreabiocode.org/): Here they are trying to barcode every organism of the ecosystem for this tiny island in the middle of the Pacific Ocean. And so far they are doing an awesome job at it! The database they have set up is amazing. For each specimen, they have pictures of it, georeferenced maps, and even links to any sequences associated with it.
The Map of Life project (http://www.mappinglife.org/): This includes a plethora of resources regarding species distributions. It allows the user to search for a species, view its interactive range map, and then investigate information from other species data repositories and data from past surveys.
The idea of sharing and re-using trees: Here is a nicely-written NaturePrecedings abstract http://precedings.nature.com/documents/6048/version/1 that describes the purpose and importance of this concept.

There were also many more, but I just wanted to highlight a few. Also at iEvoBio, I was able to personally meet one of my mentors, Bill Piel as well as other people involved in the TreeBASE project such as Hilmar and Karen. It was quite the week. After a good night's sleep I am ready to get back to work.

Tuesday, June 14, 2011

charsets, matrices, and unit tests...Oh My!

Greetings everyone! I apologize for the fact that it has been a while since I have last posted. I wanted to make sure I fully understood what I was talking about before I actually wrote it down for all the world to see. A few weeks ago, I was asked to add charset functionality to TreeBASE. What exactly does that mean? I was in the same boat you were two weeks ago. And after attempting to figure it out on my own, I ended up going on a wild goose chase and virtually gaining no further understanding of my actual project. So I decided that it couldn't hurt to ask for more details. This post will involve a summary of the information I obtained from Bill and Rutger over the past week about CharSets. Much of it will come directly from the e-mails that were sent back and forth just for simplicity's sake (and Rutger and Bill worded it very well).

Here it goes. First off, this was my original task: I needed to generate a NeXML solution both for expressing CHARSET free text and for expressing the row-segment metadata like the Genbank accession number.

What is a CHARSET? In NEXUS parlance, the CHARSET is simply a specified set of characters with a title. This allows the user to annotate various sets of characters to mean different things, e.g., saying which sequences belong to which genes, or saying which characters are behavioral vs morphological, etc etc. This is one of the few really universal ways of specifying sets of characters. It is recognized by PAUP, MrBayes, etc, and they can use it in conjunction with defining partitions and applying different substitution models to different genes.

How are CHARETS used in TreeBASE? What are the advantages and disadvantages of CHARSETS? TreeBASE currently parses CHARSETs out of submitted NEXUS files, stores them, and then generates them when outputting a NEXUS serialization. However, our NeXML does not seem to output the equivalent metadata. Consequently, the NEXUS output contains some annotations that are not found in the NeXML output, which is a bummer. On the other hand, CHARSET annotations are only really human-readable, in that there is no machine readable syntax that use a controlled vocabulary, etc. (i.e. a human knows that both "CHARSET CO1 = 1-505;" and "CHARSET COI = 1-505;" both state that sequences 1 through 505 come from the gene cytochrome oxidase I, but a computer only knows that "CO1" and "COI" are two different strings). Likewise "CHARSET ambiguous_regions = 1-34 543-551 601 893-901;" is clear to humans but not to machines. NeXML has an opportunity to be much more explicit, but only if it is supplied data with a controlled vocabulary (which we can't do because we ingest them with free-text NEXUS). But nonetheless, it would be valuable for TreeBASE's CHARSETs to be expressed in NeXML too, even if they have uninterpretable strings.

The NeXML API did not yet have programmatic access to create such sets. These have just recently been added to the code by Rutger. It is simple to use:

Subset charset = matrix.createSubset("

ambiguous_regions");
charset.addThing(char1);
charset.addThing(char2); // etc.

These charset objects inherit from Annotatable and we can then attach annotations.

What are the differences between CHARSETS and RowSegments?

In addition to CHARSETs, TreeBASE also implements a similar annotation called "RowSegments" but this differs in several important ways:

1. CHARSETS applies to all homologous character scorings for all OTUs (ie taxa) in a character block or alignment. RowSegments specify a set of characters for any particular OTU or taxon or row in a matrix. So, taxon_a can have a RowSegment annotation for sequences 34 through 42, and taxon_b can have a RowSegment annotation for sequences 39 through 45. Whereas a CHARSET annotation has to apply to the same homologous characters in both taxon_a and taxon_b.

2. CHARSETS allow you to specify a scattering of characters (i.e. see the ambiguous_regions example above), whereas RowSegments have a single begin and end index -- each can only specify a stretch of sequence or characters.

3. CHARSETS have no controlled vocabulary. RowSegments have hard-typed fields for basic DarwinCore metadata plus culture numbers and Genbank accession numbers. It is likely that once we implement a MIAPA standard, we will need to soft-type our RowSegment annotations -- i.e. subject-predicate-object.

4. The conceptual understanding of a CHARSET is that it refers to an abstract class of characters: a type of gene, a type of morphological character, etc. The conceptual understanding of RowSegments is that these are metadata attached to the specimen(s) that were examined when deriving the characters: e.g. the specimen's culture number, museum collection code, the Genbank accession number for the sequence derived from the specimen, etc.

5. CHARSETS are a formal part of NEXUS and NeXML, RowSegments are not.

If RowSegments are not a part of NEXUS or NeXML, that must cause some problems...

One of our big problems is that we offer the ability to capture some pretty important metadata using RowSegment annotation, but the concept of the "RowSegment" is missing from NEXUS -- so we can't export it or ingest it using NEXUS. Mesquite has something that comes close: the NOTES block of a Mesquite-written NEXUS file has a special annotation, such as these two:

SUTM T = 4 N = genBankNumber S = AF284000;

... which means "the entire row in the matrix for taxon number 4 comes from the Genbank access AF284000." This doesn't work for us because we allow a row to have more than one Genbank accession number -- e.g. sequence 1-500 is AF284000 but 501-850 is AF45345.

Alternatively there is also:

AN T = 4 C = 1 AU = TreeBASE TF = ( CM AF284000 ) TF = ( R genBankNumber );

... which means, "the annotation for character 1 of taxon 4 was provided by someone named 'TreeBASE', who gave it the value 'AF284000' with reference to something called 'genBankNumber' ". This doesn't work for us because we would want to provide a whole range of sequences, not just one base. i.e., it would be better if Mesquite allowed us to write:

AN T = 4 C = 1-501 AU = TreeBASE TF = ( CM AF284000 ) TF = ( R genBankNumber );

AN T = 4 C = 501-850 AU = TreeBASE TF = ( CM AF45345 ) TF = ( R genBankNumber );

So unless Mesquite expands it's NOTES annotation capability, we cannot export NEXUS with our RowSegment annotation using a syntax that Mesquite understands and then present it to the user in a nice graphical way.

But nonetheless, it would be great to figure out how these metadata could be expressed in NeXML, especially if there's a way to imbed Darwin Core syntax and vocabularies inside the NeXML.

What is DarwinCore syntax? It is abbreviated DwC. The wikipedia page is pretty helpful (http://en.wikipedia.org/wiki/Darwin_Core). But basically is is a set of data standards that are used in biodiversity research to help keep everyone on the same page. The Term Reference page (http://rs.tdwg.org/dwc/terms/#theterms) is also very useful in gaining an understanding of what sorts of data it standardizes.

Isn't there already metadata being expressed in NeXML?
Currently, *some* of these metadata are being expressed in our NeXML, but probably in the wrong fashion (IMO). For example, the latitude and longitude are attached to OTU elements. However, the lat/long is an attribute of a specimen, and some of that specimen's characters or sequences were aligned in a matrix, and a derived analysis of the matrix produced a tree with OTUs in it. So the lat/long only very indirectly belongs to the OTUs. It much more directly annotates a set of sequences or characters for a particular row of character data. The issue is how to express this in NeXML.

Can you give me a two-sentence summary, please?

If the ultimate goal is to allow TreeBASE to ingest all the data it needs exclusively via NeXML, then the first step is to fully express all of TreeBASE's data and metadata in TreeBASE's NeXML output. And then the second step is to create a NeXML ingest that knows what to do and where to store this richly-annotated NeXML.

I will give you a moment to digest all of that. After finally sifting through all of this valuable information, I finally let out a sigh of relief because I was happy to finally have a thorough understanding as to what was going on. Our next step was to start putting that into action--which will be my next post (I promise I will do it later tonight!). I just didn't want to overwhelm readers too much!

Friday, June 3, 2011

right under my nose...

Don't you hate it when you think you have been doing something right the entire time and then you find something way later on that would have made your life so much easier? That just happened to me.

It kept crossing my mind that the code for NeXML was a little skimpy. I kind of felt like I was missing something. However, I had only been looking under org.cipres.treebase.domain.nexus.nexml. When studying this a little closer, I realized I am missing out on more than half the relevant code. Within the nexml-1.5-SNAPSHOT.jar, there is a whole set of important classes. Sure enough, Rutger even told me to study org.nexml.model*

And now the frustration sets in....GRRR.

Monday, May 30, 2011

Setbacks and Starting Fresh/Community Bonding Period--Take 2

Happy Memorial Day everyone! Finally feels like summer here in Chicago. First day in a while that it hasn't felt like November. Just wanted to provide you with a few updates as to what has been going on the past week.

First, remember my last post when I was really excited to get started because my code base was finally built? Well, lesson #1: never get too excited. Somehow, some way, I managed to seriously botch up what I had spent the past few weeks doing and had to have Rutger help me out. Nothing says mentor-mentee bonding like a 4 hour VNC session!

That is one extremely useful thing I have learned to do thus far. VNC. I've never run it before on my Mac and spent an evening testing it out with my dad. It's actually very simple once you figure it out. First, for Macs, you go to System Preferences, Sharing, and then check the box that says "Screen Sharing". There, you can click on Computer Settings and set the password that people will use to log to VNC into your computer. I also had to set up permissions for my router--that was probably the hardest part. And finally, I found that the easiest way to figure out your ip address is http://whatismyipaddress.com. Pretty cool :) Although I have to say it is extremely freaky to see your mouse moving on its own.

Thankfully I had this all figured out at the time I was in real need. After creating an endless amount of new projects, I ended up having to check out the code in Eclipse, and then installing Maven via my terminal rather than through Eclipse.

So I think I am OFFICIALLY ready to go now. Before this drama happened, I was just about ready to start writing code :( The first thing I was working on was expressing row-segment metadata for NeXML. At first, I was a little nervous. It seemed like there were a MILLION places to start. But then, the the advice of a professor (I really wish I could remember which one so I could give him credit for his words of wisdom that I use daily) resounded in my head--Step 1: Don't Panic. Okay...trying my best. Step 2: Start with what you know. Okay...I know that I am working on the NeXML section. I also know what metadata annotations entail, so I was looking for keywords. And I ALSO vaguely remember an e-mail from a while back that said that annotations to georeference information had been included. After spending some time looking at how pieces fit together, I found that I was going to be editing the populateXmlMatrix() function within the NexmlMatrixConverter class. I checked with Rutger and it was very exciting to know that I ended up being correct! Yay.

Now that I am finally on the right page, I will be putting together a JUnit test to make this work. I have never written one before so it is going to be challenging, but I am looking at code that was already written to see how others have done it and am reading up on tutorials to do so. If I can get it working, I will be committing my first piece of code!

Other things I will be working on this week (this is mainly to keep tabs on myself):
--Progress report.
--Add charsets. There are two ways to do so and get back to Rutger about which way would be best and if changes to the NeXML API are necessary.
--Fill in Wiki.

Okay that's all for now. I know I'm supposed to be taking the day off for Memorial Day, but I at least wanted to work a little in the morning to make up for some lost time from last week. Ta-ta for now!

Monday, May 23, 2011

If you build it...

...you can finally get started! So my code base is built (kinda--more in a minute) and just in time to get down to real work. But boy was that a challenge. My only error that occurs when I build it is:

The markup in the document following the root element must be well-formed. styles.xml

So not really sure what is going on, but if I comment out line 3 and 4, the code builds. So moving on now....

To kick every thing off, I am going to be expressing CHARSET free text and expressing the row-segment metadata (Genbank accession number) for NeXML. Right now, it only seems to be working for NEXUS. And don't worry...I am working on my directions for setting up the code base as well so more to come.

Okay back to work. Bye bye.