FLOSSing in the lab – What Plant and Food Research does with FLOSS
Roy Storey, Ben Warren, Eric Burgueno, Zane Gilmore, and Matthew Laurenson
Plant & Food Research (PFR; www.plantandfood.co.nz) is a New Zealand government-owned research institute that directs significant effort to breeding and delivering new cultivars to industry. Breeding is a numbers game. Each year PFR produces hundreds of thousands of new plant genotypes, the bulk of which are deliberately discarded before significant resource has been invested in growing them. Some plant characteristics cannot be observed until plants have reached reproductive maturity, which can take several years. Analysing a seedling’s genetic sequence data with genomics tools provides an opportunity to identify early on whether it carries useful genes, such as those associated with disease resistance or fruit colour.
The compressed genetic sequence of a single plant can occupy hundreds of gigabytes. Generating such data is becoming cheaper every year - scientists respond by using sequencing technology more often, which increases reliance on robust data processing and management methods. Tracking sample processing requires careful identification and management of individual plants, their associated genotypes, any samples collected from those plants, and the locations of both plants and samples. PFR has developed several software assemblages to address this challenge.
"Kea", a sample handling system developed using the Python-based Django framework and an underlying postgreSQL database, helps users to manage laboratory analysis of plant samples. Kea tracks hundreds of thousands of samples through laboratory processing. It uses Elastic Search for scalable indexing to provide rapid response times for data filtering and reporting.
"powerPlant" provides scientists with the tools carry out their own bioinformatics (the science of managing and analyzing genetic data) using open source analysis tools such as Galaxy, Ensembl, and R-Studio. powerPlant employs the open source job scheduler OpenLava to allocate processing tasks from front end web servers to back end processing nodes and databases.
Open data directives like the New Zealand Government Open Access and Licensing framework (NZGOAL), Reproducible Research, and the desire to collaborate with other organisations led PFR to establish scinet.org.nz to share data sets. Each data set has a MediaWiki instance for commentary and content management, coupled with a data browsing application such as GBrowse, for manipulating and displaying annotations on genomes.
PFR now has a single 300 TB file system capable of storing large datasets, but we currently lack smart ways of processing them. We use a range of Linux-based servers with RAM ranging from 84 GB to 4 TB with 8 to 160 hyper-threaded CPU cores. Typical analyses can fully utilize even the largest of these servers. Much of current Information Technology industry activity is about "dividing the hardware" or even the OS itself into tinier chunks. In contrast, we need to identify how to divide a single process and distribute it across multiple servers to maximize resource use. Standards like Open MPI are only partially adopted and numerous bioinformatics applications don't utilize processors across multiple servers.
In the future we plan to explore using systems like iRods to automate file and metadata handling, and perhaps enable asynchronous integrations with national processing facilities like NeSI.
Zane is a developer and computer consultant for scientists working for the Plant and Food Research Institute. He writes software (mostly in Python) and advises scientists on how to facilitate their science. He has worked as a developer since 2000 after he got a degree in Computer Science at University of Canterbury.