Lessons on computational biology teams from working in biotech

There’s a lot of talk out there about the rise of “computational platforms” for drug discovery and other life sciences applications. No matter the application space, at the end of the day, a biotech or techbio startup needs to assemble skilled scientists and engineers to build a product.  Given the inherent complexity and broad base of scientific knowledge involved, it’s no surprise that some teams and platforms turn out better than others.

Here are some examples of decisions around computational biology / bioinformatics team building that I have encountered that ran into problems:

  • Letting one developer build something on their own for a lengthy period, leading a computational tool or pipeline to become sprawling and difficult to scale or to improve.  Similarly, relying on bespoke or hard-to-automate algorithms, or setting up a pipeline without proper build and test systems or without proper documentation.  In such cases, it can be nearly impossible to reconstruct what was done after the fact.

  • Throwing more bodies at a problem (e.g., bringing in additional computational biologists to support a poorly architected data analysis pipeline that frequently crashes for unclear reasons), rather than evaluating fundamental needs and bringing in a software engineer instead.

  • Relying too heavily on one programming language, particularly a non-mainstream programming language.  I generally recommend that your computational group should be comfortable working in R and Python.  Other languages can be brought on board if they are critical for what you are building. Low-level languages like C++ and Java can be important for speed and scalability (particular for crunching through large data), but they impose a barrier to entry and may be best left for when you can afford to hire experienced engineers.

  • Reinventing the wheel for a standard computational task like mutation calling.  When beginning to build a new computational tool, one should always ask: Can I use an off-the-shelf method to solve XYZ?  Or do I really need to develop a method de novo?

  • Establishing blanket automated review policies on all code, even code written for exploratory data analysis.  This is a great way to choke off creativity and impede often-overlooked-but-important exploratory analyses. (Yes, one does need to be careful about sharing quickly written exploratory analysis code to ensure it doesn’t end up somewhere unintended.)

The good news is that all of the above pitfalls are foreseeable.  Once you have fallen in, some of these are easier to extricate yourself from than others.

Here are some examples of “bioinformatics done right” — teams I have worked with that were super productive and fun:

  • A computational biologist (CB) + a software engineer (SE) building a genomics data analysis pipeline. The CB would prototype new pipeline functions or modules, and the SE would revise and augment the prototype code for efficiency/scalability while integrating into the pipeline’s end-to-end testing and build system. (I’ve seen multiple examples of this, from a team of 2 up to a team of 8-10, where the tasks get further subdivided and iterations can happen quickly.)

  • A team of Machine Learning (ML) engineers partnered with a team of CBs, together nested within a larger project team working on a personalized health care product. The ML team led the algorithm development (deep learning model), the CB team led the training data set generation, and the two computational teams regularly met to exchange updates and iterate on the jointly developed tool. This configuration helped ensure the ML model was pointed at the right target and able to make useful predictions for the intended product.

  • A small team of computational biologists building a workflow in the cloud for single-cell RNA-seq analysis and related data visualization tools (for target selection). One computational biologist was an expert in the data types + standard tools + analysis workflow, while the others were experts in DevOps for the cloud and collaborative development. (The founding team had the core biology and disease area expertise.)

An important theme that emerges from the above experiences is the need for complementarities. These can often fall across core biology / data analysis and software engineering / ML / algorithms, as described in this post. Best practices in collaborative development are also super important to get right early on (using source control tools like Github, regular check-ins and code reviews prior to merging branches, using project management tools like JIRA or Confluence to draft requirements and track issues, etc.).

Previous
Previous

Simplifying computational biology (for people building life sciences companies and products)