Cell In Silico


Molecules - Sequence - Disease

What I Learn from Porting a Bioinformatics Software from Python 2 to Python 3

July 03, 2020 | 6 Minute Read

When I volunteer to take over the job of maintaining CLIPper, I was thrilled. On one hand, I was excited to learn how others code, and the dos and don’ts of software engineering. I look forward to add useful features for many other scientists over the world. On the other hand, I was afraid that I will ruin the ENCORE standard pipeline CLIPper serve as, and destroy numerous people’s research projects. Though I have been completely “dry” for a while, I had little formal CS eduation and is very inexperienced.

via GIPHY

In my vague memory, porting stuffs from python 2 to python 3 is a lot more than “changing print something to print(something). I need to ensure all the functions are behaving the same, instead of just getting all of them running. Some modules can help with backward incompatibility…etc. This article will describe steps I went through and what I learn.

Step 0: It’s Hard to Ruin Things, If You Do Proper Version Control

via GIPHY

Open up your own branch before you do anything is always a good thing. If that branch end up not working, just don’t merge it. It also ensures the master branch is stable. I record every significant change I made with a commit message. I was lucky that I don’t need to revert things back. But I believe with more complex projects, the need to revert will happen.

I first started a branch called charlene_working, intended to clean up the python 2 code.

Step 1: Read the Mysterious Code. Where do I even start?

via GIPHY

Sometimes the repository isn’t well annotated. Sometimes it contains a lot of unused functions and files. Unfortunately these are the cases for CLIPper. Making things worse, the method CLIPper employ is dispersed through several papers. To efficiently understand which file is doing what, I found it helpful to find and read through the “main function”.

Never start at line 1, file 1. If you are unlucky, starting in a helper function of a submodule of a module of some 100,000-line package, it will be very tricky to understand what this function is doing. You will stuck here FOREVER.

But where is the main function? Here are some tips to help you locate it.

Find main function

  1. Many bioinformatics software comes with a command line tool. Line console_script: FILENAME.py in setup.py points you to the corret file. Look for the function after if __name__ == '__main__':. That points you to the main function.
  2. Popular python package structure: usually in PACKAGE_NAME/src/, PACKAGE_NAME/ with some “sorta generic name”.
  3. Trace back from the function featured in “Getting Started”. Those usually start from a high level.

Practice with Scoary CLIPper BioGraPy

Trace down along the main function.

From reading the main function, I found it calls to function: call_peaks and filter_peak_dicts. I search the whole directory (Ctrl-Shift-F in PyCharm) to find where those functions are, what other functions they call, trace all the way until no other function is called. Also pay attention how the whole repository is organised. This bird-eye’s view help locate the bug when tests don’t pass.

Along the way, I took some notes in the docstring, and identified some files that I suspect are useless.

Step 2: Run Tests and Ensure 100% coverage

via GIPHY

After I have a vague birds eye’s view on CLIPper, I run the test once. This step is to ensure that the test pass BEFORE you do anything to the code, so that if it failed after some modification, you know IT’S YOUR FAULT!! CLIPper’s test don’t pass initially, since the test wasn’t maintained with the code. Luckily, these are mostly some syntax changes, additional options. I was able to recover and pass the tests with a few lines modification.

Then I use coverage to ensure all functions has a test.

Step 3: Reorganize if Messy

via GIPHY

I removed unused functions and modularize function with proper file naming. peakfinder.py was quite long and contain a myriad of functions. I end up splitting it into three files: filter_peak.py, main.py and utils.py. A good IDE can save a lot of time when moving functions, chaning names.

Remember to rerun test after any major change.

Step 4: Port to python3

via GIPHY

  1. Create a new environment for python 3.
  2. Create a new branch for python 3, in case stuffs broke down its always easy to revert.
  3. Use (futurize)[https://python-future.org/futurize.html] to automatically fix common syntatic difference.
  4. Run Test, fix bugs, until all is complete (usually some really minor stuffs like map() return list in python 2 but map Object in python 3; or function name change when package is upgraded sklearn.mixture.GMM to sklearn.mixture.GaussianMixture).

The biggest challenge I faced in this step is dealing with the C extension. I had never written C before. I had to go through a few tutorials to deal with the syntatic difference between python 2 and 3’s extension.

Step 5: Make Others Easy to Install

via GIPHY

Brian helped me to containerize everything (the conda environment as well as the gcc version), and installed on the server. conda env export > environment.yml helps others recreate your conda environment.

Step 6: Run More Tests

via GIPHY

Since CLIPper is previously used in other datasets, we plan to rerun those using the new CLIPper, to really make sure that it behaves the same. Hope the new CLIPper can pass and we can distribute it soon!

Summary

Although it is a lot of pain (and work) to maintain CLIPper, I learn a lot from this. I was able to complete local testing in 4-5 days, which isn’t too bad from hindsight. (I felt it will take forever when I started, or I may die from it). Apart from those I mentioned above, I became more familiar with the PyCharm IDE, and know how a proper development environment can save a lot of time. I learned what setup.py is doing, how C extension can be made and proper packaging. Most importantly, CLIPper is no mysterious to me anymore. I hope I can add more features to CLIPper and improve its statistical methods to deal with noisy sequencing data.

Useful Tutorial And References!

Portion Python 2 to Python 3

Unittest

Coverage

Packaging python