Weeknote 20200119 - Python, pip, dependencies

I started working with a new team focussed on using data science to make GOV.UK better for the end users. My previous team focussed on long term maintenance of the platform. Data science is a whole new world for me. I'm excited to get stuck in and learn.

The code is largely in Python and is the first time I have worked with a significant piece of production Python code. I immediately bumped into problems trying to install the requirements for the project.

In the project, two of the top level dependencies rely on Pygments. Pip installed an up to date version of Pygments when it resolved the first dependency, but the second dependency required an earlier version of Pygments. I think this was the result of merging Dependabot PRs without running pip install and ensuring a clean install.

Pip, to my surprise, does not resolve dependency versioning issues and I am reminded of DLL Hell.

My current understanding of workflow on a Python project is developers occasionally use pip freeze to write requirements to a file. Subsequent efforts can install from the file. Apart from the versioning issues, it is also easy to commit unwanted packages to a project if developers are not careful.

I spent some time investigating how other projects manage Python dependencies. A colleague pointed me towards PEP-0518 which looks interesting, but in my limited view of repositories around GDS, I did not see a .toml file being used so I don't know whether people are doing this in the wild or not.

I found pipdeptree which outputs a tree view of dependencies similar to Bundler's Gemfile.lock.

I discussed the problem with one of the other developers on my new team and went digging around some other Python projects. We found a pattern of two requirements files. One for base project dependencies that is hand crafted, and one autogenerated with all dependencies and their sub-dependencies. The advantage is developers are more aware of the dependencies they are adding and this should address the unwanted or unneeded packages sneaking their way into a project.

Going forward, I want to add some automated process around the projects to catch these errors earlier. The first thing to do is spin up a new virtual environment and run pip install -r requirements.txt on each branch push. After that, I want to put some linting in place. The code is still small enough for this not to be too daunting.