Skip to content

PDB/mmCIF parsers spawn multiple threads (thanks to numpy) #1397

@StefansM

Description

@StefansM

If the underlying BLAS library allows it, numpy will spawn a thread for every core in the machine as soon as it is imported. As far as I can tell, these are POSIX threads rather than python threads, so the overhead is not huge, but this can still cause problems when a program is expected to run in a single thread. If you try to run 64 jobs on a 64 core machine, you will end up with 4096 threads, which will almost certainly breach the resource limits on a shared-use machine, causing your jobs to fail and your system administrator to be angry.

This behaviour can be disabled by setting some environment variables, but this is not particular obvious, especially if biopython was installed by an administrator and the underlying dependencies aren't known.

It's not clear to me what the best solution, if any, is to this. I see a few options:

  1. Ignore it. This is the usual behaviour of numpy and it is up to the user to configure their system correctly.

  2. As above, but add a warning to the FAQ.

  3. Override the environment variables that define the number of threads that are used before numpy is imported. This has the disadvantage of changing the environment variable in the user's code as well as the library code.

  4. Whenever numpy is to be used, fork off a subprocess with the correct environment variables. This has all the overhead associated with forking and interprocess communication, and would probably just be a huge hassle.

Personally, I found this non-obvious enough that it caused me some real problems with system resource limits (the default on the machines I'm using is 1024 threads or processes) and cost me some time debugging. I didn't expect a PDB parser to be spawning 64 threads, so I spent time combing through my own code first.

Example

I am using CPython on Linux-2.6.32-504.el6.x86_64-x86_64-with-redhat-6.6-Santiago and biopython 1.71.dev0. In one terminal, I open a python interpreter:

$ python
Python 3.6.1 (default, Jun 28 2017, 11:45:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.getpid()
27015
>>> 

I keep that interpreter open, and in a bash prompt on the same machine I check the number of threads that it is using:

$ ps --no-header -L -p 27015 | wc -l
1

As expected. Now, in the python interpreter I import numpy and then check the number of threads in use:

$ ps --no-header -L -p 27015 | wc -l
64

The same thing happens when I import Bio.PDB or any other module that uses numpy.

Numpy configuration

In case it's relevant, numpy is linked against openblas:

>>> import numpy
>>> numpy.show_config()
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions