PDB/mmCIF parsers spawn multiple threads (thanks to numpy)

If the underlying BLAS library allows it, numpy will spawn a thread for every core in the machine as soon as it is imported. As far as I can tell, these are POSIX threads rather than python threads, so the overhead is not huge, but this can still cause problems when a program is expected to run in a single thread. If you try to run 64 jobs on a 64 core machine, you will end up with 4096 threads, which will almost certainly breach the resource limits on a shared-use machine, causing your jobs to fail and your system administrator to be angry.

This behaviour can be disabled by setting [some environment variables](https://stackoverflow.com/a/31622299), but this is not particular obvious, especially if biopython was installed by an administrator and the underlying dependencies aren't known.

It's not clear to me what the best solution, if any, is to this. I see a few options:

1. Ignore it. This is the usual behaviour of numpy and it is up to the user to configure their system correctly.

2. As above, but add a warning to the FAQ.

3. Override the environment variables that define the number of threads that are used before numpy is imported. This has the disadvantage of changing the environment variable in the user's code as well as the library code.

4. Whenever numpy is to be used, fork off a subprocess with the correct environment variables. This has all the overhead associated with forking and interprocess communication, and would probably just be a huge hassle.

Personally, I found this non-obvious enough that it caused me some real problems with system resource limits (the default on the machines I'm using is 1024 threads or processes) and cost me some time debugging. I didn't expect a PDB parser to be spawning 64 threads, so I spent time combing through my own code first.

### Example
I am using CPython on `Linux-2.6.32-504.el6.x86_64-x86_64-with-redhat-6.6-Santiago` and biopython `1.71.dev0`. In one terminal, I open a python interpreter:

```
$ python
Python 3.6.1 (default, Jun 28 2017, 11:45:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.getpid()
27015
>>> 
```
I keep that interpreter open, and in a bash prompt on the same machine I check the number of threads that it is using:
```
$ ps --no-header -L -p 27015 | wc -l
1
```
As expected. Now, in the python interpreter I import numpy and then check the number of threads in use:
```
$ ps --no-header -L -p 27015 | wc -l
64
```
The same thing happens when I import `Bio.PDB` or any other module that uses numpy.

### Numpy configuration
In case it's relevant, numpy is linked against openblas:
```
>>> import numpy
>>> numpy.show_config()
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDB/mmCIF parsers spawn multiple threads (thanks to numpy) #1397

Example

Numpy configuration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PDB/mmCIF parsers spawn multiple threads (thanks to numpy) #1397

Description

Example

Numpy configuration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions