How can we trust a Docker container that has been sent to us from a remote party that performs a computation on our privacy sensitive data? This is one of the questions you might ask when considering using VANTAGE6. If we take a closer look, there are three major requirements to trust algorithms sent by other parties:
- As a data provider, you want to have control over which algorithms are allowed to run on your data.
- You also need to trust that the algorithms do what they advertise. For example, not sending raw-data records to the central server (it is not possible to send them anywhere else as docker algorithm containers can only speak to the central server through a proxy server, as they do not have an internet connection themselves).
- Finally, you need to trust the source from which you retrieve the docker image.
Control Allowed Algorithms
The researcher can send any (docker-)image name to your organization once you have a collaboration on the central server. This is a trust issue, as you need to trust the researcher to not send a container that sends back all the raw data. In most cases, it would not be acceptable to trust 3rd party researches which such a power.
To control the algorithms that are allowed on the server, you should be able to specify them in the configuration file. This can either be by exact name (e.g., harbor.vantage6.ai/iknl-legacy/dl_summary) or something like a regular expression.
Malicious Containers and Trusted Image Source
If a researcher or anyone else who develops an algorithm for VANTAGE has bad intentions, it would be possible to put (maybe even hide) malicious code into an (Docker) algorithm container (for example see this paper). This may also occur when the researcher is not aware of a privacy issues of a particular algorithm.
To avoid data-leaks from malicious containers, two things need to be in place:
- Review the code and algorithm thoroughly
- At the node, verify that the container is the container that has been reviewed an accepted by your organization
Reviewing code and algorithm could be done by either the community or by any professional you trust (maybe yourself). We are currently working on such a process, but this is not ready to be shared yet.
To verify that the container pulled to the node is the container that has been reviewed we could make use of Docker notary Something that is also available in harbor, the registry package that we use at distributedlearning.ai