The changes that vantage6 version 5 ("Uluru") brings are very beneficial to algorithm developers in the long run. These changes also have the consequence that algorithms written for version 4 need to be updated so that they may be used in version 5. Here, we provide a step-by-step guide to update your algorithms.
What changes in v5 that requires algorithm changes?
One of the biggest changes in Uluru is the introduction of sessions. If you haven't heard about this, we recommend to read our blogpost on sessions before continuing with this article. Briefly, sessions split the responsibility of an algorithm into different, clearly defined functionalities: data extraction, preprocessing and compute. Where every function in a v4 algorithm had to handle all of these functionalities by itself, these responsibilities are split up into different functions in v5. The requirements for sessions cause most changes for migrating from v4 to v5 - however, as we will see, it will make algorithms more flexible, making it easier to connect them to different data sources and maintain the algorithms.
A second change - optional but recommended - is to migrate from a setup.py-based Python package to a modern pyproject.toml structure. Neither v4 nor v5 has hard requirements for using one or the other, but if you were using v6 algorithm createto help you create an algorithm, note that running this command in v5 will give you a pyproject.toml.
Start migrating: update or create a new algorithm
Normally, minor changes in recommended algorithm code can be applied to your algorithm by using the following command:
v6 algorithm update
This command will update the latest changes in the algorithm template to your own algorithm. In principle, this command may also be run to update algorithms from version 4 to version 5. However, the changes are rather large this time that this might not be the ideal workflow. An alternative therefore would be to start fresh using:
v6 algorithm create
Since this command essentially means starting from scratch, this would require copy-pasting your algorithm code in the new template which is not ideal.
Which of the strategies - update+modify or create+copy/paste - you choose, depends on your own preference, and on the algorithm. For small algorithms, it may be easier to copy paste than for large algorithms with proper testing. For most production algorithms, we would therefore recommend the update strategy.
Splitting up your algorithm into session steps
The main difference between a v4 and a v5 algorithm, is that data extraction and compute are separated in distinct functions. Additionally, you may specify preprocessing functions, though if you always want to apply the same preprocessing after extraction data, you could also make that preprocessing part of the data extraction function. That would ensure that all dataframes contain properly processed data.
Data extraction
In version 4, data could also be extracted automatically for common database types such as CSV, Excel, Parquet and SQL. For instance, the following is a valid algorithm function in v4:
import pandas as pd
@data()
def partial_average(df: pd.DataFrame, column_name: str):
local_sum = float(df[column_name].sum())
local_count = len(df[column_name])
return {"sum": local_sum, "count": local_count}
These algorithm functions would only work in v4 for databases where the extraction of the data could be handled automatically by the vantage6 infrastructure. Projects using other databases that wanted to use these algorithm functions would need to modify the algorithm code to be able to use it. They could then use the @database_connection decorator to get database connection details - but such a function would then be only usable in that project and not outside of it.
In v5, data extraction has to be performed in separate functions, because it initializes a dataframe in a session that can then be reused. In order to not require everyone to write their own data extraction functions for simple cases, however, vantage6 provides functions for simple data extraction. These can be used as follows:
# in your algorithm's __init__.py file
from vantage6.algorithm.data_extraction import read_csv
from vantage6.algorithm.data_extraction import read_excel
from vantage6.algorithm.data_extraction import read_parquet
from vantage6.algorithm.data_extraction import read_sql_database
from vantage6.algorithm.data_extraction import read_sparql_database
These functions can then be called directly by the user to create a dataframe in v5.
However, as mentioned, the simple data extraction protocols do not cover all use cases. In v5, it is easier to create your own custom data extraction function. Let's say you want to read a CSV file from a database, allow the user to specify some columns that they want to prevent from being included in the dataframe. Such a function could look like this:
import pandas as pd
from vantage6.algorithm.decorator.action import data_extraction
@data_extraction
def read_csv(connection_details: dict, drop_columns: list[str] | None = None) -> pd.DataFrame:
database_uri = connection_details["uri"] # e.g. /path/to/my/data.csv
df = pd.read_csv(database_uri)
if drop_columns is not None:
df = df.drop(columns=drop_columns)
return df
Note that a data extraction function (like the preprocessing function) returns a dataframe, not a JSON result like all functions in v4. The result of all data extraction and preprocessing functions is not sent to the server, but stored as a local file on the node. So don't feel like you are sharing node data - this is not the case!
The connection_details dictionary is inserted by the vantage6 infrastructure; the drop_columns argument is specified by the user when they request the data extraction. The contents of the connection_details dictionary depend on the node configuration, but always contains the uri and type keys. To give an example, let's say that the node configuration contains the following database configuration:
databases:
serviceBased:
- name: my_postgres_db
uri: postgres://vantage6-postgres:5432/vantage6
type: sql
env:
user: vantage6-db
password: vantage6-is-awesome
If the user requests the my_postgres_db database, that would lead to the following connection_details dictionary being passed to the data extraction function:
{
"uri": "postgres://vantage6-postgres:5432/vantage6",
"type": "sql",
"user": "vantage6-db",
"password": "vantage6-is-awesome"
}
Since the type and uri keys are obligatory in the node configuration, they are always present in the connection_details dictionary. The env key contains additional environment variables that are set in the node configuration.
You can extend the simple example above as much as you want. For instance, you could add a query key to the connection_details dictionary to allow the user to specify a SQL query to execute on the database, or create a function that gets certain query parameters from the user to create a safe SPARQL query - the sky is the limit!
Preprocessing
Preprocessing functions are a new element in v5. In v4, they would simply be a part of the algorithm function itself. For example, if you wanted to compute the average BMI in your dataset, but only had length and weight columns, you would have to compute the BMI in the algorithm function itself, before being able to compute the average BMI.
In v5, you can specify a preprocessing function that will add the BMI column to the dataframe and store it on the node. Such a function may be implemented as follows:
import pandas as pd
from vantage6.algorithm.decorator.action import preprocessing
@preprocessing
def compute_BMI(pd.DataFrame, weight_col: str, height_col: str):
df1["BMI"] = df1[weight_col] / (df1[height_col] ** 2)
return df1
By running this preprocessing function, you only need to compute the average BMI once, and you can reuse the same dataframe for other computations - maybe you later also want to know the average BMI for a different subset of your dataset?
Of course, if you don't want to add a separate preprocessing function, you can also just compute the BMI in the compute function. That way, you need to redo it for every compute task, which for some cases could be computationally expensive. Also, your algorithm function would become more complex and less reusable: if you run this preprocessing function, you can use a generic average function to compute the average BMI. A function that computes only an average BMI is far more specific and therefore less likely to be reusable by you yourself, your project partners and the vantage6 community.
There are also some preprocessing functions provided by the vantage6 infrastructure. These can be used as follows:
# in your algorithm's __init__.py file
# function to filter a column by min and/or max value
from vantage6.algorithm.preprocessing import filter_range
# import all preprocessing functions
from vantage6.algorithm.preprocessing import *
You can see all available preprocessing functions in the vantage6 repository.
Compute
In v4, every function was, in the end, a compute function: each function in the end returned a dictionary with the results of the computation. The only distinction between different functions was that some functions were central compute functions, and some were federated compute functions. However, this distinction was not enforced by the infrastructure - a partial function had all the same permissions as a central compute function.
In v5, algorithm developers are required to specify the type of compute function they want to write. They can indicate this by using the @central and @federated decorators. The infrastructure will then provide these functions with the correct permissions. For example, a central function is allowed to create subtasks, while a federated function is not.
Other than that, compute functions in v5 are very similar to compute functions in v4. Note just that v4's @data decorator is renamed to @dataframe in v5. These are examples of central and federated compute functions in v5:
from vantage6.algorithm.decorator.algorithm_client import algorithm_client
from vantage6.algorithm.decorator.action import central
from vantage6.algorithm.client import AlgorithmClient
@central
@algorithm_client
def central_function(client: AlgorithmClient, column_name: str):
org_ids = [organization.get("id") for organization in client.organization.list()]
task = client.task.create(
method="federated_average",
arguments={"column_name": column_name},
organizations=org_ids,
name="My subtask",
)
results = client.wait_for_results(task_id=task.get("id"))
return _aggregate_results(results)
import pandas as pd
from vantage6.algorithm.decorator.action import federated
from vantage6.algorithm.decorator.data import dataframe
@federated
@dataframe(1)
def federated_average(df1: pd.DataFrame, column_name: str):
local_sum = float(df1[column_name].sum())
local_count = len(df1[column_name])
return {"average": local_sum / local_count}
That's all for this blog as far as sessions and compute functions are concerned. Below, we will detail the remaining changes for v5 algorithms.
Generate an algorithm.json file easily
In both v4 and v5, a JSON file is used to describe the algorithm and how to run it. This file is required if you want to upload your algorithm to the algorithm store. Having your algorithm in the algorithm store can make it easier to find and use by others, but also allows users in your collaboration to run the algorithm in the vantage6 UI, which is usually a lot easier than running it via the Python client or the API.
In v4, you would have to manually create this JSON file by hand. In v5, you can use the following command to generate it automatically:
v6 algorithm generate-store-json
This command will look at your algorithm code and generate a JSON file that describes the algorithm. Note that it is still important to check whether the content of the generated file is correct, as it is not always able to infer the correct information from the algorithm code. Doing so is most easily done when uploading the file to the algorithm store in the vantage6 UI.
Other changes
There are not many other changes for algorithms in v5. We wanted to briefly mention the following:
- In your Dockerfile, note that you update the base image to harbor2.vantage6.ai/infrastructure/algorithm-base:5.0. If you were using the v4 base image, that would no longer work!
- The use of VPNs is for node-to-node communication no longer supported. If you were using that, you will need to use a different method to communicate between your nodes. We are planning to introduce a new, much better way to communicate between nodes in the future.
- The test files have been updated. Whether you have run v6 algorithm create or v6 algorithm update, note that the scripts are somewhat different now. Specifically, the MockAlgorithmClient from v4 has been replaced by a MockNetwork that separates the responsibilities of the client and the server more clearly. If you had extensive testing in a v4 algorithm, you may need to update your tests to work with the new MockNetwork.
Final thoughts
The changes in v5 make algorithms more flexible and easier to maintain, which will benefit algorithm developers in the long run. For now, we hope this guide will help you to migrate your algorithms to v5. Good luck!