User Manual

In this chapter we will discuss the parts of Fastr in more detail. We will give a more complete overview of the system and describe the more advanced features.

Tools

The Tools in Fastr are the building blocks of each workflow. A tool represents a program/script/binary that can be called by Fastr and can be seens as a template. A Node can be created based on a Tool. The Node will be one processing step in a workflow, and the tool defines what the step does.

On the import of Fastr, all available Tools will be loaded in a default ToolManager that can be accessed via fastr.toollist. To get an overview of the tools in the system, just print the repr() of the ToolManager:

>>> fastr.toollist
AddImages                v0.1    :  /home/hachterberg/dev/fastr/fastr/resources/tools/addimages/v1_0/addimages.xml
AddInt                   v0.1    :  /home/hachterberg/dev/fastr/fastr/resources/tools/addint/v1_0/addint.xml

As you can see it gives the tool id, version and the file from which it was loaded for each tool in the system. To view the layout of a tool, just print the repr() of the tool itself.

>>> fastr.toollist['AddInt']
Tool AddInt v0.1 (Add two integers)
       Inputs          |       Outputs
---------------------------------------------
left_hand  (Int)       |  result   (Int)
right_hand (Int)       |

To add a Tool to the system a file should be added to one of the path in fastr.config.tools_path. The structure of a tool file is described in Tool description

Create your own tool

There are 4 steps in creating a tool:

  1. CREATE FOLDERS. We will call the tool ThrowDie. Create the folder throw_die in the folder fastr-tools. In this folder create another folder called bin.

  2. PLACE EXECUTABLE IN CORRECT PLACE. In this example we will use a snippet of executable python code:

    #!/usr/bin/env python
    import sys
    import random
    import json
    
    if (len(sys.argv) > 1):
        sides = int(sys.argv[1])
    else:
        sides = 6
    result = [int(random.randint(1, sides ))]
    
    print('RESULT={}'.format(json.dumps(result)))
    

    Save this text in a file called throw_die.py

    Place the executable python script in the folder throw_die/bin

  3. CREATE AND EDIT XML FILE FOR TOOL.

    Put the following text in file called throw_die.xml.

    <tool id="ThrowDie" description="Simulates a throw of a die. Number of sides of the die is provided by user"
          name="throw_die" version="1.0">
      <authors>
        <author name="John Doe" />
      </authors>
      <command version="1.0" >
        <authors>
          <author name="John Doe" url="http://a.b/c" />
        </authors>
        <targets>
          <target arch="*" bin="throw_die.py" interpreter="python" os="*" paths='bin/'/>
        </targets>
        <description>
           throw_die.py number_of_sides
           output = simulated die throw
        </description>
      </command>
      <interface>
        <inputs>
          <input cardinality="1" datatype="Int" description="Number of die sides" id="die_sides" name="die sides" nospace="False" order="0" required="True"/>
         </inputs>
        <outputs>
          <output id="output" name="output value" datatype="Int" automatic="True" cardinality="1" method="json" location="^RESULT=(.*)$" />
        </outputs>
      </interface>
    </tool>
    

    Put throw_die.xml in the folder example_tool. All Attributes in the example above are required. For a complete overview of the xml Attributes that can be used to define a tool, check the Tool description. The most important Attributes in this xml are:

    id      : The id is used in in FASTR to create an instance of your tool, this name will appear in the toollist when you type fastr.toollist.
    targets : This defines where the executables are located and on which platform they are available.
    inputs  : This defines the inputs that you want to be used in FASTR, how FASTR should use them and what data is allowed to be put in there.
    

    More xml examples can be found in the fastr-tools folder.

  1. EDIT CONFIGURATION FILE. Append the line [PATH TO LOCATION OF FASTR-TOOLS]/fastr-tools/throw_die/ to the the config.py (located in ~/.fastr/ directory) to the tools_path. See Config file for more information on configuration.

    You should now have a working tool. To test that everything is ok do the following in python:

    >>> import fastr
    >>> fastr.toollist
    

Now a list of available tools should be produced, including the tool throw_die

To test the tool create the script test_throwdie.py:

import fastr
network = fastr.Network()
source1 = network.create_source(fastr.typelist['Int'], id_='source1')
sink1 = network.create_sink(fastr.typelist['Int'], id_='sink1')
throwdie = network.create_node(fastr.toollist['ThrowDie'], id_='throwdie')
link1 = network.create_link(source1.output, throwdie.inputs['die_sides'])
link2 = network.create_link(throwdie.outputs['output'], sink1.inputs['input'])
source_data = {'source1': {'s1': 4, 's2': 5, 's3': 6, 's4': 7}}
sink_data = {'sink1': 'vfs://tmp/fastr_result_{sample_id}.txt'}
network.draw_network()
network.execute(source_data, sink_data)

Call the script from commandline by

>>> python test_throwdie.py

An image of the network will be created in the current directory and result files will be put in the tmp directory. The result files are called fastr_result_s1.txt, fastr_result_s2.txt, fastr_result_s3.txt, and fastr_result_s4.txt

Note

If you have code which is operating system depend you will have to edit the xml file. The following gives and example of how the elastix tool does this:

<targets>
      <target os="windows" arch="*" bin="elastix.exe">
        <paths>
          <path type="bin" value="vfs://apps/elastix/4.7/install/" />
          <path type="lib" value="vfs://apps/elastix/4.7/install/lib" />
        </paths>
      </target>
      <target os="linux" arch="*" modules="elastix/4.7" bin="elastix">
        <paths>
          <path type="bin" value="vfs://apps/elastix/4.7/install/" />
          <path type="lib" value="vfs://apps/elastix/4.7/install/lib" />
        </paths>
      </target>
      <target os="darwin" arch="*" modules="elastix/4.7" bin="elastix">
        <paths>
          <path type="bin" value="vfs://apps/elastix/4.7/install/" />
          <path type="lib" value="vfs://apps/elastix/4.7/install/lib" />
        </paths>
      </target>
   </targets>

vfs is the virtual file system path, more information can be found at VirtualFileSystem.

Network

A Network represented an entire workflow. It hold all Nodes, Links and other information required to execute the workflow. Networks can be visualized as a number of building blocks (the Nodes) and links between them:

../_images/network_multi_atlas.svg

An empty network is easy to create, all you need is to name it:

>>> network = fastr.Network(id_="network_name")

The Network is the main interface to Fastr, from it you can create all elements to create a workflow. In the following sections the different elements of a Network will be described in more detail.

Node

Nodes are the point in the Network where the processing happens. A Node takes the input data and executes jobs as specified by the underlying Tool. A Nodes can be created in a two different ways:

>>> node1 = fastr.Node(tool, id_='node1', parent=network)
>>> node2 = network.create_node(tool, id_='node2', stepid='step1')

In the first way, we specifically create a Node object. We pass it an id and the parent network. If the parent is None the fastr.curent_network will be used. The Node constructor will automaticaly add the new node to the parent network.

Note

For a Node, the tool can be given both as the Tool class or the id of the

tool.

The second way, we tell the network to create a Node. The network will automatically assign itself as the parent. Optionally you can add define a stepid for the node which is a logical grouping of Nodes that is mostly used for visualization.

A Node contains Inputs and Outputs. To see the layout of the Node one can simply look at the repr().

>>> addint = fastr.Node(fastr.toollist['AddInt'], id_='addint')
>>> addint
Node addint (tool: AddInt v1.0)
       Inputs          |       Outputs
---------------------------------------------
left_hand  (Int)       |  result   (Int)
right_hand (Int)       |

The inputs and outputs are located in mappings with the same name:

>>> addint.inputs
InputDict([('left_hand', <Input: fastr:///networks/unnamed_network/nodelist/addint/inputs/left_hand>), ('right_hand', <Input: fastr:///networks/unnamed_network/nodelist/addint/inputs/right_hand>)])

>>> addint.outputs
OutputDict([('result', Output fastr:///networks/unnamed_network/nodelist/addint/outputs/result)])

The InputDict and OutputDict are classes that behave like mappings. The InputDict also facilitaties the linking shorthand. By assigning an Output to an existing key, the InputDict will create a Link between the InputDict and Output.

SourceNode

A SourceNode is a special kind of node that is the start of a workflow. The SourceNodes are given data at run-time that fetched via IOPlugins. On create, only the datatype of the data that the SourceNode supplied needs to be known. Creating a SourceNode is very similar to an ordinary node:

>>> source1 = fastr.SourceNode('Int', id_='source1')
>>> source2 = network.create_source(fastr.typelist['Int'], id_='source2', stepid='step1')

In both cases, the source is automatically automaticall assigned to a network. In the first case to the fastr.current_network and in the second case to the network used to call the method. A SourceNode only has a single output which has a short-cut access via source.output.

Note

For a source or constant node, the datatype can be given both as the BaseDataType class or the id of the datatype.

ConstantNode

A ConstantNode is another special node. It is a subclass of the SourceNode and has a similar function. However, instead of setting the data at run-time, the data of a constant is given at creation and saved in the object. Creating a ConstantNode is similar as creating a source, but with supplying data:

>>> constant1 = fastr.ConstantNode('Int', [42], id_='constant1')
>>> constant2 = network.create_constant('Int', [42], id_='constant2', stepid='step1')

Often, when a ConstantNode is created, it is created specifically for one input and will not be reused. In this case there is a shorthand to create and link a constant to an input:

>>> addint.inputs['value1'] = [42]

will create a constant node with the value 42 and create a link between the output and input addint.value1.

SinkNode

The SinkNode is the counter-part of the source node. Instead of get data into the workflow, it saves the data resulting from the workflow. For this a rule has to be given at run-time that determines where to store the data. The information about how to create such a rule is described at SinkNode.set_data. At creation time, only the datatype has to be specified:

>>> sink1 = fastr.Sink('Int', id_='sink1')
>>> sink2 = network.create_sink(fastr.typelist['Int'], id_='sink2', stepid='step1')

Data Flow

The data enters the Network via SourceNodes, flows via other Nodes and leaves the Network via SinkNodes. The flow between Nodes goes from an Output via a Link to an Input. In the following image it is simple to track the data from the SourceNodes at the left to the SinkNodes at right side:

../_images/network1.svg

Note that the data in Fastr is stored in the Output and the Link and Input just give access to it (possible while transforming the data).

Data flow inside a Node

In a Node all data from the Inputs will be combined and the jobs will be generated. There are strict rules to how this combination is performed. In the default case all inputs will be used pair-wise, and if there is only a single value for an input, it it will be considered as a constant.

To illustrate this we will consider the following Tool (note this is a simplified version of the real tool):

>>> fastr.toollist['Elastix']
Tool Elastix v4.8 (Elastix Registration)
                         Inputs                            |             Outputs
----------------------------------------------------------------------------------------------
fixed_image       (ITKImageFile)                           |  transform (ElastixTransformFile)
moving_image      (ITKImageFile)                           |
parameters        (ElastixParameterFile)                   |

Also it is important to know that for this tool (by definition) the cardinality of the transform Output will match the cardinality of the parameters Inputs

If we supply a Node based on this Tool with a single sample on each Input, there will be one single matching Output sample created:

../_images/flow_simple_one_sample.svg

If the cardinality of the parameters sample would be increased to 2, the resulting transform sample would also become 2:

../_images/flow_simple_one_sample_two_cardinality.svg

Now if the number of samples on fixed_image would be increased to 3, the moving_image and parameters will be considered constant and be repeated, resulting in 3 transform samples.

../_images/flow_simple_three_sample.svg

Then if the amount of samples for moving_image is also increased to 3, the moving_image and fixed_image will be used pairwise and the parameters will be constant.

../_images/flow_simple_three_sample_two_cardinality.svg

Advanced flows in a Node

Sometimes the default pairwise behaviour is not desirable. For example if you want to test all combinations of certain input samples. To achieve this we can change the input_group of Inputs to set them apart from the rest. By default all Inputs are assigned to the default input group. Now let us change that:

>>> node = network.create_node('Elastix', id_='elastix')
>>> node.inputs['moving_image'].input_group = 'moving'

This will result in moving_image to be put in a different input group. Now if we would supply fixed_image with 3 samples and moving_image with 4 samples, instead of an error we would get the following result:

../_images/flow_cross_three_sample.svg

Warning

TODO: Expand this section with the merging dimensions

Data flows in an Input

If an Inputs has multiple Links attached to it, the data will be combined by concatenating the values for each corresponding sample in the cardinality.

Broadcasting (matching data of different dimensions)

Sometimes you might want to combine data that does not have the same number of dimensions. As long as all dimensions of the lower dimensional datasets match a dimension in the higher dimensional dataset, this can be achieved using broadcasting. The term broadcasting is borrowed from NumPy and described as:

“The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.”

NumPy manual on broadcasting

In fastr it works similar, but to combined different Inputs in an InputGroup. To illustrate broadcasting it is best to use an example, the following network uses broadcasting in the transformix Node:

../_images/network_multi_atlas.svg

As you can see this visualization prints the dimensions for each Input and Output (e.g. the elastix.fixed_image Input has dimensions [N]). To explain what happens in more detail, we present an image illustrating the details for the samples in elastix and transformix:

../_images/flow_broadcast.svg

In the figure the moving_image (and references to it) are identified with different colors, so they are easy to track across the different steps.

At the top the Inputs for the elastix Node are illustrated. Because the input groups a set differently, output samples are generated for all combinations of fixed_image and moving_image (see Advanced flows in a Node for details).

In the transformix Node, we want to combine a list of samples that is related to the moving_image (it has the same dimension name and sizes) with the resulting transform samples from the elastix Node. As you can see the sizes of the sample collections do not match ([N] vs [N x M]). This is where broadcasting comes into play, it allows the system to match these related sample collections. Because all the dimensions in [N] are known in [N x M], it is possible to match them uniquely. This is done automatically and the result is a new [N xM] sample collection. To create a matching sample collections, the samples in the transformix.image Input are reused as indicated by the colors.

Warning

Note that this might fail when there are data-blocks with non-unique dimension names, as it will be not be clear which of the dimensions with identical names should be matched!

DataTypes

In Fastr all data is contained in object of a specific type. The types in Fastr are represented by classes that subclass BaseDataType. There are a few different other classes under BaseDataType that are each a base class for a family of types:

  • DataType – The base class for all types that hold data
    • ValueType – The base class for types that contain simple data (e.g. Int, String) that can be represented as a str
    • EnumType – The base class for all types that are a choice from a set of options
    • URLType – The base class for all types that have their data stored in files (which are referenced by URL)
  • TypeGroup – The base class for all types that actually represent a group of types
../_images/datatype_diagram.svg

Fig. 2 The relation between the different DataType classes

The types are defined in xml files and created by the DataTypeManager. The DataTypeManager acts as a container containing all Fastr types. It is automatically instantiated as fastr.typelist. In fastr the created DataTypes classes are also automatically place in the fastr.datatypes module once created.

Resolving Datatypes

Outputs in fastr can have a TypeGroup or a number of DataTypes associated with them. The final DataType used will depend on the linked Inputs. The DataType resolving works as a two-step procedure.

  1. All possible DataTypes are determined and considered as options.
  2. The best possible DataType from options is selected for non-automatic Outputs

The options are defined as the intersection of the set of possible values for the Output and each separate Input connected to the Output. Given the resulting options there are three scenarios:

  • If there are no valid DataTypes (options is empty) the result will be None.
  • If there is a single valid DataType, then this is automatically the result (even if it is not a preferred DataType).
  • If there are multiple valid DataTypes, then the preferred DataTypes are used to resolve conflicts.

There are a number of places where the preferred DataTypes can be set, these are used in the order as given:

  1. The preferred keyword argument to match_types
  2. The preferred types specified in the fastr.config

Execution

Executing a Network is very simple:

>>> source_data = {'source_id1': ['val1', 'val2'],
                   'source_id2': {'id3': 'val3', 'id4': 'val4'}}
>>> sink_data = {'sink_id1': 'vfs://some_output_location/{sample_id}/file.txt'}
>>> network.execute(source_data, sink_data)

The Network.execute method takes a dict of source data and a dict sink data as arguments. The dictionaries should have a key for each SourceNode or SinkNode.

TODO: add .. figure:: images/execution_layers.*

The execution of a Network uses a layered model:

  • Network.execute will analyze the Network and call all Nodes.
  • Node.execute will create jobs and fill their payload
  • execute_job will execute the job on the execute machine and resolve any deferred values (val:// urls).
  • Tool.execute will find the correct target and call the interface and if required resolve vfs:// urls
  • Interface.execute will actually run the required command(s)

The ExecutionPlugin will call call the executionscript.py for each job, passing the job as a gzipped pickle file. The executionscript.py will resolve deferred values and then call Tool.execute which analyses the required target and executes the underlying Interface. The Interface actually executes the job and collect the results. The result is returned (via the Tool) to the executionscript.py. There we save the result, provenance and profiling in a new gzipped pickle file. The execution system will use a callback to load the data back into the Network.

The selection and settings of the ExecutionPlugin are defined in the fastr config.

Continuing a Network

Normally a random temporary directory is created for each run. To continue a previously stopped/crashed network, you should call the Network.execute method using the same temporary directory(tmp dir). You can set the temporary directory to a fixed value using the following code:

>>> tmpdir = '/tmp/example_network_rerun'
>>> network.execute(source_data, sink_data, tmpdir=tmpdir)

Warning

Be aware that at this moment, Fastr will rerun only the jobs where not all output files are present or if the job/tool parameters have been changed. It will not rerun if the input data of the node has changed or the actual tools have been adjusted. In these cases you should remove the output files of these nodes, to force a rerun.

IOPlugins

Sources and sink are used to get data in and out of a Network during execution. To make the data retrieval and storage easier, a plugin system was created that selects different plugins based on the URL scheme used. So for example, a url starting with vfs:// will be handles by the VirtualFileSystem plugin. A list of all the IOPlugins known by the system and their use can be found at IOPlugin Reference.

Naming Convention

For the naming convention of the tools we tried to stay close to the Python PEP 8 coding style. In short, we defined toolnames as classes so they should be UpperCamelCased. The inputs and outputs of a tool we considered as functions or method arguments, these should we named lower_case_with_underscores.

An overview of the mapping of Fastr to PEP 8:

Fastr construct Python PEP8 equivalent Examples
Network.id module brain_tissue_segmentation
Tool.id class BrainExtractionTool, ThresholdImage
Node.id variable name brain_extraction, threshold_mask
Input/Output.id method image, number_of_classes, probability_image

Furthermore there are some small guidelines:

  • No input or output in the input or output names. This is already specified when setting or getting the data.

  • Add the type of the output that is named. i.e. enum, string, flag, image,

    • No File in the input/output name (Passing files around is what Fastr was developed for).
    • No type necessary where type is implied i.e. lower_threshold, number_of_levels, max_threads.
  • Where possible/useful use the fullname instead of an abbreviation.

Provenance

For every data derived data object, Fastr records the Provenance. The SinkNode write provenance records next to every data object it writes out. The records contain information on what operations were performed to obtain the resulting data object.

W3C Prov

The provenance is recorded using the W3C Prov Data Model (PROV-DM). Behind the scences we are using the python prov implementation.

The PROV-DM defines 3 Starting Point Classes and and their relating properties. See Fig. 3 for a graphic representation of the classes and the relations.

../_images/provo.svg

Fig. 3 The three Starting Point classes and the properties that relate them. The diagrams in this document depict Entities as yellow ovals, Activities as blue rectangles, and Agents as orange pentagons. The responsibility properties are shown in pink. [*]

Implementation

In the workflow document the provenance classes map to fastr concepts in the following way:

Agent:Fastr, Networks, Tools, Nodes
Activity:Jobs
Entities:Data

Usage

The provenance is stored in ProvDocument objects in pickles. The convenience command line tool fastr prov can be used to extract the provenance in the PROV-N notation and can be serialized to PROV-JSON and PROV-XML. The provenance document can also be vizualized using the fastr prov command line tool.

Footnotes

[*]This picture and caption is taken from http://www.w3.org/TR/prov-o/ . “Copyright © 2011-2013 World Wide Web Consortium, (MIT, ERCIM, Keio, Beihang). http://www.w3.org/Consortium/Legal/2015/doc-license