User Manual¶
In this chapter we will discuss the parts of Fastr in more detail. We will give a more complete overview of the system and describe the more advanced features.
Tools¶
The Tools
in Fastr are the building blocks of each workflow.
A tool represents a program/script/binary that can be called by Fastr and can be seens as a template.
A Node
can be created based on a Tool
.
The Node will be one processing step in a workflow, and the tool defines what the step does.
On the import of Fastr, all available Tools
will be loaded in a default
ToolManager
that can be accessed via fastr.toollist
. To get an
overview of the tools in the system, just print the repr()
of the ToolManager
:
>>> fastr.toollist
AddImages v0.1 : /home/hachterberg/dev/fastr/fastr/resources/tools/addimages/v1_0/addimages.xml
AddInt v0.1 : /home/hachterberg/dev/fastr/fastr/resources/tools/addint/v1_0/addint.xml
As you can see it gives the tool id, version and the file from which it was loaded for each tool in the system.
To view the layout of a tool, just print the repr()
of the tool itself.
>>> fastr.toollist['AddInt']
Tool AddInt v0.1 (Add two integers)
Inputs | Outputs
---------------------------------------------
left_hand (Int) | result (Int)
right_hand (Int) |
To add a Tool
to the system a file should be added to one of the path
in fastr.config.tools_path
. The structure of a tool file is described in Tool description
Create your own tool¶
There are 4 steps in creating a tool:
CREATE FOLDERS. We will call the tool ThrowDie. Create the folder throw_die in the folder fastr-tools. In this folder create another folder called bin.
PLACE EXECUTABLE IN CORRECT PLACE. In this example we will use a snippet of executable python code:
#!/usr/bin/env python import sys import random import json if (len(sys.argv) > 1): sides = int(sys.argv[1]) else: sides = 6 result = [int(random.randint(1, sides ))] print('RESULT={}'.format(json.dumps(result)))
Save this text in a file called
throw_die.py
Place the executable python script in the folder
throw_die/bin
CREATE AND EDIT XML FILE FOR TOOL.
Put the following text in file called
throw_die.xml
.<tool id="ThrowDie" description="Simulates a throw of a die. Number of sides of the die is provided by user" name="throw_die" version="1.0"> <authors> <author name="John Doe" /> </authors> <command version="1.0" > <authors> <author name="John Doe" url="http://a.b/c" /> </authors> <targets> <target arch="*" bin="throw_die.py" interpreter="python" os="*" paths='bin/'/> </targets> <description> throw_die.py number_of_sides output = simulated die throw </description> </command> <interface> <inputs> <input cardinality="1" datatype="Int" description="Number of die sides" id="die_sides" name="die sides" nospace="False" order="0" required="True"/> </inputs> <outputs> <output id="output" name="output value" datatype="Int" automatic="True" cardinality="1" method="json" location="^RESULT=(.*)$" /> </outputs> </interface> </tool>
Put throw_die.xml in the folder example_tool. All Attributes in the example above are required. For a complete overview of the xml Attributes that can be used to define a tool, check the Tool description. The most important Attributes in this xml are:
id : The id is used in in FASTR to create an instance of your tool, this name will appear in the toollist when you type fastr.toollist. targets : This defines where the executables are located and on which platform they are available. inputs : This defines the inputs that you want to be used in FASTR, how FASTR should use them and what data is allowed to be put in there.
More xml examples can be found in the fastr-tools folder.
EDIT CONFIGURATION FILE. Append the line
[PATH TO LOCATION OF FASTR-TOOLS]/fastr-tools/throw_die/
to the theconfig.py
(located in ~/.fastr/ directory) to thetools_path
. See Config file for more information on configuration.You should now have a working tool. To test that everything is ok do the following in python:
>>> import fastr >>> fastr.toollist
Now a list of available tools should be produced, including the tool throw_die
To test the tool create the script test_throwdie.py:
import fastr network = fastr.Network() source1 = network.create_source(fastr.typelist['Int'], id_='source1') sink1 = network.create_sink(fastr.typelist['Int'], id_='sink1') throwdie = network.create_node(fastr.toollist['ThrowDie'], id_='throwdie') link1 = network.create_link(source1.output, throwdie.inputs['die_sides']) link2 = network.create_link(throwdie.outputs['output'], sink1.inputs['input']) source_data = {'source1': {'s1': 4, 's2': 5, 's3': 6, 's4': 7}} sink_data = {'sink1': 'vfs://tmp/fastr_result_{sample_id}.txt'} network.draw_network() network.execute(source_data, sink_data)
Call the script from commandline by
>>> python test_throwdie.py
An image of the network will be created in the current directory and result files will be put in the tmp directory. The result files are called
fastr_result_s1.txt
, fastr_result_s2.txt
, fastr_result_s3.txt
, and fastr_result_s4.txt
Note
If you have code which is operating system depend you will have to edit the xml file. The following gives and example of how the elastix tool does this:
<targets>
<target os="windows" arch="*" bin="elastix.exe">
<paths>
<path type="bin" value="vfs://apps/elastix/4.7/install/" />
<path type="lib" value="vfs://apps/elastix/4.7/install/lib" />
</paths>
</target>
<target os="linux" arch="*" modules="elastix/4.7" bin="elastix">
<paths>
<path type="bin" value="vfs://apps/elastix/4.7/install/" />
<path type="lib" value="vfs://apps/elastix/4.7/install/lib" />
</paths>
</target>
<target os="darwin" arch="*" modules="elastix/4.7" bin="elastix">
<paths>
<path type="bin" value="vfs://apps/elastix/4.7/install/" />
<path type="lib" value="vfs://apps/elastix/4.7/install/lib" />
</paths>
</target>
</targets>
vfs
is the virtual file system path, more information can be found at
VirtualFileSystem
.
Network¶
A Network
represented an entire workflow.
It hold all Nodes
, Links
and other information
required to execute the workflow. Networks can be visualized as a number of building blocks (the Nodes) and links
between them:
An empty network is easy to create, all you need is to name it:
>>> network = fastr.Network(id_="network_name")
The Network
is the main interface to Fastr, from it you can create all elements
to create a workflow. In the following sections the different elements of a
Network
will be described in more detail.
Node¶
Nodes
are the point in the Network
where
the processing happens. A Node
takes the input data and executes jobs as specified
by the underlying Tool
. A Nodes
can be created in a two different ways:
>>> node1 = fastr.Node(tool, id_='node1', parent=network)
>>> node2 = network.create_node(tool, id_='node2', stepid='step1')
In the first way, we specifically create a Node
object. We pass it an id
and
the parent
network.
If the parent
is None
the fastr.curent_network
will be used.
The Node
constructor will automaticaly add the new node to the parent
network.
Note
For a Node, the tool can be given both as the Tool
class or the id of the
tool.
The second way, we tell the network
to create a Node
.
The network
will automatically assign itself as the parent
.
Optionally you can add define a stepid
for the node which is a logical grouping of
Nodes
that is mostly used for visualization.
A Node
contains Inputs
and
Outputs
. To see the layout of the Node
one can simply look at the repr()
.
>>> addint = fastr.Node(fastr.toollist['AddInt'], id_='addint')
>>> addint
Node addint (tool: AddInt v1.0)
Inputs | Outputs
---------------------------------------------
left_hand (Int) | result (Int)
right_hand (Int) |
The inputs and outputs are located in mappings with the same name:
>>> addint.inputs
InputDict([('left_hand', <Input: fastr:///networks/unnamed_network/nodelist/addint/inputs/left_hand>), ('right_hand', <Input: fastr:///networks/unnamed_network/nodelist/addint/inputs/right_hand>)])
>>> addint.outputs
OutputDict([('result', Output fastr:///networks/unnamed_network/nodelist/addint/outputs/result)])
The InputDict
and OutputDict
are
classes that behave like mappings. The InputDict
also facilitaties the linking
shorthand. By assigning an Output
to an existing key, the
InputDict
will create a Link
between the
InputDict
and Output
.
SourceNode¶
A SourceNode
is a special kind of node that is the start of a workflow.
The SourceNodes
are given data at run-time that fetched via
IOPlugins
. On create, only the datatype of the data that the
SourceNode
supplied needs to be known. Creating a
SourceNode
is very similar to an ordinary node:
>>> source1 = fastr.SourceNode('Int', id_='source1')
>>> source2 = network.create_source(fastr.typelist['Int'], id_='source2', stepid='step1')
In both cases, the source is automatically automaticall assigned to a network.
In the first case to the fastr.current_network
and in the second case to the network
used to call the method.
A SourceNode
only has a single output which has a short-cut access via source.output
.
Note
For a source or constant node, the datatype can be given both as the BaseDataType
class or the id of the datatype.
ConstantNode¶
A ConstantNode
is another special node.
It is a subclass of the SourceNode
and has a similar function.
However, instead of setting the data at run-time, the data of a constant is given at creation and saved in the object.
Creating a ConstantNode
is similar as creating a source, but with supplying data:
>>> constant1 = fastr.ConstantNode('Int', [42], id_='constant1')
>>> constant2 = network.create_constant('Int', [42], id_='constant2', stepid='step1')
Often, when a ConstantNode
is created, it is created specifically for one input and will not be reused.
In this case there is a shorthand to create and link a constant to an input:
>>> addint.inputs['value1'] = [42]
will create a constant node with the value 42 and create a link between the output and input addint.value1
.
SinkNode¶
The SinkNode
is the counter-part of the source node.
Instead of get data into the workflow, it saves the data resulting from the workflow.
For this a rule has to be given at run-time that determines where to store the data.
The information about how to create such a rule is described at SinkNode.set_data
.
At creation time, only the datatype has to be specified:
>>> sink1 = fastr.Sink('Int', id_='sink1')
>>> sink2 = network.create_sink(fastr.typelist['Int'], id_='sink2', stepid='step1')
Link¶
Links
indicate how the data flows between Nodes
.
Links can be created explicitly using on of the following:
>>> link = fastr.Link(node1.outputs['image'], node2.inputs['image'])
>>> link = network.create_link(node1.outputs['image'], node2.inputs['image'])
or can be create implicitly by assigning an Output
to an
Input
in the InputDict
.
# This style of assignment will create a Link similar to above
>>> node2.inputs['image'] = node1.outputs['image']
Note that a Link
is also create automatically when using the short-hand for the
ConstantNode
Data Flow¶
The data enters the Network
via
SourceNodes
, flows via other Nodes
and
leaves the Network
via SinkNodes
.
The flow between Nodes
goes from an
Output
via a Link
to an
Input
. In the following image it is simple to track the data from
the SourceNodes
at the left to the
SinkNodes
at right side:
Note that the data in Fastr is stored in the Output
and the
Link
and Input
just give access to it
(possible while transforming the data).
Data flow inside a Node¶
In a Node
all data from the Inputs
will
be combined and the jobs will be generated. There are strict rules to how this combination is performed. In the default
case all inputs will be used pair-wise, and if there is only a single value for an input, it it will be considered as
a constant.
To illustrate this we will consider the following Tool
(note this is a simplified
version of the real tool):
>>> fastr.toollist['Elastix']
Tool Elastix v4.8 (Elastix Registration)
Inputs | Outputs
----------------------------------------------------------------------------------------------
fixed_image (ITKImageFile) | transform (ElastixTransformFile)
moving_image (ITKImageFile) |
parameters (ElastixParameterFile) |
Also it is important to know that for this tool (by definition) the cardinality of the transform
Output
will match the cardinality of the parameters
Inputs
If we supply a Node
based on this Tool
with a
single sample on each Input
, there will be one single matching
Output
sample created:
If the cardinality of the parameters
sample would be increased to 2, the resulting transform
sample would also
become 2:
Now if the number of samples on fixed_image
would be increased to 3, the moving_image
and parameters
will be considered constant and be repeated, resulting in 3 transform
samples.
Then if the amount of samples for moving_image
is also increased to 3, the moving_image
and fixed_image
will
be used pairwise and the parameters
will be constant.
Advanced flows in a Node¶
Sometimes the default pairwise behaviour is not desirable. For example if you want to test all combinations of certain
input samples. To achieve this we can change the input_group
of
Inputs
to set them apart from the rest. By default all
Inputs
are assigned to the default
input group. Now let us change that:
>>> node = network.create_node('Elastix', id_='elastix')
>>> node.inputs['moving_image'].input_group = 'moving'
This will result in moving_image
to be put in a different input group. Now if we would supply fixed_image
with
3 samples and moving_image
with 4 samples, instead of an error we would get the following result:
Warning
TODO: Expand this section with the merging dimensions
Data flows in a Link¶
As mentioned before the data flows from an Output
to an
Input
throuhg a Link
. By default the
Link
passed the data as is, however there are two special directives that change
the shape of the data:
Collapsing flow, this collapses certain dimensions from the sample array into the cardinality. As a user you have to specify the dimension or tuple of dimensions you want to collapse.
This is useful in situation where you want to use a tool that aggregates over a number of samples (e.g. take a mean or sum).
To achieve this you can set the
collapse
property of theLink
as follows:>>> link.collapse = 'dim1' >>> link.collapse = ('dim1', 'dim2') # In case you want to collapse multiple dimensions
Expanding flow, this turns the cardinality into a new dimension. The new dimension will be named after the
Output
from which the link originates. It will be in the form of{nodeid}__{outputid}
This flow directive is useful if you want to split a large sample in multiple smaller samples. This could be because processing the whole sample is not feasible because of resource constraints. An example would be splitting a 3D image into slices to process separately to avoid high memory use or to achieve parallelism.
To achieve this you can set the
expand
property of theLink
toTrue
:>>> link.expand = True
Note
both collapsing and expanding can be used on the same link, it will executes similar to a expand-collapse sequence, but the newly created expand dimension is ignored in the collapse.
>>> link.collapse = 'dim1'
>>> link.expand = True
Data flows in an Input¶
If an Inputs
has multiple Links
attached
to it, the data will be combined by concatenating the values for each corresponding sample in the cardinality.
Broadcasting (matching data of different dimensions)¶
Sometimes you might want to combine data that does not have the same number of dimensions. As long as all dimensions of the lower dimensional datasets match a dimension in the higher dimensional dataset, this can be achieved using broadcasting. The term broadcasting is borrowed from NumPy and described as:
“The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.”
In fastr it works similar, but to combined different Inputs in an InputGroup. To illustrate broadcasting it is best to
use an example, the following network uses broadcasting in the transformix
Node:
As you can see this visualization prints the dimensions for each Input and Output (e.g. the elastix.fixed_image
Input has dimensions [N]
). To explain what happens in more detail, we present an image illustrating the
details for the samples in elastix
and transformix
:
In the figure the moving_image
(and references to it) are identified with different colors, so they are easy to
track across the different steps.
At the top the Inputs for the elastix
Node are illustrated. Because the input groups a set differently, output
samples are generated for all combinations of fixed_image
and moving_image
(see Advanced flows in a Node for
details).
In the transformix
Node, we want to combine a list of samples that is related to the moving_image
(it has the
same dimension name and sizes) with the resulting transform
samples from the elastix
Node. As you can see the
sizes of the sample collections do not match ([N]
vs [N x M]
). This is where broadcasting comes into play, it
allows the system to match these related sample collections. Because all the dimensions in [N]
are known in
[N x M]
, it is possible to match them uniquely. This is done automatically and the result is a new [N xM]
sample
collection. To create a matching sample collections, the samples in the transformix.image
Input are reused as
indicated by the colors.
Warning
Note that this might fail when there are data-blocks with non-unique dimension names, as it will be not be clear which of the dimensions with identical names should be matched!
DataTypes¶
In Fastr all data is contained in object of a specific type. The types in Fastr are represented by classes that subclass BaseDataType
. There are a few different other classes under BaseDataType
that are each a base class for a family of types:
DataType
– The base class for all types that hold dataTypeGroup
– The base class for all types that actually represent a group of types
The types are defined in xml files and created by the DataTypeManager
.
The DataTypeManager
acts as a container containing all Fastr types.
It is automatically instantiated as fastr.typelist
.
In fastr the created DataTypes classes are also automatically place in the fastr.datatypes
module once created.
Resolving Datatypes¶
Outputs
in fastr can have a TypeGroup
or a number of DataTypes
associated with them. The final DataType
used will
depend on the linked Inputs
. The DataType
resolving works as a two-step procedure.
- All possible
DataTypes
are determined and considered as options. - The best possible
DataType
from options is selected for non-automatic Outputs
The options are defined as the intersection of the set of possible values for the Output
and each separate Input
connected to the Output
. Given the resulting options there are three scenarios:
- If there are no valid
DataTypes
(options is empty) the result will be None. - If there is a single valid
DataType
, then this is automatically the result (even if it is not a preferredDataType
). - If there are multiple valid
DataTypes
, then the preferredDataTypes
are used to resolve conflicts.
There are a number of places where the preferred DataTypes
can be set, these are used in the order as given:
- The preferred keyword argument to
match_types
- The preferred types specified in the fastr.config
Execution¶
Executing a Network is very simple:
>>> source_data = {'source_id1': ['val1', 'val2'],
'source_id2': {'id3': 'val3', 'id4': 'val4'}}
>>> sink_data = {'sink_id1': 'vfs://some_output_location/{sample_id}/file.txt'}
>>> network.execute(source_data, sink_data)
The Network.execute
method takes a dict
of source data
and a dict
sink data as arguments. The dictionaries should have a key for each
SourceNode
or SinkNode
.
TODO: add .. figure:: images/execution_layers.*
The execution of a Network uses a layered model:
Network.execute
will analyze the Network and call all Nodes.Node.execute
will create jobs and fill their payloadexecute_job
will execute the job on the execute machine and resolve any deferred values (val://
urls).Tool.execute
will find the correct target and call the interface and if required resolvevfs://
urlsInterface.execute
will actually run the required command(s)
The ExecutionPlugin
will call call
the executionscript.py
for each job, passing the job as a
gzipped pickle file. The executionscript.py
will resolve deferred values and
then call Tool.execute
which analyses the required target and executes the
underlying Interface
. The Interface actually executes the job and collect
the results. The result is returned (via the Tool) to the executionscript.py
.
There we save the result, provenance and profiling in a new gzipped pickle file. The execution system will use a
callback to load the data back into the Network.
The selection and settings of the ExecutionPlugin
are defined in the fastr config.
Continuing a Network¶
Normally a random temporary directory is created for each run. To continue a previously stopped/crashed network, you should call the Network.execute
method using the same temporary directory(tmp dir). You can set the temporary directory to a fixed value using the following code:
>>> tmpdir = '/tmp/example_network_rerun'
>>> network.execute(source_data, sink_data, tmpdir=tmpdir)
Warning
Be aware that at this moment, Fastr will rerun only the jobs where not all output files are present or if the job/tool parameters have been changed. It will not rerun if the input data of the node has changed or the actual tools have been adjusted. In these cases you should remove the output files of these nodes, to force a rerun.
IOPlugins¶
Sources and sink are used to get data in and out of a Network
during execution.
To make the data retrieval and storage easier, a plugin system was created that selects different plugins based on the
URL scheme used. So for example, a url starting with vfs://
will be handles by the
VirtualFileSystem plugin
. A list of all the
IOPlugins
known by the system and their use can
be found at IOPlugin Reference.
Naming Convention¶
For the naming convention of the tools we tried to stay close to the Python PEP 8 coding style. In short, we defined toolnames as classes so they should be UpperCamelCased. The inputs and outputs of a tool we considered as functions or method arguments, these should we named lower_case_with_underscores.
An overview of the mapping of Fastr to PEP 8:
Fastr construct | Python PEP8 equivalent | Examples |
---|---|---|
Network.id | module | brain_tissue_segmentation |
Tool.id | class | BrainExtractionTool, ThresholdImage |
Node.id | variable name | brain_extraction, threshold_mask |
Input/Output.id | method | image, number_of_classes, probability_image |
Furthermore there are some small guidelines:
No input or output in the input or output names. This is already specified when setting or getting the data.
Add the type of the output that is named. i.e. enum, string, flag, image,
- No File in the input/output name (Passing files around is what Fastr was developed for).
- No type necessary where type is implied i.e. lower_threshold, number_of_levels, max_threads.
Where possible/useful use the fullname instead of an abbreviation.
Provenance¶
For every data derived data object, Fastr records the Provenance. The SinkNode
write provenance records next to every data object it writes out. The records contain information on what operations were performed to obtain the resulting data object.
W3C Prov¶
The provenance is recorded using the W3C Prov Data Model (PROV-DM). Behind the scences we are using the python prov implementation.
The PROV-DM defines 3 Starting Point Classes and and their relating properties. See Fig. 3 for a graphic representation of the classes and the relations.
Implementation¶
In the workflow document the provenance classes map to fastr concepts in the following way:
Agent: | Fastr, Networks, Tools, Nodes |
---|---|
Activity: | Jobs |
Entities: | Data |
Usage¶
The provenance is stored in ProvDocument objects in pickles. The convenience command line tool fastr prov
can be used to extract the provenance in the PROV-N notation and can be serialized to PROV-JSON and PROV-XML. The provenance document can also be vizualized using the fastr prov
command line tool.
Footnotes
[*] | This picture and caption is taken from http://www.w3.org/TR/prov-o/ . “Copyright © 2011-2013 World Wide Web Consortium, (MIT, ERCIM, Keio, Beihang). http://www.w3.org/Consortium/Legal/2015/doc-license“ |