fab
Welcome to FAB (Flash Analysis for Beamtimes)
The purpose of this library is to give users convenient access and analysis tools for the data generated at the Free Electron Laser FLASH.
It abstracs the details of loading the data, that would otherwise require accessing the mutliple hdf5 files generated by the DAQ during the beamtime. It also provides easy access to the Maxwell cluster resoruces, so that parallel computation can be performed efforlessly.
The code repo can be found at: https://gitlab.desy.de/fabiano.lever/flashanalysis/
Installation
If you use fab on the Maxwell cluster, or through the jupyterhub, you can find fab
already installed in the 'flash' kernel (for jhub), or in the flash environment on maxwell.
To activate the envoronment, simply run: module load flash flashenv
in your terminal.
NOTE: If you use vscode to ssh into maxwell, you can select the flash python interpreter
by its path: /software/ps/flash/envs/flash/bin/python
Quickstart
A brief introduction to the ideas behind the module and some info on how to get started. For an even quicker introduction, have a look at the notebooks in the example folder.
In most cases, in order to use fab
, you need to provide a configuration file
specifing the kind of data you want to load, how you want to load it, and additional
parameters on how fab
should behave. Let's have a look at a quick example:
#fab_config.toml
[instruments.ursa.delay_set]
__type__ = 'fab.datasources.HDFSource'
hdf_key = '/zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup'
fillna_method = 'ffill'
[instruments.ursa.GMD]
__type__ = 'fab.datasources.GMD'
data_key = "/FL2/Photon Diagnostic/GMD/Pulse resolved energy/energy hall"
calibration_key = "/FL2/Photon Diagnostic/GMD/Average energy/energy hall"
[instruments.ursa.eTof]
__type__ = 'fab.datasources.SlicedADC'
hdf_key = '/FL2/Experiment/MTCA-EXP1/ADQ412 GHz ADC/CH00/TD'
offset = 2246
window = 3000
period = 9969.23
t0 = 0.153
dt = 0.0005
baseline = 200
dim_names = ['shot_id', 'eTof_trace']
In this file, we tell fab
that we want to create a new instrument called ursa
, and
that instrument should contain three data variables called delay_set
, GMD
and eTof
.
We are configuring the delay_set
loader to look for data in the hdf5 file table
/zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup
,
and fill the missing values with the ffill
method (that is, missing values are filled
with the last valid value). To find out the hdf paths for the available data, please
open one of the HDF files with hdfview of similar software, or ask the local contact
of your beamline for help.
The eTof
and 'GMD' values are also loaded from the HDF filems, but in this case we
ask fab
for a more sophisticated loading strategy, implemented in the
fab.datasources.SlicedADC
and fab.datasources.GMD
classes. Refer to the fab.datasources
documentation for more info.
We suggest you create a single config file for each beamtime, placing it in the shared folder, so that each participant can use the same configuration.
After we defined what data we want to get, we are now ready to load it:
from fab.magic import config, beamtime, ursa
result = ursa.load(daq_run=43861)
The fab.magic
module attemps to do a few things for you. By importing config, it looks for a
configuration file named fab_config.toml
(or any file matching the pattern fab_config*.toml
)
in the current directory or in one of its parents. It uses the first one it finds. The beamtime
import gives access to the beamtime run table created via fablive
. If you are in a gpfs beamtime
directory, the beamtime number is automatically detected. Otherwise you can specify it as e.g.
beamtime=11013355
in the config file.
NOTE: The fab.magic
module provides a convenient way to quickly set up the analyis
environment. It makes some assumptions on how you use fab
. If you prefer a more structured
setup or finer control over the configuration process, please refer to the fab.settings
,
fab.instruments
and fab.beamtime
modules.
Finally, the ursa
import instantiates the instrument we defined in the config file, which we can
then use to load the data. Calling load
with no arguments will load all available data.
To access the data, we can simply use:
result.delay_set
result.eTof
Depending on the size of the data, the data will be already in RAM or it will be represented by
a lazy dask.array
object. You can force the data to be preloaded in RAM (or force it not to be),
by passing the preload_values
attribute. Please go thorugh the advantages/disadvantages of this
approach by reading the documentation in fab.datasources
, as it might have catastrophic impact on
performance if used incorrectly.
To force the data to be loaded in RAM, we can use the .compute()
method. Please refer to the
documentation xarray and dask for more information. When using fab on one of the maxwell cluster
login nodes, a job will be automatically submitted and the computation will be performed on
the cluster.
Please have a look at the fab.datasources
and fab.instruments
modules for more detailed infomation
about how to configure fab
and tailor it to your needs. The settings
and maxwell
modules
documentation will help you in writing more complex configuration files and in how to customize how
fab to uses the maxwell HPC cluster.
NOTE: You don't need to be on maxwell to use this module. If you have the HDF files on your local
machine, you can configure fab
to loads those files by using the hdf_path
configuration parameter.
fab
will then happpily run on your local machine.
xarray, dask and the Maxwell HPC cluster
xArray
xarray
(shortened to xr
) is a wrapper around numpy arrays that allows labelling data. The data
loaded by the fab
library is returned in the form of xr.DataArray
or xr.Dataset
objects.
This allows easy indexing of the data (label based indexing instead of positional indexing) and
offers many of the analysis functionality of pandas extended to data with more than 2 dimensions.
Please refer to the official documentation of xarray
for more info about the analysis API.
Dask
Dask is a python tools that allows easy parallelization of large tasks that might exceed the resources of a normal workstation. It can be integrated with HPC clusters so that computations happens remotely on the cluster and only the results are returned.
In general, the data contained in the objects loaded by fab
is given as a dask.array
. That
means that all operations and analysis will be computed lazily only when needed. For many tasks,
the computation is triggered automatically when the data needs to be displayed (e.g. when plotting),
so that dask.array
objects can be used in the same way one uses np.ndarray
.
In case you wish to trigger a computation manually, all you have to do is call the .compute()
method on the array:
data = Instrument(instr_dict).load(daq_run=43861) # This is a dataset made up of dask arrays
raw_data = data.compute() # Performs the computation and returns a normal np.array
Please be aware than computing large datasets will be slow and could lead to memory errors. If working with large datasets, it's best to do all the analysis on the lazy arrays and compute the reuslt at the end only after the data has been reduced. This way the computation happens in a parallelized manner on the cluster and only the reduced final result is loaded in memory.
#DO THIS:
data = Instrument(instr_dict).load(daq_run=43861)
mean = data.mean(dim='train_id').compute()
#DONT DO THIS:
data = Instrument(instr_dict).load(daq_run=43861)
mean = data.compute().mean(dim='train_id')
Note that plotting a dask array will automatically trigger the computation, so you don't need to
call .compute()
before plotting.
You can perform multiple computation in one call by passing a list of arrays to the compute
method.
This will speed up the calculation as the scheduler will load the underling data only once (as
opposed to loading it multiple times if you call compute
on each array).
import dask
from fab.magic import config, beamtime, your_instrument
data = your_instrument.load(daq_run=43861)
mean, std = data.mean(dim='train_id'), data.std(dim='train_id')
mean, std = dask.compute(mean, std)
The Maxwell cluster
Most of the analyis of FLASH data is done on the maxwell cluster. If the fab
module detects
that the program is running on one of Maxwell's login nodes, such as max-display
, it autmatically
configures dask to start a dask.distributed
scheduler that runs jobs on the Maxwell cluster.
This way, you don't need to do anything to run your computations efficiently and in parallel on the Maxwell cluster. Just connect to a display node and import fab. The jobs will be automatically sent to the cluster. In order to configure the automatic setup (e..g. which maxwell partition to use, or to specify harwdare requirements) have a look at the configuration section.
1''' 2# Welcome to FAB (Flash Analysis for Beamtimes) 3 4The purpose of this library is to give users convenient access and analysis 5tools for the data generated at the Free Electron Laser FLASH. 6 7It abstracs the details of loading the data, that would otherwise require 8accessing the mutliple hdf5 files generated by the DAQ during the beamtime. 9It also provides easy access to the Maxwell cluster resoruces, so that parallel 10computation can be performed efforlessly. 11 12The code repo can be found at: https://gitlab.desy.de/fabiano.lever/flashanalysis/ 13 14 15# Installation 16If you use fab on the Maxwell cluster, or through the jupyterhub, you can find fab 17already installed in the 'flash' kernel (for jhub), or in the flash environment on maxwell. 18To activate the envoronment, simply run: `module load flash flashenv` in your terminal. 19 20NOTE: If you use vscode to ssh into maxwell, you can select the flash python interpreter 21by its path: `/software/ps/flash/envs/flash/bin/python` 22 23# Quickstart 24A brief introduction to the ideas behind the module and some info on how to get started. 25For an even quicker introduction, have a look at the notebooks in the example folder. 26 27In most cases, in order to use `fab`, you need to provide a configuration file 28specifing the kind of data you want to load, how you want to load it, and additional 29parameters on how `fab` should behave. Let's have a look at a quick example: 30 31```toml 32#fab_config.toml 33 34[instruments.ursa.delay_set] 35__type__ = 'fab.datasources.HDFSource' 36hdf_key = '/zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup' 37fillna_method = 'ffill' 38 39[instruments.ursa.GMD] 40__type__ = 'fab.datasources.GMD' 41data_key = "/FL2/Photon Diagnostic/GMD/Pulse resolved energy/energy hall" 42calibration_key = "/FL2/Photon Diagnostic/GMD/Average energy/energy hall" 43 44[instruments.ursa.eTof] 45__type__ = 'fab.datasources.SlicedADC' 46hdf_key = '/FL2/Experiment/MTCA-EXP1/ADQ412 GHz ADC/CH00/TD' 47offset = 2246 48window = 3000 49period = 9969.23 50t0 = 0.153 51dt = 0.0005 52baseline = 200 53dim_names = ['shot_id', 'eTof_trace'] 54``` 55 56In this file, we tell `fab` that we want to create a new instrument called `ursa`, and 57that instrument should contain three data variables called `delay_set`, `GMD` 58and `eTof`. 59 60We are configuring the `delay_set` loader to look for data in the hdf5 file table 61`/zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup`, 62and fill the missing values with the `ffill` method (that is, missing values are filled 63with the last valid value). To find out the hdf paths for the available data, please 64open one of the HDF files with hdfview of similar software, or ask the local contact 65of your beamline for help. 66 67The `eTof` and 'GMD' values are also loaded from the HDF filems, but in this case we 68ask `fab` for a more sophisticated loading strategy, implemented in the 69`fab.datasources.SlicedADC` and `fab.datasources.GMD` classes. Refer to the `fab.datasources` 70documentation for more info. 71 72We suggest you create a single config file for each beamtime, placing it in the shared folder, 73so that each participant can use the same configuration. 74 75After we defined what data we want to get, we are now ready to load it: 76 77```python 78from fab.magic import config, beamtime, ursa 79 80result = ursa.load(daq_run=43861) 81``` 82 83The `fab.magic` module attemps to do a few things for you. By importing config, it looks for a 84configuration file named `fab_config.toml` (or any file matching the pattern `fab_config*.toml`) 85in the current directory or in one of its parents. It uses the first one it finds. The beamtime 86import gives access to the beamtime run table created via `fablive`. If you are in a gpfs beamtime 87directory, the beamtime number is automatically detected. Otherwise you can specify it as e.g. 88`beamtime=11013355` in the config file. 89 90**NOTE**: The `fab.magic` module provides a convenient way to quickly set up the analyis 91environment. It makes some assumptions on how you use `fab`. If you prefer a more structured 92setup or finer control over the configuration process, please refer to the `fab.settings`, 93`fab.instruments` and `fab.beamtime` modules. 94 95Finally, the `ursa` import instantiates the instrument we defined in the config file, which we can 96then use to load the data. Calling `load` with no arguments will load all available data. 97 98To access the data, we can simply use: 99 100```python 101result.delay_set 102result.eTof 103``` 104 105Depending on the size of the data, the data will be already in RAM or it will be represented by 106a lazy `dask.array` object. You can force the data to be preloaded in RAM (or force it not to be), 107by passing the `preload_values` attribute. Please go thorugh the advantages/disadvantages of this 108approach by reading the documentation in `fab.datasources`, as it might have catastrophic impact on 109performance if used incorrectly. 110To force the data to be loaded in RAM, we can use the `.compute()` method. Please refer to the 111documentation xarray and dask for more information. When using fab on one of the maxwell cluster 112login nodes, a job will be automatically submitted and the computation will be performed on 113the cluster. 114 115Please have a look at the `fab.datasources` and `fab.instruments` modules for more detailed infomation 116about how to configure `fab` and tailor it to your needs. The `settings` and `maxwell` modules 117documentation will help you in writing more complex configuration files and in how to customize how 118fab to uses the maxwell HPC cluster. 119 120**NOTE**: You don't need to be on maxwell to use this module. If you have the HDF files on your local 121machine, you can configure `fab` to loads those files by using the `hdf_path` configuration parameter. 122`fab` will then happpily run on your local machine. 123 124# xarray, dask and the Maxwell HPC cluster 125 126## xArray 127 128`xarray` (shortened to `xr`) is a wrapper around numpy arrays that allows labelling data. The data 129loaded by the `fab` library is returned in the form of `xr.DataArray` or `xr.Dataset` objects. 130This allows easy indexing of the data (label based indexing instead of positional indexing) and 131offers many of the analysis functionality of pandas extended to data with more than 2 dimensions. 132Please refer to the official documentation of `xarray` for more info about the analysis API. 133 134## Dask 135 136Dask is a python tools that allows easy parallelization of large tasks that might exceed the 137resources of a normal workstation. It can be integrated with HPC clusters so that computations 138happens remotely on the cluster and only the results are returned. 139 140In general, the data contained in the objects loaded by `fab` is given as a `dask.array`. That 141means that all operations and analysis will be computed lazily only when needed. For many tasks, 142the computation is triggered automatically when the data needs to be displayed (e.g. when plotting), 143so that `dask.array` objects can be used in the same way one uses `np.ndarray`. 144In case you wish to trigger a computation manually, all you have to do is call the `.compute()` 145method on the array: 146 147```python 148data = Instrument(instr_dict).load(daq_run=43861) # This is a dataset made up of dask arrays 149raw_data = data.compute() # Performs the computation and returns a normal np.array 150``` 151 152Please be aware than computing large datasets will be slow and could lead to memory errors. 153If working with large datasets, it's best to do all the analysis on the lazy arrays and compute 154the reuslt at the end only after the data has been reduced. This way the computation happens 155in a parallelized manner on the cluster and only the reduced final result is loaded in memory. 156 157```python 158#DO THIS: 159data = Instrument(instr_dict).load(daq_run=43861) 160mean = data.mean(dim='train_id').compute() 161 162#DONT DO THIS: 163data = Instrument(instr_dict).load(daq_run=43861) 164mean = data.compute().mean(dim='train_id') 165``` 166 167Note that plotting a dask array will automatically trigger the computation, so you don't need to 168call `.compute()` before plotting. 169 170You can perform multiple computation in one call by passing a list of arrays to the `compute` method. 171This will speed up the calculation as the scheduler will load the underling data only once (as 172opposed to loading it multiple times if you call `compute` on each array). 173 174```python 175import dask 176from fab.magic import config, beamtime, your_instrument 177 178data = your_instrument.load(daq_run=43861) 179mean, std = data.mean(dim='train_id'), data.std(dim='train_id') 180mean, std = dask.compute(mean, std) 181``` 182 183## The Maxwell cluster 184 185Most of the analyis of FLASH data is done on the maxwell cluster. If the `fab` module detects 186that the program is running on one of Maxwell's login nodes, such as `max-display`, it autmatically 187configures dask to start a `dask.distributed` scheduler that runs jobs on the Maxwell cluster. 188 189This way, you don't need to do anything to run your computations efficiently and in parallel 190on the Maxwell cluster. Just connect to a display node and import fab. The jobs will be 191automatically sent to the cluster. In order to configure the automatic setup (e..g. which 192maxwell partition to use, or to specify harwdare requirements) have a look at the configuration 193section. 194''' 195 196__version__ = '0.9.14' 197__author__ = "Fabiano Lever" 198 199from .settings import fab_setup