fab
Welcome to FAB (Flash Analysis for Beamtimes)
The purpose of this library is to give users convenient access and analysis tools for the data generated at the Free Electron Laser FLASH.
It abstracts the details of loading the data, that would otherwise require accessing the multiple hdf5 files generated by the DAQ during the beamtime. It also provides easy access to the Maxwell cluster resources, so that parallel computation can be performed effortlessly.
The code repo can be found at: https://gitlab.desy.de/fabiano.lever/flashanalysis/
Installation
If you use fab on the Maxwell cluster, or through the jupyterhub, you can find fab
already installed in the 'flash' kernel (for jhub), or in the flash environment on maxwell.
To activate the environment, simply run: module load flash flashenv in your terminal.
NOTE: If you use vscode to ssh into maxwell, you can select the flash python interpreter
by its path: /software/ps/flash/envs/flash/bin/python
Quickstart
A brief introduction to the ideas behind the module and some info on how to get started. For an even quicker introduction, have a look at the notebooks in the example folder.
Configuration and loading data
In most cases, in order to use fab, you need to provide a configuration file
specifying the kind of data you want to load, how you want to load it, and additional
parameters on how fab should behave. Let's have a look at a quick example:
#fab_config.yaml
instruments:
ursa:
delay_set:
__type__: fab.datasources.HDFSource
hdf_key: /zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup
fillna_method: ffill
GMD:
__type__: fab.datasources.GMD
data_key: /FL2/Photon Diagnostic/GMD/Pulse resolved energy/energy hall
calibration_key: /FL2/Photon Diagnostic/GMD/Average energy/energy hall
eTof:
__type__: fab.datasources.SlicedADC
hdf_key: /FL2/Experiment/MTCA-EXP1/ADQ412 GHz ADC/CH00/TD
offset: 2246
window: 3000
period: 9969.23
t0: 0.153
dt: 0.0005
baseline: 200
dim_names:
- shot_id
- eTof_trace
In this file, we tell fab that we want to create a new instrument called ursa, and
that instrument should contain three data variables called delay_set, GMD
and eTof.
We are configuring the delay_set loader to look for data in the hdf5 file table
/zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup,
and fill the missing values with the ffill method (that is, missing values are filled
with the last valid value). To find out the hdf paths for the available data, please
open one of the HDF files with hdfview of similar software, or ask the local contact
of your beamline for help.
The eTof and 'GMD' values are also loaded from the HDF files, but in this case we
ask fab for a more sophisticated loading strategy, implemented in the
fab.datasources.SlicedADC and fab.datasources.GMD classes. Refer to the fab.datasources
documentation for more info.
We suggest you create a single config file for each beamtime, placing it in the shared folder, so that each participant can use the same configuration.
After we defined what data we want to get, we are now ready to load it:
Magic imports and automatic config detection
from fab.magic import config, beamtime, ursa
result = ursa.load(daq_run=43861)
The fab.magic module attempts to do a few things for you. By importing config, it looks for a
configuration file named fab_config.yaml (or any file matching the pattern fab_config*.yaml)
in the current directory or in one of its parents. It uses the first one it finds. The beamtime
import gives access to the beamtime run table created via fabscan. If you are in a gpfs beamtime
directory, the beamtime number is automatically detected. Otherwise you can specify it as e.g.
beamtime: 11013355 in the config file.
NOTE: The fab.magic module provides a convenient way to quickly set up the analysis
environment. It makes some assumptions on how you use fab. If you prefer a more structured
setup or finer control over the configuration process, please refer to the fab.settings,
fab.instruments and fab.beamtime modules.
Finally, the ursa import instantiates the instrument we defined in the config file, which we can
then use to load the data. Calling load with no arguments will load all available data.
To access the data, we can simply use:
result.delay_set
result.eTof
Depending on the size of the data, the data will be already in RAM or it will be represented by
a lazy dask.array object. You can force the data to be preloaded in RAM (or force it not to be),
by passing the preload_values attribute. Please go through the advantages/disadvantages of this
approach by reading the documentation in fab.datasources, as it might have catastrophic impact on
performance if used incorrectly.
To force the data to be loaded in RAM, we can use the .compute() method. Please refer to the
documentation xarray and dask for more information. When using fab on one of the maxwell cluster
login nodes, a job will be automatically submitted and the computation will be performed on
the cluster.
Please have a look at the fab.datasources and fab.instruments modules for more detailed information
about how to configure fab and tailor it to your needs. The settings and maxwell modules
documentation will help you in writing more complex configuration files and in how to customize how
fab uses the maxwell HPC cluster.
NOTE: You don't need to be on maxwell to use this module. If you have the HDF files on your local
machine, you can configure fab to load those files by using the hdf_path configuration parameter.
fab will then happily run on your local machine.
Beamtime objects
If your beamtime used fabtrack/fabscan to create a run table, you can access it through the beamtime object.
This provides a pandas dataframe with the runs metadata. You can use this dataframe to filter the runs you want to analyze,
and then load the data for those runs only. See the fab.beamtime documentation for more info.
The documentation for fabtrack is still a work in progress. If you need support, ask Fabiano Lever or your local contact for help.
Analysis functions
Once data is loaded, the fab.analysis module provides common analysis functions optimized for
beamtime data, including bootstrap resampling for uncertainty estimation, quantile-based binning,
and utilities for shot-pair analysis in pump-probe experiments. Like datasources and instruments,
analysis functions can be pre-configured through the YAML config file by setting default parameters
in the analysis section. This allows you to maintain consistent analysis parameters across your
analysis code without having to specify them in every function call.
xarray, dask and the Maxwell HPC cluster
xArray
xarray (shortened to xr) is a wrapper around numpy arrays that allows labelling data. The data
loaded by the fab library is returned in the form of xr.DataArray or xr.Dataset objects.
This allows easy indexing of the data (label based indexing instead of positional indexing) and
offers many of the analysis functionality of pandas extended to data with more than 2 dimensions.
Please refer to the official documentation of xarray for more info about the analysis API.
Dask
Dask is a python tools that allows easy parallelization of large tasks that might exceed the resources of a normal workstation. It can be integrated with HPC clusters so that computations happens remotely on the cluster and only the results are returned.
In general, the data contained in the objects loaded by fab is given as a dask.array. That
means that all operations and analysis will be computed lazily only when needed. For many tasks,
the computation is triggered automatically when the data needs to be displayed (e.g. when plotting),
so that dask.array objects can be used in the same way one uses np.ndarray.
In case you wish to trigger a computation manually, all you have to do is call the .compute()
method on the array:
data = Instrument(instr_dict).load(daq_run=43861) # This is a dataset made up of dask arrays
raw_data = data.compute() # Performs the computation and returns a normal np.array
Please be aware than computing large datasets will be slow and could lead to memory errors. If working with large datasets, it's best to do all the analysis on the lazy arrays and compute the reuslt at the end only after the data has been reduced. This way the computation happens in a parallelized manner on the cluster and only the reduced final result is loaded in memory.
#DO THIS:
data = Instrument(instr_dict).load(daq_run=43861)
mean = data.mean(dim='train_id').compute()
#DONT DO THIS:
data = Instrument(instr_dict).load(daq_run=43861)
mean = data.compute().mean(dim='train_id')
Note that plotting a dask array will automatically trigger the computation, so you don't need to
call .compute() before plotting.
You can perform multiple computation in one call by passing a list of arrays to the compute method.
This will speed up the calculation as the scheduler will load the underling data only once (as
opposed to loading it multiple times if you call compute on each array).
import dask
from fab.magic import config, beamtime, your_instrument
data = your_instrument.load(daq_run=43861)
mean, std = data.mean(dim='train_id'), data.std(dim='train_id')
mean, std = dask.compute(mean, std)
The Maxwell cluster
Most of the analysis of FLASH data is done on the maxwell cluster. If the fab module detects
that the program is running on one of Maxwell's login nodes, such as max-display, it automatically
configures dask to start a dask.distributed scheduler that runs jobs on the Maxwell cluster.
This way, you don't need to do anything to run your computations efficiently and in parallel on the Maxwell cluster. Just connect to a display node and import fab. The jobs will be automatically sent to the cluster. In order to configure the automatic setup (e.g. which maxwell partition to use, or to specify hardware requirements) have a look at the configuration section.
1''' 2# Welcome to FAB (Flash Analysis for Beamtimes) 3 4The purpose of this library is to give users convenient access and analysis 5tools for the data generated at the Free Electron Laser FLASH. 6 7It abstracts the details of loading the data, that would otherwise require 8accessing the multiple hdf5 files generated by the DAQ during the beamtime. 9It also provides easy access to the Maxwell cluster resources, so that parallel 10computation can be performed effortlessly. 11 12The code repo can be found at: https://gitlab.desy.de/fabiano.lever/flashanalysis/ 13 14 15# Installation 16If you use fab on the Maxwell cluster, or through the jupyterhub, you can find fab 17already installed in the 'flash' kernel (for jhub), or in the flash environment on maxwell. 18To activate the environment, simply run: `module load flash flashenv` in your terminal. 19 20NOTE: If you use vscode to ssh into maxwell, you can select the flash python interpreter 21by its path: `/software/ps/flash/envs/flash/bin/python` 22 23# Quickstart 24 25A brief introduction to the ideas behind the module and some info on how to get started. 26For an even quicker introduction, have a look at the notebooks in the example folder. 27 28 29## Configuration and loading data 30 31In most cases, in order to use `fab`, you need to provide a configuration file 32specifying the kind of data you want to load, how you want to load it, and additional 33parameters on how `fab` should behave. Let's have a look at a quick example: 34 35```yaml 36#fab_config.yaml 37 38instruments: 39 ursa: 40 delay_set: 41 __type__: fab.datasources.HDFSource 42 hdf_key: /zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup 43 fillna_method: ffill 44 GMD: 45 __type__: fab.datasources.GMD 46 data_key: /FL2/Photon Diagnostic/GMD/Pulse resolved energy/energy hall 47 calibration_key: /FL2/Photon Diagnostic/GMD/Average energy/energy hall 48 eTof: 49 __type__: fab.datasources.SlicedADC 50 hdf_key: /FL2/Experiment/MTCA-EXP1/ADQ412 GHz ADC/CH00/TD 51 offset: 2246 52 window: 3000 53 period: 9969.23 54 t0: 0.153 55 dt: 0.0005 56 baseline: 200 57 dim_names: 58 - shot_id 59 - eTof_trace 60 61``` 62 63In this file, we tell `fab` that we want to create a new instrument called `ursa`, and 64that instrument should contain three data variables called `delay_set`, `GMD` 65and `eTof`. 66 67We are configuring the `delay_set` loader to look for data in the hdf5 file table 68`/zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup`, 69and fill the missing values with the `ffill` method (that is, missing values are filled 70with the last valid value). To find out the hdf paths for the available data, please 71open one of the HDF files with hdfview of similar software, or ask the local contact 72of your beamline for help. 73 74The `eTof` and 'GMD' values are also loaded from the HDF files, but in this case we 75ask `fab` for a more sophisticated loading strategy, implemented in the 76`fab.datasources.SlicedADC` and `fab.datasources.GMD` classes. Refer to the `fab.datasources` 77documentation for more info. 78 79We suggest you create a single config file for each beamtime, placing it in the shared folder, 80so that each participant can use the same configuration. 81 82After we defined what data we want to get, we are now ready to load it: 83 84## Magic imports and automatic config detection 85 86```python 87from fab.magic import config, beamtime, ursa 88 89result = ursa.load(daq_run=43861) 90``` 91 92The `fab.magic` module attempts to do a few things for you. By importing config, it looks for a 93configuration file named `fab_config.yaml` (or any file matching the pattern `fab_config*.yaml`) 94in the current directory or in one of its parents. It uses the first one it finds. The beamtime 95import gives access to the beamtime run table created via `fabscan`. If you are in a gpfs beamtime 96directory, the beamtime number is automatically detected. Otherwise you can specify it as e.g. 97`beamtime: 11013355` in the config file. 98 99**NOTE**: The `fab.magic` module provides a convenient way to quickly set up the analysis 100environment. It makes some assumptions on how you use `fab`. If you prefer a more structured 101setup or finer control over the configuration process, please refer to the `fab.settings`, 102`fab.instruments` and `fab.beamtime` modules. 103 104Finally, the `ursa` import instantiates the instrument we defined in the config file, which we can 105then use to load the data. Calling `load` with no arguments will load all available data. 106 107To access the data, we can simply use: 108 109```python 110result.delay_set 111result.eTof 112``` 113 114Depending on the size of the data, the data will be already in RAM or it will be represented by 115a lazy `dask.array` object. You can force the data to be preloaded in RAM (or force it not to be), 116by passing the `preload_values` attribute. Please go through the advantages/disadvantages of this 117approach by reading the documentation in `fab.datasources`, as it might have catastrophic impact on 118performance if used incorrectly. 119To force the data to be loaded in RAM, we can use the `.compute()` method. Please refer to the 120documentation xarray and dask for more information. When using fab on one of the maxwell cluster 121login nodes, a job will be automatically submitted and the computation will be performed on 122the cluster. 123 124Please have a look at the `fab.datasources` and `fab.instruments` modules for more detailed information 125about how to configure `fab` and tailor it to your needs. The `settings` and `maxwell` modules 126documentation will help you in writing more complex configuration files and in how to customize how 127fab uses the maxwell HPC cluster. 128 129**NOTE**: You don't need to be on maxwell to use this module. If you have the HDF files on your local 130machine, you can configure `fab` to load those files by using the `hdf_path` configuration parameter. 131`fab` will then happily run on your local machine. 132 133## Beamtime objects 134 135If your beamtime used `fabtrack`/`fabscan` to create a run table, you can access it through the `beamtime` object. 136This provides a pandas dataframe with the runs metadata. You can use this dataframe to filter the runs you want to analyze, 137and then load the data for those runs only. See the `fab.beamtime` documentation for more info. 138 139The documentation for `fabtrack` is still a work in progress. If you need support, ask Fabiano Lever or your local contact for help. 140 141## Analysis functions 142 143Once data is loaded, the `fab.analysis` module provides common analysis functions optimized for 144beamtime data, including bootstrap resampling for uncertainty estimation, quantile-based binning, 145and utilities for shot-pair analysis in pump-probe experiments. Like datasources and instruments, 146analysis functions can be pre-configured through the YAML config file by setting default parameters 147in the `analysis` section. This allows you to maintain consistent analysis parameters across your 148analysis code without having to specify them in every function call. 149 150# xarray, dask and the Maxwell HPC cluster 151 152## xArray 153 154`xarray` (shortened to `xr`) is a wrapper around numpy arrays that allows labelling data. The data 155loaded by the `fab` library is returned in the form of `xr.DataArray` or `xr.Dataset` objects. 156This allows easy indexing of the data (label based indexing instead of positional indexing) and 157offers many of the analysis functionality of pandas extended to data with more than 2 dimensions. 158Please refer to the official documentation of `xarray` for more info about the analysis API. 159 160## Dask 161 162Dask is a python tools that allows easy parallelization of large tasks that might exceed the 163resources of a normal workstation. It can be integrated with HPC clusters so that computations 164happens remotely on the cluster and only the results are returned. 165 166In general, the data contained in the objects loaded by `fab` is given as a `dask.array`. That 167means that all operations and analysis will be computed lazily only when needed. For many tasks, 168the computation is triggered automatically when the data needs to be displayed (e.g. when plotting), 169so that `dask.array` objects can be used in the same way one uses `np.ndarray`. 170In case you wish to trigger a computation manually, all you have to do is call the `.compute()` 171method on the array: 172 173```python 174data = Instrument(instr_dict).load(daq_run=43861) # This is a dataset made up of dask arrays 175raw_data = data.compute() # Performs the computation and returns a normal np.array 176``` 177 178Please be aware than computing large datasets will be slow and could lead to memory errors. 179If working with large datasets, it's best to do all the analysis on the lazy arrays and compute 180the reuslt at the end only after the data has been reduced. This way the computation happens 181in a parallelized manner on the cluster and only the reduced final result is loaded in memory. 182 183```python 184#DO THIS: 185data = Instrument(instr_dict).load(daq_run=43861) 186mean = data.mean(dim='train_id').compute() 187 188#DONT DO THIS: 189data = Instrument(instr_dict).load(daq_run=43861) 190mean = data.compute().mean(dim='train_id') 191``` 192 193Note that plotting a dask array will automatically trigger the computation, so you don't need to 194call `.compute()` before plotting. 195 196You can perform multiple computation in one call by passing a list of arrays to the `compute` method. 197This will speed up the calculation as the scheduler will load the underling data only once (as 198opposed to loading it multiple times if you call `compute` on each array). 199 200```python 201import dask 202from fab.magic import config, beamtime, your_instrument 203 204data = your_instrument.load(daq_run=43861) 205mean, std = data.mean(dim='train_id'), data.std(dim='train_id') 206mean, std = dask.compute(mean, std) 207``` 208 209## The Maxwell cluster 210 211Most of the analysis of FLASH data is done on the maxwell cluster. If the `fab` module detects 212that the program is running on one of Maxwell's login nodes, such as `max-display`, it automatically 213configures dask to start a `dask.distributed` scheduler that runs jobs on the Maxwell cluster. 214 215This way, you don't need to do anything to run your computations efficiently and in parallel 216on the Maxwell cluster. Just connect to a display node and import fab. The jobs will be 217automatically sent to the cluster. In order to configure the automatic setup (e.g. which 218maxwell partition to use, or to specify hardware requirements) have a look at the configuration 219section. 220''' 221__author__ = "Fabiano Lever" 222#__docformat__ = "google" 223 224__all__ = ['magic', 'instruments', 'datasources', 'settings', 'beamtime', 'analysis', 'maxwell', 'preprocessing'] 225 226from .settings import fab_setup 227from .version import __version__