fab

Welcome to FAB (Flash Analysis for Beamtimes)

The purpose of this library is to give users convenient access and analysis tools for the data generated at the Free Electron Laser FLASH.

It abstracs the details of loading the data, that would otherwise require accessing the mutliple hdf5 files generated by the DAQ during the beamtime. It also provides easy access to the Maxwell cluster resoruces, so that parallel computation can be performed efforlessly.

The code repo can be found at: https://gitlab.desy.de/fabiano.lever/flashanalysis/

Installation

If you use fab on the Maxwell cluster, or through the jupyterhub, you can find fab already installed in the 'flash' kernel (for jhub), or in the flash environment on maxwell. To activate the envoronment, simply run: module load flash flashenv in your terminal.

NOTE: If you use vscode to ssh into maxwell, you can select the flash python interpreter by its path: /software/ps/flash/envs/flash/bin/python

Quickstart

A brief introduction to the ideas behind the module and some info on how to get started. For an even quicker introduction, have a look at the notebooks in the example folder.

In most cases, in order to use fab, you need to provide a configuration file specifing the kind of data you want to load, how you want to load it, and additional parameters on how fab should behave. Let's have a look at a quick example:

#fab_config.yaml

instruments:
  ursa:
    delay_set:
      __type__: fab.datasources.HDFSource
      hdf_key: /zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup
      fillna_method: ffill
    GMD:
      __type__: fab.datasources.GMD
      data_key: /FL2/Photon Diagnostic/GMD/Pulse resolved energy/energy hall
      calibration_key: /FL2/Photon Diagnostic/GMD/Average energy/energy hall
    eTof:
      __type__: fab.datasources.SlicedADC
      hdf_key: /FL2/Experiment/MTCA-EXP1/ADQ412 GHz ADC/CH00/TD
      offset: 2246
      window: 3000
      period: 9969.23
      t0: 0.153
      dt: 0.0005
      baseline: 200
      dim_names:
        - shot_id
        - eTof_trace

In this file, we tell fab that we want to create a new instrument called ursa, and that instrument should contain three data variables called delay_set, GMD and eTof.

We are configuring the delay_set loader to look for data in the hdf5 file table /zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup, and fill the missing values with the ffill method (that is, missing values are filled with the last valid value). To find out the hdf paths for the available data, please open one of the HDF files with hdfview of similar software, or ask the local contact of your beamline for help.

The eTof and 'GMD' values are also loaded from the HDF files, but in this case we ask fab for a more sophisticated loading strategy, implemented in the fab.datasources.SlicedADC and fab.datasources.GMD classes. Refer to the fab.datasources documentation for more info.

We suggest you create a single config file for each beamtime, placing it in the shared folder, so that each participant can use the same configuration.

After we defined what data we want to get, we are now ready to load it:

from fab.magic import config, beamtime, ursa

result = ursa.load(daq_run=43861) 

The fab.magic module attemps to do a few things for you. By importing config, it looks for a configuration file named fab_config.yaml (or any file matching the pattern fab_config*.yaml) in the current directory or in one of its parents. It uses the first one it finds. The beamtime import gives access to the beamtime run table created via fabscan. If you are in a gpfs beamtime directory, the beamtime number is automatically detected. Otherwise you can specify it as e.g. beamtime: 11013355 in the config file.

NOTE: The fab.magic module provides a convenient way to quickly set up the analyis environment. It makes some assumptions on how you use fab. If you prefer a more structured setup or finer control over the configuration process, please refer to the fab.settings, fab.instruments and fab.beamtime modules.

Finally, the ursa import instantiates the instrument we defined in the config file, which we can then use to load the data. Calling load with no arguments will load all available data.

To access the data, we can simply use:

result.delay_set
result.eTof

Depending on the size of the data, the data will be already in RAM or it will be represented by a lazy dask.array object. You can force the data to be preloaded in RAM (or force it not to be), by passing the preload_values attribute. Please go thorugh the advantages/disadvantages of this approach by reading the documentation in fab.datasources, as it might have catastrophic impact on performance if used incorrectly. To force the data to be loaded in RAM, we can use the .compute() method. Please refer to the documentation xarray and dask for more information. When using fab on one of the maxwell cluster login nodes, a job will be automatically submitted and the computation will be performed on the cluster.

Please have a look at the fab.datasources and fab.instruments modules for more detailed infomation about how to configure fab and tailor it to your needs. The settings and maxwell modules documentation will help you in writing more complex configuration files and in how to customize how fab to uses the maxwell HPC cluster.

NOTE: You don't need to be on maxwell to use this module. If you have the HDF files on your local machine, you can configure fab to loads those files by using the hdf_path configuration parameter. fab will then happpily run on your local machine.

xarray, dask and the Maxwell HPC cluster

xArray

xarray (shortened to xr) is a wrapper around numpy arrays that allows labelling data. The data loaded by the fab library is returned in the form of xr.DataArray or xr.Dataset objects. This allows easy indexing of the data (label based indexing instead of positional indexing) and offers many of the analysis functionality of pandas extended to data with more than 2 dimensions. Please refer to the official documentation of xarray for more info about the analysis API.

Dask

Dask is a python tools that allows easy parallelization of large tasks that might exceed the resources of a normal workstation. It can be integrated with HPC clusters so that computations happens remotely on the cluster and only the results are returned.

In general, the data contained in the objects loaded by fab is given as a dask.array. That means that all operations and analysis will be computed lazily only when needed. For many tasks, the computation is triggered automatically when the data needs to be displayed (e.g. when plotting), so that dask.array objects can be used in the same way one uses np.ndarray. In case you wish to trigger a computation manually, all you have to do is call the .compute() method on the array:

data = Instrument(instr_dict).load(daq_run=43861) # This is a dataset made up of dask arrays
raw_data = data.compute() # Performs the computation and returns a normal np.array

Please be aware than computing large datasets will be slow and could lead to memory errors. If working with large datasets, it's best to do all the analysis on the lazy arrays and compute the reuslt at the end only after the data has been reduced. This way the computation happens in a parallelized manner on the cluster and only the reduced final result is loaded in memory.

#DO THIS:
data = Instrument(instr_dict).load(daq_run=43861) 
mean = data.mean(dim='train_id').compute()

#DONT DO THIS:
data = Instrument(instr_dict).load(daq_run=43861) 
mean = data.compute().mean(dim='train_id')

Note that plotting a dask array will automatically trigger the computation, so you don't need to call .compute() before plotting.

You can perform multiple computation in one call by passing a list of arrays to the compute method. This will speed up the calculation as the scheduler will load the underling data only once (as opposed to loading it multiple times if you call compute on each array).

import dask
from fab.magic import config, beamtime, your_instrument

data = your_instrument.load(daq_run=43861)
mean, std = data.mean(dim='train_id'), data.std(dim='train_id')
mean, std = dask.compute(mean, std)

The Maxwell cluster

Most of the analyis of FLASH data is done on the maxwell cluster. If the fab module detects that the program is running on one of Maxwell's login nodes, such as max-display, it autmatically configures dask to start a dask.distributed scheduler that runs jobs on the Maxwell cluster.

This way, you don't need to do anything to run your computations efficiently and in parallel on the Maxwell cluster. Just connect to a display node and import fab. The jobs will be automatically sent to the cluster. In order to configure the automatic setup (e..g. which maxwell partition to use, or to specify harwdare requirements) have a look at the configuration section.

  1'''
  2# Welcome to FAB (Flash Analysis for Beamtimes)
  3
  4The purpose of this library is to give users convenient access and analysis 
  5tools for the data generated at the Free Electron Laser FLASH.
  6
  7It abstracs the details of loading the data, that would otherwise require 
  8accessing the mutliple hdf5 files generated by the DAQ during the beamtime.
  9It also provides easy access to the Maxwell cluster resoruces, so that parallel 
 10computation can be performed efforlessly.
 11
 12The code repo can be found at: https://gitlab.desy.de/fabiano.lever/flashanalysis/
 13
 14
 15# Installation
 16If you use fab on the Maxwell cluster, or through the jupyterhub, you can find fab
 17already installed in the 'flash' kernel (for jhub), or in the flash environment on maxwell.
 18To activate the envoronment, simply run: `module load flash flashenv` in your terminal.
 19
 20NOTE: If you use vscode to ssh into maxwell, you can select the flash python interpreter 
 21by its path: `/software/ps/flash/envs/flash/bin/python`
 22
 23# Quickstart
 24A brief introduction to the ideas behind the module and some info on how to get started.
 25For an even quicker introduction, have a look at the notebooks in the example folder.
 26
 27In most cases, in order to use `fab`, you need to provide a configuration file 
 28specifing the kind of data you want to load, how you want to load it, and additional 
 29parameters on how `fab` should behave. Let's have a look at a quick example:
 30
 31```yaml
 32#fab_config.yaml
 33
 34instruments:
 35  ursa:
 36    delay_set:
 37      __type__: fab.datasources.HDFSource
 38      hdf_key: /zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup
 39      fillna_method: ffill
 40    GMD:
 41      __type__: fab.datasources.GMD
 42      data_key: /FL2/Photon Diagnostic/GMD/Pulse resolved energy/energy hall
 43      calibration_key: /FL2/Photon Diagnostic/GMD/Average energy/energy hall
 44    eTof:
 45      __type__: fab.datasources.SlicedADC
 46      hdf_key: /FL2/Experiment/MTCA-EXP1/ADQ412 GHz ADC/CH00/TD
 47      offset: 2246
 48      window: 3000
 49      period: 9969.23
 50      t0: 0.153
 51      dt: 0.0005
 52      baseline: 200
 53      dim_names:
 54        - shot_id
 55        - eTof_trace
 56
 57```
 58
 59In this file, we tell `fab` that we want to create a new instrument called `ursa`, and 
 60that instrument should contain three data variables called `delay_set`, `GMD`
 61and `eTof`. 
 62
 63We are configuring the `delay_set` loader to look for data in the hdf5 file table 
 64`/zraw/FLASH.SYNC/LASER.LOCK.EXP/F2.PPL.OSC/FMC0.MD22.0.POSITION_SET.WR/dGroup`,
 65and fill the missing values with the `ffill` method (that is, missing values are filled 
 66with the last valid value). To find out the hdf paths for the available data, please
 67open one of the HDF files with hdfview of similar software, or ask the local contact 
 68of your beamline for help.
 69
 70The `eTof` and 'GMD' values are also loaded from the HDF files, but in this case we 
 71ask `fab` for a more sophisticated loading strategy, implemented in the 
 72`fab.datasources.SlicedADC` and `fab.datasources.GMD` classes. Refer to the `fab.datasources` 
 73documentation for more info.
 74
 75We suggest you create a single config file for each beamtime, placing it in the shared folder,
 76so that each participant can use the same configuration.
 77
 78After we defined what data we want to get, we are now ready to load it:
 79
 80```python
 81from fab.magic import config, beamtime, ursa
 82
 83result = ursa.load(daq_run=43861) 
 84```
 85
 86The `fab.magic` module attemps to do a few things for you. By importing config, it looks for a 
 87configuration file named `fab_config.yaml` (or any file matching the pattern `fab_config*.yaml`)
 88in the current directory or in one of its parents. It uses the first one it finds. The beamtime
 89import gives access to the beamtime run table created via `fabscan`. If you are in a gpfs beamtime
 90directory, the beamtime number is automatically detected. Otherwise you can specify it as e.g. 
 91`beamtime: 11013355` in the config file. 
 92
 93**NOTE**: The `fab.magic` module provides a convenient way to quickly set up the analyis 
 94environment. It makes some assumptions on how you use `fab`. If you prefer a more structured
 95setup or finer control over the configuration process, please refer to the `fab.settings`,
 96`fab.instruments` and `fab.beamtime` modules.
 97
 98Finally, the `ursa` import instantiates the instrument we defined in the config file, which we can
 99then use to load the data. Calling `load` with no arguments will load all available data.
100
101To access the data, we can simply use:
102
103```python
104result.delay_set
105result.eTof
106```
107
108Depending on the size of the data, the data will be already in RAM or it will be represented by 
109a lazy `dask.array` object. You can force the data to be preloaded in RAM (or force it not to be), 
110by passing the `preload_values` attribute. Please go thorugh the advantages/disadvantages of this 
111approach by reading the documentation in `fab.datasources`, as it might have catastrophic impact on 
112performance if used incorrectly.
113To force the data to be loaded in RAM, we can use the `.compute()` method. Please refer to the 
114documentation xarray and dask for more information. When using fab on one of the maxwell cluster
115login nodes, a job will be automatically submitted and the computation will be performed on 
116the cluster.
117
118Please have a look at the `fab.datasources` and `fab.instruments` modules for more detailed infomation
119about how to configure `fab` and tailor it to your needs. The `settings` and `maxwell` modules 
120documentation will help you in writing more complex configuration files and in how to customize how 
121fab to uses the maxwell HPC cluster.
122
123**NOTE**: You don't need to be on maxwell to use this module. If you have the HDF files on your local
124machine, you can configure `fab` to loads those files by using the `hdf_path` configuration parameter.
125`fab` will then happpily run on your local machine.
126
127# xarray, dask and the Maxwell HPC cluster
128
129## xArray
130
131`xarray` (shortened to `xr`) is a wrapper around numpy arrays that allows labelling data. The data 
132loaded by the `fab` library is returned in the form of `xr.DataArray` or `xr.Dataset` objects. 
133This allows easy indexing of the data (label based indexing instead of positional indexing) and 
134offers many of the analysis functionality of pandas extended to data with more than 2 dimensions.
135Please refer to the official documentation of `xarray` for more info about the analysis API.
136
137## Dask
138
139Dask is a python tools that allows easy parallelization of large tasks that might exceed the 
140resources of a normal workstation. It can be integrated with HPC clusters so that computations
141happens remotely on the cluster and only the results are returned.
142
143In general, the data contained in the objects loaded by `fab` is given as a `dask.array`. That 
144means that all operations and analysis will be computed lazily only when needed. For many tasks, 
145the computation is triggered automatically when the data needs to be displayed (e.g. when plotting), 
146so that `dask.array` objects can be used in the same way one uses `np.ndarray`. 
147In case you wish to trigger a computation manually, all you have to do is call the `.compute()` 
148method on the array:
149
150```python
151data = Instrument(instr_dict).load(daq_run=43861) # This is a dataset made up of dask arrays
152raw_data = data.compute() # Performs the computation and returns a normal np.array
153```
154
155Please be aware than computing large datasets will be slow and could lead to memory errors. 
156If working with large datasets, it's best to do all the analysis on the lazy arrays and compute 
157the reuslt at the end only after the data has been reduced. This way the computation happens 
158in a parallelized manner on the cluster and only the reduced final result is loaded in memory. 
159
160```python
161#DO THIS:
162data = Instrument(instr_dict).load(daq_run=43861) 
163mean = data.mean(dim='train_id').compute()
164
165#DONT DO THIS:
166data = Instrument(instr_dict).load(daq_run=43861) 
167mean = data.compute().mean(dim='train_id')
168```
169
170Note that plotting a dask array will automatically trigger the computation, so you don't need to
171call `.compute()` before plotting.
172
173You can perform multiple computation in one call by passing a list of arrays to the `compute` method.
174This will speed up the calculation as the scheduler will load the underling data only once (as 
175opposed to loading it multiple times if you call `compute` on each array).
176
177```python
178import dask
179from fab.magic import config, beamtime, your_instrument
180
181data = your_instrument.load(daq_run=43861)
182mean, std = data.mean(dim='train_id'), data.std(dim='train_id')
183mean, std = dask.compute(mean, std)
184```
185
186## The Maxwell cluster
187
188Most of the analyis of FLASH data is done on the maxwell cluster. If the `fab` module detects
189that the program is running on one of Maxwell's login nodes, such as `max-display`, it autmatically
190configures dask to start a `dask.distributed` scheduler that runs jobs on the Maxwell cluster. 
191
192This way, you don't need to do anything to run your computations efficiently and in parallel 
193on the Maxwell cluster. Just connect to a display node and import fab. The jobs will be 
194automatically sent to the cluster. In order to configure the automatic setup (e..g. which 
195maxwell partition to use, or to specify harwdare requirements) have a look at the configuration 
196section.
197'''
198__author__ = "Fabiano Lever"
199#__docformat__ = "google"
200
201__all__ = ['magic', 'instruments', 'datasources', 'settings', 'beamtime', 'maxwell', 'preprocessing']
202
203from .settings import fab_setup
204from .version import __version__