Skip to content

Data Processing

Ryan Quinn edited this page Jul 18, 2019 · 11 revisions

This page serves to provide a set of instructions for running the analysis chain, as well as provide information in order to customize the analysis chain for a specific detector setup.

Running the Data Processing Chain

Unpacking the Data

The raw data is unpacked with the unpackRaw.py script. Run ./unpackRaw.py --help to see a list of available options.

Hit Reconstruction

The unpacked data is converted into hits using the hitReco.py script. Run ./hitReco.py to see a list of available options.

The default hit selection criteria is extremely basic. A more selective criteria should be implemented for your specific setup.

The reconstructed data can then be used for data analysis.

Data Analysis

Event Data

The event data is stored in the reco_events array inside the reco file:

recoFile = np.load(recoFileName, allow_pickle=True)
eventData = recoFile['reco_events']

You can loop over this array and access each event's data with the typical python loop syntax:

for event in eventArray:
    # process event here

Each event consists of a series of 'hits'. A hit is a single cell that passes some criteria defined in 'hitReco.py'. Hits are supposed to represent the positions in the detector where some particle has hit the sensor and deposited energy. Each hit has a number of properties describing the position, energy deposition, and other parameters. Inside an event, the properties of the hits are stored in separate arrays.

As an example, consider an event with 10 hits. This event will contain a number of elements. Two elements, event and nhits, are single numbers representing the event number and the number of hits in the event, respectively. We can access these elements like they are dictionary elements:

>>> event = eventArray[0] # take the first event, here we assume it contains 10 hits
>>> event['event']
1
>>> event['nhits']
10

The properties of the 10 hits are stored in arrays prefixed with hit_. All available arrays can be seen in hitReco.py. Let's print the x-positions of each of the hits:

>>> event['hit_x']
array([0., -1., 1., 2., 2., -1., -2., 1., 0., 0.])

and the low gain ADC waveforms (note there are 11 time samples per waveform, so this is a 10x11 array:

>>> event['hit_adc_lg']
array([[105., 108., 112., 110., 102., 100., 100., 100., 100., 100.],
       [98.,  120., 125., 109., 104., 98.,  98.,  97.,  98.,  99. ],
       ...
       [96.,  112., 122., 119., 102., 98.,  98.,  101., 99.,  100.]])

Analysis on the hits can be performed in two ways. You can use full-array manipulation (thanks to NumPy), or you can loop over the hits individually. As an example, we calculate the average hit radii relative to the coordinates (x=-2,y=3,) both ways:

# loop method
mean_hit_radius_loopmethod = []
for hit in range(nhits):
    hit_radius = np.sqrt((event['hit_x'][hit] - (-2))**2 + (event['hit_y'][hit] - 3)**2)
    mean_hit_radius_loopmethod += hit_radius
mean_hit_radius_loopmethod /= nhits

# array manipulation method
mean_hit_radius_arraymethod = np.mean(np.sqrt(np.power(event['hit_x'] - (-2), 2) + np.power(event['hit_y'] - 3, 2)))

Note that this is a slightly weird calculation to make, since the hits are in different layers. So, in effect, we are calculating the average hit distance from the line going down the z axis at the coordinates (x=-2,y=3).

In general, the array manipulation method is much preferred for its speed advantages. The loop method is much slower, but it may be the only option for some types of calculations.

Example Analyzer

An example data analyzer exampleAnalyzer.py is provided. This produces a 2-dimensional histogram of the low gain ADC waveforms (after common mode subtraction) using all events in data/reco/run1.reco.npz. For a 1000 event run on a dummy sensor, the output looks like: 2-D Waveform Histogram

Data Format

Raw

Each hexaboard contains four SKIROCs (chips). Each of these chips reads out 64 channels (of which only the even-numbered are connected to a cell). The raw data from these chips is processed by the MAX10 FPGA on the hexaboard, and sent out. This data is collected from all chips and sent into the main FIFO on the readout board. This FIFO is 32 bits 'wide', with one bit for each possible connected skiroc. The data in the FIFO is formatted as follows (the 32 bit words that are read from the FIFO are vertically oriented): FIFO Memory Mapping The FIFO header bits are 1 if there is a connected skiroc in that spot, and 0 otherwise. The 30784 words in the FIFO are read out with the header coming first (the MSB of the FIFO words are on top). Each set of four skiroc bitstreams corresponds to one hexaboard. The top port of the readout board corresponds to the four least significant bits in the FIFO (the bottom four rows in the above diagram).

IPBusDAQ.py collects these 'blocks' of 30785 32-bit words, compresses them, and writes them to disk. No processing is done to extract the skiroc data - that is done in the unpacking step.

Unpacked

The goal of the unpacker is to 'unpack' the raw blocks written by the DAQ into a manageable data structure containing the data recorded by the skiroc.

The data for one chip coming out of the MAX10 is formatted as follows: SKIROC Memory Mapping The unpacker first extracts the data (4 bits * 30795 bits) for one hexaboard from the block and drops the header:

hxData = ((block>>hx) & 0xf)[1:]

It then looks at the data from each individual skiroc. Looking at the SKIROC Memory Mapping, we see the SKIROC data is a set of 1924 16-bit words. The unpacker extracts the data from one skiroc ski, and reshapes it into a 1924*16 array - one entry for each bit:

skiBits = ((hxData>>ski) & 0x1).reshape((1924,16))

which then gets bitwise OR'd together to form the final 16-bit words:

# FIFO bits are MSB first, so we must fill the word from left to right (15-bitidx)
for bitidx in range(16):
    skiWords[ski,:] |= skiBits[:,bitidx] << (15-bitidx)

The data from each channel is then formatted into an array of custom chanType elements:

chanType = np.dtype([
        ('hg_adc', 'u2', (13,)), ('hg_adc_hit', '?', (13,)),
        ('lg_adc', 'u2', (13,)), ('lg_adc_hit', '?', (13,)),
        ('toa_fall', 'u2'), ('toa_fall_hit', '?'),
        ('toa_rise', 'u2'), ('toa_rise_hit', '?'),
        ('tot_slow', 'u2'), ('tot_slow_hit', '?'),
        ('tot_fast', 'u2'), ('tot_fast_hit', '?'),
    ])

The data from all channels for one skiroc are combined into an array of chipType objects, along with the chip-specific data:

chipType = np.dtype([
        ('chan', chanType, (64,)), # 64 channels per skiroc
        ('roll_position', 'u2'),
        ('global_ts', 'u4'),
        ('chip_id', 'u1'),
    ])

And finally the data from the chips is combined into an array of hexbdType objects:

hexbdType = np.dtype([
        ('chip', chipType, (4,)), # 4 skirocs per hexbd
        ('index', 'u1'),
    ])

unpackRaw.py does the job of unpacking the raw data, formatting, compressing, and saving to disk. Each run is stored as an hdf5 dataset of events, and each event is stored as a numpy array of hexbdType objects.

Reco

The next step is to 'reconstruct' the unpacked data into usable data (hits) in a useful format. This requires multiple steps.

Hit Selection

During an event, there are usually only a few cells that particles pass through. Most of the cells in the detector see nothing. Hit selection is used to drop data that is not important.

The hitReco.py script (as of this writing) employs a very basic hit selection algorithm. Any cells with low gain ADC at time sample 3 (usually the TS with max energy when a particle passes through) greater than the common mode noise are selected as hits. All others are dropped.

def checkHit(lgADC_CMSub):
    return lgADC_CMSub[2] > 0 # lgADC_CMSub already has cm noise subtracted from it

This will likely need to be updated for an actual functioning setup.

Common Mode Noise

Noise is present in the background of every event. All cells will likely have a nonzero ADC value, even when nothing is passing through them. This offset is known as the common mode noise.

The common mode noise calculation done in hitReco.py (as of this writing) calculates a value of the common mode noise for each hexaboard: the mean of the first time sample ADC over all cells in the layer. The first time sample is used since this time sample is (usually) taken before the particle passes through the sensor - meaning it is free from any signal. This calculation is done separately for high and low gain. The common mode noise is subtracted from all ADC counts.

More sophisticated common mode calculations are easy to implement. A simple improvement is to calculate the noise per skiroc, instead of just per hexaboard.

SCA to Time Sample Conversion

As seen in the SKIROC memory diagram, each chip records one 13 ADC values per channel per gain during an event. These 13 values correspond to different time samples, but not directly. A conversion is needed between SCA and time samples, and this conversion depends on the roll_positions - a value generated by each chip.

From the roll_positions value, we can generate a SCA to time sample conversion table:

def getSCAConversion(rollMask):
    scaToTS = -np.ones(13, dtype=np.uint8) # start with -1 (aka 255 in uint8)
    if ((rollMask>>11) & 1) and (rollMask & 1):
        scaToTS[0] = 12
        for i in range(1,13): scaToTS[i] = i-1
    else:
        posTrk = -1 # posTrk is the position of the first bit that is set
        for i in range(13):
            if (rollMask>>i) & 1:
                posTrk = i
                break
        for i in range(1, posTrk+2):
            scaToTS[i] = i + 12 - (posTrk+1)
        for i in range(posTrk+2, 12):
            scaToTS[i] = i - 1 - (posTrk+1)
    return scaToTS[scaToTS < 13] # some time samples are invalid -- only use good ones

Note that this conversion table must be generated for each chip.

There are only 11 time samples available of the 13 SCA. Two SCA are unusable. With an array of ADC values organized by SCA, we can convert to time samples with:

adc_array_ts = adc_array_sca[scaToTS][:11]

or a single SCA value to its corresponding TS value with:

ts_value = scaToTS[sca_value]

The adc_array_ts value is in order of time samples, so it is in order temporally. Each time sample is 25ns apart.

Data Formatting

The reconstructor also formats the data into a set of arrays. Each hit has a number of properties (position, ADC counts, etc...); per event, there is one array for each of these properties. See the data analysis section for more information.