Relativistic momentum #200

henry2004y · 2024-10-27T22:01:23Z

Handle #198 by switching to solve "momentum" $\vec{p}/m = \gamma \vec{v}$ instead of $\vec{v}$ in the relativistic case. In this way we do not need to check whether the computed velocity is larger than the speed of light, since the derivation guarantees smaller than c speed. A small caveat is that when velocity is 0, the direction is undetermined, which requires an additional branch.

As quoted from the original discussion note, this form is a bit unintuitive and requires a conversion from relativistic momentum to velocity in certain cases. However, I feel like it is more natural with relativity. Further discussions are welcomed! @Beforerr

The normalization case may need further testing.

codecov · 2024-10-27T22:13:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.39%. Comparing base (c1034ec) to head (aeaae8d).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #200      +/-   ##
==========================================
+ Coverage   83.48%   84.39%   +0.90%     
==========================================
  Files           9        9              
  Lines         660      692      +32     
==========================================
+ Hits          551      584      +33     
+ Misses        109      108       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-10-28T00:54:33Z

Benchmark result

Judge result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

Time of benchmarks:
- Target: 28 Oct 2024 - 00:52
- Baseline: 28 Oct 2024 - 00:54
Package commits:
- Target: 6f6f36
- Baseline: c1034e
Julia commits:
- Target: 8f5b7c
- Baseline: 8f5b7c
Julia command flags:
- Target: None
- Baseline: None
Environment variables:
- Target: None
- Baseline: None

Results

A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
that indicate possible regressions or improvements - are shown below (thus, an empty table means that all
benchmark results remained invariant between builds).

ID	time ratio	memory ratio
`["trace", "GC", "in place"]`	1.07 (5%) ❌	1.00 (1%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

["trace", "GC"]
["trace", "analytic field"]
["trace", "numerical field"]
["trace", "time-dependent field"]

Julia versioninfo

Target

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4281 s          0 s        293 s       3586 s          0 s
       #2     0 MHz       4626 s          0 s        282 s       3248 s          0 s
       #3     0 MHz       4148 s          0 s        277 s       3740 s          0 s
       #4     0 MHz       3945 s          0 s        283 s       3937 s          0 s
  Memory: 15.606491088867188 GB (13119.08984375 MB free)
  Uptime: 819.43 sec
  Load Avg:  1.27  2.71  1.81
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Baseline

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4376 s          0 s        311 s       4627 s          0 s
       #2     0 MHz       4684 s          0 s        301 s       4325 s          0 s
       #3     0 MHz       4566 s          0 s        294 s       4460 s          0 s
       #4     0 MHz       4504 s          0 s        329 s       4487 s          0 s
  Memory: 15.606491088867188 GB (13048.6328125 MB free)
  Uptime: 935.11 sec
  Load Avg:  1.05  2.17  1.71
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Target result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

Time of benchmark: 28 Oct 2024 - 0:52
Package commit: 6f6f36
Julia commit: 8f5b7c
Julia command flags: None
Environment variables: None

Results

Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.

ID	time	memory	allocations
`["trace", "GC", "in place"]`	136.976 μs (5%)	122.69 KiB (1%)	2286
`["trace", "analytic field", "in place relativistic"]`	8.636 μs (5%)	13.97 KiB (1%)	289
`["trace", "analytic field", "in place"]`	5.971 μs (5%)	10.45 KiB (1%)	199
`["trace", "analytic field", "out of place"]`	3.637 μs (5%)	8.61 KiB (1%)	159
`["trace", "numerical field", "Boris ensemble"]`	3.261 μs (5%)	3.30 KiB (1%)	25
`["trace", "numerical field", "Boris"]`	1.684 μs (5%)	1.88 KiB (1%)	19
`["trace", "numerical field", "in place"]`	18.504 μs (5%)	12.73 KiB (1%)	200
`["trace", "numerical field", "out of place"]`	13.235 μs (5%)	9.81 KiB (1%)	159
`["trace", "time-dependent field", "in place"]`	7.341 μs (5%)	11.30 KiB (1%)	220
`["trace", "time-dependent field", "out of place"]`	5.064 μs (5%)	9.39 KiB (1%)	177

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

["trace", "GC"]
["trace", "analytic field"]
["trace", "numerical field"]
["trace", "time-dependent field"]

Julia versioninfo

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4281 s          0 s        293 s       3586 s          0 s
       #2     0 MHz       4626 s          0 s        282 s       3248 s          0 s
       #3     0 MHz       4148 s          0 s        277 s       3740 s          0 s
       #4     0 MHz       3945 s          0 s        283 s       3937 s          0 s
  Memory: 15.606491088867188 GB (13119.08984375 MB free)
  Uptime: 819.43 sec
  Load Avg:  1.27  2.71  1.81
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Baseline result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

Time of benchmark: 28 Oct 2024 - 0:54
Package commit: c1034e
Julia commit: 8f5b7c
Julia command flags: None
Environment variables: None

Results

Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.

ID	time	memory	allocations
`["trace", "GC", "in place"]`	127.799 μs (5%)	122.69 KiB (1%)	2286
`["trace", "analytic field", "in place"]`	5.865 μs (5%)	10.45 KiB (1%)	199
`["trace", "analytic field", "out of place"]`	3.564 μs (5%)	8.61 KiB (1%)	159
`["trace", "numerical field", "Boris ensemble"]`	3.265 μs (5%)	3.30 KiB (1%)	25
`["trace", "numerical field", "Boris"]`	1.688 μs (5%)	1.88 KiB (1%)	19
`["trace", "numerical field", "in place"]`	18.274 μs (5%)	12.73 KiB (1%)	200
`["trace", "numerical field", "out of place"]`	12.974 μs (5%)	9.81 KiB (1%)	159
`["trace", "time-dependent field", "in place"]`	7.376 μs (5%)	11.30 KiB (1%)	220
`["trace", "time-dependent field", "out of place"]`	5.043 μs (5%)	9.39 KiB (1%)	177

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

["trace", "GC"]
["trace", "analytic field"]
["trace", "numerical field"]
["trace", "time-dependent field"]

Julia versioninfo

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4376 s          0 s        311 s       4627 s          0 s
       #2     0 MHz       4684 s          0 s        301 s       4325 s          0 s
       #3     0 MHz       4566 s          0 s        294 s       4460 s          0 s
       #4     0 MHz       4504 s          0 s        329 s       4487 s          0 s
  Memory: 15.606491088867188 GB (13048.6328125 MB free)
  Uptime: 935.11 sec
  Load Avg:  1.05  2.17  1.71
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Runtime information

Runtime Info
BLAS #threads	2
`BLAS.vendor()`	`lbt`
`Sys.CPU_THREADS`	4

lscpu output:

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             4
On-line CPU(s) list:                0-3
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7763 64-Core Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           4890.87
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
Virtualization:                     AMD-V
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          64 KiB (2 instances)
L1i cache:                          64 KiB (2 instances)
L2 cache:                           1 MiB (2 instances)
L3 cache:                           32 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Cpu Property	Value
Brand	AMD EPYC 7763 64-Core Processor
Vendor	:AMD
Architecture	:Unknown
Model	Family: 0xaf, Model: 0x01, Stepping: 0x01, Type: 0x00
Cores	16 physical cores, 16 logical cores (on executing CPU)
	No Hyperthreading hardware capability detected
Clock Frequencies	Not supported by CPU
Data Cache	Level 1:3 : (32, 512, 32768) kbytes
	64 byte cache line size
Address Size	48 bits virtual, 48 bits physical
SIMD	256 bit = 32 byte max. SIMD vector size
Time Stamp Counter	TSC is accessible via `rdtsc`
	TSC runs at constant rate (invariant from clock frequency)
Perf. Monitoring	Performance Monitoring Counters (PMC) are not supported
Hypervisor	Yes, Microsoft

Beforerr · 2024-10-28T07:13:44Z

I would also prefer solving momentum (which is also more common in PIC codes).

PS: If you are concerned with concerned about velocity accuracy, norm may be better.
https://discourse.julialang.org/t/performance-of-norm-function/14709

PPS: this may be faster.

   if γ²v² > 1e-20 
      vmag = √(γ²v² / (1 + γ²v²/c2))
      vx, vy, vz = vmag * normalize(γv)
   else # no velocity
      vx, vy, vz = 0, 0, 0
   end

...

const c2 = c^2

Beforerr · 2024-10-28T07:53:01Z

There should be no $\Omega$ for normalization case. Also see pull #202

henry2004y · 2024-10-28T15:07:10Z

I feel like in the normalized relativistic case, the only reasonable velocity normalization factor is c; otherwise, in the Lorentz factor $\gamma$, c will inevitably show up.

Beforerr · 2024-10-28T15:41:06Z

BTW why use v̂ and 1e-20?

I feel like v̂ is an intermediate variable, creating it any way would slow the code if not optimized by the compiler.
And 1e-20 is far from underflow for double (and should be different for float16 and for (un)normalization case). I just would like to avoid branch in computationally intensive code.

henry2004y · 2024-10-28T16:06:14Z

BTW why use v̂ and 1e-20?

I feel like v̂ is an intermediate variable, creating it any way would slow the code if not optimized by the compiler. And 1e-20 is far from underflow for double (and should be different for float16 and for (un)normalization case). I just would like to avoid branch in computationally intensive code.

Modern compilers are very good at branch prediction. In our case, small γ²v² value is a rare case, so this branch will almost never be executed. Even if you write

vx, vy, vz = vmag * normalize(γv)

without v̂, it will still allocate because of normalize(). With SVector, this will be optimized by the Julia compiler.

Let's compare them side-by-side:

function trace_relativistic_normalized!(dy, y, p::TestParticle.TPNormalizedTuple, t)
   _, E, B = p
   Ex, Ey, Ez = E(y, t)
   Bx, By, Bz = B(y, t)

   γv = @view y[4:6]
   γ²v² = γv[1]^2 + γv[2]^2 + γv[3]^2
   if γ²v² > eps(eltype(dy))
      v̂ = SVector{3, eltype(dy)}(normalize(γv))
   else # no velocity
      v̂ = SVector{3, eltype(dy)}(0, 0, 0)
   end
   vmag = √(γ²v² / (1 + γ²v²))
   vx, vy, vz = vmag * v̂[1], vmag * v̂[2], vmag * v̂[3]

   dy[1], dy[2], dy[3] = vx, vy, vz
   dy[4] = vy*Bz - vz*By + Ex
   dy[5] = vz*Bx - vx*Bz + Ey
   dy[6] = vx*By - vy*Bx + Ez

   return
end

function trace_relativistic_normalized2!(dy, y, p::TestParticle.TPNormalizedTuple, t)
   _, E, B = p
   Ex, Ey, Ez = E(y, t)
   Bx, By, Bz = B(y, t)

   γv = @view y[4:6]
   γ²v² = γv[1]^2 + γv[2]^2 + γv[3]^2
   T = eltype(dy)
   if γ²v² > eps(T)
      vmag = √(γ²v² / (1 + γ²v²))
      vx, vy, vz = vmag * normalize(γv)
   else # no velocity
      vx, vy, vz = zero(T), zero(T), zero(T)
   end

   dy[1], dy[2], dy[3] = vx, vy, vz
   dy[4] = vy*Bz - vz*By + Ex
   dy[5] = vz*Bx - vx*Bz + Ey
   dy[6] = vx*By - vy*Bx + Ez

   return
end


using BenchmarkTools

# Tracing relativistic particle in dimensionless units
param = prepare(xu -> SA[0.0, 0.0, 0.0], xu -> SA[0.0, 0.0, 1.0]; species=User)
tspan = (0.0, 1.0) # 1/2π period
stateinit = [0.0, 0.0, 0.0, 0.5, 0.0, 0.0]
prob = ODEProblem(trace_relativistic_normalized!, stateinit, tspan, param)

julia> @benchmark TestParticle.trace_relativistic_normalized!(stateinit, stateinit, prob.p, 0.0)
BenchmarkTools.Trial: 10000 samples with 927 evaluations.
 Range (min … max):  109.385 ns …  30.531 μs  ┊ GC (min … max): 0.00% … 99.22%
 Time  (median):     132.039 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   157.137 ns ± 382.613 ns  ┊ GC (mean ± σ):  7.44% ±  3.89%

    ▃█▆▃
  ▃▄████▇▆▅▅▄▄▃▃▄▅▄▅▅▄▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  109 ns           Histogram: frequency by time          289 ns <

 Memory estimate: 96 bytes, allocs estimate: 3.

julia> @benchmark TestParticle.trace_relativistic_normalized2!(stateinit, stateinit, prob.p, 0.0)
BenchmarkTools.Trial: 10000 samples with 875 evaluations.
 Range (min … max):  129.257 ns …  32.808 μs  ┊ GC (min … max):  0.00% … 99.35%
 Time  (median):     143.086 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   177.729 ns ± 448.856 ns  ┊ GC (mean ± σ):  10.46% ±  5.04%

   ▂▅█▁
  ▅████▆▄▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  129 ns           Histogram: frequency by time          344 ns <

 Memory estimate: 176 bytes, allocs estimate: 5.

Regarding 1e-20, I agree it's an arbitrarily chosen bad value. I will replace it with eps.

Beforerr · 2024-10-28T16:58:58Z

Your benchmark is surprising. I did a simple test. test3 is the fastest, I think the benefits come from StaticArrays, the earliest we use the fastest. And eliminating v̂ still helps.

using StaticArrays
using LinearAlgebra

function test1(u)
    γv = @view u[4:6]
    γ²v² = γv[1]^2 + γv[2]^2 + γv[3]^2
    vmag = √(γ²v² / (1 + γ²v²))
    v̂ = SVector{3, eltype(vmag)}(normalize(γv))
    vx, vy, vz = vmag * v̂[1], vmag * v̂[2], vmag * v̂[3]
   return vx, vy, vz
end

function test2(u)
    γv = @view u[4:6]
    γ²v² = γv[1]^2 + γv[2]^2 + γv[3]^2
    vmag = √(γ²v² / (1 + γ²v²))
    vx, vy, vz =  vmag * normalize(γv)
    return vx, vy, vz
end

function test3(u)
    γv = @views SVector{3, eltype(u)}(u[4:6])
    γ²v² = γv[1]^2 + γv[2]^2 + γv[3]^2
    vmag = √(γ²v² / (1 + γ²v²))
    vx, vy, vz =  vmag * normalize(γv)
    return vx, vy, vz
end

Results

julia> @benchmark test1(u)
BenchmarkTools.Trial: 10000 samples with 970 evaluations.
 Range (min … max):  77.320 ns …  2.744 μs  ┊ GC (min … max): 0.00% … 95.91%
 Time  (median):     82.732 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   89.239 ns ± 86.160 ns  ┊ GC (mean ± σ):  4.45% ±  4.49%

   ▃▄ █▄▁▃                                                     
  ▂████████▆▄▃▃▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  77.3 ns         Histogram: frequency by time         134 ns <

 Memory estimate: 112 bytes, allocs estimate: 3.

julia> @benchmark test2(u)
BenchmarkTools.Trial: 10000 samples with 829 evaluations.
 Range (min … max):  147.768 ns …   3.315 μs  ┊ GC (min … max): 0.00% … 93.31%
 Time  (median):     170.537 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   178.993 ns ± 122.050 ns  ┊ GC (mean ± σ):  4.29% ±  5.80%

              ▂▄▄▅▅▇▄██▆▇▇▇▆▅▄▃▂▁                                
  ▂▂▂▂▂▃▄▄▄▆▇████████████████████▇▆▆▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▅
  148 ns           Histogram: frequency by time          212 ns <

 Memory estimate: 192 bytes, allocs estimate: 5.

julia> @benchmark test3(u)
BenchmarkTools.Trial: 10000 samples with 990 evaluations.
 Range (min … max):  44.402 ns …  2.527 μs  ┊ GC (min … max): 0.00% … 96.99%
 Time  (median):     45.033 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.068 ns ± 40.773 ns  ┊ GC (mean ± σ):  1.85% ±  2.13%

   ▇█▇▅▄▂▂▂▂▃▃▂▂▂▄▄▄▅▃▃▂▁▁▁▁▁▁▁   ▁                           ▂
  ▆████████████████████████████████████▇▆▆▆▆▇▆▅▅▆▅▆▆▆▄▅▄▂▄▄▃▂ █
  44.4 ns      Histogram: log(frequency) by time      54.2 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.

julia> @benchmark test1(su)
BenchmarkTools.Trial: 10000 samples with 968 evaluations.
 Range (min … max):  77.823 ns …  2.693 μs  ┊ GC (min … max): 0.00% … 95.79%
 Time  (median):     82.128 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.694 ns ± 86.851 ns  ┊ GC (mean ± σ):  4.61% ±  4.49%

   ▆█▃▅▄▁▁                                                     
  ▃████████▆▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▂ ▃
  77.8 ns         Histogram: frequency by time         128 ns <

 Memory estimate: 112 bytes, allocs estimate: 3.

julia> @benchmark test2(su)
BenchmarkTools.Trial: 10000 samples with 855 evaluations.
 Range (min … max):  137.913 ns …   3.428 μs  ┊ GC (min … max): 0.00% … 94.06%
 Time  (median):     161.890 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   172.117 ns ± 124.310 ns  ┊ GC (mean ± σ):  4.55% ±  5.89%

            ▁▂▄▆█▇▅▄▄▂                                           
  ▁▁▁▂▂▂▃▅▆▇███████████▇▅▄▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  138 ns           Histogram: frequency by time          231 ns <

 Memory estimate: 192 bytes, allocs estimate: 5.

julia> @benchmark test3(su)
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
 Range (min … max):  43.348 ns …  2.509 μs  ┊ GC (min … max): 0.00% … 96.93%
 Time  (median):     44.442 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.364 ns ± 41.202 ns  ┊ GC (mean ± σ):  1.90% ±  2.13%

   ▁▆██                                                        
  ▃████▇▂▃▅▅▄▃▄▆▄▄▄▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  43.3 ns         Histogram: frequency by time        56.3 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.

henry2004y · 2024-10-28T17:30:00Z

Your benchmark is surprising. I did a simple test. test3 is the fastest, I think the benefits come from StaticArrays, the earliest we use the fastest.

I followed your advice in the new commit. This can now remove all the allocations by using SVector before passing to normalize:

julia> @benchmark TestParticle.trace_relativistic_normalized!($stateinit, $stateinit, $prob.p, 0.0)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  17.900 ns … 222.200 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     18.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.883 ns ±   4.304 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █        ▂▂▁                                                 ▁
  █▇▆▆▇▇▇▇█████████▇▇▆▇▆▆▇▇▅▄▄▅▆▄▅▅▅▄▄▅▄▄▄▄▄▄▄▄▅▅▅▄▄▁▅▃▄▄▃▁▃▁▄ █
  17.9 ns       Histogram: log(frequency) by time      36.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

P.S. Previously we did not exclude the allocation from passing the input arrays.

github-actions · 2024-10-28T17:40:03Z

Benchmark result

Judge result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

Time of benchmarks:
- Target: 28 Oct 2024 - 17:38
- Baseline: 28 Oct 2024 - 17:39
Package commits:
- Target: 778813
- Baseline: c1034e
Julia commits:
- Target: 8f5b7c
- Baseline: 8f5b7c
Julia command flags:
- Target: None
- Baseline: None
Environment variables:
- Target: None
- Baseline: None

Results

A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
that indicate possible regressions or improvements - are shown below (thus, an empty table means that all
benchmark results remained invariant between builds).

ID	time ratio	memory ratio
`["trace", "analytic field", "in place"]`	1.08 (5%) ❌	1.00 (1%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

["trace", "GC"]
["trace", "analytic field"]
["trace", "numerical field"]
["trace", "time-dependent field"]

Julia versioninfo

Target

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4367 s          0 s        291 s       3873 s          0 s
       #2     0 MHz       4144 s          0 s        296 s       4095 s          0 s
       #3     0 MHz       4377 s          0 s        280 s       3890 s          0 s
       #4     0 MHz       3977 s          0 s        255 s       4323 s          0 s
  Memory: 15.606491088867188 GB (13108.9765625 MB free)
  Uptime: 857.47 sec
  Load Avg:  1.25  2.68  1.78
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Baseline

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4689 s          0 s        310 s       4675 s          0 s
       #2     0 MHz       4308 s          0 s        318 s       5053 s          0 s
       #3     0 MHz       4767 s          0 s        310 s       4615 s          0 s
       #4     0 MHz       4222 s          0 s        283 s       5194 s          0 s
  Memory: 15.606491088867188 GB (13084.7109375 MB free)
  Uptime: 972.08 sec
  Load Avg:  1.06  2.15  1.69
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Target result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

Time of benchmark: 28 Oct 2024 - 17:38
Package commit: 778813
Julia commit: 8f5b7c
Julia command flags: None
Environment variables: None

Results

Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.

ID	time	memory	allocations
`["trace", "GC", "in place"]`	127.196 μs (5%)	122.69 KiB (1%)	2286
`["trace", "analytic field", "in place relativistic"]`	6.857 μs (5%)	10.45 KiB (1%)	199
`["trace", "analytic field", "in place"]`	6.135 μs (5%)	10.45 KiB (1%)	199
`["trace", "analytic field", "out of place"]`	3.675 μs (5%)	8.61 KiB (1%)	159
`["trace", "numerical field", "Boris ensemble"]`	3.256 μs (5%)	3.30 KiB (1%)	25
`["trace", "numerical field", "Boris"]`	1.674 μs (5%)	1.88 KiB (1%)	19
`["trace", "numerical field", "in place"]`	18.584 μs (5%)	12.73 KiB (1%)	200
`["trace", "numerical field", "out of place"]`	13.304 μs (5%)	9.81 KiB (1%)	159
`["trace", "time-dependent field", "in place"]`	7.386 μs (5%)	11.30 KiB (1%)	220
`["trace", "time-dependent field", "out of place"]`	5.181 μs (5%)	9.39 KiB (1%)	177

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

["trace", "GC"]
["trace", "analytic field"]
["trace", "numerical field"]
["trace", "time-dependent field"]

Julia versioninfo

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4367 s          0 s        291 s       3873 s          0 s
       #2     0 MHz       4144 s          0 s        296 s       4095 s          0 s
       #3     0 MHz       4377 s          0 s        280 s       3890 s          0 s
       #4     0 MHz       3977 s          0 s        255 s       4323 s          0 s
  Memory: 15.606491088867188 GB (13108.9765625 MB free)
  Uptime: 857.47 sec
  Load Avg:  1.25  2.68  1.78
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Baseline result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

Time of benchmark: 28 Oct 2024 - 17:39
Package commit: c1034e
Julia commit: 8f5b7c
Julia command flags: None
Environment variables: None

Results

Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.

ID	time	memory	allocations
`["trace", "GC", "in place"]`	127.616 μs (5%)	122.69 KiB (1%)	2286
`["trace", "analytic field", "in place"]`	5.696 μs (5%)	10.45 KiB (1%)	199
`["trace", "analytic field", "out of place"]`	3.627 μs (5%)	8.61 KiB (1%)	159
`["trace", "numerical field", "Boris ensemble"]`	3.251 μs (5%)	3.30 KiB (1%)	25
`["trace", "numerical field", "Boris"]`	1.689 μs (5%)	1.88 KiB (1%)	19
`["trace", "numerical field", "in place"]`	18.414 μs (5%)	12.73 KiB (1%)	200
`["trace", "numerical field", "out of place"]`	13.424 μs (5%)	9.81 KiB (1%)	159
`["trace", "time-dependent field", "in place"]`	7.311 μs (5%)	11.30 KiB (1%)	220
`["trace", "time-dependent field", "out of place"]`	5.019 μs (5%)	9.39 KiB (1%)	177

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

["trace", "GC"]
["trace", "analytic field"]
["trace", "numerical field"]
["trace", "time-dependent field"]

Julia versioninfo

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4689 s          0 s        310 s       4675 s          0 s
       #2     0 MHz       4308 s          0 s        318 s       5053 s          0 s
       #3     0 MHz       4767 s          0 s        310 s       4615 s          0 s
       #4     0 MHz       4222 s          0 s        283 s       5194 s          0 s
  Memory: 15.606491088867188 GB (13084.7109375 MB free)
  Uptime: 972.08 sec
  Load Avg:  1.06  2.15  1.69
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Runtime information

Runtime Info
BLAS #threads	2
`BLAS.vendor()`	`lbt`
`Sys.CPU_THREADS`	4

lscpu output:

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             4
On-line CPU(s) list:                0-3
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7763 64-Core Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           4890.84
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
Virtualization:                     AMD-V
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          64 KiB (2 instances)
L1i cache:                          64 KiB (2 instances)
L2 cache:                           1 MiB (2 instances)
L3 cache:                           32 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Cpu Property	Value
Brand	AMD EPYC 7763 64-Core Processor
Vendor	:AMD
Architecture	:Unknown
Model	Family: 0xaf, Model: 0x01, Stepping: 0x01, Type: 0x00
Cores	16 physical cores, 16 logical cores (on executing CPU)
	No Hyperthreading hardware capability detected
Clock Frequencies	Not supported by CPU
Data Cache	Level 1:3 : (32, 512, 32768) kbytes
	64 byte cache line size
Address Size	48 bits virtual, 48 bits physical
SIMD	256 bit = 32 byte max. SIMD vector size
Time Stamp Counter	TSC is accessible via `rdtsc`
	TSC runs at constant rate (invariant from clock frequency)
Perf. Monitoring	Performance Monitoring Counters (PMC) are not supported
Hypervisor	Yes, Microsoft

henry2004y added 2 commits October 27, 2024 17:58

solve momentum in the relativistic case #198

393ca59

Add get_velocity

e900cd8

henry2004y requested a review from TCLiuu October 27, 2024 22:08

henry2004y added 2 commits October 27, 2024 18:58

Add relativistic benchmark

e2b3f61

Update sol extraction; add utility methods

6641b6c