Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relativistic momentum #200

Merged
merged 13 commits into from
Oct 28, 2024
Merged

Relativistic momentum #200

merged 13 commits into from
Oct 28, 2024

Conversation

henry2004y
Copy link
Owner

@henry2004y henry2004y commented Oct 27, 2024

Handle #198 by switching to solve "momentum" $\vec{p}/m = \gamma \vec{v}$ instead of $\vec{v}$ in the relativistic case. In this way we do not need to check whether the computed velocity is larger than the speed of light, since the derivation guarantees smaller than c speed. A small caveat is that when velocity is 0, the direction is undetermined, which requires an additional branch.

As quoted from the original discussion note, this form is a bit unintuitive and requires a conversion from relativistic momentum to velocity in certain cases. However, I feel like it is more natural with relativity. Further discussions are welcomed! @Beforerr

The normalization case may need further testing.

@henry2004y henry2004y requested a review from TCLiuu October 27, 2024 22:08
Copy link

codecov bot commented Oct 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.39%. Comparing base (c1034ec) to head (aeaae8d).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #200      +/-   ##
==========================================
+ Coverage   83.48%   84.39%   +0.90%     
==========================================
  Files           9        9              
  Lines         660      692      +32     
==========================================
+ Hits          551      584      +33     
+ Misses        109      108       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Repository owner deleted a comment from github-actions bot Oct 28, 2024
Repository owner deleted a comment from github-actions bot Oct 28, 2024
Copy link
Contributor

Benchmark result

Judge result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

  • Time of benchmarks:
    • Target: 28 Oct 2024 - 00:52
    • Baseline: 28 Oct 2024 - 00:54
  • Package commits:
    • Target: 6f6f36
    • Baseline: c1034e
  • Julia commits:
    • Target: 8f5b7c
    • Baseline: 8f5b7c
  • Julia command flags:
    • Target: None
    • Baseline: None
  • Environment variables:
    • Target: None
    • Baseline: None

Results

A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
that indicate possible regressions or improvements - are shown below (thus, an empty table means that all
benchmark results remained invariant between builds).

ID time ratio memory ratio
["trace", "GC", "in place"] 1.07 (5%) ❌ 1.00 (1%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

  • ["trace", "GC"]
  • ["trace", "analytic field"]
  • ["trace", "numerical field"]
  • ["trace", "time-dependent field"]

Julia versioninfo

Target

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4281 s          0 s        293 s       3586 s          0 s
       #2     0 MHz       4626 s          0 s        282 s       3248 s          0 s
       #3     0 MHz       4148 s          0 s        277 s       3740 s          0 s
       #4     0 MHz       3945 s          0 s        283 s       3937 s          0 s
  Memory: 15.606491088867188 GB (13119.08984375 MB free)
  Uptime: 819.43 sec
  Load Avg:  1.27  2.71  1.81
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Baseline

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4376 s          0 s        311 s       4627 s          0 s
       #2     0 MHz       4684 s          0 s        301 s       4325 s          0 s
       #3     0 MHz       4566 s          0 s        294 s       4460 s          0 s
       #4     0 MHz       4504 s          0 s        329 s       4487 s          0 s
  Memory: 15.606491088867188 GB (13048.6328125 MB free)
  Uptime: 935.11 sec
  Load Avg:  1.05  2.17  1.71
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Target result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

  • Time of benchmark: 28 Oct 2024 - 0:52
  • Package commit: 6f6f36
  • Julia commit: 8f5b7c
  • Julia command flags: None
  • Environment variables: None

Results

Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.

ID time GC time memory allocations
["trace", "GC", "in place"] 136.976 μs (5%) 122.69 KiB (1%) 2286
["trace", "analytic field", "in place relativistic"] 8.636 μs (5%) 13.97 KiB (1%) 289
["trace", "analytic field", "in place"] 5.971 μs (5%) 10.45 KiB (1%) 199
["trace", "analytic field", "out of place"] 3.637 μs (5%) 8.61 KiB (1%) 159
["trace", "numerical field", "Boris ensemble"] 3.261 μs (5%) 3.30 KiB (1%) 25
["trace", "numerical field", "Boris"] 1.684 μs (5%) 1.88 KiB (1%) 19
["trace", "numerical field", "in place"] 18.504 μs (5%) 12.73 KiB (1%) 200
["trace", "numerical field", "out of place"] 13.235 μs (5%) 9.81 KiB (1%) 159
["trace", "time-dependent field", "in place"] 7.341 μs (5%) 11.30 KiB (1%) 220
["trace", "time-dependent field", "out of place"] 5.064 μs (5%) 9.39 KiB (1%) 177

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

  • ["trace", "GC"]
  • ["trace", "analytic field"]
  • ["trace", "numerical field"]
  • ["trace", "time-dependent field"]

Julia versioninfo

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4281 s          0 s        293 s       3586 s          0 s
       #2     0 MHz       4626 s          0 s        282 s       3248 s          0 s
       #3     0 MHz       4148 s          0 s        277 s       3740 s          0 s
       #4     0 MHz       3945 s          0 s        283 s       3937 s          0 s
  Memory: 15.606491088867188 GB (13119.08984375 MB free)
  Uptime: 819.43 sec
  Load Avg:  1.27  2.71  1.81
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Baseline result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

  • Time of benchmark: 28 Oct 2024 - 0:54
  • Package commit: c1034e
  • Julia commit: 8f5b7c
  • Julia command flags: None
  • Environment variables: None

Results

Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.

ID time GC time memory allocations
["trace", "GC", "in place"] 127.799 μs (5%) 122.69 KiB (1%) 2286
["trace", "analytic field", "in place"] 5.865 μs (5%) 10.45 KiB (1%) 199
["trace", "analytic field", "out of place"] 3.564 μs (5%) 8.61 KiB (1%) 159
["trace", "numerical field", "Boris ensemble"] 3.265 μs (5%) 3.30 KiB (1%) 25
["trace", "numerical field", "Boris"] 1.688 μs (5%) 1.88 KiB (1%) 19
["trace", "numerical field", "in place"] 18.274 μs (5%) 12.73 KiB (1%) 200
["trace", "numerical field", "out of place"] 12.974 μs (5%) 9.81 KiB (1%) 159
["trace", "time-dependent field", "in place"] 7.376 μs (5%) 11.30 KiB (1%) 220
["trace", "time-dependent field", "out of place"] 5.043 μs (5%) 9.39 KiB (1%) 177

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

  • ["trace", "GC"]
  • ["trace", "analytic field"]
  • ["trace", "numerical field"]
  • ["trace", "time-dependent field"]

Julia versioninfo

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4376 s          0 s        311 s       4627 s          0 s
       #2     0 MHz       4684 s          0 s        301 s       4325 s          0 s
       #3     0 MHz       4566 s          0 s        294 s       4460 s          0 s
       #4     0 MHz       4504 s          0 s        329 s       4487 s          0 s
  Memory: 15.606491088867188 GB (13048.6328125 MB free)
  Uptime: 935.11 sec
  Load Avg:  1.05  2.17  1.71
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Runtime information

Runtime Info
BLAS #threads 2
BLAS.vendor() lbt
Sys.CPU_THREADS 4

lscpu output:

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             4
On-line CPU(s) list:                0-3
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7763 64-Core Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           4890.87
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
Virtualization:                     AMD-V
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          64 KiB (2 instances)
L1i cache:                          64 KiB (2 instances)
L2 cache:                           1 MiB (2 instances)
L3 cache:                           32 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Cpu Property Value
Brand AMD EPYC 7763 64-Core Processor
Vendor :AMD
Architecture :Unknown
Model Family: 0xaf, Model: 0x01, Stepping: 0x01, Type: 0x00
Cores 16 physical cores, 16 logical cores (on executing CPU)
No Hyperthreading hardware capability detected
Clock Frequencies Not supported by CPU
Data Cache Level 1:3 : (32, 512, 32768) kbytes
64 byte cache line size
Address Size 48 bits virtual, 48 bits physical
SIMD 256 bit = 32 byte max. SIMD vector size
Time Stamp Counter TSC is accessible via rdtsc
TSC runs at constant rate (invariant from clock frequency)
Perf. Monitoring Performance Monitoring Counters (PMC) are not supported
Hypervisor Yes, Microsoft

@Beforerr
Copy link
Contributor

Beforerr commented Oct 28, 2024

I would also prefer solving momentum (which is also more common in PIC codes).

PS: If you are concerned with concerned about velocity accuracy, norm may be better.
https://discourse.julialang.org/t/performance-of-norm-function/14709

PPS: this may be faster.

   if γ²v² > 1e-20 
      vmag = √(γ²v² / (1 + γ²v²/c2))
      vx, vy, vz = vmag * normalize(γv)
   else # no velocity
      vx, vy, vz = 0, 0, 0
   end

...

const c2 = c^2

@Beforerr
Copy link
Contributor

Beforerr commented Oct 28, 2024

There should be no $\Omega$ for normalization case. Also see pull #202

image

Repository owner deleted a comment from github-actions bot Oct 28, 2024
@henry2004y
Copy link
Owner Author

I feel like in the normalized relativistic case, the only reasonable velocity normalization factor is c; otherwise, in the Lorentz factor $\gamma$, c will inevitably show up.

Repository owner deleted a comment from github-actions bot Oct 28, 2024
@Beforerr
Copy link
Contributor

BTW why use v̂ and 1e-20?

I feel like v̂ is an intermediate variable, creating it any way would slow the code if not optimized by the compiler.
And 1e-20 is far from underflow for double (and should be different for float16 and for (un)normalization case). I just would like to avoid branch in computationally intensive code.

@henry2004y
Copy link
Owner Author

BTW why use v̂ and 1e-20?

I feel like v̂ is an intermediate variable, creating it any way would slow the code if not optimized by the compiler. And 1e-20 is far from underflow for double (and should be different for float16 and for (un)normalization case). I just would like to avoid branch in computationally intensive code.

Modern compilers are very good at branch prediction. In our case, small γ²v² value is a rare case, so this branch will almost never be executed. Even if you write

vx, vy, vz = vmag * normalize(γv)

without v̂, it will still allocate because of normalize(). With SVector, this will be optimized by the Julia compiler.

Let's compare them side-by-side:

function trace_relativistic_normalized!(dy, y, p::TestParticle.TPNormalizedTuple, t)
   _, E, B = p
   Ex, Ey, Ez = E(y, t)
   Bx, By, Bz = B(y, t)

   γv = @view y[4:6]
   γ²v² = γv[1]^2 + γv[2]^2 + γv[3]^2
   if γ²v² > eps(eltype(dy))
      v̂ = SVector{3, eltype(dy)}(normalize(γv))
   else # no velocity= SVector{3, eltype(dy)}(0, 0, 0)
   end
   vmag = (γ²v² / (1 + γ²v²))
   vx, vy, vz = vmag * v̂[1], vmag * v̂[2], vmag * v̂[3]

   dy[1], dy[2], dy[3] = vx, vy, vz
   dy[4] = vy*Bz - vz*By + Ex
   dy[5] = vz*Bx - vx*Bz + Ey
   dy[6] = vx*By - vy*Bx + Ez

   return
end

function trace_relativistic_normalized2!(dy, y, p::TestParticle.TPNormalizedTuple, t)
   _, E, B = p
   Ex, Ey, Ez = E(y, t)
   Bx, By, Bz = B(y, t)

   γv = @view y[4:6]
   γ²v² = γv[1]^2 + γv[2]^2 + γv[3]^2
   T = eltype(dy)
   if γ²v² > eps(T)
      vmag = (γ²v² / (1 + γ²v²))
      vx, vy, vz = vmag * normalize(γv)
   else # no velocity
      vx, vy, vz = zero(T), zero(T), zero(T)
   end

   dy[1], dy[2], dy[3] = vx, vy, vz
   dy[4] = vy*Bz - vz*By + Ex
   dy[5] = vz*Bx - vx*Bz + Ey
   dy[6] = vx*By - vy*Bx + Ez

   return
end


using BenchmarkTools

# Tracing relativistic particle in dimensionless units
param = prepare(xu -> SA[0.0, 0.0, 0.0], xu -> SA[0.0, 0.0, 1.0]; species=User)
tspan = (0.0, 1.0) # 1/2π period
stateinit = [0.0, 0.0, 0.0, 0.5, 0.0, 0.0]
prob = ODEProblem(trace_relativistic_normalized!, stateinit, tspan, param)
julia> @benchmark TestParticle.trace_relativistic_normalized!(stateinit, stateinit, prob.p, 0.0)
BenchmarkTools.Trial: 10000 samples with 927 evaluations.
 Range (min  max):  109.385 ns   30.531 μs  ┊ GC (min  max): 0.00%  99.22%
 Time  (median):     132.039 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   157.137 ns ± 382.613 ns  ┊ GC (mean ± σ):  7.44% ±  3.89%

    ▃█▆▃
  ▃▄████▇▆▅▅▄▄▃▃▄▅▄▅▅▄▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  109 ns           Histogram: frequency by time          289 ns <

 Memory estimate: 96 bytes, allocs estimate: 3.

julia> @benchmark TestParticle.trace_relativistic_normalized2!(stateinit, stateinit, prob.p, 0.0)
BenchmarkTools.Trial: 10000 samples with 875 evaluations.
 Range (min  max):  129.257 ns   32.808 μs  ┊ GC (min  max):  0.00%  99.35%
 Time  (median):     143.086 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   177.729 ns ± 448.856 ns  ┊ GC (mean ± σ):  10.46% ±  5.04%

   ▂▅█▁
  ▅████▆▄▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  129 ns           Histogram: frequency by time          344 ns <

 Memory estimate: 176 bytes, allocs estimate: 5.

Regarding 1e-20, I agree it's an arbitrarily chosen bad value. I will replace it with eps.

Repository owner deleted a comment from github-actions bot Oct 28, 2024
Repository owner deleted a comment from github-actions bot Oct 28, 2024
Repository owner deleted a comment from github-actions bot Oct 28, 2024
@Beforerr
Copy link
Contributor

Your benchmark is surprising. I did a simple test. test3 is the fastest, I think the benefits come from StaticArrays, the earliest we use the fastest. And eliminating v̂ still helps.

using StaticArrays
using LinearAlgebra

function test1(u)
    γv = @view u[4:6]
    γ²v² = γv[1]^2 + γv[2]^2 + γv[3]^2
    vmag = √(γ²v² / (1 + γ²v²))
    v̂ = SVector{3, eltype(vmag)}(normalize(γv))
    vx, vy, vz = vmag * v̂[1], vmag * v̂[2], vmag * v̂[3]
   return vx, vy, vz
end

function test2(u)
    γv = @view u[4:6]
    γ²v² = γv[1]^2 + γv[2]^2 + γv[3]^2
    vmag = √(γ²v² / (1 + γ²v²))
    vx, vy, vz =  vmag * normalize(γv)
    return vx, vy, vz
end

function test3(u)
    γv = @views SVector{3, eltype(u)}(u[4:6])
    γ²v² = γv[1]^2 + γv[2]^2 + γv[3]^2
    vmag = √(γ²v² / (1 + γ²v²))
    vx, vy, vz =  vmag * normalize(γv)
    return vx, vy, vz
end

Results

julia> @benchmark test1(u)
BenchmarkTools.Trial: 10000 samples with 970 evaluations.
 Range (min … max):  77.320 ns …  2.744 μs  ┊ GC (min … max): 0.00% … 95.91%
 Time  (median):     82.732 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   89.239 ns ± 86.160 ns  ┊ GC (mean ± σ):  4.45% ±  4.49%

   ▃▄ █▄▁▃                                                     
  ▂████████▆▄▃▃▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  77.3 ns         Histogram: frequency by time         134 ns <

 Memory estimate: 112 bytes, allocs estimate: 3.

julia> @benchmark test2(u)
BenchmarkTools.Trial: 10000 samples with 829 evaluations.
 Range (min … max):  147.768 ns …   3.315 μs  ┊ GC (min … max): 0.00% … 93.31%
 Time  (median):     170.537 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   178.993 ns ± 122.050 ns  ┊ GC (mean ± σ):  4.29% ±  5.80%

              ▂▄▄▅▅▇▄██▆▇▇▇▆▅▄▃▂▁                                
  ▂▂▂▂▂▃▄▄▄▆▇████████████████████▇▆▆▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▅
  148 ns           Histogram: frequency by time          212 ns <

 Memory estimate: 192 bytes, allocs estimate: 5.

julia> @benchmark test3(u)
BenchmarkTools.Trial: 10000 samples with 990 evaluations.
 Range (min … max):  44.402 ns …  2.527 μs  ┊ GC (min … max): 0.00% … 96.99%
 Time  (median):     45.033 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.068 ns ± 40.773 ns  ┊ GC (mean ± σ):  1.85% ±  2.13%

   ▇█▇▅▄▂▂▂▂▃▃▂▂▂▄▄▄▅▃▃▂▁▁▁▁▁▁▁   ▁                           ▂
  ▆████████████████████████████████████▇▆▆▆▆▇▆▅▅▆▅▆▆▆▄▅▄▂▄▄▃▂ █
  44.4 ns      Histogram: log(frequency) by time      54.2 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.

julia> @benchmark test1(su)
BenchmarkTools.Trial: 10000 samples with 968 evaluations.
 Range (min … max):  77.823 ns …  2.693 μs  ┊ GC (min … max): 0.00% … 95.79%
 Time  (median):     82.128 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.694 ns ± 86.851 ns  ┊ GC (mean ± σ):  4.61% ±  4.49%

   ▆█▃▅▄▁▁                                                     
  ▃████████▆▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▂ ▃
  77.8 ns         Histogram: frequency by time         128 ns <

 Memory estimate: 112 bytes, allocs estimate: 3.

julia> @benchmark test2(su)
BenchmarkTools.Trial: 10000 samples with 855 evaluations.
 Range (min … max):  137.913 ns …   3.428 μs  ┊ GC (min … max): 0.00% … 94.06%
 Time  (median):     161.890 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   172.117 ns ± 124.310 ns  ┊ GC (mean ± σ):  4.55% ±  5.89%

            ▁▂▄▆█▇▅▄▄▂                                           
  ▁▁▁▂▂▂▃▅▆▇███████████▇▅▄▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  138 ns           Histogram: frequency by time          231 ns <

 Memory estimate: 192 bytes, allocs estimate: 5.

julia> @benchmark test3(su)
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
 Range (min … max):  43.348 ns …  2.509 μs  ┊ GC (min … max): 0.00% … 96.93%
 Time  (median):     44.442 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.364 ns ± 41.202 ns  ┊ GC (mean ± σ):  1.90% ±  2.13%

   ▁▆██                                                        
  ▃████▇▂▃▅▅▄▃▄▆▄▄▄▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  43.3 ns         Histogram: frequency by time        56.3 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.

Repository owner deleted a comment from github-actions bot Oct 28, 2024
@henry2004y
Copy link
Owner Author

henry2004y commented Oct 28, 2024

Your benchmark is surprising. I did a simple test. test3 is the fastest, I think the benefits come from StaticArrays, the earliest we use the fastest.

I followed your advice in the new commit. This can now remove all the allocations by using SVector before passing to normalize:

julia> @benchmark TestParticle.trace_relativistic_normalized!($stateinit, $stateinit, $prob.p, 0.0)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  17.900 ns  222.200 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     18.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.883 ns ±   4.304 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █        ▂▂▁                                                 ▁
  █▇▆▆▇▇▇▇█████████▇▇▆▇▆▆▇▇▅▄▄▅▆▄▅▅▅▄▄▅▄▄▄▄▄▄▄▄▅▅▅▄▄▁▅▃▄▄▃▁▃▁▄ █
  17.9 ns       Histogram: log(frequency) by time      36.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

P.S. Previously we did not exclude the allocation from passing the input arrays.

Repository owner deleted a comment from github-actions bot Oct 28, 2024
Copy link
Contributor

Benchmark result

Judge result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

  • Time of benchmarks:
    • Target: 28 Oct 2024 - 17:38
    • Baseline: 28 Oct 2024 - 17:39
  • Package commits:
    • Target: 778813
    • Baseline: c1034e
  • Julia commits:
    • Target: 8f5b7c
    • Baseline: 8f5b7c
  • Julia command flags:
    • Target: None
    • Baseline: None
  • Environment variables:
    • Target: None
    • Baseline: None

Results

A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
that indicate possible regressions or improvements - are shown below (thus, an empty table means that all
benchmark results remained invariant between builds).

ID time ratio memory ratio
["trace", "analytic field", "in place"] 1.08 (5%) ❌ 1.00 (1%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

  • ["trace", "GC"]
  • ["trace", "analytic field"]
  • ["trace", "numerical field"]
  • ["trace", "time-dependent field"]

Julia versioninfo

Target

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4367 s          0 s        291 s       3873 s          0 s
       #2     0 MHz       4144 s          0 s        296 s       4095 s          0 s
       #3     0 MHz       4377 s          0 s        280 s       3890 s          0 s
       #4     0 MHz       3977 s          0 s        255 s       4323 s          0 s
  Memory: 15.606491088867188 GB (13108.9765625 MB free)
  Uptime: 857.47 sec
  Load Avg:  1.25  2.68  1.78
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Baseline

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4689 s          0 s        310 s       4675 s          0 s
       #2     0 MHz       4308 s          0 s        318 s       5053 s          0 s
       #3     0 MHz       4767 s          0 s        310 s       4615 s          0 s
       #4     0 MHz       4222 s          0 s        283 s       5194 s          0 s
  Memory: 15.606491088867188 GB (13084.7109375 MB free)
  Uptime: 972.08 sec
  Load Avg:  1.06  2.15  1.69
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Target result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

  • Time of benchmark: 28 Oct 2024 - 17:38
  • Package commit: 778813
  • Julia commit: 8f5b7c
  • Julia command flags: None
  • Environment variables: None

Results

Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.

ID time GC time memory allocations
["trace", "GC", "in place"] 127.196 μs (5%) 122.69 KiB (1%) 2286
["trace", "analytic field", "in place relativistic"] 6.857 μs (5%) 10.45 KiB (1%) 199
["trace", "analytic field", "in place"] 6.135 μs (5%) 10.45 KiB (1%) 199
["trace", "analytic field", "out of place"] 3.675 μs (5%) 8.61 KiB (1%) 159
["trace", "numerical field", "Boris ensemble"] 3.256 μs (5%) 3.30 KiB (1%) 25
["trace", "numerical field", "Boris"] 1.674 μs (5%) 1.88 KiB (1%) 19
["trace", "numerical field", "in place"] 18.584 μs (5%) 12.73 KiB (1%) 200
["trace", "numerical field", "out of place"] 13.304 μs (5%) 9.81 KiB (1%) 159
["trace", "time-dependent field", "in place"] 7.386 μs (5%) 11.30 KiB (1%) 220
["trace", "time-dependent field", "out of place"] 5.181 μs (5%) 9.39 KiB (1%) 177

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

  • ["trace", "GC"]
  • ["trace", "analytic field"]
  • ["trace", "numerical field"]
  • ["trace", "time-dependent field"]

Julia versioninfo

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4367 s          0 s        291 s       3873 s          0 s
       #2     0 MHz       4144 s          0 s        296 s       4095 s          0 s
       #3     0 MHz       4377 s          0 s        280 s       3890 s          0 s
       #4     0 MHz       3977 s          0 s        255 s       4323 s          0 s
  Memory: 15.606491088867188 GB (13108.9765625 MB free)
  Uptime: 857.47 sec
  Load Avg:  1.25  2.68  1.78
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Baseline result

Benchmark Report for /home/runner/work/TestParticle.jl/TestParticle.jl

Job Properties

  • Time of benchmark: 28 Oct 2024 - 17:39
  • Package commit: c1034e
  • Julia commit: 8f5b7c
  • Julia command flags: None
  • Environment variables: None

Results

Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.

ID time GC time memory allocations
["trace", "GC", "in place"] 127.616 μs (5%) 122.69 KiB (1%) 2286
["trace", "analytic field", "in place"] 5.696 μs (5%) 10.45 KiB (1%) 199
["trace", "analytic field", "out of place"] 3.627 μs (5%) 8.61 KiB (1%) 159
["trace", "numerical field", "Boris ensemble"] 3.251 μs (5%) 3.30 KiB (1%) 25
["trace", "numerical field", "Boris"] 1.689 μs (5%) 1.88 KiB (1%) 19
["trace", "numerical field", "in place"] 18.414 μs (5%) 12.73 KiB (1%) 200
["trace", "numerical field", "out of place"] 13.424 μs (5%) 9.81 KiB (1%) 159
["trace", "time-dependent field", "in place"] 7.311 μs (5%) 11.30 KiB (1%) 220
["trace", "time-dependent field", "out of place"] 5.019 μs (5%) 9.39 KiB (1%) 177

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

  • ["trace", "GC"]
  • ["trace", "analytic field"]
  • ["trace", "numerical field"]
  • ["trace", "time-dependent field"]

Julia versioninfo

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 22.04.5 LTS
  uname: Linux 6.5.0-1025-azure #26~22.04.1-Ubuntu SMP Thu Jul 11 22:33:04 UTC 2024 x86_64 x86_64
  CPU: AMD EPYC 7763 64-Core Processor: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz       4689 s          0 s        310 s       4675 s          0 s
       #2     0 MHz       4308 s          0 s        318 s       5053 s          0 s
       #3     0 MHz       4767 s          0 s        310 s       4615 s          0 s
       #4     0 MHz       4222 s          0 s        283 s       5194 s          0 s
  Memory: 15.606491088867188 GB (13084.7109375 MB free)
  Uptime: 972.08 sec
  Load Avg:  1.06  2.15  1.69
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Runtime information

Runtime Info
BLAS #threads 2
BLAS.vendor() lbt
Sys.CPU_THREADS 4

lscpu output:

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             4
On-line CPU(s) list:                0-3
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7763 64-Core Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           4890.84
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
Virtualization:                     AMD-V
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          64 KiB (2 instances)
L1i cache:                          64 KiB (2 instances)
L2 cache:                           1 MiB (2 instances)
L3 cache:                           32 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Cpu Property Value
Brand AMD EPYC 7763 64-Core Processor
Vendor :AMD
Architecture :Unknown
Model Family: 0xaf, Model: 0x01, Stepping: 0x01, Type: 0x00
Cores 16 physical cores, 16 logical cores (on executing CPU)
No Hyperthreading hardware capability detected
Clock Frequencies Not supported by CPU
Data Cache Level 1:3 : (32, 512, 32768) kbytes
64 byte cache line size
Address Size 48 bits virtual, 48 bits physical
SIMD 256 bit = 32 byte max. SIMD vector size
Time Stamp Counter TSC is accessible via rdtsc
TSC runs at constant rate (invariant from clock frequency)
Perf. Monitoring Performance Monitoring Counters (PMC) are not supported
Hypervisor Yes, Microsoft

@henry2004y henry2004y merged commit f1f6197 into master Oct 28, 2024
7 checks passed
@henry2004y henry2004y deleted the relativistic_momentum branch October 28, 2024 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants