Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance memcpy and remove redundant implementations #22513

Merged
merged 6 commits into from
Jan 18, 2025
Merged

enhance memcpy and remove redundant implementations #22513

merged 6 commits into from
Jan 18, 2025

Conversation

andrewrk
Copy link
Member

perf data point: building hello world with a compiler with different memcpy versions.

stage4/bin/zig build -p fast -Doptimize=ReleaseFast -Dno-lib -Dforce-link-libc -Dtarget=native-native-musl

Benchmark 1 (23 runs): fast-4bace0f6/bin/zig build-exe ../test/standalone/simple/hello_world/hello.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           222ms ± 9.73ms     205ms …  239ms          0 ( 0%)        0%
  peak_rss           69.3MB ±  557KB    68.2MB … 70.5MB          0 ( 0%)        0%
  cpu_cycles         1.21G  ± 55.4M     1.15G  … 1.27G           0 ( 0%)        0%
  instructions       2.22G  ±  479K     2.22G  … 2.22G           0 ( 0%)        0%
  cache_references   89.2M  ± 1.11M     87.5M  … 91.2M           0 ( 0%)        0%
  cache_misses       10.6M  ±  357K     9.98M  … 11.1M           0 ( 0%)        0%
  branch_misses      7.73M  ± 94.4K     7.55M  … 7.87M           0 ( 0%)        0%
Benchmark 2 (24 runs): fast/bin/zig build-exe ../test/standalone/simple/hello_world/hello.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           217ms ± 7.11ms     204ms …  225ms          0 ( 0%)          -  2.4% ±  2.3%
  peak_rss           69.6MB ±  813KB    68.2MB … 72.0MB          1 ( 4%)          +  0.5% ±  0.6%
  cpu_cycles         1.20G  ± 50.1M     1.13G  … 1.26G           0 ( 0%)          -  0.9% ±  2.6%
  instructions       2.17G  ±  388K     2.17G  … 2.17G           2 ( 8%)        ⚡-  2.3% ±  0.0%
  cache_references   89.6M  ± 1.07M     87.5M  … 91.8M           0 ( 0%)          +  0.5% ±  0.7%
  cache_misses       10.5M  ±  376K     9.85M  … 11.0M           0 ( 0%)          -  1.1% ±  2.0%
  branch_misses      7.69M  ± 80.3K     7.51M  … 7.80M           0 ( 0%)          -  0.6% ±  0.7%

Difference between these is the second one has the different compiler_rt memcpy. Both have musl libc memcpy deleted.

@@ -5,24 +5,141 @@ const builtin = @import("builtin");
comptime {
if (builtin.object_format != .c) {
@export(&memcpy, .{ .name = "memcpy", .linkage = common.linkage, .visibility = common.visibility });
@export(&memcpy, .{ .name = "memmove", .linkage = common.linkage, .visibility = common.visibility });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not an alias that is not portable?

Copy link
Member Author

@andrewrk andrewrk Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I left it so I could show you how it would fail and then add a commit to fix it. (Or it doesn't fail and then I need to reexamine why I thought it wasn't portable)

edit: yeah the failures are due to weak aliases not working. perhaps this should be a semantic analysis error 🤔

andrewrk and others added 4 commits January 17, 2025 12:34
@andrewrk
Copy link
Member Author

@alexrp I'm getting a bunch of arm and thumb failures like this:

zig: /home/andy/src/llvm-project-19/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:970: void {anonymous}::SelectionDAGLegalize::LegalizeOp(llvm::SDNode*): Assertion `(TLI.getTypeAction(*DAG.getContext(), Op.getValueType()) == TargetLowering::TypeLegal || Op.getOpcode() == ISD::TargetConstant || Op.getOpcode() == ISD::Register) && "Unexpected illegal type!"' failed.
#4  0x00007ffff7cf8a56 in __assert_fail ()
   from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
#5  0x000000000ae6e962 in (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode*) ()
#6  0x000000000ae6fb30 in llvm::SelectionDAG::Legalize() ()
#7  0x000000000ae91f05 in llvm::SelectionDAGISel::CodeGenAndEmitDAG() ()
#8  0x000000000ae96a6f in llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) ()
#9  0x000000000ae981e4 in llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) ()
#10 0x000000000bc625e4 in (anonymous namespace)::ARMDAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&)
    ()
#11 0x000000000ae83e13 in llvm::SelectionDAGISelLegacy::runOnMachineFunction(llvm::MachineFunction&) ()
#12 0x000000000955158b in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) [clone .part.0] ()
#13 0x0000000008b9d461 in llvm::FPPassManager::runOnFunction(llvm::Function&) ()
#14 0x0000000008b9d881 in llvm::FPPassManager::runOnModule(llvm::Module&) ()
#15 0x0000000008b9e14f in llvm::legacy::PassManagerImpl::run(llvm::Module&) ()
#16 0x0000000008e4ef2e in ZigLLVMTargetMachineEmitToFile ()

Since these look like LLVM bugs and they're in more obscure targets, I'm inclined to disable these tests and make sure there is an issue tracking them. Unless you have an appetite for looking into this?

@andrewrk
Copy link
Member Author

mm, on these targets a simpler memcpy should be used. I suspect the code is just too hard for llvm to lower.

@alexrp
Copy link
Member

alexrp commented Jan 17, 2025

Unless you have an appetite for looking into this?

Do you have a repro on hand? If so, I can try to llvm-reduce it and look into it on the LLVM side tomorrow.

@andrewrk
Copy link
Member Author

https://zig.godbolt.org/z/zhqnMb6cE

triggers an assertion in LegalizeDAG otherwise
@andrewrk
Copy link
Member Author

andrewrk commented Jan 17, 2025

Alright so the fact that we can't use the good impl for arm because of llvm bugs makes deleting the musl arm assembly implementations of memcpy a little sus but I'm still going to do it. The next person to work on memcpy (probably @dweiller) will need to ensure arm perf is measured as well. By deleting musl implementation, the memcpy implementation actually starts to matter, and it also becomes more straightforward to take real world perf data points.

I hope this also helps people understand the implications of #2879.

@andrewrk andrewrk changed the title enhance memcpy enhance memcpy and remove redundant implementations Jan 17, 2025
strong symbols always take precedence over weak symbols.
@andrewrk andrewrk enabled auto-merge January 18, 2025 02:43
@andrewrk andrewrk merged commit f38d7a9 into master Jan 18, 2025
10 checks passed
@andrewrk andrewrk deleted the memcpy branch January 18, 2025 03:52
@alexrp
Copy link
Member

alexrp commented Jan 18, 2025

https://zig.godbolt.org/z/zhqnMb6cE

target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "armv7a-unknown-unknown-eabi"

define ptr @memmove2() #0 {
  store <8 x i8> zeroinitializer, ptr null, align 1
  ret ptr null
}

attributes #0 = { "target-features"="+aclass,+d32,+db,+dsp,+fp64,+fpregs,+fpregs64,+v4t,+v5t,+v5te,+v6,+v6k,+v6m,+v6t2,+v7,+v7clrex,+v8m,+neon,+perfmon,+thumb2,+armv7-a,+vfp2,+vfp2sp,+vfp3,+vfp3d16,+vfp3d16sp,+vfp3sp,-32bit,-8msecext,-aapcs-frame-chain,-acquire-release,-aes,-atomics-32,-avoid-movs-shop,-avoid-partial-cpsr,-bf16,-big-endian-instructions,-cde,-cdecp0,-cdecp1,-cdecp2,-cdecp3,-cdecp4,-cdecp5,-cdecp6,-cdecp7,-cheap-predicable-cpsr,-clrbhb,-crc,-crypto,-dfb,-disable-postra-scheduler,-dont-widen-vmovs,-dotprod,-execute-only,-expand-fp-mlx,-fix-cmse-cve-2021-35465,-fix-cortex-a57-aes-1742098,-fp16,-fp16fml,-fp-armv8,-fp-armv8d16,-fp-armv8d16sp,-fp-armv8sp,-fpao,-fpregs16,-fullfp16,-fuse-aes,-fuse-literals,-harden-sls-blr,-harden-sls-nocomdat,-harden-sls-retbr,-v8,-v8.1a,-v8.1m.main,-v8.2a,-v8.3a,-v8.4a,-v8.5a,-v8.6a,-v8.7a,-v8.8a,-v8.9a,-v8m.main,-v9.1a,-v9.2a,-v9.3a,-v9.4a,-v9.5a,-v9a,-hwdiv,-hwdiv-arm,-i8mm,-iwmmxt,-iwmmxt2,-lob,-long-calls,-loop-align,-mclass,-mp,-muxed-units,-mve,-mve1beat,-mve2beat,-mve4beat,-mve.fp,-nacl-trap,-neon-fpmovs,-neonfp,-no-branch-predictor,-no-bti-at-return-twice,-no-movt,-no-neg-immediates,-noarm,-nonpipelined-vfp,-pacbti,-prefer-ishst,-prefer-vmovsr,-prof-unpr,-ras,-rclass,-read-tp-tpidrprw,-read-tp-tpidruro,-read-tp-tpidrurw,-reserve-r9,-ret-addr-stack,-sb,-sha2,-slow-fp-brcc,-slow-load-D-subreg,-slow-odd-reg,-slow-vdup32,-slow-vgetlni32,-slowfpvfmx,-slowfpvmlx,-soft-float,-splat-vfp-neon,-strict-align,-thumb-mode,-trustzone,-use-mipipeliner,-use-misched,-armv4,-armv4t,-armv5t,-armv5te,-armv5tej,-armv6,-armv6j,-armv6k,-armv6kz,-armv6-m,-armv6s-m,-armv6t2,-armv7e-m,-armv7-m,-armv7-r,-armv7ve,-armv8.1-a,-armv8.1-m.main,-armv8.2-a,-armv8.3-a,-armv8.4-a,-armv8.5-a,-armv8.6-a,-armv8.7-a,-armv8.8-a,-armv8.9-a,-armv8-a,-armv8-m.base,-armv8-m.main,-armv8-r,-armv9.1-a,-armv9.2-a,-armv9.3-a,-armv9.4-a,-armv9.5-a,-armv9-a,-vfp4,-vfp4d16,-vfp4d16sp,-vfp4sp,-virtualization,-vldn-align,-vmlx-forwarding,-vmlx-hazards,-wide-stride-vfp,-xscale,-zcz" "use-soft-float"="true" }

Remove "use-soft-float"="true" and it works.

There is some silliness on the LLVM side where you have to massage the VFP/NEON target features when using use-soft-float, even though the latter should completely override target features as a matter of policy where relevant. (I've started the process of fixing that over at llvm/llvm-project#107022 but I've been busy with other things lately.)

If you change +vfp2 to -vfp2, it also works. We already have code in Zig to do this hack:

zig/lib/std/zig/system.zig

Lines 412 to 415 in f38d7a9

// https://github.com/llvm/llvm-project/issues/105978
if (cpu.arch.isArm() and query_abi.floatAbi() == .soft) {
cpu.features.removeFeature(@intFromEnum(Target.arm.Feature.vfp2));
}

(And yes, I do have plans to make this less hacky eventually.)

It looks like #22434 accidentally broke this hack by moving ABI detection after those CPU feature hacks. So it's on me for not noticing that during review. The good news is that those features aren't relevant for ABI detection on Arm and Hexagon, so we can simply move them after ABI detection.

#22526 should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants