[Tutorial] A summary of page fault issues #659
Stonepia
started this conversation in
Show and tell
Replies: 1 comment
-
For printing, we can also use below API if we cannot get cgf: #define DPCPP_K_PRINT(fmt_str, ...) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
1. Introduction
The new driver for PVC introduces stricter checks for memory access. It can be enabled using the following flag:
A page fault may result in the following error message:
In the older driver, when encountering incorrect memory access, the kernel may silently drop it, resulting in no error message. However, in the newer driver, these incorrect accesses will cause a page fault.
This thread will discuss common debugging practices for existing fixed bugs.
2. Debug techniques
The first thing one need to see is about the error log. There are two kinds of errors:
Accessing nullptr
In this kind of error, it would have the error message indicating page fault happens at
0x0
:One should pay attention to where the nullptr may be passed to the kernel.
Accessing wrong memory address
The message may like:
Please pay special attention to this kind of address. Normally, the GPU memory address may have higher address, it should be something like
0xff00000098400000
. When one witness the address like0x55xxx
, which is much likely this tensor address is on CPU. Thus, it may because the GPU kernel is trying to access a CPU address.2.1. Locating the kernel
The first step is to locate the kernel that caused the page fault. One should run the test with the following flag to print more details:
For the detail of those flags, please view SYCL Env Flags for detail.
Then run the test and direct the output to a separate file, as the log may be very large.
By looking at the end of the log, one can find the log to be something like the following:
In the above log, one should first look at the last
piEnqueueKernelLaunch
event, and its second arg is what we are looking for (0x56067344ec70 in this event). Then one should search for the name of this kernel, which should be apiKernelCreate
event. The name of this kernel is_ZTSN2at15AtenIpexTypeXPU4impl45MaxPool3dWithIndicesOutFrameImplKernelFunctorIdLb0EEE
.One can use
c++filt
to get the readable kernel name:In this case, one could start from the
MaxPool3dWithIndicesOutFrameImplKernelFunctor
.2.2. Printing the message
One can directly print the related messages outside the kernel using
std::cout
:Inside the SYCL kernel, one should use
sycl::stream
andsycl::endl
to print:For more information, please refer to the Doing IO in the Kernel documentation.
Please note that printing inside the kernel will alter the kernel behavior. Thus, there may be cases where adding a print statement makes the kernel correct. In such cases, there is no good solution at the moment.
3. Possible Bugs
3.1. Unguarded memory access
This is the most common cause of page faults. In this section, we will show some typical cases we have encountered.
3.1.1. Unsafe pointer access
The kernel sometimes needs to access the
data_ptr
of a tensor. It will have the following pattern:The above kernel will have the argument
output_ptr
, which points to the underlying storage of the tensoroutput_
. However, if the tensor is not fully initialized, thetensor.data_ptr()
call will return a nullptr. Thus, a nullptr will be passed to the kernel. If the kernel tries to write to a nullptr, a page fault will occur.For
data_ptr
access, we always encourage using the template accessdata_ptr
API: Uset.mutable_data_ptr<T>()
andt.const_data_ptr<T>()
. For more information, please refer to the Proposal: Switch to safer data_ptr API for details.3.1.2. Failure to check tensor legality
Some test cases may fail because the kernel forgets to check the tensor shape or the shape is supported by PyTorch but not by SYCL. In these scenarios, one should check with PyTorch's kernel implementation and add checks like:
TORCH_CHECK( indices.dim() == 1 || indices.dim() == 2, "input has to be a 1D or 2D Tensor, but got Tensor of dimension ", indices.dim());
The issue can be found in IPEX/#4482 (Requires internal access).
3.1.3. Lack of boundary check
Boundary check is crucial for page fault issues. Since the former driver silently fails if the kernel is wrong, and the new driver throws a page fault message, incorrect boundaries are highly likely to cause a page fault.
3.1.3.1. Incorrect accessing order
The following pattern will cause an error:
In C++, the
&&
operator first performs the left predicate. If it is true, then it performs the right predicate. Ifinner_idx
is greater thancfg_.problem_batch_
, thesorted_indices_[inner_idx]
will try to access an index out of bounds, causing a page fault.The above code should be fixed as follows:
This issue can be found in torch-xpu-ops/#595.
3.1.3.2. Lack of early return / assertion check
We have encountered cases where the kernel does not check the correctness of boundaries for an early return. In the former driver, this kernel would be silently dropped, resulting in no error message. However, in the new driver, this will cause a segmentation fault.
We encountered a page fault for
beam_search
, and the fix is as follows:If we didn't return early, the code in the kernel might be executed, causing the page fault. These incorrect kernels will not be silently dropped. Thus, it is always recommended to check if there should be a check at the beginning of the kernel.
The above fix can be found in IPEX/#4552(Requires internal access).
Similarly, we encourage adding checks to provide more informative error messages and throw errors as early as possible. For example, the
embedding_bag
kernel lacks a boundary check. The check is similar to theCUDA_KERNEL_ASSERT
in the EmbeddingBag Kernel.3.2. oneDNN Related Bugs
When the kernel is oneDNN related, it is recommended to reproduce it using
benchdnn
. @ZhiweiYan-96 has a great document explaining this. Please refer to dnnl_workshop for details.These bugs can be caught by setting the
ONEDNN_VERBOSE
flag:Then you may witness the following:
In this case, one should reproduce it using
benchdnn
.## batch_file.txt $ cat batch_file.txt 8192x768:768x384 49152x64x9:49152x9x1 96x512x64:96x64x512 $ ./tests/benchdnn/benchdnn --matmul --mode=p --engine=gpu --attr-scratchpad=user --batch=batch_file.txt
3.3. API unaligned
3.3.1. Kernel re-dispatch not considered
We have encountered the pagefault error when the tensor is a
ZeroTensor
in the following kernel:In the above kernel, the tensor is a
ZeroTensor
, i.e., all of its elements are 0.ZeroTensor
is a new backend which doesn't include the actual tensor storage.ZeroTensor
has a new dispatch key, when a tensor is aZeroTensor
, it should be dispatched to the corresponding backend.The dispatch order should be something like:
Thus, the above should be changed to:
return at::add(self, wrapper, alpha);
However, in this particular situation, we should never generate the ZeroTensor kernels. We need to remove the redundant generated kernels. Please view torch-xpu-ops#689 for details.
3.3.2. Kernel does not have same device check
When the GPU kernel trying to access a CPU address, it will get page fault.
Take an example of the
index_fill_
kernel:When
self
is on XPU andindex
is on CPU, it will fail with the following error:This problem has a typical behavior, that if one print the data_ptr of the tensor:
It will print the follwing:
From the above,
f000
).0xff00000098400000
.To solve this kind of problem, one could either:
TensorIterator
, it will by default has this check.3.3.3. Kernel implementation does not include corner cases
During the IPEX implementation of the
fmha_backward
kernel, it doesn't consider the corner case where thebias
doesn't require grad.These cases are unlikely to occur in stock PyTorch, as we have the same test scope as stock PyTorch. They are listed here for completeness.
IPEX/#4428 (Requires internal access).
Beta Was this translation helpful? Give feedback.
All reactions