[Tutorial] A summary of page fault issues #659

Stonepia · 2024-07-30T02:36:03Z

Stonepia
Jul 30, 2024
Maintainer

1. Introduction

The new driver for PVC introduces stricter checks for memory access. It can be enabled using the following flag:

export DisableScratchPages=1 
export NEOReadDebugKeys=1

A page fault may result in the following error message:

FATAL: Unexpected page fault from GPU at 0x0, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 3 (PML4), access: 0 (Read), banned: 1, aborting.
Abort was called at 287 line in file:
./shared/source/os_interface/linux/drm_neo.cpp

In the older driver, when encountering incorrect memory access, the kernel may silently drop it, resulting in no error message. However, in the newer driver, these incorrect accesses will cause a page fault.

This thread will discuss common debugging practices for existing fixed bugs.

1. Introduction
2. Debug techniques
- 2.1. Locating the kernel
- 2.2. Printing the message
3. Possible Bugs

2. Debug techniques

The first thing one need to see is about the error log. There are two kinds of errors:

Accessing nullptr
Accessing wrong memory address

Accessing nullptr

In this kind of error, it would have the error message indicating page fault happens at 0x0:

FATAL: Unexpected page fault from GPU at 0x0,

One should pay attention to where the nullptr may be passed to the kernel.

Accessing wrong memory address
The message may like:

FATAL: Unexpected page fault from GPU at 0x55556428f000,

Please pay special attention to this kind of address. Normally, the GPU memory address may have higher address, it should be something like 0xff00000098400000. When one witness the address like 0x55xxx, which is much likely this tensor address is on CPU. Thus, it may because the GPU kernel is trying to access a CPU address.

2.1. Locating the kernel

The first step is to locate the kernel that caused the page fault. One should run the test with the following flag to print more details:

export SYCL_PI_TRACE=-1

export ZE_SERIALIZE=2
export OverrideImmediateCmdListSynchronousMode=1

For the detail of those flags, please view SYCL Env Flags for detail.

Then run the test and direct the output to a separate file, as the log may be very large.

python test.py &> test.log

By looking at the end of the log, one can find the log to be something like the following:

---> piKernelCreate(
	<unknown> : 0x560671f84c50
	<const char *>: _ZTSN2at15AtenIpexTypeXPU4impl45MaxPool3dWithIndicesOutFrameImplKernelFunctorIdLb0EEE
	<unknown> : 0x7ffea545fe80
UR ---> (*RetKernel)->initialize()
UR ---> urProgramRetain(Program)
UR <--- urProgramRetain(Program)(UR_RESULT_SUCCESS)
UR <--- (*RetKernel)->initialize()(UR_RESULT_SUCCESS)
) ---> 	pi_result : PI_SUCCESS
	[out]<unknown> ** : 0x7ffea545fe80[ 0x56067344ec70 ... ]


...


---> piEnqueueKernelLaunch(
	<unknown> : 0x5606715b3f20
	<unknown> : 0x56067344ec70
	<unknown> : 1
	<unknown> : 0x7ffea5461330
	<unknown> : 0x7ffea5461300
	<unknown> : 0x7ffea5461318
	<unknown> : 0
	pi_event * : 0[ nullptr ]
	pi_event * : 0x7f4da0006868[ 0 ... ]


...

FATAL: Unexpected page fault from GPU at 0xff010107ffc00000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 4 (PML5), access: 2 (Atomic), banned: 1, aborting.
FATAL: Unexpected page fault from GPU at 0xff010107ffc00000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 4 (PML5), access: 2 (Atomic), banned: 1, aborting.

In the above log, one should first look at the last piEnqueueKernelLaunch event, and its second arg is what we are looking for (0x56067344ec70 in this event). Then one should search for the name of this kernel, which should be a piKernelCreate event. The name of this kernel is _ZTSN2at15AtenIpexTypeXPU4impl45MaxPool3dWithIndicesOutFrameImplKernelFunctorIdLb0EEE.

One can use c++filt to get the readable kernel name:

$c++filt _ZTSN2at15AtenIpexTypeXPU4impl45MaxPool3dWithIndicesOutFrameImplKernelFunctorIdLb0EEE
typeinfo name for at::AtenIpexTypeXPU::impl::MaxPool3dWithIndicesOutFrameImplKernelFunctor<double, false>

In this case, one could start from the MaxPool3dWithIndicesOutFrameImplKernelFunctor.

2.2. Printing the message

One can directly print the related messages outside the kernel using std::cout:

std::cout << "tensor size is: " << tensor.sizes() << std::endl;
std::cout << "tensor is: " << tensor << std::endl;

std::cout << "tensor data_ptr is: " << tensor.data_ptr<scalar_t>() << std::endl;
std::cout << "Is tensor initialized?: " << tensor.storage().defined() << std::endl;

Inside the SYCL kernel, one should use sycl::stream and sycl::endl to print:

void sycl_kernel(sycl::stream out, ...) {
  out << "Sum: " << variable << cl::sycl::endl;  // I want to print this sum
}

q.submit([&](sycl::handler &cgf) {
  sycl::stream out(65535, 256, cgh); // output buffer
  cgh.parallel_for<decltype(kfn)>(num_items, sycl_kernel(out, ...));
});

For more information, please refer to the Doing IO in the Kernel documentation.

Please note that printing inside the kernel will alter the kernel behavior. Thus, there may be cases where adding a print statement makes the kernel correct. In such cases, there is no good solution at the moment.

3. Possible Bugs

3.1. Unguarded memory access

This is the most common cause of page faults. In this section, we will show some typical cases we have encountered.

3.1.1. Unsafe pointer access

The kernel sometimes needs to access the data_ptr of a tensor. It will have the following pattern:

// Adopted from IPEX code. Torch-xpu-ops may be different.
auto cgf = DPCPP_Q_CGF(cgh) {
  auto output_ptr = (scalar_t*)output_.data_ptr();;
  ROIAlignForwardKernelFunctor<scalar_t> kfn(
    //...
    output_ptr
  );
};

The above kernel will have the argument output_ptr, which points to the underlying storage of the tensor output_. However, if the tensor is not fully initialized, the tensor.data_ptr() call will return a nullptr. Thus, a nullptr will be passed to the kernel. If the kernel tries to write to a nullptr, a page fault will occur.

For data_ptr access, we always encourage using the template access data_ptr API: Use t.mutable_data_ptr<T>() and t.const_data_ptr<T>(). For more information, please refer to the Proposal: Switch to safer data_ptr API for details.

3.1.2. Failure to check tensor legality

Some test cases may fail because the kernel forgets to check the tensor shape or the shape is supported by PyTorch but not by SYCL. In these scenarios, one should check with PyTorch's kernel implementation and add checks like:

TORCH_CHECK(
  indices.dim() == 1 || indices.dim() == 2,
  "input has to be a 1D or 2D Tensor, but got Tensor of dimension ",
  indices.dim());

The issue can be found in IPEX/#4482 (Requires internal access).

3.1.3. Lack of boundary check

Boundary check is crucial for page fault issues. Since the former driver silently fails if the kernel is wrong, and the new driver throws a page fault message, incorrect boundaries are highly likely to cause a page fault.

3.1.3.1. Incorrect accessing order

The following pattern will cause an error:

for (int64_t inner_idx = id.glb_batch;
    sorted_indices_[inner_idx] == idx && inner_idx < cfg_.problem_batch_;
    inner_idx++) {...}

In C++, the && operator first performs the left predicate. If it is true, then it performs the right predicate. If inner_idx is greater than cfg_.problem_batch_, the sorted_indices_[inner_idx] will try to access an index out of bounds, causing a page fault.

The above code should be fixed as follows:

inner_idx < cfg_.problem_batch_ && sorted_indices_[inner_idx] == idx;

This issue can be found in torch-xpu-ops/#595.

3.1.3.2. Lack of early return / assertion check

We have encountered cases where the kernel does not check the correctness of boundaries for an early return. In the former driver, this kernel would be silently dropped, resulting in no error message. However, in the new driver, this will cause a segmentation fault.

We encountered a page fault for beam_search, and the fix is as follows:

inline KERNEL_FUNC void load_bias(
    arguments_t& args,
    scalar_t* ptr,
    uint32_t base_offset,
    uint32_t startT,
    uint32_t endT,
    matSij_t& matAcc) {
  // Add check for early return
  if (startT + ctx.sg_idx * kSgBc >= endT) {
    return;
  }
  // ...
}

If we didn't return early, the code in the kernel might be executed, causing the page fault. These incorrect kernels will not be silently dropped. Thus, it is always recommended to check if there should be a check at the beginning of the kernel.

The above fix can be found in IPEX/#4552(Requires internal access).

Similarly, we encourage adding checks to provide more informative error messages and throw errors as early as possible. For example, the embedding_bag kernel lacks a boundary check. The check is similar to the CUDA_KERNEL_ASSERT in the EmbeddingBag Kernel.

3.2. oneDNN Related Bugs

When the kernel is oneDNN related, it is recommended to reproduce it using benchdnn. @ZhiweiYan-96 has a great document explaining this. Please refer to dnnl_workshop for details.

These bugs can be caught by setting the ONEDNN_VERBOSE flag:

export ONEDNN_VERBOSE=1

Then you may witness the following:

Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,gpu,jit:gemm:any,,--mode=P --matmul --engine=gpu --attr-scratchpad=user 8192x768:768x384,4.83184,74.2585,0.19104,25292.3,0.19926,24248.9
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)

In this case, one should reproduce it using benchdnn.

## batch_file.txt
$ cat batch_file.txt
8192x768:768x384
49152x64x9:49152x9x1
96x512x64:96x64x512

$ ./tests/benchdnn/benchdnn --matmul --mode=p --engine=gpu --attr-scratchpad=user --batch=batch_file.txt

3.3. API unaligned

3.3.1. Kernel re-dispatch not considered

We have encountered the pagefault error when the tensor is a ZeroTensor in the following kernel:

Tensor XPUNativeFunctions::add(
    const Tensor& self,
    const Scalar& other,
    const Scalar& alpha) {
  auto wrapper = native::wrapped_scalar_tensor(other);
  return XPUNativeFunctions::add(self, wrapper, alpha);
}

In the above kernel, the tensor is a ZeroTensor, i.e., all of its elements are 0. ZeroTensor is a new backend which doesn't include the actual tensor storage. ZeroTensor has a new dispatch key, when a tensor is a ZeroTensor, it should be dispatched to the corresponding backend.

The dispatch order should be something like:

Autocast -> Autograd -> ZeroTensor -> Backends

Thus, the above should be changed to:

return at::add(self, wrapper, alpha);

However, in this particular situation, we should never generate the ZeroTensor kernels. We need to remove the redundant generated kernels. Please view torch-xpu-ops#689 for details.

3.3.2. Kernel does not have same device check

When the GPU kernel trying to access a CPU address, it will get page fault.

Take an example of the index_fill_ kernel:

Tensor& XPUNativeFunctions::index_fill_(
    Tensor& self,
    int64_t dim,
    const Tensor& index,
    const Scalar& source) {}

When self is on XPU and index is on CPU, it will fail with the following error:

FATAL: Unexpected page fault from GPU at 0x55556428f000,

This problem has a typical behavior, that if one print the data_ptr of the tensor:

std::cout << "index data_ptr is: " << (void *)index.data_ptr() << std::endl;

It will print the follwing:

index data_ptr is:  0x55556428f8c0
FATAL: Unexpected page fault from GPU at 0x55556428f000,

From the above,

The pagefault happens at the beginning address of one page (because the last digit is f000).
The address is pretty low, a typical GPU address would like 0xff00000098400000.

To solve this kind of problem, one could either:

Check if the usage is correct (torch-xpu-ops#734)
Add check for the kernel usage (torch-xpu-ops#735). Note that if one write the kernel using TensorIterator, it will by default has this check.

3.3.3. Kernel implementation does not include corner cases

During the IPEX implementation of the fmha_backward kernel, it doesn't consider the corner case where the bias doesn't require grad.

These cases are unlikely to occur in stock PyTorch, as we have the same test scope as stock PyTorch. They are listed here for completeness.

IPEX/#4428 (Requires internal access).

huaiyuzh · 2024-07-30T08:24:37Z

huaiyuzh
Jul 30, 2024
Maintainer

For printing, we can also use below API if we cannot get cgf:
#define DPCPP_KER_STRING(var, str) static const DPCPP_CONSTANT char var[] = str;
#define DPCPP_KER_PRINTF sycl::ext::oneapi::experimental::printf

#define DPCPP_K_PRINT(fmt_str, ...)
{
DPCPP_KER_STRING(fmt_var, fmt_str);
DPCPP_KER_PRINTF(fmt_var, ##VA_ARGS);
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tutorial] A summary of page fault issues #659

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

[Tutorial] A summary of page fault issues #659

Stonepia Jul 30, 2024 Maintainer

1. Introduction

2. Debug techniques

2.1. Locating the kernel

2.2. Printing the message

3. Possible Bugs

3.1. Unguarded memory access

3.1.1. Unsafe pointer access

3.1.2. Failure to check tensor legality

3.1.3. Lack of boundary check

3.1.3.1. Incorrect accessing order

3.1.3.2. Lack of early return / assertion check

3.2. oneDNN Related Bugs

3.3. API unaligned

3.3.1. Kernel re-dispatch not considered

3.3.2. Kernel does not have same device check

3.3.3. Kernel implementation does not include corner cases

Replies: 1 comment

huaiyuzh Jul 30, 2024 Maintainer

Stonepia
Jul 30, 2024
Maintainer

huaiyuzh
Jul 30, 2024
Maintainer