-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
fb140d2
commit 37d2833
Showing
9 changed files
with
354 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
crossscore.active.vision |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,277 @@ | ||
<!doctype html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="utf-8"> | ||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> | ||
<meta name="description" content="CrossScore: Towards Multi-View Image Evaluation and Scoring"> | ||
<meta name="author" content="Zirui Wang"> | ||
<meta name="generator" content="Jekyll v4.1.1"> | ||
|
||
<title>CrossScore</title> | ||
|
||
<!-- Bootstrap core CSS --> | ||
<link | ||
rel="stylesheet" | ||
href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" | ||
integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" | ||
crossorigin="anonymous"> | ||
|
||
<!-- Custom styles for this template --> | ||
<link href="style.css" rel="stylesheet"> | ||
</head> | ||
|
||
<body> | ||
</nav> | ||
|
||
|
||
|
||
<main role="main" class="container"> | ||
|
||
<div class="title"> | ||
<h1>CrossScore: Towards Multi-View Image Evaluation and Scoring</h1> | ||
</div> | ||
|
||
<div class="col text-center"> | ||
<p class="authors"> | ||
<a href="https://scholar.google.com/citations?user=zCBKqa8AAAAJ&hl=en">Zirui Wang<sup>1</sup></a> | ||
<a href="https://scholar.google.com/citations?user=IVfbqkgAAAAJ&hl=en">Wenjing Bian<sup>1</sup></a> | ||
<a href="https://scholar.google.co.uk/citations?user=tiLf8UkAAAAJ&hl=en">Omkar Parkhi<sup>2</sup></a> | ||
<a href="https://scholar.google.co.uk/citations?user=Mf6PAuQAAAAJ&hl=en">Yuheng Ren<sup>2</sup></a> | ||
<a href="http://www.robots.ox.ac.uk/~victor/">Victor Adrian Prisacariu<sup>1</sup></a> | ||
</p> | ||
<p class="institution"> | ||
<sup>1</sup>University of Oxford <sup>2</sup>Meta Reality Lab | ||
</p> | ||
</div> | ||
|
||
<div class="col text-center"> | ||
<a class="btn btn-secondary" href="" role="button">Arxiv</a> | ||
<a class="btn btn-secondary" href="" role="button">Code (Comming Soon)</a> | ||
</div> | ||
|
||
<p> | ||
<b>TLDR</b>: | ||
We introduce a novel image quality assessment method that evaluates an image | ||
by comparing it with multiple views of the same scene, eliminating the | ||
need for pre-aligned ground truth. | ||
Our method enables effective evaluation of rendered images in novel view | ||
synthesis applications where ground truth references are unavailable. | ||
</p> | ||
|
||
<div class="col text-center"> | ||
<figure class="figure"> | ||
<embed src="assets/04_main_results.png" alt="main results", class="responsive-figure"> | ||
<figcaption class="figcaption_left"> | ||
CrossScore maps are closely correlated with SSIM score maps | ||
across diverse datasets while not requiring ground truth images. | ||
Score map colour coding: | ||
<span style="color:brown;">red</span> represents the highest score, | ||
followed by | ||
<span style="color:orange;">orange</span>, | ||
<span style="color:green;">green</span>, and | ||
<span style="color:blue;">blue</span>, | ||
indicating decreasing scores respectively. | ||
</figcaption> | ||
</figure> | ||
</div> | ||
|
||
|
||
<h2>Abstract</h2> | ||
<p> | ||
We introduce a novel <i>Cross-Reference</i> image quality assessment | ||
method that effectively fills the gap in the image assessment | ||
landscape, complementing the array of established evaluation schemes -- | ||
ranging from | ||
<i>Full-Reference</i> metrics like SSIM, | ||
<i>No-Reference</i> metrics such as NIQE, to | ||
<i>General-Reference</i> metrics including FID, and | ||
<i>Multi-Modal-Reference</i> metrics, <i>e.g.</i> CLIPScore. | ||
</p> | ||
|
||
<p> | ||
Utilising a neural network with the cross-attention mechanism and a unique data collection | ||
pipeline from NVS optimisation, our method enables accurate image quality assessment without | ||
requiring ground truth references. | ||
By comparing a query image against multiple views of the same scene, our method addresses | ||
the limitations of existing metrics in novel view synthesis (NVS) and similar tasks where | ||
direct reference images are unavailable. | ||
Experimental results show that our method is closely correlated to the | ||
full-reference metric SSIM, while not requiring ground truth references. | ||
</p> | ||
|
||
|
||
|
||
|
||
|
||
<h2>Method</h2> | ||
<p> | ||
Our goal is to evaluate the quality of a query image, using a set of reference images | ||
that capture the same scene as the query image but from other viewpoints. | ||
From the NVS application perspective, the query image is often a rendered image | ||
with artefacts, and the reference images consists of the real captured images. | ||
</p> | ||
|
||
<div class="col text-center"> | ||
<figure class="figure"> | ||
<embed src="assets/01_method.png" alt="Method Overview", class="responsive-figure"> | ||
|
||
<figcaption class="figcaption_left"> | ||
Method overview: | ||
<b>Left</b>: NVS-based data engine generates query, reference images and SSIM maps, which | ||
drives the self-supervised training of our model. | ||
<b>Right</b>: Our model that takes a query image and a set of reference images | ||
as input and predicts a score map for the query image. | ||
</figcaption> | ||
</figure> | ||
</div> | ||
|
||
<h3>Network</h3> | ||
<p> | ||
We propose a network that takes a query image and a set of reference images | ||
and predict a dense score map for the query image. | ||
Our network consists of three components: | ||
<ol> | ||
<li> an image encoder which extracts feature maps from input images; </li> | ||
<li> a cross-reference module that associates a query image with multi-view reference images; and </li> | ||
<li> a score regression head that regresses a CrossScore for each pixel of the query image. </li> | ||
</ol> | ||
In practice, we adapt | ||
a pretrained DINOv2-small model as the image encoder, | ||
a Transformer Decoder for the cross-reference module, and | ||
a shallow MLP for the score regression head. | ||
</p> | ||
|
||
<h3>Self-supervised Training</h3> | ||
<p> | ||
We leverage existing NVS systems and abundant multi-view datasets to generate | ||
SSIM maps for our training. | ||
</p> | ||
|
||
<p> | ||
Specifically, we select Neural Radiance Field (NeRF)-style NVS systems as | ||
our data engine. | ||
Given a set of images, a NeRF recovers a neural representation of a scene by | ||
iteratively reconstructing the given image set with photometric losses. | ||
</p> | ||
<p> | ||
By rendering images with the camera parameters from the original captured | ||
image set at multiple NeRF training checkpoints, we generate a large number of | ||
images that contain various types of artefacts at various levels. | ||
From which, we compute SSIM maps between | ||
rendered images and corresponding real captured images, which serve as | ||
our training objectives. | ||
</p> | ||
<!-- <p> | ||
Together with a set of | ||
reference images from the same scene, our model predicts a score map | ||
for a query image, supervised by the corresponding SSIM map. | ||
</p> --> | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
<h2>Additional Results</h2> | ||
<figure> | ||
<video controls autoplay muted loop playsinline class="center_video"> | ||
<source src="assets/additional_results.mp4" type="video/mp4"> | ||
Your browser does not support the video tag. | ||
</video> | ||
<figcaption class="figcaption_center"> | ||
Comparison of CrossScore with SSIM maps on rendered images. | ||
</figcaption> | ||
|
||
</figure> | ||
|
||
|
||
|
||
|
||
|
||
<h2> Ablation: Enable and Disable Reference Images </h2> | ||
<p> | ||
Here, we show our method effectively leverage reference views while | ||
evaluating a query image. | ||
With reference images enabled (ON), the score map predicted | ||
by our method contains more details than when reference images | ||
are disabled (OFF), where the model tends to assign | ||
a high score everywhere. | ||
</p> | ||
<div class="col text-center"> | ||
<figure class="figure"> | ||
<embed src="assets/02_ablation.png" alt="Ablation", class="responsive-figure"> | ||
<figcaption> | ||
Ablation study on the importance of reference images. | ||
</figcaption> | ||
</figure> | ||
</div> | ||
|
||
|
||
|
||
|
||
|
||
<h2> Attention Weights Visualisation </h2> | ||
<p> | ||
We further illustrate that our model indeed checking related context | ||
in reference images, as evidenced by the visualisation of attention maps below. | ||
</p> | ||
<div class="col text-center"> | ||
<figure class="figure"> | ||
<embed src="assets/03_attn.png" alt="BLEFF thumbnails", class="responsive-figure"> | ||
<figcaption class="figcaption_left"> | ||
Attention weights visualisation of our model. | ||
<b>Top left</b>: a query image with a region of interest (centre of image) | ||
highlighted with a <span style="color:magenta;">magenta</span> box. | ||
|
||
<b>Right column</b>: three reference images from our cross-reference | ||
set with attention maps overlaid. The attention maps illustrate the attention | ||
that is paid to predicting image quality at the query region. | ||
|
||
<span style="color:red;">Red</span> and | ||
<span style="color:blue;">blue</span> denote high and low | ||
attention weights respectively. | ||
Note that we use 5 reference images in our experiment, | ||
but only 3 are shown due to space constraint. | ||
|
||
<b>Bottom</b>: Predicted CrossScore map and SSIM map. | ||
|
||
<span style="color:red;">Red</span> and | ||
<span style="color:blue;">blue</span> denote high and low | ||
quality image regions respectively. | ||
</figcaption> | ||
</figure> | ||
</div> | ||
|
||
|
||
|
||
|
||
|
||
<h2> Acknowledgement </h2> | ||
<p> | ||
This research is supported by an <a href="https://facebookresearch.github.io/projectaria_tools/docs/intro">ARIA</a> | ||
research gift grant from Meta Reality Lab. | ||
We gratefully thank | ||
<a href="https://elliottwu.com/">Shangzhe Wu</a>, | ||
<a href="https://tengdahan.github.io/">Tengda Han</a>, | ||
<a href="https://scholar.google.com/citations?user=31eXgMYAAAAJ&hl=en">Zihang Lai</a> | ||
for insightful discussions, and | ||
<a href="https://portraits.keble.net/2022/michael-hobley">Michael Hobley</a> | ||
for proofreading. | ||
</p> | ||
|
||
|
||
|
||
<h2>BibTeX</h2> | ||
<pre> | ||
@article{wang2024crossscore, | ||
title={CrossScore: Towards Multi-View Image Evaluation and Scoring}, | ||
author={Zirui Wang and Wenjing Bian and Omkar Parkhi and Yuheng Ren and Victor Adrian Prisacariu}, | ||
journal={arXiv preprint arXiv:}, | ||
year={2024} | ||
} | ||
</pre> | ||
|
||
</main> | ||
</html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
h1{ | ||
font-size: 35px; | ||
} | ||
|
||
h2{ | ||
margin-top: 2rem; | ||
margin-bottom: 1rem; | ||
font-size: 28px; | ||
} | ||
|
||
h3{ | ||
margin-top: 1rem; | ||
margin-bottom: 1rem; | ||
font-size: 18px; | ||
font-style: italic; | ||
} | ||
|
||
pre{ | ||
margin-bottom: 0; | ||
} | ||
|
||
.title { | ||
padding-top: 3rem; | ||
text-align: center; | ||
} | ||
|
||
.authors{ | ||
font-size: 18px; | ||
text-align: center; | ||
padding-top: 0px; | ||
padding-bottom: 0px; | ||
margin-top: 0px; | ||
margin-bottom: 0px; | ||
} | ||
|
||
.institution{ | ||
font-size: 14px; | ||
text-align: center; | ||
padding-top: 10px; | ||
padding-bottom: 10px; | ||
margin-top: 0px; | ||
margin-bottom: 0px; | ||
} | ||
|
||
.center_video{ | ||
margin: 0 auto; | ||
display: block; | ||
width: 80%; | ||
height: auto; | ||
padding-bottom: 10px; | ||
} | ||
|
||
.figure { | ||
width: 80%; /* Default to full width on smaller screens */ | ||
margin: 0; /* Optional: Removes default margin */ | ||
} | ||
|
||
.responsive-figure { | ||
width: 100%; /* Makes the embed element take full width of the figure */ | ||
height: auto; /* Helps maintain the aspect ratio */ | ||
} | ||
|
||
figcaption { | ||
font-size: 14px; | ||
color: #495057; | ||
} | ||
.figcaption_left { | ||
text-align: left; | ||
} | ||
.figcaption_center { | ||
text-align: center; | ||
} | ||
|
||
.btn-secondary { | ||
margin-bottom: 20px; /* Adds space below each button */ | ||
} |