Non-Local Kernel Regression for Image and Video RestorationHaichao Zhang1,2, Jianchao Yang2, Yanning Zhang1, and Thomas S. Huang21School of Computer Science, Northwestern Polytechnical University, Xi’an, China 2Beckman Institute, University of Illinois at Urbana-Champaign, USA {hczhang, jyang29, huang}@ifp.uiuc.edu,

[email protected] This paper presents a non-local kernel regression (NL-KR) method for image and video restoration tasks, which exploits both the non-local self-similarity and local structural regularity in natural im- ages. The non-local self-similarity is based on the observation that image patches tend to repeat themselves in natural images and videos; and the local structural regularity reveals that image patches have regular struc- tures where accurate estimation of pixel values via regression is possible. Explicitly unifying both properties, the proposed non-local kernel re- gression framework is robust and applicable to various image and videorestoration tasks. In this work, we are specifically interested in applying the NL-KR model to image and video super-resolution (SR) reconstruc- tion. Extensive experimental results on both single images and realistic video sequences demonstrate the superiority of the proposed framework for SR tasks over previous works both qualitatively and quantitatively.1IntroductionOne of the recent trends in image processing is to pursue the low-dimensional models for image representation and manipulation. Examples include the lo- cal structure based methods [1], sparse representation methods [2][3], manifold methods [4], etc. The success of such models is guaranteed by the low Degree of Freedom (DOF) of the local structures in natural images, represented as mean- ingful local structural regularity as well as self-similarity of local patterns. Many conventional image processing algorithms are based on the assumption of local structural regularity, meaning that there are meaningful structures in the spatial space of natural images. Examples are structure tensor based meth-ods [1][5][6]and bilateral filtering [7]. These methods utilize the local structural patterns to regularize the image processing procedure and are based on the as- sumption that images are locally smooth except at edges. Another type of methods exploiting the self-similarity in natural images are recently emerging. The self-similarity property means that higher level patterns (e.g., texton and pixon) will repeat themselves in the image (possibly in dif- ferent scales). This also indicates the DOF in one image is less than the DOFoffered by the pixel-level representation. A representative work is the popular Non-Local Means (NL-Means) [8], which takes advantage of the redundancy of2H.Zhang, J.Yang, Y.Zhang and T.Huangsimilar patches existing in the target image for denoising tasks. Later, this idea is generalized to handle multi-frame super-resolution tasks in [9]. Recently, this self-similarity property is thoroughly explored by Glasner et. al in [10] for ad- dressing single image super-resolution problems. Gabriel Peyr´ e et. al proposed a non-local regularization method for general inverse problems [11]. We propose in this paper a Non-Local Kernel Regression (NL-KR) method for image and video restoration (see Fig. 1 for a graphical illustration of the proposed model). We take advantage of both local structural regularity and non-local similarity in a unified framework for more reliable and robust estimation. The non-local similarity and local structural regularity are intimately related, and are also complimentary in the sense that non-local similar pattern fusion can be regularized by the structural regularity while the redundancy from similar patterns enables more accurate estimation for structural regression.The rest of the paper is organized as follows. We first review and summarize related works in Section 2, then we propose our NL-KR model and discuss its relations to other algorithms in Section 3. The practical algorithm for SR based on NL-KR is described in Section 4. Experiments are carried out in Section 5 on both synthetic and real image sequences, and extensive comparisons are made with both classical as well as state-of-the-art methods. Section 6 provides some discussions and concludes our paper.Fig.1. Non-Local Kernel Regression. (1) Similar patch searching: different colors indicate the similarity (red the highest, green the medium and blue the least); (2) Structural kernel estimation and reweighting: estimate a regression kernel adapted to the structure at each position where the similar patches reside and re-weight them according to similarity; (3) Non-local kernel regression: estimate the value for the query point with both local structural and non-local similar information in raster-scan order.2Related WorksIn this work, we are interested in image and video restorations where we desire to estimate the pixel value of a given location in the image plane (e.g., image super-resolution, inpainting and denoising). This section presents a brieftechnical review of local structural regression or filtering method as well as the non-local similarity-based approach.Non-Local Kernel Regression for Image and Video Restoration32.1Local Structural RegressionTypical image filtering methods usually perform in a local manner, i.e., thevalue of the estimated image at a query location is influenced only by the pixels within a small neighborhood of that position. They usually take the form of:ˆ z(xi) = argminz?j∈N(xi)(yj− z)2Kxi(xj− xi)(1)where N(xi) denotes the neighbors of xi, and Kxi(xj− xi) is the spatial ker- nel at location xithat assigns larger weights to nearby similar pixels while smaller weights to farther non-similar pixels. Since the local image structure is not isotropic, local structure aware kernels are developed, with representative examples as Orientated Gaussian kernel [1] and Bilateral kernel [7]. To approx- imate the local structure better, higher order estimation can be used:ˆ a = argmina?RxiY − Φa?2WKxi(2)Here Φ is the polynomial bases given in Eq. 3 developed from Taylor expan- sion3with a the corresponding regression coefficients, and tril(·) extracts the lower triangular part of a matrix and stack it to a column vector. WKxi= diag[Kxi(x1−xi),Kxi(x2−xi),··· ,Kxi(xm−xi)] (m = |N(xi)|) is the weightmatrix defined by the kernel. Rxitakes a patch centered at xifrom Y and rep- resents it as a vector.Φ =⎡⎢⎢⎢⎣1 (x1− xi)Ttril{(x1− xi)(x1− xi)T}T··· 1 (x2− xi)Ttril{(x2− xi)(x2− xi)T}T··· 1 (xm− xi)Ttril{(xm− xi)(xm− xi)T}T···⎤⎥⎥⎥⎦(3)Therefore, the first element of the regression coefficient is the desired pixel value estimation at xi.ˆ z(xi) = eT1?ΦTWKxiΦ?−1ΦTWKxiRxiY(4)where e1is a vector with the first element equal to one, and the rest zero.2.2Non-Local Similarity-based EstimationLocal image structures tend to repeat themselves across the image and also the image sequence in videos. This property has been explored in many appli- cations such as texture synthesis [12], image inpainting [13], denoising [8] and super-resolution [9] [14]. This self-similarity property provides the redundancy3The regression bases do not have to be polynomial, and other choices are open. For more details about deriving the polynomial bases, one can refer to [6], which gives a nice tutorial on kernel regression.4H.Zhang, J.Yang, Y.Zhang and T.Huangthat is sometimes critical for many ill-posed image processing problems, as sim- ilar structures can be regarded as multiple observations from the same underly- ing ground truth. For instance, the NL-Means algorithm recently introduced by Buades et al. in [8] for image denoising has become very popular, due to its ef- fectiveness despite of its simplicity. The algorithm breaks the locality constraintsof previous conventional filtering methods, making use of similar patterns foundin different locations of the image to denoise the image. Specifically, NL-Means algorithm is a weighted average:z(xi) =? j∈P(xi)wijyj? j∈P(xi)wij(5)where P(xi) denote the index set for similar pixel observations for xi(includes xiitself). The weight wijreflects the similarity between the observations at xi and xj[8]. Eq. 5, can be reformulated into an optimization problem:ˆ z(xi) = argminz?j∈P(xi)[yj− z(xi)]2wij= argminz?y − 1z(xi)?2Wxi(6)where y denotes the vector consisting of the pixels at the locations in the similar set P(xi), 1 denotes a vector of all ones, and Wxi= diag[wi1,wi2,.,wim] (m = |P(xi)|). Compared with Eq. 2, the NL-Means estimation Eq. 6 can be regarded as a zero-order estimation, and the weight matrix is constructed from the similarity measures instead of the spatial kernel as before.3Non-Local Kernel Regression Model3.1Mathematical FormulationWe derive the Non-Local Kernel Regression (NL-KR) algorithm in this sec- tion. The approach makes full use of both cues from local structural regularity and non-local similarity for image and video restoration. We argue that the pro- posed approach is more reliable and robust for ill-posed inverse problems: local structural regression regularize the noisy candidates found by non-local similar- ity search; and non-local similarity provides the redundancy preventing possibleoverfitting of the local structural regression. Instead of using a point prediction model in non-local methods, we use the more reliable local structure-based pre- diction. On the other hand, rather than predicting the value with only one local patch, we can try to make use of all the non-local similar patches in natural images. Mathematically, the proposed high-order Non-Local Kernel Regression model is formulated as:Non-Local Kernel Regression for Image and Video Restoration5ˆ a = argmina1 2local???? wii?RxiY − Φa?2WKxi+1 2non−local?????j∈P(xi)\{i}wij?RxjY − Φa?2WKxj= argmina1 2?j∈P(xi)wij?RxjY − Φa?2WKxj= argmina1 2?j∈P(xi)?RxjY − Φa?2˜Wxj(7)where WKxjis the weight matrix constructed from kernel Kxj, and P(xi) again is the similar index set for xi. wijis calculated between the location xiof interest and any position xj(j ∈ P(xi)) by measuring the similarity of their neighborhoods weighted by a Gaussian kernel:wij= exp? −?RxiY − RxjY ?2WG 2σ2? (8)where WGis the weight matrix constructed from a Gaussian kernel, which puts larger weights on the centering pixels of the patch. The proposed NL-KR regres- sion model consists of two parts:1. Local regression term: the traditional local regression or filtering term, with wiiset to be 1. This term also contributes as a fidelity loss, as the estimation should be close to the observation. 2. Non-local regression term: instead of zero-order point estimation as in Non-Local means, higher-order kernel regression is also used to make full use the structural redundancy in the similar patches.The effects of these two parts will be more clear with experimental results inSection 5. To get the regression coefficients, differentiate the right hand side of Eq. 7 with respect to a and set it to zero, we haveˆ a =⎡⎣ΦT⎛⎝?j∈P(xi)wijWKxj⎞⎠Φ⎤⎦−1ΦT?j∈P(xi)wijWKxjRxjY(9)Then ˆ z(xi) = eT1ˆ a. Examination on Eq. 9, we have the following two comments: – The structural kernel is estimated from contaminated observations, and thus is not robust. Compared with Eq. 4, with non-local redundancy, our estima- tion is more stable because of the weighted average of kernel weight matrices inside the inverse, and the weighted average of the structural pixel values. – Compared with Eq. 6, our model can regularize the estimation from the non- local patches by structural higher-order regression, and thus is more robust to outliers.Therefore, the proposed model make full use of both important cues from local structure and non-local similarity, leading to more reliable and robust estimation,which will be verified by experimental results in Sec. 5.6H.Zhang, J.Yang, Y.Zhang and T.Huang3.2Structural Kernel EstimationIt is desirable to use a structure adaptive kernel in estimation. Given the observation Y and a query location xi, we can construct a structure adaptive kernel as:Kxi(x − xi) =1?det(T)exp? −12(x − xi)TT−1(x − xi)? (10)where T is the diffusion tensor at xicontrolling the spatial structure of thekernel. Given two unit vectors u and v defined by the gradient and tangent direction respectively, we can construct T = fuuT+ gvvTand adjust f and g according to the underlying structure, so that the induced kernel is isotropic (f ≈ g) at almost constant regions and aligned along the image contour (g > f) otherwise. One possible choice4is f(α,β) =β+γ α+γand g(α,β) =α+γ β+γ(γ ≥ 0),where α and β are the eigen values of the structure tensor [1], reflecting the strength of the gradient along each eigenvector directions. α, β, u and v can be calculated from the following relation using singular value decomposition (SVD):∇Yxi∇YT xi= αuuT+ βvvT(11)where ∇Yxiis a 2 × 1 vector, denoting the gradient of Y at xi.3.3Relations to Other WorksTons of works have emerged recently based on non-local redundancy and local regressions for image and video processing. It is worthwhile to talk about the relations of the proposed model to those previously proposed algorithms. The non-local models in [4], [8] and [9] use the redundancy from non-local self- similarity, but do not include the spatial structure explicitly as a regularization. The high order generalization of non-local means in [15] uses the computation ofnon-local similarity to find the local kernel for regression, which actually violates the philosophy of the non-local model. Local structural regression [1][6][7] ex- plicitly employ the spatial kernel for regularization, but neglect the redundancy of similar local patterns useful for robust estimation. The 3D kernel regression method [16] exploits the local spatial-temporal structure by extending their 2D spatial kernel regression, also discards the non-local self-similarity. Sparse repre- sentation for denoising [2] and super-resolution [3] do local regression using bases learned from a training database. They perform estimation on each individual local patch and discard the patch redundancy. The sparse representation model is later generalized in [17] for image denoising by doing simultaneous sparse cod-ing over similar patches found in different locations of the image. However, the non-local redundancy is used in a hard assignment clustering way instead of a soft way. [10] fully explores the self-similarity property for single image super- resolution, but no spatial structural regularization is applied. To summarize, ourmodel is the first work to explicitly unify the self-similarity and local structural regularization into a single model, allowing more robust estimation.4One can refer to [1] for other choices of diffusion tensor.Non-Local Kernel Regression for Image and Video Restoration7Fig.2. Operator Illustration. RlxjY , the patch of the LR image Y at xj, is formed by downsampling HR patch RhxjZ by factor r = 2 as DpRh xjZ, keeping the c