In this paper, we demonstrate the significance of restoring harmonics of the fundamental frequency (pitch) in the deep neural network (DNN)-based speech enhancement. The parameters of the DNN can be estimated by minimizing the mask loss, but it does not restore the pitch harmonics, especially at higher frequencies. In this paper, we propose to restore the pitch harmonics in the spectral domain by minimizing cepstral loss around the pitch peak. Restoring the cepstral pitch peak, in turn, helps in restoring the pitch harmonics in the enhanced spectrum. The proposed cepstral pitch-peak loss acts as an adaptive comb filter on voiced segments and emphasizes the pitch harmonics in the speech spectrum. The network parameters are estimated using a combination of mask loss and cepstral pitch-peak loss. We show that this combination offers the complementary advantages of enhancing both the voiced and unvoiced regions. The DNN-based methods primarily rely on the network architecture, and hence, the prediction accuracy improves with the increasing complexity of the architecture. The lower complex models are essential for real-time processing systems. In this work, we propose a compact model using a sliding-window attention network (SWAN). The SWAN is trained to regress the spectral magnitude mask (SMM) from the noisy speech signal. Our experimental results demonstrate that the proposed approach achieves comparable performance with the state-of-the-art noncausal and causal speech enhancement methods with much lesser computational complexity. Our three-layered noncausal SWAN achieves 2.99 PESQ on the Valentini database with only 10 9 floating-point operations (FLOPs). © 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.