Scalable Video Coding for Humans and Machines
- URL: http://arxiv.org/abs/2208.02512v1
- Date: Thu, 4 Aug 2022 07:45:41 GMT
- Title: Scalable Video Coding for Humans and Machines
- Authors: Hyomin Choi and Ivan V. Baji\'c
- Abstract summary: We propose a scalable video coding framework that supports machine vision through its base layer bitstream and human vision via its enhancement layer bitstream.
The proposed framework includes components from both conventional and Deep Neural Network (DNN)-based video coding.
- Score: 42.870358996305356
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video content is watched not only by humans, but increasingly also by
machines. For example, machine learning models analyze surveillance video for
security and traffic monitoring, search through YouTube videos for
inappropriate content, and so on. In this paper, we propose a scalable video
coding framework that supports machine vision (specifically, object detection)
through its base layer bitstream and human vision via its enhancement layer
bitstream. The proposed framework includes components from both conventional
and Deep Neural Network (DNN)-based video coding. The results show that on
object detection, the proposed framework achieves 13-19% bit savings compared
to state-of-the-art video codecs, while remaining competitive in terms of
MS-SSIM on the human vision task.
Related papers
- Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - Learned Scalable Video Coding For Humans and Machines [4.14360329494344]
We introduce an end-to-end learnable video task in its base layer, while its enhancement layer, together with the base layer, supports input reconstruction for human viewing.
Our framework outperforms both state-of-the-art learned and conventional video codecs in its base layer, while maintaining comparable performance on the human vision task in its enhancement layer.
arXiv Detail & Related papers (2023-07-18T05:22:25Z) - VNVC: A Versatile Neural Video Coding Framework for Efficient
Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels.
We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z) - Task Oriented Video Coding: A Survey [0.5076419064097732]
State-of-the-art video coding standards, such as H.265/HEVC and Versatile Video Coding, are still designed with the assumption the compressed video will be watched by humans.
With the tremendous advance and maturation of deep neural networks in solving computer vision tasks, more and more videos are directly analyzed by deep neural networks without humans' involvement.
We explore and summarize recent progress on computer vision task oriented video coding and emerging video coding standard, Video Coding for Machines.
arXiv Detail & Related papers (2022-08-15T16:21:54Z) - Saliency-Driven Versatile Video Coding for Neural Object Detection [7.367608892486084]
We propose a saliency-driven coding framework for the video coding for machines task using the latest video coding standard Versatile Video Coding (VVC)
To determine the salient regions before encoding, we employ the real-time-capable object detection network You Only Look Once(YOLO) in combination with a novel decision criterion.
We find that, compared to the reference VVC with a constant quality, up to 29 % of accuracy can be saved with the same detection at the decoder side by applying the proposed saliency-driven framework.
arXiv Detail & Related papers (2022-03-11T14:27:43Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - Human-Machine Collaborative Video Coding Through Cuboidal Partitioning [26.70051123157869]
We propose a video coding framework by leveraging on to the commonality that exists between human vision and machine vision applications using cuboids.
Cuboids, estimated rectangular regions over a video frame, are computationally efficient, has a compact representation and object centric.
Herein cuboidal feature descriptors are extracted from the current frame and then employed for accomplishing a machine vision task in the form of object detection.
arXiv Detail & Related papers (2021-02-02T04:44:45Z) - An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond
Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding.
We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern.
By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z) - Towards Coding for Human and Machine Vision: A Scalable Image Coding
Approach [104.02201472370801]
We come up with a novel image coding framework by leveraging both the compressive and the generative models.
By introducing advanced generative models, we train a flexible network to reconstruct images from compact feature representations and the reference pixels.
Experimental results demonstrate the superiority of our framework in both human visual quality and facial landmark detection.
arXiv Detail & Related papers (2020-01-09T10:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.