STRIVE: Scene Text Replacement In Videos
- URL: http://arxiv.org/abs/2109.02762v1
- Date: Mon, 6 Sep 2021 22:21:40 GMT
- Title: STRIVE: Scene Text Replacement In Videos
- Authors: Vijay Kumar B G, Jeyasri Subramanian, Varnith Chordia, Eugene Bart,
Shaobo Fang, Kelly Guan and Raja Bala
- Abstract summary: We propose replacing scene text in videos using deep style transfer and learned photometric transformations.
Results on synthetic and challenging real videos show realistic text trans-fer, competitive quantitative and qualitative performance,and superior inference speed relative to alternatives.
- Score: 5.187595026303028
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose replacing scene text in videos using deep style transfer and
learned photometric transformations.Building on recent progress on still image
text replacement,we present extensions that alter text while preserving the
appearance and motion characteristics of the original video.Compared to the
problem of still image text replacement,our method addresses additional
challenges introduced by video, namely effects induced by changing lighting,
motion blur, diverse variations in camera-object pose over time,and
preservation of temporal consistency. We parse the problem into three steps.
First, the text in all frames is normalized to a frontal pose using a
spatio-temporal trans-former network. Second, the text is replaced in a single
reference frame using a state-of-art still-image text replacement method.
Finally, the new text is transferred from the reference to remaining frames
using a novel learned image transformation network that captures lighting and
blur effects in a temporally consistent manner. Results on synthetic and
challenging real videos show realistic text trans-fer, competitive quantitative
and qualitative performance,and superior inference speed relative to
alternatives. We introduce new synthetic and real-world datasets with paired
text objects. To the best of our knowledge this is the first attempt at deep
video text replacement.
Related papers
- Text-Animator: Controllable Visual Text Video Generation [149.940821790235]
We propose an innovative approach termed Text-Animator for visual text video generation.
Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos.
We also develop a camera control module and a text refinement module to improve the stability of generated visual text.
arXiv Detail & Related papers (2024-06-25T17:59:41Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - PSGText: Stroke-Guided Scene Text Editing with PSP Module [4.151658495779136]
Scene Text Editing aims to substitute text in an image with new desired text while preserving the background and styles of the original text.
This paper introduces a three-stage framework for transferring texts across text images.
arXiv Detail & Related papers (2023-10-20T09:15:26Z) - FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video
editing [65.60744699017202]
We introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.
Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module.
Results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance.
arXiv Detail & Related papers (2023-10-09T17:59:53Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z) - SwapText: Image Based Texts Transfer in Scenes [13.475726959175057]
We present SwapText, a framework to transfer texts across scene images.
A novel text swapping network is proposed to replace text labels only in the foreground image.
The generated foreground image and background image are used to generate the word image by the fusion network.
arXiv Detail & Related papers (2020-03-18T11:02:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.