Fully Automatic Page Turning on Real Scores
- URL: http://arxiv.org/abs/2111.06643v1
- Date: Fri, 12 Nov 2021 10:23:14 GMT
- Title: Fully Automatic Page Turning on Real Scores
- Authors: Florian Henkel, Stephanie Schwaiger, Gerhard Widmer
- Abstract summary: We present a prototype of an automatic page turning system that works directly on real scores, i.e., sheet images.
Our system is based on a multi-modal neural architecture that observes a complete sheet image page as input, listens to an incoming musical performance, and predicts the position in the image.
As a proof of concept we further combine our system with an actual machine that will physically turn the page on command.
- Score: 6.230751621285321
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a prototype of an automatic page turning system that works
directly on real scores, i.e., sheet images, without any symbolic
representation. Our system is based on a multi-modal neural network
architecture that observes a complete sheet image page as input, listens to an
incoming musical performance, and predicts the corresponding position in the
image. Using the position estimation of our system, we use a simple heuristic
to trigger a page turning event once a certain location within the sheet image
is reached. As a proof of concept we further combine our system with an actual
machine that will physically turn the page on command.
Related papers
- Visual Localization in 3D Maps: Comparing Point Cloud, Mesh, and NeRF Representations [8.522160106746478]
We present a global visual localization system capable of localizing a single camera image across various 3D map representations.
Our system generates a database by synthesizing novel views of the scene, creating RGB and depth image pairs.
NeRF synthesized images show superior performance, localizing query images at an average success rate of 72%.
arXiv Detail & Related papers (2024-08-21T19:37:17Z) - Breaking the Frame: Visual Place Recognition by Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach based on overlap prediction, called VOP.
VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone.
Our approach uses a voting mechanism to assess overlap scores for potential database images.
arXiv Detail & Related papers (2024-06-23T20:00:20Z) - PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion [2.3020018305241337]
PlaceFormer is a transformer-based approach for visual place recognition.
PlaceFormer employs patch tokens from the transformer to create global image descriptors.
It selects patches that correspond to task-relevant areas in an image.
arXiv Detail & Related papers (2024-01-23T20:28:06Z) - Efficient Gesture Recognition for the Assistance of Visually Impaired
People using Multi-Head Neural Networks [5.883916678819684]
This paper proposes an interactive system for mobile devices controlled by hand gestures aimed at helping people with visual impairments.
This system allows the user to interact with the device by making simple static and dynamic hand gestures.
Each gesture triggers a different action in the system, such as object recognition, scene description or image scaling.
arXiv Detail & Related papers (2022-05-14T06:01:47Z) - Temporal Graph Network Embedding with Causal Anonymous Walks
Representations [54.05212871508062]
We propose a novel approach for dynamic network representation learning based on Temporal Graph Network.
For evaluation, we provide a benchmark pipeline for the evaluation of temporal network embeddings.
We show the applicability and superior performance of our model in the real-world downstream graph machine learning task provided by one of the top European banks.
arXiv Detail & Related papers (2021-08-19T15:39:52Z) - SeqNet: Learning Descriptors for Sequence-based Hierarchical Place
Recognition [31.714928102950594]
We present a novel hybrid system that creates a high performance initial match hypothesis generator.
Sequence descriptors are generated using a temporal convolutional network dubbed SeqNet.
We then perform selective sequential score aggregation using shortlisted single image learnt descriptors to produce an overall place match hypothesis.
arXiv Detail & Related papers (2021-02-23T10:32:10Z) - Cross-Descriptor Visual Localization and Mapping [81.16435356103133]
Visual localization and mapping is the key technology underlying the majority of Mixed Reality and robotics systems.
We present three novel scenarios for localization and mapping which require the continuous update of feature representations.
Our data-driven approach is agnostic to the feature descriptor type, has low computational requirements, and scales linearly with the number of description algorithms.
arXiv Detail & Related papers (2020-12-02T18:19:51Z) - Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by
Implicitly Unprojecting to 3D [100.93808824091258]
We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras.
Our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a bird's-eye-view grid.
We show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network.
arXiv Detail & Related papers (2020-08-13T06:29:01Z) - Learning to Read and Follow Music in Complete Score Sheet Images [8.680081568962997]
We propose the first system that directly performs score following in full-page, completely unprocessed sheet images.
Based on incoming audio and a given image of the score, our system directly predicts the most likely position within the page that matches the audio.
arXiv Detail & Related papers (2020-07-21T11:53:22Z) - Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-based
Image Retrieval [55.29233996427243]
Low-shot sketch-based image retrieval is an emerging task in computer vision.
In this paper, we address any-shot, i.e. zero-shot and few-shot, sketch-based image retrieval (SBIR) tasks.
For solving these tasks, we propose a semantically aligned cycle-consistent generative adversarial network (SEM-PCYC)
Our results demonstrate a significant boost in any-shot performance over the state-of-the-art on the extended version of the Sketchy, TU-Berlin and QuickDraw datasets.
arXiv Detail & Related papers (2020-06-20T22:43:53Z) - Geometrically Mappable Image Features [85.81073893916414]
Vision-based localization of an agent in a map is an important problem in robotics and computer vision.
We propose a method that learns image features targeted for image-retrieval-based localization.
arXiv Detail & Related papers (2020-03-21T15:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.