Looking at words and points with attention: a benchmark for
text-to-shape coherence
- URL: http://arxiv.org/abs/2309.07917v1
- Date: Thu, 14 Sep 2023 17:59:48 GMT
- Title: Looking at words and points with attention: a benchmark for
text-to-shape coherence
- Authors: Andrea Amaduzzi, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano
- Abstract summary: The evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark.
We employ large language models to automatically refine descriptions associated with shapes.
To validate our approach, we conduct a user study and compare quantitatively our metric with existing ones.
The refined dataset, the new metric and a set of text-shape pairs validated by the user study comprise a novel, fine-grained benchmark.
- Score: 17.340484439401894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While text-conditional 3D object generation and manipulation have seen rapid
progress, the evaluation of coherence between generated 3D shapes and input
textual descriptions lacks a clear benchmark. The reason is twofold: a) the low
quality of the textual descriptions in the only publicly available dataset of
text-shape pairs; b) the limited effectiveness of the metrics used to
quantitatively assess such coherence. In this paper, we propose a comprehensive
solution that addresses both weaknesses. Firstly, we employ large language
models to automatically refine textual descriptions associated with shapes.
Secondly, we propose a quantitative metric to assess text-to-shape coherence,
through cross-attention mechanisms. To validate our approach, we conduct a user
study and compare quantitatively our metric with existing ones. The refined
dataset, the new metric and a set of text-shape pairs validated by the user
study comprise a novel, fine-grained benchmark that we publicly release to
foster research on text-to-shape coherence of text-conditioned 3D generative
models. Benchmark available at
https://cvlab-unibo.github.io/CrossCoherence-Web/.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation [52.029698642883226]
Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF.
Most studies evaluate their results with subjective case studies and user experiments.
We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
arXiv Detail & Related papers (2023-10-04T17:12:18Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - LRANet: Towards Accurate and Efficient Scene Text Detection with
Low-Rank Approximation Network [63.554061288184165]
We propose a novel parameterized text shape method based on low-rank approximation.
By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation.
We implement an accurate and efficient arbitrary-shaped text detector named LRANet.
arXiv Detail & Related papers (2023-06-27T02:03:46Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic
Textual Guidance [70.08635216710967]
X-Mesh is a text-driven 3D stylization framework that incorporates a novel Text-guided Dynamic Attention Module.
We introduce a new standard text-mesh benchmark, MIT-30, and two automated metrics, which will enable future research to achieve fair and objective comparisons.
arXiv Detail & Related papers (2023-03-28T06:45:31Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Parts2Words: Learning Joint Embedding of Point Clouds and Texts by
Bidirectional Matching between Parts and Words [32.47815081044594]
We propose to learn joint embedding of point clouds and texts by bidirectional matching between parts from shapes and words from texts.
Specifically, we first segment the point clouds into parts, and then leverage optimal transport method to match parts and words in an optimized feature space.
Experiments demonstrate that our method achieves a significant improvement in accuracy over the SOTAs on multi-modal retrieval tasks.
arXiv Detail & Related papers (2021-07-05T08:55:34Z) - Extending Text Informativeness Measures to Passage Interestingness
Evaluation (Language Model vs. Word Embedding) [1.2998637003026272]
This paper defines the concept of Interestingness as a generalization of Informativeness.
We then study the ability of state of the art Informativeness measures to cope with this generalization.
We prove that the CLEF-INEX Tweet Contextualization 2012 Logarithm Similarity measure provides best results.
arXiv Detail & Related papers (2020-04-14T18:22:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.