Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech
- URL: http://arxiv.org/abs/2501.07726v1
- Date: Mon, 13 Jan 2025 22:24:52 GMT
- Title: Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech
- Authors: Bruno Ferenc Šegedin, Gasper Beguš,
- Abstract summary: This study presents the first exploration of how the fully connected layer of CNNs for speech synthesis encodes linguistically relevant information.
We show that lexically specific latent codes in generative CNNs (ciwGAN) have shared lexically invariant sublexical representations in the FC-layer weights.
- Score: 0.0
- License:
- Abstract: Interpretability work on the convolutional layers of CNNs has primarily focused on computer vision, but some studies also explore correspondences between the latent space and the output in the audio domain. However, it has not been thoroughly examined how acoustic and linguistic information is represented in the fully connected (FC) layer that bridges the latent space and convolutional layers. The current study presents the first exploration of how the FC layer of CNNs for speech synthesis encodes linguistically relevant information. We propose two techniques for exploration of the fully connected layer. In Experiment 1, we use weight matrices as inputs into convolutional layers. In Experiment 2, we manipulate the FC layer to explore how symbolic-like representations are encoded in CNNs. We leverage the fact that the FC layer outputs a feature map and that variable-specific weight matrices are temporally structured to (1) demonstrate how the distribution of learned weights varies between latent variables in systematic ways and (2) demonstrate how manipulating the FC layer while holding constant subsequent model parameters affects the output. We ultimately present an FC manipulation that can output a single segment. Using this technique, we show that lexically specific latent codes in generative CNNs (ciwGAN) have shared lexically invariant sublexical representations in the FC-layer weights, showing that ciwGAN encodes lexical information in a linguistically principled manner.
Related papers
- Linking in Style: Understanding learned features in deep learning models [0.0]
Convolutional neural networks (CNNs) learn abstract features to perform object classification.
We propose an automatic method to visualize and systematically analyze learned features in CNNs.
arXiv Detail & Related papers (2024-09-25T12:28:48Z) - Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs [63.29737699997859]
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning.
In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation.
arXiv Detail & Related papers (2024-05-26T21:31:59Z) - Joint Prompt Optimization of Stacked LLMs using Variational Inference [66.04409787899583]
Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences.
By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN)
We show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4.
arXiv Detail & Related papers (2023-06-21T18:45:56Z) - Revealing Similar Semantics Inside CNNs: An Interpretable Concept-based
Comparison of Feature Spaces [0.0]
Safety-critical applications require transparency in artificial intelligence components.
convolutional neural networks (CNNs) widely used for perception tasks lack inherent interpretability.
We propose two methods for estimating the layer-wise similarity between semantic information inside CNN latent spaces.
arXiv Detail & Related papers (2023-04-30T13:53:39Z) - A knowledge-driven vowel-based approach of depression classification
from speech using data augmentation [10.961439164833891]
We propose a novel explainable machine learning (ML) model that identifies depression from speech.
Our method first models the variable-length utterances at the local-level into a fixed-size vowel-based embedding.
depression is classified at the global-level from a group of vowel CNN embeddings that serve as the input of another 1D CNN.
arXiv Detail & Related papers (2022-10-27T08:34:08Z) - Latent Topology Induction for Understanding Contextualized
Representations [84.7918739062235]
We study the representation space of contextualized embeddings and gain insight into the hidden topology of large language models.
We show there exists a network of latent states that summarize linguistic properties of contextualized representations.
arXiv Detail & Related papers (2022-06-03T11:22:48Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical
image segmentation [0.0]
A novel network named SegTransVAE is proposed in this paper.
SegTransVAE is built upon encoder-decoder architecture, exploiting transformer with the variational autoencoder (VAE) branch to the network.
Evaluation on various recently introduced datasets shows that SegTransVAE outperforms previous methods in Dice Score and $95%$-Haudorff Distance.
arXiv Detail & Related papers (2022-01-21T08:02:55Z) - RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for
Image Recognition [123.59890802196797]
We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition.
We construct convolutional layers inside a RepMLP during training and merge them into the FC for inference.
By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs.
arXiv Detail & Related papers (2021-05-05T06:17:40Z) - Interpreting intermediate convolutional layers of CNNs trained on raw
speech [0.0]
We show that averaging over feature maps after ReLU activation in each convolutional layer yields interpretable time-series data.
The proposed technique enables acoustic analysis of intermediate convolutional layers.
arXiv Detail & Related papers (2021-04-19T17:52:06Z) - Video-based Facial Expression Recognition using Graph Convolutional
Networks [57.980827038988735]
We introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based facial expression recognition.
We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0.
arXiv Detail & Related papers (2020-10-26T07:31:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.