# 干渉計の光子はどこにありますか。

Where is the photon in an interferometer? ( http://arxiv.org/abs/2007.03624v2 )

In a paper titled "Asking photons where they have been?" \cite{LV} Danan, Farfurnik, Bar-Ad and Vaidman describe an experiment with pre and post selected photons going through nested Mach-Zehnder interferometers. They find that some of the mirrors leave no footprints on the signal and interpret this as evidence that the photon skipped these mirrors. They argue that the experiment supports Aharonov-Vaidman's formulation of quantum mechanics \cite{AharonovVaidman} where post-selected particles are assigned disconnected trajectories. I review the experiment and analyze it within the orthodox framework of quantum mechanics. The standard view of interfering trajectories accounts for the experimental findings.
# 近似グラフ彩色のためのハイブリッド量子古典アルゴリズム

Hybrid quantum-classical algorithms for approximate graph coloring ( http://arxiv.org/abs/2011.13420v2 )

We show how to apply the recursive quantum approximate optimization algorithm (RQAOA) to MAX-$k$-CUT, the problem of finding an approximate $k$-vertex coloring of a graph. We compare this proposal to the best known classical and hybrid classical-quantum algorithms. First, we show that the standard (non-recursive) QAOA fails to solve this optimization problem for most regular bipartite graphs at any constant level $p$: the approximation ratio achieved by QAOA is hardly better than assigning colors to vertices at random. Second, we construct an efficient classical simulation algorithm which simulates level-$1$ QAOA and level-$1$ RQAOA for arbitrary graphs. In particular, these hybrid algorithms give rise to efficient classical algorithms, and no benefit arising from the use of quantum mechanics is to be expected. Nevertheless, they provide a suitable testbed for assessing the potential benefit of hybrid algorithm: We use the simulation algorithm to perform large-scale simulation of level-$1$ QAOA and RQAOA with up to $300$ qutrits applied to ensembles of randomly generated $3$-colorable constant-degree graphs. We find that level-$1$ RQAOA is surprisingly competitive: for the ensembles considered, its approximation ratios are often higher than those achieved by the best known generic classical algorithm based on rounding an SDP relaxation. This suggests the intriguing possibility that higher-level RQAOA may be a potentially useful algorithm for NISQ devices.
# デジタルコンタクトトレーシング:bluetoothベースのアプリの障害に代わる大規模位置情報データ

Digital Contact Tracing: Large-scale Geolocation Data as an Alternative to Bluetooth-based Apps' Failure ( http://arxiv.org/abs/2101.07024v2 )

The currently deployed contact-tracing mobile apps have failed as an efficient solution in the context of the COVID-19 pandemic. None of them has managed to attract the number of active users required to achieve an efficient operation. This urges the research community to re-open the debate and explore new avenues that lead to efficient contact-tracing solutions. This paper contributes to this debate with an alternative contact-tracing solution that leverages already available geolocation information owned by BigTech companies with very large penetration rates in most countries adopting contact-tracing mobile apps. Moreover, our solution provides sufficient privacy guarantees to protect the identity of infected users as well as precluding Health Authorities from obtaining the contact graph from individuals.
# 1.6 tbps 古典チャネルとdv-qkdが共存する中空コアネスト反共振ノードレスファイバー(hc-nanf)

1.6 Tbps Classical Channel Coexistence With DV-QKD Over Hollow Core Nested Antiresonant Nodeless Fibre (HC-NANF) ( http://arxiv.org/abs/2106.14560v2 )

We demonstrate for the first time the coexistence of a quantum-channel and 8x200 Gpbs 16-QAM optical channels with launching powers as high as -9dBm per channel in a 2 km HC-NANF. Comparative analysis with single-mode fibre reveals that the quantum channel could not be sustained at such power levels.
# ランダム安定化テンソルネットワークにおける絡み合い相転移

Entanglement phase transitions in random stabilizer tensor networks ( http://arxiv.org/abs/2107.12376v2 )

We explore a class of random tensor network models with "stabilizer" local tensors which we name Random Stabilizer Tensor Networks (RSTNs). For RSTNs defined on a two-dimensional square lattice, we perform extensive numerical studies of entanglement phase transitions between volume-law and area-law entangled phases of the one-dimensional boundary states. These transitions occur when either (a) the bond dimension $D$ of the constituent tensors is varied, or (b) the tensor network is subject to random breaking of bulk bonds, implemented by forced measurements. In the absence of broken bonds, we find that the RSTN supports a volume-law entangled boundary state with bond dimension $D\geq3$ where $D$ is a prime number, and an area-law entangled boundary state for $D=2$. Upon breaking bonds at random in the bulk with probability $p$, there exists a critical measurement rate $p_c$ for each $D\geq 3$ above which the boundary state becomes area-law entangled. To explore the conformal invariance at these entanglement transitions for different prime $D$, we consider tensor networks on a finite rectangular geometry with a variety of boundary conditions, and extract universal operator scaling dimensions via extensive numerical calculations of the entanglement entropy, mutual information and mutual negativity at their respective critical points. Our results at large $D$ approach known universal data of percolation conformal field theory, while showing clear discrepancies at smaller $D$, suggesting a distinct entanglement transition universality class for each prime $D$. We further study universal entanglement properties in the volume-law phase and demonstrate quantitative agreement with the recently proposed description in terms of a directed polymer in a random environment.
# ブールテンソルネットワークのための量子アニーリングアルゴリズム

Quantum Annealing Algorithms for Boolean Tensor Networks ( http://arxiv.org/abs/2107.13659v4 )

Quantum annealers manufactured by D-Wave Systems, Inc., are computational devices capable of finding high-quality solutions of NP-hard problems. In this contribution, we explore the potential and effectiveness of such quantum annealers for computing Boolean tensor networks. Tensors offer a natural way to model high-dimensional data commonplace in many scientific fields, and representing a binary tensor as a Boolean tensor network is the task of expressing a tensor containing categorical (i.e., {0, 1}) values as a product of low dimensional binary tensors. A Boolean tensor network is computed by Boolean tensor decomposition, and it is usually not exact. The aim of such decomposition is to minimize the given distance measure between the high-dimensional input tensor and the product of lower-dimensional (usually three-dimensional) tensors and matrices representing the tensor network. In this paper, we introduce and analyze three general algorithms for Boolean tensor networks: Tucker, Tensor Train, and Hierarchical Tucker networks. The computation of a Boolean tensor network is reduced to a sequence of Boolean matrix factorizations, which we show can be expressed as a quadratic unconstrained binary optimization problem suitable for solving on a quantum annealer. By using a novel method we introduce called \textit{parallel quantum annealing}, we demonstrate that tensor with up to millions of elements can be decomposed efficiently using a DWave 2000Q quantum annealer.
# 新型コロナウイルス(covid-19)パンデミックに対するオープンウェブの弾力性は?

How resilient is the Open Web to the COVID-19 pandemic? ( http://arxiv.org/abs/2107.14534v3 )

In this paper we refer to the Open Web to the set of services offered freely to Internet users, representing a pillar of modern societies. Despite its importance for society, it is unknown how the COVID-19 pandemic is affecting the Open Web. In this paper, we address this issue, focusing our analysis on Spain, one of the countries which have been most impacted by the pandemic. On the one hand, we study the impact of the pandemic in the financial backbone of the Open Web, the online advertising business. To this end, we leverage concepts from Supply-Demand economic theory to perform a careful analysis of the elasticity in the supply of ad-spaces to the financial shortage of the online advertising business and its subsequent reduction in ad spaces' price. On the other hand, we analyze the distribution of the Open Web composition across business categories and its evolution during the COVID-19 pandemic. These analyses are conducted between Jan 1st and Dec 31st, 2020, using a reference dataset comprising information from more than 18 billion ad spaces. Our results indicate that the Open Web has experienced a moderate shift in its composition across business categories. However, this change is not produced by the financial shortage of the online advertising business, because as our analysis shows, the Open Web's supply of ad spaces is inelastic (i.e., insensitive) to the sustained low-price of ad spaces during the pandemic. Instead, existing evidence suggests that the reported shift in the Open Web composition is likely due to the change in the users' online behavior (e.g., browsing and mobile apps utilization patterns).
# ユニタリ演算間の決定論的変換:適応量子回路と不定因果性の指数的優位性

Deterministic transformations between unitary operations: Exponential advantage with adaptive quantum circuits and the power of indefinite causality ( http://arxiv.org/abs/2109.08202v3 )

This work analyses the performance of quantum circuits and general processes to transform $k$ uses of an arbitrary unitary operation $U$ into another unitary operation $f(U)$. When the desired function $f$ a homomorphism, i.e., $f(UV)=f(U)f(V)$, it is known that optimal average fidelity is attainable by parallel circuits and indefinite causality does not provide any advantage. Here we show that the situation changes dramatically when considering anti-homomorphisms, i.e., $f(UV)=f(V)f(U)$. In particular, we prove that when $f$ is an anti-homomorphism, sequential circuits could exponentially outperform parallel ones and processes with indefinite causal order could outperform sequential ones. We presented explicit constructions on how to obtain such advantages for the unitary inversion task $f(U)=U^{-1}$ and the unitary transposition task $f(U)=U^T$. We also stablish a one-to-one connection between the problem of unitary estimation and parallel unitary transposition, allowing one to easily translate results from one field to the other. Finally, we apply our results to several concrete problem instances and present a method based on computer-assisted proofs to show optimality.
# 完全可解量子磁石のダイナミクス:量子コンピュータのためのベンチマーク

Dynamics in an exact solvable quantum magnet: benchmark for quantum computer ( http://arxiv.org/abs/2109.11371v5 )

Quantum magnets are never short of novel and fascinating dynamics, yet its simulation by classical computers requires exponentially-scaled computation resources, which renders the research on large-scale many-body dynamics fiendishly difficult. In this letter, we explore the dynamic behavior of 2D large-scale ferromagnetic J1-J2 Heisenberg model both theoretically and experimentally. First, the analytical solution of magnon dynamics is obtained to show an obvious ballistic propagation of magnon, which is typical for quantum walk. Then, we verify the dynamic behavior of the system through numerical approach of exact diagonalization and tensor network method. We also calculate out-of-time ordered correlators and butterfly velocities among different lattice points, finding that they can well depict the competition between different couplings. Finally, a quantum walk experiment is designed and conducted on the basis of IBM programmable quantum processors, and the experimental results are in consistence with our theoretical predictions. Since the analytical results can be used, in principle, to predict the behavior of large-scale quantum many-body systems and even those infinitely large, this work will help facilitate further research on quantum walk and quantum many-body dynamics in large-scale lattice systems, guide future design of quantum computers, as well as popularize quantum computers until they are known and available to every household in the world.
# 逆量子速度限界と最小ヒルベルト空間ノルム

Reverse quantum speed limit and minimum Hilbert space norm ( http://arxiv.org/abs/2110.01369v2 )

The reverse quantum speed limit (RQSL) gives an upper limit to the time for evolution between initial and final quantum states. We show that, in conjunction with the existence of a minimum time scale, the RQSL implies a lower limit to the norm of the change in a quantum state, and confirm that this limit is satisfied in two-state and ideal-measurement models. Such a lower limit is of relevance for interpretational issues in probability and for understanding the meaning of probability in Everett quantum theory.
# 強局在回路における励起スペクトル統計

Exact Spectral Statistics in Strongly Localised Circuits ( http://arxiv.org/abs/2110.15938v3 )

Since the seminal work of Anderson, localisation has been recognised as a standard mechanism allowing quantum many-body systems to escape ergodicity. This idea acquired even more prominence in the last decade as it has been argued that localisation -- dubbed many-body localisation (MBL) in this context -- can sometimes survive local interactions in the presence of sufficiently strong disorder. A conventional signature of localisation is in the statistical properties of the spectrum -- spectral statistics -- which differ qualitatively from those in the ergodic phase. Although features of the spectral statistics are routinely used as numerical diagnostics for localisation, they have never been derived from first principles in the presence of non-trivial interactions. Here we fill this gap and provide the example of a simple class of quantum many-body systems -- which we dub strongly localised quantum circuits -- that are interacting, localised, and where the spectral statistics can be characterised exactly. Furthermore, we show that these systems exhibit a cascade of three different regimes for spectral correlations depending on the energy scale: at small, intermediate, and large scales they behave as disconnected patches of three decreasing sizes. We argue that these features appear in generic MBL systems, with the difference that only at the smallest scale they do become Poissonian.
# 業界メンターとオープンソースインターンシップ

Open-Source Internships With Industry Mentors ( http://arxiv.org/abs/2111.04414v2 )

Internships help students connect what they have learned in the classroom to the real world, and students with access to internships are more likely to graduate and secure employment. However, many students are unable to find an internship by the time they graduate. This experience report describes a program where volunteer software engineers mentor students as they work on open-source projects in the summer, offered as an alternative to a traditional internship experience. We catalog the considerations involved in providing an experience similar to a traditional internship, describe our program's design, and provide two years' worth of participant evaluations and career outcomes as a measure of efficacy. The program served mostly undergraduates from non-R1 schools who are underrepresented in technology, and achieved similar educational outcomes to a traditional internship program. Most promisingly, mentors were willing to serve as a professional reference for 80% of students and the number of graduating seniors who secured full-time employment in technology was 7 points higher than average (despite occurring during the COVID-19 pandemic).
# 平面角の量子記述

Quantum Description of Angles in the Plane ( http://arxiv.org/abs/2111.11501v2 )

The real plane with its set of orientations or angles in $[0,\pi)$ is the simplest non trivial example of a (projective) Hilbert space and provides nice illustrations of quantum formalism. We present some of them, namely covariant integral quantization, linear polarisation of light as a quantum measurement, interpretation of entanglement leading to the violation of Bell inequalities, and spin one-half coherent states viewed as two entangled angles.
# 線形結合粒子検出器を用いた複雑なスカラー場とフェルミオン場からのハーベストング絡み

Harvesting entanglement from complex scalar and fermionic fields with linearly coupled particle detectors ( http://arxiv.org/abs/2111.12779v4 )

We explore entanglement harvesting with particle detectors that couple linearly to non-Hermitian fields. Specifically, we analyze the case of particle detectors coupled to a complex scalar quantum field and to a spin 1/2 fermionic field. We find that the complex scalar model can be a good approximation for the fermionic model in the protocol of entanglement harvesting when the mass of the field is sufficiently large compared to the inverse interaction time. Moreover, we show that by taking advantage of the U(1) degree of freedom of a complex detector it is possible to increase the harvested negativity by up to two orders of magnitude when compared to the case of a real detector.
# 非エルミート量子力学の数学的定式化と観測可能幾何相

A mathematical formalism of non-Hermitian quantum mechanics and observable-geometric phases ( http://arxiv.org/abs/2111.12883v3 )

We present a mathematical formalism of non-Hermitian quantum mechanics, following the Dirac-von Neumann formalism of quantum mechanics. In this formalism, the state postulate is the same as in the Dirac-von Neumann formalism, but the observable postulate should be changed to include para-Hermitian operators (spectral operators of scalar type with real spectrum) representing observable, as such both the measurement postulate and the evolution postulate must be modified accordingly. This is based on a Stone type theorem as proved here that the dynamics of non-Hermitian quantum systems is governed by para-unitary time evolution. The Born formula on the expectation of an observable at a certain state is given in the non-Hermitian setting, which is proved to be equal to the usual Born rule for every Hermitian observable, but for a non-Hermitian one it may depend on measurement via the choice of a metric operator associated with the non-Hermitian observable under measurement. Our formalism is nether Hamiltonian-dependent nor basis-dependent, but can recover both PT-symmetric and biorthogonal quantum mechanics, and it reduces to the Dirac-von Neumann formalism of quantum mechanics in the Hermitian setting. As application, we study observable-geometric phases for non-Hermitian quantum systems.
# 確率的不整合操作により蒸留可能な状態の基準

Criterion for a state to be distillable via stochastic incoherent operations ( http://arxiv.org/abs/2112.06168v2 )

Coherence distillation is a basic information-theoretic task in the resource theory of coherence. In this paper, we present the necessary and sufficient conditions under which a mixed state can be distilled into a pure coherent state via stochastic incoherent operations (sIOs). With the help of this result, we further show the following: (i) Any $2$-dimensional coherent state is distillable via sIOs if and only if it is a pure coherent state; (ii) a state $\rho$ is n-distillable via sIOs if and only if it is 1-distillable; and (iii) the set of distillable states via stochastic maximally incoherent operations is identical to the set of distillable states via sIOs. Finally, we analyze the reason why sIO is stronger than stochastic strictly incoherent operations when we use them to distill a coherent state.
# 創発的局所性をもつ非局所モデルにおける情報伝達

Information propagation in a non-local model with emergent locality ( http://arxiv.org/abs/2112.07541v3 )

In this paper, we revisit a "relatively local" model proposed in arXiv:1811.07241, where locality and dimensionality of space only emerges from the entanglement structure of the state the system is in. Various quantities such as butterfly velocity/ entanglement speed can be defined similarly, at least in the regime where locality is well defined and a light cone structure emerges in the correlation between sites. We find that the relations observed between them in local models arXiv:1908.06993 are not respected. In particular, we conjecture that the hierarchy of the interaction over different distances provides different "layers" of light-cones. When long range interactions are sufficiently suppressed, the effective light cones are dominated by linear behaviors with little remnant of non-locality. This could potentially be used as a physical smoking gun for emergent locality in non-local models.
# 自転車ネットワークにおける欠落リンクの自動検出

Automated Detection of Missing Links in Bicycle Networks ( http://arxiv.org/abs/2201.03402v2 )

Cycling is an effective solution for making urban transport more sustainable. However, bicycle networks are typically developed in a slow, piecewise process that leaves open a large number of gaps, even in well developed cycling cities like Copenhagen. Here, we develop the IPDC procedure (Identify, Prioritize, Decluster, Classify) for finding the most important missing links in urban bicycle networks, using data from OpenStreetMap. In this procedure we first identify all possible gaps following a multiplex network approach, prioritize them according to a flow-based metric, decluster emerging gap clusters, and manually classify the types of gaps. We apply the IPDC procedure to Copenhagen and report the 105 top priority gaps. For evaluation, we compare these gaps with the city's most recent Cycle Path Prioritization Plan and find considerable overlaps. Our results show how network analysis with minimal data requirements can serve as a cost-efficient support tool for bicycle network planning. By taking into account the whole city network for consolidating urban bicycle infrastructure, our data-driven framework can complement localized, manual planning processes for more effective, city-wide decision-making.
# 半導体量子ドットにおけるスピンの絡み合いと電荷自由度

Entangling spin and charge degrees of freedom in semiconductor quantum dots ( http://arxiv.org/abs/2201.06506v2 )

In this theoretical manuscript I propose a scheme for entangling a single electron semiconductor spin qubit with a single electron semiconductor charge qubit in a triangular triple quantum dot configuration. Two out of three quantum dots are used to define a single electron semiconductor charge qubit. Furthermore, the spin qubit is embedded in the Zeeman sub-levels of the third quantum dot. Combining single qubit gates with entangling CNOT gates allows one to construct a SWAP gate, and therefore to use the semiconductor spin qubit as a long-lived memory for the semiconductor charge qubit.
# 高磁場勾配を実現する集積電流搬送ワイヤを用いた表面イオントラップの作製

Fabrication of Surface Ion Traps with Integrated Current Carrying Wires enabling High Magnetic Field Gradients ( http://arxiv.org/abs/2202.02313v2 )

A major challenge for quantum computers is the scalable simultaneous execution of quantum gates. One approach to address this in trapped ion quantum computers is the implementation of quantum gates based on static magnetic field gradients and global microwave fields. In this paper, we present the fabrication of surface ion traps with integrated copper current carrying wires embedded inside the substrate below the ion trap electrodes, capable of generating high magnetic field gradients. The copper layer's measured sheet resistance of 1.12 m$\Omega$/sq at room temperature is sufficiently low to incorporate complex designs, without excessive power dissipation at high currents causing a thermal runaway. At a temperature of 40 K the sheet resistance drops to 20.9 $\mu\Omega$/sq giving a lower limit for the residual resistance ratio of 100. Continuous currents of 13 A can be applied, resulting in a simulated magnetic field gradient of 144 T/m at the ion position, which is 125 $\mu$m from the trap surface for the particular anti-parallel wire pair in our design.
# 保護されたゼロモードのスカーと$U(1)$量子リンクおよび量子二量体モデル

Scars from protected zero modes and beyond in $U(1)$ quantum link and quantum dimer models ( http://arxiv.org/abs/2202.03451v3 )

We demonstrate the presence of anomalous high-energy eigenstates, or many-body scars, in $U(1)$ quantum link and quantum dimer models on square and rectangular lattices. In particular, we consider the paradigmatic Rokhsar-Kivelson Hamiltonian $H=\mathcal{O}_{\mathrm{kin}} + \lambda \mathcal{O}_{\mathrm{pot}}$ where $\mathcal{O}_{\mathrm{pot}}$ ($\mathcal{O}_{\mathrm{kin}}$) is defined as a sum of terms on elementary plaquettes that are diagonal (off-diagonal) in the computational basis. Both these interacting models possess an exponentially large number of mid-spectrum zero modes in system size at $\lambda=0$ that are protected by an index theorem preventing any mixing with the nonzero modes at this coupling. We classify different types of scars for $|\lambda| \lesssim \mathcal{O}(1)$ both at zero and finite winding number sectors complementing and significantly generalizing our previous work [Banerjee and Sen, Phys. Rev. Lett. 126, 220601 (2021)]. The scars at finite $\lambda$ show a rich variety with those that are composed solely from the zero modes of $\mathcal{O}_{\mathrm{kin}}$, those that contain an admixture of both the zero and the nonzero modes of $\mathcal{O}_{\mathrm{kin}}$, and finally those composed solely from the nonzero modes of $\mathcal{O}_{\mathrm{kin}}$. We give analytic expressions for certain "lego scars" for the quantum dimer model on rectangular lattices where one of the linear dimensions can be made arbitrarily large, with the building blocks (legos) being composed of emergent singlets and other more complicated entangled structures.
# 非線形シュレーディンガー方程式のブローアップに対するザハロフ・グラスニー法の拡張

Enhancement of the Zakharov-Glassey's method for Blow-Up in nonlinear Schroedinger equations ( http://arxiv.org/abs/2203.08522v3 )

In this paper we give a sharper condition for the blow-up of the solution to a nonlinear Schroedinger equation with free/Stark/quadratic potential by improving the well known Zakharov-Glassey's method.
# ランクベース非支配ソーティング

Rank-based Non-dominated Sorting ( http://arxiv.org/abs/2203.13654v2 )

Non-dominated sorting is a computational bottleneck in Pareto-based multi-objective evolutionary algorithms (MOEAs) due to the runtime-intensive comparison operations involved in establishing dominance relationships between solution candidates. In this paper we introduce Rank Sort, a non-dominated sorting approach exploiting sorting stability and ordinal information to avoid expensive dominance comparisons in the rank assignment phase. Two algorithmic variants are proposed: the first one, RankOrdinal (RO), uses ordinal rank comparisons in order to determine dominance and requires O(N) space; the second one, RankIntersect (RS), uses set intersections and bit-level parallelism and requires O(N^2) space. We demonstrate the efficiency of the proposed methods in comparison with other state of the art algorithms in empirical simulations using the NSGA2 algorithm as well as synthetic benchmarks. The RankIntersect algorithm is able to significantly outperform the current state of the art offering up to 30% speed-up for many objectives. C++ implementations are provided for all algorithms.
# ボース・アインシュタイン凝縮体の大環におけるカオスオンセット

Chaos onset in large rings of Bose-Einstein condensates ( http://arxiv.org/abs/2203.14625v1 )

We consider large rings of weakly-coupled Bose-Einstein condensates, analyzing their transition to chaotic dynamics and loss of coherence. Initially, a ring is considered to be in an eigenstate, i.e. in a commensurate configuration with equal site fillings and equal phase differences between neighboring sites. Such a ring should exhibit a circulating current whose value will depend on the initial, non-zero phase difference. The appearance of such currents is a signature of an established coherence along the ring. If phase difference falls between $\pi/2$ and $3\pi/2$ and interparticle interaction in condensates exceeds a critical interaction value $u_c$, the coherence is supposed to be quickly destroyed because the system enters a chaotic regime due to inherent instabilities. This is, however, only a part of the story. It turns out that chaotic dynamics and resulting averaging of circular current to zero is generally offset by a critical time-scale $t_c$, which is almost two orders of magnitude larger than the one expected from the linear stability analysis. We study the critical time-scale in detail in a broad parameter range.
# DV-QKDがHollowコアファイバー上で1.6Tbpsのクラシックチャネルで共存

DV-QKD Coexistence With 1.6 Tbps Classical Channels Over Hollow Core Fibre ( http://arxiv.org/abs/2203.14621v1 )

The feasibility of coexisting a quantum channel with carrier-grade classical optical channels over Hollow Core Nested Antiresonant Nodeless Fibre (HC-NANF) is experimentally explored for the first time in terms of achievable quantum bit error rate (QBER), secret key rate (SKR) as well as classical signal bit error rates (BER). A coexistence transmission of 1.6 Tbps is achieved for the classical channels simultaneously with a quantum channel over a 2 km-long HC-NANF with a total coexistence power of 0 dBm. To find the best and worst wavelength position for the classical channels, we simulated different classical channels bands with different spacing between the quantum and classical channels considering the crosstalk generated from both Raman scattering and four-wave-mixing (FWM) on the quantum channel. Following our simulation, we numerically estimate the best (Raman spectrum dip) and worst locations (Raman spectrum peak) of the classical channel with respect to its impact on the performance on the quantum channel in terms of SKR and QBER. We further implemented a testbed to experimentally test both single mode fibre (SMF) and HC-NANF in the best and worst-case scenarios. In the best-case scenario, the spacing between quantum and classical is 200 GHz (1.6 nm) with 50 GHz (0.4 nm) spacing between each classical channel. The SKR was preserved without any noticeable changes when coexisting the quantum channel with eight classical channels at 0 dBm total coexistence power in HC-NANF compared to a significant drop of 73% when using SMF at -24 dBm total coexistence power which is 250 times lower than the power used in HC-NANF. In the worst-case scenario using the same powers, and with 1 THz (8 nm) spacing between quantum and classical channels, the SKR dropped 10% using the HC-NANF, whereas in the SMF the SKR plummeted to zero.
# 2次元ドリフト解析:2つの関数を同時に最適化することは難しい

Two-Dimensional Drift Analysis: Optimizing Two Functions Simultaneously Can Be Hard ( http://arxiv.org/abs/2203.14547v1 )

In this paper we show how to use drift analysis in the case of two random variables $X_1, X_2$, when the drift is approximatively given by $A\cdot (X_1,X_2)^T$ for a matrix $A$. The non-trivial case is that $X_1$ and $X_2$ impede each other's progress, and we give a full characterization of this case. As application, we develop and analyze a minimal example TwoLinear of a dynamic environment that can be hard. The environment consists of two linear function $f_1$ and $f_2$ with positive weights $1$ and $n$, and in each generation selection is based on one of them at random. They only differ in the set of positions that have weight $1$ and $n$. We show that the $(1+1)$-EA with mutation rate $\chi/n$ is efficient for small $\chi$ on TwoLinear, but does not find the shared optimum in polynomial time for large $\chi$.
# 離散最適化のための量子アルゴリズムにおけるトレードオフと設計ツールキットのエンコード:色付け、ルーティング、スケジューリング、その他の問題

Encoding trade-offs and design toolkits in quantum algorithms for discrete optimization: coloring, routing, scheduling, and other problems ( http://arxiv.org/abs/2203.14432v1 )

Challenging combinatorial optimization problems are ubiquitous in science and engineering. Several quantum methods for optimization have recently been developed, in different settings including both exact and approximate solvers. Addressing this field of research, this manuscript has three distinct purposes. First, we present an intuitive method for synthesizing and analyzing discrete (i.e., integer-based) optimization problems, wherein the problem and corresponding algorithmic primitives are expressed using a discrete quantum intermediate representation (DQIR) that is encoding-independent. This compact representation often allows for more efficient problem compilation, automated analyses of different encoding choices, easier interpretability, more complex runtime procedures, and richer programmability, as compared to previous approaches, which we demonstrate with a number of examples. Second, we perform numerical studies comparing several qubit encodings; the results exhibit a number of preliminary trends that help guide the choice of encoding for a particular set of hardware and a particular problem and algorithm. Our study includes problems related to graph coloring, the traveling salesperson problem, factory/machine scheduling, financial portfolio rebalancing, and integer linear programming. Third, we design low-depth graph-derived partial mixers (GDPMs) up to 16-level quantum variables, demonstrating that compact (binary) encodings are more amenable to QAOA than previously understood. We expect this toolkit of programming abstractions and low-level building blocks to aid in designing quantum algorithms for discrete combinatorial problems.
# リアルリドバーグドレッシングを用いたコールド原子量子シミュレータにおける位相モット絶縁体へのアクセス

Accessing the topological Mott insulator in cold atom quantum simulators with realistic Rydberg dressing ( http://arxiv.org/abs/2203.14818v1 )

The interplay between many-body interactions and the kinetic energy gives rise to rich phase diagrams hosting, among others, interaction-induced topological phases. These phases are characterized by both a local order parameter and a global topological invariant, and can exhibit exotic ground states such as self-trapped polarons and interaction-induced edge states. In this work, we investigate a realistic scenario for the quantum simulation of such systems using cold Rydberg-dressed atoms in optical lattices. We consider spinless fermions on a checkerboard lattice, interacting via the tunable-range effective potential induced by the Rydberg dressing. We perform a detailed analysis of the phase diagram at half- and incommensurate fillings, in the mean-field approximation. We furthermore study the stability of the phases with respect to temperature within the mean-field approximation and with respect to quantum fluctuations using the density matrix renormalization group method. Finally, we propose an implementation protocol, and in particular identify attainable regimes of experimental parameters in which the topological properties of the model become accessible. Our work, thereby, opens a realistic pathway to the outstanding experimental observation of this predicted phase in state-of-the-art cold atom quantum simulators.
# 集合的3レベル量子熱エンジンの性能

Performance of the collective three-level quantum thermal engine ( http://arxiv.org/abs/2203.14811v1 )

We investigate the performance of a microscopic quantum heat engine consisting of V- or Lambda-type emitters interacting collectively or independently when being in contact with environmental thermal reservoirs. Though the efficiency of a Carnot's cycle is always higher than those associated with these setups, we have found that the performance of the cooperative Lambda-type heat engine may be larger than that of the V-type under similar conditions. Cooperativity among the emitters plays an important role for the Lambda-type setup, significantly improving its performance, while is less relevant for a V-type thermal engine. This is because the population inversion on the working atomic transition as well as its off-diagonal elements behave differently for these two atomic ensembles.
# 高次元OAM絡み合い状態の選択自由制御生成

Postselection-free controlled generation of high-dimensional OAM entangled state ( http://arxiv.org/abs/2203.14799v1 )

High-dimensional entangled states in orbital angular momentum (OAM) basis offer unique advantages for several quantum information applications. However, a given quantum information application for its optimal performance requires generation of a specific OAM entangled state with full control over the shape of the OAM Schmidt spectrum. Spontaneous parametric downconversion (SPDC) is the most widely used method for generating OAM entangled states. Most of the existing methods for controlling the generation of the state employ postselection, which invariably compromises the security benefits of quantum information applications. The postselection-free generation of OAM entangled states can be achieved either by adjusting the SPDC phase matching condition or by shaping the profile of the pump field used for down-conversion. While it is known that the phase matching adjustments can only change the width of the OAM Schmidt spectrum, even the most recent works based on pump shaping have demonstrated generation of only up to 24-dimensional states with very limited control over the OAM Schmidt spectrum. Here, we propose and experimentally demonstrate a technique that employs both the phase matching adjustments and pump-shaping for generating high-dimensional OAM entangled states with complete control over the shape of the OAM Schmidt spectrum. We report generation of up to 200-dimensional OAM entangled states with three different spectra, namely, Gaussian, rectangular and triangular, with up to 99% generation accuracy.
# メタヒューリスティックアルゴリズムを用いたMIR波長スペクトル純物生成のためのニオブ酸リチウムの最適化設計

Optimized design of the lithium niobate for spectrally-pure-state generation at MIR wavelengths using metaheuristic algorithm ( http://arxiv.org/abs/2203.14783v1 )

Quantum light sources in the mid-infrared (MIR) band play an important role in many applications, such as quantum sensing, quantum imaging, and quantum communication. However, there is still a lack of high-quality quantum light sources in the MIR band, such as the spectrally pure single-photon source. In this work, we present the generation of spectrally-pure state in an optimized poled lithium niobate crystal using a metaheuristic algorithm. In particular, we adopt the particle swarm optimization algorithm to optimize the duty cycle of the poling period of the lithium niobate crystal. With our approach, the spectral purity can be improved from 0.820 to 0.998 under the third group-velocity-matched condition, and the wavelength-tunable range is from 3.0 $\mu$m to 4.0 $\mu$m for the degenerate case and 3.0 $\mu$m to 3.7 $\mu$m for the nondegenerate case. Our work paves the way for developing quantum photonic technologies at the MIR wavelength band.
# エントロピーフラックスからの量子エントロピー生成のための上界

Upper bound for quantum entropy production from entropy flux ( http://arxiv.org/abs/2203.14766v1 )

Entropy production is a key quantity characterizing nonequilibrium systems. However, it can often be difficult to compute in practice, as it requires detailed information about the system and the dynamics it undergoes. This becomes even more difficult in the quantum domain, and if one is interested in generic nonequilibrium reservoirs, for which the standard thermal recipes no longer apply. In this paper, we derive an upper bound for the entropy production in terms of the entropy flux for a class of systems for which the flux is given in terms of a system's observable. Since currents are often easily accessible in this class of systems, this bound should prove useful for estimating the entropy production in a broad variety of processes. We illustrate the applicability of the bound by considering a three-level maser engine and a system interacting with a squeezed bath.
# 参照状態誤差緩和:化学の高精度量子計算のための戦略

Reference-State Error Mitigation: A Strategy for High Accuracy Quantum Computation of Chemistry ( http://arxiv.org/abs/2203.14756v1 )

Decoherence and gate errors severely limit the capabilities of state-of-the-art quantum computers. This work introduces a strategy for reference-state error mitigation (REM) of quantum chemistry that can be straightforwardly implemented on current and near-term devices. REM can be applied alongside existing mitigation procedures, while requiring minimal post-processing and only one or no additional measurements. The approach is agnostic to the underlying quantum mechanical ansatz and is designed for the variational quantum eigensolver (VQE). Two orders-of-magnitude improvement in the computational accuracy of ground state energies of small molecules (H2, HeH+ and LiH) is demonstrated on superconducting quantum hardware. Simulations of noisy circuits with a depth exceeding 1000 two-qubit gates are used to argue for scalability of the method.
# 連続可変デバイス独立量子会議とポストセレクション

Continuous-variable measurement device independent quantum conferencing with post-selection ( http://arxiv.org/abs/2203.14657v1 )

A continuous variable (CV), measurement device independent (MDI) quantum key distribution (QKD) protocol is analyzed, enabling three parties to connect for quantum conferencing. We utilise a generalised Bell detection at an untrusted relay and a postselection procedure, in which distant parties reconcile on the signs of the displacements of the quadratures of their prepared coherent states. We derive the rate of the protocol under a collective pure-loss attack, demonstrating improved rate-distance performance compared to the equivalent non-post-selected protocol. In the symmetric configuration in which all the parties lie the same distance from the relay, we find a positive key rate over 6 km. Such postselection techniques can be used to improve the rate of multi-party quantum conferencing protocols at longer distances at the cost of reduced performance at shorter distances.
# リンドブラッドの進化, 誕生規則, ヘラルディングとクローン

Lindblad evolutions, Born rule, Heralding and cloning ( http://arxiv.org/abs/2203.14634v1 )

We discuss an apparent difficulty in computing the radiation emitted by a system undergoing Lindblad evolution. The difficulty is resolved by viewing the problem as Born rule for conserved currents defined by the appropriate terms in the adjoint Lindbladian. In Heralding Alice prepares Bob's system in a state that mirrors her test. We show that heralding is consistent with no-cloning. This follows from the observation that heralding is not a completely positive map.
# 変調振幅駆動による量子熱エンジンの熱力学制御

Controlling thermodynamics of a quantum heat engine with modulated amplitude drivings ( http://arxiv.org/abs/2203.15005v1 )

External driving of bath temperatures with a phase difference of a nonequilibrium quantum engine leads to the emergence of geometric effects on the thermodynamics. In this work, we modulate the amplitude of the external driving protocols by introducing envelope functions and study the role of geometric effects on the flux, noise and efficiency of a four-level driven quantum heat engine coupled with two thermal baths and a unimodal cavity. We observe that having a finite width of the modulation envelope introduces an additional control knob for studying the thermodynamics in the adiabatic limit. The optimization of the flux as well as the noise with respect to thermally induced quantum coherences becomes possible in presence of geometric effects, which is hitherto not possible with sinusoidal driving without an envelope. We also report the deviation of the slope and generation of an intercept in the standard expression for efficiency at maximum power as a function of Carnot efficiency in presence of geometric effects under the amplitude modulation. Further, a recently developed universal bound on the efficiency obtained from thermodynamic uncertainty relation is shown not to hold when a small width of the modulation envelope along with a large value of cavity temperature is maintained.
# 一般確率論における中間決定論

Intermediate determinism in general probabilistic theories ( http://arxiv.org/abs/2203.14997v1 )

Quantum theory is indeterministic, but not completely so. When a system is in a pure state there are properties it possesses with certainty, known as actual properties. The actual properties of a quantum system (in a pure state) fully determine the probability of finding the system to have any other property. We call this feature intermediate determinism. In dimensions of at least three, the intermediate determinism of quantum theory is guaranteed by the structure of its lattice of properties. This observation follows from Gleason's theorem, which is why it fails to hold in dimension two. In this work we extend the idea of intermediate determinism from properties to measurements. Under this extension intermediate determinism follows from the structure of quantum effects for separable Hilbert spaces of any dimension, including dimension two. Then, we find necessary and sufficient conditions for a general probabilistic theory to obey intermediate determinism. We show that, although related, both the no-restriction hypothesis and a Gleason-type theorem are neither necessary nor sufficient for intermediate determinism.
# IBMプロセッサを用いた計測に基づくランダム化ベンチマーク

Measurement-based interleaved randomised benchmarking using IBM processors ( http://arxiv.org/abs/2203.14995v1 )

Quantum computers have the potential to outperform classical computers at certain computational tasks, such as prime factorisation and unstructured searching. However, experimental realisations of quantum computers are subject to noise. Quantifying the noise is of fundamental importance, since noise is often the dominant factor preventing the successful realisation of advanced quantum computations. Here we propose an interleaved randomised benchmarking protocol for measurement-based quantum computers, in which any single-qubit measurement-based 2-design can be used to estimate the fidelity of any single-qubit measurement-based gate. We test our protocol by using a weak approximate measurement-based 2-design to estimate the fidelity of the Hadamard gate and the T gate (a universal single-qubit set) on IBM superconducting quantum computers. To this end, single-qubit measurements were performed on entangled linear cluster states of up to 31 qubits. Our estimated gate fidelities show good agreement with gate fidelities calculated from process tomography results. Furthermore, by artificially increasing noise in the measurement-based gates, we were able to show that our protocol is able to detect large noise variations in different measurement-based implementations of a gate.
# 波動関数の重なりからの作用素融合:普遍有限サイズ補正とハージュラップモデルへの応用

Operator fusion from wavefunction overlaps: Universal finite-size corrections and application to Haagerup model ( http://arxiv.org/abs/2203.14992v1 )

Given a critical quantum spin chain described by a conformal field theory (CFT) at long distances, it is crucial to understand the universal conformal data. One most important ingredient is the operator product expansion (OPE) coefficients, which describe how operators fuse into each other. It has been proposed in [Zou, Vidal, Phys. Rev. B 105, 125125] that the OPE coefficients can be computed from overlaps of low-energy wavefunctions of the spin chain. In this work, we establish that all conformal data including central charge, conformal dimensions, and OPE coefficients are encoded in the wavefunction overlaps, with universal finite-size corrections that depend on the operator content of the cyclic orbifold CFT. Thus this method allows us to numerically compute all the conformal data based solely on the low-energy eigenstates. The predictions are verified in the Ising and XXZ model. As an application, we study the recently proposed Haagerup model built from the Haagerup fusion category. We find that the CFT has central charge $c \approx 2.1$ and the lowest spin-$1$ operator in the twisted sector has scaling dimension $1 < \Delta_J \leq 1.4$.
# 2次元における量子パーコレーションに対する臨界相関の影響

Effects of critical correlations on quantum percolation in two dimensions ( http://arxiv.org/abs/2203.14977v1 )

We analyze the out-of-equilibrium dynamics of a quantum particle coupled to local magnetic degrees of freedom that undergo a classical phase transition. Specifically, we consider a two-dimensional tight-binding model that interacts with a background of classical spins in thermal equilibrium, which are subject to Ising interactions and act as emergent, correlated disorder for the quantum particle. Particular attention is devoted to temperatures close to the ferromagnet-to-paramagnet transition. To capture the salient features of the classical transition, namely the effects of long-range correlations, we focus on the strong coupling limit, in which the model can be mapped onto a quantum percolation problem on spin clusters generated by the Ising model. By inspecting several dynamical probes such as energy level statistics, inverse participation ratios, and wave-packet dynamics, we provide evidence that the classical phase transition might induce a delocalization-localization transition in the quantum system at certain energies. We also identify further important features due to the presence of Ising correlations, such as the suppression of compact localized eigenstates.
# 3+1次元コンパクト電磁力学におけるカシミール境界、モノポールおよび分解遷移

Casimir boundaries, monopoles, and deconfinement transition in 3+1 dimensional compact electrodynamics ( http://arxiv.org/abs/2203.14922v1 )

Compact U(1) gauge theory in 3+1 dimensions possesses the confining phase, characterized by a linear raise of the potential between particles with opposite electric charges at sufficiently large inter-particle separation. The confinement is generated by condensation of Abelian monopoles at strong gauge coupling. We study the properties of monopoles and the deconfining order parameter in zero-temperature theory in the presence of ideally conducting parallel metallic boundaries (plates) usually associated with the Casimir effect. Using first-principle numerical simulations in compact U(1) lattice gauge theory, we show that as the distance between the plates diminishes, the vacuum in between the plates experiences a deconfining transition. The phase diagram in the space of the gauge coupling and the inter-plane distance is obtained.
# 新しいホライズンズ:スカラーとベクトル超軽量ダークマター

New Horizons: Scalar and Vector Ultralight Dark Matter ( http://arxiv.org/abs/2203.14915v1 )

The last decade has seen unprecedented effort in dark matter model building at all mass scales coupled with the design of numerous new detection strategies. Transformative advances in quantum technologies have led to a plethora of new high-precision quantum sensors and dark matter detection strategies for ultralight ($<10\,$eV) bosonic dark matter that can be described by an oscillating classical, largely coherent field. This white paper focuses on searches for wavelike scalar and vector dark matter candidates.
# 共振器内周波数変換により制御される繊維空洞内の量子メモリに向けて

Toward a Quantum Memory in a Fiber Cavity Controlled by Intracavity Frequency Translation ( http://arxiv.org/abs/2203.14844v1 )

We propose a quantum memory protocol based on trapping photons in a fiber-integrated cavity, comprised of a birefringent fiber with dichroic reflective end facets. Photons are switched into resonance with the fiber cavity by intracavity Bragg-scattering frequency translation, driven by ancillary control pulses. After the storage delay, photons are switched out of resonance with the cavity, again by intracavity frequency translation. We demonstrate storage of quantum-level THz-bandwidth coherent states for a lifetime up to 16 cavity round trips, or 200ns, and a maximum overall efficiency of 73%.
# 荷電粒子の表面誘起脱コヒーレンスと加熱

Surface-induced decoherence and heating of charged particles ( http://arxiv.org/abs/2203.15088v1 )

Levitating charged particles in ultra-high vacuum provides a preeminent platform for quantum information processing, for quantum-enhanced force and torque sensing, for probing physics beyond the standard model, and for high-mass tests of the quantum superposition principle. Existing setups range from single atomic ions, to ion chains and crystals, to charged molecules and nanoparticles. Future technological applications of such quantum systems will be crucially affected by fluctuating electric fields emanating from nearby electrodes, which interact with the levitated particles' monopole and higher charge moments. In this article, we provide a theoretical toolbox for describing how the rotational and translational quantum dynamics of charged nano- to microscale objects is affected by near metallic and dielectric surfaces, as characterized by their macroscopic dielectric response. The resulting quantum master equations describe the coherent surface-particle interaction, due to image charges and Casimir-Polder potentials, as well as surface-induced decoherence and heating, with the experimentally observed frequency and distance scaling. We explicitly evaluate the master equations for typical charge distributions and types of motion, thereby providing the tools required for describing and mitigating surface-induced decoherence in a variety of experiments with charged objects.
# プログラマブル量子シミュレータにおけるトポロジカルマヨナモードの観測とブレイディング

Observing and braiding topological Majorana modes on programmable quantum simulators ( http://arxiv.org/abs/2203.15083v1 )

Despite its great promise of fault tolerance, the simplest demonstration of topological quantum computation remains elusive. Majorana modes are the primitive building blocks and their experimental realization on various platforms is yet to be confirmed. This work presents an experimental and theoretical framework for the space-resolved detection and exchange of Majorana modes on programmable (noisy) quantum hardware. We have implemented our framework by performing a series of measurements on a driven Ising-type quantum spin model with tunable interactions, which confirm the existence of the topological Majorana modes and distinguishes them from trivial modes. Lastly, we propose and demonstrate a novel technique for braiding the Majorana modes which results in the correct statistics but decreases the magnitude of the signal. The present work may be seen as the first reliable observation of Majorana modes on existing quantum hardware.
# スピン軌道準曲面による超長光子量子ウォーク

Ultra-long photonic quantum walks via spin-orbit metasurfaces ( http://arxiv.org/abs/2203.15051v1 )

The possibility of fine-tuning the couplings between optical modes is a key requirement in programmable photonic simulators and quantum computers. Engineering particle evolutions across large lattices is a challenging task, which requires sophisticated setups that are often intrinsically lossy. Here we report ultra-long photonic quantum walks across several hundred optical modes, obtained by propagating a light beam through very few closely-stacked liquid-crystal metasurfaces, without any optical amplification. By exploiting spin-orbit effects, these metasurfaces realize a space-dependent polarization transformation which mixes circularly polarized optical modes carrying quantized transverse momentum. With this setup we engineer quantum walks up to 320 discrete steps, far beyond state-of-the-art experiments. To showcase the potential of this method, we experimentally demonstrate that in the long-time limit a quantum walk affected by dynamical disorder generates maximal entanglement between two system partitions. Our platform grants experimental access to ultra-long unitary evolutions while keeping optical losses constant, thereby paving the way to massive multi-photon multi-mode quantum simulations.
# 複数のレジスタを持つ個人による量子遺伝的アルゴリズム

Quantum Genetic Algorithm with Individuals in Multiple Registers ( http://arxiv.org/abs/2203.15039v1 )

Genetic algorithms are heuristic optimization techniques inspired by Darwinian evolution, which are characterized by successfully finding robust solutions for optimization problems. Here, we propose a subroutine-based quantum genetic algorithm with individuals codified in independent registers. This distinctive codification allows our proposal to depict all the fundamental elements characterizing genetic algorithms, i.e. population-based search with selection of many individuals, crossover, and mutation. Our subroutine-based construction permits us to consider several variants of the algorithm. For instance, we firstly analyze the performance of two different quantum cloning machines, a key component of the crossover subroutine. Indeed, we study two paradigmatic examples, namely, the biomimetic cloning of quantum observables and the Bu\v zek-Hillery universal quantum cloning machine, observing a faster average convergence of the former, but better final populations of the latter. Additionally, we analyzed the effect of introducing a mutation subroutine, concluding a minor impact on the average performance. Furthermore, we introduce a quantum channel analysis to prove the exponential convergence of our algorithm and even predict its convergence-ratio. This tool could be extended to formally prove results on the convergence of general non-unitary iteration-based algorithms.
# フェルミオンの二次元プログラマブルなtweezer配列

A two-dimensional programmable tweezer array of fermions ( http://arxiv.org/abs/2203.15023v1 )

We prepare high-filling two-component arrays of up to fifty fermionic atoms in optical tweezers, with the atoms in the ground motional state of each tweezer. Using a stroboscopic technique, we configure the arrays in various two-dimensional geometries with negligible Floquet heating. Full spin- and density-resolved readout of individual sites allows us to post-select near-zero entropy initial states for fermionic quantum simulation. We prepare a correlated state in a two-by-two tunnel-coupled Hubbard plaquette, demonstrating all the building blocks for realizing a programmable fermionic quantum simulator.
# ファクシリ合成法により作製した低次元六方晶窒化ホウ素の室温深紫外発光

Room temperature deep UV photoluminescence from low dimensional hexagonal boron nitride prepared using a facile synthesis ( http://arxiv.org/abs/2203.15022v1 )

Evaluation of the defect levels in low-dimensional materials is an important aspect of quantum science. In this article, we report a facile synthesis method of hexagonal boron nitride (h-BN) and evaluate the defects and their light emission characteristics. The thermal annealing procedure is optimized to obtain clean h-BN. The UV-Vis spectroscopy shows the optical energy gap of 5.28 eV which is comparable to the reported energy gap for exfoliated, clean h-BN samples. The optimized synthesis route of h-BN has generated two kinds of defects which are characterised using room temperature photoluminescence measurements. The defects emit light at 4.18 eV (in deep ultraviolet region) and 3.44 eV (ultraviolet), respectively. The defect emitting deep ultraviolet (DUV) has oscillatory dependency on the excitation energy, while that emitting 3.44 eV light (ZPL3.44 eV) has a phonon bands with mean energy level separation of 125 meV measured at room temperature. This agrees very well with the Franck-Condon-like structure having regularly spaced energy levels, which are typical indications of single defect levels in the low dimensional h-BN.
# ミリケルビン温度におけるエルビウムドープ結晶中の電子スピンスペクトル拡散

Electron-spin spectral diffusion in an erbium doped crystal at millikelvin temperatures ( http://arxiv.org/abs/2203.15012v1 )

Erbium-doped crystals offer a versatile platform for hybrid quantum devices because they combine magnetically-sensitive electron-spin transitions with telecom-wavelength optical transitions. At the high doping concentrations necessary for many quantum applications, however, strong magnetic interactions of the electron-spin bath lead to excess spectral diffusion and rapid decoherence. Here we lithographically fabricate a 4.4 GHz superconducting planar micro-resonator on a $\text{CaWO}_{4}$ crystal doped with Er ions at a concentration of twenty parts per million relative to Ca. Using the microwave resonator, we characterize the spectral diffusion processes that limit the electron-spin coherence of Er ions at millikelvin temperatures by applying 2- and 3-pulse echo sequences. The coherence time shows a strong temperature dependence, reaching 1.3 ms at 23 mK for an electron-spin transition of $^{167}\text{Er}$.
# 健康・推論・学習会議(chil)2022年度の招待論文集

A collection of invited non-archival papers for the Conference on Health, Inference, and Learning (CHIL) 2022 ( http://arxiv.org/abs/2205.02752v1 )

A collection of invited non-archival papers for the Conference on Health, Inference, and Learning (CHIL) 2022. This index is incomplete as some authors of invited non-archival presentations opted not to include their papers in this index.
# デカップリングネットワークを用いた教師なし低光度画像強調

Unsupervised Low-light Image Enhancement with Decoupled Networks ( http://arxiv.org/abs/2005.02818v2 )

In this paper, we tackle the problem of enhancing real-world low-light images with significant noise in an unsupervised fashion. Conventional unsupervised learning-based approaches usually tackle the low-light image enhancement problem using an image-to-image translation model. They focus primarily on illumination or contrast enhancement but fail to suppress the noise that ubiquitously exists in images taken under real-world low-light conditions. To address this issue, we explicitly decouple this task into two sub-tasks: illumination enhancement and noise suppression. We propose to learn a two-stage GAN-based framework to enhance the real-world low-light images in a fully unsupervised fashion. To facilitate the unsupervised training of our model, we construct samples with pseudo labels. Furthermore, we propose an adaptive content loss to suppress real image noise in different regions based on illumination intensity. In addition to conventional benchmark datasets, a new unpaired low-light image enhancement dataset is built and used to thoroughly evaluate the performance of our model. Extensive experiments show that our proposed method outperforms the state-of-the-art unsupervised image enhancement methods in terms of both illumination enhancement and noise reduction.
# P-ADMMiRNN: 効率的並列ADMMアプローチによる安定収束学習RNN

P-ADMMiRNN: Training RNN with Stable Convergence via An Efficient and Paralleled ADMM Approach ( http://arxiv.org/abs/2006.05622v3 )

It is hard to train Recurrent Neural Network (RNN) with stable convergence and avoid gradient vanishing and exploding problems, as the weights in the recurrent unit are repeated from iteration to iteration. Moreover, RNN is sensitive to the initialization of weights and bias, which brings difficulties in training. The Alternating Direction Method of Multipliers (ADMM) has become a promising algorithm to train neural networks beyond traditional stochastic gradient algorithms with the gradient-free features and immunity to unsatisfactory conditions. However, ADMM could not be applied to train RNN directly since the state in the recurrent unit is repetitively updated over timesteps. Therefore, this work builds a new framework named ADMMiRNN upon the unfolded form of RNN to address the above challenges simultaneously. We also provide novel update rules and theoretical convergence analysis. We explicitly specify essential update rules in the iterations of ADMMiRNN with constructed approximation techniques and solutions to each sub-problem instead of vanilla ADMM. Numerical experiments are conducted on MNIST, IMDb, and text classification tasks. ADMMiRNN achieves convergent results and outperforms the compared baselines. Furthermore, ADMMiRNN trains RNN more stably without gradient vanishing or exploding than stochastic gradient algorithms. We also provide a distributed paralleled algorithm regarding ADMMiRNN, named P-ADMMiRNN, including Synchronous Parallel ADMMiRNN (SP-ADMMiRNN) and Asynchronous Parallel ADMMiRNN (AP-ADMMiRNN), which is the first to train RNN with ADMM in an asynchronous parallel manner. The source code is publicly available.
# ディープグラフ埋め込みによるRNA二次構造の折りたたみ景観の解明

Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings ( http://arxiv.org/abs/2006.06885v3 )

Biomolecular graph analysis has recently gained much attention in the emerging field of geometric deep learning. Here we focus on organizing biomolecular graphs in ways that expose meaningful relations and variations between them. We propose a geometric scattering autoencoder (GSAE) network for learning such graph embeddings. Our embedding network first extracts rich graph features using the recently proposed geometric scattering transform. Then, it leverages a semi-supervised variational autoencoder to extract a low-dimensional embedding that retains the information in these features that enable prediction of molecular properties as well as characterize graphs. We show that GSAE organizes RNA graphs both by structure and energy, accurately reflecting bistable RNA structures. Also, the model is generative and can sample new folding trajectories.
# 多次元オンライン意思決定のための確率的低ランクテンソルバンド

Stochastic Low-rank Tensor Bandits for Multi-dimensional Online Decision Making ( http://arxiv.org/abs/2007.15788v2 )

Multi-dimensional online decision making plays a crucial role in many real applications such as online recommendation and digital marketing. In these problems, a decision at each time is a combination of choices from different types of entities. To solve it, we introduce stochastic low-rank tensor bandits, a class of bandits whose mean rewards can be represented as a low-rank tensor. We consider two settings, tensor bandits without context and tensor bandits with context. In the first setting, the platform aims to find the optimal decision with the highest expected reward, a.k.a, the largest entry of true reward tensor. In the second setting, some modes of the tensor are contexts and the rest modes are decisions, and the goal is to find the optimal decision given the contextual information. We propose two learning algorithms tensor elimination and tensor epoch-greedy for tensor bandits without context, and derive finite-time regret bounds for them. Comparing with existing competitive methods, tensor elimination has the best overall regret bound and tensor epoch-greedy has a sharper dependency on dimensions of the reward tensor. Furthermore, we develop a practically effective Bayesian algorithm called tensor ensemble sampling for tensor bandits with context. Numerical experiments back up our theoretical findings and show that our algorithms outperform various state-of-the-art approaches that ignore the tensor low-rank structure. In an online advertising application with contextual information, our tensor ensemble sampling reduces the cumulative regret by 75% compared to the benchmark method.
# マルチアームバンディットにおける統計的にロバストなリスク回避型ベストアーム識別

Statistically Robust, Risk-Averse Best Arm Identification in Multi-Armed Bandits ( http://arxiv.org/abs/2008.13629v2 )

Traditional multi-armed bandit (MAB) formulations usually make certain assumptions about the underlying arms' distributions, such as bounds on the support or their tail behaviour. Moreover, such parametric information is usually 'baked' into the algorithms. In this paper, we show that specialized algorithms that exploit such parametric information are prone to inconsistent learning performance when the parameter is misspecified. Our key contributions are twofold: (i) We establish fundamental performance limits of statistically robust MAB algorithms under the fixed-budget pure exploration setting, and (ii) We propose two classes of algorithms that are asymptotically near-optimal. Additionally, we consider a risk-aware criterion for best arm identification, where the objective associated with each arm is a linear combination of the mean and the conditional value at risk (CVaR). Throughout, we make a very mild 'bounded moment' assumption, which lets us work with both light-tailed and heavy-tailed distributions within a unified framework.
# 幾何散乱ネットワークのデータ駆動学習

Data-Driven Learning of Geometric Scattering Networks ( http://arxiv.org/abs/2010.02415v3 )

We propose a new graph neural network (GNN) module, based on relaxations of recently proposed geometric scattering transforms, which consist of a cascade of graph wavelet filters. Our learnable geometric scattering (LEGS) module enables adaptive tuning of the wavelets to encourage band-pass features to emerge in learned representations. The incorporation of our LEGS-module in GNNs enables the learning of longer-range graph relations compared to many popular GNNs, which often rely on encoding graph structure via smoothness or similarity between neighbors. Further, its wavelet priors result in simplified architectures with significantly fewer learned parameters compared to competing GNNs. We demonstrate the predictive performance of LEGS-based networks on graph classification benchmarks, as well as the descriptive quality of their learned features in biochemical graph data exploration tasks.
# 中国語理解のための多レベル単語アダプタによる単語情報の注入

Injecting Word Information with Multi-Level Word Adapter for Chinese Spoken Language Understanding ( http://arxiv.org/abs/2010.03903v3 )

In this paper, we improve Chinese spoken language understanding (SLU) by injecting word information. Previous studies on Chinese SLU do not consider the word information, failing to detect word boundaries that are beneficial for intent detection and slot filling. To address this issue, we propose a multi-level word adapter to inject word information for Chinese SLU, which consists of (1) sentence-level word adapter, which directly fuses the sentence representations of the word information and character information to perform intent detection and (2) character-level word adapter, which is applied at each character for selectively controlling weights on word information as well as character information. Experimental results on two Chinese SLU datasets show that our model can capture useful word information and achieve state-of-the-art performance.
# AutoMLに基づく作物と雑草の分類

Crop and weed classification based on AutoML ( http://arxiv.org/abs/2010.14708v2 )

CNN models already play an important role in classification of crop and weed with high accuracy, more than 95% as reported in literature. However, to manually choose and fine-tune the deep learning models becomes laborious and indispensable in most traditional practices and research. Moreover, the classic objective functions are not thoroughly compatible with agricultural farming tasks as the corresponding models suffer from misclassifying crop to weed, often more likely than in other deep learning application domains. In this paper, we applied autonomous machine learning with a new objective function for crop and weed classification, achieving higher accuracy and lower crop killing rate (rate of identifying a crop as a weed). The experimental results show that our method outperforms state-of-the-art applications, for example, ResNet and VGG19.
# 高次元強化学習問題に対するハミルトニアンモンテカルロサンプリングの適用について

On Using Hamiltonian Monte Carlo Sampling for Reinforcement Learning Problems in High-dimension ( http://arxiv.org/abs/2011.05927v3 )

Value function based reinforcement learning (RL) algorithms, for example, $Q$-learning, learn optimal policies from datasets of actions, rewards, and state transitions. However, when the underlying state transition dynamics are stochastic and evolve on a high-dimensional space, generating independent and identically distributed (IID) data samples for creating these datasets poses a significant challenge due to the intractability of the associated normalizing integral. In these scenarios, Hamiltonian Monte Carlo (HMC) sampling offers a computationally tractable way to generate data for training RL algorithms. In this paper, we introduce a framework, called \textit{Hamiltonian $Q$-Learning}, that demonstrates, both theoretically and empirically, that $Q$ values can be learned from a dataset generated by HMC samples of actions, rewards, and state transitions. Furthermore, to exploit the underlying low-rank structure of the $Q$ function, Hamiltonian $Q$-Learning uses a matrix completion algorithm for reconstructing the updated $Q$ function from $Q$ value updates over a much smaller subset of state-action pairs. Thus, by providing an efficient way to apply $Q$-learning in stochastic, high-dimensional settings, the proposed approach broadens the scope of RL algorithms for real-world applications.
# (参考訳) 圧力スイング吸着系への応用によるサロゲート支援進化多目的最適化

Surrogate Assisted Evolutionary Multi-objective Optimisation applied to a Pressure Swing Adsorption system ( http://arxiv.org/abs/2204.12585v1 )

Chemical plant design and optimisation have proven challenging due to the complexity of these real-world systems. The resulting complexity translates into high computational costs for these systems' mathematical formulations and simulation models. Research has illustrated the benefits of using machine learning surrogate models as substitutes for computationally expensive models during optimisation. This paper extends recent research into optimising chemical plant design and operation. The study further explores Surrogate Assisted Genetic Algorithms (SA-GA) in more complex variants of the original plant design and optimisation problems, such as the inclusion of parallel and feedback components. The novel extension to the original algorithm proposed in this study, Surrogate Assisted NSGA-\Romannum{2} (SA-NSGA), was tested on a popular literature case, the Pressure Swing Adsorption (PSA) system. We further provide extensive experimentation, comparing various meta-heuristic optimisation techniques and numerous machine learning models as surrogates. The results for both sets of systems illustrate the benefits of using Genetic Algorithms as an optimisation framework for complex chemical plant system design and optimisation for both single and multi-objective scenarios. We confirm that Random Forest surrogate assisted Evolutionary Algorithms can be scaled to increasingly complex chemical systems with parallel and feedback components. We further find that combining a Genetic Algorithm framework with Machine Learning Surrogate models as a substitute for long-running simulation models yields significant computational efficiency improvements, 1.7 - 1.84 times speedup for the increased complexity examples and a 2.7 times speedup for the Pressure Swing Adsorption system.
# (参考訳) 生成設計の考え方:自然言語生成アプローチ

Generative Design Ideation: A Natural Language Generation Approach ( http://arxiv.org/abs/2204.09658v1 )

This paper aims to explore a generative approach for knowledge-based design ideation by applying the latest pre-trained language models in artificial intelligence (AI). Specifically, a method of fine-tuning the generative pre-trained transformer using the USPTO patent database is proposed. The AI-generated ideas are not only in concise and understandable language but also able to synthesize the target design with external knowledge sources with controllable knowledge distance. The method is tested in a case study of rolling toy design and the results show good performance in generating ideas of varied novelty with near-field and far-field source knowledge.
# (参考訳) MITボイスネームシステム

The MIT Voice Name System ( http://arxiv.org/abs/2204.09657v1 )

This RFC white Paper summarizes our progress on the MIT Voice Name System (VNS) and Huey. The VNS, similar in name and function to the DNS, is a system to reserve and use "wake words" to activate Artificial Intelligence (AI) devices. Just like you can say "Hey Siri" to activate Apple's personal assistant, we propose using the VNS in smart speakers and other devices to route wake requests based on commands such as "turn off", "open grocery shopping list" or "271, start flash card review of my computer vision class". We also introduce Huey, an unambiguous Natural Language to interact with AI devices. We aim to standardize voice interactions to a universal reach similar to that of other systems such as phone numbering, with an agreed world-wide approach to assign and use numbers, or the Internet's DNS, with a standard naming system, that has helped flourish popular services including the World-Wide-Web, FTP, and email. Just like these standards are "neutral", we also aim to endow the VNS with "wake neutrality" so that each participant can develop its own digital voice. We focus on voice as a starting point to talk to any IoT object and explain briefly how the VNS may be expanded to other AI technologies enabling person-to-machine conversations (really machine-to-machine), including computer vision or neural interfaces. We also describe briefly considerations for a broader set of standards, MIT Open AI (MOA), including a reference architecture to serve as a starting point for the development of a general conversational commerce infrastructure that has standard "Wake Words", NLP commands such as "Shopping Lists" or "Flash Card Reviews", and personalities such as Pi or 271. Privacy and security are key elements considered because of speech-to-text errors and the amount of personal information contained in a voice sample.
# データサイエンスの語彙の進化と利用。 13年でどのくらい変わりましたか。

Evolution and use of data science vocabulary. How much have we changed in 13 years? ( http://arxiv.org/abs/2204.10174v1 )

Here I present an investigation on the evolution and use of vocabulary in data science in the last 13 years. Based on a rigorous statistical analysis, a database with 12,787 documents containing the words "data science" in the title, abstract or keywords is analyzed. It is proposed to classify the evolution of this discipline in three periods: emergence, growth and boom. Characteristic words and pioneering documents are identified for each period. By proposing the distinctive vocabulary and relevant topics of data science and classified in time periods, these results add value to the scientific community of this discipline.
# (参考訳) 金融規制書類における階層的改革モデルを用いた材料情報の発見

Discovering material information using hierarchical Reformer model on financial regulatory filings ( http://arxiv.org/abs/2204.05979v1 )

Most applications of machine learning for finance are related to forecasting tasks for investment decisions. Instead, we aim to promote a better understanding of financial markets with machine learning techniques. Leveraging the tremendous progress in deep learning models for natural language processing, we construct a hierarchical Reformer ([15]) model capable of processing a large document level dataset, SEDAR, from canadian financial regulatory filings. Using this model, we show that it is possible to predict trade volume changes using regulatory filings. We adapt the pretraining task of HiBERT ([36]) to obtain good sentence level representations using a large unlabelled document dataset. Finetuning the model to successfully predict trade volume changes indicates that the model captures a view from financial markets and processing regulatory filings is beneficial. Analyzing the attention patterns of our model reveals that it is able to detect some indications of material information without explicit training, which is highly relevant for investors and also for the market surveillance mandate of financial regulators.
# 最適化と機械学習を用いたバイアニソトロピックな地表面の逆設計と実験的検証

Inverse Design and Experimental Verification of a Bianisotropic Metasurface Using Optimization and Machine Learning ( http://arxiv.org/abs/2204.00433v1 )

Electromagnetic metasurfaces have attracted significant interest recently due to their low profile and advantageous applications. Practically, many metasurface designs start with a set of constraints for the radiated far-field, such as main-beam direction(s) and side lobe levels, and end with a non-uniform physical structure for the surface. This problem is quite challenging, since the required tangential field transformations are not completely known when only constraints are placed on the scattered fields. Hence, the required surface properties cannot be solved for analytically. Moreover, the translation of the desired surface properties to the physical unit cells can be time-consuming and difficult, as it is often a one-to-many mapping in a large solution space. Here, we divide the inverse design process into two steps: a macroscopic and microscopic design step. In the former, we use an iterative optimization process to find the surface properties that radiate a far-field pattern that complies with specified constraints. This iterative process exploits non-radiating currents to ensure a passive and lossless design. In the microscopic step, these optimized surface properties are realized with physical unit cells using machine learning surrogate models. The effectiveness of this end-to-end synthesis process is demonstrated through measurement results of a beam-splitting prototype.
# 可変トレーサビリティのための高速かつ効率的な条件学習-精度とロバストさの相違

A Fast and Efficient Conditional Learning for Tunable Trade-Off between Accuracy and Robustness ( http://arxiv.org/abs/2204.00426v1 )

Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers. These layers require many new parameters and are hyperparameter sensitive. They significantly increase training time, memory cost, and potential latency which can prove costly for resource-limited or real-time applications. In this paper, we present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer, thereby incurring no significant increase in parameter count, training time, or network latency compared to standard adversarial training. In particular, we add configurable scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance. Extensive experiments show that FLOAT can yield SOTA performance improving both clean and perturbed image classification by up to ~6% and ~10%, respectively. Moreover, real hardware measurement shows that FLOAT can reduce the training time by up to 1.43x with fewer model parameters of up to 1.47x on iso-hyperparameter settings compared to the FiLM-based alternatives. Additionally, to further improve memory efficiency we introduce FLOAT sparse (FLOATS), a form of non-iterative model pruning and provide detailed empirical analysis to provide a three way accuracy-robustness-complexity trade-off for these new class of pruned conditionally trained models.
# (参考訳) v2x情報に基づく深部強化学習支援小隊制御

Deep Reinforcement Learning Aided Platoon Control Relying on V2X Information ( http://arxiv.org/abs/2203.15781v1 )

The impact of Vehicle-to-Everything (V2X) communications on platoon control performance is investigated. Platoon control is essentially a sequential stochastic decision problem (SSDP), which can be solved by Deep Reinforcement Learning (DRL) to deal with both the control constraints and uncertainty in the platoon leading vehicle's behavior. In this context, the value of V2X communications for DRL-based platoon controllers is studied with an emphasis on the tradeoff between the gain of including exogenous information in the system state for reducing uncertainty and the performance erosion due to the curse-of-dimensionality. Our objective is to find the specific set of information that should be shared among the vehicles for the construction of the most appropriate state space. SSDP models are conceived for platoon control under different information topologies (IFT) by taking into account `just sufficient' information. Furthermore, theorems are established for comparing the performance of their optimal policies. In order to determine whether a piece of information should or should not be transmitted for improving the DRL-based control policy, we quantify its value by deriving the conditional KL divergence of the transition models. More meritorious information is given higher priority in transmission, since including it in the state space has a higher probability in offsetting the negative effect of having higher state dimensions. Finally, simulation results are provided to illustrate the theoretical analysis.
# (参考訳) 衝突の学習:学習ハッシュ関数を用いた推薦システムモデル圧縮

Learning to Collide: Recommendation System Model Compression with Learned Hash Functions ( http://arxiv.org/abs/2203.15837v1 )

A key characteristic of deep recommendation models is the immense memory requirements of their embedding tables. These embedding tables can often reach hundreds of gigabytes which increases hardware requirements and training cost. A common technique to reduce model size is to hash all of the categorical variable identifiers (ids) into a smaller space. This hashing reduces the number of unique representations that must be stored in the embedding table; thus decreasing its size. However, this approach introduces collisions between semantically dissimilar ids that degrade model quality. We introduce an alternative approach, Learned Hash Functions, which instead learns a new mapping function that encourages collisions between semantically similar ids. We derive this learned mapping from historical data and embedding access patterns. We experiment with this technique on a production model and find that a mapping informed by the combination of access frequency and a learned low dimension embedding is the most effective. We demonstrate a small improvement relative to the hashing trick and other collision related compression techniques. This is ongoing work that explores the impact of categorical id collisions on recommendation model quality and how those collisions may be controlled to improve model performance.
# (参考訳) 言語認識のための部分空間に基づく表現と学習

Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition ( http://arxiv.org/abs/2203.15576v1 )

Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007, the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods.
# (参考訳) テンソルネットワークのスタック操作

Stack operation of tensor networks ( http://arxiv.org/abs/2203.16338v1 )

The tensor network, as a facterization of tensors, aims at performing the operations that are common for normal tensors, such as addition, contraction and stacking. However, due to its non-unique network structure, only the tensor network contraction is so far well defined. In this paper, we propose a mathematically rigorous definition for the tensor network stack approach, that compress a large amount of tensor networks into a single one without changing their structures and configurations. We illustrate the main ideas with the matrix product states based machine learning as an example. Our results are compared with the for loop and the efficient coding method on both CPU and GPU.
# (参考訳) PPGを用いた心拍モニタリングのロバスト化と省エネルギー化

Robust and Energy-efficient PPG-based Heart-Rate Monitoring ( http://arxiv.org/abs/2203.16339v1 )

A wrist-worn PPG sensor coupled with a lightweight algorithm can run on a MCU to enable non-invasive and comfortable monitoring, but ensuring robust PPG-based heart-rate monitoring in the presence of motion artifacts is still an open challenge. Recent state-of-the-art algorithms combine PPG and inertial signals to mitigate the effect of motion artifacts. However, these approaches suffer from limited generality. Moreover, their deployment on MCU-based edge nodes has not been investigated. In this work, we tackle both the aforementioned problems by proposing the use of hardware-friendly Temporal Convolutional Networks (TCN) for PPG-based heart estimation. Starting from a single "seed" TCN, we leverage an automatic Neural Architecture Search (NAS) approach to derive a rich family of models. Among them, we obtain a TCN that outperforms the previous state-of-the-art on the largest PPG dataset available (PPGDalia), achieving a Mean Absolute Error (MAE) of just 3.84 Beats Per Minute (BPM). Furthermore, we tested also a set of smaller yet still accurate (MAE of 5.64 - 6.29 BPM) networks that can be deployed on a commercial MCU (STM32L4) which require as few as 5k parameters and reach a latency of 17.1 ms consuming just 0.21 mJ per inference.
# (参考訳) 森林火災リスク予測 : 最適な火災危険指標

Wildfire risk forecast: An optimizable fire danger index ( http://arxiv.org/abs/2203.15558v1 )

Wildfire events have caused severe losses in many places around the world and are expected to increase with climate change. Throughout the years many technologies have been developed to identify fire events early on and to simulate fire behavior once they have started. Another particularly helpful technology is fire risk indices, which use weather forcing to make advanced predictions of the risk of fire. Predictions of fire risk indices can be used, for instance, to allocate resources in places with high risk. These indices have been developed over the years as empirical models with parameters that were estimated in lab experiments and field tests. These parameters, however, may not fit well all places where these models are used. In this paper we propose a novel implementation of one index (NFDRS IC) as a differentiable function in which one can optimize its internal parameters via gradient descent. We leverage existing machine learning frameworks (PyTorch) to construct our model. This approach has two benefits: (1) the NFDRS IC parameters can be improved for each region using actual observed fire events, and (2) the internal variables remain intact for interpretations by specialists instead of meaningless hidden layers as in traditional neural networks. In this paper we evaluate our strategy with actual fire events for locations in the USA and Europe.
# (参考訳) TraHGR: 筋電図による手指ジェスチャー認識のためのFew-shot Learning

TraHGR: Few-shot Learning for Hand Gesture Recognition via ElectroMyography ( http://arxiv.org/abs/2203.16336v1 )

Deep learning-based Hand Gesture Recognition (HGR) via surface Electromyogram (sEMG) signals has recently shown significant potential for development of advanced myoelectric-controlled prosthesis. Existing deep learning approaches, typically, include only one model as such can hardly maintain acceptable generalization performance in changing scenarios. In this paper, we aim to address this challenge by capitalizing on the recent advances of hybrid models and transformers. In other words, we propose a hybrid framework based on the transformer architecture, which is a relatively new and revolutionizing deep learning model. The proposed hybrid architecture, referred to as the Transformer for Hand Gesture Recognition (TraHGR), consists of two parallel paths followed by a linear layer that acts as a fusion center to integrate the advantage of each module and provide robustness over different scenarios. We evaluated the proposed architecture TraHGR based on the commonly used second Ninapro dataset, referred to as the DB2. The sEMG signals in the DB2 dataset are measured in the real-life conditions from 40 healthy users, each performing 49 gestures. We have conducted extensive set of experiments to test and validate the proposed TraHGR architecture, and have compared its achievable accuracy with more than five recently proposed HGR classification algorithms over the same dataset. We have also compared the results of the proposed TraHGR architecture with each individual path and demonstrated the distinguishing power of the proposed hybrid architecture. The recognition accuracies of the proposed TraHGR architecture are 86.18%, 88.91%, 81.44%, and 93.84%, which are 2.48%, 5.12%, 8.82%, and 4.30% higher than the state-ofthe-art performance for DB2 (49 gestures), DB2-B (17 gestures), DB2-C (23 gestures), and DB2-D (9 gestures), respectively.
# (参考訳) 確率EMアルゴリズムを用いた多成分信号の瞬時周波数推定

Instantaneous Frequency Estimation In Multi-Component Signals Using Stochastic EM Algorithm ( http://arxiv.org/abs/2203.16334v1 )

This paper addresses the problem of estimating the modes of an observed non-stationary mixture signal in the presence of an arbitrary distributed noise. A novel Bayesian model is introduced to estimate the model parameters from the spectrogram of the observed signal, by resorting to the stochastic version of the EM algorithm to avoid the computationally expensive joint parameters estimation from the posterior distribution. The proposed method is assessed through comparative experiments with state-of-the-art methods. The obtained results validate the proposed approach by highlighting an improvement of the modes estimation performance.
# (参考訳) 文脈比較:計量テンソルを用いたコサイン類似度尺度の改善

Comparing in context: Improving cosine similarity measures with a metric tensor ( http://arxiv.org/abs/2203.14996v1 )

Cosine similarity is a widely used measure of the relatedness of pre-trained word embeddings, trained on a language modeling goal. Datasets such as WordSim-353 and SimLex-999 rate how similar words are according to human annotators, and as such are often used to evaluate the performance of language models. Thus, any improvement on the word similarity task requires an improved word representation. In this paper, we propose instead the use of an extended cosine similarity measure to improve performance on that task, with gains in interpretability. We explore the hypothesis that this approach is particularly useful if the word-similarity pairs share the same context, for which distinct contextualized similarity measures can be learned. We first use the dataset of Richie et al. (2020) to learn contextualized metrics and compare the results with the baseline values obtained using the standard cosine similarity measure, which consistently shows improvement. We also train a contextualized similarity measure for both SimLex-999 and WordSim-353, comparing the results with the corresponding baselines, and using these datasets as independent test sets for the all-context similarity measure learned on the contextualized dataset, obtaining positive results for a number of tests.
# (参考訳) digital elevation model (dem) fusionの系統的レビューとメタ分析:前処理, 方法, 応用

A systematic review and meta-analysis of Digital Elevation Model (DEM) fusion: pre-processing, methods and applications ( http://arxiv.org/abs/2203.15026v1 )

The remote sensing community has identified data fusion as one of the key challenging topics of the 21st century. The subject of image fusion in two-dimensional (2D) space has been covered in several published reviews. However, the special case of 2.5D/3D Digital Elevation Model (DEM) fusion has not been addressed till date. DEM fusion is a key application of data fusion in remote sensing. It takes advantage of the complementary characteristics of multi-source DEMs to deliver a more complete, accurate and reliable elevation dataset. Although several methods for fusing DEMs have been developed, the absence of a well-rounded review has limited their proliferation among researchers and end-users. It is often required to combine knowledge from multiple studies to inform a holistic perspective and guide further research. In response, this paper provides a systematic review of DEM fusion: the pre-processing workflow, methods and applications, enhanced with a meta-analysis. Through the discussion and comparative analysis, unresolved challenges and open issues were identified, and future directions for research were proposed. This review is a timely solution and an invaluable source of information for researchers within the fields of remote sensing and spatial information science, and the data fusion community at large.
# (参考訳) 社会調和型ナビゲーションデータセット(SCAND) : ソーシャルナビゲーションのための大規模データ集合

Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of Demonstrations for Social Navigation ( http://arxiv.org/abs/2203.15041v1 )

Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a 'socially compliant' manner in the presence of other intelligent agents such as humans. With the emergence of autonomously navigating mobile robots in human populated environments (e.g., domestic service robots in homes and restaurants and food delivery robots on public sidewalks), incorporating socially compliant navigation behaviors on these robots becomes critical to ensuring safe and comfortable human robot coexistence. To address this challenge, imitation learning is a promising framework, since it is easier for humans to demonstrate the task of social navigation rather than to formulate reward functions that accurately capture the complex multi objective setting of social navigation. The use of imitation learning and inverse reinforcement learning to social navigation for mobile robots, however, is currently hindered by a lack of large scale datasets that capture socially compliant robot navigation demonstrations in the wild. To fill this gap, we introduce Socially CompliAnt Navigation Dataset (SCAND) a large scale, first person view dataset of socially compliant navigation demonstrations. Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations that comprises multi modal data streams including 3D lidar, joystick commands, odometry, visual and inertial information, collected on two morphologically different mobile robots a Boston Dynamics Spot and a Clearpath Jackal by four different human demonstrators in both indoor and outdoor environments. We additionally perform preliminary analysis and validation through real world robot experiments and show that navigation policies learned by imitation learning on SCAND generate socially compliant behaviors
# (参考訳) ビッグデータとAIの時代におけるAUCの最大化:調査

AUC Maximization in the Era of Big Data and AI: A Survey ( http://arxiv.org/abs/2203.15046v1 )

Area under the ROC curve, a.k.a. AUC, is a measure of choice for assessing the performance of a classifier for imbalanced data. AUC maximization refers to a learning paradigm that learns a predictive model by directly maximizing its AUC score. It has been studied for more than two decades dating back to late 90s and a huge amount of work has been devoted to AUC maximization since then. Recently, stochastic AUC maximization for big data and deep AUC maximization for deep learning have received increasing attention and yielded dramatic impact for solving real-world problems. However, to the best our knowledge there is no comprehensive survey of related works for AUC maximization. This paper aims to address the gap by reviewing the literature in the past two decades. We not only give a holistic view of the literature but also present detailed explanations and comparisons of different papers from formulations to algorithms and theoretical guarantees. We also identify and discuss remaining and emerging issues for deep AUC maximization, and provide suggestions on topics for future work.
# (参考訳) 疾患分類のためのフォローアップX線系列を用いた深層学習手法

A Deep Learning Technique using a Sequence of Follow Up X-Rays for Disease classification ( http://arxiv.org/abs/2203.15060v1 )

The ability to predict lung and heart based diseases using deep learning techniques is central to many researchers, particularly in the medical field around the world. In this paper, we present a unique outlook of a very familiar problem of disease classification using X-rays. We present a hypothesis that X-rays of patients included with the follow up history of their most recent three chest X-ray images would perform better in disease classification in comparison to one chest X-ray image input using an internal CNN to perform feature extraction. We have discovered that our generic deep learning architecture which we propose for solving this problem performs well with 3 input X ray images provided per sample for each patient. In this paper, we have also established that without additional layers before the output classification, the CNN models will improve the performance of predicting the disease labels for each patient. We have provided our results in ROC curves and AUROC scores. We define a fresh approach of collecting three X-ray images for training deep learning models, which we have concluded has clearly improved the performance of the models. We have shown that ResNet, in general, has a better result than any other CNN model used in the feature extraction phase. With our original approach to data pre-processing, image training, and pre-trained models, we believe that the current research will assist many medical institutions around the world, and this will improve the prediction of patients' symptoms and diagnose them with more accurate cure.
# (参考訳) 反復非線形最適化とアニメーションによる意味運動補正

Semantic Motion Correction Via Iterative Nonlinear Optimization and Animation ( http://arxiv.org/abs/2203.15072v1 )

Here, we present an end-to-end method to create 2D animation for a goalkeeper attempting to block a penalty kick, and then correct that motion using an iterative nonlinear optimization scheme. The input is a raw video that is fed into pose and object detection networks to find the skeleton of the goalkeeper and the ball. The output is a set of key frames of the skeleton associated with the corrected motion so that if the goalkeeper missed the ball, the animation will show then successfully deflecting it. Our method is robust enough correct different kinds of mistakes the goalkeeper can make, such as not lunging far enough or jumping to the incorrect side. Our method is also meant to be semantically similar to the goalkeeper's original motion, which helps keep our animation grounded with respect to actual human behavior.
# (参考訳) ドライバ衝突警告に対するニューロシンボリックハイブリッドアプローチ

Neurosymbolic hybrid approach to driver collision warning ( http://arxiv.org/abs/2203.15076v1 )

There are two main algorithmic approaches to autonomous driving systems: (1) An end-to-end system in which a single deep neural network learns to map sensory input directly into appropriate warning and driving responses. (2) A mediated hybrid recognition system in which a system is created by combining independent modules that detect each semantic feature. While some researchers believe that deep learning can solve any problem, others believe that a more engineered and symbolic approach is needed to cope with complex environments with less data. Deep learning alone has achieved state-of-the-art results in many areas, from complex gameplay to predicting protein structures. In particular, in image classification and recognition, deep learning models have achieved accuracies as high as humans. But sometimes it can be very difficult to debug if the deep learning model doesn't work. Deep learning models can be vulnerable and are very sensitive to changes in data distribution. Generalization can be problematic. It's usually hard to prove why it works or doesn't. Deep learning models can also be vulnerable to adversarial attacks. Here, we combine deep learning-based object recognition and tracking with an adaptive neurosymbolic network agent, called the Non-Axiomatic Reasoning System (NARS), that can adapt to its environment by building concepts based on perceptual sequences. We achieved an improved intersection-over-union (IOU) object recognition performance of 0.65 in the adaptive retraining model compared to IOU 0.31 in the COCO data pre-trained model. We improved the object detection limits using RADAR sensors in a simulated environment, and demonstrated the weaving car detection capability by combining deep learning-based object detection and tracking with a neurosymbolic model.
# (参考訳) 視覚・自己教師あり音声モデルにおける単語発見

Word Discovery in Visually Grounded, Self-Supervised Speech Models ( http://arxiv.org/abs/2203.15081v1 )

We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a crucial component of the word discovery capability we observe. We also evaluate our method on the Buckeye word segmentation and ZeroSpeech spoken term discovery tasks, where we outperform all currently published methods on several metrics.
# (参考訳) 反復的, ディープ・シンセティック・アパーチャ・ソナー画像分割法

Iterative, Deep Synthetic Aperture Sonar Image Segmentation ( http://arxiv.org/abs/2203.15082v1 )

Synthetic aperture sonar (SAS) systems produce high-resolution images of the seabed environment. Moreover, deep learning has demonstrated superior ability in finding robust features for automating imagery analysis. However, the success of deep learning is conditioned on having lots of labeled training data, but obtaining generous pixel-level annotations of SAS imagery is often practically infeasible. This challenge has thus far limited the adoption of deep learning methods for SAS segmentation. Algorithms exist to segment SAS imagery in an unsupervised manner, but they lack the benefit of state-of-the-art learning methods and the results present significant room for improvement. In view of the above, we propose a new iterative algorithm for unsupervised SAS image segmentation combining superpixel formation, deep learning, and traditional clustering methods. We call our method Iterative Deep Unsupervised Segmentation (IDUS). IDUS is an unsupervised learning framework that can be divided into four main steps: 1) A deep network estimates class assignments. 2) Low-level image features from the deep network are clustered into superpixels. 3) Superpixels are clustered into class assignments (which we call pseudo-labels) using $k$-means. 4) Resulting pseudo-labels are used for loss backpropagation of the deep network prediction. These four steps are performed iteratively until convergence. A comparison of IDUS to current state-of-the-art methods on a realistic benchmark dataset for SAS image segmentation demonstrates the benefits of our proposal even as the IDUS incurs a much lower computational burden during inference (actual labeling of a test image). Finally, we also develop a semi-supervised (SS) extension of IDUS called IDSS and demonstrate experimentally that it can further enhance performance while outperforming supervised alternatives that exploit the same labeled training imagery.
# (参考訳) クロマグラムに基づくピッチ認識リミックスによる歌声分離の改善

Improved singing voice separation with chromagram-based pitch-aware remixing ( http://arxiv.org/abs/2203.15092v1 )

Singing voice separation aims to separate music into vocals and accompaniment components. One of the major constraints for the task is the limited amount of training data with separated vocals. Data augmentation techniques such as random source mixing have been shown to make better use of existing data and mildly improve model performance. We propose a novel data augmentation technique, chromagram-based pitch-aware remixing, where music segments with high pitch alignment are mixed. By performing controlled experiments in both supervised and semi-supervised settings, we demonstrate that training models with pitch-aware remixing significantly improves the test signal-to-distortion ratio (SDR)
Machine learning models such as Transformers or LSTMs struggle with tasks that are compositional in nature such as those involving reasoning/inference. Although many datasets exist to evaluate compositional generalization, when it comes to evaluating inference abilities, options are more limited. This paper presents LogicInference, a new dataset to evaluate the ability of models to perform logical inference. The dataset focuses on inference using propositional logic and a small subset of first-order logic, represented both in semi-formal logical notation, as well as in natural language. We also report initial results using a collection of machine learning models to establish an initial baseline in this dataset.
# (参考訳) 最大重み独立集合問題に対するメタヒューリスティックアルゴリズム

Motivated by a real-world vehicle routing application, we consider the maximum-weight independent set problem: Given a node-weighted graph, find a set of independent (mutually nonadjacent) nodes whose node-weight sum is maximum. Some of the graphs airsing in this application are large, having hundreds of thousands of nodes and hundreds of millions of edges. To solve instances of this size, we develop a new local search algorithm, which is a metaheuristic in the greedy randomized adaptive search (GRASP) framework. This algorithm, which we call METAMIS, uses a wider range of simple local search operations than previously described in the literature. We introduce data structures that make these operations efficient. A new variant of path-relinking is introduced to escape local optima and so is a new alternating augmenting-path local search move that improves algorithm performance. We compare an implementation of our algorithm with a state-of-the-art openly available code on public benchmark sets, including some large instances with hundreds of millions of vertices. Our algorithm is, in general, competitive and outperforms this openly available code on large vehicle routing instances. We hope that our results will lead to even better MWIS algorithms.
# FlexFringe:確率的オートマタ学習によるソフトウェア行動モデリング

We present the efficient implementations of probabilistic deterministic finite automaton learning methods available in FlexFringe. These implement well-known strategies for state-merging including several modifications to improve their performance in practice. We show experimentally that these algorithms obtain competitive results and significant improvements over a default implementation. We also demonstrate how to use FlexFringe to learn interpretable models from software logs and use these for anomaly detection. Although less interpretable, we show that learning smaller more convoluted models improves the performance of FlexFringe on anomaly detection, outperforming an existing solution based on neural nets.
# 垂直ジャンプ高さ推定のための極V800スポーツウォッチの信頼性と妥当性

This study aimed to assess the reliability and validity of the Polar V800 to measure vertical jump height. Twenty-two physically active healthy men (age: 22.89 +- 4.23 years; body mass: 70.74 +- 8.04 kg; height: 1.74 +- 0.76 m) were recruited for the study. The reliability was evaluated by comparing measurements acquired by the Polar V800 in two identical testing sessions one week apart. Validity was assessed by comparing measurements simultaneously obtained using a force platform (gold standard), high-speed camera and the Polar V800 during squat jump (SJ) and countermovement jump (CMJ) tests. In the test-retest reliability, high intraclass correlation coefficients (ICCs) were observed (mean: 0.90, SJ and CMJ) in the Polar V800. There was no significant systematic bias +- random errors (p > 0.05) between test-retest. Low coefficients of variation (<5%) were detected in both jumps in the Polar V800. In the validity assessment, similar jump height was detected among devices (p > 0.05). There was almost perfect agreement between the Polar V800 compared to a force platform for the SJ and CMJ tests (Mean ICCs = 0.95; no systematic bias +- random errors in SJ mean: -0.38 +- 2.10 cm, p > 0.05). Mean ICC between the Polar V800 versus high-speed camera was 0.91 for the SJ and CMJ tests, however, a significant systematic bias +- random error (0.97 +- 2.60 cm; p = 0.01) was detected in CMJ test. The Polar V800 offers valid, compared to force platform, and reliable information about vertical jump height performance in physically active healthy young men.
# 異なる取得スタイラスの相互運用のための手書き圧力正規化について

In this paper, we present a pressure characterization and normalization procedure for online handwritten acquisition. Normalization process has been tested in biometric recognition experiments (identification and verification) using online signature database MCYT, which consists of the signatures from 330 users. The goal is to analyze the real mismatch scenarios where users are enrolled with one stylus and then, later on, they produce some testing samples using a different stylus model with different pressure response. Experimental results show: 1) a saturation behavior in pressure signal 2) different dynamic ranges in the different stylus studied 3) improved biometric recognition accuracy by means of pressure signal normalization as well as a performance degradation in mismatched conditions 4) interoperability between different stylus can be obtained by means of pressure normalization. Normalization produces an improvement in signature identification rates higher than 7% (absolute value) when compared with mismatched scenarios.
# (参考訳) 試験試料の定量化による分布外精度の理解

ライセンス: CC BY 4.0
Existing works show that although modern neural networks achieve remarkable generalization performance on the in-distribution (ID) dataset, the accuracy drops significantly on the out-of-distribution (OOD) datasets \cite{recht2018cifar, recht2019imagenet}. To understand why a variety of models consistently make more mistakes in the OOD datasets, we propose a new metric to quantify the difficulty of the test images (either ID or OOD) that depends on the interaction of the training dataset and the model. In particular, we introduce \textit{confusion score} as a label-free measure of image difficulty which quantifies the amount of disagreement on a given test image based on the class conditional probabilities estimated by an ensemble of trained models. Using the confusion score, we investigate CIFAR-10 and its OOD derivatives. Next, by partitioning test and OOD datasets via their confusion scores, we predict the relationship between ID and OOD accuracies for various architectures. This allows us to obtain an estimator of the OOD accuracy of a given model only using ID test labels. Our observations indicate that the biggest contribution to the accuracy drop comes from images with high confusion scores. Upon further inspection, we report on the nature of the misclassified images grouped by their confusion scores: \textit{(i)} images with high confusion scores contain \textit{weak spurious correlations} that appear in multiple classes in the training data and lack clear \textit{class-specific features}, and \textit{(ii)} images with low confusion scores exhibit spurious correlations that belong to another class, namely \textit{class-specific spurious correlations}.
# (参考訳) フェデレートされた名前付きエンティティ認識

We present an analysis of the performance of Federated Learning in a paradigmatic natural-language processing task: Named-Entity Recognition (NER). For our evaluation, we use the language-independent CoNLL-2003 dataset as our benchmark dataset and a Bi-LSTM-CRF model as our benchmark NER model. We show that federated training reaches almost the same performance as the centralized model, though with some performance degradation as the learning environments become more heterogeneous. We also show the convergence rate of federated models for NER. Finally, we discuss existing challenges of Federated Learning for NLP applications that can foster future research directions.
# (参考訳) セマンティクスのセグメンテーションを再考する:プロトタイプビュー

Prevalent semantic segmentation solutions, despite their different network designs (FCN based or attention based) and mask decoding strategies (parametric softmax based or pixel-query based), can be placed in one category, by considering the softmax weights or query vectors as learnable class prototypes. In light of this prototype view, this study uncovers several limitations of such parametric segmentation regime, and proposes a nonparametric alternative based on non-learnable prototypes. Instead of prior methods learning a single weight/query vector for each class in a fully parametric manner, our model represents each class as a set of non-learnable prototypes, relying solely on the mean features of several training pixels within that class. The dense prediction is thus achieved by nonparametric nearest prototype retrieving. This allows our model to directly shape the pixel embedding space, by optimizing the arrangement between embedded pixels and anchored prototypes. It is able to handle arbitrary number of classes with a constant amount of learnable parameters. We empirically show that, with FCN based and attention based segmentation models (i.e., HRNet, Swin, SegFormer) and backbones (i.e., ResNet, HRNet, Swin, MiT), our nonparametric framework yields compelling results over several datasets (i.e., ADE20K, Cityscapes, COCO-Stuff), and performs well in the large-vocabulary situation. We expect this work will provoke a rethink of the current de facto semantic segmentation model design.
# (参考訳) 逆行性複合機能のための逆行性前駆体

Training a high-dimensional simulated agent with an under-specified reward function often leads the agent to learn physically infeasible strategies that are ineffective when deployed in the real world. To mitigate these unnatural behaviors, reinforcement learning practitioners often utilize complex reward functions that encourage physically plausible behaviors. However, a tedious labor-intensive tuning process is often required to create hand-designed rewards which might not easily generalize across platforms and tasks. We propose substituting complex reward functions with "style rewards" learned from a dataset of motion capture demonstrations. A learned style reward can be combined with an arbitrary task reward to train policies that perform tasks using naturalistic strategies. These natural strategies can also facilitate transfer to the real world. We build upon Adversarial Motion Priors -- an approach from the computer graphics domain that encodes a style reward from a dataset of reference motions -- to demonstrate that an adversarial approach to training policies can produce behaviors that transfer to a real quadrupedal robot without requiring complex reward functions. We also demonstrate that an effective style reward can be learned from a few seconds of motion capture data gathered from a German Shepherd and leads to energy-efficient locomotion strategies with natural gait transitions.
# (参考訳) よくできたテキストは半分だ! 多様な条件生成のための組成サンプリング

We propose Composition Sampling, a simple but effective method to generate diverse outputs for conditional generation of higher quality compared to previous stochastic decoding strategies. It builds on recently proposed plan-based neural generation models (Narayan et al, 2021) that are trained to first create a composition of the output and then generate by conditioning on it and the input. Our approach avoids text degeneration by first sampling a composition in the form of an entity chain and then using beam search to generate the best possible text grounded to this entity chain. Experiments on summarization (CNN/DailyMail and XSum) and question generation (SQuAD), using existing and newly proposed automatic metrics together with human-based evaluation, demonstrate that Composition Sampling is currently the best available decoding strategy for generating diverse meaningful outputs.
# (参考訳) RGB-Dカメラ用ビジュアルオドメトリー

Visual odometry is the process of estimating the position and orientation of a camera by analyzing the images associated to it. This paper develops a quick and accurate approach to visual odometry of a moving RGB-D camera navigating on a static environment. The proposed algorithm uses SURF (Speeded Up Robust Features) as feature extractor, RANSAC (Random Sample Consensus) to filter the results and Minimum Mean Square to estimate the rigid transformation of six parameters between successive video frames. Data from a Kinect camera were used in the tests. The results show that this approach is feasible and promising, surpassing in performance the algorithms ICP (Interactive Closest Point) and SfM (Structure from Motion) in tests using a publicly available dataset.
# (参考訳) LocalBins: 局所分布学習による深さ推定の改善

We propose a novel architecture for depth estimation from a single image. The architecture itself is based on the popular encoder-decoder architecture that is frequently used as a starting point for all dense regression tasks. We build on AdaBins which estimates a global distribution of depth values for the input image and evolve the architecture in two ways. First, instead of predicting global depth distributions, we predict depth distributions of local neighborhoods at every pixel. Second, instead of predicting depth distributions only towards the end of the decoder, we involve all layers of the decoder. We call this new architecture LocalBins. Our results demonstrate a clear improvement over the state-of-the-art in all metrics on the NYU-Depth V2 dataset. Code and pretrained models will be made publicly available.
# (参考訳) エンド・ツー・エンド統一シーンテキスト検出とレイアウト解析に向けて

Scene text detection and document layout analysis have long been treated as two separate tasks in different image domains. In this paper, we bring them together and introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way. Comprehensive experiments show that our unified model achieves better performance than multiple well-designed baseline methods. Additionally, this model achieves state-of-the-art results on multiple scene text detection datasets without the need of complex post-processing. Dataset and code: https://github.com/google-research-datasets/hiertext.
# (参考訳) ミシガン神経障害スクリーニング装置を用いた糖尿病性感覚神経症に対する機械学習による重症度予測ツール

Background: Diabetic Sensorimotor polyneuropathy (DSPN) is a major long-term complication in diabetic patients associated with painful neuropathy, foot ulceration and amputation. The Michigan neuropathy screening instrument (MNSI) is one of the most common screening techniques for DSPN, however, it does not provide any direct severity grading system. Method: For designing and modelling the DSPN severity grading systems for MNSI, 19 years of data from Epidemiology of Diabetes Interventions and Complications (EDIC) clinical trials were used. MNSI variables and patient outcomes were investigated using machine learning tools to identify the features having higher association in DSPN identification. A multivariable logistic regression-based nomogram was generated and validated for DSPN severity grading. Results: The top-7 ranked features from MNSI: 10-gm filament, Vibration perception (R), Vibration perception (L), previous diabetic neuropathy, the appearance of deformities, appearance of callus and appearance of fissure were identified as key features for identifying DSPN using the extra tree model. The area under the curve (AUC) of the nomogram for the internal and external datasets were 0.9421 and 0.946, respectively. From the developed nomogram, the probability of having DSPN was predicted and a DSPN severity scoring system for MNSI was developed from the probability score. The model performance was validated on an independent dataset. Patients were stratified into four severity levels: absent, mild, moderate, and severe using a cut-off value of 10.5, 12.7 and 15 for a DSPN probability less than 50%, 75% to 90%, and above 90%, respectively. Conclusions: This study provides a simple, easy-to-use and reliable algorithm for defining the prognosis and management of patients with DSPN.
# wav2vec 2.0を用いた変換器によるロバスト話者認識

Recent advances in unsupervised speech representation learning discover new approaches and provide new state-of-the-art for diverse types of speech processing tasks. This paper presents an investigation of using wav2vec 2.0 deep speech representations for the speaker recognition task. The proposed fine-tuning procedure of wav2vec 2.0 with simple TDNN and statistic pooling back-end using additive angular margin loss allows to obtain deep speaker embedding extractor that is well-generalized across different domains. It is concluded that Contrastive Predictive Coding pretraining scheme efficiently utilizes the power of unlabeled data, and thus opens the door to powerful transformer-based speaker recognition systems. The experimental results obtained in this study demonstrate that fine-tuning can be done on relatively small sets and a clean version of data. Using data augmentation during fine-tuning provides additional performance gains in speaker verification. In this study speaker recognition systems were analyzed on a wide range of well-known verification protocols: VoxCeleb1 cleaned test set, NIST SRE 18 development set, NIST SRE 2016 and NIST SRE 2019 evaluation set, VOiCES evaluation set, NIST 2021 SRE, and CTS challenges sets.
# FedADMM: 部分的に参加可能なフェデレーションプリマルデュアルアルゴリズム

Federated learning is a framework for distributed optimization that places emphasis on communication efficiency. In particular, it follows a client-server broadcast model and is particularly appealing because of its ability to accommodate heterogeneity in client compute and storage resources, non-i.i.d. data assumptions, and data privacy. Our contribution is to offer a new federated learning algorithm, FedADMM, for solving non-convex composite optimization problems with non-smooth regularizers. We prove converges of FedADMM for the case when not all clients are able to participate in a given communication round under a very general sampling model.
# 深部話者埋め込み型検証システムにおける異なるキャリブレーション手法の検討

Deep speaker embedding extractors have already become new state-of-the-art systems in the speaker verification field. However, the problem of verification score calibration for such systems often remains out of focus. An irrelevant score calibration leads to serious issues, especially in the case of unknown acoustic conditions, even if we use a strong speaker verification system in terms of threshold-free metrics. This paper presents an investigation over several methods of score calibration: a classical approach based on the logistic regression model; the recently presented magnitude estimation network MagnetO that uses activations from the pooling layer of the trained deep speaker extractor and generalization of such approach based on separate scale and offset prediction neural networks. An additional focus of this research is to estimate the impact of score normalization on the calibration performance of the system. The obtained results demonstrate that there are no serious problems if in-domain development data are used for calibration tuning. Otherwise, a trade-off between good calibration performance and threshold-free system quality arises. In most cases using adaptive s-norm helps to stabilize score distributions and to improve system performance. Meanwhile, some experiments demonstrate that novel approaches have their limits in score stabilization on several datasets.
# オーディオディープフェイクのアタッカー属性

Deepfakes are synthetically generated media often devised with malicious intent. They have become increasingly more convincing with large training datasets advanced neural networks. These fakes are readily being misused for slander, misinformation and fraud. For this reason, intensive research for developing countermeasures is also expanding. However, recent work is almost exclusively limited to deepfake detection - predicting if audio is real or fake. This is despite the fact that attribution (who created which fake?) is an essential building block of a larger defense strategy, as practiced in the field of cybersecurity for a long time. This paper considers the problem of deepfake attacker attribution in the domain of audio. We present several methods for creating attacker signatures using low-level acoustic descriptors and machine learning embeddings. We show that speech signal features are inadequate for characterizing attacker signatures. However, we also demonstrate that embeddings from a recurrent neural network can successfully characterize attacks from both known and unknown attackers. Our attack signature embeddings result in distinct clusters, both for seen and unseen audio deepfakes. We show that these embeddings can be used in downstream-tasks to high-effect, scoring 97.10% accuracy in attacker-id classification.
# 対話的軌道予測のためのドメイン知識駆動型擬似ラベル

Motion forecasting in highly interactive scenarios is a challenging problem in autonomous driving. In such scenarios, we need to accurately predict the joint behavior of interacting agents to ensure the safe and efficient navigation of autonomous vehicles. Recently, goal-conditioned methods have gained increasing attention due to their advantage in performance and their ability to capture the multimodality in trajectory distribution. In this work, we study the joint trajectory prediction problem with the goal-conditioned framework. In particular, we introduce a conditional-variational-autoencoder-based (CVAE) model to explicitly encode different interaction modes into the latent space. However, we discover that the vanilla model suffers from posterior collapse and cannot induce an informative latent space as desired. To address these issues, we propose a novel approach to avoid KL vanishing and induce an interpretable interactive latent space with pseudo labels. The pseudo labels allow us to incorporate arbitrary domain knowledge on interaction. We motivate the proposed method using an illustrative toy example. In addition, we validate our framework on the Waymo Open Motion Dataset with both quantitative and qualitative evaluations.
# トランジットサービスのための確率的トリップ要求を用いた動的車両ルーティング問題のオンライン解法

Many transit agencies operating paratransit and microtransit services have to respond to trip requests that arrive in real-time, which entails solving hard combinatorial and sequential decision-making problems under uncertainty. To avoid decisions that lead to significant inefficiency in the long term, vehicles should be allocated to requests by optimizing a non-myopic utility function or by batching requests together and optimizing a myopic utility function. While the former approach is typically offline, the latter can be performed online. We point out two major issues with such approaches when applied to paratransit services in practice. First, it is difficult to batch paratransit requests together as they are temporally sparse. Second, the environment in which transit agencies operate changes dynamically (e.g., traffic conditions), causing estimates that are learned offline to become stale. To address these challenges, we propose a fully online approach to solve the dynamic vehicle routing problem (DVRP) with time windows and stochastic trip requests that is robust to changing environmental dynamics by construction. We focus on scenarios where requests are relatively sparse - our problem is motivated by applications to paratransit services. We formulate DVRP as a Markov decision process and use Monte Carlo tree search to evaluate actions for any given state. Accounting for stochastic requests while optimizing a non-myopic utility function is computationally challenging; indeed, the action space for such a problem is intractably large in practice. To tackle the large action space, we leverage the structure of the problem to design heuristics that can sample promising actions for the tree search. Our experiments using real-world data from our partner agency show that the proposed approach outperforms existing state-of-the-art approaches both in terms of performance and robustness.
# 一般ゲームにおけるnash平衡を産出する人工障壁を用いた適応学習

Artificial barriers in Learning Automata (LA) is a powerful and yet under-explored concept although it was first proposed in the 1980s. Introducing artificial non-absorbing barriers makes the LA schemes resilient to being trapped in absorbing barriers, a phenomenon which is often referred to as lock in probability leading to an exclusive choice of one action after convergence. Within the field of LA and reinforcement learning in general, there is a sacristy of theoretical works and applications of schemes with artificial barriers. In this paper, we devise a LA with artificial barriers for solving a general form of stochastic bimatrix game. Classical LA systems possess properties of absorbing barriers and they are a powerful tool in game theory and were shown to converge to game's of Nash equilibrium under limited information. However, the stream of works in LA for solving game theoretical problems can merely solve the case where the Saddle Point of the game exists in a pure strategy and fail to reach mixed Nash equilibrium when no Saddle Point exists for a pure strategy. In this paper, by resorting to the powerful concept of artificial barriers, we suggest a LA that converges to an optimal mixed Nash equilibrium even though there may be no Saddle Point when a pure strategy is invoked. Our deployed scheme is of Linear Reward-Inaction ($L_{R-I}$) flavor which is originally an absorbing LA scheme, however, we render it non-absorbing by introducing artificial barriers in an elegant and natural manner, in the sense that that the well-known legacy $L_{R-I}$ scheme can be seen as an instance of our proposed algorithm for a particular choice of the barrier. Furthermore, we present an $S$ Learning version of our LA with absorbing barriers that is able to handle $S$-Learning environment in which the feedback is continuous and not binary as in the case of the $L_{R-I}$.
# Filler 単語の検出と分類:データセットとベンチマーク

Filler words such as `uh' or `um' are sounds or words people use to signal they are pausing to think. Finding and removing filler words from recordings is a common and tedious task in media editing. Automatically detecting and classifying filler words could greatly aid in this task, but few studies have been published on this problem. A key reason is the absence of a dataset with annotated filler words for training and evaluation. In this work, we present a novel speech dataset, PodcastFillers, with 35K annotated filler words and 50K annotations of other sounds that commonly occur in podcasts such as breaths, laughter, and word repetitions. We propose a pipeline that leverages VAD and ASR to detect filler candidates and a classifier to distinguish between filler word types. We evaluate our proposed pipeline on PodcastFillers, compare to several baselines, and present a detailed ablation study. In particular, we evaluate the importance of using ASR and how it compares to a transcription-free approach resembling keyword spotting. We show that our pipeline obtains state-of-the-art results, and that leveraging ASR strongly outperforms a keyword spotting approach. We make PodcastFillers publicly available, and hope our work serves as a benchmark for future research.
# テキストによるピアツーピアメンタルヘルス支援における人間とAIのコラボレーション

Advances in artificial intelligence (AI) are enabling systems that augment and collaborate with humans to perform simple, mechanistic tasks like scheduling meetings and grammar-checking text. However, such Human-AI collaboration poses challenges for more complex, creative tasks, such as carrying out empathic conversations, due to difficulties of AI systems in understanding complex human emotions and the open-ended nature of these tasks. Here, we focus on peer-to-peer mental health support, a setting in which empathy is critical for success, and examine how AI can collaborate with humans to facilitate peer empathy during textual, online supportive conversations. We develop Hailey, an AI-in-the-loop agent that provides just-in-time feedback to help participants who provide support (peer supporters) respond more empathically to those seeking help (support seekers). We evaluate Hailey in a non-clinical randomized controlled trial with real-world peer supporters on TalkLife (N=300), a large online peer-to-peer support platform. We show that our Human-AI collaboration approach leads to a 19.60% increase in conversational empathy between peers overall. Furthermore, we find a larger 38.88% increase in empathy within the subsample of peer supporters who self-identify as experiencing difficulty providing support. We systematically analyze the Human-AI collaboration patterns and find that peer supporters are able to use the AI feedback both directly and indirectly without becoming overly reliant on AI while reporting improved self-efficacy post-feedback. Our findings demonstrate the potential of feedback-driven, AI-in-the-loop writing systems to empower humans in open-ended, social, creative tasks such as empathic conversations.
# 深い対話型学習に基づく全スライド画像の卵巣癌分節化によるBRCA変異の形態学的解析

Deep learning has been widely used to analyze digitized hematoxylin and eosin (H&E)-stained histopathology whole slide images. Automated cancer segmentation using deep learning can be used to diagnose malignancy and to find novel morphological patterns to predict molecular subtypes. To train pixel-wise cancer segmentation models, manual annotation from pathologists is generally a bottleneck due to its time-consuming nature. In this paper, we propose Deep Interactive Learning with a pretrained segmentation model from a different cancer type to reduce manual annotation time. Instead of annotating all pixels from cancer and non-cancer regions on giga-pixel whole slide images, an iterative process of annotating mislabeled regions from a segmentation model and training/finetuning the model with the additional annotation can reduce the time. Especially, employing a pretrained segmentation model can further reduce the time than starting annotation from scratch. We trained an accurate ovarian cancer segmentation model with a pretrained breast segmentation model by 3.5 hours of manual annotation which achieved intersection-over-union of 0.74, recall of 0.86, and precision of 0.84. With automatically extracted high-grade serous ovarian cancer patches, we attempted to train another deep learning model to predict BRCA mutation. The segmentation model and code have been released at https://github.com/MSKCC-Computational-Pathology/DMMN-ovary.
# 顔認証バイパス

Face verification systems aim to validate the claimed identity using feature vectors and distance metrics. However, no attempt has been made to bypass such a system using generated images that are constrained by the same feature vectors. In this work, we train StarGAN v2 to generate diverse images based on a human user, that have similar feature vectors yet qualitatively look different. We then demonstrate a proof of concept on a custom face verification system and verify our claims by demonstrating the same proof of concept in a black box setting on dating applications that utilize similar face verification systems.
# 深層学習に基づくアクセス制御に向けて

A common trait of current access control approaches is the challenging need to engineer abstract and intuitive access control models. This entails designing access control information in the form of roles (RBAC), attributes (ABAC), or relationships (ReBAC) as the case may be, and subsequently, designing access control rules. This framework has its benefits but has significant limitations in the context of modern systems that are dynamic, complex, and large-scale, due to which it is difficult to maintain an accurate access control state in the system for a human administrator. This paper proposes Deep Learning Based Access Control (DLBAC) by leveraging significant advances in deep learning technology as a potential solution to this problem. We envision that DLBAC could complement and, in the long-term, has the potential to even replace, classical access control models with a neural network that reduces the burden of access control model engineering and updates. Without loss of generality, we conduct a thorough investigation of a candidate DLBAC model, called DLBAC_alpha, using both real-world and synthetic datasets. We demonstrate the feasibility of the proposed approach by addressing issues related to accuracy, generalization, and explainability. We also discuss challenges and future research directions.
# 1石で2羽の鳥を殺す:部分fcによる顔認識cnnの効率的かつロバストな訓練

Learning discriminative deep feature embeddings by using million-scale in-the-wild datasets and margin-based softmax loss is the current state-of-the-art approach for face recognition. However, the memory and computing cost of the Fully Connected (FC) layer linearly scales up to the number of identities in the training set. Besides, the large-scale training data inevitably suffers from inter-class conflict and long-tailed distribution. In this paper, we propose a sparsely updating variant of the FC layer, named Partial FC (PFC). In each iteration, positive class centers and a random subset of negative class centers are selected to compute the margin-based softmax loss. All class centers are still maintained throughout the whole training process, but only a subset is selected and updated in each iteration. Therefore, the computing requirement, the probability of inter-class conflict, and the frequency of passive update on tail class centers, are dramatically reduced. Extensive experiments across different training data and backbones (e.g. CNN and ViT) confirm the effectiveness, robustness and efficiency of the proposed PFC. The source code is available at \https://github.com/deepinsight/insightface/tree/master/recognition.
# X-Pool:テキストビデオ検索のためのクロスプラットフォーム言語ビデオアテンション

In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However, videos inherently express a much wider gamut of information than texts. Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos. Therefore, for a given text, a retrieval model should focus on the text's most semantically similar video sub-regions to make a more relevant comparison. Yet, most existing works aggregate entire videos without directly considering text. Common text-agnostic aggregations schemes include mean-pooling or self-attention over the frames, but these are likely to encode misleading visual information not described in the given text. To address this, we propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text's attention weights over the frames. We evaluate our method on three benchmark datasets of MSR-VTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 12% in relative improvement in Recall@1. Our findings thereby highlight the importance of joint text-video reasoning to extract important visual cues according to text. Full code and demo can be found at: https://layer6ai-labs.github.io/xpool/
# tl-gan: 自動運転のためのデータ合成による交通光認識の改善

Traffic light recognition, as a critical component of the perception module of self-driving vehicles, plays a vital role in the intelligent transportation systems. The prevalent deep learning based traffic light recognition methods heavily hinge on the large quantity and rich diversity of training data. However, it is quite challenging to collect data in various rare scenarios such as flashing, blackout or extreme weather, thus resulting in the imbalanced distribution of training data and consequently the degraded performance in recognizing rare classes. In this paper, we seek to improve traffic light recognition by leveraging data synthesis. Inspired by the generative adversarial networks (GANs), we propose a novel traffic light generation approach TL-GAN to synthesize the data of rare classes to improve traffic light recognition for autonomous driving. TL-GAN disentangles traffic light sequence generation into image synthesis and sequence assembling. In the image synthesis stage, our approach enables conditional generation to allow full control of the color of the generated traffic light images. In the sequence assembling stage, we design the style mixing and adaptive template to synthesize realistic and diverse traffic light sequences. Extensive experiments show that the proposed TL-GAN renders remarkable improvement over the baseline without using the generated data, leading to the state-of-the-art performance in comparison with the competing algorithms that are used for general image synthesis and data imbalance tackling.
# 明示的に暗黙に登録する:単一画像からの高忠実な衣料メッシュ再構築に向けて

Fueled by the power of deep learning techniques and implicit shape learning, recent advances in single-image human digitalization have reached unprecedented accuracy and could recover fine-grained surface details such as garment wrinkles. However, a common problem for the implicit-based methods is that they cannot produce separated and topology-consistent mesh for each garment piece, which is crucial for the current 3D content creation pipeline. To address this issue, we proposed a novel geometry inference framework ReEF that reconstructs topology-consistent layered garment mesh by registering the explicit garment template to the whole-body implicit fields predicted from single images. Experiments demonstrate that our method notably outperforms its counterparts on single-image layered garment reconstruction and could bring high-quality digital assets for further content creation.
# 非教師なし超スペクトル画像セグメンテーションのための分布依存ムンフォード・シャーモデル

Hyperspectral images provide a rich representation of the underlying spectrum for each pixel, allowing for a pixel-wise classification/segmentation into different classes. As the acquisition of labeled training data is very time-consuming, unsupervised methods become crucial in hyperspectral image analysis. The spectral variability and noise in hyperspectral data make this task very challenging and define special requirements for such methods. Here, we present a novel unsupervised hyperspectral segmentation framework. It starts with a denoising and dimensionality reduction step by the well-established Minimum Noise Fraction (MNF) transform. Then, the Mumford-Shah (MS) segmentation functional is applied to segment the data. We equipped the MS functional with a novel robust distribution-dependent indicator function designed to handle the characteristic challenges of hyperspectral data. To optimize our objective function with respect to the parameters for which no closed form solution is available, we propose an efficient fixed point iteration scheme. Numerical experiments on four public benchmark datasets show that our method produces competitive results, which outperform two state-of-the-art methods substantially on three of these datasets.
# DeepShadow: シャドーからの神経形

This paper presents DeepShadow, a one-shot method for recovering the depth map and surface normals from photometric stereo shadow maps. Previous works that try to recover the surface normals from photometric stereo images treat cast shadows as a disturbance. We show that the self and cast shadows not only do not disturb 3D reconstruction, but can be used alone, as a strong learning signal, to recover the depth map and surface normals. We demonstrate that 3D reconstruction from shadows can even outperform shape-from-shading in certain cases. To the best of our knowledge, our method is the first to reconstruct 3D shape-from-shadows using neural networks. The method does not require any pre-training or expensive labeled data, and is optimized during inference time.
# CD-Net:ピラミッドコンテキスト詳細ネットワークを用いた病理組織学的表現学習

Extracting rich phenotype information, such as cell density and arrangement, from whole slide histology images (WSIs), requires analysis of large field of view, i.e more contexual information. This can be achieved through analyzing the digital slides at lower resolution. A potential drawback is missing out on details present at a higher resolution. To jointly leverage complementary information from multiple resolutions, we present a novel transformer based Pyramidal Context-Detail Network (CD-Net). CD-Net exploits the WSI pyramidal structure through co-training of proposed Context and Detail Modules, which operate on inputs from multiple resolutions. The residual connections between the modules enable the joint training paradigm while learning self-supervised representation for WSIs. The efficacy of CD-Net is demonstrated in classifying Lung Adenocarcinoma from Squamous cell carcinoma.
# DAMNETS: Markovian Network Time Series を生成するための深い自己回帰モデル

In this work, we introduce DAMNETS, a deep generative model for Markovian network time series. Time series of networks are found in many fields such as trade or payment networks in economics, contact networks in epidemiology or social media posts over time. Generative models of such data are useful for Monte-Carlo estimation and data set expansion, which is of interest for both data privacy and model fitting. Using recent ideas from the Graph Neural Network (GNN) literature, we introduce a novel GNN encoder-decoder structure in which an encoder GNN learns a latent representation of the input graph, and a decoder GNN uses this representation to simulate the network dynamics. We show using synthetic data sets that DAMNETS can replicate features of network topology across time observed in the real world, such as changing community structure and preferential attachment. DAMNETS outperforms competing methods on all of our measures of sample quality over several real and synthetic data sets.
# 木探索とグラフニューラルネットワークを用いた時間制御性制限下での不確実性を持つ分断時間ネットワークの解法

We present a novel approach based on tree search and graph machine learning for the scheduling problem known as Disjunctive Temporal Networks with Uncertainty (DTNU). Dynamic Controllability (DC) of DTNUs seeks a reactive scheduling strategy to satisfy temporal constraints in response to uncontrollable action durations. We introduce new semantics for reactive scheduling: Time-based Dynamic Controllability (TDC) and a restricted subset of TDC, R-TDC. We design a tree search algorithm to determine whether or not a DTNU is R-TDC. Moreover, we leverage a graph neural network as a heuristic for tree search guidance. Finally, we conduct experiments on a known benchmark on which we show R-TDC to retain significant completeness with regard to DC, while being faster to prove. This results in the tree search processing fifty percent more DTNU problems in R-TDC than the state-of-the-art DC solver does in DC with the same time budget. We also observe that graph neural network search guidance leads to substantial performance gains on benchmarks of more complex DTNUs, with up to eleven times more problems solved than the baseline tree search.
# 実世界ラベルのない光学的流れ, 深さ, シーンフローの学習

Self-supervised monocular depth estimation enables robots to learn 3D perception from raw video streams. This scalable approach leverages projective geometry and ego-motion to learn via view synthesis, assuming the world is mostly static. Dynamic scenes, which are common in autonomous driving and human-robot interaction, violate this assumption. Therefore, they require modeling dynamic objects explicitly, for instance via estimating pixel-wise 3D motion, i.e. scene flow. However, the simultaneous self-supervised learning of depth and scene flow is ill-posed, as there are infinitely many combinations that result in the same 3D point. In this paper we propose DRAFT, a new method capable of jointly learning depth, optical flow, and scene flow by combining synthetic data with geometric self-supervision. Building upon the RAFT architecture, we learn optical flow as an intermediate task to bootstrap depth and scene flow learning via triangulation. Our algorithm also leverages temporal and geometric consistency losses across tasks to improve multi-task learning. Our DRAFT architecture simultaneously establishes a new state of the art in all three tasks in the self-supervised monocular setting on the standard KITTI benchmark. Project page: https://sites.google.com/tri.global/draft.
# 新しいピラミッド型ハイブリッドテクスチャとディープ機能に基づく皮膚癌自動分類モデル:アンサンブルダークネットとテクスチャ特徴抽出器

Background: Skin cancer is one of the widely seen cancer worldwide and automatic classification of skin cancer can be benefited dermatology clinics for an accurate diagnosis. Hence, a machine learning-based automatic skin cancer detection model must be developed. Material and Method: This research interests to overcome automatic skin cancer detection problem. A colored skin cancer image dataset is used. This dataset contains 3297 images with two classes. An automatic multilevel textural and deep features-based model is presented. Multilevel fuse feature generation using discrete wavelet transform (DWT), local phase quantization (LPQ), local binary pattern (LBP), pre-trained DarkNet19, and DarkNet53 are utilized to generate features of the skin cancer images, top 1000 features are selected threshold value-based neighborhood component analysis (NCA). The chosen top 1000 features are classified using the 10-fold cross-validation technique. Results: To obtain results, ten-fold cross-validation is used and 91.54% classification accuracy results are obtained by using the recommended pyramidal hybrid feature generator and NCA selector-based model. Further, various training and testing separation ratios (90:10, 80:20, 70:30, 60:40, 50:50) are used and the maximum classification rate is calculated as 95.74% using the 90:10 separation ratio. Conclusions: The findings and accuracies calculated are denoted that this model can be used in dermatology and pathology clinics to simplify the skin cancer detection process and help physicians.
# ロバスト3次元物体検出のためのLiDAR降雪シミュレーション

3D object detection is a central task for applications such as autonomous driving, in which the system needs to localize and classify surrounding traffic agents, even in the presence of adverse weather. In this paper, we address the problem of LiDAR-based 3D object detection under snowfall. Due to the difficulty of collecting and annotating training data in this setting, we propose a physically based method to simulate the effect of snowfall on real clear-weather LiDAR point clouds. Our method samples snow particles in 2D space for each LiDAR line and uses the induced geometry to modify the measurement for each LiDAR beam accordingly. Moreover, as snowfall often causes wetness on the ground, we also simulate ground wetness on LiDAR point clouds. We use our simulation to generate partially synthetic snowy LiDAR data and leverage these data for training 3D object detection models that are robust to snowfall. We conduct an extensive evaluation using several state-of-the-art 3D object detection methods and show that our simulation consistently yields significant performance gains on the real snowy STF dataset compared to clear-weather baselines and competing simulation approaches, while not sacrificing performance in clear weather. Our code is available at www.github.com/SysCV/LiDAR_snow_sim.
# 記述するものを分離する:言語に基づく音源分離

In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed by people laughing"). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To address this issue, we proposed LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture. We evaluate the performance of our proposed system with a dataset created from the AudioCaps dataset. Experimental results show that LASS-Net achieves considerable improvements over baseline methods. Furthermore, we observe that LASS-Net achieves promising generalization results when using diverse human-annotated descriptions as queries, indicating its potential use in real-world scenarios. The separated audio samples and source code are available at https://liuxubo717.github.io/LASS-demopage.
# 完全クロストランスを用いたFew-Shotオブジェクト検出

Few-shot object detection (FSOD), with the aim to detect novel objects using very few training examples, has recently attracted great research interest in the community. Metric-learning based methods have been demonstrated to be effective for this task using a two-branch based siamese network, and calculate the similarity between image regions and few-shot examples for detection. However, in previous works, the interaction between the two branches is only restricted in the detection head, while leaving the remaining hundreds of layers for separate feature extraction. Inspired by the recent work on vision transformers and vision-language transformers, we propose a novel Fully Cross-Transformer based model (FCT) for FSOD by incorporating cross-transformer into both the feature backbone and detection head. The asymmetric-batched cross-attention is proposed to aggregate the key information from the two branches with different batch sizes. Our model can improve the few-shot similarity learning between the two branches by introducing the multi-level interactions. Comprehensive experiments on both PASCAL VOC and MSCOCO FSOD benchmarks demonstrate the effectiveness of our model.
# 変形性および高齢者の音声認識におけるオンザフライ特徴に基づく話者適応

Automatic recognition of dysarthric and elderly speech highly challenging tasks to date. Speaker-level heterogeneity attributed to accent or gender commonly found in normal speech, when aggregated with age and speech impairment severity, create large diversity among speakers. Speaker adaptation techniques play a crucial role in personalization of ASR systems for such users. Their mobility issues limit the amount of speaker-level data available for model based adaptation. To this end, this paper investigates two novel forms of feature based on-the-fly rapid speaker adaptation approaches. The first is based on speaker-level variance regularized spectral basis embedding (SBEVR) features, while the other uses on-the-fly learning hidden unit contributions (LHUC) transforms conditioned on speaker-level spectral features. Experiments conducted on the UASpeech dysarthric and DimentiaBank Pitt elderly speech datasets suggest the proposed SBEVR features based adaptation statistically significantly outperform both the baseline on-the-fly i-Vector adapted hybrid TDNN/DNN systems by up to 2.48% absolute (7.92% relative) reduction in word error rate (WER), and offline batch mode model based LHUC adaptation using all speaker-level data by 0.78% absolute (2.41% relative) in WER reduction.
# booleanルール説明によるユーザ駆動モデル調整

AI solutions are heavily dependant on the quality and accuracy of the input training data, however the training data may not always fully reflect the most up-to-date policy landscape or may be missing business logic. The advances in explainability have opened the possibility of allowing users to interact with interpretable explanations of ML predictions in order to inject modifications or constraints that more accurately reflect current realities of the system. In this paper, we present a solution which leverages the predictive power of ML models while allowing the user to specify modifications to decision boundaries. Our interactive overlay approach achieves this goal without requiring model retraining, making it appropriate for systems that need to apply instant changes to their decision making. We demonstrate that user feedback rules can be layered with the ML predictions to provide immediate changes which in turn supports learning with less data.
# CMGAN:音声強調のためのコンバータベースメトリックGAN

Recently, convolution-augmented transformer (Conformer) has achieved promising performance in automatic speech recognition (ASR) and time-domain speech enhancement (SE), as it can capture both local and global dependencies in the speech signal. In this paper, we propose a conformer-based metric generative adversarial network (CMGAN) for SE in the time-frequency (TF) domain. In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information by modeling both time and frequency dependencies. The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech. In addition, a metric discriminator is employed to further improve the quality of the enhanced estimated speech by optimizing the generator with respect to a corresponding evaluation score. Quantitative analysis on Voice Bank+DEMAND dataset indicates the capability of CMGAN in outperforming various previous models with a margin, i.e., PESQ of 3.41 and SSNR of 11.10 dB.
# 非パラメトリック混合学習のための超多項下限

We study the problem of learning nonparametric distributions in a finite mixture, and establish a super-polynomial lower bound on the sample complexity of learning the component distributions in such models. Namely, we are given i.i.d. samples from $f$ where $$ f=\sum_{i=1}^k w_i f_i, \quad\sum_{i=1}^k w_i=1, \quad w_i>0 $$ and we are interested in learning each component $f_i$. Without any assumptions on $f_i$, this problem is ill-posed. In order to identify the components $f_i$, we assume that each $f_i$ can be written as a convolution of a Gaussian and a compactly supported density $\nu_i$ with $\text{supp}(\nu_i)\cap \text{supp}(\nu_j)=\emptyset$. Our main result shows that $\Omega((\frac{1}{\varepsilon})^{C\log\log \frac{1}{\varepsilon}})$ samples are required for estimating each $f_i$. The proof relies on a fast rate for approximation with Gaussians, which may be of independent interest. This result has important implications for the hardness of learning more general nonparametric latent variable models that arise in machine learning applications.
# 非知覚エンティティへの一般化のためのパラメータ化タスク構造学習

Real world tasks are hierarchical and compositional. Tasks can be composed of multiple subtasks (or sub-goals) that are dependent on each other. These subtasks are defined in terms of entities (e.g., "apple", "pear") that can be recombined to form new subtasks (e.g., "pickup apple", and "pickup pear"). To solve these tasks efficiently, an agent must infer subtask dependencies (e.g. an agent must execute "pickup apple" before "place apple in pot"), and generalize the inferred dependencies to new subtasks (e.g. "place apple in pot" is similar to "place apple in pan"). Moreover, an agent may also need to solve unseen tasks, which can involve unseen entities. To this end, we formulate parameterized subtask graph inference (PSGI), a method for modeling subtask dependencies using first-order logic with subtask entities. To facilitate this, we learn entity attributes in a zero-shot manner, which are used as quantifiers (e.g. "is_pickable(X)") for the parameterized subtask graph. We show this approach accurately learns the latent structure on hierarchical and compositional tasks more efficiently than prior work, and show PSGI can generalize by modelling structure on subtasks unseen during adaptation.
# 量子アニーリングによる凸非負行列分解

In this paper we provide the quantum version of the Convex Non-negative Matrix Factorization algorithm (Convex-NMF) by using the D-wave quantum annealer. More precisely, we use D-wave 2000Q to find the low rank approximation of a fixed real-valued matrix X by the product of two non-negative matrices factors W and G such that the Frobenius norm of the difference X-XWG is minimized. In order to solve this optimization problem we proceed in two steps. In the first step we transform the global real optimization problem depending on W,G into two quadratic unconstrained binary optimization problems (QUBO) depending on W and G respectively. In the second step we use an alternative strategy between the two QUBO problems corresponding to W and G to find the global solution. The running of these two QUBO problems on D-wave 2000Q need to use an embedding to the chimera graph of D-wave 2000Q, this embedding is limited by the number of qubits of D-wave 2000Q. We perform a study on the maximum number of real data to be used by our approach on D-wave 2000Q. The proposed study is based on the number of qubits used to represent each real variable. We also tested our approach on D-Wave 2000Q with several randomly generated data sets to prove that our approach is faster than the classical approach and also to prove that it gets the best results.
# 自己教師付き適応グラフアライメントによる多言語知識グラフ補完

Predicting missing facts in a knowledge graph (KG) is crucial as modern KGs are far from complete. Due to labor-intensive human labeling, this phenomenon deteriorates when handling knowledge represented in various languages. In this paper, we explore multilingual KG completion, which leverages limited seed alignment as a bridge, to embrace the collective knowledge from multiple languages. However, language alignment used in prior works is still not fully exploited: (1) alignment pairs are treated equally to maximally push parallel entities to be close, which ignores KG capacity inconsistency; (2) seed alignment is scarce and new alignment identification is usually in a noisily unsupervised manner. To tackle these issues, we propose a novel self-supervised adaptive graph alignment (SS-AGA) method. Specifically, SS-AGA fuses all KGs as a whole graph by regarding alignment as a new edge type. As such, information propagation and noise influence across KGs can be adaptively controlled via relation-aware attention weights. Meanwhile, SS-AGA features a new pair generator that dynamically captures potential alignment pairs in a self-supervised paradigm. Extensive experiments on both the public multilingual DBPedia KG and newly-created industrial multilingual E-commerce KG empirically demonstrate the effectiveness of SS-AG
# Salient ImageNet を用いたコアリスク最小化

Deep neural networks can be unreliable in the real world especially when they heavily use spurious features for their predictions. Recently, Singla & Feizi (2022) introduced the Salient Imagenet dataset by annotating and localizing core and spurious features of ~52k samples from 232 classes of Imagenet. While this dataset is useful for evaluating the reliance of pretrained models on spurious features, its small size limits its usefulness for training models. In this work, we first introduce the Salient Imagenet-1M dataset with more than 1 million soft masks localizing core and spurious features for all 1000 Imagenet classes. Using this dataset, we first evaluate the reliance of several Imagenet pretrained models (42 total) on spurious features and observe that: (i) transformers are more sensitive to spurious features compared to Convnets, (ii) zero-shot CLIP transformers are highly susceptible to spurious features. Next, we introduce a new learning paradigm called Core Risk Minimization (CoRM) whose objective ensures that the model predicts a class using its core features. We evaluate different computational approaches for solving CoRM and achieve significantly higher (+12%) core accuracy (accuracy when non-core regions corrupted using noise) with no drop in clean accuracy compared to models trained via Empirical Risk Minimization.
# Text2Pos: Text-to-Point-Cloudクロスモーダルローカライゼーション

Natural language-based communication with mobile devices and home appliances is becoming increasingly popular and has the potential to become natural for communicating with mobile robots in the future. Towards this goal, we investigate cross-modal text-to-point-cloud localization that will allow us to specify, for example, a vehicle pick-up or goods delivery location. In particular, we propose Text2Pos, a cross-modal localization module that learns to align textual descriptions with localization cues in a coarse- to-fine manner. Given a point cloud of the environment, Text2Pos locates a position that is specified via a natural language-based description of the immediate surroundings. To train Text2Pos and study its performance, we construct KITTI360Pose, the first dataset for this task based on the recently introduced KITTI360 dataset. Our experiments show that we can localize 65% of textual queries within 15m distance to query locations for top-10 retrieved locations. This is a starting point that we hope will spark future developments towards language-based navigation.
# 潜在変換によるサイクル整合反事実

CounterFactual (CF) visual explanations try to find images similar to the query image that change the decision of a vision system to a specified outcome. Existing methods either require inference-time optimization or joint training with a generative adversarial model which makes them time-consuming and difficult to use in practice. We propose a novel approach, Cycle-Consistent Counterfactuals by Latent Transformations (C3LT), which learns a latent transformation that automatically generates visual CFs by steering in the latent space of generative models. Our method uses cycle consistency between the query and CF latent representations which helps our training to find better solutions. C3LT can be easily plugged into any state-of-the-art pretrained generative network. This enables our method to generate high-quality and interpretable CF images at high resolution such as those in ImageNet. In addition to several established metrics for evaluating CF explanations, we introduce a novel metric tailored to assess the quality of the generated CF examples and validate the effectiveness of our method on an extensive set of experiments.
# (参考訳) 3MASSIV:ソーシャルメディアショートビデオのマルチリンガル、マルチモーダル、マルチアスペクトデータセット

We present 3MASSIV, a multilingual, multimodal and multi-aspect, expertly-annotated dataset of diverse short videos extracted from short-video social media platform - Moj. 3MASSIV comprises of 50k short videos (20 seconds average duration) and 100K unlabeled videos in 11 different languages and captures popular short video trends like pranks, fails, romance, comedy expressed via unique audio-visual formats like self-shot videos, reaction videos, lip-synching, self-sung songs, etc. 3MASSIV presents an opportunity for multimodal and multilingual semantic understanding on these unique videos by annotating them for concepts, affective states, media types, and audio language. We present a thorough analysis of 3MASSIV and highlight the variety and unique aspects of our dataset compared to other contemporary popular datasets with strong baselines. We also show how the social media content in 3MASSIV is dynamic and temporal in nature, which can be used for semantic understanding tasks and cross-lingual analysis.
# (参考訳) paedid: 画素レベル欠陥領域セグメンテーションのためのパッチ自動エンコーダベースのディープイメージ分解

Unsupervised pixel-level defective region segmentation is an important task in image-based anomaly detection for various industrial applications. The state-of-the-art methods have their own advantages and limitations: matrix-decomposition-based methods are robust to noise but lack complex background image modeling capability; representation-based methods are good at defective region localization but lack accuracy in defective region shape contour extraction; reconstruction-based methods detected defective region match well with the ground truth defective region shape contour but are noisy. To combine the best of both worlds, we present an unsupervised patch autoencoder based deep image decomposition (PAEDID) method for defective region segmentation. In the training stage, we learn the common background as a deep image prior by a patch autoencoder (PAE) network. In the inference stage, we formulate anomaly detection as an image decomposition problem with the deep image prior and domain-specific regularizations. By adopting the proposed approach, the defective regions in the image can be accurately extracted in an unsupervised fashion. We demonstrate the effectiveness of the PAEDID method in simulation studies and an industrial dataset in the case study.
# (参考訳) 表現認識のための多モデルアンサンブル学習法

Analysis of human affect plays a vital role in human-computer interaction (HCI) systems. Due to the difficulty in capturing large amounts of real-life data, most of the current methods have mainly focused on controlled environments, which limit their application scenarios. To tackle this problem, we propose our solution based on the ensemble learning method. Specifically, we formulate the problem as a classification task, and then train several expression classification models with different types of backbones--ResNet, EfficientNet and InceptionNet. After that, the outputs of several models are fused via model ensemble method to predict the final results. Moreover, we introduce the multi-fold ensemble method to train and ensemble several models with the same architecture but different data distributions to enhance the performance of our solution. We conduct many experiments on the AffWild2 dataset of the ABAW2022 Challenge, and the results demonstrate the effectiveness of our solution.
# (参考訳) ヤシ古木における赤ヤシの識別とモニタリングのための新しいリモートセンシング手法

The spread of the Red Pal Weevil (RPW) has become an existential threat for palm trees around the world. In the Middle East, RPW is causing wide-spread damage to date palm Phoenix dactylifera L., having both agricultural impacts on the palm production and environmental impacts. Early detection of RPW is very challenging, especially at large scale. This research proposes a novel remote sensing approach to recognize and monitor red palm weevil in date palm trees, using a combination of vegetation indices, object detection and semantic segmentation techniques. The study area consists of date palm trees with three classes, including healthy palms, smallish palms and severely infected palms. This proposed method achieved a promising 0.947 F1 score on test data set. This work paves the way for deploying artificial intelligence approaches to monitor RPW in large-scale as well as provide guidance for practitioners.
# (参考訳) エッジ支援セマンティックビデオセグメンテーションのためのDNN駆動圧縮オフロード

Deep learning has shown impressive performance in semantic segmentation, but it is still unaffordable for resource-constrained mobile devices. While offloading computation tasks is promising, the high traffic demands overwhelm the limited bandwidth. Existing compression algorithms are not fit for semantic segmentation, as the lack of obvious and concentrated regions of interest (RoIs) forces the adoption of uniform compression strategies, leading to low compression ratios or accuracy. This paper introduces STAC, a DNN-driven compression scheme tailored for edge-assisted semantic video segmentation. STAC is the first to exploit DNN's gradients as spatial sensitivity metrics for spatial adaptive compression and achieves superior compression ratio and accuracy. Yet, it is challenging to adapt this content-customized compression to videos. Practical issues include varying spatial sensitivity and huge bandwidth consumption for compression strategy feedback and offloading. We tackle these issues through a spatiotemporal adaptive scheme, which (1) takes partial strategy generation operations offline to reduce communication load, and (2) propagates compression strategies and segmentation results across frames through dense optical flow, and adaptively offloads keyframes to accommodate video content. We implement STAC on a commodity mobile device. Experiments show that STAC can save up to 20.95% of bandwidth without losing accuracy, compared to the state-of-the-art algorithm.
# (参考訳) 臨床関連バイオメトリクスの制約を活用して胎児脳の超音波計測を正確に行うための深層学習モデル

Multiple studies have demonstrated that obtaining standardized fetal brain biometry from mid-trimester ultrasonography (USG) examination is key for the reliable assessment of fetal neurodevelopment and the screening of central nervous system (CNS) anomalies. Obtaining these measurements is highly subjective, expertise-driven, and requires years of training experience, limiting quality prenatal care for all pregnant mothers. In this study, we propose a deep learning (DL) approach to compute 3 key fetal brain biometry from the 2D USG images of the transcerebellar plane (TC) through the accurate and automated caliper placement (2 per biometry) by modeling it as a landmark detection problem. We leveraged clinically relevant biometric constraints (relationship between caliper points) and domain-relevant data augmentation to improve the accuracy of a U-Net DL model (trained/tested on: 596 images, 473 subjects/143 images, 143 subjects). We performed multiple experiments demonstrating the effect of the DL backbone, data augmentation, generalizability and benchmarked against a recent state-of-the-art approach through extensive clinical validation (DL vs. 7 experienced clinicians). For all cases, the mean errors in the placement of the individual caliper points and the computed biometry were comparable to error rates among clinicians. The clinical translation of the proposed framework can assist novice users from low-resource settings in the reliable and standardized assessment of fetal brain sonograms.
# (参考訳) 記号型ライブラリと帰納的組み合わせによる神経数学的推論の強化

Mathematical reasoning recently has been shown as a hard challenge for neural systems. Abilities including expression translation, logical reasoning, and mathematics knowledge acquiring appear to be essential to overcome the challenge. This paper demonstrates that some abilities can be achieved through abductive combination with discrete systems that have been programmed with human knowledge. On a mathematical reasoning dataset, we adopt the recently proposed abductive learning framework, and propose the ABL-Sym algorithm that combines the Transformer neural models with a symbolic mathematics library. ABL-Sym shows 9.73% accuracy improvement on the interpolation tasks and 47.22% accuracy improvement on the extrapolation tasks, over the state-of-the-art approaches. Online demonstration: http://math.polixir.ai
# (参考訳) ARCS:正確な回転と対応検索

This paper is about the old Wahba problem in its more general form, which we call "simultaneous search of rotation and correspondences". In this generalization we need to find a rotation that best aligns two partially overlapping $3$D point sets, of sizes $m$ and $n$ respectively with $m\geq n$. We first propose a solver, $\texttt{ARCS}$, that i) assumes noiseless point sets in general position, ii) requires only $2$ inliers, iii) uses $O(m\log m)$ time and $O(m)$ space, and iv) can successfully solve the problem even with, e.g., $m,n\sim 10^6$ in about $0.1$ seconds. We next robustify $\texttt{ARCS}$ to noise, for which we approximately solve consensus maximization problems using ideas from robust subspace learning and interval stabbing. Thirdly, we refine the approximately found consensus set by a Riemannian subgradient descent approach over the space of unit quaternions, which we show converges globally to an $\varepsilon$-stationary point in $O(\varepsilon^{-4})$ iterations, or locally to the ground-truth at a linear rate in the absence of noise. We combine these algorithms into $\texttt{ARCS+}$, to simultaneously search for rotations and correspondences. Experiments show that $\texttt{ARCS+}$ achieves state-of-the-art performance on large-scale datasets with more than $10^6$ points with a $10^4$ time-speedup over alternative methods. \url{https://github.com/liangzu/ARCS}
# (参考訳) 生成逆ネットワークに対する共役勾配法

While the generative model has many advantages, it is not feasible to calculate the Jensen-Shannon divergence of the density function of the data and the density function of the model of deep neural networks; for this reason, various alternative approaches have been developed. Generative adversarial networks (GANs) can be used to formulate this problem as a discriminative problem with two models, a generator and a discriminator whose learning can be formulated in the context of game theory and the local Nash equilibrium. Since this optimization is more difficult than minimization of a single objective function, we propose to apply the conjugate gradient method to solve the local Nash equilibrium problem in GANs. We give a proof and convergence analysis under mild assumptions showing that the proposed method converges to a local Nash equilibrium with three different learning-rate schedules including a constant learning rate. Furthermore, we demonstrate the convergence of a simple toy problem to a local Nash equilibrium and compare the proposed method with other optimization methods in experiments using real-world data, finding that the proposed method outperforms stochastic gradient descent (SGD) and momentum SGD.
# (参考訳) ANNA: 質問応答のための言語表現の強化

Pre-trained language models have brought significant improvements in performance in a variety of natural language processing tasks. Most existing models performing state-of-the-art results have shown their approaches in the separate perspectives of data processing, pre-training tasks, neural network modeling, or fine-tuning. In this paper, we demonstrate how the approaches affect performance individually, and that the language model performs the best results on a specific question answering task when those approaches are jointly considered in pre-training models. In particular, we propose an extended pre-training task, and a new neighbor-aware mechanism that attends neighboring tokens more to capture the richness of context for pre-training language modeling. Our best model achieves new state-of-the-art results of 95.7\% F1 and 90.6\% EM on SQuAD 1.1 and also outperforms existing pre-trained language models such as RoBERTa, ALBERT, ELECTRA, and XLNet on the SQuAD 2.0 benchmark.
# (参考訳) Encode-in-Style: StyleGAN2を用いた潜在型ビデオ符号化

We propose an end-to-end facial video encoding approach that facilitates data-efficient high-quality video re-synthesis by optimizing low-dimensional edits of a single Identity-latent. The approach builds on StyleGAN2 image inversion and multi-stage non-linear latent-space editing to generate videos that are nearly comparable to input videos. It economically captures face identity, head-pose, and complex facial motions at fine levels, and thereby bypasses training and person modeling which tend to hamper many re-synthesis approaches. The approach is designed with maximum data efficiency, where a single W+ latent and 35 parameters per frame enable high-fidelity video rendering. This pipeline can also be used for puppeteering (i.e., motion transfer).
# (参考訳) ノードダイナミクスの非線形性を考慮した分散有限和制約最適化

Motivated by recent development in networking and parallel data-processing, we consider a distributed and localized finite-sum (or fixed-sum) allocation technique to solve resource-constrained convex optimization problems over multi-agent networks (MANs). Such networks include (smart) agents representing an intelligent entity capable of communication, processing, and decision-making. In particular, we consider problems subject to practical nonlinear constraints on the dynamics of the agents in terms of their communications and actuation capabilities (referred to as the node dynamics), e.g., networks of mobile robots subject to actuator saturation and quantized communication. The considered distributed sum-preserving optimization solution further enables adding purposeful nonlinear constraints, for example, sign-based nonlinearities, to reach convergence in predefined-time or robust to impulsive noise and disturbances in faulty environments. Moreover, convergence can be achieved under minimal network connectivity requirements among the agents; thus, the solution is applicable over dynamic networks where the channels come and go due to the agent's mobility and limited range. This paper discusses how various nonlinearity constraints on the optimization problem (e.g., collaborative allocation of resources) can be addressed for different applications via a distributed setup (over a network).
# (参考訳) 研究論文のアスペクト的類似性のための特別文書埋め込み

Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the document coherence, the pairwise classification approach scales poorly to large scale corpora. In this paper, we treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach avoids document segmentation and scales linearly w.r.t.the corpus size. In an empirical study, we use the Papers with Code corpus containing 157,606 research papers and consider the task, method, and dataset of the respective research papers as their aspects. We compare and analyze three generic document embeddings, six specialized document embeddings and a pairwise classification baseline in the context of research paper recommendations. As generic document embeddings, we consider FastText, SciBERT, and SPECTER. To compute the specialized document embeddings, we compare three alternative methods inspired by retrofitting, fine-tuning, and Siamese networks. In our experiments, Siamese SciBERT achieved the highest scores. Additional analyses indicate an implicit bias of the generic document embeddings towards the dataset aspect and against the method aspect of each research paper. Our approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit.
# (参考訳) UNICON: 一様選択とコントラスト学習によるラベルノイズの燃焼

Supervised deep learning methods require a large repository of annotated data; hence, label noise is inevitable. Training with such noisy data negatively impacts the generalization performance of deep neural networks. To combat label noise, recent state-of-the-art methods employ some sort of sample selection mechanism to select a possibly clean subset of data. Next, an off-the-shelf semi-supervised learning method is used for training where rejected samples are treated as unlabeled data. Our comprehensive analysis shows that current selection methods disproportionately select samples from easy (fast learnable) classes while reject those from relatively harder ones. This creates class imbalance in the selected clean set and in turn deteriorates performance under high label noise. In this work, we propose UNICON, a simple yet effective sample selection method which is robust to high label noise. To address the disproportionate selection of easy and hard samples, we introduce a Jensen-Shannon divergence based uniform selection mechanism which does not require any probabilistic modelling and hyperparameter tuning. We complement our selection method with contrastive learning to further combat memorization of noisy labels. Extensive experimentation on multiple benchmark datasets demonstrate the effectiveness of UNICON; we obtain 11.4% improvement over the current state-of-the-art on CIFAR100 dataset with 90% noise rate.
# (参考訳) モラル・ディベータ--モラル・フレームの議論の計算的生成に関する研究

An audience's prior beliefs and morals are strong indicators of how likely they will be affected by a given argument. Utilizing such knowledge can help focus on shared values to bring disagreeing parties towards agreement. In argumentation technology, however, this is barely exploited so far. This paper studies the feasibility of automatically generating morally framed arguments as well as their effect on different audiences. Following the moral foundation theory, we propose a system that effectively generates arguments focusing on different morals. In an in-depth user study, we ask liberals and conservatives to evaluate the impact of these arguments. Our results suggest that, particularly when prior beliefs are challenged, an audience becomes more affected by morally framed arguments.
# (参考訳) オープンリサーチ知識グラフにおけるバイオアッセイのデジタル化

Background: Recent years are seeing a growing impetus in the semantification of scholarly knowledge at the fine-grained level of scientific entities in knowledge graphs. The Open Research Knowledge Graph (ORKG) https://www.orkg.org/ represents an important step in this direction, with thousands of scholarly contributions as structured, fine-grained, machine-readable data. There is a need, however, to engender change in traditional community practices of recording contributions as unstructured, non-machine-readable text. For this in turn, there is a strong need for AI tools designed for scientists that permit easy and accurate semantification of their scholarly contributions. We present one such tool, ORKG-assays. Implementation: ORKG-assays is a freely available AI micro-service in ORKG written in Python designed to assist scientists obtain semantified bioassays as a set of triples. It uses an AI-based clustering algorithm which on gold-standard evaluations over 900 bioassays with 5,514 unique property-value pairs for 103 predicates shows competitive performance. Results and Discussion: As a result, semantified assay collections can be surveyed on the ORKG platform via tabulation or chart-based visualizations of key property values of the chemicals and compounds offering smart knowledge access to biochemists and pharmaceutical researchers in the advancement of drug development.
# (参考訳) オープンリサーチナレッジグラフにおけるエンティティ認識というコンピュータサイエンス

Domain-specific named entity recognition (NER) on Computer Science (CS) scholarly articles is an information extraction task that is arguably more challenging for the various annotation aims that can beset the task and has been less studied than NER in the general domain. Given that significant progress has been made on NER, we believe that scholarly domain-specific NER will receive increasing attention in the years to come. Currently, progress on CS NER -- the focus of this work -- is hampered in part by its recency and the lack of a standardized annotation aim for scientific entities/terms. This work proposes a standardized task by defining a set of seven contribution-centric scholarly entities for CS NER viz., research problem, solution, resource, language, tool, method, and dataset. Following which, its main contributions are: combines existing CS NER resources that maintain their annotation focus on the set or subset of contribution-centric scholarly entities we consider; further, noting the need for big data to train neural NER models, this work additionally supplies thousands of contribution-centric entity annotations from article titles and abstracts, thus releasing a cumulative large novel resource for CS NER; and, finally, trains a sequence labeling CS NER model inspired after state-of-the-art neural architectures from the general domain NER task. Throughout the work, several practical considerations are made which can be useful to information technology designers of the digital libraries.
# (参考訳) 同じスコア: 単語埋め込みのためのコサインベースのバイアススコアの改善

Over the last years, word and sentence embeddings have established as text preprocessing for all kinds of NLP tasks and improved performances in these tasks significantly. Unfortunately, it has also been shown that these embeddings inherit various kinds of biases from the training data and thereby pass on biases present in society to NLP solutions. Many papers attempted to quantify bias in word or sentence embeddings to evaluate debiasing methods or compare different embedding models, often with cosine-based scores. However, some works have raised doubts about these scores showing that even though they report low biases, biases persist and can be shown with other tests. In fact, there is a great variety of bias scores or tests proposed in the literature without any consensus on the optimal solutions. We lack works that study the behavior of bias scores and elaborate their advantages and disadvantages. In this work, we will explore different cosine-based bias scores. We provide a bias definition based on the ideas from the literature and derive novel requirements for bias scores. Furthermore, we thoroughly investigate the existing cosine-based scores and their limitations in order to show why these scores fail to report biases in some situations. Finally, we propose a new bias score, SAME, to address the shortcomings of existing bias scores and show empirically that SAME is better suited to quantify biases in word embeddings.
# (参考訳) クロスドメイン特徴マップ一貫性強化によるct再構成カーネルへの適応

Deep learning methods provide significant assistance in analyzing coronavirus disease (COVID-19) in chest computed tomography (CT) images, including identification, severity assessment, and segmentation. Although the earlier developed methods address the lack of data and specific annotations, the current goal is to build a robust algorithm for clinical use, having a larger pool of available data. With the larger datasets, the domain shift problem arises, affecting the performance of methods on the unseen data. One of the critical sources of domain shift in CT images is the difference in reconstruction kernels used to generate images from the raw data (sinograms). In this paper, we show a decrease in the COVID-19 segmentation quality of the model trained on the smooth and tested on the sharp reconstruction kernels. Furthermore, we compare several domain adaptation approaches to tackle the problem, such as task-specific augmentation and unsupervised adversarial learning. Finally, we propose the unsupervised adaptation method, called F-Consistency, that outperforms the previous approaches. Our method exploits a set of unlabeled CT image pairs which differ only in reconstruction kernels within every pair. It enforces the similarity of the network hidden representations (feature maps) by minimizing mean squared error (MSE) between paired feature maps. We show our method achieving 0.64 Dice Score on the test dataset with unseen sharp kernels, compared to the 0.56 Dice Score of the baseline model. Moreover, F-Consistency scores 0.80 Dice Score between predictions on the paired images, which almost doubles the baseline score of 0.46 and surpasses the other methods. We also show F-Consistency to better generalize on the unseen kernels and without the specific semantic content, e.g., presence of the COVID-19 lesions.
# (参考訳) 潰瘍性および非潰瘍性患者の足底軟組織分布と均質性に関するelastographyによる定量的比較

The primary objective of this study was to develop a method that allows accurate quantification of plantar soft tissue stiffness distribution and homogeneity. The secondary aim of this study is to investigate if the differences in soft tissue stiffness distribution and homogeneity can be detected between ulcerated and non-ulcerated foot. Novel measures of individual pixel stiffness, named as quantitative strainability (QS) and relative strainability (RS) were developed. SE data obtained from 39 (9 with active diabetic foot ulcers) patients with diabetic neuropathy. The patients with active diabetic foot ulcer had wound in parts of the foot other than the first metatarsal head and the heel where the elastography measures were conducted. RS was used to measure changes and gradients in the stiffness distribution of plantar soft tissues in participants with and without active diabetic foot ulcer. The plantar soft tissue homogeneity in superior-inferior direction in the left forefoot was significantly (p<0.05) higher in ulcerated group compared to non-ulcerated group. The assessment of homogeneity showed potentials to further explain the nature of the change in tissue that can increase internal stress . This can have implications in assessing the vulnerability to soft tissue damage and ulceration in diabetes.
# (参考訳) 低リソース言語のための同型言語間埋め込み

Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones. Recent research in cross-lingual representation learning has focused on offline mapping approaches due to their simplicity, computational efficacy, and ability to work with minimal parallel resources. However, they crucially depend on the assumption of embedding spaces being approximately isomorphic i.e. sharing similar geometric structure, which does not hold in practice, leading to poorer performance on low-resource and distant language pairs. In this paper, we introduce a framework to learn CLWEs, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language. In our work, we first pre-align the low-resource and related language embedding spaces using offline methods to mitigate the assumption of isometry. Following this, we use joint training methods to develops CLWEs for the related language and the target embed-ding space. Finally, we remap the pre-aligned low-resource space and the target space to generate the final CLWEs. We show consistent gains over current methods in both quality and degree of isomorphism, as measured by bilingual lexicon induction (BLI) and eigenvalue similarity respectively, across several language pairs: {Nepali, Finnish, Romanian, Gujarati, Hungarian}-English. Lastly, our analysis also points to the relatedness as well as the amount of related language data available as being key factors in determining the quality of embeddings achieved.
# (参考訳) シンボリック音楽作曲のための深層学習モデルの主観評価

Deep learning models are typically evaluated to measure and compare their performance on a given task. The metrics that are commonly used to evaluate these models are standard metrics that are used for different tasks. In the field of music composition or generation, the standard metrics used in other fields have no clear meaning in terms of music theory. In this paper, we propose a subjective method to evaluate AI-based music composition systems by asking questions related to basic music principles to different levels of users based on their musical experience and knowledge. We use this method to compare state-of-the-art models for music composition with deep learning. We give the results of this evaluation method and we compare the responses of each user level for each evaluated model.
# (参考訳) モデルベース価値拡大の再検討

Model-based value expansion methods promise to improve the quality of value function targets and, thereby, the effectiveness of value function learning. However, to date, these methods are being outperformed by Dyna-style algorithms with conceptually simpler 1-step value function targets. This shows that in practice, the theoretical justification of value expansion does not seem to hold. We provide a thorough empirical study to shed light on the causes of failure of value expansion methods in practice which is believed to be the compounding model error. By leveraging GPU based physics simulators, we are able to efficiently use the true dynamics for analysis inside the model-based reinforcement learning loop. Performing extensive comparisons between true and learned dynamics sheds light into this black box. This paper provides a better understanding of the actual problems in value expansion. We provide future directions of research by empirically testing the maximum theoretical performance of current approaches.
# (参考訳) 位置認識ニューロンを用いた連合学習

Federated Learning (FL) fuses collaborative models from local nodes without centralizing users' data. The permutation invariance property of neural networks and the non-i.i.d. data across clients make the locally updated parameters imprecisely aligned, disabling the coordinate-based parameter averaging. Traditional neurons do not explicitly consider position information. Hence, we propose Position-Aware Neurons (PANs) as an alternative, fusing position-related values (i.e., position encodings) into neuron outputs. PANs couple themselves to their positions and minimize the possibility of dislocation, even updating on heterogeneous data. We turn on/off PANs to disable/enable the permutation invariance property of neural networks. PANs are tightly coupled with positions when applied to FL, making parameters across clients pre-aligned and facilitating coordinate-based parameter averaging. PANs are algorithm-agnostic and could universally improve existing FL algorithms. Furthermore, "FL with PANs" is simple to implement and computationally friendly.
# (参考訳) 機械学習を用いたインドヒマラヤ地域の衛星画像時系列からの開地作付け地図の作成

Crop maps are crucial for agricultural monitoring and food management and can additionally support domain-specific applications, such as setting cold supply chain infrastructure in developing countries. Machine learning (ML) models, combined with freely-available satellite imagery, can be used to produce cost-effective and high spatial-resolution crop maps. However, accessing ground truth data for supervised learning is especially challenging in developing countries due to factors such as smallholding and fragmented geography, which often results in a lack of crop type maps or even reliable cropland maps. Our area of interest for this study lies in Himachal Pradesh, India, where we aim at producing an open-access binary cropland map at 10-meter resolution for the Kullu, Shimla, and Mandi districts. To this end, we developed an ML pipeline that relies on Sentinel-2 satellite images time series. We investigated two pixel-based supervised classifiers, support vector machines (SVM) and random forest (RF), which are used to classify per-pixel time series for binary cropland mapping. The ground truth data used for training, validation and testing was manually annotated from a combination of field survey reference points and visual interpretation of very high resolution (VHR) imagery. We trained and validated the models via spatial cross-validation to account for local spatial autocorrelation and selected the RF model due to overall robustness and lower computational cost. We tested the generalization capability of the chosen model at the pixel level by computing the accuracy, recall, precision, and F1-score on hold-out test sets of each district, achieving an average accuracy for the RF (our best model) of 87%. We used this model to generate a cropland map for three districts of Himachal Pradesh, spanning 14,600 km2, which improves the resolution and quality of existing public maps.
# (参考訳) amcad:適応型混合曲率表現に基づく広告検索システム

Graph embedding based retrieval has become one of the most popular techniques in the information retrieval community and search engine industry. The classical paradigm mainly relies on the flat Euclidean geometry. In recent years, hyperbolic (negative curvature) and spherical (positive curvature) representation methods have shown their superiority to capture hierarchical and cyclic data structures respectively. However, in industrial scenarios such as e-commerce sponsored search platforms, the large-scale heterogeneous query-item-advertisement interaction graphs often have multiple structures coexisting. Existing methods either only consider a single geometry space, or combine several spaces manually, which are incapable and inflexible to model the complexity and heterogeneity in the real scenario. To tackle this challenge, we present a web-scale Adaptive Mixed-Curvature ADvertisement retrieval system (AMCAD) to automatically capture the complex and heterogeneous graph structures in non-Euclidean spaces. Specifically, entities are represented in adaptive mixed-curvature spaces, where the types and curvatures of the subspaces are trained to be optimal combinations. Besides, an attentive edge-wise space projector is designed to model the similarities between heterogeneous nodes according to local graph structures and the relation types. Moreover, to deploy AMCAD in Taobao, one of the largest ecommerce platforms with hundreds of million users, we design an efficient two-layer online retrieval framework for the task of graph based advertisement retrieval. Extensive evaluations on real-world datasets and A/B tests on online traffic are conducted to illustrate the effectiveness of the proposed system.
# (参考訳) reptile: 積極的なリアルタイム深層強化学習自己適応フレームワーク

In this work a general framework is proposed to support the development of software systems that are able to adapt their behaviour according to the operating environment changes. The proposed approach, named REPTILE, works in a complete proactive manner and relies on Deep Reinforcement Learning-based agents to react to events, referred as novelties, that can affect the expected behaviour of the system. In our framework, two types of novelties are taken into account: those related to the context/environment and those related to the physical architecture itself. The framework, predicting those novelties before their occurrence, extracts time-changing models of the environment and uses a suitable Markov Decision Process to deal with the real-time setting. Moreover, the architecture of our RL agent evolves based on the possible actions that can be taken.
# (参考訳) 限られたデータを用いた話者認識システム

This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work. We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset. These subsets are restricted to 50 k audio files (versus over 1 M files available), and vary on the axis of number of speakers and session variability. We train three speaker recognition systems on these subsets; the X-vector, ECAPA-TDNN, and wav2vec2 network architectures. We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited. Code and data subsets are available at \url{https://github.com/nikvaessen/w2v2-speaker-few-samples}.
# (参考訳) LiDARCap:LiDARポイント雲を用いた長距離マーカーレス3Dモーションキャプチャ

Existing motion capture datasets are largely short-range and cannot yet fit the need of long-range applications. We propose LiDARHuman26M, a new human motion capture dataset captured by LiDAR at a much longer range to overcome this limitation. Our dataset also includes the ground truth human motions acquired by the IMU system and the synchronous RGB images. We further present a strong baseline method, LiDARCap, for LiDAR point cloud human motion capture. Specifically, we first utilize PointNet++ to encode features of points and then employ the inverse kinematics solver and SMPL optimizer to regress the pose through aggregating the temporally encoded features hierarchically. Quantitative and qualitative experiments show that our method outperforms the techniques based only on RGB images. Ablation experiments demonstrate that our dataset is challenging and worthy of further research. Finally, the experiments on the KITTI Dataset and the Waymo Open Dataset show that our method can be generalized to different LiDAR sensor settings.
# (参考訳) MSTR: エンドツーエンドのヒューマンオブジェクトインタラクション検出のためのマルチスケールトランス

Human-Object Interaction (HOI) detection is the task of identifying a set of <human, object, interaction> triplets from an image. Recent work proposed transformer encoder-decoder architectures that successfully eliminated the need for many hand-designed components in HOI detection through end-to-end training. However, they are limited to single-scale feature resolution, providing suboptimal performance in scenes containing humans, objects and their interactions with vastly different scales and distances. To tackle this problem, we propose a Multi-Scale TRansformer (MSTR) for HOI detection powered by two novel HOI-aware deformable attention modules called Dual-Entity attention and Entity-conditioned Context attention. While existing deformable attention comes at a huge cost in HOI detection performance, our proposed attention modules of MSTR learn to effectively attend to sampling points that are essential to identify interactions. In experiments, we achieve the new state-of-the-art performance on two HOI detection benchmarks.
# (参考訳) 画像テキスト検索:最近の研究・開発に関する調査

In the past few years, cross-modal image-text retrieval (ITR) has experienced increased interest in the research community due to its excellent research value and broad real-world application. It is designed for the scenarios where the queries are from one modality and the retrieval galleries from another modality. This paper presents a comprehensive and up-to-date survey on the ITR approaches from four perspectives. By dissecting an ITR system into two processes: feature extraction and feature alignment, we summarize the recent advance of the ITR approaches from these two perspectives. On top of this, the efficiency-focused study on the ITR system is introduced as the third perspective. To keep pace with the times, we also provide a pioneering overview of the cross-modal pre-training ITR approaches as the fourth perspective. Finally, we outline the common benchmark datasets and valuation metric for ITR, and conduct the accuracy comparison among the representative ITR approaches. Some critical yet less studied issues are discussed at the end of the paper.
# (参考訳) PIT(Pruning In Time) - 時間的畳み込みネットワークのための軽量ネットワークアーキテクチャ最適化

Temporal Convolutional Networks (TCNs) are promising Deep Learning models for time-series processing tasks. One key feature of TCNs is time-dilated convolution, whose optimization requires extensive experimentation. We propose an automatic dilation optimizer, which tackles the problem as a weight pruning on the time-axis, and learns dilation factors together with weights, in a single training. Our method reduces the model size and inference latency on a real SoC hardware target by up to 7.4x and 3x, respectively with no accuracy drop compared to a network without dilation. It also yields a rich set of Pareto-optimal TCNs starting from a single model, outperforming hand-designed solutions in both size and accuracy.
# (参考訳) 5Gルーティング干渉環境

5G is the next-generation cellular network technology, with the goal of meeting the critical demand for bandwidth required to accommodate a high density of users. It employs flexible architectures to accommodate the high density \cite{7390965}. 5G is enabled by mmWave communication, which operates at frequencies ranging from 30 to 300 GHz. This paper discusses the creation of a python-based environment known as the 5G Routing Interfered Environment (5GRIE). The environment can run different algorithms to route packets with source and destination pairs using a formulated interference model. Deep Reinforcement Learning algorithms that use Stable-Baselines 3 \cite{Raffin_Stable_Baselines3_2020}, as well as heuristic-based algorithms like random or greedy, can be run on it. Profitable is an algorithm that is provided.
# (参考訳) mixnn: ディープラーニングモデルを保護するための設計

In this paper, we propose a novel design, called MixNN, for protecting deep learning model structure and parameters. The layers in a deep learning model of MixNN are fully decentralized. It hides communication address, layer parameters and operations, and forward as well as backward message flows among non-adjacent layers using the ideas from mix networks. MixNN has following advantages: 1) an adversary cannot fully control all layers of a model including the structure and parameters, 2) even some layers may collude but they cannot tamper with other honest layers, 3) model privacy is preserved in the training phase. We provide detailed descriptions for deployment. In one classification experiment, we compared a neural network deployed in a virtual machine with the same one using the MixNN design on the AWS EC2. The result shows that our MixNN retains less than 0.001 difference in terms of classification accuracy, while the whole running time of MixNN is about 7.5 times slower than the one running on a single virtual machine.
# (参考訳) 部分的に行う:部分入力によるシーンレベルFG-SBIRに向けて

We scrutinise an important observation plaguing scene-level sketch research -- that a significant portion of scene sketches are "partial". A quick pilot study reveals: (i) a scene sketch does not necessarily contain all objects in the corresponding photo, due to the subjective holistic interpretation of scenes, (ii) there exists significant empty (white) regions as a result of object-level abstraction, and as a result, (iii) existing scene-level fine-grained sketch-based image retrieval methods collapse as scene sketches become more partial. To solve this "partial" problem, we advocate for a simple set-based approach using optimal transport (OT) to model cross-modal region associativity in a partially-aware fashion. Importantly, we improve upon OT to further account for holistic partialness by comparing intra-modal adjacency matrices. Our proposed method is not only robust to partial scene-sketches but also yields state-of-the-art performance on existing datasets.
# (参考訳) クラウドファンディング成功予測のための画像特徴抽出

S. J. Blanchard, T. J. Noseworthy, E. Pancer, M. Poole(参考訳) クラウドファンディングプラットフォームに関する実証研究の増加と視覚情報の普及にもかかわらず、運用管理とマーケティングの文献は、イメージ特性がクラウドファンディングの成功に果たす役割を探求していない。 この原稿の著者は、視覚処理に関する文献を合成し、クラウドファンディングの成功を形作る可能性のあるいくつかの画像の特徴を特定することから始める。 それぞれの画像特性について詳細な測定を行った後、彼らは機械学習アルゴリズム(ベイジアン加法木)の一部として、プロジェクト特性とテキスト情報とともに、クラウドファンディングの成功を予測する。 その結果、これらの画像特性の包含は、ベースラインのプロジェクト変数に対する予測とテキスト的特徴を大幅に改善することが示された。 さらに、画像特性変数は、画像数と動画数に関連付けられた変数と同様に、重要度が高い。 この研究は、新しい製品の成功を確実にするための視覚情報の役割に関心がある研究者や管理者に貴重な資源を提供する。

Despite an increase in the empirical study of crowdfunding platforms and the prevalence of visual information, operations management and marketing literature has yet to explore the role that image characteristics play in crowdfunding success. The authors of this manuscript begin by synthesizing literature on visual processing to identify several image characteristics that are likely to shape crowdfunding success. After detailing measures for each image characteristic, they use them as part of a machine-learning algorithm (Bayesian additive trees), along with project characteristics and textual information, to predict crowdfunding success. Results show that the inclusion of these image characteristics substantially improves prediction over baseline project variables, as well as textual features. Furthermore, image characteristic variables exhibit high importance, similar to variables linked to the number of pictures and number of videos. This research therefore offers valuable resources to researchers and managers who are interested in the role of visual information in ensuring new product success.
# (参考訳) 算数条件付きデータ認識プロセスの健全性

Soundness of Data-Aware Processes with Arithmetic Conditions ( http://arxiv.org/abs/2203.14809v1 )

Paolo Felli, Marco Montali, Sarah Winkler(参考訳) データ・アウェア・プロセスは単一のモデルにおける構造的および行動的制約を表現・統合し、ビジネスプロセス管理や情報システム工学においてますます研究されている。 このスペクトルでは、単純さと表現性のバランスをとる能力により、データペトリネット(DPN)の人気が高まっている。 データと制御フローの相互作用は、そのようなモデルの正確性、特に音質の周知な特性、重要かつ困難性をチェックする。 DPNの健全性をチェックするための従来のアプローチの最大の欠点は、実世界の具体的なアプリケーションを扱う上で重要な特徴である算術のないデータ条件を考えることである。 本稿では,算術データ条件に富むDPNの健全性を評価するための基礎的かつ運用的な枠組みを提供することにより,このオープンな問題に対処する。 このフレームワークには概念実証実装が付属しており、アドホックな技術に頼るのではなく、既製のSMT技術を採用している。 この実装は、文献の例集と、そのような例から構築された合成変異体上で検証される。

Data-aware processes represent and integrate structural and behavioural constraints in a single model, and are thus increasingly investigated in business process management and information systems engineering. In this spectrum, Data Petri nets (DPNs) have gained increasing popularity thanks to their ability to balance simplicity with expressiveness. The interplay of data and control-flow makes checking the correctness of such models, specifically the well-known property of soundness, crucial and challenging. A major shortcoming of previous approaches for checking soundness of DPNs is that they consider data conditions without arithmetic, an essential feature when dealing with real-world, concrete applications. In this paper, we attack this open problem by providing a foundational and operational framework for assessing soundness of DPNs enriched with arithmetic data conditions. The framework comes with a proof-of-concept implementation that, instead of relying on ad-hoc techniques, employs off-the-shelf established SMT technologies. The implementation is validated on a collection of examples from the literature, and on synthetic variants constructed from such examples.
# (参考訳) 中国における衛星降水ダウンスケーリングのための注意機構に基づく畳み込みネットワーク

An attention mechanism based convolutional network for satellite precipitation downscaling over China ( http://arxiv.org/abs/2203.14812v1 )

Yinghong Jing, Liupeng Lin, Xinghua Li, Tongwen Li, Huanfeng Shen(参考訳) 降水は水循環の重要な部分であり、気候変動の敏感な指標である。 グローバル降水量測定(GPM)ミッション(IMERG)の総合的マルチサテライトE検索は,グローバルおよび地域降水量調査に広く利用されている。 しかし、局所的な応用は比較的粗い空間分解能によって制限される。 そこで本稿では,GPM IMERGの月間降水量のダウンスケールのために,注意機構に基づく畳み込みネットワーク(AMCN)を提案する。 提案手法は,大域的なクロスアテンションモジュール,多要素クロスアテンションモジュール,残留畳み込みモジュールからなるエンド・ツー・エンドネットワークであり,降水と複雑な表面特性の関係を包括的に検討した。 また,低分解能降水に基づく劣化損失関数は,ネットワークトレーニングを物理的に制約し,提案するネットワークのロバスト性を向上させるように設計されている。 実験の結果,提案するネットワークは3つのベースライン法より有意に優れていた。 最後に, 地理的差分解析手法を導入し, 高精度・微粒な降水量推定のためのその場測定を取り入れた。

Precipitation is a key part of hydrological circulation and is a sensitive indicator of climate change. The Integrated Multi-satellitE Retrievals for the Global Precipitation Measurement (GPM) mission (IMERG) datasets are widely used for global and regional precipitation investigations. However, their local application is limited by the relatively coarse spatial resolution. Therefore, in this paper, an attention mechanism based convolutional network (AMCN) is proposed to downscale GPM IMERG monthly precipitation data. The proposed method is an end-to-end network, which consists of a global cross-attention module, a multi-factor cross-attention module, and a residual convolutional module, comprehensively considering the potential relationships between precipitation and complicated surface characteristics. In addition, a degradation loss function based on low-resolution precipitation is designed to physically constrain the network training, to improve the robustness of the proposed network under different time and scale variations. The experiments demonstrate that the proposed network significantly outperforms three baseline methods. Finally, a geographic difference analysis method is introduced to further improve the downscaled results by incorporating in-situ measurements for high-quality and fine-scale precipitation estimation.
# (参考訳) 確率的パラメータ化:確率論的機械学習による時間相関のモデル化

Stochastic Parameterizations: Better Modelling of Temporal Correlations using Probabilistic Machine Learning ( http://arxiv.org/abs/2203.14814v1 )

Raghul Parthipan, Hannah M. Christensen, J. Scott Hosking, Damon J. Wischik(参考訳) 小規模プロセスのモデリングは気候モデルの主要なエラー源であり、パラメータ化によってそのようなプロセスを近似しなければならない低コストモデルの精度を妨げる。 確率性と機械学習を使うことは、よりよいモデルにつながったが、両方の利点を組み合わせる作業が不足している。 確率的枠組み内で物理的に変形したリカレントニューラルネットワークを用いることで,lorenz 96大気シミュレーションのモデルが,従来のベースラインと既存の確率的機械学習(gan)モデルの両方に匹敵することを示した。 これは、標準の1次自己回帰スキームと比較して時間的相関をモデル化する能力が優れているためである。 このモデルは目に見えない体制にも一般化する。 文献から多くの指標を評価するとともに、将来の確率的気候モデルにおいて、確率論的指標が統一的な選択である可能性についても論じる。

The modelling of small-scale processes is a major source of error in climate models, hindering the accuracy of low-cost models which must approximate such processes through parameterization. Using stochasticity and machine learning have led to better models but there is a lack of work on combining the benefits from both. We show that by using a physically-informed recurrent neural network within a probabilistic framework, our resulting model for the Lorenz 96 atmospheric simulation is competitive and often superior to both a bespoke baseline and an existing probabilistic machine-learning (GAN) one. This is due to a superior ability to model temporal correlations compared to standard first-order autoregressive schemes. The model also generalises to unseen regimes. We evaluate across a number of metrics from the literature, but also discuss how the probabilistic metric of likelihood may be a unifying choice for future probabilistic climate models.
# (参考訳) 心配せずにスケッチする: ノイズ耐性のあるスケッチに基づく画像検索

Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval ( http://arxiv.org/abs/2203.14817v1 )

Ayan Kumar Bhunia and Subhadeep Koley and Abdullah Faiz Ur Rahman Khilji and Aneeshan Sain and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song(参考訳) スケッチによって多くのエキサイティングなアプリケーション、特に画像検索が可能になる。 しかし、恐怖からスケッチへの問題(すなわち「スケッチできない」)は、その普及によって致命的であることが証明されている。 本稿では,この「恐ろしい」見出しに取り組み,ユーザが心配せずにスケッチできる既存の検索モデルのための補助モジュールを初めて提案する。 我々は最初に、ノイズストロークの存在に秘密があることを示すパイロット実験を行ったが、「私はスケッチできない」というほどではなかった。 そこで我々は,検索に肯定的な寄与をもたらすノイズストロークのみを検出可能なストロークサブセットセレクタを設計した。 強化学習に基づく定式化は,与えられた部分集合に存在する各ストロークの重要性を,そのストロークが検索にどの程度寄与するかに基づいて定量化する。 事前学習した検索モデルを前処理モジュールとして組み合わせることで,標準ベースラインよりも8%~10%の大幅な向上を実現し,新たな最先端性能を報告した。 最後に、一度トレーニングされたセレクタをプラグイン・アンド・プレイ方式で使用して、以前は不可能だった方法で様々なスケッチアプリケーションを強化することを実証する。

Sketching enables many exciting applications, notably, image retrieval. The fear-to-sketch problem (i.e., "I can't sketch") has however proven to be fatal for its widespread adoption. This paper tackles this "fear" head on, and for the first time, proposes an auxiliary module for existing retrieval models that predominantly lets the users sketch without having to worry. We first conducted a pilot study that revealed the secret lies in the existence of noisy strokes, but not so much of the "I can't sketch". We consequently design a stroke subset selector that {detects noisy strokes, leaving only those} which make a positive contribution towards successful retrieval. Our Reinforcement Learning based formulation quantifies the importance of each stroke present in a given subset, based on the extent to which that stroke contributes to retrieval. When combined with pre-trained retrieval models as a pre-processing module, we achieve a significant gain of 8%-10% over standard baselines and in turn report new state-of-the-art performance. Last but not least, we demonstrate the selector once trained, can also be used in a plug-and-play manner to empower various sketch applications in ways that were not previously possible.
# (参考訳) ブラケット付き露光・イベントからのHDR再構成

HDR Reconstruction from Bracketed Exposures and Events ( http://arxiv.org/abs/2203.14825v1 )

Richard Shaw, Sibi Catley-Chandar, Ales Leonardis, Eduardo Perez-Pellitero(参考訳) 高品質なHDR画像の再構成は、現代の計算写真の中心にある。 マルチフレームhdr再構成法では,高精細で高精細で高精度な色再現が実現されている。 しかし、フレームのミスアライメントがしばしば目に見えるゴーストアーティファクトをもたらすような、動的または大部分が露出過剰なシーンで失敗する傾向にある。 近年のアプローチでは、照明の2値変化のみを測定するイベントベースカメラ(EBC)を用いてこれを緩和しようとしている。 望まれる高時間分解能とダイナミックレンジ特性にもかかわらず、これらの手法は色情報や低解像度センサの欠如により従来の多フレーム再構成手法よりも優れていなかった。 本稿では, ブラケット付きLDR画像と, 同時にキャプチャしたイベントを両世界の長所として活用し, ブラケット付きLDRから高画質なRGB情報と, イベントからの相補的な高周波およびダイナミックレンジ情報とを両世界の長所から得ることを提案する。 本稿では,注意と多スケール空間アライメントモジュールを用いて,特徴領域におけるブラケット画像とイベントモダリティを融合するマルチモーダル・エンド・ツー・エンド学習型hdrイメージングシステムを提案する。 イベント特徴を自己スーパービジョンで画像空間に変換することを学習する新しいイベント・ツー・イメージ機能蒸留モジュールを提案する。 当社のフレームワークでは,入力イベントストリームをスライディングウィンドウを使ってサブサンプリングすることで,イベントの時間分解能の向上を実現しています。 提案手法は,2dBと1dBのPSNR-LとPSNR-muをそれぞれHdM HDRデータセット上で改良し,合成イベントと実イベントを用いたSoTA多フレームHDR再構成手法を克服する。

Reconstruction of high-quality HDR images is at the core of modern computational photography. Significant progress has been made with multi-frame HDR reconstruction methods, producing high-resolution, rich and accurate color reconstructions with high-frequency details. However, they are still prone to fail in dynamic or largely over-exposed scenes, where frame misalignment often results in visible ghosting artifacts. Recent approaches attempt to alleviate this by utilizing an event-based camera (EBC), which measures only binary changes of illuminations. Despite their desirable high temporal resolution and dynamic range characteristics, such approaches have not outperformed traditional multi-frame reconstruction methods, mainly due to the lack of color information and low-resolution sensors. In this paper, we propose to leverage both bracketed LDR images and simultaneously captured events to obtain the best of both worlds: high-quality RGB information from bracketed LDRs and complementary high frequency and dynamic range information from events. We present a multi-modal end-to-end learning-based HDR imaging system that fuses bracketed images and event modalities in the feature domain using attention and multi-scale spatial alignment modules. We propose a novel event-to-image feature distillation module that learns to translate event features into the image-feature space with self-supervision. Our framework exploits the higher temporal resolution of events by sub-sampling the input event streams using a sliding window, enriching our combined feature representation. Our proposed approach surpasses SoTA multi-frame HDR reconstruction methods using synthetic and real events, with a 2dB and 1dB improvement in PSNR-L and PSNR-mu on the HdM HDR dataset, respectively.
# (参考訳) 多言語同時音声翻訳

Multilingual Simultaneous Speech Translation ( http://arxiv.org/abs/2203.14835v1 )

Shashank Subramanya, Jan Niehues(参考訳) 会議や会議などのイベント中に同時に音声翻訳を行うために設計されたアプリケーションは、優れたユーザエクスペリエンスを提供するために翻訳テキストを表示しながら、品質と遅延のバランスを取る必要がある。 オンライン音声翻訳システムを構築する一般的なアプローチは、オフライン音声翻訳用に構築されたモデルを活用することである。 エンド・ツー・エンドのモノリンガルモデルを適応させる手法に基づいて、オンライン音声翻訳を行う上での多言語モデルと異なるアーキテクチャ(エンド・ツー・エンド、カスケード)について検討する。 多言語TEDxコーパスでは、アプローチが異なるアーキテクチャに一般化されることを示す。 言語やアーキテクチャのレイテンシ低減(40%相対)も同様に向上しています。 しかし、エンドツーエンドアーキテクチャは、オンラインモデルに適応した後、翻訳品質の損失を小さくする。 さらに、このアプローチはゼロショット方向までスケールする。

Applications designed for simultaneous speech translation during events such as conferences or meetings need to balance quality and lag while displaying translated text to deliver a good user experience. One common approach to building online spoken language translation systems is by leveraging models built for offline speech translation. Based on a technique to adapt end-to-end monolingual models, we investigate multilingual models and different architectures (end-to-end and cascade) on the ability to perform online speech translation. On the multilingual TEDx corpus, we show that the approach generalizes to different architectures. We see similar gains in latency reduction (40% relative) across languages and architectures. However, the end-to-end architecture leads to smaller translation quality losses after adapting to the online model. Furthermore, the approach even scales to zero-shot directions.
# (参考訳) Doodle It Yourself:小さめのスケッチを引いて授業のインクリメンタル学習

Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches ( http://arxiv.org/abs/2203.14843v1 )

Ayan Kumar Bhunia, Viswanatha Reddy Gajjala, Subhadeep Koley, Rohit Kundu, Aneeshan Sain, Tao Xiang and Yi-Zhe Song(参考訳) 人間の視覚システムは、ほんの数例から新しい視覚概念を学ぶのに顕著である。 これはまさに、モデルが"偽造"に苦しめられないようにすることに重点を置いている、数発のクラスインクリメンタルラーニング(FSCIL)の背景にある目標である。 本稿では、そのユビキタスな応用をボトルネックにする2つの重要な問題に対処することで、FSCILの境界をさらに推し進める。 i) モデルが単に写真(人間のように)以外の様々なモダリティから学習できるか、そして (ii)写真のアクセスが容易でない場合(倫理的・プライバシー上の制約により)はどうか。 私たちの重要なイノベーションは、クラスサポートの新しいモダリティとしてスケッチを使うことを提唱することです。 この製品は“doodle it yourself”(diy)のfscilフレームワークで、ユーザが新しいクラスの例を自由にスケッチして、そのクラスの写真を認識できるようになる。 そのために、我々はこのフレームワークを (i)ドメイン不変学習における勾配コンセンサス (ii)旧級情報保存のための知識蒸留 (iii)旧クラスと新クラス間のメッセージパッシングのためのグラフアテンションネットワーク。 FSCILの文脈では,スケッチがテキストよりも優れたクラスサポートであることを実験的に示す。

The human visual system is remarkable in learning new visual concepts from just a few examples. This is precisely the goal behind few-shot class incremental learning (FSCIL), where the emphasis is additionally placed on ensuring the model does not suffer from "forgetting". In this paper, we push the boundary further for FSCIL by addressing two key questions that bottleneck its ubiquitous application (i) can the model learn from diverse modalities other than just photo (as humans do), and (ii) what if photos are not readily accessible (due to ethical and privacy constraints). Our key innovation lies in advocating the use of sketches as a new modality for class support. The product is a "Doodle It Yourself" (DIY) FSCIL framework where the users can freely sketch a few examples of a novel class for the model to learn to recognize photos of that class. For that, we present a framework that infuses (i) gradient consensus for domain invariant learning, (ii) knowledge distillation for preserving old class information, and (iii) graph attention networks for message passing between old and novel classes. We experimentally show that sketches are better class support than text in the context of FSCIL, echoing findings elsewhere in the sketching literature.
# (参考訳) タスク分割によるマルチタスク模倣学習のためのモジュール適応ポリシー選択

Modular Adaptive Policy Selection for Multi-Task Imitation Learning through Task Division ( http://arxiv.org/abs/2203.14855v1 )

Dafni Antotsiou, Carlo Ciliberto and Tae-Kyun Kim(参考訳) 深い模倣学習は、多くの専門家によるデモンストレーションを必要とするが、特に多くのタスクが関与している場合には、取得が困難である。 しかし、異なるタスクはしばしば類似点を共有するため、それらを一緒に学ぶことは彼らにとって大きな利益となり、多くのデモの必要性を軽減できる。 しかし、共同マルチタスク学習はしばしば負の伝達に悩まされ、タスク固有の情報を共有する。 本稿では,タスク特有の特徴を生かしながらマルチタスク模倣を行う手法を提案する。 これは、プロトポリケーションをモジュールとして使用して、タスクを共有可能な単純なサブ振る舞いに分割する。 プロトポリアは並列に動作し、モジュールと共同で訓練されたセレクタ機構によって適応的に選択される。 異なるタスクセットにおける実験により,単一エージェント,タスクコンディショニングエージェント,マルチヘッドマルチタスクエージェント,最先端のメタ学習エージェントの精度が向上した。 また、タスクを共有行動とタスク固有のサブ行動の両方に自律的に分割する能力を示す。

Deep imitation learning requires many expert demonstrations, which can be hard to obtain, especially when many tasks are involved. However, different tasks often share similarities, so learning them jointly can greatly benefit them and alleviate the need for many demonstrations. But, joint multi-task learning often suffers from negative transfer, sharing information that should be task-specific. In this work, we introduce a method to perform multi-task imitation while allowing for task-specific features. This is done by using proto-policies as modules to divide the tasks into simple sub-behaviours that can be shared. The proto-policies operate in parallel and are adaptively chosen by a selector mechanism that is jointly trained with the modules. Experiments on different sets of tasks show that our method improves upon the accuracy of single agents, task-conditioned and multi-headed multi-task agents, as well as state-of-the-art meta learning agents. We also demonstrate its ability to autonomously divide the tasks into both shared and task-specific sub-behaviours.
# (参考訳) HIME:複数例による高能率ヘッドショット画像超解像

HIME: Efficient Headshot Image Super-Resolution with Multiple Exemplars ( http://arxiv.org/abs/2203.14863v1 )

Xiaoyu Xiang, Jon Morton, Fitsum A Reda, Lucas Young, Federico Perazzi, Rakesh Ranjan, Amit Kumar, Andrea Colaco, Jan Allebach(参考訳) 低解像度のヘッドショット画像において、失われた情報を復元するための有望な方向は、同一のアイデンティティから高解像度の例証のセットを活用することである。 参照セットの補完画像は、多くの異なるビューやポーズで生成されたヘッドショットの品質を改善することができる。 しかし、複数の例を最大限に活用することは困難であり、それぞれの例の品質とアライメントは保証できない。 低品質で不整合なイメージを参照として使用すると、結果が損なわれる。 これらの課題を克服するために,HIME (Multiple Exemplars Network) 法を用いたヘッドショット画像超解法を提案する。 従来の手法と比較して,我々のネットワークは,顔の事前処理を必要とせずに,入力と参照のミスアライメントを効果的に処理することができる。 さらに,より詳細な顔特徴を再構築するために,制御可能な空間範囲における局所的なテクスチャの豊かな表現を提供する相関損失を提案する。 実験の結果, 提案手法の計算コストは, 最近のexemplar-guided法に比べて大幅に低減するだけでなく, 質的, 定量的性能も向上した。

A promising direction for recovering the lost information in low-resolution headshot images is utilizing a set of high-resolution exemplars from the same identity. Complementary images in the reference set can improve the generated headshot quality across many different views and poses. However, it is challenging to make the best use of multiple exemplars: the quality and alignment of each exemplar cannot be guaranteed. Using low-quality and mismatched images as references will impair the output results. To overcome these issues, we propose an efficient Headshot Image Super-Resolution with Multiple Exemplars network (HIME) method. Compared with previous methods, our network can effectively handle the misalignment between the input and the reference without requiring facial priors and learn the aggregated reference set representation in an end-to-end manner. Furthermore, to reconstruct more detailed facial features, we propose a correlation loss that provides a rich representation of the local texture in a controllable spatial range. Experimental results demonstrate that the proposed framework not only has significantly fewer computation cost than recent exemplar-guided methods but also achieves better qualitative and quantitative performance.
# (参考訳) フィンランド議会asrコーパスの分析,ベンチマーク,統計

Finnish Parliament ASR corpus - Analysis, benchmarks and statistics ( http://arxiv.org/abs/2203.14876v1 )

Anja Virkkunen and Aku Rouhe and Nhan Phan and Mikko Kurimo(参考訳) 議会の会議記録や書き起こしなどの公開資料は、自動音声認識(ASR)システムの訓練と評価のために、成長を続ける材料を提供する。 本稿では,3000時間を超える発話データと449人の話者からなるフィンランド議会asrコーパスの公開分析を行った。 このコーパスは初期の作業に基づいて構築され、結果としてコーパスは2つの期間から2つのトレーニングサブセットに自然に分割される。 同様に、異なる時間をカバーする2つの公式な修正テストセットがあり、縦方向の分布シフト特性を持つASRタスクを設定している。 公式開発セットも用意されている。 我々は、カルディに基づく完全なデータ準備パイプラインと隠れマルコフモデル(HMM)、ハイブリッドディープニューラルネットワーク(HMM-DNN)、アテンションベースのエンコーダデコーダ(AED)ASRレシピを開発した。 公式のテストセットにベンチマークを設定し、他の複数の最近使われたテストセットにもベンチマークを設定しました。 どちらの時間的コーパスサブセットも既に大きく、その規模を超えて、公式なテストセットのASRパフォーマンスは高められるが、他のドメインは追加データから恩恵を受ける。 HMM-DNN と AED のアプローチは、HMM-DNN システムとよく一致した同値なデータ設定で比較される。 最後に、議会メタデータで利用可能な話者カテゴリー間でasrの精度のばらつきを比較し、性別、年齢、教育などの要因に基づいて潜在的なバイアスを検出する。

Public sources like parliament meeting recordings and transcripts provide ever-growing material for the training and evaluation of automatic speech recognition (ASR) systems. In this paper, we publish and analyse the Finnish parliament ASR corpus, the largest publicly available collection of manually transcribed speech data for Finnish with over 3000 hours of speech and 449 speakers for which it provides rich demographic metadata. This corpus builds on earlier initial work, and as a result the corpus has a natural split into two training subsets from two periods of time. Similarly, there are two official, corrected test sets covering different times, setting an ASR task with longitudinal distribution-shift characteristics. An official development set is also provided. We develop a complete Kaldi-based data preparation pipeline, and hidden Markov model (HMM), hybrid deep neural network (HMM-DNN) and attention-based encoder-decoder (AED) ASR recipes. We set benchmarks on the official test sets, as well as multiple other recently used test sets. Both temporal corpus subsets are already large, and we observe that beyond their scale, ASR performance on the official test sets plateaus, whereas other domains benefit from added data. The HMM-DNN and AED approaches are compared in a carefully matched equal data setting, with the HMM-DNN system consistently performing better. Finally, the variation of the ASR accuracy is compared between the speaker categories available in the parliament metadata to detect potential biases based on factors such as gender, age, and education.
# (参考訳) AWA Part:知識グラフのアダプティブなワークロード対応分割

AWAPart: Adaptive Workload-Aware Partitioning of Knowledge Graphs ( http://arxiv.org/abs/2203.14884v1 )

Amitabh Priyadarshi, Krzysztof J. Kochut(参考訳) 大規模知識グラフは多くの領域でますます一般的になっている。 その大きなサイズは、特にメインメモリに配置された場合、集中型データストアにグラフを格納するシステムの限界を超えることが多い。 これを解決するには、大規模な知識グラフを複数のサブグラフに分割し、分散システムのノードに配置する必要がある。 しかし、これらの断片化されたサブグラフのクエリは、切断エッジを含む分散結合による通信コストの増加など、新たな課題を引き起こす。 これらの問題に対処するため、優れたパーティショニングは、与えられたクエリのワークロードを考慮してエッジカットを減らす必要がある。 しかし、分割されたグラフは、クエリのワークロードの変更に対応し、平均処理時間を維持するために、継続的に再分割する必要がある。 本稿では,大規模知識グラフに対する適応的分割手法を導入し,クエリ処理量の変化に応じて分割を適応させる。 本評価では,知識グラフトリプルの分割を動的に適応することで,クエリの処理時間の性能が向上することを示す。

Large-scale knowledge graphs are increasingly common in many domains. Their large sizes often exceed the limits of systems storing the graphs in a centralized data store, especially if placed in main memory. To overcome this, large knowledge graphs need to be partitioned into multiple sub-graphs and placed in nodes in a distributed system. But querying these fragmented sub-graphs poses new challenges, such as increased communication costs, due to distributed joins involving cut edges. To combat these problems, a good partitioning should reduce the edge cuts while considering a given query workload. However, a partitioned graph needs to be continually re-partitioned to accommodate changes in the query workload and maintain a good average processing time. In this paper, an adaptive partitioning method for large-scale knowledge graphs is introduced, which adapts the partitioning in response to changes in the query workload. Our evaluation demonstrates that the performance of processing time for queries is improved after dynamically adapting the partitioning of knowledge graph triples.
# (参考訳) HUNIS:高性能無監督核インスタンスセグメンテーション

HUNIS: High-Performance Unsupervised Nuclei Instance Segmentation ( http://arxiv.org/abs/2203.14887v1 )

Vasileios Magoulianitis, Yijing Yang and C.-C. Jay Kuo(参考訳) 本研究では,高性能非教師付き核インスタンス分割法(HUNIS)を提案する。 hunisは2段階のブロックワイズ操作からなる。 第1段は以下の通り。 1)画素強度の適応しきい値化 2 核の大きさ・形状の事前の組み入れ及び 3)偽陽性核インスタンスの除去。 そして、HUNISは、第1のステージからガイダンスを受け取って第2のステージセグメンテーションを行う。 第2段は第1段で得られたセグメンテーションマスクを利用し、色と形状の分布を利用してより正確なセグメンテーションを行う。 2段設計の主な目的は、第1段から第2段までのピクセル単位の擬似ラベルを提供することである。 この自己超越メカニズムは新しくて効果的です。 MoNuSegデータセットの実験結果によると、HUNISは他の教師なしの手法よりもかなり優れていた。 また、最先端の監視手法の間でも競争的な立場にある。

A high-performance unsupervised nuclei instance segmentation (HUNIS) method is proposed in this work. HUNIS consists of two-stage block-wise operations. The first stage includes: 1) adaptive thresholding of pixel intensities, 2) incorporation of nuclei size/shape priors and 3) removal of false positive nuclei instances. Then, HUNIS conducts the second stage segmentation by receiving guidance from the first one. The second stage exploits the segmentation masks obtained in the first stage and leverages color and shape distributions for a more accurate segmentation. The main purpose of the two-stage design is to provide pixel-wise pseudo-labels from the first to the second stage. This self-supervision mechanism is novel and effective. Experimental results on the MoNuSeg dataset show that HUNIS outperforms all other unsupervised methods by a substantial margin. It also has a competitive standing among state-of-the-art supervised methods.
# (参考訳) コミュニティ構造を持つABCDeランダムグラフモデルの特性と性能

Properties and Performance of the ABCDe Random Graph Model with Community Structure ( http://arxiv.org/abs/2203.14899v1 )

Bogumi{\l} Kami\'nski, Tomasz Olczak, Bartosz Pankratz, Pawe{\l} Pra{\l}at, Fran\c{c}ois Th\'eberge(参考訳) 本稿では,コミュニティ構造を組み込んだ合成ランダムグラフモデルの特性と性能について検討する。 このようなモデルは自然に制御されないコミュニティ検出アルゴリズムの評価とチューニングに重要である。 本稿では,マルチスレッドを用いたABCDグラフ生成器ABCDeの新たな実装を提案する。 本稿では,本アルゴリズムの実装の詳細を議論するとともに,従来利用可能であったabcdモデルの逐次バージョンと,標準および広く使用されているlfrジェネレータの並列実装と比較する。 ABCDe は NetworKit で提供される LFR の並列実装よりも 10 倍以上高速でスケール可能であることを示す。 さらに、このアルゴリズムは高速であるだけでなく、ABCDが生成したランダムグラフは、元のLFRアルゴリズムが生成したグラフと類似した特性を持つ一方、並列化されたNetworKit実装は、顕著に異なる特性を持つグラフを生成する。

In this paper, we investigate properties and performance of synthetic random graph models with a built-in community structure. Such models are important for evaluating and tuning community detection algorithms that are unsupervised by nature. We propose a new implementation of the ABCD graph generator, ABCDe, that uses multiple-threading. We discuss the implementation details of the algorithm as well as compare it with both the previously available sequential version of the ABCD model and with the parallel implementation of the standard and extensively used LFR generator. We show that ABCDe is more than ten times faster and scales better than the parallel implementation of LFR provided in NetworKit. Moreover, the algorithm is not only faster but random graphs generated by ABCD have similar properties to the ones generated by the original LFR algorithm, while the parallelized NetworKit implementation of LFR produces graphs that have noticeably different characteristics.
# MolGenSurvey: 分子設計のための機械学習モデルに関するシステム調査

MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design ( http://arxiv.org/abs/2203.14500v1 )

Yuanqi Du, Tianfan Fu, Jimeng Sun, Shengchao Liu(参考訳) 分子設計は分子科学における根本的な問題であり、創薬、物質科学など様々な分野において重要な応用がある。 しかし、大規模な探索空間のため、人間の専門家が湿式実験で全ての分子を列挙してテストすることは不可能である。 近年,機械学習手法,特に生成手法の急速な発展に伴い,分子設計は機械学習モデルを利用して候補分子を生成することで大きな進歩を遂げている。 本稿では、分子設計のための機械学習モデルにおける最も関連する研究を体系的に概観する。 まず,1d文字列,2dグラフ,3dジオメトリを含む,メインストリーム分子の成熟・表現法と一般生成法(深部生成法と組合せ最適化法)について概観する。 次に、既存の分子設計問題をすべて、入力、出力タイプ、目標を含む問題設定に従って複数の場所にまとめる。 最後に、オープンチャレンジで締めくくり、現実世界のアプリケーションにおける分子設計のための機械学習モデルの将来の機会を指摘した。

Molecule design is a fundamental problem in molecular science and has critical applications in a variety of areas, such as drug discovery, material science, etc. However, due to the large searching space, it is impossible for human experts to enumerate and test all molecules in wet-lab experiments. Recently, with the rapid development of machine learning methods, especially generative methods, molecule design has achieved great progress by leveraging machine learning models to generate candidate molecules. In this paper, we systematically review the most relevant work in machine learning models for molecule design. We start with a brief review of the mainstream molecule featurization and representation methods (including 1D string, 2D graph, and 3D geometry) and general generative methods (deep generative and combinatorial optimization methods). Then we summarize all the existing molecule design problems into several venues according to the problem setup, including input, output types and goals. Finally, we conclude with the open challenges and point out future opportunities of machine learning models for molecule design in real-world applications.
# 不均質なフォッグにおける分散タスクマネジメント : 社会的にコンケーブなバンディットゲーム

Distributed Task Management in the Heterogeneous Fog: A Socially Concave Bandit Game ( http://arxiv.org/abs/2203.14572v1 )

Xiaotong Cheng and Setareh Maghsudi(参考訳) フォグコンピューティングは、モバイルユーザーの爆発的な計算需要に対する潜在的な解決策として登場した。 このポテンシャルは主に、ネットワークエッジにおけるタスクのオフロードとアロケーションの能力に起因しており、遅延を低減し、サービスの品質を改善する。 大きな可能性にもかかわらず、霧ネットワークの性能を最適化することはしばしば困難である。 フォグアーキテクチャでは、計算ノードは異なる能力と能力を持つ異質のスマートデバイスであり、したがって好みである。 また、ランダムなタスク到着を伴う超高密度霧ネットワークでは、集中制御が過大なオーバーヘッドをもたらすため、実現不可能である。 不確実性下における異種フォグコンピューティングネットワークにおける分散タスク割り当て問題について検討する。 この問題をソーシャル・コンケーブゲームとして定式化し、プレイヤーはナッシュ均衡への道のりで後悔を最小化しようとする。 定式化問題を解決するため,我々は2つの無規制意思決定戦略を考案する。 一つの戦略、すなわちbandit gradient ascent with momentumは、banditフィードバックを伴うオンライン凸最適化アルゴリズムである。 もうひとつの戦略であるLipschitz Bandit with Initializationは、EXP3のマルチアームバンディットアルゴリズムである。 両戦略に対する後悔関係を確立し,その収束特性を解析する。 さらに,提案手法をLearning with Linear Rewardsという集中型アロケーション戦略と比較した。 理論的および数値解析により,提案手法は最先端手法と比較して効率的なタスク割当を行うための優れた性能を示す。

Fog computing has emerged as a potential solution to the explosive computational demand of mobile users. This potential mainly stems from the capacity of task offloading and allocation at the network edge, which reduces the delay and improves the quality of service. Despite the significant potential, optimizing the performance of a fog network is often challenging. In the fog architecture, the computing nodes are heterogeneous smart devices with distinct abilities and capacities, thereby, preferences. Besides, in an ultra-dense fog network with random task arrival, centralized control results in excessive overhead, and therefore, it is not feasible. We study a distributed task allocation problem in a heterogeneous fog computing network under uncertainty. We formulate the problem as a social-concave game, where the players attempt to minimize their regret on the path to Nash equilibrium. To solve the formulated problem, we develop two no-regret decision-making strategies. One strategy, namely bandit gradient ascent with momentum, is an online convex optimization algorithm with bandit feedback. The other strategy, Lipschitz Bandit with Initialization, is an EXP3 multi-armed bandit algorithm. We establish a regret bound for both strategies and analyze their convergence characteristics. Moreover, we compare the proposed strategies with a centralized allocation strategy named Learning with Linear Rewards. Theoretical and numerical analysis shows the superior performance of the proposed strategies for efficient task allocation compared to the state-of-the-art methods.
# 終端雑音-ロバスト音声認識のためのデュアルパス型学習

Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition ( http://arxiv.org/abs/2203.14838v1 )

Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng(参考訳) ノイズロスト自動音声認識は、通常フロントエンド音声強調モジュールに存在する過剰抑圧問題に直面して大幅に劣化する。 このような問題を緩和するために, エンドツーエンドノイズロスト自動音声認識(DPSL-ASR)のための新しいデュアルパス学習を提案する。 具体的には, DPSL-ASR方式では, IFF-Net をデュアルパス入力として融合したクリーンな特徴を導入し, 過度に抑圧された情報を復元する。 さらに,融通した特徴をクリーンな特徴にマッピングすることで,詳細な情報や潜伏情報を学ぶためのスタイル学習を提案する。 さらに,2経路間のデコード埋め込み距離を最小化するために,一貫性損失を利用する。 実験の結果,提案手法は,RATS Channel-AデータセットとCHiME-4 1-Channel Trackデータセットを用いて,相対単語誤り率(WER)を10.6%,8.6%削減できることがわかった。 中間埋め込みの可視化は、提案したDPSL-ASRが最良のベースラインよりも詳細を学習できることを示唆している。 私たちのコード実装はgithubで利用可能です。

Noise-robust automatic speech recognition degrades significantly in face of over-suppression problem, which usually exists in the front-end speech enhancement module. To alleviate such issue, we propose novel dual-path style learning for end-to-end noise-robust automatic speech recognition (DPSL-ASR). Specifically, the proposed DPSL-ASR approach introduces clean feature along with fused feature by the IFF-Net as dual-path inputs to recover the over-suppressed information. Furthermore, we propose style learning to learn abundant details and latent information by mapping fused feature to clean feature. Besides, we also utilize the consistency loss to minimize the distance of decoded embeddings between two paths. Experimental results show that the proposed DPSL-ASR approach achieves relative word error rate (WER) reductions of 10.6% and 8.6%, on RATS Channel-A dataset and CHiME-4 1-Channel Track dataset, respectively. The visualizations of intermediate embeddings also indicate that the proposed DPSL-ASR can learn more details than the best baseline. Our code implementation is available at Github: https://github.com/YUCHEN005/DPSL-ASR.
# 4ボソン正規化群極限サイクルに関する新しい知見

New insights into four-boson renormalization group limit cycles ( http://arxiv.org/abs/2203.14597v1 )

Bastian Kaspschak, Ulf.-G. Mei{\ss}ner(参考訳) 機械学習技術を用いて,単位極限を超える再正規化群制限サイクルの出現が,3ボソンサブシステムから4ボソンシステム全体へ伝達されることを検証する。 4つの同一ボソンに着目して、変分オートエンコーダの強化されたアンサンブルの潜在空間内で合成特異ポテンシャルの集団を生成する。 制限サイクルの挙動から与えられた再正規化群フローの偏差を測定するための制限サイクル損失を導入した後, 得られた集団にエリート的遺伝的アルゴリズムを適用して最小化する。 フィットテストポテンシャルは逆二乗ポテンシャルの周りに蓄積し、4つのボソンの極限サイクルを生成し、既に3つのボソン系で極限サイクルを生成することが知られている。 これはまた、4体の項が先行する順序で低エネルギーの観測値に入り込まないことを示唆している。

Using machine learning techniques, we verify that the emergence of renormalization group limit cycles beyond the unitary limit is transferred from the three-boson subsystems to the whole four-boson system. Focussing on four identical bosons, we first generate populations of synthetic singular potentials within the latent space of a boosted ensemble of variational autoencoders. After introducing the limit cycle loss for measuring the deviation of a given renormalization group flow from limit cycle behavior, we minimize it by applying an elitist genetic algorithm to the generated populations. The fittest potentials are observed to accumulate around the inverse-square potential, which we prove to generate limit cycles for four bosons and which is already known to produce limit cycles in the three-boson system. This also indicates that a four-body term does not enter low-energy observables at leading order, since we do not observe any additional scale to emerge.
# フローベース変分量子モンテカルロの数値的および幾何学的側面

Numerical and geometrical aspects of flow-based variational quantum Monte Carlo ( http://arxiv.org/abs/2203.14824v1 )

James Stokes, Brian Chen, Shravan Veerapaneni(参考訳) 本稿では,流れに基づく変分量子モンテカルロ法を用いて連続変数量子系をシミュレートするための近年の取り組みを要約し,場の振幅(四分数)に基づくボソンの例に着目した。 特に、時間依存の変動原理の確率的推定と情報幾何学との関係を慎重に検討し、変動実時間および想像時間進化問題に重点を置いている。 pytorchコードの実装を導くための実践的な手順がいくつか提供されている。 このレビューは、機械学習と量子情報科学に関心のある研究者が利用できることを意図している。

This article aims to summarize recent and ongoing efforts to simulate continuous-variable quantum systems using flow-based variational quantum Monte Carlo techniques, focusing for pedagogical purposes on the example of bosons in the field amplitude (quadrature) basis. Particular emphasis is placed on the variational real- and imaginary-time evolution problems, carefully reviewing the stochastic estimation of the time-dependent variational principles and their relationship with information geometry. Some practical instructions are provided to guide the implementation of a PyTorch code. The review is intended to be accessible to researchers interested in machine learning and quantum information science.
# (参考訳) WawPart: 知識グラフの作業負荷対応分割

WawPart: Workload-Aware Partitioning of Knowledge Graphs ( http://arxiv.org/abs/2203.14888v1 )

Amitabh Priyadarshi, Krzysztof J. Kochut(参考訳) 知識グラフという形での大規模なデータセットは、今日では多くのドメインでよく使われている。 ナレッジグラフのサイズはしばしば単一のコンピュータシステムの容量を超え、特にグラフをメインメモリに保存しなければならない場合である。 これを克服するために、知識グラフを複数のサブグラフに分割し、多くの計算ノードにシャードとして分散することができる。 しかしながら、クエリなどのグラフ上で実行される多くの共通タスクのパフォーマンスは、結果として低下する。 これは分割を横断(切断)するグラフエッジによって要求される分散結合に起因する。 本稿では,一連のクエリ(作業負荷)を考慮した知識グラフ分割手法を提案する。 結果として生じる分割は、分散結合の数を減らし、ワークロードパフォーマンスを改善することを目的としている。 クエリワークロードとナレッジグラフで識別される重要な機能は、クエリをクラスタ化し、グラフを分割するために使用される。 クエリはグラフのパーティショニングを考慮して書き直される。 評価結果は,ワークロード処理時間の性能改善を示す。

Large-scale datasets in the form of knowledge graphs are often used in numerous domains, today. A knowledge graphs size often exceeds the capacity of a single computer system, especially if the graph must be stored in main memory. To overcome this, knowledge graphs can be partitioned into multiple sub-graphs and distributed as shards among many computing nodes. However, performance of many common tasks performed on graphs, such as querying, suffers, as a result. This is due to distributed joins mandated by graph edges crossing (cutting) the partitions. In this paper, we propose a method of knowledge graph partitioning that takes into account a set of queries (workload). The resulting partitioning aims to reduces the number of distributed joins and improve the workload performance. Critical features identified in the query workload and the knowledge graph are used to cluster the queries and then partition the graph. Queries are rewritten to account for the graph partitioning. Our evaluation results demonstrate the performance improvement in workload processing time.
# FS6D:新しい物体のFew-Shot 6D Pose Estimation

FS6D: Few-Shot 6D Pose Estimation of Novel Objects ( http://arxiv.org/abs/2203.14628v1 )

Yisheng He, Yao Wang, Haoqiang Fan, Jian Sun, Qifeng Chen(参考訳) 6次元オブジェクトポーズ推定ネットワークは、近接した仮定と高忠実度オブジェクトCADモデルに依存するため、多数のオブジェクトインスタンスにスケールする能力に制限されている。 本研究では,未知の物体の6次元位置を,予備のトレーニングなしで数回の支持ビューで推定する,数ショットの6次元物体のポーズ推定という新しいオープン集合問題について検討する。 この問題に対処するため,我々は,与えられたサポートビューとクエリシーンパッチの外観と幾何学的関係を十分に検討することの重要性を指摘し,高密度rgbdプロトタイプとトランスフォーマーの抽出・マッチングによる高密度プロトタイプマッチングフレームワークを提案する。 さらに,ネットワーク事前学習のための大規模rgbdフォトリアリスティックデータセット(shapenet6d)を提案する。 簡易かつ効果的なオンラインテクスチャブレンディングアプローチも導入され、低コストで外観の多様性を豊かにする合成データセットからドメインギャップを取り除く。 最後に、この問題に対する解決策を議論し、今後の研究を促進するために人気のあるデータセットのベンチマークを確立する。 プロジェクトページは \url{https://fs6d.github.io/} にある。

6D object pose estimation networks are limited in their capability to scale to large numbers of object instances due to the close-set assumption and their reliance on high-fidelity object CAD models. In this work, we study a new open set problem; the few-shot 6D object poses estimation: estimating the 6D pose of an unknown object by a few support views without extra training. To tackle the problem, we point out the importance of fully exploring the appearance and geometric relationship between the given support views and query scene patches and propose a dense prototypes matching framework by extracting and matching dense RGBD prototypes with transformers. Moreover, we show that the priors from diverse appearances and shapes are crucial to the generalization capability under the problem setting and thus propose a large-scale RGBD photorealistic dataset (ShapeNet6D) for network pre-training. A simple and effective online texture blending approach is also introduced to eliminate the domain gap from the synthesis dataset, which enriches appearance diversity at a low cost. Finally, we discuss possible solutions to this problem and establish benchmarks on popular datasets to facilitate future research. The project page is at \url{https://fs6d.github.io/}.
# 長期記憶に基づくインターベンショナルMRI再構成のためのリカレントニューラルネットワーク

A Long Short-term Memory Based Recurrent Neural Network for Interventional MRI Reconstruction ( http://arxiv.org/abs/2203.14769v1 )

Ruiyang Zhao, Zhao He, Tao Wang, Suhao Qiu, Pawel Herman, Yanle Hu, Chencheng Zhang, Dinggang Shen, Bomin Sun, Guang-Zhong Yang, and Yuan Feng(参考訳) 外科的指導のためのインターベンショナル磁気共鳴イメージング(i-MRI)は、深部脳刺激(DBS)のような介入過程を可視化し、手術のパフォーマンスと患者の結果を改善するのに役立つ。 従来のダイナミックイメージングにおける振り返り再構成とは異なり、DBS用のi-MRIは、介入画像の連続的取得と再構成をオンラインで行う必要がある。 そこで本研究では,convolutional long short-term memory (conv-lstm) を用いたリカレントニューラルネットワーク (recurrent neural network, convlr) を提案する。 初期化器とConv-LSTMブロックを用いることで、前操作参照画像と術中フレームの先行を現在のフレームの再構築に利用した。 放射状サンプリングのためのデータ一貫性をソフト投射法により実現した。 再現精度を向上させるために,逆学習戦略を採用した。 術前および術後のMR画像に基づく介入画像のセットをシミュレーションし,アルゴリズムによる検証を行った。 その結果、10個のラジアルスポークしか得られず、ConvLRは最先端の手法と比較して最高の性能を示し、最大40倍の加速を実現した。 提案アルゴリズムは,DBSのリアルタイムi-MRIを実現する可能性があり,汎用的なMR誘導介入に使用できる。

Interventional magnetic resonance imaging (i-MRI) for surgical guidance could help visualize the interventional process such as deep brain stimulation (DBS), improving the surgery performance and patient outcome. Different from retrospective reconstruction in conventional dynamic imaging, i-MRI for DBS has to acquire and reconstruct the interventional images sequentially online. Here we proposed a convolutional long short-term memory (Conv-LSTM) based recurrent neural network (RNN), or ConvLR, to reconstruct interventional images with golden-angle radial sampling. By using an initializer and Conv-LSTM blocks, the priors from the pre-operative reference image and intra-operative frames were exploited for reconstructing the current frame. Data consistency for radial sampling was implemented by a soft-projection method. An adversarial learning strategy was adopted to improve the reconstruction accuracy. A set of interventional images based on the pre-operative and post-operative MR images were simulated for algorithm validation. Results showed with only 10 radial spokes, ConvLR provided the best performance compared with state-of-the-art methods, giving an acceleration up to 40 folds. The proposed algorithm has the potential to achieve real-time i-MRI for DBS and can be used for general purpose MR-guided intervention.
# LiDAR蒸留:3次元物体検出のためのビーム誘起領域ギャップのブリッジ

LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection ( http://arxiv.org/abs/2203.14956v1 )

Yi Wei, Zibu Wei, Yongming Rao, Jiaxin Li, Jie Zhou, Jiwen Lu(参考訳) 本稿では,異なるLiDARビームによる3次元物体検出のための領域ギャップをブリッジするLiDAR蒸留法を提案する。 多くの現実世界の応用において、大量生産されたロボットや車両が使用するLiDARポイントは通常、大規模な公開データセットよりもビームが少ない。 さらに、LiDARはビーム量が異なる他の製品モデルにアップグレードされるため、以前のバージョンの高解像度センサーが取得したラベル付きデータを利用するのは難しい。 領域適応型3D検出の最近の進歩にもかかわらず、ほとんどの手法はビーム誘起領域ギャップを取り除くのに苦労している。 トレーニングプロセス中に、ソースドメインのポイントクラウド密度とターゲットドメインのポイントクラウド密度を一致させることが不可欠であることがわかった。 この発見に触発されて、ビーム誘起ドメインシフトを緩和するプログレッシブフレームワークを提案する。 各イテレーションにおいて、ハイビーム点雲をダウンサンプリングすることで、まず低ビーム擬似LiDARを生成する。 次に、教師学習フレームワークを用いて、より多くのビームでデータからリッチな情報を蒸留する。 Waymo、nuScenes、KITTIの3つの異なるLiDARベースの検出器による大規模な実験は、我々のLiDAR蒸留の有効性を実証している。 特に、我々の手法は推論の計算コストを増大させません。

In this paper, we propose the LiDAR Distillation to bridge the domain gap induced by different LiDAR beams for 3D object detection. In many real-world applications, the LiDAR points used by mass-produced robots and vehicles usually have fewer beams than that in large-scale public datasets. Moreover, as the LiDARs are upgraded to other product models with different beam amount, it becomes challenging to utilize the labeled data captured by previous versions' high-resolution sensors. Despite the recent progress on domain adaptive 3D detection, most methods struggle to eliminate the beam-induced domain gap. We find that it is essential to align the point cloud density of the source domain with that of the target domain during the training process. Inspired by this discovery, we propose a progressive framework to mitigate the beam-induced domain shift. In each iteration, we first generate low-beam pseudo LiDAR by downsampling the high-beam point clouds. Then the teacher-student framework is employed to distill rich information from the data with more beams. Extensive experiments on Waymo, nuScenes and KITTI datasets with three different LiDAR-based detectors demonstrate the effectiveness of our LiDAR Distillation. Notably, our approach does not increase any additional computation cost for inference.
# ランダム化実験における汎用機械学習による異種処理効果の統計的推測

Statistical Inference for Heterogeneous Treatment Effects Discovered by Generic Machine Learning in Randomized Experiments ( http://arxiv.org/abs/2203.14511v1 )

Kosuke Imai, Michael Lingzhi Li(参考訳) 研究者たちは、ランダム化実験における因果不均一性を調べるために、機械学習(ML)アルゴリズムに目を向けている。 その約束にもかかわらず、MLアルゴリズムは、多くの共変量と小さなサンプルサイズを持つ実用的な設定の下で、不均一な処理効果を正確に確認できないかもしれない。 さらに、推定の不確実性の定量化は依然として課題である。 汎用MLアルゴリズムによって発見された不均一な処理効果の統計的推測に対する一般手法を開発する。 本研究では,Neymanの繰り返しサンプリングフレームワークを,MLアルゴリズムを用いて条件平均処理効果を推定し,推定した効果の大きさに基づいてサンプルを複数のグループに分割する,共通の設定に適用する。 本研究は,各群の平均治療効果を推定する方法を示し,有効信頼区間を構築する。 さらに, 群間における治療効果の均一性, 群内平均治療効果のランク一貫性に関する非パラメトリックテストを行った。 本手法の有効性は,処理代入のランダム化と単位のランダムサンプリングにのみ依存するため,MLアルゴリズムの特性に依存しない。 最後に,データのランダム分割によって引き起こされる付加的不確実性を考慮し,提案手法をクロスフィッティング手法に一般化する。

Researchers are increasingly turning to machine learning (ML) algorithms to investigate causal heterogeneity in randomized experiments. Despite their promise, ML algorithms may fail to accurately ascertain heterogeneous treatment effects under practical settings with many covariates and small sample size. In addition, the quantification of estimation uncertainty remains a challenge. We develop a general approach to statistical inference for heterogeneous treatment effects discovered by a generic ML algorithm. We apply the Neyman's repeated sampling framework to a common setting, in which researchers use an ML algorithm to estimate the conditional average treatment effect and then divide the sample into several groups based on the magnitude of the estimated effects. We show how to estimate the average treatment effect within each of these groups, and construct a valid confidence interval. In addition, we develop nonparametric tests of treatment effect homogeneity across groups, and rank-consistency of within-group average treatment effects. The validity of our methodology does not rely on the properties of ML algorithms because it is solely based on the randomization of treatment assignment and random sampling of units. Finally, we generalize our methodology to the cross-fitting procedure by accounting for the additional uncertainty induced by the random splitting of data.
# 高密度部分グラフによる相関 Erd\H{o}s-R\enyi グラフの検出しきい値

Detection threshold for correlated Erd\H{o}s-R\'enyi graphs via densest subgraphs ( http://arxiv.org/abs/2203.14573v1 )

Jian Ding, Hang Du(参考訳) n$ の非ラベルノード上の 2 つの erd\h{o}s-r\'enyi ランダムグラフ間の辺相関を検出する問題は、仮説検定問題として定式化することができる: ヌル仮説の下では、2つのグラフは独立にサンプリングされる; 代替として、2つのグラフは erd\h{o}s-r\'enyi $\mathbf{g}(n, p)$ の親グラフから独立にサブサンプリングされる。 p = n^{-\alpha+o(1)}$ for $\alpha\in (0, 1]$ が、Wu, Xu, Yu の最近の研究で定数因子をシャープ化するとき、鋭い情報理論しきい値を確立する。 我々の研究における重要な新規性は、検出問題と Erd\H{o}s-R\'enyi グラフの最も密度の高い部分グラフの間の興味深い関係である。

The problem of detecting edge correlation between two Erd\H{o}s-R\'enyi random graphs on $n$ unlabeled nodes can be formulated as a hypothesis testing problem: under the null hypothesis, the two graphs are sampled independently; under the alternative, the two graphs are independently sub-sampled from a parent graph which is Erd\H{o}s-R\'enyi $\mathbf{G}(n, p)$ (so that their marginal distributions are the same as the null). We establish a sharp information-theoretic threshold when $p = n^{-\alpha+o(1)}$ for $\alpha\in (0, 1]$ which sharpens a constant factor in a recent work by Wu, Xu and Yu. A key novelty in our work is an interesting connection between the detection problem and the densest subgraph of an Erd\H{o}s-R\'enyi graph.
# 線形システム同定における無限次元スパース学習

Infinite-Dimensional Sparse Learning in Linear System Identification ( http://arxiv.org/abs/2203.14731v1 )

Mingzhou Yin, Mehmet Tolga Akan, Andrea Iannelli, Roy S. Smith(参考訳) 正規化法は既知のモデル構造を持たないシステム同定問題に広く適用されている。 本稿では,原子ノルム正規化に基づく無限次元スパース学習アルゴリズムを提案する。 原子ノルム正規化は、伝達関数を一階原子モデルに分解し、粗い極の集合を選択し、対応する係数を識別する群ラスソ問題を解く。 この問題を解決することの難しさは、可能な原子モデルが無限に存在するという事実にある。 本研究は,既存の問題の最適条件の破れを最大化する新しい候補原子モデルを生成する欲望アルゴリズムを提案する。 このアルゴリズムは、無限次元群ラッソ問題を高精度に解くことができる。 このアルゴリズムはさらに、反復的に重み付けされた適応群ラスソと相補的なペア安定性の選択により、極位置推定におけるバイアスの低減と偽陽性の否定のために拡張される。 数値計算により,提案アルゴリズムは,インパルス応答フィッティングと極位置推定の両方の観点から,ベンチマークパラメータ化および正規化手法よりも優れた性能を示した。

Regularized methods have been widely applied to system identification problems without known model structures. This paper proposes an infinite-dimensional sparse learning algorithm based on atomic norm regularization. Atomic norm regularization decomposes the transfer function into first-order atomic models and solves a group lasso problem that selects a sparse set of poles and identifies the corresponding coefficients. The difficulty in solving the problem lies in the fact that there are an infinite number of possible atomic models. This work proposes a greedy algorithm that generates new candidate atomic models maximizing the violation of the optimality condition of the existing problem. This algorithm is able to solve the infinite-dimensional group lasso problem with high precision. The algorithm is further extended to reduce the bias and reject false positives in pole location estimation by iteratively reweighted adaptive group lasso and complementary pairs stability selection respectively. Numerical results demonstrate that the proposed algorithm performs better than benchmark parameterized and regularized methods in terms of both impulse response fitting and pole location estimation.
# 量子回路を用いた最適無分類と密度推定

Optimisation-free Classification and Density Estimation with Quantum Circuits ( http://arxiv.org/abs/2203.14452v1 )

Vladimir Vargas-Calder\'on, Fabio A. Gonz\'alez, and Herbert Vinck-Posada(参考訳) 量子回路を用いた分類と確率密度推定のための新しい機械学習フレームワークの実装を実証する。 このフレームワークは、トレーニングデータセットまたは単一のデータサンプルを、量子特徴マップを介して物理システムの量子状態にマップする。 任意の大きなトレーニングデータセットの量子状態は、その確率分布を有限次元の量子波動関数で要約する。 新しいデータサンプルの量子状態をトレーニングデータセットの量子状態に投影することにより、統計を導出して、新しいデータサンプルの密度を分類または推定することができる。 注目すべきは、実際の量子デバイスに対する我々のフレームワークの実装は、量子回路パラメータの最適化を必要としないことである。 それにもかかわらず、我々はこのフレームワークの量子長所を活用できる変分量子回路アプローチについて論じる。

We demonstrate the implementation of a novel machine learning framework for classification and probability density estimation using quantum circuits. The framework maps a training data set or a single data sample to the quantum state of a physical system through quantum feature maps. The quantum state of the arbitrarily large training data set summarises its probability distribution in a finite-dimensional quantum wave function. By projecting the quantum state of a new data sample onto the quantum state of the training data set, one can derive statistics to classify or estimate the density of the new data sample. Remarkably, the implementation of our framework on a real quantum device does not require any optimisation of quantum circuit parameters. Nonetheless, we discuss a variational quantum circuit approach that could leverage quantum advantage for our framework.
# 薬物・薬物相互作用予測のための多視点サブ構造学習

Multi-View Substructure Learning for Drug-Drug Interaction Prediction ( http://arxiv.org/abs/2203.14513v1 )

Zimeng Li, Shichao Zhu, Bin Shao, Tie-Yan Liu, Xiangxiang Zeng and Tong Wang(参考訳) 薬物と薬物の相互作用(DDI)予測は、体系的に有効な治療のための薬物の組み合わせ戦略を提供する。 先行研究は通常、薬物自体のような単一視点に制約された薬物情報をモデル化し、不完全でノイズの多い情報となり、DDI予測の精度が制限される。 本研究では,単剤 (intra-view) と対 (inter-view) の両方の表現から化学サブストラクチャーを学習し,そのサブストラクチャーを用いて反復的に薬物表現を更新するddi予測(msn-ddi)のための新しい多視点薬物サブストラクチャーネットワークを提案する。 総合的な評価では、MSN-DDIは、トランスダクティブ・セッティングの下で比較的改善された19.32%と99%以上の精度を達成することで、既存の薬物に対するDDI予測をほぼ解決したことを示している。 さらに重要なことは、MSN-DDIはより困難な誘導シナリオの下で、比較的改善された7.07%の精度で薬物を発見できるより良い一般化能力を示す。 最後に、MSN-DDIは、新しい薬物に対する現実世界のDDIアプリケーションの予測性能を改善する。

Drug-drug interaction (DDI) prediction provides a drug combination strategy for systemically effective treatment. Previous studies usually model drug information constrained on a single view such as the drug itself, leading to incomplete and noisy information, which limits the accuracy of DDI prediction. In this work, we propose a novel multi- view drug substructure network for DDI prediction (MSN-DDI), which learns chemical substructures from both the representations of the single drug (intra-view) and the drug pair (inter-view) simultaneously and utilizes the substructures to update the drug representation iteratively. Comprehensive evaluations demonstrate that MSN-DDI has almost solved DDI prediction for existing drugs by achieving a relatively improved accuracy of 19.32% and an over 99% accuracy under the transductive setting. More importantly, MSN-DDI exhibits better generalization ability to unseen drugs with a relatively improved accuracy of 7.07% under more challenging inductive scenarios. Finally, MSN-DDI improves prediction performance for real-world DDI applications to new drugs.
# 動的環境における最適オンライン凸最適化

Optimistic Online Convex Optimization in Dynamic Environments ( http://arxiv.org/abs/2203.14520v1 )

Qing-xin Meng, Jian-wei Liu(参考訳) 本稿では,動的環境における楽観的なオンライン凸最適化問題について検討する。 既存の研究によると、Ader は $O\left(\sqrt{\left(1+P_T\right)T}\right)$ dynamic regret upper bound を楽しみ、$T$ はラウンド数、$P_T$ は参照戦略列のパス長である。 しかし、Aderは環境適応的ではない。 最適化が環境適応性を実現するためのフレームワークを提供するという事実に基づいて,Ader の Greedy Projection (GP) と Normalized Exponentated Subgradient (NES) をそれぞれOptimistic-GP と Optimistic-NES に置き換え,対応するアルゴリズム ONES-OGP を命名する。 さらに2倍のトリックを適応的なトリックに拡張し、m_t$, $\widetilde{m}_t$, $v_t+1_{l^2\rho\left(\rho+2 p_t\right)\leqslant\varrho^2 v_t}d_t$という3つの特性項を導入することで、動的後悔の上限である$t$の依存性を置き換える。 我々は,ONES-OGPの適応的トリックとその段階的変動バージョンを詳述し,これらはすべて環境適応型である。

In this paper, we study the optimistic online convex optimization problem in dynamic environments. Existing works have shown that Ader enjoys an $O\left(\sqrt{\left(1+P_T\right)T}\right)$ dynamic regret upper bound, where $T$ is the number of rounds, and $P_T$ is the path length of the reference strategy sequence. However, Ader is not environment-adaptive. Based on the fact that optimism provides a framework for implementing environment-adaptive, we replace Greedy Projection (GP) and Normalized Exponentiated Subgradient (NES) in Ader with Optimistic-GP and Optimistic-NES respectively, and name the corresponding algorithm ONES-OGP. We also extend the doubling trick to the adaptive trick, and introduce three characteristic terms naturally arise from optimism, namely $M_T$, $\widetilde{M}_T$ and $V_T+1_{L^2\rho\left(\rho+2 P_T\right)\leqslant\varrho^2 V_T}D_T$, to replace the dependence of the dynamic regret upper bound on $T$. We elaborate ONES-OGP with adaptive trick and its subgradient variation version, all of which are environment-adaptive.
# SEによる視覚オブザーバの方向性ランドマーク配置の最適化(3)

Optimization of Directional Landmark Deployment for Visual Observer on SE(3) ( http://arxiv.org/abs/2203.14485v1 )

Zike Lei, Xi Chen, Ying Tan, Xiang Chen, Li Chai(参考訳) 本稿では, 3次元タスク空間内の任意の領域内の方向ランドマーク数(位置とポーズ)を新たに配置するための最適化手法を提案する。 この新しい展開技術はランドマークとモノクラーカメラの両方の幾何学モデルに基づいて構築されている。 特に、カメラが固定位置で同時にカバーする少なくともn個のランドマークの確率を特徴付けるために、MCP(Multiple Coverage Probability)という新しい概念が定義される。 この最適化は、与えられた3次元空間をグローバルに探索することでmcpを最大化するために与えられたランドマークの数と位置について行われる。 除去遺伝的アルゴリズムを採用することにより、大域的最適解を得ることができ、実演例としてSE(3)上の視覚観察者の収束性能を改善するために応用される。 提案手法の有効性を検証するため,シミュレーションと実験を行った。

An optimization method is proposed in this paper for novel deployment of given number of directional landmarks (location and pose) within a given region in the 3-D task space. This new deployment technique is built on the geometric models of both landmarks and the monocular camera. In particular, a new concept of Multiple Coverage Probability (MCP) is defined to characterize the probability of at least n landmarks being covered simultaneously by a camera at a fixed position. The optimization is conducted with respect to the position and pose of the given number of landmarks to maximize MCP through globally exploration of the given 3-D space. By adopting the elimination genetic algorithm, the global optimal solutions can be obtained, which are then applied to improve the convergent performance of the visual observer on SE(3) as a demonstration example. Both simulation and experimental results are presented to validate the effectiveness of the proposed landmark deployment optimization method.
# Open-VICO:人間-ロボットコラボレーションにおけるマルチカメラベースの骨格追跡のためのオープンソースのガゼボツールキット

Open-VICO: An Open-Source Gazebo Toolkit for Multi-Camera-based Skeleton Tracking in Human-Robot Collaboration ( http://arxiv.org/abs/2203.14733v1 )

Luca Fortini (1), Mattia Leonori (1), Juan M. Gandarias (1), Arash Ajoudani (1) ((1) Human-Robot Interfaces and Physical Interaction, Istituto Italiano di Tecnologia)(参考訳) シミュレーションツールはロボット研究、特にヒューマン・ロボティクス・コラボレーション(HRC)のような安全性が重要である分野において不可欠である。 しかし、人間の振る舞いをシミュレートすることは困難であり、既存のロボットシミュレータは機能的人間モデルを統合していない。 Open-VICO~\footnote{\url{https://gitlab.iit.it/hrii-public/open-vico}}はガゼボで仮想人間モデルを統合するオープンソースツールキットである。 特にOpen-VICOは、現実的な人間の運動モデル、マルチカメラビジョンのセットアップ、そして人間の追跡技術、そしてGazeboのおかげで多くのロボットやセンサーモデルを組み合わせることができる。 予め記録された人間の骨格運動をモーションキャプチャーシステムに組み込むことは、人間-ロボットインタラクション(HRI)設定における人間のパフォーマンス行動解析の景観を広げる。 機能とストレスを説明するために,本研究のシミュレーションツールを用いて,関連する文献課題の中から選択した4つの具体例について述べる。 一 シミュレーションにおける3次元マルチRGB-Dカメラキャリブレーション ii)openposeに基づく人工ヒト骨格追跡データセットの作成 三 シミュレーションにおけるヒト骨格追跡のためのマルチカメラシナリオ iv) 人間とロボットの相互作用例。 この研究の鍵は、軽量な人間追跡と柔軟な人間ロボットアプリケーションのための新しいビジョンベースのアルゴリズムと方法論の研究を動機付ける、素直なパイプラインを作ることだ。

Simulation tools are essential for robotics research, especially for those domains in which safety is crucial, such as Human-Robot Collaboration (HRC). However, it is challenging to simulate human behaviors, and existing robotics simulators do not integrate functional human models. This work presents Open-VICO~\footnote{\url{https://gitlab.iit.it/hrii-public/open-vico}}, an open-source toolkit to integrate virtual human models in Gazebo focusing on vision-based human tracking. In particular, Open-VICO allows to combine in the same simulation environment realistic human kinematic models, multi-camera vision setups, and human-tracking techniques along with numerous robot and sensor models thanks to Gazebo. The possibility to incorporate pre-recorded human skeleton motion with Motion Capture systems broadens the landscape of human performance behavioral analysis within Human-Robot Interaction (HRI) settings. To describe the functionalities and stress the potential of the toolkit four specific examples, chosen among relevant literature challenges in the field, are developed using our simulation utils: i) 3D multi-RGB-D camera calibration in simulation, ii) creation of a synthetic human skeleton tracking dataset based on OpenPose, iii) multi-camera scenario for human skeleton tracking in simulation, and iv) a human-robot interaction example. The key of this work is to create a straightforward pipeline which we hope will motivate research on new vision-based algorithms and methodologies for lightweight human-tracking and flexible human-robot applications.
# 三次元感情認識における音声・視覚融合の連関モデル

A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition ( http://arxiv.org/abs/2203.14779v1 )

Gnana Praveen Rajasekar, Wheidima Carneiro de Melo, Nasib Ullah, Haseeb Aslam, Osama Zeeshan, Th\'eo Denorme, Marco Pedersoli, Alessandro Koerich, Patrick Cardinal, and Eric Granger(参考訳) マルチモーダル感情認識は,複数のモーダル(音声,視覚,生体信号など)に対する多様かつ相補的な関係を活用でき,ノイズモーダルに対してある程度の堅牢性を提供できるため,近年注目を集めている。 オーディオ・ヴィジュアル(A-V)融合の最先端手法の多くは、A-Vの相補的な性質を効果的に活用しない再帰的ネットワークや従来の注意機構に依存している。 本稿では,ビデオから抽出した顔と声のモーダリティの融合に基づく,次元的感情認識に焦点をあてる。 具体的には,原子価と覚醒の連続値の正確な予測を可能にするために,a-vモダリティにまたがるサルエント特徴を抽出するための相補的関係に依存する結合的クロス・アテンションモデルを提案する。 提案する融合モデルはモーダル間関係を効率的に活用し,特徴間の不均一性を低減できる。 特に、合成特徴表現と個々のモダリティの相関関係に基づいて、クロスアテンション重みを計算する。 結合したA-V特徴表現をクロスアテンションモジュールにデプロイすることで、当社の融合モジュールの性能はバニラクロスアテンションモジュールよりも大幅に向上する。 AffWild2データセットによる検証セットビデオの実験結果から,提案したA-V融合モデルが,最先端のアプローチよりも優れたコスト効率のソリューションを提供することが示された。 コードはGitHubで入手できる。 https://github.com/praveena2j/JointCrossAttentional-AV-Fusion。

Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between the features. In particular, it computes the cross-attention weights based on correlation between the combined feature representation and individual modalities. By deploying the combined A-V feature representation into the cross-attention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches. The code is available on GitHub: https://github.com/praveena2j/JointCrossAttentional-AV-Fusion.
# 全光学的定量位相顕微鏡の微分顕微鏡設計

Differentiable Microscopy Designs an All Optical Quantitative Phase Microscope ( http://arxiv.org/abs/2203.14944v1 )

Kithmini Herath, Udith Haputhanthri, Ramith Hettiarachchi, Hasindu Kariyawasam, Azeem Ahmad, Balpreet S. Ahluwalia, Chamira U. S. Edussooriya, Dushan Wadduwage(参考訳) 16世紀後半のザカリアス・ヤンセンによる最初の顕微鏡以来、科学者たちは様々なタスクのために新しいタイプの顕微鏡を発明してきた。 新たなアーキテクチャを作るには、何十年も、科学的な経験と創造性が必要になります。 本研究では,深層学習に基づく設計パラダイムである微分顕微鏡(\partial\mu$)を導入し,新しい解釈可能な顕微鏡アーキテクチャの設計を支援する。 微分可能な顕微鏡は、まず一般的な物理ベースの光学系をモデル化するが、トレーニング可能な光学素子は光学経路上の重要な位置にある。 事前に取得したデータを使用して、関心のあるタスクのためにモデルエンドツーエンドをトレーニングします。 学習デザインの提案は、学習した光学要素を解釈することで単純化することができる。 まず,光学式4-$f$システムを用いて,計算後再構成を必要としない全光学的定量的位相顕微鏡(qpm)設計を提案する。 続く文献調査では、学習アーキテクチャは20年前に開発された一般化フェーズの概念と似ていることが示唆された。 次に、一般化位相コントラストの概念を取り入れ、学習手順を簡素化する。 さらに、この物理光学装置は、D2NN(diffractive Deep Neural Network)を用いて小型化される。 我々は、複数のデータセットで全光位相-強度変換を行うための既存のベンチマークを上回り、D2NN上でこの種のデモを初めて行った。 提案された微分可能な顕微鏡フレームワークは、新しい光学系を設計する創造的なプロセスを補完するものであり、おそらくは従来と変わらないがより良い光学設計につながるだろう。

Ever since the first microscope by Zacharias Janssen in the late 16th century, scientists have been inventing new types of microscopes for various tasks. Inventing a novel architecture demands years, if not decades, worth of scientific experience and creativity. In this work, we introduce Differentiable Microscopy ($\partial\mu$), a deep learning-based design paradigm, to aid scientists design new interpretable microscope architectures. Differentiable microscopy first models a common physics-based optical system however with trainable optical elements at key locations on the optical path. Using pre-acquired data, we then train the model end-to-end for a task of interest. The learnt design proposal can then be simplified by interpreting the learnt optical elements. As a first demonstration, based on the optical 4-$f$ system, we present an all-optical quantitative phase microscope (QPM) design that requires no computational post-reconstruction. A follow-up literature survey suggested that the learnt architecture is similar to the generalized phase concept developed two decades ago. We then incorporate the generalized phase contrast concept to simplify the learning procedure. Furthermore, this physical optical setup is miniaturized using a diffractive deep neural network (D2NN). We outperform the existing benchmark for all-optical phase-to-intensity conversion on multiple datasets, and ours is the first demonstration of its kind on D2NNs. The proposed differentiable microscopy framework supplements the creative process of designing new optical systems and would perhaps lead to unconventional but better optical designs.
# コンテンツとタスクアウェアの異なる顕微鏡による圧縮蛍光イメージング

Differentiable Microscopy for Content and Task Aware Compressive Fluorescence Imaging ( http://arxiv.org/abs/2203.14945v1 )

Udith Haputhanthri, Andrew Seeber, Dushan Wadduwage(参考訳) スループットと画質のトレードオフは、顕微鏡に固有の課題である。 スループットを向上させるため、圧縮撮像は画像信号をアンダーサンプリングし、画像は正規化逆問題を解くことで計算的に再構成される。 従来の正規化器と比較すると、Deep Learningベースの手法は圧縮と画質において大きな成功を収めている。 しかし、取得プロセスにおける情報損失は圧縮限界を設定する。 したがって、復元品質を損なうことなく圧縮のさらなる改善が課題となる。 本研究では,学習可能な物理パラメータ(例えば照明パターン)を持つ現実的な一般化されたフォワードモデルと,新しい物理に触発された逆モデルを含む,微分可能な圧縮蛍光顕微鏡(\partial \mu$)を提案する。 カスケードモデルはエンドツーエンドの微分可能であり、トレーニングデータを通じて最適な圧縮サンプリングスキームを学習できる。 本モデルでは, 各種圧縮顕微鏡構成の数値実験を数千回行った。 以上の結果から,学習サンプリングは従来の圧縮サンプリング方式よりも高い圧縮率(100~1000ドル)で広く用いられていることが示唆された。 タスク認識圧縮のためのフレームワークをさらに活用する。 実験の結果,超高圧縮でもセグメンテーションタスクにおいて優れた性能を示す(時間4096$)。

The trade-off between throughput and image quality is an inherent challenge in microscopy. To improve throughput, compressive imaging under-samples image signals; the images are then computationally reconstructed by solving a regularized inverse problem. Compared to traditional regularizers, Deep Learning based methods have achieved greater success in compression and image quality. However, the information loss in the acquisition process sets the compression bounds. Further improvement in compression, without compromising the reconstruction quality is thus a challenge. In this work, we propose differentiable compressive fluorescence microscopy ($\partial \mu$) which includes a realistic generalizable forward model with learnable-physical parameters (e.g. illumination patterns), and a novel physics-inspired inverse model. The cascaded model is end-to-end differentiable and can learn optimal compressive sampling schemes through training data. With our model, we performed thousands of numerical experiments on various compressive microscope configurations. Our results suggest that learned sampling outperforms widely used traditional compressive sampling schemes at higher compressions ($\times 100- 1000$) in terms of reconstruction quality. We further utilize our framework for Task Aware Compression. The experimental results show superior performance on segmentation tasks even at extremely high compression ($\times 4096$).
# (参考訳) グリーディパラメータ探索による除去テンプレートの最適化

Optimizing Elimination Templates by Greedy Parameter Search ( http://arxiv.org/abs/2203.14901v1 )

Evgeniy Martyushev, Jana Vrablikova, Tomas Pajdla(参考訳) 本研究では, 移動, 画像マッチング, カメラトラッキングなどの最小問題を解くために, 効率的な多項式システムのための除去テンプレートを構築する手法を提案する。 まず,有限個の異なる解を持つ系に対する除去テンプレートの特定のアフィンパラメータ化を構築する。 次に,パラメータ空間上のヒューリスティックなグリーディ最適化戦略を用いて,小さなサイズのテンプレートを得る。 コンピュータビジョンにおける34の最小問題に対して本手法をテストした。 それらすべてにおいて、テンプレートは最先端のものと比べて、同じか小さいかのどちらかです。 難しい例では、テンプレートは2.1、2.5、3.8、6.6倍小さくなります。 焦点距離が不明な屈折絶対ポーズ推定の問題に対して,20倍のテンプレートが発見された。 また,合成データを用いた実験により,新しい解法が高速かつ数値的精度を示す。 また,未知の共通焦点長と放射歪を持つ相対ポーズ推定問題に対して,高速で高精度な解法を提案する。

We propose a new method for constructing elimination templates for efficient polynomial system solving of minimal problems in structure from motion, image matching, and camera tracking. We first construct a particular affine parameterization of the elimination templates for systems with a finite number of distinct solutions. Then, we use a heuristic greedy optimization strategy over the space of parameters to get a template with a small size. We test our method on 34 minimal problems in computer vision. For all of them, we found the templates either of the same or smaller size compared to the state-of-the-art. For some difficult examples, our templates are, e.g., 2.1, 2.5, 3.8, 6.6 times smaller. For the problem of refractive absolute pose estimation with unknown focal length, we have found a template that is 20 times smaller. Our experiments on synthetic data also show that the new solvers are fast and numerically accurate. We also present a fast and numerically accurate solver for the problem of relative pose estimation with unknown common focal length and radial distortion.
# バーチャルリアリティによるパーソナライズされた人間認識ロボットナビゲーションの学習

Learning Personalized Human-Aware Robot Navigation Using Virtual Reality Demonstrations from a User Study ( http://arxiv.org/abs/2203.14741v1 )

Jorge de Heuvel, Nathan Corral, Lilli Bruckschen, Maren Bennewitz(参考訳) 最も快適で人間を意識したロボットナビゲーションのためには、主観的なユーザー好みを考慮する必要がある。 本稿では,パーソナライズされたナビゲーションコントローラと直感的なバーチャルリアリティデモインタフェースを学習するための,新しい強化学習フレームワークを提案する。 実施したユーザー調査は、私たちのパーソナライズされたアプローチが、より快適な人間-ロボット体験で古典的アプローチを著しく上回っていることを示している。 これらの結果を得るためには,非熟練ユーザによるデモトラジェクタをほんの数個使用して,直感的なデモ設定を主に評価する。 実験で示すように、学習したコントローラは、ナビゲーション中のユーザの好みを反映しながら、デモデータにカバーされていない状態によく一般化する。 最後に,実ロボットに性能を損なうことなくナビゲーションコントローラを転送する。

For the most comfortable, human-aware robot navigation, subjective user preferences need to be taken into account. This paper presents a novel reinforcement learning framework to train a personalized navigation controller along with an intuitive virtual reality demonstration interface. The conducted user study provides evidence that our personalized approach significantly outperforms classical approaches with more comfortable human-robot experiences. We achieve these results using only a few demonstration trajectories from non-expert users, who predominantly appreciate the intuitive demonstration setup. As we show in the experiments, the learned controller generalizes well to states not covered in the demonstration data, while still reflecting user preferences during navigation. Finally, we transfer the navigation controller without loss in performance to a real robot.
# 適応的リスク傾向:分散強化学習によるクラッタ環境におけるナノドローンナビゲーション

Adaptive Risk Tendency: Nano Drone Navigation in Cluttered Environments with Distributional Reinforcement Learning ( http://arxiv.org/abs/2203.14749v1 )

Cheng Liu, Erik-Jan van Kampen, Guido C.H.E. de Croon(参考訳) リスク評価能力とリスク認識決定能力を備えたロボットの開発は、不確実性の下で動作しているロボットの堅牢性を確保するための重要なステップとして広く考えられている。 本稿では,nano drone robotが部分的可観測性下で障害物を避けながら,aprioriの未知環境をナビゲートする特定の事例について考察する。 本稿では,適応的リスク傾向を学習するための分散強化学習フレームワークを提案する。 具体的には,学習行動値分布のテール条件分散を不確実性測定として使用し,指数重み付け平均予測アルゴリズムを用いて,環境内の観測された不確実性に基づいて,実行時のリスクテンデンシーを自動的に適応する手法を提案する。 提案アルゴリズムは,シミュレーションと実世界の実験の両方において,ハエのリスク感度を調節し,リスクニュートラルポリシやリスク・アバースポリシよりも優れたパフォーマンスを実現する。 コードと実世界の実験ビデオはこのリポジトリにある。 \url{https://github.com/tudelft/risk-sensitive-rl.git}

Enabling robots with the capability of assessing risk and making risk-aware decisions is widely considered a key step toward ensuring robustness for robots operating under uncertainty. In this paper, we consider the specific case of a nano drone robot learning to navigate an apriori unknown environment while avoiding obstacles under partial observability. We present a distributional reinforcement learning framework in order to learn adaptive risk tendency policies. Specifically, we propose to use tail conditional variance of the learnt action-value distribution as an uncertainty measurement, and use a exponentially weighted average forecasting algorithm to automatically adapt the risk-tendency at run-time based on the observed uncertainty in the environment. We show our algorithm can adjust its risk-sensitivity on the fly both in simulation and real-world experiments and achieving better performance than risk-neutral policy or risk-averse policies. Code and real-world experiment video can be found in this repository: \url{https://github.com/tudelft/risk-sensitive-rl.git}
# あなたが何をしたかを学ぶ: ユーザー行動監督による製品分類学の拡張

Learning What You Need from What You Did: Product Taxonomy Expansion with User Behaviors Supervision ( http://arxiv.org/abs/2203.14921v1 )

Sijie Cheng, Zhouhong Gu, Bang Liu, Rui Xie, Wei Wu and Yanghua Xiao(参考訳) 分類学は様々な領域で広く使われており、多くの応用がなされている。 特に、商品分類は、レコメンデーション、ブラウジング、クエリ理解のためのeコマースドメインにおいて重要な役割を果たす。 しかし、タコノミクスは、手動のメンテナンスや更新に依存する場合、高価で労働集約的なEコマースプラットフォームにおいて、新しく登場した用語や概念を常に把握する必要がある。 そこで,既存の分類群に新しい概念を自動的に付加する分類展開タスクを目標とした。 本稿では,既存の分類体系に新たな概念を付加するための,自己監督型およびユーザ行動指向の製品分類拡張フレームワークを提案する。 本フレームワークは,ユーザの意図や認知に適合した偽善関係を抽出する。 具体的には 一 ユーザーの行動情報を十分に活用するために、クエリークリックの概念からユーザーの興味に合致する候補の偽善関係を抽出する。 二 新しい概念のセマンティック情報を強化し、偽名関係をよりよく検出するために、事前学習言語モデルとグラフニューラルネットワークとコントラスト学習を併用することにより、既存の分類とユーザクリックログにおけるユーザ生成コンテンツと構造情報の両方を通して概念と関係をモデル化する。 三 データセット構築のコストを削減し、データスキューを克服するために、既存の分類学からの高品質でバランスの取れたトレーニングデータセットを監督なしで構築する。 毎日7000万人以上のアクティブユーザーとテイクアウトを注文する中国の垂直eコマースプラットフォームであるMeituan Platformにおける実世界の製品分類に関する大規模な実験は、最先端の手法よりも提案するフレームワークの優位性を実証している。 特に,実世界の製品分類を39,263から94,698まで88%の精度で拡張した。

Taxonomies have been widely used in various domains to underpin numerous applications. Specially, product taxonomies serve an essential role in the e-commerce domain for the recommendation, browsing, and query understanding. However, taxonomies need to constantly capture the newly emerged terms or concepts in e-commerce platforms to keep up-to-date, which is expensive and labor-intensive if it relies on manual maintenance and updates. Therefore, we target the taxonomy expansion task to attach new concepts to existing taxonomies automatically. In this paper, we present a self-supervised and user behavior-oriented product taxonomy expansion framework to append new concepts into existing taxonomies. Our framework extracts hyponymy relations that conform to users' intentions and cognition. Specifically, i) to fully exploit user behavioral information, we extract candidate hyponymy relations that match user interests from query-click concepts; ii) to enhance the semantic information of new concepts and better detect hyponymy relations, we model concepts and relations through both user-generated content and structural information in existing taxonomies and user click logs, by leveraging Pre-trained Language Models and Graph Neural Network combined with Contrastive Learning; iii) to reduce the cost of dataset construction and overcome data skews, we construct a high-quality and balanced training dataset from existing taxonomy with no supervision. Extensive experiments on real-world product taxonomies in Meituan Platform, a leading Chinese vertical e-commerce platform to order take-out with more than 70 million daily active users, demonstrate the superiority of our proposed framework over state-of-the-art methods. Notably, our method enlarges the size of real-world product taxonomies from 39,263 to 94,698 relations with 88% precision.
# ブートストラップによるブラックボックス選択推論

Black-box Selective Inference via Bootstrapping ( http://arxiv.org/abs/2203.14504v1 )

Sifan Liu, Jelena Markovic, Jonathan Taylor(参考訳) 本稿では,ブラックボックスとなる可能性のあるモデル選択手順の後に選択推論を行う手法を提案する。 条件付き選択後推論フレームワークにおいて、テスト統計量の選択後分布を決定する重要な量は、統計上のモデル条件を選択する確率である。 ブートストラップされたデータセット上でモデル選択手順を繰り返し実行することにより、選択イベントを示すバイナリ応答と、特別に設計された共変量を含むトレーニングデータを生成し、選択確率を学習する。 構成された信頼区間は、対象パラメータの近傍で十分な選択確率を学習できれば漸近的に有効であることを示す。 提案アルゴリズムの有効性をいくつかの例で検証する。

We propose a method for selective inference after a model selection procedure that is potentially a black box. In the conditional post-selection inference framework, a crucial quantity in determining the post-selection distribution of a test statistic is the probability of selecting the model conditional on the statistic. By repeatedly running the model selection procedure on bootstrapped datasets, we can generate training data with binary responses indicating the selection event as well as specially designed covariates, which are then used to learn the selection probability. We prove that the constructed confidence intervals are asymptotically valid if we can learn the selection probability sufficiently well around a neighborhood of the target parameter. The validity of the proposed algorithm is verified by several examples.
# インタラクティブな画像ベースモデリングシステム

An Interactive Image-based Modeling System ( http://arxiv.org/abs/2203.14441v1 )

Zhi He, Rui Wang, Wei Hua, Yuchi Huo(参考訳) 本稿では, 対話型3次元モデリング手法と, 単一のあるいは複数の未校正画像に基づく対応システムを提案する。 本手法の主な特徴は,一般人のモデリング習慣により,対象物の3dモデルが粗画像から細画像に再構成される点である。 近似形状の決定に基づいて、投影制約と空間制約を追加または修正し、トポロジー修正を適用し、カメラキャリブレーションを徐々に実現し、粗いモデルを洗練し、最終的に任意の幾何学とトポロジーでオブジェクトの再構成を完了させる。 インタラクティブな処理の間、幾何学的パラメータとカメラ投影行列をリアルタイムで解き、再構成結果を3Dウィンドウに表示する。

This paper propose a interactive 3D modeling method and corresponding system based on single or multiple uncalibrated images. The main feature of this method is that, according to the modeling habits of ordinary people, the 3D model of the target is reconstructed from coarse to fine images. On the basis of determining the approximate shape, the user adds or modify projection constraints and spatial constraints, and apply topology modification, gradually realize camera calibration, refine rough model, and finally complete the reconstruction of objects with arbitrary geometry and topology. During the interactive process, the geometric parameters and camera projection matrix are solved in real time, and the reconstruction results are displayed in a 3D window.
# (参考訳) オープンセットオブジェクト検出のための低密度潜在領域の拡張

Expanding Low-Density Latent Regions for Open-Set Object Detection ( http://arxiv.org/abs/2203.14911v1 )

Jiaming Han, Yuqiang Ren, Jian Ding, Xingjia Pan, Ke Yan, Gui-Song Xia(参考訳) 現代の物体検出器は、クローズセット設定下で素晴らしい進歩を遂げた。 しかし、未知のカテゴリのオブジェクトは、しばしば既存の既知のクラスに誤って分類されるため、オープンセットオブジェクト検出(OSOD)は難しいままである。 本研究では,未知の物体が通常低密度の潜在領域に分布しているという認識に基づいて,未知の物体を潜在空間内の高密度領域と低密度領域を分離して同定することを提案する。 従来のしきい値に基づく手法は、未知のオブジェクトを全てカバーできない限られた低密度領域のみを保持するため、拡張された低密度領域を持つ新しいOpen-set Detector(OpenDet)を提案する。 この目的のために、OpenDetに2人の学習者、Contrastive Feature Learner (CFL) と Unknown Probability Learner (UPL) を設ける。 CFLは、既知のクラスのコンパクトな特徴を促進するために、インスタンスレベルのコントラスト学習を行い、未知のクラスに対してより低密度な領域を残し、UPLは予測の不確実性に基づいて未知の確率を最適化する。 したがって、低密度領域における未知の物体は、学習された未知の確率と容易に識別できる。 例えば、OpenDetは6つのOSODベンチマークでAbsolute Open-Set Errorsを25%-35%削減する。 コードはhttps://github.com/csuhan/opendet2.com/で入手できる。

Modern object detectors have achieved impressive progress under the close-set setup. However, open-set object detection (OSOD) remains challenging since objects of unknown categories are often misclassified to existing known classes. In this work, we propose to identify unknown objects by separating high/low-density regions in the latent space, based on the consensus that unknown objects are usually distributed in low-density latent regions. As traditional threshold-based methods only maintain limited low-density regions, which cannot cover all unknown objects, we present a novel Open-set Detector (OpenDet) with expanded low-density regions. To this aim, we equip OpenDet with two learners, Contrastive Feature Learner (CFL) and Unknown Probability Learner (UPL). CFL performs instance-level contrastive learning to encourage compact features of known classes, leaving more low-density regions for unknown classes; UPL optimizes unknown probability based on the uncertainty of predictions, which further divides more low-density regions around the cluster of known classes. Thus, unknown objects in low-density regions can be easily identified with the learned unknown probability. Extensive experiments demonstrate that our method can significantly improve the OSOD performance, e.g., OpenDet reduces the Absolute Open-Set Errors by 25%-35% on six OSOD benchmarks. Code is available at: https://github.com/csuhan/opendet2.
# 視覚言語モデルを用いたオープンボキャブラリ物体検出のための学習

Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model ( http://arxiv.org/abs/2203.14940v1 )

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, Guoqi Li(参考訳) 近年,視覚言語による事前学習はオープン語彙オブジェクト検出において大きな可能性を秘めている。 クラステキスト埋め込みは、事前に訓練された視覚言語モデルのテキストエンコーダにプロンプトを供給することによって、まず生成される。 その後、検出器の訓練を監督する領域分類器として使用される。 このモデルの成功につながる重要な要素は適切なプロンプトであり、注意深い単語のチューニングと巧妙な設計を必要とする。 画像分類タスクには, むだ時間を要するプロンプトエンジニアリングを回避するために, プロンプト表現学習手法が提案されているが, 検出タスクに適用した場合にのみ最適解となる。 本稿では,前訓練された視覚言語モデルに基づく開語彙物体検出のための連続的プロンプト表現を学習するための新しい手法である検出プロンプト(detpro)を提案する。 従来の分類指向のメソッドとは異なり、DetProには2つのハイライトがある。 1) 画像背景に提案を組み込む背景解釈スキームを即時訓練に組み込むこと 2) 適応型プロンプトトレーニングのための画像フォアグラウンドでの提案を分離するコンテキストグレーディングスキーム。 我々は、最近の最先端のオープンワールドオブジェクト検出器であるViLDでDetProを組み立て、LVISで実験を行い、Pascal VOC、COCO、Objects365データセットでの転送学習を行った。 実験の結果、detpro は lvis の新しいクラスにおける +3.4 apbox および +3.0 apmask の改善など、すべての設定において、ベースライン vild よりも優れています。 コードとモデルはhttps://github.com/dyabel/detproで入手できる。

Recently, vision-language pre-training shows great potential in open-vocabulary object detection, where detectors trained on base classes are devised for detecting new classes. The class text embedding is firstly generated by feeding prompts to the text encoder of a pre-trained vision-language model. It is then used as the region classifier to supervise the training of a detector. The key element that leads to the success of this model is the proper prompt, which requires careful words tuning and ingenious design. To avoid laborious prompt engineering, there are some prompt representation learning methods being proposed for the image classification task, which however can only be sub-optimal solutions when applied to the detection task. In this paper, we introduce a novel method, detection prompt (DetPro), to learn continuous prompt representations for open-vocabulary object detection based on the pre-trained vision-language model. Different from the previous classification-oriented methods, DetPro has two highlights: 1) a background interpretation scheme to include the proposals in image background into the prompt training; 2) a context grading scheme to separate proposals in image foreground for tailored prompt training. We assemble DetPro with ViLD, a recent state-of-the-art open-world object detector, and conduct experiments on the LVIS as well as transfer learning on the Pascal VOC, COCO, Objects365 datasets. Experimental results show that our DetPro outperforms the baseline ViLD in all settings, e.g., +3.4 APbox and +3.0 APmask improvements on the novel classes of LVIS. Code and models are available at https://github.com/dyabel/detpro.
# シーケンスコントラスト学習による長編ビデオのフレームワイズ行動表現

Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning ( http://arxiv.org/abs/2203.14957v1 )

Minghao Chen, Fangyun Wei, Chong Li, Deng Cai(参考訳) アクション表現学習の研究は、主にショートビデオクリップのグローバル表現を抽出する様々なアーキテクチャの設計に焦点を当てていた。 対照的に、ビデオアライメントのような多くの実践的応用は、長いビデオの高密度表現を学習するのに強い需要がある。 本稿では,フレームワイドなアクション表現(特に長編ビデオ)を自己監督的に学習するための,新しいコントラッシブ・アクション表現学習(CARL)フレームワークを提案する。 具体的には,時空間的文脈を考慮した簡易かつ効率的な映像エンコーダを提案する。 近年の自己教師付き学習の進歩に触発されて,時空間データ拡張によって得られた2つの相関ビューに適用した,新しいシーケンスコントラスト損失 (scl) を提案する。 SCLは、2つの拡張ビューのシーケンス類似性と、タイムスタンプ距離のガウス分布の間のKL分割を最小化し、埋め込み空間を最適化する。 FineGym, PennAction, Pouring のデータセットを用いた実験により, 提案手法は下流の微細な動作分類において, 従来の最先端技術よりも優れていたことがわかった。 驚くべきことに、ペアビデオのトレーニングを受けなくても、ビデオアライメントやきめ細かなフレーム検索タスクにおいて優れたパフォーマンスを示すことができる。 コードとモデルはhttps://github.com/minghchen/carl_codeで入手できる。

Prior works on action representation learning mainly focus on designing various architectures to extract the global representations for short video clips. In contrast, many practical applications such as video alignment have strong demand for learning dense representations for long videos. In this paper, we introduce a novel contrastive action representation learning (CARL) framework to learn frame-wise action representations, especially for long videos, in a self-supervised manner. Concretely, we introduce a simple yet efficient video encoder that considers spatio-temporal context to extract frame-wise representations. Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views obtained through a series of spatio-temporal data augmentations. SCL optimizes the embedding space by minimizing the KL-divergence between the sequence similarity of two augmented views and a prior Gaussian distribution of timestamp distance. Experiments on FineGym, PennAction and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification. Surprisingly, although without training on paired videos, our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks. Code and models are available at https://github.com/minghchen/CARL_code.
# リハーサル型連続学習のための勾配マッチングコアセット

Gradient-Matching Coresets for Rehearsal-Based Continual Learning ( http://arxiv.org/abs/2203.14544v1 )

Lukas Balles, Giovanni Zappella, C\'edric Archambeau(参考訳) 継続学習(CL)の目標は、学習した知識を忘れずに、機械学習モデルを新しいデータで効率的に更新することである。 広く使われているほとんどのCLメソッドは、新しいデータのトレーニング中に再利用されるデータポイントのリハーサルメモリに依存している。 このようなリハーサルメモリを計算して、これまで見てきたすべてのデータの小さな情報サブセットを維持することは、これらの手法の成功に不可欠である。 リハーサル型連続学習のためのコアセット選択法を考案する。 コアセットによって誘導される勾配は、可能な限り、元のトレーニングデータセットによって誘導される勾配と一致すべきである。 神経接核理論(neural tangent kernel theory)に触発され、モデルの初期化分布をまたいでこの勾配マッチングを行い、まずモデルを訓練することなくコアセットを抽出することができる。 本手法は,多岐にわたる連続学習シナリオで評価し,リハーサルに基づくcl法の性能を,リザーバサンプリングなどの競合するメモリ管理戦略と比較して改善することを示す。

The goal of continual learning (CL) is to efficiently update a machine learning model with new data without forgetting previously-learned knowledge. Most widely-used CL methods rely on a rehearsal memory of data points to be reused while training on new data. Curating such a rehearsal memory to maintain a small, informative subset of all the data seen so far is crucial to the success of these methods. We devise a coreset selection method for rehearsal-based continual learning. Our method is based on the idea of gradient matching: The gradients induced by the coreset should match, as closely as possible, those induced by the original training dataset. Inspired by the neural tangent kernel theory, we perform this gradient matching across the model's initialization distribution, allowing us to extract a coreset without having to train the model first. We evaluate the method on a wide range of continual learning scenarios and demonstrate that it improves the performance of rehearsal-based CL methods compared to competing memory management strategies such as reservoir sampling.
# 疲労検出のための手書き作業の解析について

On the Handwriting Tasks' Analysis to Detect Fatigue ( http://arxiv.org/abs/2203.14782v1 )

Manuel-Vicente Garnacho-Casta\~no, Marcos Faundez-Zanuy, Josep Lopez-Xarbau(参考訳) 激しい運動後の身体回復の実践的決定は、身体的なスポーツ活動のほとんどとプロの活動(脳コンピュータインタフェース操作システムを含む)の両方が良好な形状を必要とするため、機械的側面と認知的側面を含めなければならない課題である。 本稿では,20名の健常者のオンライン手書きデータベースを提案する。 主な目的は、様々な作業における身体的運動刺激の影響を調査し、激しい運動後の回復を評価することであった。 この目的のために, 身体運動前後の異なる手書き作業を行い, メタボリック, 機械的疲労評価などの測定を行った。 実験の結果, 迅速な機械的回復が得られ, 乳酸濃度や機械的疲労によって測定できるが, 認知的努力が必要な場合にはそうではないことがわかった。 筆跡解析の結果,乳酸濃度および機械的評価回復後の手書き性能には統計的差異が認められた。 結論: スポーツや職業活動において, 古典的手法で測定されるものよりも, より回復する時間の必要性が指摘される。

Practical determination of physical recovery after intense exercise is a challenging topic that must include mechanical aspects as well as cognitive ones because most of physical sport activities, as well as professional activities (including brain computer interface-operated systems), require good shape in both of them. This paper presents a new online handwritten database of 20 healthy subjects. The main goal was to study the influence of several physical exercise stimuli in different handwritten tasks and to evaluate the recovery after strenuous exercise. To this aim, they performed different handwritten tasks before and after physical exercise as well as other measurements such as metabolic and mechanical fatigue assessment. Experimental results showed that although a fast mechanical recovery happens and can be measured by lactate concentrations and mechanical fatigue, this is not the case when cognitive effort is required. Handwriting analysis revealed that statistical differences exist on handwriting performance even after lactate concentration and mechanical assessment recovery. Conclusions: This points out a necessity of more recovering time in sport and professional activities than those measured in classic ways.
# 物理的出力を持つ微分可能、学習可能、地域化プロセスベースモデルは、最先端の水理予測精度にアプローチできる

Differentiable, learnable, regionalized process-based models with physical outputs can approach state-of-the-art hydrologic prediction accuracy ( http://arxiv.org/abs/2203.14827v1 )

Dapeng Feng, Jiangtao Liu, Kathryn Lawson, and Chaopeng Shen(参考訳) 水循環全体にわたる水文変数の予測は、水資源管理だけでなく、生態系や水質モデリングのような下流のアプリケーションにも大きな価値がある。 近年、長期記憶(LSTM)のような純粋にデータ駆動型ディープラーニングモデルは、降雨流出やその他の地質学変数をモデル化する上で、一見不可能な性能を示した。 ここでは、局所化パラメータ化を伴う集中観測変数(ストリームフロー)に対するLSTMの性能レベルに、微分可能で学習可能なプロセスベースモデル(ここでは {\delta} モデルと呼ぶ)がアプローチ可能であることを示す。 我々は、単純な水理学モデルhbvをバックボーンとして使用し、プロセスベースのモデルモジュールのパラメータ化、置換、強化のために、差別化可能なプログラミングフレームワークでのみトレーニング可能な組み込みニューラルネットワークを使用する。 アンサンブルやポストプロセッサを使わずに、デルタモデルでは、特定の強制データに対して、米国全域の671の流域で中央値のナッシュサトクリフ効率が0.715であるのに対して、同じセットアップを持つ最先端のLSTMモデルでは0.72である。 一方、得られた学習可能なプロセスベースのモデルは、地下水貯留、蒸発散、表面流出、およびベースフローなど、複数の観測源によって評価される(後に訓練される)。 蒸発散を模擬し, ベースフローから排出する割合を推定した。 一般的なフレームワークは、さまざまなプロセスの複雑さを持つモデルで動作し、ビッグデータから物理学を学ぶための道を開くことができる。

Predictions of hydrologic variables across the entire water cycle have significant value for water resource management as well as downstream applications such as ecosystem and water quality modeling. Recently, purely data-driven deep learning models like long short-term memory (LSTM) showed seemingly-insurmountable performance in modeling rainfall-runoff and other geoscientific variables, yet they cannot predict unobserved physical variables and remain challenging to interpret. Here we show that differentiable, learnable, process-based models (called {\delta} models here) can approach the performance level of LSTM for the intensively-observed variable (streamflow) with regionalized parameterization. We use a simple hydrologic model HBV as the backbone and use embedded neural networks, which can only be trained in a differentiable programming framework, to parameterize, replace, or enhance the process-based model modules. Without using an ensemble or post-processor, {\delta} models can obtain a median Nash Sutcliffe efficiency of 0.715 for 671 basins across the USA for a particular forcing data, compared to 0.72 from a state-of-the-art LSTM model with the same setup. Meanwhile, the resulting learnable process-based models can be evaluated (and later, to be trained) by multiple sources of observations, e.g., groundwater storage, evapotranspiration, surface runoff, and baseflow. Both simulated evapotranspiration and fraction of discharge from baseflow agreed decently with alternative estimates. The general framework can work with models with various process complexity and opens up the path for learning physics from big data.
# TGL: 数十億グラフの時間的GNNトレーニングのための一般的なフレームワーク

ライセンス: Link先を確認
Many real world graphs contain time domain information. Temporal Graph Neural Networks capture temporal information as well as structural and contextual information in the generated dynamic node embeddings. Researchers have shown that these embeddings achieve state-of-the-art performance in many different tasks. In this work, we propose TGL, a unified framework for large-scale offline Temporal Graph Neural Network training where users can compose various Temporal Graph Neural Networks with simple configuration files. TGL comprises five main components, a temporal sampler, a mailbox, a node memory module, a memory updater, and a message passing engine. We design a Temporal-CSR data structure and a parallel sampler to efficiently sample temporal neighbors to formtraining mini-batches. We propose a novel random chunk scheduling technique that mitigates the problem of obsolete node memory when training with a large batch size. To address the limitations of current TGNNs only being evaluated on small-scale datasets, we introduce two large-scale real-world datasets with 0.2 and 1.3 billion temporal edges. We evaluate the performance of TGL on four small-scale datasets with a single GPU and the two large datasets with multiple GPUs for both link prediction and node classification tasks. We compare TGL with the open-sourced code of five methods and show that TGL achieves similar or better accuracy with an average of 13x speedup. Our temporal parallel sampler achieves an average of 173x speedup on a multi-core CPU compared with the baselines. On a 4-GPU machine, TGL can train one epoch of more than one billion temporal edges within 1-10 hours. To the best of our knowledge, this is the first work that proposes a general framework for large-scale Temporal Graph Neural Networks training on multiple GPUs.
# (参考訳) Sketch3T: ゼロショットSBIRのテストタイムトレーニング

ライセンス: CC BY 4.0
Zero-shot sketch-based image retrieval typically asks for a trained model to be applied as is to unseen categories. In this paper, we question to argue that this setup by definition is not compatible with the inherent abstract and subjective nature of sketches, i.e., the model might transfer well to new categories, but will not understand sketches existing in different test-time distribution as a result. We thus extend ZS-SBIR asking it to transfer to both categories and sketch distributions. Our key contribution is a test-time training paradigm that can adapt using just one sketch. Since there is no paired photo, we make use of a sketch raster-vector reconstruction module as a self-supervised auxiliary task. To maintain the fidelity of the trained cross-modal joint embedding during test-time update, we design a novel meta-learning based training paradigm to learn a separation between model updates incurred by this auxiliary task from those off the primary objective of discriminative learning. Extensive experiments show our model to outperform state of-the-arts, thanks to the proposed test-time adaption that not only transfers to new categories but also accommodates to new sketching styles.
# 物体検出評価のための最適補正コスト

ライセンス: Link先を確認
Mean Average Precision (mAP) is the primary evaluation measure for object detection. Although object detection has a broad range of applications, mAP evaluates detectors in terms of the performance of ranked instance retrieval. Such the assumption for the evaluation task does not suit some downstream tasks. To alleviate the gap between downstream tasks and the evaluation scenario, we propose Optimal Correction Cost (OC-cost), which assesses detection accuracy at image level. OC-cost computes the cost of correcting detections to ground truths as a measure of accuracy. The cost is obtained by solving an optimal transportation problem between the detections and the ground truths. Unlike mAP, OC-cost is designed to penalize false positive and false negative detections properly, and every image in a dataset is treated equally. Our experimental result validates that OC-cost has better agreement with human preference than a ranking-based measure, i.e., mAP for a single image. We also show that detectors' rankings by OC-cost are more consistent on different data splits than mAP. Our goal is not to replace mAP with OC-cost but provide an additional tool to evaluate detectors from another aspect. To help future researchers and developers choose a target measure, we provide a series of experiments to clarify how mAP and OC-cost differ.
# 顔解析のための周期的自己制御型マルチタスク学習

ライセンス: Link先を確認
This paper probes intrinsic factors behind typical failure cases (e.g. spatial inconsistency and boundary confusion) produced by the existing state-of-the-art method in face parsing. To tackle these problems, we propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation (DML-CSR) for face parsing. Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection. These tasks only share low-level encoder weights without high-level interactions between each other, enabling to decouple auxiliary modules from the whole network at the inference stage. To address spatial inconsistency, we develop a dynamic dual graph convolutional network to capture global contextual information without using any extra pooling operation. To handle boundary confusion in both single and multiple face scenarios, we exploit binary and category edge detection to jointly obtain generic geometric structure and fine-grained semantic clues of human faces. Besides, to prevent noisy labels from degrading model generalization during training, cyclical self-regulation is proposed to self-ensemble several model instances to get a new model and the resulting model then is used to self-distill subsequent models, through alternating iterations. Experiments show that our method achieves the new state-of-the-art performance on the Helen, CelebAMask-HQ, and Lapa datasets. The source code is available at https://github.com/deepinsight/insightface/tree/master/parsing/dml_csr.
# SC^2-PCR:効率的なロバスト点群登録のための2次空間適合性

ライセンス: Link先を確認
In this paper, we present a second order spatial compatibility (SC^2) measure based method for efficient and robust point cloud registration (PCR), called SC^2-PCR. Firstly, we propose a second order spatial compatibility (SC^2) measure to compute the similarity between correspondences. It considers the global compatibility instead of local consistency, allowing for more distinctive clustering between inliers and outliers at early stage. Based on this measure, our registration pipeline employs a global spectral technique to find some reliable seeds from the initial correspondences. Then we design a two-stage strategy to expand each seed to a consensus set based on the SC^2 measure matrix. Finally, we feed each consensus set to a weighted SVD algorithm to generate a candidate rigid transformation and select the best model as the final result. Our method can guarantee to find a certain number of outlier-free consensus sets using fewer samplings, making the model estimation more efficient and robust. In addition, the proposed SC^2 measure is general and can be easily plugged into deep learning based frameworks. Extensive experiments are carried out to investigate the performance of our method. Code will be available at \url{https://github.com/ZhiChen902/SC2-PCR}.
# OTFace: ディープフェイス表現のための最適なトランスポート損失をガイドしたハードサンプル

ライセンス: Link先を確認
Face representation in the wild is extremely hard due to the large scale face variations. To this end, some deep convolutional neural networks (CNNs) have been developed to learn discriminative feature by designing properly margin-based losses, which perform well on easy samples but fail on hard samples. Based on this, some methods mainly adjust the weights of hard samples in training stage to improve the feature discrimination. However, these methods overlook the feature distribution property which may lead to better results since the miss-classified hard samples may be corrected by using the distribution metric. This paper proposes the hard samples guided optimal transport (OT) loss for deep face representation, OTFace for short. OTFace aims to enhance the performance of hard samples by introducing the feature distribution discrepancy while maintain the performance on easy samples. Specifically, we embrace triplet scheme to indicate hard sample groups in one mini-batch during training. OT is then used to characterize the distribution differences of features from the high level convolutional layer. Finally, we integrate the margin-based-softmax (e.g. ArcFace or AM-Softmax) and OT to guide deep CNN learning. Extensive experiments are conducted on several benchmark databases. The quantitative results demonstrate the advantages of the proposed OTFace over state-of-the-art methods.
# ヒトアバターモデリングのための構造的局所放射場

ライセンス: Link先を確認
It is extremely challenging to create an animatable clothed human avatar from RGB videos, especially for loose clothes due to the difficulties in motion modeling. To address this problem, we introduce a novel representation on the basis of recent neural scene rendering techniques. The core of our representation is a set of structured local radiance fields, which are anchored to the pre-defined nodes sampled on a statistical human body template. These local radiance fields not only leverage the flexibility of implicit representation in shape and appearance modeling, but also factorize cloth deformations into skeleton motions, node residual translations and the dynamic detail variations inside each individual radiance field. To learn our representation from RGB data and facilitate pose generalization, we propose to learn the node translations and the detail variations in a conditional generative latent space. Overall, our method enables automatic construction of animatable human avatars for various types of clothes without the need for scanning subject-specific templates, and can generate realistic images with dynamic details for novel poses. Experiment show that our method outperforms state-of-the-art methods both qualitatively and quantitatively.
# メッセージパッシングのための学習指向による等変点クラウド解析

ライセンス: Link先を確認
Equivariance has been a long-standing concern in various fields ranging from computer vision to physical modeling. Most previous methods struggle with generality, simplicity, and expressiveness -- some are designed ad hoc for specific data types, some are too complex to be accessible, and some sacrifice flexible transformations. In this work, we propose a novel and simple framework to achieve equivariance for point cloud analysis based on the message passing (graph neural network) scheme. We find the equivariant property could be obtained by introducing an orientation for each point to decouple the relative position for each point from the global pose of the entire point cloud. Therefore, we extend current message passing networks with a module that learns orientations for each point. Before aggregating information from the neighbors of a point, the networks transforms the neighbors' coordinates based on the point's learned orientations. We provide formal proofs to show the equivariance of the proposed framework. Empirically, we demonstrate that our proposed method is competitive on both point cloud analysis and physical modeling tasks. Code is available at https://github.com/luost26/Equivariant-OrientedMP .
# NOC-REK:外部知識から語彙を抽出した新しいオブジェクトキャプション

ライセンス: Link先を確認
Novel object captioning aims at describing objects absent from training data, with the key ingredient being the provision of object vocabulary to the model. Although existing methods heavily rely on an object detection model, we view the detection step as vocabulary retrieval from an external knowledge in the form of embeddings for any object's definition from Wiktionary, where we use in the retrieval image region features learned from a transformers model. We propose an end-to-end Novel Object Captioning with Retrieved vocabulary from External Knowledge method (NOC-REK), which simultaneously learns vocabulary retrieval and caption generation, successfully describing novel objects outside of the training dataset. Furthermore, our model eliminates the requirement for model retraining by simply updating the external knowledge whenever a novel object appears. Our comprehensive experiments on held-out COCO and Nocaps datasets show that our NOC-REK is considerably effective against SOTAs.
# グレイスとブラックスワンのキャッチ:オープンセットによる異常検出

ライセンス: Link先を確認
Despite most existing anomaly detection studies assume the availability of normal training samples only, a few labeled anomaly examples are often available in many real-world applications, such as defect samples identified during random quality inspection, lesion images confirmed by radiologists in daily medical screening, etc. These anomaly examples provide valuable knowledge about the application-specific abnormality, enabling significantly improved detection of similar anomalies in some recent models. However, those anomalies seen during training often do not illustrate every possible class of anomaly, rendering these models ineffective in generalizing to unseen anomaly classes. This paper tackles open-set supervised anomaly detection, in which we learn detection models using the anomaly examples with the objective to detect both seen anomalies (`gray swans') and unseen anomalies (`black swans'). We propose a novel approach that learns disentangled representations of abnormalities illustrated by seen anomalies, pseudo anomalies, and latent residual anomalies (i.e., samples that have unusual residuals compared to the normal data in a latent space), with the last two abnormalities designed to detect unseen anomalies. Extensive experiments on nine real-world anomaly detection datasets show superior performance of our model in detecting seen and unseen anomalies under diverse settings. Code and data are available at: https://github.com/choubo/DRA.
# imface: 暗黙の神経表現を持つ非線形3次元モーファブル顔モデル

ライセンス: Link先を確認
Precise representations of 3D faces are beneficial to various computer vision and graphics applications. Due to the data discretization and model linearity, however, it remains challenging to capture accurate identity and expression clues in current studies. This paper presents a novel 3D morphable face model, namely ImFace, to learn a nonlinear and continuous space with implicit neural representations. It builds two explicitly disentangled deformation fields to model complex shapes associated with identities and expressions, respectively, and designs an improved learning strategy to extend embeddings of expressions to allow more diverse changes. We further introduce a Neural Blend-Field to learn sophisticated details by adaptively blending a series of local fields. In addition to ImFace, an effective preprocessing pipeline is proposed to address the issue of watertight input requirement in implicit representations, enabling them to work with common facial surfaces for the first time. Extensive experiments are performed to demonstrate the superiority of ImFace.
# REGTR: 変換器によるエンドツーエンドのポイントクラウド対応

ライセンス: Link先を確認
Despite recent success in incorporating learning into point cloud registration, many works focus on learning feature descriptors and continue to rely on nearest-neighbor feature matching and outlier filtering through RANSAC to obtain the final set of correspondences for pose estimation. In this work, we conjecture that attention mechanisms can replace the role of explicit feature matching and RANSAC, and thus propose an end-to-end framework to directly predict the final set of correspondences. We use a network architecture consisting primarily of transformer layers containing self and cross attentions, and train it to predict the probability each point lies in the overlapping region and its corresponding position in the other point cloud. The required rigid transformation can then be estimated directly from the predicted correspondences without further post-processing. Despite its simplicity, our approach achieves state-of-the-art performance on 3DMatch and ModelNet benchmarks. Our source code can be found at https://github.com/yewzijian/RegTR .
# Uni6D: 6D Pose Estimationのためのプロジェクションブレークダウンのない統一CNNフレームワーク

ライセンス: Link先を確認
As RGB-D sensors become more affordable, using RGB-D images to obtain high-accuracy 6D pose estimation results becomes a better option. State-of-the-art approaches typically use different backbones to extract features for RGB and depth images. They use a 2D CNN for RGB images and a per-pixel point cloud network for depth data, as well as a fusion network for feature fusion. We find that the essential reason for using two independent backbones is the "projection breakdown" problem. In the depth image plane, the projected 3D structure of the physical world is preserved by the 1D depth value and its built-in 2D pixel coordinate (UV). Any spatial transformation that modifies UV, such as resize, flip, crop, or pooling operations in the CNN pipeline, breaks the binding between the pixel value and UV coordinate. As a consequence, the 3D structure is no longer preserved by a modified depth image or feature. To address this issue, we propose a simple yet effective method denoted as Uni6D that explicitly takes the extra UV data along with RGB-D images as input. Our method has a Unified CNN framework for 6D pose estimation with a single CNN backbone. In particular, the architecture of our method is based on Mask R-CNN with two extra heads, one named RT head for directly predicting 6D pose and the other named abc head for guiding the network to map the visible points to their coordinates in the 3D model as an auxiliary module. This end-to-end approach balances simplicity and accuracy, achieving comparable accuracy with state of the arts and 7.2x faster inference speed on the YCB-Video dataset.
# マルチカメラビデオトリプレットを用いた参照ベースビデオ超解像

ライセンス: Link先を確認
We propose the first reference-based video super-resolution (RefVSR) approach that utilizes reference videos for high-fidelity results. We focus on RefVSR in a triple-camera setting, where we aim at super-resolving a low-resolution ultra-wide video utilizing wide-angle and telephoto videos. We introduce the first RefVSR network that recurrently aligns and propagates temporal reference features fused with features extracted from low-resolution frames. To facilitate the fusion and propagation of temporal reference features, we propose a propagative temporal fusion module. For learning and evaluation of our network, we present the first RefVSR dataset consisting of triplets of ultra-wide, wide-angle, and telephoto videos concurrently taken from triple cameras of a smartphone. We also propose a two-stage training strategy fully utilizing video triplets in the proposed dataset for real-world 4x video super-resolution. We extensively evaluate our method, and the result shows the state-of-the-art performance in 4x super-resolution.
# 映像分解のためのピラミッド特徴アライメントネットワーク

ライセンス: Link先を確認
Video deblurring remains a challenging task due to various causes of blurring. Traditional methods have considered how to utilize neighboring frames by the single-scale alignment for restoration. However, they typically suffer from misalignment caused by severe blur. In this work, we aim to better utilize neighboring frames with high efficient feature alignment. We propose a Pyramid Feature Alignment Network (PFAN) for video deblurring. First, the multi-scale feature of blurry frames is extracted with the strategy of Structure-to-Detail Downsampling (SDD) before alignment. This downsampling strategy makes the edges sharper, which is helpful for alignment. Then we align the feature at each scale and reconstruct the image at the corresponding scale. This strategy effectively supervises the alignment at each scale, overcoming the problem of propagated errors from the above scales at the alignment stage. To better handle the challenges of complex and large motions, instead of aligning features at each scale separately, lower-scale motion information is used to guide the higher-scale motion estimation. Accordingly, a Cascade Guided Deformable Alignment (CGDA) is proposed to integrate coarse motion into deformable convolution for finer and more accurate alignment. As demonstrated in extensive experiments, our proposed PFAN achieves superior performance with competitive speed compared to the state-of-the-art methods.
# 映像品質評価のための効率的な変換器の視覚メカニズム

ライセンス: Link先を確認
Visual (image, video) quality assessments can be modelled by visual features in different domains, e.g., spatial, frequency, and temporal domains. Perceptual mechanisms in the human visual system (HVS) play a crucial role in the generation of quality perception. This paper proposes a general framework for no-reference visual quality assessment using efficient windowed transformer architectures. A lightweight module for multi-stage channel attention is integrated into the Swin (shifted window) Transformer. Such module can represent the appropriate perceptual mechanisms in image quality assessment (IQA) to build an accurate IQA model. Meanwhile, representative features for image quality perception in the spatial and frequency domains can also be derived from the IQA model, which are then fed into another windowed transformer architecture for video quality assessment (VQA). The VQA model efficiently reuses attention information across local windows to tackle the issue of expensive time and memory complexities of original transformer. Experimental results on both large-scale IQA and VQA databases demonstrate that the proposed quality assessment models outperform other state-of-the-art models by large margins. The complete source code will be published on Github.
# handoccnet:octorion-robust 3d hand mesh推定ネットワーク

ライセンス: Link先を確認
Hands are often severely occluded by objects, which makes 3D hand mesh estimation challenging. Previous works often have disregarded information at occluded regions. However, we argue that occluded regions have strong correlations with hands so that they can provide highly beneficial information for complete 3D hand mesh estimation. Thus, in this work, we propose a novel 3D hand mesh estimation network HandOccNet, that can fully exploits the information at occluded regions as a secondary means to enhance image features and make it much richer. To this end, we design two successive Transformer-based modules, called feature injecting transformer (FIT) and self- enhancing transformer (SET). FIT injects hand information into occluded region by considering their correlation. SET refines the output of FIT by using a self-attention mechanism. By injecting the hand information to the occluded region, our HandOccNet reaches the state-of-the-art performance on 3D hand mesh benchmarks that contain challenging hand-object occlusions. The codes are available in: https://github.com/namepllet/HandOccNet.
# 顔提示攻撃検出のためのスタイル誘導ドメイン適応

ライセンス: Link先を確認
Domain adaptation (DA) or domain generalization (DG) for face presentation attack detection (PAD) has attracted attention recently with its robustness against unseen attack scenarios. Existing DA/DG-based PAD methods, however, have not yet fully explored the domain-specific style information that can provide knowledge regarding attack styles (e.g., materials, background, illumination and resolution). In this paper, we introduce a novel Style-Guided Domain Adaptation (SGDA) framework for inference-time adaptive PAD. Specifically, Style-Selective Normalization (SSN) is proposed to explore the domain-specific style information within the high-order feature statistics. The proposed SSN enables the adaptation of the model to the target domain by reducing the style difference between the target and the source domains. Moreover, we carefully design Style-Aware Meta-Learning (SAML) to boost the adaptation ability, which simulates the inference-time adaptation with style selection process on virtual test domain. In contrast to previous domain adaptation approaches, our method does not require either additional auxiliary models (e.g., domain adaptors) or the unlabeled target domain during training, which makes our method more practical to PAD task. To verify our experiments, we utilize the public datasets: MSU-MFSD, CASIA-FASD, OULU-NPU and Idiap REPLAYATTACK. In most assessments, the result demonstrates a notable gap of performance compared to the conventional DA/DG-based PAD methods.
# S2-Net: クロスモダリティ画像のための自己超越的特徴表現学習

ライセンス: Link先を確認
Combining the respective advantages of cross-modality images can compensate for the lack of information in the single modality, which has attracted increasing attention of researchers into multi-modal image matching tasks. Meanwhile, due to the great appearance differences between cross-modality image pairs, it often fails to make the feature representations of correspondences as close as possible. In this letter, we design a cross-modality feature representation learning network, S2-Net, which is based on the recently successful detect-and-describe pipeline, originally proposed for visible images but adapted to work with cross-modality image pairs. To solve the consequent problem of optimization difficulties, we introduce self-supervised learning with a well-designed loss function to guide the training without discarding the original advantages. This novel strategy simulates image pairs in the same modality, which is also a useful guide for the training of cross-modality images. Notably, it does not require additional data but significantly improves the performance and is even workable for all methods of the detect-and-describe pipeline. Extensive experiments are conducted to evaluate the performance of the strategy we proposed, compared to both handcrafted and deep learning-based methods. Results show that our elegant formulation of combined optimization of supervised and self-supervised learning outperforms state-of-the-arts on RoadScene and RGB-NIR datasets.
# テキストガイドによる3次元形状生成に向けて

ライセンス: Link先を確認
In this work, we explore the challenging task of generating 3D shapes from text. Beyond the existing works, we propose a new approach for text-guided 3D shape generation, capable of producing high-fidelity shapes with colors that match the given text description. This work has several technical contributions. First, we decouple the shape and color predictions for learning features in both texts and shapes, and propose the word-level spatial transformer to correlate word features from text with spatial features from shape. Also, we design a cyclic loss to encourage consistency between text and shape, and introduce the shape IMLE to diversify the generated shapes. Further, we extend the framework to enable text-guided shape manipulation. Extensive experiments on the largest existing text-shape benchmark manifest the superiority of this work. The code and the models are available at https://github.com/liuzhengzhe/Towards-Implicit Text-Guided-Shape-Generation.
# SPIQ: チャネルごとのデータフリー静的入力量子化

ライセンス: Link先を確認
Computationally expensive neural networks are ubiquitous in computer vision and solutions for efficient inference have drawn a growing attention in the machine learning community. Examples of such solutions comprise quantization, i.e. converting the processing values (weights and inputs) from floating point into integers e.g. int8 or int4. Concurrently, the rise of privacy concerns motivated the study of less invasive acceleration methods, such as data-free quantization of pre-trained models weights and activations. Previous approaches either exploit statistical information to deduce scalar ranges and scaling factors for the activations in a static manner, or dynamically adapt this range on-the-fly for each input of each layers (also referred to as activations): the latter generally being more accurate at the expanse of significantly slower inference. In this work, we argue that static input quantization can reach the accuracy levels of dynamic methods by means of a per-channel input quantization scheme that allows one to more finely preserve cross-channel dynamics. We show through a thorough empirical evaluation on multiple computer vision problems (e.g. ImageNet classification, Pascal VOC object detection as well as CityScapes semantic segmentation) that the proposed method, dubbed SPIQ, achieves accuracies rivalling dynamic approaches with static-level inference speed, significantly outperforming state-of-the-art quantization methods on every benchmark.
# Rex: データフリーの残留量子化エラー拡大

ライセンス: Link先を確認
Deep neural networks (DNNs) are nowadays ubiquitous in the computer vision landscape. However, they suffer from high computational costs in inference, particularly when evaluated on edge devices. This problem is generally addressed via post-hoc quantization, i.e. converting the DNN values (weights and inputs) from floating point into e.g. int8, int4 or ternary quantization. In this paper, we propose REx, a data-free quantization algorithm for pre-trained models that is compliant with data protection regulations, convenient and fast to execute. First, we improve upon the naive linear quantization operator by decomposing the weights as an expansion of residual quantization errors. Second, we propose a budgeted group-sparsity formulation to achieve better accuracy vs. number of bit-wise operation trade-offs with sparse, higher expansion orders. Third, we show that this sparse expansion can be approximated by an ensemble of quantized neural networks to dramatically improve the evaluation speed through more efficient parallelization. We provide theoretical guarantees of the efficiency of REx as well as a thorough empirical validation on several popular DNN architectures applied to multiple computer vision problems, e.g. ImageNet classification, object detection as well as semantic segmentation. In particular, we show that REx significantly outperforms existing state-of-the-art data-free quantization techniques.
# MaskGroup: 3Dインスタンスセグメンテーションのための階層的なポイントグループとマスキング

ライセンス: Link先を確認
This paper studies the 3D instance segmentation problem, which has a variety of real-world applications such as robotics and augmented reality. Since the surroundings of 3D objects are of high complexity, the separating of different objects is very difficult. To address this challenging problem, we propose a novel framework to group and refine the 3D instances. In practice, we first learn an offset vector for each point and shift it to its predicted instance center. To better group these points, we propose a Hierarchical Point Grouping algorithm to merge the centrally aggregated points progressively. All points are grouped into small clusters, which further gradually undergo another clustering procedure to merge into larger groups. These multi-scale groups are exploited for instance prediction, which is beneficial for predicting instances with different scales. In addition, a novel MaskScoreNet is developed to produce binary point masks of these groups for further refining the segmentation results. Extensive experiments conducted on the ScanNetV2 and S3DIS benchmarks demonstrate the effectiveness of the proposed method. For instance, our approach achieves a 66.4\% mAP with the 0.5 IoU threshold on the ScanNetV2 test set, which is 1.9\% higher than the state-of-the-art method.
# 効率的な3dcg背景作成のための多彩な360度画像アウトペイント

ライセンス: Link先を確認
We address the problem of generating a 360-degree image from a single image with a narrow field of view by estimating its surroundings. Previous methods suffered from overfitting to the training resolution and deterministic generation. This paper proposes a completion method using a transformer for scene modeling and novel methods to improve the properties of a 360-degree image on the output image. Specifically, we use CompletionNets with a transformer to perform diverse completions and AdjustmentNet to match color, stitching, and resolution with an input image, enabling inference at any resolution. To improve the properties of a 360-degree image on an output image, we also propose WS-perceptual loss and circular inference. Thorough experiments show that our method outperforms state-of-the-art (SOTA) methods both qualitatively and quantitatively. For example, compared to SOTA methods, our method completes images 16 times larger in resolution and achieves 1.7 times lower Frechet inception distance (FID). Furthermore, we propose a pipeline that uses the completion results for lighting and background of 3DCG scenes. Our plausible background completion enables perceptually natural results in the application of inserting virtual objects with specular surfaces.
# 高解像度イベントカメラは本当に必要か?

ライセンス: Link先を確認
Due to their outstanding properties in challenging conditions, event cameras have become indispensable in a wide range of applications, ranging from automotive, computational photography, and SLAM. However, as further improvements are made to the sensor design, modern event cameras are trending toward higher and higher sensor resolutions, which result in higher bandwidth and computational requirements on downstream tasks. Despite this trend, the benefits of using high-resolution event cameras to solve standard computer vision tasks are still not clear. In this work, we report the surprising discovery that, in low-illumination conditions and at high speeds, low-resolution cameras can outperform high-resolution ones, while requiring a significantly lower bandwidth. We provide both empirical and theoretical evidence for this claim, which indicates that high-resolution event cameras exhibit higher per-pixel event rates, leading to higher temporal noise in low-illumination conditions and at high speeds. As a result, in most cases, high-resolution event cameras show a lower task performance, compared to lower resolution sensors in these conditions. We empirically validate our findings across several tasks, namely image reconstruction, optical flow estimation, and camera pose tracking, both on synthetic and real data. We believe that these findings will provide important guidelines for future trends in event camera development.
# 非監督的人物再識別のための部分的擬似ラベルリファインメント

ライセンス: Link先を確認
Unsupervised person re-identification (re-ID) aims at learning discriminative representations for person retrieval from unlabeled data. Recent techniques accomplish this task by using pseudo-labels, but these labels are inherently noisy and deteriorate the accuracy. To overcome this problem, several pseudo-label refinement methods have been proposed, but they neglect the fine-grained local context essential for person re-ID. In this paper, we propose a novel Part-based Pseudo Label Refinement (PPLR) framework that reduces the label noise by employing the complementary relationship between global and part features. Specifically, we design a cross agreement score as the similarity of k-nearest neighbors between feature spaces to exploit the reliable complementary relationship. Based on the cross agreement, we refine pseudo-labels of global features by ensembling the predictions of part features, which collectively alleviate the noise in global feature clustering. We further refine pseudo-labels of part features by applying label smoothing according to the suitability of given labels for each part. Thanks to the reliable complementary information provided by the cross agreement score, our PPLR effectively reduces the influence of noisy labels and learns discriminative representations with rich local contexts. Extensive experimental results on Market-1501 and MSMT17 demonstrate the effectiveness of the proposed method over the state-of-the-art performance. The code is available at https://github.com/yoonkicho/PPLR.
# スパイキングニューロンを用いた脳インスパイア多層セプトロン

ライセンス: Link先を確認
Recently, Multilayer Perceptron (MLP) becomes the hotspot in the field of computer vision tasks. Without inductive bias, MLPs perform well on feature extraction and achieve amazing results. However, due to the simplicity of their structures, the performance highly depends on the local features communication machenism. To further improve the performance of MLP, we introduce information communication mechanisms from brain-inspired neural networks. Spiking Neural Network (SNN) is the most famous brain-inspired neural network, and achieve great success on dealing with sparse data. Leaky Integrate and Fire (LIF) neurons in SNNs are used to communicate between different time steps. In this paper, we incorporate the machanism of LIF neurons into the MLP models, to achieve better accuracy without extra FLOPs. We propose a full-precision LIF operation to communicate between patches, including horizontal LIF and vertical LIF in different directions. We also propose to use group LIF to extract better local features. With LIF modules, our SNN-MLP model achieves 81.9%, 83.3% and 83.5% top-1 accuracy on ImageNet dataset with only 4.4G, 8.5G and 15.2G FLOPs, respectively, which are state-of-the-art results as far as we know.
# 画像操作検出およびローカライズのためのobjectformer

ライセンス: Link先を確認
Recent advances in image editing techniques have posed serious challenges to the trustworthiness of multimedia data, which drives the research of image tampering detection. In this paper, we propose ObjectFormer to detect and localize image manipulations. To capture subtle manipulation traces that are no longer visible in the RGB domain, we extract high-frequency features of the images and combine them with RGB features as multimodal patch embeddings. Additionally, we use a set of learnable object prototypes as mid-level representations to model the object-level consistencies among different regions, which are further used to refine patch embeddings to capture the patch-level consistencies. We conduct extensive experiments on various datasets and the results verify the effectiveness of the proposed method, outperforming state-of-the-art tampering detection and localization methods.
# Assembly101: 手続き活動を理解するための大規模マルチビュービデオデータセット

ライセンス: Link先を確認
Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants work without fixed instructions, and the sequences feature rich and natural variations in action ordering, mistakes, and corrections. Assembly101 is the first multi-view action dataset, with simultaneous static (8) and egocentric (4) recordings. Sequences are annotated with more than 100K coarse and 1M fine-grained action segments, and 18M 3D hand poses. We benchmark on three action understanding tasks: recognition, anticipation and temporal segmentation. Additionally, we propose a novel task of detecting mistakes. The unique recording format and rich set of annotations allow us to investigate generalization to new toys, cross-view transfer, long-tailed distributions, and pose vs. appearance. We envision that Assembly101 will serve as a new challenge to investigate various activity understanding problems.
# 非パラメトリックベイズ推論による原始的形状抽象化

ライセンス: Link先を確認
3D shape abstraction has drawn great interest over the years. Apart from low-level representations such as meshes and voxels, researchers also seek to semantically abstract complex objects with basic geometric primitives. Recent deep learning methods rely heavily on datasets, with limited generality to unseen categories. Furthermore, abstracting an object accurately yet with a small number of primitives still remains a challenge. In this paper, we propose a novel non-parametric Bayesian statistical method to infer an abstraction, consisting of an unknown number of geometric primitives, from a point cloud. We model the generation of points as observations sampled from an infinite mixture of Gaussian Superquadric Taper Models (GSTM). Our approach formulates the abstraction as a clustering problem, in which: 1) each point is assigned to a cluster via the Chinese Restaurant Process (CRP); 2) a primitive representation is optimized for each cluster, and 3) a merging post-process is incorporated to provide a concise representation. We conduct extensive experiments on various datasets. The results indicate that our method outperforms the state-of-the-art in terms of accuracy and is generalizable to various types of objects.
# (参考訳) 語彙空間における概念の促進によるトランスフォーマーフィードフォワード層構築予測

ライセンス: CC BY 4.0
Transformer-based language models (LMs) are at the core of modern NLP, but their internal prediction construction process is opaque and largely not understood. In this work, we make a substantial step towards unveiling this underlying prediction process, by reverse-engineering the operation of the feed-forward network (FFN) layers, one of the building blocks of transformer models. We view the token representation as a changing distribution over the vocabulary, and the output from each FFN layer as an additive update to that distribution. Then, we analyze the FFN updates in the vocabulary space, showing that each update can be decomposed to sub-updates corresponding to single FFN parameter vectors, each promoting concepts that are often human-interpretable. We then leverage these findings for controlling LM predictions, where we reduce the toxicity of GPT2 by almost 50%, and for improving computation efficiency with a simple early exit rule, saving 20% of computation on average.
# (参考訳) 科学的名前付きエンティティ認識のための階層変換器モデル

ライセンス: CC BY 4.0
The task of Named Entity Recognition (NER) is an important component of many natural language processing systems, such as relation extraction and knowledge graph construction. In this work, we present a simple and effective approach for Named Entity Recognition. The main idea of our approach is to encode the input subword sequence with a pre-trained transformer such as BERT, and then, instead of directly classifying the word labels, another layer of transformer is added to the subword representation to better encode the word-level interaction. We evaluate our approach on three benchmark datasets for scientific NER, particularly in the computer science and biomedical domains. Experimental results show that our model outperforms the current state-of-the-art on SciERC and TDM datasets without requiring external resources or specific data augmentation. Code is available at \url{https://github.com/urchade/HNER}.
# ドメイン知識を低リソースのエンティティ認識に活用する

ライセンス: Link先を確認
In recent years, named entity recognition has always been a popular research in the field of natural language processing, while traditional deep learning methods require a large amount of labeled data for model training, which makes them not suitable for areas where labeling resources are scarce. In addition, the existing cross-domain knowledge transfer methods need to adjust the entity labels for different fields, so as to increase the training cost. To solve these problems, enlightened by a processing method of Chinese named entity recognition, we propose to use domain knowledge to improve the performance of named entity recognition in areas with low resources. The domain knowledge mainly applied by us is domain dictionary and domain labeled data. We use dictionary information for each word to strengthen its word embedding and domain labeled data to reinforce the recognition effect. The proposed model avoids large-scale data adjustments in different domains while handling named entities recognition with low resources. Experiments demonstrate the effectiveness of our method, which has achieved impressive results on the data set in the field of scientific and technological equipment, and the F1 score has been significantly improved compared with many other baseline methods.
# UTSA NLP at SemEval-2022 Task 4: An Exploration of Simple Ensembles of Transformer, Convolutional, and Recurrent Neural Networks

ライセンス: Link先を確認
The act of appearing kind or helpful via the use of but having a feeling of superiority condescending and patronizing language can have have serious mental health implications to those that experience it. Thus, detecting this condescending and patronizing language online can be useful for online moderation systems. Thus, in this manuscript, we describe the system developed by Team UTSA SemEval-2022 Task 4, Detecting Patronizing and Condescending Language. Our approach explores the use of several deep learning architectures including RoBERTa, convolutions neural networks, and Bidirectional Long Short-Term Memory Networks. Furthermore, we explore simple and effective methods to create ensembles of neural network models. Overall, we experimented with several ensemble models and found that the a simple combination of five RoBERTa models achieved an F-score of .6441 on the development dataset and .5745 on the final test dataset. Finally, we also performed a comprehensive error analysis to better understand the limitations of the model and provide ideas for further research.
# 計画問題を境界幅部分問題に分解する学習スケッチ:拡張版

ライセンス: Link先を確認
Recently, sketches have been introduced as a general language for representing the subgoal structure of instances drawn from the same domain. Sketches are collections of rules of the form C -> E over a given set of features where C expresses Boolean conditions and E expresses qualitative changes. Each sketch rule defines a subproblem: going from a state that satisfies C to a state that achieves the change expressed by E or a goal state. Sketches can encode simple goal serializations, general policies, or decompositions of bounded width that can be solved greedily, in polynomial time, by the SIW_R variant of the SIW algorithm. Previous work has shown the computational value of sketches over benchmark domains that, while tractable, are challenging for domain-independent planners. In this work, we address the problem of learning sketches automatically given a planning domain, some instances of the target class of problems, and the desired bound on the sketch width. We present a logical formulation of the problem, an implementation using the ASP solver Clingo, and experimental results. The sketch learner and the SIW_R planner yield a domain-independent planner that learns and exploits domain structure in a crisp and explicit form.
# (参考訳) 深部強化学習を用いた低線量X線CTのための有限パラメータDenoising

ライセンス: CC BY 4.0
The use of deep learning has successfully solved several problems in the field of medical imaging. Deep learning has been applied to the CT denoising problem successfully. However, the use of deep learning requires large amounts of data to train deep convolutional networks (CNNs). Moreover, due to large parameter count, such deep CNNs may cause unexpected results. In this study, we introduce a novel CT denoising framework, which has interpretable behaviour, and provides useful results with limited data. We employ bilateral filtering in both the projection and volume domains to remove noise. To account for non-stationary noise, we tune the $\sigma$ parameters of the volume for every projection view, and for every volume pixel. The tuning is carried out by two deep CNNs. Due to impracticality of labelling, the two deep CNNs are trained via a Deep-Q reinforcement learning task. The reward for the task is generated by using a custom reward function represented by a neural network. Our experiments were carried out on abdominal scans for the Mayo Clinic TCIA dataset, and the AAPM Low Dose CT Grand Challenge. Our denoising framework has excellent denoising performance increasing the PSNR from 28.53 to 28.93, and increasing the SSIM from 0.8952 to 0.9204. We outperform several state-of-the-art deep CNNs, which have several orders of magnitude higher number of parameters (p-value (PSNR) = 0.000, p-value (SSIM) = 0.000). Our method does not introduce any blurring, which is introduced by MSE loss based methods, or any deep learning artifacts, which are introduced by WGAN based models. Our ablation studies show that parameter tuning and using our reward network results in the best possible results.
# ロバストで不可解な例:敵対的学習に対するデータ保護

ライセンス: Link先を確認
The tremendous amount of accessible data in cyberspace face the risk of being unauthorized used for training deep learning models. To address this concern, methods are proposed to make data unlearnable for deep learning models by adding a type of error-minimizing noise. However, such conferred unlearnability is found fragile to adversarial training. In this paper, we design new methods to generate robust unlearnable examples that are protected from adversarial training. We first find that the vanilla error-minimizing noise, which suppresses the informative knowledge of data via minimizing the corresponding training loss, could not effectively minimize the adversarial training loss. This explains the vulnerability of error-minimizing noise in adversarial training. Based on the observation, robust error-minimizing noise is then introduced to reduce the adversarial training loss. Experiments show that the unlearnability brought by robust error-minimizing noise can effectively protect data from adversarial training in various scenarios. The code is available at \url{https://github.com/fshp971/robust-unlearnable-examples}.
# メタ学習によるブラックボックス攻撃の強化

ライセンス: Link先を確認
Deep neural networks (DNNs) have achieved remarkable success in diverse fields. However, it has been demonstrated that DNNs are very vulnerable to adversarial examples even in black-box settings. A large number of black-box attack methods have been proposed to in the literature. However, those methods usually suffer from low success rates and large query counts, which cannot fully satisfy practical purposes. In this paper, we propose a hybrid attack method which trains meta adversarial perturbations (MAPs) on surrogate models and performs black-box attacks by estimating gradients of the models. Our method uses the meta adversarial perturbation as an initialization and subsequently trains any black-box attack method for several epochs. Furthermore, the MAPs enjoy favorable transferability and universality, in the sense that they can be employed to boost performance of other black-box adversarial attack methods. Extensive experiments demonstrate that our method can not only improve the attack success rates, but also reduces the number of queries compared to other methods.
# ravir : 赤外線反射イメージングにおける網膜動脈と静脈のセグメンテーションと定量的解析のためのデータセットと方法論

ライセンス: Link先を確認
The retinal vasculature provides important clues in the diagnosis and monitoring of systemic diseases including hypertension and diabetes. The microvascular system is of primary involvement in such conditions, and the retina is the only anatomical site where the microvasculature can be directly observed. The objective assessment of retinal vessels has long been considered a surrogate biomarker for systemic vascular diseases, and with recent advancements in retinal imaging and computer vision technologies, this topic has become the subject of renewed attention. In this paper, we present a novel dataset, dubbed RAVIR, for the semantic segmentation of Retinal Arteries and Veins in Infrared Reflectance (IR) imaging. It enables the creation of deep learning-based models that distinguish extracted vessel type without extensive post-processing. We propose a novel deep learning-based methodology, denoted as SegRAVIR, for the semantic segmentation of retinal arteries and veins and the quantitative measurement of the widths of segmented vessels. Our extensive experiments validate the effectiveness of SegRAVIR and demonstrate its superior performance in comparison to state-of-the-art models. Additionally, we propose a knowledge distillation framework for the domain adaptation of RAVIR pretrained networks on color images. We demonstrate that our pretraining procedure yields new state-of-the-art benchmarks on the DRIVE, STARE, and CHASE_DB1 datasets. Dataset link: https://ravirdataset.github.io/data/
# 音声の超高解像度化に必要なのはNeural Vocoderだけ

ライセンス: Link先を確認
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components. Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio. These strong constraints can potentially lead to poor generalization ability in mismatched real-world cases. In this paper, we propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios. NVSR consists of a mel-bandwidth extension module, a neural vocoder module, and a post-processing module. Our proposed system achieves state-of-the-art results on the VCTK multi-speaker benchmark. On 44.1 kHz target resolution, NVSR outperforms WSRGlow and Nu-wave by 8% and 37% respectively on log spectral distance and achieves a significantly better perceptual quality. We also demonstrate that prior knowledge in the pre-trained vocoder is crucial for speech SR by performing mel-bandwidth extension with a simple replication-padding method. Samples can be found in https://haoheliu.github.io/nvsr.
# 変動文脈一貫性文マスキングによる解釈可能な研究レプリケーション予測

ライセンス: Link先を確認
Research Replication Prediction (RRP) is the task of predicting whether a published research result can be replicated or not. Building an interpretable neural text classifier for RRP promotes the understanding of why a research paper is predicted as replicable or non-replicable and therefore makes its real-world application more reliable and trustworthy. However, the prior works on model interpretation mainly focused on improving the model interpretability at the word/phrase level, which are insufficient especially for long research papers in RRP. Furthermore, the existing methods cannot utilize a large size of unlabeled dataset to further improve the model interpretability. To address these limitations, we aim to build an interpretable neural model which can provide sentence-level explanations and apply weakly supervised approach to further leverage the large corpus of unlabeled datasets to boost the interpretability in addition to improving prediction performance as existing works have done. In this work, we propose the Variational Contextual Consistency Sentence Masking (VCCSM) method to automatically extract key sentences based on the context in the classifier, using both labeled and unlabeled datasets. Results of our experiments on RRP along with European Convention of Human Rights (ECHR) datasets demonstrate that VCCSM is able to improve the model interpretability for the long document classification tasks using the area over the perturbation curve and post-hoc accuracy as evaluation metrics.
# 弁論意味論と自然言語弁論グラフネットワークを用いた議論の自動評価

ライセンス: Link先を確認
The lack of annotated data on professional argumentation and complete argumentative debates has led to the oversimplification and the inability of approaching more complex natural language processing tasks. Such is the case of the automatic debate evaluation. In this paper, we propose an original hybrid method to automatically evaluate argumentative debates. For that purpose, we combine concepts from argumentation theory such as argumentation frameworks and semantics, with Transformer-based architectures and neural graph networks. Furthermore, we obtain promising results that lay the basis on an unexplored new instance of the automatic analysis of natural language arguments.
# (参考訳) クロスビュー自己教師付き学習における学習場所の学習

ライセンス: CC BY 4.0
Self-supervised learning (SSL) has made enormous progress and largely narrowed the gap with the supervised ones, where the representation learning is mainly guided by a projection into an embedding space. During the projection, current methods simply adopt uniform aggregation of pixels for embedding; however, this risks involving object-irrelevant nuisances and spatial misalignment for different augmentations. In this paper, we present a new approach, Learning Where to Learn (LEWEL), to adaptively aggregate spatial information of features, so that the projected embeddings could be exactly aligned and thus guide the feature learning better. Concretely, we reinterpret the projection head in SSL as a per-pixel projection and predict a set of spatial alignment maps from the original features by this weight-sharing projection head. A spectrum of aligned embeddings is thus obtained by aggregating the features with spatial weighting according to these alignment maps. As a result of this adaptive alignment, we observe substantial improvements on both image-level prediction and dense prediction at the same time: LEWEL improves MoCov2 by 1.6%/1.3%/0.5%/0.4% points, improves BYOL by 1.3%/1.3%/0.7%/0.6% points, on ImageNet linear/semi-supervised classification, Pascal VOC semantic segmentation, and object detection, respectively.
# 深部ニューラルネットワーク重み行列のランダム行列解析

ライセンス: Link先を確認
Neural networks have been used successfully in a variety of fields, which has led to a great deal of interest in developing a theoretical understanding of how they store the information needed to perform a particular task. We study the weight matrices of trained deep neural networks using methods from random matrix theory (RMT) and show that the statistics of most of the singular values follow universal RMT predictions. This suggests that they are random and do not contain system specific information, which we investigate further by comparing the statistics of eigenvector entries to the universal Porter-Thomas distribution. We find that for most eigenvectors the hypothesis of randomness cannot be rejected, and that only eigenvectors belonging to the largest singular values deviate from the RMT prediction, indicating that they may encode learned information. We analyze the spectral distribution of such large singular values using the Hill estimator and find that the distribution cannot be characterized by a tail index, i.e. is not of power law type.
# (参考訳) 動的マルチタスクアーキテクチャの制御

ライセンス: CC BY 4.0
Multi-task learning commonly encounters competition for resources among tasks, specifically when model capacity is limited. This challenge motivates models which allow control over the relative importance of tasks and total compute cost during inference time. In this work, we propose such a controllable multi-task network that dynamically adjusts its architecture and weights to match the desired task preference as well as the resource constraints. In contrast to the existing dynamic multi-task approaches that adjust only the weights within a fixed architecture, our approach affords the flexibility to dynamically control the total computational cost and match the user-preferred task importance better. We propose a disentangled training of two hypernetworks, by exploiting task affinity and a novel branching regularized loss, to take input preferences and accordingly predict tree-structured models with adapted weights. Experiments on three multi-task benchmarks, namely PASCAL-Context, NYU-v2, and CIFAR-100, show the efficacy of our approach. Project page is available at https://www.nec-labs.com/~mas/DYMU.
# 敗血症早期予知のための生理的時系列と臨床ノートの統合

ライセンス: Link先を確認
Sepsis is a leading cause of death in the Intensive Care Units (ICU). Early detection of sepsis is critical for patient survival. In this paper, we propose a multimodal Transformer model for early sepsis prediction, using the physiological time series data and clinical notes for each patient within $36$ hours of ICU admission. Specifically, we aim to predict sepsis using only the first 12, 18, 24, 30 and 36 hours of laboratory measurements, vital signs, patient demographics, and clinical notes. We evaluate our model on two large critical care datasets: MIMIC-III and eICU-CRD. The proposed method is compared with six baselines. In addition, ablation analysis and case studies are conducted to study the influence of each individual component of the model and the contribution of each data modality for early sepsis prediction. Experimental results demonstrate the effectiveness of our method, which outperforms competitive baselines on all metrics.
# 多変量時系列分類における変圧器効率の向上

ライセンス: Link先を確認
Most current multivariate time series (MTS) classification algorithms focus on improving the predictive accuracy. However, for large-scale (either high-dimensional or long-sequential) time series (TS) datasets, there is an additional consideration: to design an efficient network architecture to reduce computational costs such as training time and memory footprint. In this work we propose a methodology based on module-wise pruning and Pareto analysis to investigate the relationship between model efficiency and accuracy, as well as its complexity. Comprehensive experiments on benchmark MTS datasets illustrate the effectiveness of our method.
# ニューラルネットワークカーネルの実践的視点から考える:トレーニングなしでニューラルネットワーク検索を信頼できるか?

ライセンス: Link先を確認
In Neural Architecture Search (NAS), reducing the cost of architecture evaluation remains one of the most crucial challenges. Among a plethora of efforts to bypass training of each candidate architecture to convergence for evaluation, the Neural Tangent Kernel (NTK) is emerging as a promising theoretical framework that can be utilized to estimate the performance of a neural architecture at initialization. In this work, we revisit several at-initialization metrics that can be derived from the NTK and reveal their key shortcomings. Then, through the empirical analysis of the time evolution of NTK, we deduce that modern neural architectures exhibit highly non-linear characteristics, making the NTK-based metrics incapable of reliably estimating the performance of an architecture without some amount of training. To take such non-linear characteristics into account, we introduce Label-Gradient Alignment (LGA), a novel NTK-based metric whose inherent formulation allows it to capture the large amount of non-linear advantage present in modern neural architectures. With minimal amount of training, LGA obtains a meaningful level of rank correlation with the post-training test accuracy of an architecture. Lastly, we demonstrate that LGA, complemented with few epochs of training, successfully guides existing search algorithms to achieve competitive search performances with significantly less search cost. The code is available at: https://github.com/nutellamok/DemystifyingNTK.
# 双方向分散によるリスク正則化

ライセンス: Link先を確認
Many alternative notions of "risk" (e.g., CVaR, entropic risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. In this work, we study a complementary new risk class that penalizes loss deviations in a bidirectional manner, while having more flexibility in terms of tail sensitivity than is offered by classical mean-variance, without sacrificing computational or analytical tractability.
# マルチアウトプットガウス過程におけるsafe active learning

ライセンス: Link先を確認
Multi-output regression problems are commonly encountered in science and engineering. In particular, multi-output Gaussian processes have been emerged as a promising tool for modeling these complex systems since they can exploit the inherent correlations and provide reliable uncertainty estimates. In many applications, however, acquiring the data is expensive and safety concerns might arise (e.g. robotics, engineering). We propose a safe active learning approach for multi-output Gaussian process regression. This approach queries the most informative data or output taking the relatedness between the regressors and safety constraints into account. We prove the effectiveness of our approach by providing theoretical analysis and by demonstrating empirical results on simulated datasets and on a real-world engineering dataset. On all datasets, our approach shows improved convergence compared to its competitors.
# 時間不均質拡散幾何学と位相

ライセンス: Link先を確認
Diffusion condensation is a dynamic process that yields a sequence of multiscale data representations that aim to encode meaningful abstractions. It has proven effective for manifold learning, denoising, clustering, and visualization of high-dimensional data. Diffusion condensation is constructed as a time-inhomogeneous process where each step first computes and then applies a diffusion operator to the data. We theoretically analyze the convergence and evolution of this process from geometric, spectral, and topological perspectives. From a geometric perspective, we obtain convergence bounds based on the smallest transition probability and the radius of the data, whereas from a spectral perspective, our bounds are based on the eigenspectrum of the diffusion kernel. Our spectral results are of particular interest since most of the literature on data diffusion is focused on homogeneous processes. From a topological perspective, we show diffusion condensation generalizes centroid-based hierarchical clustering. We use this perspective to obtain a bound based on the number of data points, independent of their location. To understand the evolution of the data geometry beyond convergence, we use topological data analysis. We show that the condensation process itself defines an intrinsic diffusion homology. We use this intrinsic topology as well as an ambient topology to study how the data changes over diffusion time. We demonstrate both homologies in well-understood toy examples. Our work gives theoretical insights into the convergence of diffusion condensation, and shows that it provides a link between topological and geometric data analysis.
# 確率的球面判別分析:長さ正規化埋め込みにおけるPLDAの代替

ライセンス: Link先を確認
In speaker recognition, where speech segments are mapped to embeddings on the unit hypersphere, two scoring backends are commonly used, namely cosine scoring or PLDA. Both have advantages and disadvantages, depending on the context. Cosine scoring follows naturally from the spherical geometry, but for PLDA the blessing is mixed -- length normalization Gaussianizes the between-speaker distribution, but violates the assumption of a speaker-independent within-speaker distribution. We propose PSDA, an analogue to PLDA that uses Von Mises-Fisher distributions on the hypersphere for both within and between-class distributions. We show how the self-conjugacy of this distribution gives closed-form likelihood-ratio scores, making it a drop-in replacement for PLDA at scoring time. All kinds of trials can be scored, including single-enroll and multi-enroll verification, as well as more complex likelihood-ratios that could be used in clustering and diarization. Learning is done via an EM-algorithm with closed-form updates. We explain the model and present some first experiments.
# 緩和ラベルがGANと出会う - 境界を逸脱したJigsawのパズルの解決

ライセンス: Link先を確認
This paper proposes JiGAN, a GAN-based method for solving Jigsaw puzzles with eroded or missing borders. Missing borders is a common real-world situation, for example, when dealing with the reconstruction of broken artifacts or ruined frescoes. In this particular condition, the puzzle's pieces do not align perfectly due to the borders' gaps; in this situation, the patches' direct match is unfeasible due to the lack of color and line continuations. JiGAN, is a two-steps procedure that tackles this issue: first, we repair the eroded borders with a GAN-based image extension model and measure the alignment affinity between pieces; then, we solve the puzzle with the relaxation labeling algorithm to enforce consistency in pieces positioning, hence, reconstructing the puzzle. We test the method on a large dataset of small puzzles and on three commonly used benchmark datasets to demonstrate the feasibility of the proposed approach.
# KL発散(SAD-KL)に基づく半教師付き異常検出アルゴリズム

ライセンス: Link先を確認
The unlabeled data are generally assumed to be normal data in detecting abnormal data via semisupervised learning. This assumption, however, causes inevitable detection error when distribution of unlabeled data is different from distribution of labeled normal dataset. To deal the problem caused by distribution gap between labeled and unlabeled data, we propose a semi-supervised anomaly detection algorithm using KL divergence (SAD-KL). The proposed SAD-KL is composed of two steps: (1) estimating KL divergence of probability density functions (PDFs) of the local outlier factors (LOFs) of the labeled normal data and the unlabeled data (2) estimating detection probability and threshold for detecting normal data in unlabeled data by using the KL divergence. We show that the PDFs of the LOFs follow Burr distribution and use them for detection. Once the threshold is computed, the SAD-KL runs iteratively until the labeling change rate is lower than the predefined threshold. Experiments results show that the SAD-KL shows superior detection probability over the existing algorithms even though it takes less learning time.
# 折りたたむか, 折りたたむか -バッチ・ノーマライゼーション層における必要十分条件-

ライセンス: Link先を確認
Batch-Normalization (BN) layers have become fundamental components in the evermore complex deep neural network architectures. Such models require acceleration processes for deployment on edge devices. However, BN layers add computation bottlenecks due to the sequential operation processing: thus, a key, yet often overlooked component of the acceleration process is BN layers folding. In this paper, we demonstrate that the current BN folding approaches are suboptimal in terms of how many layers can be removed. We therefore provide a necessary and sufficient condition for BN folding and a corresponding optimal algorithm. The proposed approach systematically outperforms existing baselines and allows to dramatically reduce the inference time of deep neural networks.
# 知識伝達の正規化のためのメタ機能学習フレームワーク

ライセンス: Link先を確認
Machine learning classifiers' capability is largely dependent on the scale of available training data and limited by the model overfitting in data-scarce learning tasks. To address this problem, this work proposes a novel framework of Meta Functional Learning (MFL) by meta-learning a generalisable functional model from data-rich tasks whilst simultaneously regularising knowledge transfer to data-scarce tasks. The MFL computes meta-knowledge on functional regularisation generalisable to different learning tasks by which functional training on limited labelled data promotes more discriminative functions to be learned. Based on this framework, we formulate three variants of MFL: MFL with Prototypes (MFL-P) which learns a functional by auxiliary prototypes, Composite MFL (ComMFL) that transfers knowledge from both functional space and representational space, and MFL with Iterative Updates (MFL-IU) which improves knowledge transfer regularisation from MFL by progressively learning the functional regularisation in knowledge transfer. Moreover, we generalise these variants for knowledge transfer regularisation from binary classifiers to multi-class classifiers. Extensive experiments on two few-shot learning scenarios, Few-Shot Learning (FSL) and Cross-Domain Few-Shot Learning (CD-FSL), show that meta functional learning for knowledge transfer regularisation can improve FSL classifiers.
# 帰属型視覚類似学習

ライセンス: Link先を確認
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images. Most existing similarity learning methods exacerbate the unexplainability by mapping each sample to a single point in the embedding space with a distance metric (e.g., Mahalanobis distance, Euclidean distance). Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph and then infer the overall similarity accordingly. Furthermore, we establish a bottom-up similarity construction and top-down similarity inference framework to infer the similarity based on semantic hierarchy consistency. We first identify unreliable higher-level similarity nodes and then correct them using the most coherent adjacent lower-level similarity nodes, which simultaneously preserve traces for similarity attribution. Extensive experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods and verify the interpretability of our framework. Code is available at https://github.com/zbr17/AVSL.
# (参考訳) Siamese Networks と Label Tuning を用いたFew-Shot Learning

ライセンス: CC BY 4.0
We study the problem of building text classifiers with little or no training data, commonly known as zero and few-shot text classification. In recent years, an approach based on neural textual entailment models has been found to give strong results on a diverse range of tasks. In this work, we show that with proper pre-training, Siamese Networks that embed texts and labels offer a competitive alternative. These models allow for a large reduction in inference cost: constant in the number of labels rather than linear. Furthermore, we introduce label tuning, a simple and computationally efficient approach that allows to adapt the models in a few-shot setup by only changing the label embeddings. While giving lower performance than model fine-tuning, this approach has the architectural advantage that a single encoder can be shared by many different tasks.
# 大規模バイリンガル言語画像コントラスト学習

ライセンス: Link先を確認
This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can distract the model from learning proper multimodal relations; 2) training multimodal model without cross-lingual relation can learn the relation via visual semantics; 3) our bilingual KELIP can capture cultural differences of visual semantics for the same meaning of words; 4) a large-scale multimodal model can be used for multimodal feature analogy. We hope that this work will provide helpful experience and findings for future research. We provide an open-source pre-trained KELIP.
# 3次元点雲分割用成層変圧器

ライセンス: Link先を確認
3D point cloud segmentation has made tremendous progress in recent years. Most current methods focus on aggregating local features, but fail to directly model long-range dependencies. In this paper, we propose Stratified Transformer that is able to capture long-range contexts and demonstrates strong generalization ability and high performance. Specifically, we first put forward a novel key sampling strategy. For each query point, we sample nearby points densely and distant points sparsely as its keys in a stratified way, which enables the model to enlarge the effective receptive field and enjoy long-range contexts at a low computational cost. Also, to combat the challenges posed by irregular point arrangements, we propose first-layer point embedding to aggregate local information, which facilitates convergence and boosts performance. Besides, we adopt contextual relative position encoding to adaptively capture position information. Finally, a memory-efficient implementation is introduced to overcome the issue of varying point numbers in each window. Extensive experiments demonstrate the effectiveness and superiority of our method on S3DIS, ScanNetv2 and ShapeNetPart datasets. Code is available at https://github.com/dvlab-research/Stratified-Transformer.
# 3次元医用画像に対する翻訳整合半教師付きセグメンテーション

ライセンス: Link先を確認
3D medical image segmentation methods have been successful, but their dependence on large amounts of voxel-level annotated data is a disadvantage that needs to be addressed given the high cost to obtain such annotation. Semi-supervised learning (SSL) solve this issue by training models with a large unlabelled and a small labelled dataset. The most successful SSL approaches are based on consistency learning that minimises the distance between model responses obtained from perturbed views of the unlabelled data. These perturbations usually keep the spatial input context between views fairly consistent, which may cause the model to learn segmentation patterns from the spatial input contexts instead of the segmented objects. In this paper, we introduce the Translation Consistent Co-training (TraCoCo) which is a consistency learning SSL method that perturbs the input data views by varying their spatial input context, allowing the model to learn segmentation patterns from visual objects. Furthermore, we propose the replacement of the commonly used mean squared error (MSE) semi-supervised loss by a new Cross-model confident Binary Cross entropy (CBC) loss, which improves training convergence and keeps the robustness to co-training pseudo-labelling mistakes. We also extend CutMix augmentation to 3D SSL to further improve generalisation. Our TraCoCo shows state-of-the-art results for the Left Atrium (LA) and Brain Tumor Segmentation (BRaTS19) datasets with different backbones. Our code is available at https://github.com/yyliu01/TraCoCo.
# CenterLoc3D:道路監視カメラのための単眼3D車両位置決めネットワーク

ライセンス: Link先を確認
Monocular 3D vehicle localization is an important task in Intelligent Transportation System (ITS) and Cooperative Vehicle Infrastructure System (CVIS), which is usually achieved by monocular 3D vehicle detection. However, depth information cannot be obtained directly by monocular cameras due to the inherent imaging mechanism, resulting in more challenging monocular 3D tasks. Most of the current monocular 3D vehicle detection methods leverage 2D detectors and additional geometric modules, which reduces the efficiency. In this paper, we propose a 3D vehicle localization network CenterLoc3D for roadside monocular cameras, which directly predicts centroid and eight vertexes in image space, and dimension of 3D bounding boxes without 2D detectors. In order to improve the precision of 3D vehicle localization, we propose a weighted-fusion module and a loss with spatial constraints embedding in CenterLoc3D. Firstly, the transformation matrix between 2D image space and 3D world space is solved by camera calibration. Secondly, vehicle type, centroid, eight vertexes and dimension of 3D vehicle bounding boxes are obtained by CenterLoc3D. Finally, centroid in 3D world space can be obtained by camera calibration and CenterLoc3D for 3D vehicle localization. To the best of our knowledge, this is the first application of 3D vehicle localization for roadside monocular cameras. Hence, we also propose a benchmark for this application including dataset (SVLD-3D), annotation tool (LabelImg-3D) and evaluation metrics. Through experimental validation, the proposed method achieves high accuracy and real-time performance.
# wsebp:多層畳み込みスパース符号化のための新しい幅深度同期拡張ベース追従アルゴリズム

ライセンス: Link先を確認
The pursuit algorithms integrated in multi-layer convolutional sparse coding (ML-CSC) can interpret the convolutional neural networks (CNNs). However, many current state-of-art (SOTA) pursuit algorithms require multiple iterations to optimize the solution of ML-CSC, which limits their applications to deeper CNNs due to high computational cost and large number of resources for getting very tiny gain of performance. In this study, we focus on the 0th iteration in pursuit algorithm by introducing an effective initialization strategy for each layer, by which the solution for ML-CSC can be improved. Specifically, we first propose a novel width-depth synchronous extension-based basis pursuit (WSEBP) algorithm which solves the ML-CSC problem without the limitation of the number of iterations compared to the SOTA algorithms and maximizes the performance by an effective initialization in each layer. Then, we propose a simple and unified ML-CSC-based classification network (ML-CSC-Net) which consists of an ML-CSC-based feature encoder and a fully-connected layer to validate the performance of WSEBP on image classification task. The experimental results show that our proposed WSEBP outperforms SOTA algorithms in terms of accuracy and consumption resources. In addition, the WSEBP integrated in CNNs can improve the performance of deeper CNNs and make them interpretable. Finally, taking VGG as an example, we propose WSEBP-VGG13 to enhance the performance of VGG13, which achieves competitive results on four public datasets, i.e., 87.79% vs. 86.83% on Cifar-10 dataset, 58.01% vs. 54.60% on Cifar-100 dataset, 91.52% vs. 89.58% on COVID-19 dataset, and 99.88% vs. 99.78% on Crack dataset, respectively. The results show the effectiveness of the proposed WSEBP, the improved performance of ML-CSC with WSEBP, and interpretation of the CNNs or deeper CNNs.
# (参考訳) インクリメンタルラーニングのためのエネルギーベース潜在アリグナー

ライセンス: CC BY 4.0
Deep learning models tend to forget their earlier knowledge while incrementally learning new tasks. This behavior emerges because the parameter updates optimized for the new tasks may not align well with the updates suitable for older tasks. The resulting latent representation mismatch causes forgetting. In this work, we propose ELI: Energy-based Latent Aligner for Incremental Learning, which first learns an energy manifold for the latent representations such that previous task latents will have low energy and the current task latents have high energy values. This learned manifold is used to counter the representational shift that happens during incremental learning. The implicit regularization that is offered by our proposed methodology can be used as a plug-and-play module in existing incremental learning methodologies. We validate this through extensive evaluation on CIFAR-100, ImageNet subset, ImageNet 1k and Pascal VOC datasets. We observe consistent improvement when ELI is added to three prominent methodologies in class-incremental learning, across multiple incremental settings. Further, when added to the state-of-the-art incremental object detector, ELI provides over 5% improvement in detection accuracy, corroborating its effectiveness and complementary advantage to existing art.
# 知識蒸留:悪いモデルは良い役割モデルになり得る

ライセンス: Link先を確認
Large neural networks trained in the overparameterized regime are able to fit noise to zero train error. Recent work \citep{nakkiran2020distributional} has empirically observed that such networks behave as "conditional samplers" from the noisy distribution. That is, they replicate the noise in the train data to unseen examples. We give a theoretical framework for studying this conditional sampling behavior in the context of learning theory. We relate the notion of such samplers to knowledge distillation, where a student network imitates the outputs of a teacher on unlabeled data. We show that samplers, while being bad classifiers, can be good teachers. Concretely, we prove that distillation from samplers is guaranteed to produce a student which approximates the Bayes optimal classifier. Finally, we show that some common learning algorithms (e.g., Nearest-Neighbours and Kernel Machines) can generate samplers when applied in the overparameterized regime.
# 視覚変換器の効率的な訓練のための自動プログレッシブ学習

ライセンス: Link先を確認
Recent advances in vision Transformers (ViTs) have come with a voracious appetite for computing power, high-lighting the urgent need to develop efficient training methods for ViTs. Progressive learning, a training scheme where the model capacity grows progressively during training, has started showing its ability in efficient training. In this paper, we take a practical step towards efficient training of ViTs by customizing and automating progressive learning. First, we develop a strong manual baseline for progressive learning of ViTs, by introducing momentum growth (MoGrow) to bridge the gap brought by model growth. Then, we propose automated progressive learning (AutoProg), an efficient training scheme that aims to achieve lossless acceleration by automatically increasing the training overload on-the-fly; this is achieved by adaptively deciding whether, where and how much should the model grow during progressive learning. Specifically, we first relax the optimization of the growth schedule to sub-network architecture optimization problem, then propose one-shot estimation of the sub-network performance via an elastic supernet. The searching overhead is reduced to minimal by recycling the parameters of the supernet. Extensive experiments of efficient training on ImageNet with two representative ViT models, DeiT and VOLO, demonstrate that AutoProg can accelerate ViTs training by up to 85.1% with no performance drop. Code: https://github.com/changlin31/AutoProg
# 視覚シーン理解のためのマルチタスク学習

ライセンス: Link先を確認
Despite the recent progress in deep learning, most approaches still go for a silo-like solution, focusing on learning each task in isolation: training a separate neural network for each individual task. Many real-world problems, however, call for a multi-modal approach and, therefore, for multi-tasking models. Multi-task learning (MTL) aims to leverage useful information across tasks to improve the generalization capability of a model. This thesis is concerned with multi-task learning in the context of computer vision. First, we review existing approaches for MTL. Next, we propose several methods that tackle important aspects of multi-task learning. The proposed methods are evaluated on various benchmarks. The results show several advances in the state-of-the-art of multi-task learning. Finally, we discuss several possibilities for future work.
# GIRAFFE HD:高分解能3D認識生成モデル

ライセンス: Link先を確認
3D-aware generative models have shown that the introduction of 3D information can lead to more controllable image generation. In particular, the current state-of-the-art model GIRAFFE can control each object's rotation, translation, scale, and scene camera pose without corresponding supervision. However, GIRAFFE only operates well when the image resolution is low. We propose GIRAFFE HD, a high-resolution 3D-aware generative model that inherits all of GIRAFFE's controllable features while generating high-quality, high-resolution images ($512^2$ resolution and above). The key idea is to leverage a style-based neural renderer, and to independently generate the foreground and background to force their disentanglement while imposing consistency constraints to stitch them together to composite a coherent final image. We demonstrate state-of-the-art 3D controllable high-resolution image generation on multiple natural image datasets.
# 日本語共感対話音声のコーパス : フレンドリーな音声エージェントに向けて

ライセンス: Link先を確認
We present STUDIES, a new speech corpus for developing a voice agent that can speak in a friendly manner. Humans naturally control their speech prosody to empathize with each other. By incorporating this "empathetic dialogue" behavior into a spoken dialogue system, we can develop a voice agent that can respond to a user more naturally. We designed the STUDIES corpus to include a speaker who speaks with empathy for the interlocutor's emotion explicitly. We describe our methodology to construct an empathetic dialogue speech corpus and report the analysis results of the STUDIES corpus. We conducted a text-to-speech experiment to initially investigate how we can develop more natural voice agent that can tune its speaking style corresponding to the interlocutor's emotion. The results show that the use of interlocutor's emotion label and conversational context embedding can produce speech with the same degree of naturalness as that synthesized by using the agent's emotion label. Our project page of the STUDIES corpus is http://sython.org/Corpus/STUDIES.
# fedvln: プライバシー保護型視覚言語ナビゲーション

ライセンス: Link先を確認
Data privacy is a central problem for embodied agents that can perceive the environment, communicate with humans, and act in the real world. While helping humans complete tasks, the agent may observe and process sensitive information of users, such as house environments, human activities, etc. In this work, we introduce privacy-preserving embodied agent learning for the task of Vision-and-Language Navigation (VLN), where an embodied agent navigates house environments by following natural language instructions. We view each house environment as a local client, which shares nothing other than local updates with the cloud server and other clients, and propose a novel federated vision-and-language navigation (FedVLN) framework to protect data privacy during both training and pre-exploration. Particularly, we propose a decentralized training strategy to limit the data of each client to its local model training and a federated pre-exploration method to do partial model aggregation to improve model generalizability to unseen environments. Extensive results on R2R and RxR datasets show that under our FedVLN framework, decentralized VLN models achieve comparable results with centralized training while protecting seen environment privacy, and federated pre-exploration significantly outperforms centralized pre-exploration while preserving unseen environment privacy.
# STaR: ブートストラップ推論と推論

ライセンス: Link先を確認
Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30$\times$ larger state-of-the-art language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.
# EnCBP: 英語のファイナグラインド文化背景予測のためのベンチマークデータセット

ライセンス: Link先を確認
While cultural backgrounds have been shown to affect linguistic expressions, existing natural language processing (NLP) research on culture modeling is overly coarse-grained and does not examine cultural differences among speakers of the same language. To address this problem and augment NLP models with cultural background features, we collect, annotate, manually validate, and benchmark EnCBP, a finer-grained news-based cultural background prediction dataset in English. Through language modeling (LM) evaluations and manual analyses, we confirm that there are noticeable differences in linguistic expressions among five English-speaking countries and across four states in the US. Additionally, our evaluations on nine syntactic (CoNLL-2003), semantic (PAWS-Wiki, QNLI, STS-B, and RTE), and psycholinguistic tasks (SST-5, SST-2, Emotion, and Go-Emotions) show that, while introducing cultural background information does not benefit the Go-Emotions task due to text domain conflicts, it noticeably improves deep learning (DL) model performance on other tasks. Our findings strongly support the importance of cultural background modeling to a wide variety of NLP tasks and demonstrate the applicability of EnCBP in culture-related research.
# (参考訳) CVF-SID:画像からノイズを遠ざける自己監視画像に対する周期的多変量関数

ライセンス: CC BY 4.0
Recently, significant progress has been made on image denoising with strong supervision from large-scale datasets. However, obtaining well-aligned noisy-clean training image pairs for each specific scenario is complicated and costly in practice. Consequently, applying a conventional supervised denoising network on in-the-wild noisy inputs is not straightforward. Although several studies have challenged this problem without strong supervision, they rely on less practical assumptions and cannot be applied to practical situations directly. To address the aforementioned challenges, we propose a novel and powerful self-supervised denoising method called CVF-SID based on a Cyclic multi-Variate Function (CVF) module and a self-supervised image disentangling (SID) framework. The CVF module can output multiple decomposed variables of the input and take a combination of the outputs back as an input in a cyclic manner. Our CVF-SID can disentangle a clean image and noise maps from the input by leveraging various self-supervised loss terms. Unlike several methods that only consider the signal-independent noise models, we also deal with signal-dependent noise components for real-world applications. Furthermore, we do not rely on any prior assumptions about the underlying noise distribution, making CVF-SID more generalizable toward realistic noise. Extensive experiments on real-world datasets show that CVF-SID achieves state-of-the-art self-supervised image denoising performance and is comparable to other existing approaches. The code is publicly available from https://github.com/Reyhanehne/CVF-SID_PyTorch .
# (参考訳) ウィジェット内映像のマルチモーダル感情推定

ライセンス: CC BY 4.0
In this paper, we briefly introduce our submission to the Valence-Arousal Estimation Challenge of the 3rd Affective Behavior Analysis in-the-wild (ABAW) competition. Our method utilizes the multi-modal information, i.e., the visual and audio information, and employs a temporal encoder to model the temporal context in the videos. Besides, a smooth processor is applied to get more reasonable predictions, and a model ensemble strategy is used to improve the performance of our proposed method. The experiment results show that our method achieves 65.55% ccc for valence and 70.88% ccc for arousal on the validation set of the Aff-Wild2 dataset, which prove the effectiveness of our proposed method.
# (参考訳) 映像表情認識のためのスムース予測付き粗大なカスケードネットワーク

ライセンス: CC BY 4.0
Facial expression recognition plays an important role in human-computer interaction. In this paper, we propose the Coarse-to-Fine Cascaded network with Smooth Predicting (CFC-SP) to improve the performance of facial expression recognition. CFC-SP contains two core components, namely Coarse-to-Fine Cascaded networks (CFC) and Smooth Predicting (SP). For CFC, it first groups several similar emotions to form a rough category, and then employs a network to conduct a coarse but accurate classification. Later, an additional network for these grouped emotions is further used to obtain fine-grained predictions. For SP, it improves the recognition capability of the model by capturing both universal and unique expression features. To be specific, the universal features denote the general characteristic of facial emotions within a period and the unique features denote the specific characteristic at this moment. Experiments on Aff-Wild2 show the effectiveness of the proposed CFSP.
# (参考訳) 視覚トランスフォーマーにおける運動自由連続学習に向けて--注意・機能・重み規則化の考察

ライセンス: CC BY 4.0
In this paper, we investigate the continual learning of Vision Transformers (ViT) for the challenging exemplar-free scenario, with special focus on how to efficiently distill the knowledge of its crucial self-attention mechanism (SAM). Our work takes an initial step towards a surgical investigation of SAM for designing coherent continual learning methods in ViTs. We first carry out an evaluation of established continual learning regularization techniques. We then examine the effect of regularization when applied to two key enablers of SAM: (a) the contextualized embedding layers, for their ability to capture well-scaled representations with respect to the values, and (b) the prescaled attention maps, for carrying value-independent global contextual information. We depict the perks of each distilling strategy on two image recognition benchmarks (CIFAR100 and ImageNet-32) -- while (a) leads to a better overall accuracy, (b) helps enhance the rigidity by maintaining competitive performances. Furthermore, we identify the limitation imposed by the symmetric nature of regularization losses. To alleviate this, we propose an asymmetric variant and apply it to the pooled output distillation (POD) loss adapted for ViTs. Our experiments confirm that introducing asymmetry to POD boosts its plasticity while retaining stability across (a) and (b). Moreover, we acknowledge low forgetting measures for all the compared methods, indicating that ViTs might be naturally inclined continual learner
# (参考訳) 微分プライベートベイズ推定のための統計選択とMCMC

ライセンス: CC BY 4.0
This paper concerns differentially private Bayesian estimation of the parameters of a population distribution, when a statistic of a sample from that population is shared in noise to provide differential privacy. This work mainly addresses two problems: (1) What statistic of the sample should be shared privately? For the first question, i.e., the one about statistic selection, we promote using the Fisher information. We find out that, the statistic that is most informative in a non-privacy setting may not be the optimal choice under the privacy restrictions. We provide several examples to support that point. We consider several types of data sharing settings and propose several Monte Carlo-based numerical estimation methods for calculating the Fisher information for those settings. The second question concerns inference: (2) Based on the shared statistics, how could we perform effective Bayesian inference? We propose several Markov chain Monte Carlo (MCMC) algorithms for sampling from the posterior distribution of the parameter given the noisy statistic. The proposed MCMC algorithms can be preferred over one another depending on the problem. For example, when the shared statistics is additive and added Gaussian noise, a simple Metropolis-Hasting algorithm that utilizes the central limit theorem is a decent choice. We propose more advanced MCMC algorithms for several other cases of practical relevance. Our numerical examples involve comparing several candidate statistics to be shared privately. For each statistic, we perform Bayesian estimation based on the posterior distribution conditional on the privatized version of that statistic. We demonstrate that, the relative performance of a statistic, in terms of the mean squared error of the Bayesian estimator based on the corresponding privatized statistic, is adequately predicted by the Fisher information of the privatized statistic.
# (参考訳) 報酬から関係ルールを学ぶ

ライセンス: CC BY 4.0
Humans perceive the world in terms of objects and relations between them. In fact, for any given pair of objects, there is a myriad of relations that apply to them. How does the cognitive system learn which relations are useful to characterize the task at hand? And how can it use these representations to build a relational policy to interact effectively with the environment? In this paper we proposed that this problem can be understood through the lens of a sub-field of symbolic machine learning called relational reinforcement learning (RRL). To demonstrate the potential of our approach, we build a simple model of relational policy learning based on a function approximator developed in RRL. We trained and tested our model in three Atari games that required to consider an increasingly number of potential relations: Breakout, Pong and Demon Attack. In each game, our model was able to select adequate relational representations and build a relational policy incrementally. We discuss the relationship between our model with models of relational and analogical reasoning, as well as its limitations and future directions of research.
# (参考訳) ZS4IE: 単純言語化によるゼロショット情報抽出ツールキット

ライセンス: CC BY-SA 4.0
The current workflow for Information Extraction (IE) analysts involves the definition of the entities/relations of interest and a training corpus with annotated examples. In this demonstration we introduce a new workflow where the analyst directly verbalizes the entities/relations, which are then used by a Textual Entailment model to perform zero-shot IE. We present the design and implementation of a toolkit with a user interface, as well as experiments on four IE tasks that show that the system achieves very good performance at zero-shot learning using only 5--15 minutes per type of a user's effort. Our demonstration system is open-sourced at https://github.com/BBN-E/ZS4IE . A demonstration video is available at https://vimeo.com/676138340 .
# 構造的変分クロスグラフ対応学習による合成時間的接地

ライセンス: Link先を確認
Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence. Thanks to the semantic diversity of natural language descriptions, temporal grounding allows activity grounding beyond pre-defined classes and has received increasing attention in recent years. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization). However, current temporal grounding datasets do not specifically test for the compositional generalizability. To systematically measure the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. Evaluating the state-of-the-art methods on our new dataset splits, we empirically find that they fail to generalize to queries with novel combinations of seen words. To tackle this challenge, we propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies and learns fine-grained semantic correspondence among them. Experiments illustrate the superior compositional generalizability of our approach. The repository of this work is at https://github.com/YYJMJC/ Compositional-Temporal-Grounding.
# トランスフォーマーを用いたマルチモーダルマルチラベル顔動作単位検出

ライセンス: Link先を確認
Facial Action Coding System is an important approach of facial expression analysis.This paper describes our submission to the third Affective Behavior Analysis (ABAW) 2022 competition. We proposed a transfomer based model to detect facial action unit (FAU) in video. To be specific, we firstly trained a multi-modal model to extract both audio and visual feature. After that, we proposed a action units correlation module to learn relationships between each action unit labels and refine action unit detection result. Experimental results on validation dataset shows that our method achieves better performance than baseline model, which verifies that the effectiveness of proposed network.
# UKP-SQUARE: 質問回答調査のためのオンラインプラットフォーム

ライセンス: Link先を確認
Recent advances in NLP and information retrieval have given rise to a diverse set of question answering tasks that are of different formats (e.g., extractive, abstractive), require different model architectures (e.g., generative, discriminative), and setups (e.g., with or without retrieval). Despite having a large number of powerful, specialized QA pipelines (which we refer to as Skills) that consider a single domain, model or setup, there exists no framework where users can easily explore and compare such pipelines and can extend them according to their needs. To address this issue, we present UKP-SQUARE, an extensible online QA platform for researchers which allows users to query and analyze a large collection of modern Skills via a user-friendly web interface and integrated behavioural tests. In addition, QA researchers can develop, manage, and share their custom Skills using our microservices that support a wide range of models (Transformers, Adapters, ONNX), datastores and retrieval techniques (e.g., sparse and dense). UKP-SQUARE is available on https://square.ukp-lab.de.
# (参考訳) 音声認識のための連鎖型判別オートエンコーダ

ライセンス: CC BY 4.0
In our previous work, we proposed a discriminative autoencoder (DcAE) for speech recognition. DcAE combines two training schemes into one. First, since DcAE aims to learn encoder-decoder mappings, the squared error between the reconstructed speech and the input speech is minimized. Second, in the code layer, frame-based phonetic embeddings are obtained by minimizing the categorical cross-entropy between ground truth labels and predicted triphone-state scores. DcAE is developed based on the Kaldi toolkit by treating various TDNN models as encoders. In this paper, we further propose three new versions of DcAE. First, a new objective function that considers both categorical cross-entropy and mutual information between ground truth and predicted triphone-state sequences is used. The resulting DcAE is called a chain-based DcAE (c-DcAE). For application to robust speech recognition, we further extend c-DcAE to hierarchical and parallel structures, resulting in hc-DcAE and pc-DcAE. In these two models, both the error between the reconstructed noisy speech and the input noisy speech and the error between the enhanced speech and the reference clean speech are taken into the objective function. Experimental results on the WSJ and Aurora-4 corpora show that our DcAE models outperform baseline systems.
# MonoDETR:モノクロ3次元物体検出のための深度対応トランス

ライセンス: Link先を確認
Monocular 3D object detection has long been a challenging task in autonomous driving, which requires to decode 3D predictions solely from a single 2D image. Most existing methods follow conventional 2D object detectors to first localize objects by their centers, and then predict 3D attributes using center-neighboring local features. However, such center-based pipeline views 3D prediction as a subordinate task and lacks inter-object depth interactions with global spatial clues. In this paper, we introduce a simple framework for Monocular DEtection with depth-aware TRansformer, named MonoDETR. We enable the vanilla transformer to be depth-aware and enforce the whole detection process guided by depth. Specifically, we represent 3D object candidates as a set of queries and produce non-local depth embeddings of the input image by a lightweight depth predictor and an attention-based depth encoder. Then, we propose a depth-aware decoder to conduct both inter-query and query-scene depth feature communication. In this way, each object estimates its 3D attributes adaptively from the depth-informative regions on the image, not limited by center-around features. With minimal handcrafted designs, MonoDETR is an end-to-end framework without additional data, anchors or NMS and achieves competitive performance on KITTI benchmark among state-of-the-art center-based networks. Extensive ablation studies demonstrate the effectiveness of our approach and its potential to serve as a transformer baseline for future monocular research. Code is available at https://github.com/ZrrSkywalker/MonoDETR.git.
# プログラム合成のための会話パラダイム

ライセンス: Link先を確認
Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI's Codex on the HumanEval benchmark. We plan to make the training library JaxFormer including checkpoints available as open source.
