We describe an instantiation of a new concept for multimodal multisensor data
collection of real life in-the-wild free standing social interactions in the
form of a Conference Living Lab (ConfLab). ConfLab contains high fidelity data
of 49 people during a real-life professional networking event capturing a
diverse mix of status, acquaintanceship, and networking motivations at an
international conference. Recording such a dataset is challenging due to the
delicate trade-off between participant privacy and fidelity of the data, and
the technical and logistic challenges involved. We improve upon prior datasets
in the fidelity of most of our modalities: 8-camera overhead setup, personal
wearable sensors recording body motion (9-axis IMU), Bluetooth-based proximity,
and low-frequency audio. Additionally, we use a state-of-the-art hardware
synchronization solution and time-efficient continuous technique for annotating
body keypoints and actions at high frequencies. We argue that our improvements
are essential for a deeper study of interaction dynamics at finer time scales.
Our research tasks showcase some of the open challenges related to in-the-wild
privacy-preserving social data analysis: keypoints detection from overhead
camera views, skeleton based no-audio speaker detection, and F-formation
detection. With the ConfLab dataset, we aim to bridge the gap between
traditional computer vision tasks and in-the-wild ecologically valid
socially-motivated tasks.
1Delft University of Technology, Delft, The Netherlands
オランダのデルフトにある1Delft University of Technology
0.78
2Rensselaer Polytechnic Institute, New York, USA
2Rensselaer Polytechnic Institute, New York, USA (英語)
0.77
islama6@rpi.edu
islama6@rpi.edu
0.34
Abstract We describe an instantiation of a new concept for multimodal multisensor data collection of real life in-the-wild free standing social interactions in the form of a Conference Living Lab (ConfLab).
ConfLab contains high fidelity data of 49 people during a real-life professional networking event capturing a diverse mix of status, acquaintanceship, and networking motivations at an international conference.
Recording such a dataset is challenging due to the delicate trade-off between participant privacy and fidelity of the data, and the technical and logistic challenges involved.
We improve upon prior datasets in the fidelity of most of our modalities: 8-camera overhead setup, personal wearable sensors recording body motion (9-axis IMU), Bluetooth-based proximity, and low-frequency audio.
Additionally, we use a state-of-the-art hardware synchronization solution and timeefficient continuous technique for annotating body keypoints and actions at high frequencies.
Our research tasks showcase some of the open challenges related to in-the-wild privacy-preserving social data analysis: keypoints detection from overhead camera views, skeleton based no-audio speaker detection, and F-formation detection.
With the ConfLab dataset, we aim to bridge the gap between traditional computer vision tasks and in-the-wild ecologically valid socially-motivated tasks.
1 Introduction In this paper, we address the problem of collecting a privacy-sensitive dataset to enable the study of the unscripted social dynamics of real life relationships in-the-wild.
We focus specifically on social networking settings where people are free to move around and leave or join a conversation as they please (see Figure 1).
The majority of data that captures and allows for the study of group dynamics in social interactions have focused on role-played settings in custom-built instrumented lab environments [1, 2].
While such work was an invaluable first step for the automated analysis of social signals, we argue that the next steps in advancing the study of social human behavior involves the recording of the unconstrained social dynamics of in-the-wild behavior at a fidelity comparable to the instrumented lab environments.
As such, the study of more real-life, dynamic, and crowded free standing conversational scenes beyond the lab has gained interest in the last decade [3–7].
Existing datasets of in-the-wild social behavior (see Table 1) suffer from specific drawbacks preventing the analysis and modeling of fine-grained behavior:
(i) they lack articulated pose information in the form of body keypoints;
(i)ボディキーポイントという形でのポーズ情報がないこと。
0.67
(ii) the number of people in the scene is too limited to capture and study complex interactions, and
(ii)複雑な相互作用を捉えて研究するには現場の人数が限られすぎており、
0.78
(iii) the sampling rate of the provided manual annotations is too low to capture the complex dynamics of the key social phenomena [8, Sec. 3.3].
(iii) 提供された手動アノテーションのサンプリングレートは, 重要な社会現象 [8, sec. 3.3] の複雑なダイナミクスを捉えるには低すぎる。
0.76
Some of these drawbacks—especially related to articulated pose—exist due to an inherent trade-off between having a well-instrumented recording setup to capture high-fidelity data and preserving participant privacy as well as the ecological validity (real life naturalness) [9–11] of the interaction in-the-wild, which entails having a non-invasive sensor setup.
For video, this has been addressed by mounting cameras overhead in a top-down perspective [5, 7].
ビデオでは、カメラをトップダウンの視点で[5, 7]マウントすることで、この問題に対処している。
0.65
However, state-of-the-art body keypoint estimation techniques trained on frontal or elevated side views do not perform well on top-down perspectives due to the heavy interpersonal occlusion [7, 12], preventing the automatic extraction of keypoint annotations.
As a result, prior datasets have provided manual annotations for head or body bounding boxes rather than keypoints, which entails a much larger annotation overhead.
To address these limitations, we propose the Conference Living Lab (ConfLab): a high fidelity dataset of 49 socially interacting people during a professional networking event.
これらの制限に対処するために、プロフェッショナルなネットワーキングイベント中に49人の社会的相互作用を持つ人々の高忠実度データセットである Conference Living Lab (ConfLab) を提案する。
0.64
Concretely, our following technical contributions (see Table 1 open the gateway to a wide range of multimodal and crossmodal behavior tasks, of importance to various fields including machine learning, social psychology, and social signal processing.
(d) the direct study of interaction dynamics using full body poses (previously limited to lab settings [1]).
(d)全身ポーズを用いた相互作用ダイナミクスの直接研究(従来は実験室の設定に限られていた[1])。
0.81
(ii) Subtle body dynamics: first inclusion of a full 9-axis Inertial Measurement Unit (IMU) for improved capture body dynamics at higher rates.
(II) サブトルボディーダイナミクス: フル9軸慣性測定ユニット(IMU)を初搭載し、高い速度で捕捉ボディーダイナミクスを向上する。
0.73
Previous rates were found to be insufficient for downstream tasks [16].
ダウンストリームタスク [16] では, 従来は不十分であった。
0.60
(iii) Enabling finer temporal-scale research questions: a sub-second expected crossmodal latency of 13 ms for the first time along with higher sampling rate of features (60 fps video, 56 Hz IMU) enables the in-the-wild study of nuanced time-sensitive social behaviors like mimicry and synchrony which need tolerances as low as 40 ms [see 8, Sec. 3.2].
Prior works coped with lower tolerances by windowing their inputs [16–18].
以前の作品は、入力[16–18]をウィンドウ化することで、より低い耐性に対処した。
0.49
To enable these technical improvements, as part of the ConfLab endeavour we developed specific methods for cross modal synchronization [8] and continuous video annotation [19] that have been published separately.
Beyond these technical considerations, ConfLab captures a diverse mix of levels of seniority, acquaintanceship, affiliation, and motivation to network (see Figure 2).
This was achieved by organizing the data collection as part of an international scientific conference specialized in signal processing and machine learning (ACM Multimedia 2019).
(ii) allowed potential users of the data who also donated their social behavior to experience first-hand
(二)自己体験に社会行動も寄付した潜在的なデータの利用を許可する
0.78
Figure 2: Distribution of newcomer/veteran participants (left) and their research interests (right) in percentage.
図2: 新参者/獣医参加者(左)とその研究関心(右)の割合の分布。
0.73
2
2
0.42
英語(論文から抽出)
日本語訳
スコア
Table 1: Comparison of ConfLab with existing datasets of free-standing conversation groups in in-the-wild social interaction settings.
表1: conflabとフリースタンディング会話グループの既存のデータセットの比較。
0.55
Conflab is the first and only social interaction dataset that offers skeletal keypoints and speaking status at high annotation resolution, as well as hardware synchronized camera and multimodal wearable signals at high resolution.
32 48 Intra-wearable sync via gossiping protocol; Inter-modal sync using manual inspection @1 Hz Wireless hardware sync at acquisition, max latency of ∼ 13 ms [8]
† Includes self-assessed personality ratings ‡ Upsampled to 20 Hz by Vatic tool [22]
バティックツールで20hzにアップサンプリングした自己評価パーソナリティ評価 [22]
0.56
and reflect on the potential privacy and ethical issues of sharing their data, (iii) enabled high fidelity but privacy-preserving sensing as an integral part of the decision-making on what data to collect.
Specifically, we chose not use common approaches such as egocentric vision [20] or side-elevated viewpoints where facial behavior can be easily analyzed [3, 6, 21], and recorded at a frequency of 1200 Hz to mitigate extraction of the verbal content of speech, resulting in a fully General Data Protection and Regulation (GDPR) compliant multimodal recording setup.
The richness of fine-grained temporal information, coupled with the unique social context, makes ConfLab a valuable first step in developing technologies to help people understand and potentially improve their social behavior.
Early efforts recorded at real life events had either recordings spanning for only a few minutes (e g the Coffee Break dataset [4]), or recorded at such a large distance from the participants that performing robust automated person detection or tracking with state-of-the-art approaches was non-trivial(e g the Idiap Poster Data [5]).
In recent years, two different strategies have emerged to circumvent this issue.
近年、この問題を回避するための2つの異なる戦略が出現している。
0.67
One approach was to move back to a fully instrumented lab with a high-resolution multi-camera setup where state-of-the-art 3D head pose estimation could be applied [24, 25] to generate behavioral features.
The benefit of the highly instrumented lab-based setup is that it allowed researchers to focus on novel research questions related to down stream tasks of a more social nature.
Another approach exploited wearable sensor data to allow for multimodal processing—sensors included 3 or 6 DOF inertial measurement units (IMU); infrared, bluetooth, or radio sensors to measure proximity; or microphones for speech behavior [6, 7].
For the case of the sociometric badge used by the SALSA data, proximity data has been used as a proxy of face-to-face interaction, but recent findings highlight significant problems with their accuracy [26].
ConfLab enables more robust models to be developed to conceptualize and detect social involvement.
ConfLabは、より堅牢なモデルを開発し、社会的関与を概念化し、検出することを可能にする。
0.48
The use of the 3
利用方法 3
0.38
英語(論文から抽出)
日本語訳
スコア
Chalcedony badges mentioned in the MatchNMingle dataset show more promising results using their radio-based proximity sensor and acceleration data [27].
However, they are still far away from performing sufficiently for more downstream tasks due to the relatively low sample frequency (20Hz) and annotation frequency (1Hz) [16].
Importantly, while both SALSA [6] and MatchNMingle [7] capture a multimodal dataset of a large group of individuals involved in mingling behavior, the inter-modal synchronization is only guaranteed at 1/3 Hz and 1 Hz, respectively.
While 1 Hz is able to capture some of the social interaction dynamics observed in conversations [28], it is insufficient to study fine-grained social phenomena such as back-channeling or mimicry that involve far lower latencies [8, Sec. 3.3].
ConfLab provides data streams with higher sampling rates, synchronized using a state-of-the-art portable multi-sensor recording technique shown to be within 13 ms latency at worst [8] (see Sec. 3.1).
Table 1 summarizes the differences between ConfLab and other datasets of real-life mingling events.
表1は、ConfLabと実際のミキシングイベントのデータセットの違いをまとめたものです。
0.63
Parallel to the in-the-wild work mentioned above, there have also been considerable efforts in more controlled lab-based experiments with high-quality audio and video data.
Notable examples of these role-played conversations have included seated scenarios such as the AMI meeting corpus [2] or the more recent standing scenarios of the Panoptic Dataset [1].
Both datasets enabled breakthroughs in the learning of multimodal conversational dynamics which can inform the behaviors observed in complex conversational scenes.
However, the dynamics of seated, scripted, or role-playing scenarios are different from that of our social setting and are likely to contain unwanted biases related to the artificial nature of the setting.
There have also been related efforts in the wearable and ubiquitous computing community carrying out extensive analysis of real-life face-to-face social networks.
However, they have typically focused on longer-term analysis of social networks over days, weeks, or months but using lower resolution proxies for interaction.
In practice, findings from social science indicate that the popular Sociometric badge performs poorly at social interaction detection for short-term social interaction analysis where performance robustness is required [26].
ConfLab enables researchers in these disciplines to also investigate the benefit of exploiting both visual and wearable modalities for richer social behavior studies.
3 Data Acquisition In this section we describe the considerations for designing and collecting an interaction dataset in the wild, to serve as a template and case study for similar future efforts.
10 cameras were placed directly overhead at 1 m intervals, with 4 cameras (not shared due to privacy reasons) at the corners providing an elevated-side-view perspective.
For the interaction area of 10 m × 5 m and the given height of the room (∼ 3.5 m), we found that 10 overhead cameras provided a suitable amount of overlap in the field of views.
For capturing multimodal data streams, we designed a custom wearable multi-sensor pack called the Midge2, based on the open-source Rhythm Badge designed for office environments [35].
We improved upon the Rhythm Badge in 3 ways: enabling higher audio recording frequency with an on-board switch to allow physical selection between high and low frequency; adding a 9-axis IMU to record pose; and an on-board SD card to directly store raw data, avoiding typical issues related to packet loss during wireless data transfer.
Widely used human behavior datasets are synchronized by maximizing similarity scores around manually identified common events in data streams, such as infrared camera detections [6], or speech plosives [36].
To synchronize the cameras and wearable sensors directly at acquisition while lowering the cost of the recording setup, we developed a a method published separately [8].
We achieved a demonstrated cross-modal latency of 13 ms at worst is well below the 40 ms latency tolerance suitable for behavior research in our setting [8, Sec. 3.3].
8, sec. 3.3] の動作研究に適した40 ms のレイテンシ耐性を,最悪の場合には13 ms のクロスモーダルレイテンシで達成した。
0.63
3.3 Ethics, Ecological Validity, and Recruitment
3.3 倫理・生態学的妥当性・採用
0.68
The collection and sharing of ConfLab is GDPR compliant.
ConfLabの収集と共有はGDPRに準拠している。
0.78
It was approved by both, the human research ethics committee at our institution and the local authorities of the country of the conference.
当機関の人間研究倫理委員会と会議の地方当局の双方が承認した。
0.50
All participants gave consent for the recording and sharing of their data.
参加者は全員、データの記録と共有に同意した。
0.80
ConfLab is only available for academic research purposes under an End User License Agreement.
ConfLabは、End User License Agreementの下で、学術的な研究目的でのみ利用できる。
0.76
An often-overlooked but crucial aspect of in-the-wild data collection is the design and ecological validity of the interaction setting.
To encourage mixed levels of status, acquaintanceship, and motivations to network, we designed an event with the conference organizers called Meet the Chairs!
ステータス、知性、ネットワークへのモチベーションの混合レベルを促進するために、カンファレンスオーガナイザであるmeet the chairsと一緒にイベントをデザインしました!
0.67
To further address privacy concerns, we chose an overhead camera view that makes faces and facial behavior harder to analyze, and recorded audio at low frequency.
Aside from the prospect of contributing to a community dataset and networking with the conference Chairs, as an additional incentive we provided attendees with post-hoc insights into their networking behavior using metrics computed from the wearable sensor data.
See Supplementary material for a sample participant report.
サンプル参加者レポートの補足資料を参照。
0.70
3.4 Data Association and Participant Protocol
3.4 データアソシエーションと参加者プロトコル
0.74
One consideration for multimodal data recording is the data association problem—how can pixels corresponding to an individual be linked to their other data streams?
マルチモーダルデータ記録の1つの考慮事項は、データ関連の問題である。
0.40
This was solved by designing a participant registration protocol.
これは参加者登録プロトコルの設計によって解決された。
0.66
Arriving participants were greeted and directed to a registration desk by the interaction area.
参加者は歓迎され、交流エリアによって登録デスクに指示された。
0.67
Team members fitted the participant with a Midge.
チームメンバーは参加者にmidgeを取り付けました。
0.71
The ID of the Midge acted as the participant’s identifier.
ミッジのIDは参加者の識別子として機能した。
0.68
One team member took a picture of the participant while ensuring both the face of the participant and the ID on the Midge were visible.
あるチームメンバーが参加者の写真を撮り、参加者の顔とミッドゲのidの両方が見えるようにしました。
0.69
These pictures will not be shared.
これらの写真は共有されません。
0.65
In practice, it is preferable to avoid this step by using a fully automated multimodal association approach.
However this remains an open research challenge [37, 38].
しかし、これはまだオープンな研究課題である[37, 38]。
0.77
During the event, participants mingled freely—they were allowed to carry bags or use mobile phones.
イベント中、参加者は自由に混ざり合い、バッグの持ち運びや携帯電話の使用を許された。
0.67
Conference volunteers helped to fetch drinks for participants.
会議のボランティアは参加者のために飲み物を取り出すのを助けました。
0.51
Participants could leave before the end of the 1 hour session.
参加者は1時間のセッションが終わる前に出発できる。
0.82
3.5 Replicating Data Collection Setup and Community Engagement
3.5 データコレクションセットアップとコミュニティエンゲージメントのレプリケーション
0.77
After the event, we gave a tutorial at ACM Multimedia 2019 [39] to demonstrate how our collection setup could be replicated, and to invite conference attendees and event participants to reflect on the broader considerations surrounding privacy-preserving data capture, sharing, and future directions such initiatives could take.
Through engagement with the community we also generated a spin-off of ConfLab in the form of a mobile app to help Multimedia researchers to find others in the community with complementary research interests [40].
(b) Gallery of identities (faces blurred for privacy)
(b)アイデンティティのギャラリー(プライバシーのためにぼやけた顔)
0.79
(c) Skeleton Figure 3: Illustration of the body keypoints annotation procedure:
(c)スケルトン 図3: ボディのイラスト キーポイントのアノテーション手順:
0.75
(a): our custom time continuous annotation interface;
(a)我々のカスタムタイム連続アノテーションインターフェース
0.62
(b): the gallery of person identities used by annotators to identify people in the scene; and
(b)アノテータが現場の人物を特定するために使用する人物識別のギャラリー
0.65
(c): the template of skeleton keypoints annotated
(c)注記されたスケルトンキーポイントのテンプレート
0.77
4 Data Annotation
4 データアノテーション
0.73
4.1 Continuous Keypoints Annotation
4.1 連続キーポイントアノテーション
0.67
Existing datasets of naturalistic social interactions have used video annotation software such as Vatic [22] or CVAT [42] to annotate every N frames only, followed by interpolation, to localize subjects via bounding boxes [6, 7].
In dense and crowded social scenes, this is problematic due to the interpersonal cross-contamination caused by severely overlapping bounding boxes [17].
Furthermore, richer information about the social dynamics such as gestures and changes in orientation can be obtained through the annotation of skeletal keypoints.
To our knowledge, no dataset of in-the-wild ecologically valid conversational social interactions has previously included ground truth body pose annotations.
Even using traditional approaches in other settings, it has been realistic to interpolate body key points annotated every N frames (see [42]).
他の設定で従来のアプローチを使用しても、Nフレーム毎にアノテートされたボディキーポイントを補間することは現実的である([42]参照)。 訳抜け防止モード: 他の設定で従来のアプローチを使っても。 現実的でした to interpolate body key points annotated every N frames ( See [ 42 ] )
0.81
For ConfLab this approach is likely to under-sample important body movements such as speech related gestures.
To overcome these issues, we collected fine-grained time-continuous annotations of keypoints via an online interface that we implemented as an extension to the covfee framework [19], to allow annotators to track individual joints using their mouse or laptop trackpad while playing the video in their web browser.
To validate the efficacy of the covfee framework, we designed a pilot study, which involved three annotators annotating the shoulders, head and nose keypoints of two people in a scene for 40 s of video using both CVAT [42] and covfee [19].
Through this pilot, we found that using this technique resulted in lower annotation times (7 min compared to 20 min in CVAT) and high agreement as shown by smaller averaged differences in pixels between covfee’s time-corresponding annotations compared to CVAT annotations interpolated at 1 Hz (17.3 ± 9.5 compared to 25.0 ± 12.3 for CVAT).
Annotations for ConfLab were made per camera (so the same subject could be annotated in multiple cameras due to view overlap) for 5 of the overhead cameras (see Fig 1).
We annotated the binary speaking status of every subject due to its importance to social constructs such as rapport [44], conversation floors [45], and the particular challenge it poses to action recognition methods [16, 46, 47].
Action annotations have traditionally been carried out using frame-wise techniques [7], where annotators find the start and end frame of the action of interest using a graphical interface.
Annotations were labeled by one annotator at 1 Hz.
アノテーションは1Hzのアノテータによってラベル付けされた。
0.56
The best camera view was provided for each F-formation, in particular to mitigate ambiguities in dealing with truncated formations that span across two neighboring camera views.
For keypoint annotation tasks, we selected workers based on a qualification task of annotating six out of the 17 keypoints, which allowed us to manually evaluate annotator diligence by observing their annotations.
The occlusion flag was annotated per body joint simultaneous with the continuous joint position annotation.
閉塞フラグは連続的関節位置アノテーションと同時で全身関節に付加された。
0.75
In Figure 5a we plotted the distribution of turn lengths in our speaking status annotations.
図5aでは、発話状態アノテーションのターン長の分布をプロットした。
0.71
We defined a turn to be a contiguous segment of positively-labeled speaking status, which resulted in a total of 4096 turns annotated for the 49 participants in the 16 minutes of data recordings.
Group-Level Statistics During the 16 minutes, there were 119 distinct F-formations of size greater than or equal to two, 38 instances of singleton, and 16 instances of same group (from membership perspective) reforming after disbanding.
The group size and duration per group size distribution are shown in Figures 5b and 5c, respectively.
グループサイズ分布当たりのグループサイズと期間はそれぞれ図5bおよび5cに示す。
0.83
The number of groups is inversely related to the size of the group (i.e. there are fewer large groups).
群の数は、その群のサイズと逆関係である(すなわち、大きな群がより少ない)。
0.68
The duration of the groups does not show particular trends with respect to the group sizes.
グループの期間は、グループサイズに関して特別な傾向を示すものではない。
0.82
It is worthy to note that groups of size 2,3,and 4 have a larger spread in duration.
2,3 と 4 の群は、持続時間においてより大きな広がりを持つことに留意すべきである。
0.67
From the self-reported experience level in the related conference venue, the new-comer percentage in F-formations is summarized in histogram in Figure 5d.
6 Research Tasks We report experimental results on three baseline benchmark tasks: person and keypoints detection, speaking status detection, and F-formation detection.
The first task is a fundamental building block for automatically analyzing human social behaviors.
最初のタスクは、人間の社会的行動を自動的に分析するための基本的なビルディングブロックである。
0.57
The other two demonstrate how learned body keypoints can be used in the pipeline.
他の2つは、パイプラインで学習したボディキーポイントをどのように使用できるかを示しています。
0.52
Importantly, speaking status is a key non-verbal cue for many social interaction analysis tasks [49] while F-formations detection in dynamic scenes is necessary to establish potential inter-personal influence by determining who is conversing with whom.
We developed a system for person detection (identifying bounding boxes) and pose estimation (localizing skeletal keypoints such as elbows, wrists, etc.).
We evaluated object detection performance using the standard evaluation metrics in the MS-COCO dataset paper [52].
我々はMS-COCOデータセットの標準評価指標を用いてオブジェクト検出性能を評価した[52]。
0.81
We report average precision (AP) for intersection over union (IoU) thresholds of 0.50 and 0.75, and the mean AP from an IoU range from 0.50 to 0.95 in 0.05 increments.
また,0.5インクリメントでは0.50から0.75,0.95の範囲の平均ap値が0.50から0.05の範囲であった。 訳抜け防止モード: We report average precision (AP ) for intersection over union (IoU ) thresholds of 0.50 and 0.75, また、IoUの平均APは0.50から0.95インクリメント0.05インクリメントである。
0.85
For keypoint detection, we use object keypoint similarity (OKS) [52].
The low average APOKS of 10.7 indicates that the estimated keypoints are imprecise.
10.7の低い平均APOKSは、推定されたキーポイントが不正確であることを示している。
0.50
The choice of backbone did not affect results significantly.
バックボーンの選択は結果に大きな影響を与えなかった。
0.73
Further experiments with Faster-RCNN (a detection-only model) and four different backbones (R50-C4, R101-C4, R50-FPN, and R101-FPN) revealed consistently better results for FPN backbones, with a best AP50 of 51.49.
features high person scene density (15 on average per camera view) which may be a useful resource for developing overhead person detection and keypoint estimation.
This has led to the exploration of the use of information from different modalities such video and accelerometers, capable of capturing some of the motion characteristics of speaking-related gestures [17, 18].
For the acceleration modality, we use two standard convolutional neural networks: a 1dimensional version of AlexNet [18], and 1D Resnet [55], both of which we trained from scratch.
The corresponding acceleration time series were obtained for these segments.
これらのセグメントに対して対応する加速度時系列が得られた。
0.60
The examples were labeled via a threshold of 0.5 on the fraction of positive speaking status sample labels, such that an example is labeled positive if the subject was labeled as speaking for at least half the time.
Evaluation was carried out via a train-test split at the subject level with 20% of the person identities (9 subjects) in the test set, ensuring that no examples from the test subjects were used in training.
The results in Table 4 indicate a better performance from the acceleration-based methods.
表4の結果は、アクセラレーションに基づく方法よりも優れたパフォーマンスを示している。
0.68
One possible reason for the lower performance of the pose-based methods is the significant domain shift between the Kinetics dataset and our dataset, especially due the difference in camera viewpoint (frontal vs top-down).
9 Table 5: Average F1 scores for F-formation detection comparing GTCG [15] and GCFF [21] with the effect of different threshold and orientations (standard deviation in parenthesis).
Being able to identify groups of people in a social scene sheds light on dynamics of potential social influence.
社会的な場面で集団を特定できることは、潜在的な社会的影響のダイナミクスに光を当てる。
0.71
Like prior work, we consider interaction groups more rigorously as F-formations as defined by Kendon [48].
先行研究と同様に、相互作用群をケンドン [48] によって定義される F-形式として厳密に考える。
0.54
We provide performance results for F-formation detection using GTCG [15] and GCFF [21] as a baseline.
GTCG[15]とGCFF[21]をベースラインとしてF値検出の性能評価を行った。
0.74
Application of recent deep learning methods such as DANTE [14] is not directly applicable since the inputs to the neural network architecture depend on the number of people in the scene, which varies from frame to frame in ConfLab.
We use standard evaluation metrics for group detection.
グループ検出には標準評価指標を用いる。
0.79
A group is correctly estimated if at least (cid:100)T ∗ |G|(cid:101) of the members of group G are correctly identified, and no more than 1 − (cid:100)T ∗ |G|(cid:101) is incorrectly identified, where T is the tolerance threshold.
We set T = 2 3 or T = 1 (more strict threshold), which is a common practice.
私たちは、T = 2 3 または T = 1(より厳密なしきい値)を定めます。
0.76
We report detection results in Table 5 in terms of F1 score, where true positive correspond to correctly detected groups; false positives to detected but non-existent groups; and false negatives to non-detected groups.
Results are obtained for videos from camera 2,4,6, and 8.
ビデオは2,4,6,8。
0.39
We use pre-trained parameters (from Cocktail Party [3]) for field of view (FoV) and frustum aperture (GTCG) and minimum description length (GCFF), and adjusted frustum length (GTCG) and stride (GCFF) to account for average interpersonal distance in ConfLab.
Features include positions and orientations, with options for orientations derived from head, shoulders, and hips keypoints.
特徴には、頭、肩、腰のキーポイントから派生した方向のオプションを含む位置と向きが含まれる。
0.61
We show that different results are obtained using different sources of orientations.
異なる方向のソースを用いて異なる結果が得られることを示す。
0.76
Potential explanations include the different occlusion levels in keypoints due to camera viewpoint and the complexity in the concept of interacting groups pertaining to the original definition of F-formations [48] and conversation floors [45] in relation to head, upper-body, and lower-body orientation.
As an in-the-wild dataset with large number of participants and high-resolution annotations, ConfLab provides new opportunities and challenges for future method development in F-formation detection.
ConfLab captures a rich and high-fidelity multi-sensor and multimodal dataset of social interaction behavior in-the-wild and in a real-life networking event.
We built upon prior work by providing higher-resolution and framerate data and also carefully designed our social interaction setup to enable a diverse mix of seniority, acquaintanceship, and motivations for mingling.
Prior efforts under-sampled much of the dynamics of human social behavior while ConfLab uses a modular and scalable recording setup capable of guaranteeing inter- and intra-modal synchronization in keeping with the perception of human social cues.
We contribute a rich set of 17 body keypoint annotations of 49 people at 60Hz from over head cameras for developing more robust keypoint estimation as well as manual annotations for key tasks in social behavior analysis, namely speaker and F-formation detection.
A potential benefit of our body keypoint annotations is that it enables us to revisit some of the socially related prediction tasks from prior datasets with overhead camera views (e g [5, 7]) by using ConfLab pre-trained body keypoint models.
Finally, to improve estimation robustness, ConfLab provides multimodal data allowing for further development of multimodal machine learning solutions [57] that could improve over vision only systems.
We believe this is an important step towards a long-term vision for developing personalized socially aware technologies that can enhance and foster positive social experience and assist people in their social decisions.
Since ConfLab captures social relationships, if we want to relate an individual’s social behaviors to longer term behavioral trends within the social network (e g across coffee breaks in one day, days at a conference, or multiple conferences), more instantiations similar to ConfLab are needed.
This instantiation of ConfLab attempted to maximize data fidelity while preserving participants’ privacy through the choices of overhead camera perspective, low audio recording frequency, and non-intrusive wearable sensors matching a conference badge form-factor.
A crucial assumption made in many former multimodal datasets[1, 6, 7] is that the association of video data to the wearable modality can be manually performed.
Few works [37, 38] have tried to address this issue but using movement cues alone to associate the modalities is challenging as conversing individuals are mostly stationary.
However, detecting pose and actions robustly from overhead cameras remains to be solved.
しかし、頭上カメラからのポーズや動作のロバストな検出は未解決である。
0.67
Potential Negative Societal Impact.
潜在的な負の社会的な影響。
0.49
Although ConfLab’s long term vision is towards developing technology to assist individuals in navigating social interactions, such technology could also affect a community in unintended ways: e g causing worsened social satisfaction, lack of agency, or benefiting only those members of the community who make use of the system at the expense of the rest.
All of these must be considered when developing such systems.
これら全ては、このようなシステムを開発する際に考慮する必要がある。
0.56
Moreover, ConfLab and its trained models could be exploited to develop technologies to de-anonymize or track subjects in privacy invasive ways (i.e., harmful surveillance).
Finally, since the data was collected during a scientific conference, there is an implicit selection bias which users of the data need to take into account.
References [1] Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Scott Godisart, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh.
参照: [1] Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Scott Godisart, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, Yaser Sheikh 訳抜け防止モード: 参照 [1 ]hanbyul joo, tomas simon, xulong li, ハオ・リウ、レイ・タン、リン・ギ、ショーン・バネルジー ティモシー・スコット・ゴディサート、バート・ナッベ、イアン・マシューズ、金手武雄、 信原正平、シェイク弥生。
0.58
Panoptic studio: A massively multiview system for social interaction capture.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017
0.38
1, 2, 4, 11
1, 2, 4, 11
0.42
[2] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnes Lisowska, Iain McCowan, Wilfried Post, Dennis Reidsma, and Pierre Wellner.
[2] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnes Lisowska, Iain McCowan, Wilfried Post, Dennis Reidsma, Pierre Wellner 訳抜け防止モード: ジャン・カーレッタ、シモーヌ・アシュビー、セバスチャン・ボーアバン mike flynn, mael guillemot, thomas hain, jaroslav kadlec, vasilis karaiskos, wessel kraaij, melissa kronenthal, guillaume lathoud, マイク・リンカーン、アグネス・リソフスカ、イアン・マッコーワン、ウィルフリード・ポスト デニス・リードマと ピエール・ウェナー
0.55
The ami meeting corpus: A pre-announcement.
ami ミーティングコーパス:事前発表。
0.48
In Steve Renals and Samy Bengio, editors, Machine Learning for Multimodal Interaction, pages 28–39, Berlin, Heidelberg, 2006.
steve renals と samy bengio, editors, machine learning for multimodal interaction, pages 28–39, berlin, heidelberg, 2006 において。
0.84
Springer Berlin Heidelberg.
ベルリン・ハイデルベルク出身。
0.62
1, 4 [3] Gloria Zen, Bruno Lepri, Elisa Ricci, and Oswald Lanz.
1, 4 グロリア・ゼン、ブルーノ・レプリ、エリサ・リッチ、オズワルド・ランツ。
0.40
Space speaks: towards socially and personality aware visual surveillance.
空間は社会的、個性的に視覚的監視を意識する。
0.66
In Proceedings of the 1st ACM international workshop on Multimodal pervasive video analysis, pages 37–42, 2010.
第1回 acm international workshop on multimodal pervasive video analysis, pp37-42, 2010 ページ 訳抜け防止モード: 第1回acm international workshop on multimodal pervasive video analysis の開催にあたって 37-42頁、2010年。
0.80
1, 3, 7, 10
1, 3, 7, 10
0.42
[4] Marco Cristani, Loris Bazzani, Giulia Paggetti, Andrea Fossati, Diego Tosato, Alessio Del Bue, Gloria Menegaz, and Vittorio Murino.
4]Marco Cristani, Loris Bazzani, Giulia Paggetti, Andrea Fossati, Diego Tosato, Alessio Del Bue, Gloria Menegaz, Vittorio Murino。 訳抜け防止モード: [4 ]Marco Cristani, Loris Bazzani, Giulia Paggetti, Andrea Fossati, Diego Tosato, Alessio Del Bue, Gloria Menegaz ヴィットリオ・ムリーノとも。
0.72
Social interaction discovery by statistical analysis of f-formations.
f-formationの統計的解析による社会的相互作用の発見
0.68
In Jesse Hoey, Stephen J. McKenna, and Emanuele Trucco, editors, British Machine Vision Conference, BMVC 2011, Dundee, UK, August 29 - September 2, 2011.
Jesse Hoey, Stephen J. McKenna, Emanuele Trucco, editors, British Machine Vision Conference, BMVC 2011, Dundee, UK, August 29-9月2日 訳抜け防止モード: Jesse Hoey、Stephen J. McKenna、Emanuele Trucco。 編集者, British Machine Vision Conference, BMVC 2011, Dundee, UK 2011年8月29日~9月2日。
0.78
Proceedings, pages 1–12.
背番号は1-12頁。
0.33
BMVA Press, 2011.
BMVA、2011年。
0.62
doi: 10.5244/C.25.23.
doi: 10.5244/c.25.23。
0.43
URL https://doi.org/10.5 244/C.25.23.
URL https://doi.org/10.5 244/C.25.23
0.18
3 [5] Hayley Hung and Ben Kröse.
3 5] ヘイリー・ハングとベン・クレーゼ
0.50
Detecting f-formations as dominant sets.
f-形式を支配集合として検出する。
0.46
In Proceedings of the 13th
第13条の手続において
0.59
international conference on multimodal interfaces, pages 231–238, 2011.
マルチモーダルインタフェースに関する国際会議、2011年231-238頁。
0.76
2, 3, 10 [6] Xavier Alameda-Pineda, Jacopo Staiano, Ramanathan Subramanian, Ligia Batrinca, Elisa Ricci, Bruno Lepri, Oswald Lanz, and Nicu Sebe.
Salsa: A novel dataset for multimodal group behavior analysis.
Salsa: マルチモーダルグループ行動分析のための新しいデータセット。
0.84
IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1707–1720, 2015.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1707–1720, 2015
0.46
3, 4, 5, 6, 7, 11
3, 4, 5, 6, 7, 11
0.43
[7] Laura Cabrera-Quiros, Andrew Demetriou, Ekin Gedik, Leander van der Meij, and Hayley Hung.
[7]Laura Cabrera-Quiros、Andrew Demetriou、Ekin Gedik、Leander van der Meij、Hayley Hung。
0.37
The matchnmingle dataset: A novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates.
A game-theoretic probabilistic approach for detecting conversational groups.
対話型グループ検出のためのゲーム理論的確率論的アプローチ
0.61
In Asian conference on computer vision, pages 658–675.
アジアコンピュータビジョン会議』658-675頁。
0.68
Springer, 2014. 2, 9, 10
2014年春。 2, 9, 10
0.47
[16] Ekin Gedik and Hayley Hung.
16] ekin gedik と hayley は吊り下げた。
0.63
Personalised models for speech detection from body movements using transductive parameter transfer.
トランスダクティブパラメータ転送を用いた身体運動からの音声検出のためのパーソナライズドモデル
0.80
Personal and Ubiquitous Computing, 21(4):723–737, August 2017.
パーソナライズとユビキタスコンピューティング、21(4):723–737、2017年8月。
0.63
ISSN 1617-4909.
ISSN 1617-4909。
0.37
doi: 10.1007/s00779-017-1 006-4.
doi: 10.1007/s00779-017-1 006-4。
0.36
2, 4, 6, 9
2, 4, 6, 9
0.43
[17] Laura Cabrera-Quiros, David M.J. Tax, and Hayley Hung.
17]ローラ・カブレラ=キロス、デイヴィッド・m・j・税、ヘイリー・ハング
0.49
Gestures in-the-wild : Detecting conversational hand gestures in crowded scenes using a multimodal fusion of bags of video trajectories and body worn acceleration.
Covfee: an extensible web framework for continuous-time annotation of human behavior.
Covfee: 人間の振る舞いを継続的にアノテーションするための拡張可能なWebフレームワーク。
0.60
In Cristina Palmero, Julio C. S. Jacques Junior, Albert Clapés, Isabelle Guyon, Wei-Wei Tu, Thomas B. Moeslund, and Sergio Escalera, editors, Understanding Social Behavior in Dyadic and Small Group Interactions, volume 173 of Proceedings of Machine Learning Research, pages 265–293.
Cristina Palmero, Julio C. S. Jacques Junior, Albert Clapés, Isabelle Guyon, Wei-Wei Tu, Thomas B. Moeslund, Sergio Escalera, editors, Understanding Social Behavior in Dyadic and Small Group Interactions, Volume 173 of Proceedings of Machine Learning Research, pages 265–293. 訳抜け防止モード: Julio C. S. Jacques Junior, Albert Clapés Isabelle Guyon, Wei - Wei Tu, Thomas B. Moeslund 編集長Sergio Escalera, Dyadic and Small Group Interactionsにおける社会的行動の理解 Proceedings of Machine Learning Research』第173巻、265-293頁。
In Companion Proceedings of the The Web Conference 2018, WWW ’18, page 109–110, Republic and Canton of Geneva, CHE, 2018.
Companion Proceedings of the Web Conference 2018, WWW ’18, page 109–110, Republic and Canton of Geneva, CHE, 2018。 訳抜け防止モード: The Web Conference 2018のコンパニオンで、WWWは18。 109–110, Republic and Canton of Geneva, CHE, 2018
0.65
International World Wide Web Conferences Steering Committee.
International World Wide Web Conferences Steering Committee(英語)
0.84
[32] Daniel Olguín Olguín, Benjamin N Waber, Taemie Kim, Akshay Mohan, Koji Ara, and Alex Pentland.
Daniel Olguín Olguín氏、Benjamin N Waber氏、Taemie Kim氏、Akshay Mohan氏、Ara Koji氏、Alex Pentland氏。 訳抜け防止モード: 32]daniel olguín olguín、benjamin n waber、taemie kim。 アクシャイ・モハン、コジ・アラ、アレックス・ペントランド。
0.45
Sensible organizations: Technology and methodology for automatically measuring organizational behavior.
賢明な組織: 組織行動を自動的に計測する技術と方法論。
0.70
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(1):43–55, 2008.
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(1):43–55, 2008
0.43
4 [33] Timon Elmer, Krishna Chaitanya, Prateek Purwar, and Christoph Stadtfeld.
In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pages 1–8, 2013.
2013年の第10回IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), page 1–8, 2013 訳抜け防止モード: 2013年第10回IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) 1-8頁、2013年。
0.82
5 [37] Laura Cabrera-Quiros and Hayley Hung.
5 [37]Laura Cabrera-QuirosとHayley Hung。
0.64
Who is where? matching people in video to wearable acceleration during crowded mingling events.
In Proceedings of the 24th ACM international conference on Multimedia, pages 267–271, 2016.
第24回acm国際マルチメディア会議の議事録、2016年267-271頁。
0.74
5, 11 [38] Laura Cabrera-Quiros and Hayley Hung.
5, 11 [38]Laura Cabrera-QuirosとHayley Hung。
0.85
A hierarchical approach for associating body-worn sensors to video regions in crowded mingling scenarios.
混み合った混み合いシナリオにおけるボディーワーンセンサとビデオ領域を関連付ける階層的アプローチ
0.69
IEEE Transactions on Multimedia, 21(7):1867–1879, 2018.
ieee transactions on multimedia, 21(7):1867–1879, 2018を参照。
0.61
5, 11 [39] Hayley Hung, Chirag Raman, Ekin Gedik, Stephanie Tan, and Jose Vargas Quiros.
5, 11 39]Hayley Hung, Chirag Raman, Ekin Gedik, Stephanie Tan, Jose Vargas Quiros。
0.60
Multimodal data In Proceedings of the 27th ACM International
第27回acm国際手続におけるマルチモーダルデータ
0.87
collection for social interaction analysis in-the-wild.
ウィルドにおけるソーシャルインタラクション分析のためのコレクション。
0.51
Conference on Multimedia, pages 2714–2715, 2019.
マルチメディアに関する会議』2714-2715頁、2019年。
0.61
5 [40] Ekin Gedik and Hayley Hung.
5 40] ekin gedik と hayley は吊り下げた。
0.54
Confflow: A tool to encourage new diverse collaborations.
Confflow: 新しい多様なコラボレーションを促進するツール。
0.79
In Proceedings of the 28th ACM International Conference on Multimedia, pages 4562–4564, 2020.
手続き中 第28回acm国際マルチメディア会議のページ4562-4564, 2020。
0.66
5 [41] Covfee: Continuous Video Feedback Tool.
5 [41] covfee: 連続ビデオフィードバックツール。
0.59
Jose Vargas. 6
ヨーゼ・ヴァルガス 6
0.46
[42] Computer Vision Annotation Tool (CVAT).
[42]コンピュータビジョンアノテーションツール(CVAT)。
0.55
6 [43] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár.
6 43]tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár。 訳抜け防止モード: 6 [43 ]ツン - 李林、マイケル・ミア、セルゲイ・ベロンギー Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona Deva Ramanan、C. Lawrence Zitnick、Piotr Dollár。
0.60
Microsoft COCO: Common Objects in Context.
Microsoft COCO: コンテキスト内の共通オブジェクト。
0.82
arXiv:1405.0312 [cs], February 2015.
arXiv:1405.0312 [cs], February 2015
0.46
6 [44] Philipp Müller, Michael Xuelin Huang, and Andreas Bulling.
6 44] Philipp Müller、Michael Xuelin Huang、Andreas Bulling。
0.38
Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour.
非言語行動による小グループにおける自然相互作用中の低ラプト検出
0.68
In 23rd International Conference on Intelligent User Interfaces.
第23回intelligent user interface国際会議に参加して
0.76
ACM, 2018.
2018年、ACM。
0.85
ISBN 978-1-4503-4945-1.
ISBN 978-1-4503-4945-1。
0.18
doi: 10.1145/3172944.3172 969.
doi 10.1145/3172944.3172 969
0.28
6 [45] Chirag Raman and Hayley Hung.
6 チラグ・ラマンとヘイリー・ハング。
0.39
Towards automatic estimation of conversation floors within F-formations.
F-formationsにおける会話フロアの自動推定に向けて
0.68
arXiv:1907.10384 [cs], July 2019.
arXiv:1907.10384 [cs], July 2019
0.47
6, 10 [46] Cigdem Beyan, Muhammad Shahid, and Vittorio Murino.
6, 10 [46]シグデム・ベヤン、ムハンマド・シャヒド、ヴィットリオ・ムリーノ
0.40
RealVAD: A Real-world Dataset and A Method for Voice Activity Detection by Body Motion Analysis.
realvad: 実世界のデータセットと体の動き分析による音声活動検出方法。
0.78
x, 9210(c):1–16, 2020.
x, 9210(c):1–16, 2020。
0.45
doi: 10.1109/tmm.2020.
10.1109/tmm.2020
0.26
3007350. 6
3007350. 6
0.43
13
13
0.85
英語(論文から抽出)
日本語訳
スコア
[47] Muhammad Shahid, Cigdem Beyan, and Vittorio Murino.
[47]Muhammad Shahid、Cigdem Beyan、Vittorio Murino。
0.31
Voice activity detection by upper body motion analysis and unsupervised domain adaptation.
上半身運動解析と非教師なし領域適応による音声活動検出
0.75
Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019, pages 1260–1269, 2019.
proceedings - 2019 international conference on computer vision workshop, iccvw 2019, pages 1260–1269, 2019 (英語)
0.44
doi: 10.1109/ICCVW.2019.0 0159.
doi: 10.1109/iccvw.2019.0 0159。
0.40
6 [48] Adam Kendon.
6 アダム・ケンドン(Adam Kendon)。
0.54
Conducting interaction: Patterns of behavior in focused encounters, volume 7.
行為の相互作用: 集中した出会いにおける行動パターン、巻7。
0.67
CUP Archive, 1990.
カップ 1990年、アーカイブ。
0.63
7, 10 [49] Daniel Gatica-Perez.
7, 10 ダニエル・ガティカ=ペレス(Daniel Gatica-Perez)。
0.40
Analyzing group interactions in conversations: a review.
会話におけるグループインタラクションの分析: レビュー。
0.81
In 2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pages 41–46, 2006.
2006年、IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, page 41–46, 2006。
0.45
doi: 10.1109/MFI.2006.265 658.
doi: 10.1109/mfi.2006.265 658。
0.41
8 [50] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick.
8 50]カイミング・彼、ジョージア・グキオクサーリ、ピョートル・ドルラール、ロス・ガーシック
0.47
Mask r-cnn. In Proceedings of the IEEE
仮面r-cnn。 IEEEの成果
0.37
international conference on computer vision, pages 2961–2969, 2017.
コンピュータビジョン国際会議』2961-2969頁、2017年。
0.79
8 [51] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick.
8 [52] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick.
8 52]tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, C Lawrence Zitnick。 訳抜け防止モード: 8 [52 ]ツン - 李林、マイケル・ミア、セルゲイ・ベロンギー James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár C Lawrence Zitnick氏。
0.61
Microsoft coco: Common objects in context.
Microsoft Coco: コンテキスト内の共通オブジェクト。
0.83
In European conference on computer vision, pages 740–755.
欧州のコンピュータビジョン会議において、740-755頁。
0.72
Springer, 2014. 8 [53] Jiaxing Shen, Oren Lederman, Jiannong Cao, Florian Berg, Shaojie Tang, and Alex Sandy Pentland.
9 [55] Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F. Schmidt, Jonathan Weber, Geoffrey I. Webb, Lhassane Idoumghar, Pierre-Alain Muller, and François Petitjean.
9 Hassan Ismail Fawaz氏、Benjamin Lucas氏、Germain Forestier氏、Charlotte Pelletier氏、Daniel F. Schmidt氏、Jonathan Weber氏、Geoffrey I. Webb氏、Lhassane Idoumghar氏、Pierre-Alain Muller氏、François Petitjean氏。 訳抜け防止モード: 9 ハッサン・イスメール・ファワズ ベンジャミン・ルーカス ジェルマン・フォレスジャー シャーロット・ペレティエ、ダニエル・f・シュミット、ジョナサン・ウェーバー、ジェフリー・i・ウェッブ lhassane idoumghar, pierre - アラン・ミュラー、フランソワ・プティジャン。
0.45
InceptionTime: Finding AlexNet for Time Series Classification.
InceptionTime: 時系列分類のためのAlexNetを見つける。
0.80
2019. 9 [56] Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C. Kot.
2019. 9 [56]Jun Liu、Amir Shahroudy、Mauricio Perez、Gang Wang、Ling-Yu Duan、Alex C. Kot。
0.41
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding.
NTU RGB+D 120: 3Dヒューマンアクティビティ理解のための大規模ベンチマーク。
0.64
IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, October 2020.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, October 2020
0.46
ISSN 0162-8828, 2160-9292, 1939-3539.
issn 0162-8828, 2160-9292, 1939-3539。
0.39
doi: 10.1109/TPAMI.2019.2 916873.
doi: 10.1109/tpami.2019.2 916873。
0.39
9 [57] Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency.
1 [61] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford.
1 ティムニット・ゲブル、ジェイミー・モーゲンステルン、ブライアンナ・ヴェッキオーネ、ジェニファー・ウォルトマン・ヴォーン、ハンナ・ワラッハ、ハル・ダウメ3世、ケイト・クロフォード。 訳抜け防止モード: 1 [61 ]Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii そしてケイト・クロフォード。
0.61
Datasheets for datasets.
データセット用のデータシート。
0.58
Communications of the ACM, 64(12): 86–92, 2021.
acm, 64(12): 86-92, 2021。
0.53
2 14
2 14
0.42
英語(論文から抽出)
日本語訳
スコア
ConfLab: A Rich Multimodal Multisensor Dataset of
ConfLab: リッチなマルチモーダルマルチセンサデータセット
0.49
Free-Standing Social Interactions In-the-Wild
フリースタンディング型ソーシャルインタラクション
0.63
Appendices A Sensor Calibration For computing the camera extrinsics, we marked a grid of 1 m × 1 m squares in tape across the interaction area floor.
To establish a correspondence with the camera frame of reference, the sensors were lined up against a common reference-line visible in the cameras to acquire an alignment so that the camera data can offer drift and bias correction for the wearable sensors.
Because of annotation errors, there are incorrectly labeled or missing keypoints in many frames.
アノテーションエラーのため、多くのフレームに誤ったラベル付けやキーポイントの欠落がある。
0.65
One error is misalignment of participants IDs during annotation.
1つのエラーは、アノテーション中の参加者idの誤認である。
0.50
We remove these misaligned keypoints by using outlier detection, i.e., we measure the median distance between different keypoints and remove the keypoints when the distances of those keypoints from all other keypoints of the same persons are much higher (4 times).
We check how many missing keypoints there for a person, and if there is more than 50% keypoints missing we remove the person bounding box and keypoints from the ground-truth.
Finally, we add the hip keypoints {rightHip, leftHip} to the set.
最後に、セットにhipキーポイント {rightHip, leftHip} を追加します。
0.72
The experiments in the main paper are performed with 17 keypoints.
メインペーパーの実験は17のキーポイントで行われます。
0.80
Table 6 suggests that more keypoints result in better keypoint-localizatio n.
表6は、より多くのキーポイントがキーポイント-ローカライズを改善することを示唆している。
0.44
1
1
0.42
英語(論文から抽出)
日本語訳
スコア
C Datasheet For ConfLab
conflab用cデータシート
0.74
This document is based on Datasheets for Datasets by Gebru et al [61].
この文書はGebru氏らのDatasheets for Datasetsに基づいています [61]。
0.74
Please see the most updated version here.
最新のバージョンはこちらでご覧ください。
0.76
MOTIVATION For what purpose was the dataset created?
モチベーション データセットはどのような目的で作成されたのか?
0.54
Was there a specific task in mind?
特定のタスクを念頭に置いていましたか?
0.57
Was there a specific gap that needed to be filled?
埋める必要がある特定のギャップがありましたか?
0.71
Please provide a description.
説明をお願いします。
0.66
There are two broad motivations for ConfLab: first, to enable the privacy-preserving, multimodal study of natural social conversation dynamics in a mixed-acquaintance, mixed-seniority international community; second, to bring the higher fidelity of wired in-the-lab recording setups to in-the-wild scenarios, enabling the study of fine time-scale social dynamics in-the-wild.
conflabには2つの大きな動機がある: 1つは、複数の知人、混成高齢者の国際社会における、自然の社会的会話のダイナミクスのプライバシー保護、マルチモーダルな研究を可能にすることである。 訳抜け防止モード: conflabには2つの大きな動機がある。 混合知人における自然社会会話のダイナミクスに関するマルチモーダル研究 mixed - seniority international community ; second, to bring the higher fidelity of wired in - the - lab recording setups to in - the - wild scenarios (英語) ファインタイムの研究を可能にする - ソーシャルダイナミクスをスケールする - ワイルド。
0.69
Existing in-the-wild datasets are limited by the lack of spatial and temporal resolution of the data, and inadequate synchronization guarantees between the data streams.
(i) enable finer temporal scale RQs: A sub-second expected cross modal latency of 13 ms for the first time along with higher sampling rate of features (60 fps video, 56 Hz IMU) enables the in-the-wild study of nuanced time-sensitive social behaviors like mimicry and synchrony which need tolerances as low as 40 ms[42] (Sec.3.2, L80-83).
Prior works coped with lower tolerances by windowing their inputs [15,25,40].
以前の作業では、入力[15,25,40]をウィンドウ化することで、耐性の低下に対処した。
0.44
(ii) articulated pose: first in-the-wild social interaction dataset with full body poses (Tab. 1), enabling improvements in
(ii)調音ポーズ:全体ポーズを用いた第1回体内ソーシャルインタラクションデータセット(第1報)により、改善が可能となる。 訳抜け防止モード: (ii)構音ポーズ : first in - the- wild social interaction dataset with full body pose (tab . 1) 改善を可能にする
0.87
(a) pose estimation and tracking in-the-wild (see next point),
(a)ポーズの推定・追跡(次点参照)
0.52
(b) pose-based recognition of social actions (unexplored in aerial perspective),
(b)ポーズに基づく社会的行動の認識(空中的視点では未探究)
0.80
(c) pose-based F-formation estimation (not possible using previous datasets (Tab. 1) and methods [29,46,49,51]),
YOUR ANSWER HERE What support was needed to make this dataset?
あなたの答えは このデータセットを作るには、どのようなサポートが必要でしたか?
0.61
(e g who funded the creation of the dataset? If there is an associated grant, provide the name of the grantor and the grant name and number, or if it was supported by a company or government agency, give those details.)
If the dataset is a sample, then what is the larger set?
もしデータセットがサンプルなら、もっと大きなデータセットは何ですか?
0.80
Is the sample representative of the larger set (e g , geographic coverage)?
サンプルは、より大きな集合(例えば、地理的カバレッジ)の代表ですか?
0.77
If so, please describe how this representativeness was validated/verified.
もしそうなら、この代表性がどのように検証/検証されたかを説明してください。
0.44
If it is not representative of the larger set, please describe why not (e g , to cover a more diverse range of instances, because instances were withheld or unavailable).
Because participation in such a data collection can only be voluntary, the sample was not pre-designed and may not be representative of the larger set.
features? In either case, please provide a description.
特徴? いずれの場合も説明をお願いします。
0.74
Each person in the scene wore a wearable device (Mingle Midge, in a compact conference badge form factor to be hung around the neck) which recorded the individual signals that are part of the dataset:
Additionally, video cameras placed with a top-down view over the interaction area recorded all the people in it.
さらに、対話エリア上のトップダウンビューを備えたビデオカメラが、その中のすべての人々を記録した。
0.66
Ten cameras were placed directly over-head at 1m intervals along the longer axis of the rectangle-shaped interaction space in such a way that the whole space was covered with significant overlap between adjacent cameras.
One of the cameras failed during the recording, but the space underneath it was captured by the adjacent cameras.
カメラの1つが記録中に故障したが、その下の空間は隣のカメラによって撮影された。
0.68
The number of cameras a subject is captured in varies according to their positioning, but each subject in the scene is in the field of view of at least one camera.
The confidence assessment is therefore largely based on the visibility of the target person and their speaking-associated gestures (eg. occlusion, orientation w.r.t. camera, visibility of the face)?
Pre-existing personal relationships between the subjects were not requested for privacy reasons.
既存の個人関係はプライバシー上の理由から要求されなかった。
0.72
Are there recommended data splits (e g , training, development/validati on, testing)?
推奨されるデータ分割(トレーニング、開発/検証、テストなど)はありますか?
0.81
please provide a description of these splits, explaining the rationale behind them.
これらの分割について 説明して下さい 彼らの背景にある 根拠を説明してください
0.54
YOUR ANSWER HERE If so,
あなたの答えは もしそうなら
0.69
Are there any errors, sources of noise, or redundancies in the dataset?
データセットにはエラー、ノイズの発生源、冗長性はありますか?
0.82
If so, please provide a description.
もしそうなら、説明してください。
0.75
Individual audio Because audio was recorded by a front-facing wearable device worn on the chest, it contains a significant amount of cocktail party noise and cross-contamination from other people in the scene.
Videos and 2D body poses It is important to consider that the same person may appear in multiple videos at the same time if the person was in view of multiple cameras.
a) are there guarantees that they will exist, and remain constant, over time;
a) それらが存在することを保証し,かつ,一定であり続けること。
0.76
b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created);
b) 完全なデータセットの公式アーカイブ版(すなわち、データセットが作成された時点で存在した外部リソースを含む。)があるか。
0.80
c) are there any restrictions (e g , licenses, fees) associated with any of the external resources that might apply to a future user?
c) 将来のユーザに適用される可能性のある外部リソースに関連する制限(ライセンス、料金など)はありますか?
0.81
Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.
Does the dataset contain data that might be considered confidential (e g , data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)?
The data contains personal data under GDPR in the form of video and audio recordings of subjects.
このデータにはGDPRの下の個人データが含まれており、被験者の映像や音声を記録できる。
0.61
The dataset is shared under an End User License Agreement for research purposes, to ensure that the data is not made public, and to protect the privacy of data subjects.
データセットは研究目的のためにEnd User License Agreementの下で共有され、データが公開されていないことを保証するとともに、データ対象のプライバシを保護する。
0.77
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
Data subjects answered the following questions before the start of the data collection event, after filling in their consent form:
データ主体は、同意書に記入した後、データ収集イベントの開始前に以下の質問に答えた。
0.82
• Is this your first time attending ACM MM?
・ACM MMに出席するのは初めてですか。
0.80
• Select the area(s) that describes best your research interest(s) in recent years.
• 近年の研究関心を最もよく表現した領域を選択する。
0.55
Descriptions of each theme are listed here: https://acmmm.org/ca ll-for-papers/
解説 https://acmmm.org/ca ll-for-papers/
0.46
4
4
0.42
英語(論文から抽出)
日本語訳
スコア
Figure 6 shows the distribution of the responses / populations.
図6は、応答/人口の分布を示しています。
0.76
Figure 6: Distribution of newcomer/veteran participants (left) and their research interests (right) in percentage.
図6: 新参者/獣医参加者(左)とその研究関心(右)の割合の分布。
0.78
Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?
Does the dataset contain data that might be considered sensitive in any way (e g , data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?
We did not request these information from data subjects.
我々はこれらの情報をデータ科目から求めなかった。
0.67
Any other comments? YOUR ANSWER HERE
他にコメントは? あなたの答えは
0.71
COLLECTION How was the data associated with each instance acquired?
収集 それぞれのインスタンスに関連するデータはどのように取得されましたか?
0.54
Was the data directly observable (e g , raw text, movie ratings), reported by subjects (e g , survey responses), or indirectly inferred/derived from other data (e g , part-of-speech tags, model-based guesses for age or language)?
The collected data is directly observable, containing video recordings, low-frequency audio recordings and wearable sensing signals (inertial motion unit (IMU) and Bluetooth proximity sensors) of individuals in the interaction scenes.
Accompanying data includes self-reported binary categorization of experience level and interests in research topics.
伴うデータには、自己報告された経験レベルのバイナリ分類と研究トピックへの関心が含まれる。
0.56
Video recordings capture the whole interaction floor where the association from data to individual is done manually by annotators by referring to frontal and overhead views.
If not, please describe the timeframe in which the data associated with the instances was created.
そうでない場合は、インスタンスに関連するデータが作成された時間枠を説明してください。
0.73
Finally, list when the dataset was first published.
最後に、データセットが最初に公開された時のリスト。
0.63
All data was collected on October 24, 2019, except the self-reported experience level and research interest topics which are either obtained on the same day or not more than one week before the data collection day.
This time frame matches the creation time frame of the data association for wearable
このタイムフレームはウェアラブルのためのデータアソシエーションの作成時間フレームと一致する
0.83
5
5
0.42
英語(論文から抽出)
日本語訳
スコア
sensing data. Video data was associated with individual during annotation stage (2020-2021), but all information used for association was obtained on the data collection day.
What mechanisms or procedures were used to collect the data (e g , hardware apparatus or sensor, manual human curation, software program, software API)?
How were these mechanisms or procedures validated?
これらのメカニズムや手順はどのように検証されましたか?
0.50
The synchronization setup for data collection was documented and published in [8], which includes validation of the system.
データ収集の同期設定は[8]で文書化され、システムの検証が含まれている。
0.77
To lend the reader further insight into the process of setting up the recording of such datasets in-the-wild, we share images of our process in Figure 7.
The validation of the sensors was completed through an external contractor engineer.
センサーの検証は外部の請負業者によって完了した。
0.70
The data collection software was documented and published in [? ], which includes validation of the system.
データ収集ソフトウェアは、システムの検証を含む[? ]で文書化され、公開された。
0.78
These hardwares and mechanisms have been open-sourced along with their respective publication.
これらのハードウェアとメカニズムは、それぞれの出版物とともにオープンソース化されている。
0.57
What was the resource cost of collecting the data?
データ収集のリソースコストはどのくらいでしたか?
0.83
(e g what were the required computational resources, and the associated financial costs, and energy consumption - estimate the carbon footprint. See Strubell et al [? ] for approaches in this area.)
(例えば、必要な計算資源、関連する財政費、エネルギー消費量は、カーボンフットプリントの推定である。この分野でのアプローチについては、strubell et al [?]を参照のこと。)
0.75
The resources required to collect the data include equipment, logistics, and travel costs.
データ収集に必要なリソースには、機器、物流、旅行費が含まれる。
0.78
Equipment include video cameras, wearable sensors, and fixation infrastructure to the data collection venue.
機器には、ビデオカメラ、ウェアラブルセンサー、データ収集会場への固定インフラなどが含まれる。
0.67
In our case, we used 14 GoPro Hero 8 ( $350 per camera), and 60 wearable sensors ( $25 per sensor).
Conflab is an annotated subset of a larger set of the data collected.
Conflabは、収集されたデータ集合の注釈付きサブセットである。
0.77
This segment where the articulated pose and speaking status were annotated is selected based on start time of the event in consideration of crowd density maximization in the scenes.
The annotated segment is 15 minutes; the whole set is roughly 1 hr of recordings.
注釈付きセグメントは15分で、全体の録音時間はおよそ1時間である。
0.77
Who was involved in the data collection process (e g , students, crowdworkers, contractors) and how were they compensated (e g , how much were crowdworkers paid)?
Depending on the submitted results, workers earn qualification to access of the actual tasks.
提出された結果に応じて、労働者は実際のタスクにアクセスする資格を得る。
0.65
The annotation tasks were categorized into low-effort (150), medium-effort (300), and high-effort (450), corresponding to the amount of time it would take.
The duration of the tasks was determined by the crowd density and through timing of the pilot studies.
作業の継続期間は,群集密度とパイロット実験のタイミングによって決定された。
0.73
The average hourly payment to workers is around 8 US dollars.
労働者に対する平均時間給はおよそ8米ドルである。
0.81
Were any ethical review processes conducted (e g , by an institutional review board)?
倫理審査プロセス(例えば、機関審査委員会)は実施されていますか。
0.66
If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.
The data collection was approved by the Human Research Ethics Committee (HREC) of our university (Delft University of Technology), which reviews all research involving human subjects.
The review process included addressing privacy concerns to ensure compliance with GDPR and university guidelines, review of our informed consent form, data management plan, and end user license agreement for the dataset and a safety check of our custom wearable devices.
Were the individuals in question notified about the data collection?
問題の個人はデータ収集について通知されましたか?
0.66
If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.
The individuals were notified about the data collection and their participation is voluntary.
個人はデータ収集について通知され、参加は自発的です。
0.63
The data collection was staged at an event called Meet the Chairs at ACM MM 2019.
データ収集は、ACM MM 2019でMeet the Chairsと呼ばれるイベントで実施された。
0.82
The ConfLab web page (https://conflab.ewi .tudelft.nl/) served to communicate the aim of the event, what was being recorded, and how participants could sign up.
Figure 8: Screenshots of the ConfLab web-page used for participant recruitment and registration.
図8: 参加採用と登録に使用されるConfLabのWebページのスクリーンショット。
0.81
Figure 9: Consent form signed by each participant in the data collection.
図9: データコレクションの各参加者が署名した同意フォーム。
0.87
Did the individuals in question consent to the collection and use of their data?
問題の個人はデータの収集と利用に同意したのか?
0.71
If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.
If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate) Yes, the consenting individuals were informed about revoking access to their data.
YOUR ANSWER HERE PREPROCESSING / CLEANING / LABELING
あなたの答えは 事前処理/クリーニング/ラベル付け
0.68
Was any preprocessing/cleani ng/labeling of the data done(e g ,discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
If not, you may skip the remainder of the questions in this section.
そうでない場合は、このセクションの残りの質問をスキップすることができる。
0.71
We did not pre-process the signals obtained from the wearable devices or cameras.
ウェアラブルデバイスやカメラから得た信号を前処理していませんでした。
0.58
The only exception is the audio data, which unfortunately was not properly synced at collection time due to a bug in the code for audio storage in our wearable devices.
YOUR ANSWER HERE Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned /labeled that might impact future uses?
For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e g , stereotyping, quality of service issues) or other undesirable harms (e g , financial harms, legal risks) If so, please provide a description.
Is there anything a future user could do to mitigate these undesirable harms?
この望ましくない被害を軽減するために、将来ユーザーができることはありますか?
0.67
9
9
0.43
英語(論文から抽出)
日本語訳
スコア
YOUR ANSWER HERE Are there tasks for which the dataset should not be used?
あなたの答えは データセットを使用すべきでないタスクはありますか?
0.72
If so, please provide a description.
もしそうなら、説明してください。
0.75
YOUR ANSWER HERE Any other comments?
あなたの答えは 他にコメントは?
0.71
YOUR ANSWER HERE DISTRIBUTION
あなたの答えは 流通
0.58
Will the dataset be distributed to third parties outside of the entity (e g , company, institution, organization) on behalf of which the dataset was created?
Does the dataset have a digital object identifier (DOI)?
データセットはデジタルオブジェクト識別子(DOI)を持っているか?
0.76
YOUR ANSWER HERE When will the dataset be distributed?
あなたの答えは データセットはいつ配布されますか?
0.68
YOUR ANSWER HERE Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.
If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.
Updates will be done as needed as opposed to periodically.
更新は定期的に行うのではなく、必要に応じて行われる。
0.57
Instances could be deleted, added, or corrected.
インスタンスは削除、追加、または修正できる。
0.70
The updates will be posted on the dataset website.
更新はデータセットのWebサイトにポストされる。
0.78
If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e g , were individuals in question told that their data would be retained for a fixed period of time and then deleted)?
Is there a process for communicating/distri buting these contributions to other users?
これらの貢献を他のユーザに伝達/配布するプロセスはありますか?
0.63
If so, please provide a description.
もしそうなら、説明してください。
0.75
We are open to contributions to the dataset.
データセットへのコントリビューションはオープンです。
0.49
We expect the potential contributor to contact us, indicating if there are any restrictions on their contribution and how they wish to be attributed so that we can start a discussion.