Related papers: MIX-RS: A Multi-indexing System based on HDFS for Remote Sensing Data Storage

MIX-RS: A Multi-indexing System based on HDFS for Remote Sensing Data Storage

URL: http://arxiv.org/abs/2208.02987v1
Date: Fri, 5 Aug 2022 05:11:12 GMT
Title: MIX-RS: A Multi-indexing System based on HDFS for Remote Sensing Data Storage
Authors: Jiashu Wu, Jingpan Xiong, Hao Dai, Yang Wang, Chengzhong Xu
Abstract summary: A large volume of remote sensing (RS) data has been generated with the deployment of satellite technologies. The characteristics of RS data (e.g., enormous volume, large single-file size and demanding requirement of fault tolerance) make the Hadoop Distributed File System (HDFS) an ideal choice for RS data storage. To use RS data, one of the most important techniques is geospatial indexing. We propose a framework called Multi-IndeXing-RS (MIX-RS) that unifies the multi-indexing mechanism on top of the HDFS.
Score: 21.033380514644616
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: A large volume of remote sensing (RS) data has been generated with the deployment of satellite technologies. The data facilitates research in ecological monitoring, land management and desertification, etc. The characteristics of RS data (e.g., enormous volume, large single-file size and demanding requirement of fault tolerance) make the Hadoop Distributed File System (HDFS) an ideal choice for RS data storage as it is efficient, scalable and equipped with a data replication mechanism for failure resilience. To use RS data, one of the most important techniques is geospatial indexing. However, the large data volume makes it time-consuming to efficiently construct and leverage. Considering that most modern geospatial data centres are equipped with HDFS-based big data processing infrastructures, deploying multiple geospatial indices becomes natural to optimise the efficacy. Moreover, because of the reliability introduced by high-quality hardware and the infrequently modified property of the RS data, the use of multi-indexing will not cause large overhead. Therefore, we design a framework called Multi-IndeXing-RS (MIX-RS) that unifies the multi-indexing mechanism on top of the HDFS with data replication enabled for both fault tolerance and geospatial indexing efficiency. Given the fault tolerance provided by the HDFS, RS data is structurally stored inside for faster geospatial indexing. Additionally, multi-indexing enhances efficiency. The proposed technique naturally sits on top of the HDFS to form a holistic framework without incurring severe overhead or sophisticated system implementation efforts. The MIX-RS framework is implemented and evaluated using real remote sensing data provided by the Chinese Academy of Sciences, demonstrating excellent geospatial indexing performance.

Related papers

REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing [8.574396262432522]
Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on.<n>Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository.<n>We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms.
arXiv Detail & Related papers (2025-06-19T16:26:51Z)
SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis [89.99161034065614]
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios.<n>Existing approaches face critical limitations that lack high-quality training trajectories and suffer from distributional mismatches.<n>This paper introduces SimpleDeepSearcher, a framework that bridges the gap through strategic data engineering rather than complex training paradigms.
arXiv Detail & Related papers (2025-05-22T16:05:02Z)
Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z)
TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval [10.268774281394261]
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments. We propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements.
arXiv Detail & Related papers (2025-02-28T11:32:22Z)
Evaluating Fault Tolerance and Scalability in Distributed File Systems: A Case Study of GFS, HDFS, and MinIO [0.9307293959047378]
Distributed File Systems (DFS) are essential for managing vast datasets across multiple servers, offering benefits in scalability, fault tolerance, and data accessibility. This paper presents a comprehensive evaluation of three prominent DFSs - Google File System (GFS), Hadoop Distributed File System (HDFS), and MinIO. Through detailed analysis, how these systems handle data redundancy, server failures, and client access protocols, ensuring reliability in dynamic, large-scale environments is assessed.
arXiv Detail & Related papers (2025-02-04T03:52:45Z)
Efficient $k$-NN Search in IoT Data: Overlap Optimization in Tree-Based Indexing Structures [0.6990493129893112]
The proliferation of interconnected devices in the Internet of Things (IoT) has led to an exponential increase in data. Efficient retrieval of this heterogeneous data demands a robust indexing mechanism for effective organization. We propose three innovatives designed to quantify and strategically reduce data space partition overlap.
arXiv Detail & Related papers (2024-08-28T16:16:55Z)
DNS-Rec: Data-aware Neural Architecture Search for Recommender Systems [79.76519917171261]
This paper addresses the computational overhead and resource inefficiency prevalent in Sequential Recommender Systems (SRSs) We introduce an innovative approach combining pruning methods with advanced model designs. Our principal contribution is the development of a Data-aware Neural Architecture Search for Recommender System (DNS-Rec)
arXiv Detail & Related papers (2024-02-01T07:22:52Z)
Efficient Architecture Search via Bi-level Data Pruning [70.29970746807882]
This work pioneers an exploration into the critical role of dataset characteristics for DARTS bi-level optimization. We introduce a new progressive data pruning strategy that utilizes supernet prediction dynamics as the metric. Comprehensive evaluations on the NAS-Bench-201 search space, DARTS search space, and MobileNet-like search space validate that BDP reduces search costs by over 50%.
arXiv Detail & Related papers (2023-12-21T02:48:44Z)
HyP$^2$ Loss: Beyond Hypersphere Metric Space for Multi-label Image Retrieval [20.53316810731414]
We propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP$2$ Loss) The proposed HyP$2$ Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs.
arXiv Detail & Related papers (2022-08-14T15:06:27Z)
Automating DBSCAN via Deep Reinforcement Learning [73.82740568765279]
We propose a novel Deep Reinforcement Learning guided automatic DBSCAN parameters search framework, namely DRL-DBSCAN. The framework models the process of adjusting the parameter search direction by perceiving the clustering environment as a Markov decision process. The framework consistently improves DBSCAN clustering accuracy by up to 26% and 25% respectively.
arXiv Detail & Related papers (2022-08-09T04:40:11Z)
UDRN: Unified Dimensional Reduction Neural Network for Feature Selection and Feature Projection [37.03465340777392]
Dimensional reduction(DR) maps high-dimensional data into a lower dimensions latent space with minimized defined optimization objectives. FS focuses on selecting a critical subset of dimensions but risks destroying the data distribution (structure) FP combines all the input features into lower dimensions space, aiming to maintain the data structure; but lacks interpretability and sparsity. We develop a unified framework, Unified Dimensional Reduction Neural-network(UDRN), that integrates FS and FP in a compatible, end-to-end way.
arXiv Detail & Related papers (2022-07-08T10:30:20Z)
A Learned Index for Exact Similarity Search in Metric Spaces [25.330353637669386]
LIMS is proposed to use data clustering and pivot-based data transformation techniques to build learned indexes. Machine learning models are developed to approximate the position of each data record on the disk. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes.
arXiv Detail & Related papers (2022-04-21T11:24:55Z)
Generalizing Few-Shot NAS with Gradient Matching [165.5690495295074]
One-Shot methods train one supernet to approximate the performance of every architecture in the search space via weight-sharing. Few-Shot NAS reduces the level of weight-sharing by splitting the One-Shot supernet into multiple separated sub-supernets. It significantly outperforms its Few-Shot counterparts while surpassing previous comparable methods in terms of the accuracy of derived architectures.
arXiv Detail & Related papers (2022-03-29T03:06:16Z)
$\beta$-DARTS: Beta-Decay Regularization for Differentiable Architecture Search [85.84110365657455]
We propose a simple-but-efficient regularization method, termed as Beta-Decay, to regularize the DARTS-based NAS searching process. Experimental results on NAS-Bench-201 show that our proposed method can help to stabilize the searching process and makes the searched network more transferable across different datasets.
arXiv Detail & Related papers (2022-03-03T11:47:14Z)
DHA: End-to-End Joint Optimization of Data Augmentation Policy, Hyper-parameter and Architecture [81.82173855071312]
We propose an end-to-end solution that integrates the AutoML components and returns a ready-to-use model at the end of the search. Dha achieves state-of-the-art (SOTA) results on various datasets, especially 77.4% accuracy on ImageNet with cell based search space.
arXiv Detail & Related papers (2021-09-13T08:12:50Z)
Deep Cellular Recurrent Network for Efficient Analysis of Time-Series Data with Spatial Information [52.635997570873194]
This work proposes a novel deep cellular recurrent neural network (DCRNN) architecture to process complex multi-dimensional time series data with spatial information. The proposed architecture achieves state-of-the-art performance while utilizing substantially less trainable parameters when compared to comparable methods in the literature.
arXiv Detail & Related papers (2021-01-12T20:08:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.