Scout: Leveraging Large Language Models for Rapid Digital Evidence Discovery
- URL: http://arxiv.org/abs/2507.18478v1
- Date: Thu, 24 Jul 2025 14:54:14 GMT
- Title: Scout: Leveraging Large Language Models for Rapid Digital Evidence Discovery
- Authors: Shariq Murtuza,
- Abstract summary: This work presents Scout, a digital forensics framework that performs preliminary evidence processing.<n>Scout employs text based large language models can easily process files with textual information.<n>Scout was able to identify and realize the evidence file that were of potential interest for the investigator.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent technological advancements and the prevalence of technology in day to day activities have caused a major increase in the likelihood of the involvement of digital evidence in more and more legal investigations. Consumer-grade hardware is growing more powerful, with expanding memory and storage sizes and enhanced processor capabilities. Forensics investigators often have to sift through gigabytes of data during an ongoing investigation making the process tedious. Memory forensics, disk analysis all are well supported by state of the art tools that significantly lower the effort required to be put in by a forensic investigator by providing string searches, analyzing images file etc. During the course of the investigation a lot of false positives are identified that need to be lowered. This work presents Scout, a digital forensics framework that performs preliminary evidence processing and prioritizing using large language models. Scout deploys foundational language models to identify relevant artifacts from a large number of potential evidence files (disk images, captured network packets, memory dumps etc.) which would have taken longer to get identified. Scout employs text based large language models can easily process files with textual information. For the forensic analysis of multimedia files like audio, image, video, office documents etc. multimodal models are employed by Scout. Scout was able to identify and realize the evidence file that were of potential interest for the investigator.
Related papers
- DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response [0.0]
Large Language Models (LLMs) offer new opportunities in Digital Forensics and Incident Response (DFIR)<n>LLMs offer new opportunities in DFIR tasks such as log analysis and memory, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts.<n>We present DFIR-Metric, a benchmark to evaluate LLMs across both theoretical and practical DFIR domains.
arXiv Detail & Related papers (2025-05-26T13:35:37Z) - Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality [74.59049806800176]
This demo paper highlights the Tevatron toolkit's key features, bridging academia and industry.<n>We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness.<n>We also release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval.
arXiv Detail & Related papers (2025-05-05T08:52:49Z) - Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models [53.55128042938329]
Forensics-Bench is a new forgery detection evaluation benchmark suite.<n>It comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types.<n>We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet.
arXiv Detail & Related papers (2025-03-19T09:21:44Z) - Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes.<n>Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z) - ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection [107.86009509291581]
We propose ForgerySleuth to perform comprehensive clue fusion and generate segmentation outputs indicating regions that are tampered with.<n>Our experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in robustness, generalization, and explainability.
arXiv Detail & Related papers (2024-11-29T04:35:18Z) - Multimodal Misinformation Detection using Large Vision-Language Models [7.505532091249881]
Large language models (LLMs) have shown remarkable performance in various tasks.
Few approaches consider evidence retrieval as part of misinformation detection.
We propose a novel re-ranking approach for multimodal evidence retrieval.
arXiv Detail & Related papers (2024-07-19T13:57:11Z) - An Innovative Tool for Uploading/Scraping Large Image Datasets on Social
Networks [9.27070946719462]
We propose an automated approach by means of a digital tool that we created on purpose.
The tool is capable of automatically uploading an entire image dataset to the desired digital platform and then downloading all the uploaded pictures.
arXiv Detail & Related papers (2023-11-01T23:27:37Z) - A Survey on Detection of LLMs-Generated Content [97.87912800179531]
The ability to detect LLMs-generated content has become of paramount importance.
We aim to provide a detailed overview of existing detection strategies and benchmarks.
We also posit the necessity for a multi-faceted approach to defend against various attacks.
arXiv Detail & Related papers (2023-10-24T09:10:26Z) - Audio Deepfake Attribution: An Initial Dataset and Investigation [41.62487394875349]
We design the first deepfake audio dataset for the attribution of audio generation tools, called Audio Deepfake Attribution (ADA)
We propose the Class- Multi-Center Learning ( CRML) method for open-set audio deepfake attribution (OSADA)
Experimental results demonstrate that the CRML method effectively addresses open-set risks in real-world scenarios.
arXiv Detail & Related papers (2022-08-21T05:15:40Z) - Automated Artefact Relevancy Determination from Artefact Metadata and
Associated Timeline Events [7.219077740523683]
Case-hindering, multi-year digital forensic evidence backlogs have become commonplace in law enforcement agencies throughout the world.
This is due to an ever-growing number of cases requiring digital forensic investigation coupled with the growing volume of data to be processed per case.
Leveraging previously processed digital forensic cases and their component artefact relevancy classifications can facilitate an opportunity for training automated artificial intelligence based evidence processing systems.
arXiv Detail & Related papers (2020-12-02T14:14:26Z) - MEG: Multi-Evidence GNN for Multimodal Semantic Forensics [28.12652559292884]
Fake news often involves semantic manipulations across modalities such as image, text, location etc.
Recent research has centered the problem around images, calling it image repurposing.
We introduce a novel graph neural network based model for multimodal semantic forensics.
arXiv Detail & Related papers (2020-11-23T09:01:28Z) - Multi-Modal Video Forensic Platform for Investigating Post-Terrorist
Attack Scenarios [55.82693757287532]
Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence.
We present a video analytic platform that integrates visual and audio analytic modules and fuses information from surveillance cameras and video uploads from eyewitnesses.
arXiv Detail & Related papers (2020-04-02T14:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.