Get a weekly rundown of the latest AI models and research... subscribe! https://aimodels.substack.com/

Assemblage: Automatic Binary Dataset Construction for Machine Learning

2405.03991

YC

6

Reddit

4

Published 5/8/2024 by Chang Liu, Rebecca Saul, Yihao Sun, Edward Raff, Maya Fuchs, Townsend Southard Pantano, James Holt, Kristopher Micinski
Assemblage: Automatic Binary Dataset Construction for Machine Learning

Abstract

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpuses (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish recipes for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpuses of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage can be downloaded from https://assemblage-dataset.net

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • Assemblage is a new technique for automatically constructing binary datasets for machine learning tasks.
  • It aims to address the challenges of creating high-quality binary datasets, such as Training Neural Network to Explain Binaries, Neural Assembler: Learning to Generate Fine-Grained, and Advanced Detection of Source Code Clones via Ensemble.
  • The technique involves several key steps, including data collection, feature extraction, and dataset construction.

Plain English Explanation

Assemblage is a new way to automatically build datasets for machine learning models that work with binary files, such as software programs or other computer files. Creating high-quality datasets for these types of tasks can be challenging, as discussed in related papers.

Assemblage tries to make this process easier by automating many of the steps involved. First, it collects a variety of binary files from different sources. Then, it extracts important features or characteristics from these files, such as the structure of the code or the types of instructions used. Finally, it combines these features into a dataset that can be used to train machine learning models.

The goal is to create datasets that are diverse and representative of the types of binary files that the models will encounter in the real world. This can help improve the models' performance and make them more useful for practical applications, such as detecting source code clones or generating fine-grained assembly code.

Technical Explanation

Assemblage consists of several key components:

  1. Data Collection: The system collects a diverse set of binary files from various sources, such as open-source software repositories, malware datasets, and proprietary software libraries.

  2. Feature Extraction: Assemblage extracts a range of features from the collected binaries, including low-level details like assembly instructions as well as higher-level characteristics like control flow graphs and function signatures.

  3. Dataset Construction: The extracted features are then combined and organized into a structured dataset that can be used to train machine learning models. The dataset includes both positive and negative examples, ensuring a balanced distribution of classes.

  4. Evaluation and Refinement: The quality of the constructed dataset is evaluated using various metrics, such as class balance, feature diversity, and model performance. The system then iterates on the data collection and feature extraction steps to improve the dataset, enabling the training of more accurate and robust models.

The key insight behind Assemblage is that by automating the dataset construction process, it can produce high-quality binary datasets at scale, overcoming the limitations of manual curation. This allows for the training of more powerful machine learning models for a wide range of binary analysis tasks, such as malware detection, code clone identification, and binary program understanding.

Critical Analysis

The Assemblage approach presents several advantages, such as the ability to create diverse and representative datasets, the scalability of the data collection and processing pipeline, and the potential for continuous refinement and improvement of the datasets. However, the paper also acknowledges some limitations and areas for further research:

  1. Generalization to Unseen Domains: While Assemblage is designed to capture a wide range of binary file characteristics, there may be challenges in applying the system to specialized or domain-specific binary formats that were not well-represented in the training data.

  2. Robustness to Adversarial Attacks: The paper does not discuss the robustness of the constructed datasets and models to adversarial attacks, which is an important consideration for practical deployment of binary analysis systems.

  3. Interpretability and Explainability: The paper focuses primarily on the dataset construction process and does not explore the interpretability or explainability of the machine learning models trained on the Assemblage datasets, which can be crucial for understanding the decision-making process of these models.

  4. Ethical Considerations: The paper does not address potential ethical concerns, such as the use of Assemblage for the analysis of malicious binaries or the implications of automated dataset construction on data privacy and bias.

Further research could address these limitations, explore the practical deployment of Assemblage-generated datasets, and investigate the societal impact of this technology.

Conclusion

Assemblage presents a promising approach for automatically constructing high-quality binary datasets for machine learning tasks. By automating the data collection, feature extraction, and dataset construction processes, the system aims to overcome the challenges of manual dataset curation and enable the training of more accurate and robust binary analysis models.

The potential applications of Assemblage-generated datasets are wide-ranging, from improving the performance of malware detection systems to enhancing the understanding of binary program behavior. As the field of binary analysis continues to evolve, techniques like Assemblage can play a crucial role in advancing the state of the art and unlocking new possibilities for machine learning in this domain.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

On Training a Neural Network to Explain Binaries

On Training a Neural Network to Explain Binaries

Alexander Interrante-Grant, Andy Davis, Heather Preslier, Tim Leek

YC

0

Reddit

0

In this work, we begin to investigate the possibility of training a deep neural network on the task of binary code understanding. Specifically, the network would take, as input, features derived directly from binaries and output English descriptions of functionality to aid a reverse engineer in investigating the capabilities of a piece of closed-source software, be it malicious or benign. Given recent success in applying large language models (generative AI) to the task of source code summarization, this seems a promising direction. However, in our initial survey of the available datasets, we found nothing of sufficiently high quality and volume to train these complex models. Instead, we build our own dataset derived from a capture of Stack Overflow containing 1.1M entries. A major result of our work is a novel dataset evaluation method using the correlation between two distances on sample pairs: one distance in the embedding space of inputs and the other in the embedding space of outputs. Intuitively, if two samples have inputs close in the input embedding space, their outputs should also be close in the output embedding space. We found this Embedding Distance Correlation (EDC) test to be highly diagnostic, indicating that our collected dataset and several existing open-source datasets are of low quality as the distances are not well correlated. We proceed to explore the general applicability of EDC, applying it to a number of qualitatively known good datasets and a number of synthetically known bad ones and found it to be a reliable indicator of dataset value.

Read more

5/1/2024

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

Hongyu Yan, Yadong Mu

YC

0

Reddit

0

Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.

Read more

4/26/2024

🔎

Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures

Jorge Martinez-Gil

YC

0

Reddit

0

The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces a novel ensemble learning approach for code similarity assessment, combining the strengths of multiple unsupervised similarity measures. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses, leading to improved performance. Preliminary results show that while Transformers-based CodeBERT and its variant GraphCodeBERT are undoubtedly the best option in the presence of abundant training data, in the case of specific small datasets (up to 500 samples), our ensemble achieves similar results, without prejudice to the interpretability of the resulting solution, and with a much lower associated carbon footprint due to training. The source code of this novel approach can be downloaded from https://github.com/jorge-martinez-gil/ensemble-codesim.

Read more

5/6/2024

🤿

Deep Multi-Task Learning for Malware Image Classification

Ahmed Bensaoud, Jugal Kalita

YC

0

Reddit

0

Malicious software is a pernicious global problem. A novel multi-task learning framework is proposed in this paper for malware image classification for accurate and fast malware detection. We generate bitmap (BMP) and (PNG) images from malware features, which we feed to a deep learning classifier. Our state-of-the-art multi-task learning approach has been tested on a new dataset, for which we have collected approximately 100,000 benign and malicious PE, APK, Mach-o, and ELF examples. Experiments with seven tasks tested with 4 activation functions, ReLU, LeakyReLU, PReLU, and ELU separately demonstrate that PReLU gives the highest accuracy of more than 99.87% on all tasks. Our model can effectively detect a variety of obfuscation methods like packing, encryption, and instruction overlapping, strengthing the beneficial claims of our model, in addition to achieving the state-of-art methods in terms of accuracy.

Read more

5/10/2024