Kaggle "Microsoft Malware Classification Challenge"——就是沙箱恶意文件识别，有 Opco

<div class="blogpost-body" id="cnblogs_post_body">
<p>使用图聚类方法：<span class="col-11 text-gray-dark mr-2">Malware Classification using Graph Clustering 见 https://github.com/rahulp0491/Malware-Classifier<br></span></p>
<p>代码参考：https://github.com/bindog/ToyMalwareClassification，https://github.com/xiaozhouwang/kaggle_Microsoft_Malware</p>
<p>#微软恶意代码分类</p>
<p>比赛说明和数据下载 <a href="https://www.kaggle.com/c/malware-classification/">https://www.kaggle.com/c/malware-classification/</a></p>
<p>##代码说明</p>
<ul><li><code>randomsubset.py</code> 抽取训练子集</li><li><code>asmimage.py</code> ASM文件图像纹理特征</li><li><code>opcode_n-gram.py</code> Opcode n-gram特征</li><li><code>firstrandomforest.py</code> 基于ASM文件图像纹理特征的随机森林</li><li><code>secondrandomforest.py</code> 基于Opcode n-gram特征特征的随机森林</li><li><code>combine.py</code> 将两种类型的特征结合</li></ul>
<p>##运行说明</p>
<ol><li>将完整的训练数据集解压，修改<code>randomsubset.py</code>中的路径并运行</li><li>修改<code>asmimage.py</code>和<code>opcode_n-gram.py</code>中的路径，并运行<code>run.sh</code>，耐心等待即可看到结果</li></ol>
<p> </p>
<p>参考:https://github.com/dchad/malware-detection</p>
<h1>malware-detection</h1>
<p>Experiments in malware detection and classification using machine learning techniques.</p>
<h2><a class="anchor" href="https://github.com/dchad/malware-detection#1-microsoft-malware-classification-challenge" id="user-content-1-microsoft-malware-classification-challenge"></a>1. Microsoft Malware Classification Challenge</h2>
<pre class="blockcode"><code>https://www.kaggle.com/c/malware-classification
</code></pre>
<h3><a class="anchor" href="https://github.com/dchad/malware-detection#11-feature-engineering" id="user-content-11-feature-engineering"></a>1.1 Feature Engineering</h3>
<pre class="blockcode"><code>Initial feature engineering consisted of extracting various keyword counts from the ASM files
as well as the entropy and file size from the BYTE files of the 10868 malware samples in the training set.
Image files of the first 1000 bytes of the ASM and BYTE files were created and combined with
keyword and entropy data. This resulted in a set of 2018 features.
Flow control graphs and call graphs were generated for each ASM sample. A feature set was
then generated from the graphs, including graph maximum delta, density, diameter and function
counts etc.
</code></pre>
<h3><a class="anchor" href="https://github.com/dchad/malware-detection#12-feature-selection" id="user-content-12-feature-selection"></a>1.2 Feature Selection</h3>
<pre class="blockcode"><code>Statistical analysis of the feature set using chi-squared tests to remove features that are
independent of the class labels or have low variance. The BYTE file images were found to be weak
learners and were removed from the feature set. A comparison of the best features from the chi-squared
tests with reduced feature sets of between 10% - 50% of the original features.
</code></pre>
<h4><a class="anchor" href="https://github.com/dchad/malware-detection#121-selection-comparison" id="user-content-121-selection-comparison"></a>1.2.1 Selection Comparison</h4>
<pre class="blockcode"><code>Testing with an ExtraTreesClassifier and 10-fold cross validation produced the following results:
- Original ASM Keyword Counts (1006 features): logloss = 0.034
- 10% Best ASM Features with Entropy and Image Features (202 features): logloss = 0.0174
- 20% Best ASM with Entropy and Image Features (402 features): logloss = 0.0164
- 30% Best ASM with Entropy and Image Features plus Feature Statistics (623 features):
  multiclass logloss = 0.0133
  accuracy score = 0.9978
  Confusion Matrix:
  [[1540 0 0 0 0 1 0 0 0]
  [ 1 2475 2 0 0 0 0 0 0]
  [ 0 0 2942 0 0 0 0 0 0]
  [ 1 0 0  474 0 0 0 0 0]
  [ 2 0 0 0 38 2 0 0 0]
  [ 3 0 0 0 0  748 0 0 0]
  [ 1 0 0 0 0 0  397 0 0]
  [ 0 0 0 0 0 0 0 1225 3]
  [ 0 0 0 0 0 0 0 8 1005]]
- 40% Best ASM and image features with feature statistics:
  ExtraTreesClassifier with 1000 estimators on 10868 training samples and 823 features
  using 10-fold cross validation:
multiclass logloss = 0.0135
accuracy score = 0.9976
Confustion Matrix:
[[1541 0 0 0 0 0 0 0 0]
[ 1 2475 2 0 0 0 0 0 0]
[ 0 0 2942 0 0 0 0 0 0]
[ 1 0 0  474 0 0 0 0 0]
[ 5 0 0 0 37 0 0 0 0]
[ 5 0 0 0 0  746 0 0 0]
[ 1 0 0 0 0 0  397 0 0]
[ 0 0 0 0 0 0 0 1227 1]
[ 0 0 0 0 0 0 0 9 1004]]
</code></pre>
<h4><a class="anchor" href="https://github.com/dchad/malware-detection#122-feature-selection-summary" id="user-content-122-feature-selection-summary"></a>1.2.2 Feature Selection Summary</h4>
<pre class="blockcode"><code> The performance of the Extr

Kaggle "Microsoft Malware Classification Challenge"——就是沙箱恶意文件识别，有 Opco

浏览过的版块