- class CovertMark.strategy.sgd.SGDStrategy(pt_pcap, negative_pcap, recall_pcap=None, debug=True)
A generic SGD-based strategy for observing patterns of traffic in both directions of stream. Not designed for identifying any particular existing PT, should allow a general use case based on traffic patterns. It should achieve better unseen recall performance than Logistic Regression.
- DESCRIPTION = 'Generic binary classification strategy.'
- DYNAMIC_ADJUSTMENT_STOPPING_CRITERIA = (0.75, 0.001)
- DYNAMIC_THRESHOLD_PERCENTILES = [0, 25, 50, 75, 90]
- FEATURE_SET = ['entropy', 'psh', 'interval_bins', 'tcp_len_bins']
- LOSS_FUNC = 'hinge'
- NAME = 'SGD Classifier Strategy'
- NUM_RUNS = 5
- PT_SPLIT_RATIO = 0.5
- RUN_CONFIG_DESCRIPTION = ('Occurrence Threshold (%ile)', 'Run #')
- TIME_SEGMENT_SIZE = 60
The lower the occurrence threshold is, the easier it is to perform live classification. Therefore, a 5% penalty is applied for every 20% of occurrence threshold raised.
Threshold percentile and run # are used to distinguish SGD runs.
Perform SGD learning on the training/testing dataset, and validate overfitting on validation dataset.
threshold_pct (int) – the occurrence threshold %ile used to tolerate low number of classifier positive hits to reduce false positives.
run_num (int) – the integer run number of this training/validation run.
Run the classifier with lowest FPR at each occurrence threshold on unseen recall packets.
We cannot distinguish directions in this strategy.
Input traces are assumed to be chronologically ordered, misfunctioning otherwise. Sacrificing some false negatives for low false positive rate, under dynamic occurrence thresholding.
window_size (int) – the number of packets in each segment of single client-remote TCP sessions.
decision_threshold (int) – leave as None for automatic decision threshold search, otherwise the number of IP occurrences before positive flagging.
Only supports TCP-based PTs for now due to SEQ-related shaping.
Split the inputs into test and validation packets through random sampling over all feature rows. We refer to testing data used in training as test, and data used in negative run unseen during training as validaton. It is important to balance the positive and negative sample counts, as over-supply of negative cases can severely damage the recall performance on unseen inputs captured separately.