SSH datasets

From SimpleWiki
Revision as of 06:26, 19 August 2014 by Hofstede (talk | contribs) (→‎Caveats)
Jump to navigationJump to search

This page provides access to the materials accompanying the following publication:

SSH Compromise Detection using NetFlow/IPFIX
Rick Hofstede, Luuk Hendriks, Anna Sperotto, Aiko Pras. In: ACM Computer Communication Review, 2014 (to appear).

More information regarding this publication can be found here. Any usage of materials provided on this page - be it the datasets, scripts, paper itself, or any content derived thereof - should reference this publication.


Name File Size md5
Flow data x GB xxx
Log files x GB xxx

Some results derived from these data can be found in here.

Flow data

The flow data has been exported by a Cisco Catalyst 6500 with SUP2T supervisor module (PFC4, MSFC 5), and collected using nfcapd. Neither packet sampling nor flow sampling have been applied. The following post-processing operations have however been performed:

  1. Filtering: Only SSH data has been selected, i.e., the following nfdump filter has been used: port 22 and proto tcp.
  2. Anonymization: nfanon has been used for anonymizing the flow data in a prefix-preserving manner. More precisely, nfanon relies on the CryptoPAn (Cryptography-based Prefix-preserving Anonymization) module.

Log files

The log files have been gathered from various Linux operating systems. The following post-processing operations have however been performed:

  1. Merging: On some machines, the authentication logs were distributed over <hostname>.messages and <hostname>.warn. We have merged those log files, sorted them again (if necessary), and removed any introduced duplicates.
  2. Renaming: The file names have been changed from <hostname>.<extension> into <anonymized_IP_address>.<extension>. As such, the log files can easily be correlated with the flow data.
  3. Anonymization: We have replaced any usernames by "XXXXX" and hostnames by the anonymized IP address of the considered host.


  • Note that there are activities over SSH from several (anonymized) IP address ranges that can principally be found in the log files only (so not in the flow data). These are internal IP address ranges for which the traffic is not considered for flow export.
  • (Host name resolving)