Difference between revisions of "Dropbox Crawler"

From SimpleWiki
Jump to navigationJump to search
 
(105 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Personal cloud storage is becoming more and more popular - Dropbox is certainly the best known example. Cloud storage already generates a huge amount of Internet traffic. Because of that, understanding how people interact with such applications is essential for designing efficient cloud storage systems.
+
Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?
  
We have been doing research on the usage of Dropbox ([http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf see our results here]). As a next step, we need to know what type of files people store in the service. This would allow us to understand the impact of some technologies on the system performance and on network traffic, among other things. For that, we need volunteers to provide us basic statistics (size, type etc) about files stored in their folders.  
+
In this experiment, we collected basic statistics of what files are stored in Dropbox folders.
  
== Be part of the crowd: Help our research ==
+
== Datasets ==
  
All you need to do is run a Java application at your PC. This application will read your Dropbox folder, calculate some statistics, show everything to your approval and, '''only after that''', send the statistics to us.
+
Download our datasets:
  
*  Most people will be able to run the application '''by clicking here'''
+
{| class="wikitable" style="text-align: center; width: 400px; height: 40px;"
 +
|-
 +
! scope="col" | Name
 +
! scope="col" | File Size
 +
! scope="col" | Volunteers
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/crawler/dropbox_crawler.tar.gz Crawler Dataset]
 +
| 219M || 333
 +
|}
  
* In case your browser does not support that, you can '''download the package and run it''': Just double click on it!
+
Some results derived from these data can be found in [http://eprints.eemcs.utwente.nl/24136/01/2013_drago_thesis.pdf here].
  
 +
In particular, the figures presented in Sect. 5.3 of the linked document are obtained using the
 +
scripts available in the folder "scripts" inside the tarball.
  
== What will be captured? ==
+
== How our data collection work? ==
  
What we do:
+
* It scans Dropbox folders
 +
* Calculates basic statistics
 +
* Shows what has been collected for approval
 +
* Sends the statistics to us
  
We will read all your DropBox Folder;
+
== What has been logged? ==
We will collect basic statistics (log format can be viewed in the following);
 
We will send these statistics to our web server.
 
  
 +
For each file/folder in a Dropbox, the program collects:
 +
<pre>
 +
* Size in bytes
 +
* Last modification time
 +
* Mime type of the file
 +
* File extension
 +
* MD5 Hash of both initial and final 8 kbytes of the file
 +
* MD5 Hash of the file name/path
 +
</pre>
  
What we DO NOT do:
+
The program also sends to us:
 
+
<pre>
We do not copy any file content;
+
* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former)
We do not copy file or folder name;
+
* MD5 Hash of the path of your Dropbox home folder
We do not copy any personal information;
+
* Your IP address and operating system version
We do not install or store anything in your computer.
+
* Error logs, in case something goes wrong during the data collection
 +
</pre>
  
 +
Collected information is sent via plain HTTP to a centralized collection server.
  
 
== Client source code ==
 
== Client source code ==
  
 +
Download the source code by clicking [http://www.simpleweb.org/dropbox/source_python.zip here] for the native versions (you will need Python 2.7 and [http://www.pyinstaller.org/ PyInstaller] for building these versions), or [http://www.simpleweb.org/dropbox/source_java.zip here] for the Java version.
  
Download the Java Source Code to Capture Files Information
+
== More information ==
The Project may be used direct in NetBeans, version 7.2.1
 
  
 +
The dataset in this page is used in the following publications:
  
== Policy ==
+
  @phdthesis{drago_understanding_2013,
 +
          author      = {Idilio Drago},
 +
          title        = {Understanding and Monitoring Cloud Services},
 +
          school      = {University of Twente},
 +
          url          = {<nowiki>\url{http://eprints.eemcs.utwente.nl/24136/</nowiki>}},
 +
          year        = {2013},
 +
  },
  
 +
  @inproceedings{drago_caracterizacao_2013,
 +
          author      = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva},
 +
          title        = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox},
 +
          booktitle    = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do},
 +
          series      = <nowiki>{{WP2P+}}</nowiki>,
 +
          pages        = {109--114},
 +
          year        = {2013},
 +
  },
  
We ensure that:
+
More information about our previous work is found on these papers:
  
All data we collect are anonymized.
+
* [http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf '''Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012''']
We do not copy any file content.
 
We do not collect any personal information and file/dir names.
 
  
 +
* [http://eprints.eemcs.utwente.nl/23674/01/cloud_storage.pdf '''Drago, I. and Bocchi, E. and Mellia, M. and Slatman, H. and Pras, A. (2013) Benchmarking personal cloud storage. In: Proceedings of the 13th ACM Internet Measurement Conference, IMC 2013, 23-25 Oct 2013, Barcelona, Spain. pp. 205-212.''']
  
We also will make our data publicity in a near future. Thus, anyone will be able to use this important data source.
+
* [[Dropbox Traces|This page]] and [[Cloud benchmarks | this page]] have more traces we used in other papers.
  
== Format ==
+
== External Links ==
  
All files are in a simple format. Each line has files attributes, separeted by #.
+
These institutes involved in this research:
 
+
* [http://www.utwente.nl/ewi/dacs/ DACS - University of Twente] - Contact: Idilio Drago - idilio.drago@polito.it
The following columns are found in these traces:
+
* [http://www.ufjf.br/portal/ Universidade Federal de Juiz de Fora] Contact: Alex Vieira - alex.borges@ufjf.edu.br
 
+
* [http://www.tlc-networks.polito.it/ Telecommunication Networks Group - Politecnico di Torino] - Marco Mellia - mellia@tlc.polito.it
<pre>
 
############################################################################
 
#    #    # Short description      # Unit  # Long description            #
 
############################################################################
 
#  1  #    # Lenght                # -    # File Size in Bytes
 
#  2  #    # Modified              # -    # Last modification on file (Unix date/time format)
 
#  3  #    # MIME                  # -    # File Mime Type using Magic Java Unit
 
#  4  #    # EXTENSION              # -    # File extension (substring after the last "." on the string)
 
#  5  #    # MD5                    # -    # MD5 hash code of the initial/final 8 bytes of the file.
 
#  6  #    # MD5 of the name        # -    # MD5 hash code of file name string.
 
############################################################################
 
</pre>
 
 
 
 
 
 
 
== More information ==
 
 
 
 
 
* You may find more information on our previous work about Dropbox:
 
 
 
[http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf '''Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012''']
 
 
 
* [[Dropbox Traces|This page]] has more information about the data we used in our research so far.  
 
 
 
== External Links ==
 

Latest revision as of 09:44, 9 May 2014

Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?

In this experiment, we collected basic statistics of what files are stored in Dropbox folders.

Datasets

Download our datasets:

Name File Size Volunteers
Crawler Dataset 219M 333

Some results derived from these data can be found in here.

In particular, the figures presented in Sect. 5.3 of the linked document are obtained using the scripts available in the folder "scripts" inside the tarball.

How our data collection work?

  • It scans Dropbox folders
  • Calculates basic statistics
  • Shows what has been collected for approval
  • Sends the statistics to us

What has been logged?

For each file/folder in a Dropbox, the program collects:

* Size in bytes
* Last modification time
* Mime type of the file
* File extension
* MD5 Hash of both initial and final 8 kbytes of the file
* MD5 Hash of the file name/path

The program also sends to us:

* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former)
* MD5 Hash of the path of your Dropbox home folder
* Your IP address and operating system version
* Error logs, in case something goes wrong during the data collection

Collected information is sent via plain HTTP to a centralized collection server.

Client source code

Download the source code by clicking here for the native versions (you will need Python 2.7 and PyInstaller for building these versions), or here for the Java version.

More information

The dataset in this page is used in the following publications:

 @phdthesis{drago_understanding_2013,
         author       = {Idilio Drago},
         title        = {Understanding and Monitoring Cloud Services},
         school       = {University of Twente},
         url          = {\url{http://eprints.eemcs.utwente.nl/24136/}},
         year         = {2013},
 },
 @inproceedings{drago_caracterizacao_2013,
         author       = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva},
         title        = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox},
         booktitle    = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do},
         series       = {{WP2P+}},
         pages        = {109--114},
         year         = {2013},
 },

More information about our previous work is found on these papers:

External Links

These institutes involved in this research: