Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections of Snapshots of web resources, taken at given point in time. They allow researches to analyse past versions of particular websites or the web at a glance. The dataset comprises 55.6 TB data collected from the .DE domain and 57.8 TB from the .UK domain. Below you can find more details about the gathered data.
Due to legal restrictions we are not allowed to make this dataset publicly available. But you can get access to it during an inter ship at L3S. You can use our Hadoop cluster to analyse the dataset during your stay. Of course the publication of derived data is allowed.
A good starting point to learn how to analyze WARC data, the format the archive is stored in, is the Web Archive Analysis Workshop from the Internet Archive (https://webarchive.jira.com/
Captures over time (.DE web archive)
Overview per MIME type (.DE web archive)
MIME type | distinct URL count | capture count | distinct host count | start | end |
---|---|---|---|---|---|
application | 45,120,237 | 53,297,817 | 4,651,389 | 1995-11-15 14:34:08 | 2013-09-30 23:59:59 |
audio | 1,148,275 | 1,212,263 | 99,706 | 1996-05-10 12:18:54 | 2013-09-30 23:59:57 |
image | 407,779,594 | 417,525,373 | 10,521,649 | 1994-11-15 12:45:26 | 2013-09-30 23:59:59 |
other | 34,459,850 | 35,022,420 | 6,352,228 | 1995-11-14 13:32:46 | 2013-09-30 23:59:49 |
text A | 1,063,458,432 | 1,738,936,530 | 19,284 | 1996-05-09 11:59:07 | 2013-09-30 23:59:59 |
text B | 1,836,890,783 | 3,132,633,942 | 71,096,043 | 1994-12-02 16:27:18 | 2013-09-30 23:59:59 |
video | 310,078 | 326,530 | 56,087 | 1996-05-10 17:32:06 | 2013-09-30 23:34:00 |