↑ Return to Datasets

German and UK web archive

Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections of Snapshots of web resources, taken at given point in time. They allow researches to analyse past versions of particular websites or the web at a glance. The dataset comprises 55.6 TB data collected from the .DE domain and 57.8 TB from the .UK domain. Below you can find more details about the gathered data.
Due to legal restrictions we are not allowed to make this dataset publicly available. But you can get access to it during an inter ship at L3S. You can use our Hadoop cluster to analyse the dataset during your stay. Of course the publication of derived data is allowed.
A good starting point to learn how to analyze WARC data, the format the archive is stored in, is the Web Archive Analysis Workshop from the Internet Archive (https://webarchive.jira.com/wiki/spaces/Iresearch/pages/60620822/Web+Archive+Analysis+Workshop).

 

Captures over time (.DE web archive)

 

Overview per MIME type (.DE web archive)

MIME type distinct URL count capture count distinct host count start end
application 45,120,237 53,297,817 4,651,389 1995-11-15 14:34:08 2013-09-30 23:59:59
audio 1,148,275 1,212,263 99,706 1996-05-10 12:18:54 2013-09-30 23:59:57
image 407,779,594 417,525,373 10,521,649 1994-11-15 12:45:26 2013-09-30 23:59:59
other 34,459,850 35,022,420 6,352,228 1995-11-14 13:32:46 2013-09-30 23:59:49
text A 1,063,458,432 1,738,936,530 19,284 1996-05-09 11:59:07 2013-09-30 23:59:59
text B 1,836,890,783 3,132,633,942 71,096,043 1994-12-02 16:27:18 2013-09-30 23:59:59
video 310,078 326,530 56,087 1996-05-10 17:32:06 2013-09-30 23:34:00