The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program.
From Public Data Sets, you can download the files entirely free using HTTP or S3.
- A crawl component retrieves items from content repositories, downloads the items to the server that hosts the crawl component, passes the items and associated metadata to a content processing component, and adds crawl-related information to associated crawl databases. You can add a second crawl component to provide fault tolerance.
- Download Crawl.io app for Android. Easy Control!Develop your slither skills now!
- Website Ripper Copier. Website Ripper Copier (WRC) is an all-purpose, high-speed website downloader software to save website data. WRC can download website files to a local drive for offline browsing, extract website files of a certain size and type, like the image, video, picture, movie, and music, retrieve a large number of files as a download manager with resumption support, and mirror sites.
As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.
- [ARC] s3://commoncrawl/crawl-001/ – Crawl #1 (2008/2009)
- [ARC] s3://commoncrawl/crawl-002/ – Crawl #2 (2009/2010)
- [ARC] s3://commoncrawl/parse-output/ – Crawl #3 (2012)
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2013-20/ – Summer 2013
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2013-48/ – Winter 2013
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2014-10/ – March 2014
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2014-15/ – April 2014
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2014-23/ – July 2014
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2014-35/ – August 2014
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2014-41/ – September 2014
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2014-42/ – October 2014
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2014-49/ – November 2014
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2014-52/ – December 2014
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2015-06/ – January 2015
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2015-11/ – February 2015
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2015-14/ – March 2015
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2015-18/ – April 2015
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2015-22/ – May 2015
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2015-27/ – June 2015
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2015-32/ – July 2015
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2015-35/ – August 2015
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2015-40/ – September 2015
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2015-48/ – November 2015
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2016-07/ – February 2016
- s3://commoncrawl/crawl-data/CC-MAIN-2016-18 – April 2016
- s3://commoncrawl/crawl-data/CC-MAIN-2016-22 – May 2016
- s3://commoncrawl/crawl-data/CC-MAIN-2016-26 – June 2016
- s3://commoncrawl/crawl-data/CC-MAIN-2016-30 – July 2016
- s3://commoncrawl/crawl-data/CC-MAIN-2016-36 – August 2016
- s3://commoncrawl/crawl-data/CC-MAIN-2016-40 – September 2016
- s3://commoncrawl/crawl-data/CC-MAIN-2016-44 – October 2016
- s3://commoncrawl/crawl-data/CC-MAIN-2016-50 – December 2016
- s3://commoncrawl/crawl-data/CC-MAIN-2017-04 – January 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-09 – February 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-13 – March 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-17 – April 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-22 – May 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-26 – June 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-30 – July 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-34 – August 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-39 – September 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-43 – October 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-47 – November 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2017-51 – December 2017
- s3://commoncrawl/crawl-data/CC-MAIN-2018-05 – January 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-09 – February 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-13 – March 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-17 – April 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-22 – May 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-26 – June 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-30 – July 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-34 – August 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-39 – September 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-43 – October 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-47 – November 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2018-51 – December 2018
- s3://commoncrawl/crawl-data/CC-MAIN-2019-04 – January 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-09 – February 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-13 – March 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-18 – April 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-22 – May 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-26 – June 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-30 – July 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-35 – August 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-39 – September 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-43 – October 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-47 – November 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2019-51 – December 2019
- s3://commoncrawl/crawl-data/CC-MAIN-2020-05 – January 2020
- s3://commoncrawl/crawl-data/CC-MAIN-2020-10 – February 2020
- s3://commoncrawl/crawl-data/CC-MAIN-2020-16 – March/April 2020
- s3://commoncrawl/crawl-data/CC-MAIN-2020-24 – May/June 2020
- s3://commoncrawl/crawl-data/CC-MAIN-2020-29 – July 2020
- s3://commoncrawl/crawl-data/CC-MAIN-2020-34 – August 2020
- s3://commoncrawl/crawl-data/CC-MAIN-2020-40 – September 2020
- s3://commoncrawl/crawl-data/CC-MAIN-2020-45 – October 2020
- s3://commoncrawl/crawl-data/CC-MAIN-2020-50 – November/December 2020
For all crawls since 2013, the data has been stored in the WARC file format and also contains metadata (WAT) and text data (WET) extracts. We also provide file path lists for the segments, WARC, WAT, and WET files, which can be found at CC-MAIN-YYYY-DD/[segment|warc|wat|wet].paths.gz.
By replacing s3://commoncrawl/ with https://commoncrawl.s3.amazonaws.com/ on each line, you can obtain the HTTP path for any of the files stored on S3.
Common Crawl currently stores the crawl data using the Web ARChive (WARC) format.
Before that point, the crawl was stored in the ARC file format.
The WARC format allows for more efficient storage and processing of Common Crawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.
This document aims to give you an introduction to working with the new format, specifically the difference between:
- WARC files which store the raw crawl data
- WAT files which store computed metadata for the data stored in the WARC
- WET files which store extracted plaintext from the data stored in the WARC
If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.
If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.
The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).
For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.
In the example below, we can see the crawler contacted http://news.bbc.co.uk/2/hi/africa/3414345.stm and received a HTML page in response. We can also see the page was served from the Apache web server, sets caching details, and attempts to set a cookie (shortened for display here).
WAT Response Format
WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.
This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, you can use one of the many JSON pretty print tools available.
The HTTP response metadata is most likely to be of interest to Common Crawl users. The skeleton of the JSON format is outlined below.
As an example in Python, if we parsed the JSON into the
data object, we could easily pull out interesting information from the BBC article easily…
WET Response Format
As many tasks only require textual information, the Common Crawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.
Processing the file format
We maintain introductory examples on GitHub for the following programming languages and big data processing frameworks:
For each of these platforms, the examples describe how to:
- Count the number of times various tags are used across HTML on the internet using the WARC files
- Counting the number of different server types found in the HTTP headers using the WAT files
- Execute a word count over the extracted plaintext found in the WET files
If you’re using a different programming language or prefer to work with another processing framework, there are a number of open source libraries that handle processing WARC files and the content therein, including:
- webrecorder’s warcio library for processing WARC and ARC files (Python 2.7 and 3.3+)
- IIPC’s Web Archive Commons library for processing WARC & WAT (Java)
More tools and libraries are found on the list of Awesome Web Archiving utilities maintained by the IIPC.
URL and metadata indexes
Using The Common Crawl URL Index of WARC and ARC files (2008 – present), you may look up URLs crawled in a given dataset, locate an archived page or pages within the dataset, search for URL prefixes in order to learn about coverage of hosts or domains in the Common Crawl archives, and more. To a limited extent, the Index server may be used as a “wayback machine” to manually “browse” a crawl archive.
The Parquet Index, on AWS S3, is an index to WARC files and URLs in a columnar format; it is most useful for running analytics queries. The columnar format, in Apache Parquet, enables highly efficient querying and processing of the index, which saves time and computing resources. When only a few columns are accessed, recent big data tools will run impressively fast.
The columnar index is free to access or download for anybody. All files are on AWS S3:
To date, we have tested the following data tools on the Parquet Index: Apache Spark, Apache Hive and AWS Athena. The latter makes it possible to run SQL queries on the columnar data without launching a server. For detailed examples and instructions on querying the data with Athena, please see this blog post.
Statistics and metrics
In addition, we also publish statistics and basic metrics of each crawl that include:
- Size of the crawl as numbers of fetched pages, unique URLs, unique documents (by content digest), number of different hosts, domains, and top-level domains
- Distribution of pages/URLs on hosts, domains, top-level domains
- Content language, MIME types, character sets
Crawl Download Torrent
Check out the statistics page on GitHub.