Common Crawl Foundation's Avatar

Common Crawl Foundation

@commoncrawl

Common Crawl is a non-profit foundation dedicated to the Open Web.

353
Followers
61
Following
95
Posts
19.11.2024
Joined
Posts Following

Latest posts by Common Crawl Foundation @commoncrawl

Preview
Common Crawl - Blog - Web Graph Statistics Gets a Proper Upgrade Our Web Graph Statistics site has been updated with interactive charts, a domain lookup tool for tracking harmonic centrality and PageRank over time, mobile improvements, unified rank tables with OR f...

Our Web Graph Statistics site has been updated with interactive charts, a domain lookup tool for tracking harmonic centrality and PageRank over time, mobile improvements, unified rank tables with OR filtering, and merged degree plots.

commoncrawl.org/blog/web-gra...

07.03.2026 00:55 πŸ‘ 2 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

@thomvaughan.bsky.social did a WCAG colour contrast audit of 240 top domains using Common Crawl's February 2026 archive. You can read more about his study in the thread below

02.03.2026 17:12 πŸ‘ 3 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - Announcing the Whirlwind Tour of Common Crawl's Datasets Using Java Introducing the second installment in our Whirlwind Tour series, covering crawl structure, index access, and content extraction, giving developers a practical foundation for building Java-based data w...

Introducing the second installment in our Whirlwind Tour series, covering crawl structure, index access, and content extraction, giving developers a practical foundation for building Java-based data workflows.

commoncrawl.org/blog/announc...

26.02.2026 21:35 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2025 and January/February 2026 We're happy to announce the release of the Web Graphs for December 2025 and January/February 2026, consisting of 288.6 million nodes and 12.4 billion edges at the host level, and 134.2 million nodes a...

We're happy to announce the release of the Web Graphs for December 2025 and January/February 2026, consisting of 288.6 million nodes and 12.4 billion edges at the host level, and 134.2 million nodes and 5.4 billion edges at the domain level.

www.commoncrawl.org/blog/host--a...

24.02.2026 13:15 πŸ‘ 2 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - Introducing the New Examples & Resources Browser We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and s...

We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and share links. We welcome community submissions.

blog.commoncrawl.org/blog/introdu...

23.02.2026 15:51 πŸ‘ 3 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - February 2026 Crawl Archive Now Available We are pleased to announce the release of the February 2026 crawl, consisting of 2.1 billion web pages (or 363 TiB of uncompressed content). Captures are from 45.5 million hosts or 37.1 million regist...

We are pleased to announce the release of the February 2026 crawl, consisting of 2.1 billion web pages (or 363 TiB of uncompressed content). Captures are from 45.5 million hosts or 37.1 million registered domains.

blog.commoncrawl.org/blog/februar...

23.02.2026 15:50 πŸ‘ 4 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Preserving The Web Is Not The Problem. Losing It Is. Recent reporting by Nieman Lab describes how some major news organizationsβ€”including The Guardian, The New York Times, and Redditβ€”are limiting or blocking access to their content in the Internet Ar…

Preserving The Web Is Not The Problem. Losing It Is.

Mark Graham, Director of the Wayback Machine at @archive.org, walks us through the importance of preserving the Web in this recent post:

www.techdirt.com/2026/02/17/p...

19.02.2026 19:09 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - AI Plumbers at FOSDEM’26 Common Crawl was invited to the AI Plumbers unconference held at FOSDEM this year. The contrast between the 100 people at the unconference, compared to the 10,000 people at the main event, couldn't be...

Common Crawl was invited to the AI Plumbers unconference held at FOSDEM this year. The contrast between the 100 people at the unconference, compared to the 10,000 people at the main event, couldn't be bigger.

commoncrawl.org/blog/ai-plum...

17.02.2026 22:26 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - CC-Citations: A Visualization of Research Papers Referencing Common Crawl We are proud to release an interactive visualization of thousands of research papers using or citing Common Crawl data.

We are proud to release an interactive visualization of thousands of research papers using or citing Common Crawl data.

commoncrawl.org/blog/cc-cita...

17.02.2026 22:25 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data of...

Announcing our latest paper: CommonLID

In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.

arxiv.org/abs/2601.18026

13.02.2026 19:27 πŸ‘ 22 πŸ” 12 πŸ’¬ 1 πŸ“Œ 0
Preview
Common Crawl - Blog - CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data We are excited to announce the release of CommonLID, a language identification benchmark for the web, covering 109 languages. CommonLID was developed in collaboration with multiple open-source organiz...

Read the blogpost: commoncrawl.org/blog/commonl...

Dataset: huggingface.co/datasets/com...

Preprint: www.arxiv.org/abs/2601.18026

10.02.2026 20:44 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

CommonLID can help us to create the next generation of open-source LangID models, which can in turn help create larger multilingual datasets. We would like to thank members of Masakhane and @seacrowd.bsky.social for their support in this effort.

10.02.2026 20:44 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

CommonLID proves to be the most challenging dataset in our evaluation of existing LangID systems, as can be seen in the last column of the table above.

10.02.2026 20:44 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
A table showing the results of our evaluation of existing LangID systems across 6 different datasets. Full text of the table is available on the paper linked below.

A table showing the results of our evaluation of existing LangID systems across 6 different datasets. Full text of the table is available on the paper linked below.

Current benchmarks over-estimate LangID performance on web data. In our evaluations, we show top existing models have < 80% F1, even when limiting to languages the models explicitly support.

10.02.2026 20:44 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

Language identification still proves to be a challenging task, especially for web data. In collaboration with @mlcommons.org @eleutherai.bsky.social @jhu.edu and 97 community members, we created CommonLID, a new benchmark for LangID for 100+ languages!

10.02.2026 20:44 πŸ‘ 11 πŸ” 5 πŸ’¬ 1 πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2025 and January 2026 The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level n...

The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level nodes with 6.1 billion edges.

www.commoncrawl.org/blog/host--a...

02.02.2026 18:01 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - January 2026 Crawl Archive Now Available We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

www.commoncrawl.org/blog/january...

02.02.2026 18:01 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - Web Archives for Social Sciences Datathon, Bristol Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

www.commoncrawl.org/blog/web-arc...

02.02.2026 18:00 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.

As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.

commoncrawl.org/blog/how-seo...

21.01.2026 01:18 πŸ‘ 4 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - GneissWeb Annotations Examples A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

GneissWeb Annotations Examples

A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

commoncrawl.org/blog/gneissw...

16.01.2026 13:26 πŸ‘ 2 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - Common Crawl at the Mozilla Festival 2025 From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.

From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.

www.commoncrawl.org/blog/common-...

08.01.2026 13:49 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025 We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.

commoncrawl.org/blog/host--a...

02.01.2026 00:17 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - December 2025 Crawl Archive Now Available The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).

The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).

commoncrawl.org/blog/decembe...

02.01.2026 00:17 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - A Sampling of 2025 Research Referencing Common Crawl As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and refere...

As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and referenced.

commoncrawl.org/blog/a-sampl...

18.12.2025 17:26 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Laurie Burchell at a lectern presenting her Turing Seminar talk

Laurie Burchell at a lectern presenting her Turing Seminar talk

Laurie Burchell at a lectern, with a blackboard behind her, presenting her Turing Seminar talk

Laurie Burchell at a lectern, with a blackboard behind her, presenting her Turing Seminar talk

A huge thank you to @very-laurie.bsky.social for delivering a fantastic UoB Turing seminar. Her talk was entitled β€œCommon Crawl: open web data for everybody.”

In this talk, she introduced the @commoncrawl.bsky.social and the data products they offer.

27.11.2025 13:05 πŸ‘ 6 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025 We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and...

We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and 100.7 million nodes and 6.6 billion edges at the domain level.

commoncrawl.org/blog/host--a...

24.11.2025 17:46 πŸ‘ 2 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - November 2025 Crawl Archive Now Available We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.

We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.

commoncrawl.org/blog/novembe...

24.11.2025 15:49 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Banner for the World Digital Preservation Day, 6th of November 2025

Banner for the World Digital Preservation Day, 6th of November 2025

Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?

commoncrawl.org/blog/common-...

06.11.2025 14:56 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has β€œlied to publishers” about our activiti...

Setting the Record Straight

A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has β€œlied to publishers” about our activities.

commoncrawl.org/blog/setting...

04.11.2025 22:38 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Common Crawl - Blog - October/November 2025 Newsletter Check out our newsletter for October/November 2025, with updates on what we've been up to

Check out our newsletter for October/November 2025, with updates on what we've been up to

commoncrawl.org/blog/october...

04.11.2025 22:37 πŸ‘ 2 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0