Posted in

数据集会议:关于开放许可的 LLM 数据集的社区研讨会_AI阅读总结 — 包阅AI

包阅导读总结

1.

“`

Open AI Datasets、Mozilla、EleutherAI、Open Access、Best Practices

“`

2. 核心信息:Mozilla 和 EleutherAI 于 6 月 11 日在阿姆斯特丹召集专家讨论开放 LLM 训练数据集的创建及挑战。尽管面临诸多困难,但相信开放数据是公共福祉。活动旨在形成技术建议和最佳实践,未来将与参与者合作发布成果。

3.

– 背景

– Mozilla 和 EleutherAI 举办 Dataset Convening 社区研讨会。

– 召集 30 位学者和从业者讨论开放 LLM 训练数据集相关问题。

– 现状

– 竞争和法律风险致训练数据集分享减少。

– 但开源软件成功,开放数据也应是公共福祉。

– 已有一些开放 LLM 训练数据集。

– 讨论内容

– 探讨如何保证数据集合法合规、公正可持续。

– 涉及数据获取、管理、发布等方面的问题。

– 强调需要财务可持续和基础设施投资。

– 后续计划

– 与参与者合作开发共同成果并发布。

– 这是 Mozilla Convening 系列活动之一,将继续支持开放 AI 社区。

思维导图:

文章地址:https://blog.mozilla.org/en/mozilla/dataset-convening/

文章来源:blog.mozilla.org

作者:Kasia Odrozek and Stella Bidermann

发布时间:2024/7/2 14:17

语言:英文

总字数:1104字

预计阅读时间:5分钟

评分:84分

标签:人工智能,Mozilla,主页


以下为原文内容

本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com

A group photo of 27 people standing together in a room with a colorful cityscape mural on the wall behind them.
Participants of the Dataset Convening in Amsterdam.

Mozilla and EleutherAI brought together experts to discuss a critical question: How do we create openly licensed and open-access LLM training datasets and how do we tackle the challenges faced by their builders?


On June 11, on the eve of MozFest House in Amsterdam, Mozilla and EleutherAI convened an exclusive group of 30 leading scholars and practitioners from prominent open-source AI startups, nonprofit AI labs and civil society organizations to discuss emerging practices for a new focus within the open LLM community: creating open-access and openly licensed LLM training datasets.

This work is timely. Although sharing training datasets was once common practice among many AI actors, increased competitive pressures and legal risks have made it almost unheard of nowadays for pre-training datasets to be shared or even described by their developers. However, just as open-source software has made the internet safer and more robust, we at Mozilla and EleutherAI believe open-access data is a public good that can empower developers worldwide to build upon each other’s work. It fosters competition, innovation and transparency, providing clarity around legal standing and an ability to stand up to scrutiny.

Leading AI companies want us to believe that training performant LLMs without copyrighted material is impossible. We refuse to believe this. An emerging ecosystem of open LLM developers have created LLM training datasets —such as Common Corpus, YouTube-Commons, Fine Web, Dolma, Aya, Red Pajama and many more—that could provide blueprints for more transparent and responsible AI progress. We were excited to invite many of them to join us in Amsterdam for a series of discussions about the challenges and opportunities of building an alternative to the current status quo that is open, legally compliant and just.
During the event, we drew on the learnings from assembling “Common Pile” (the soon-to-be-released dataset by EleutherAI composed only of openly licensed and public domain data) which incorporates many learnings from its hugely successful predecessor, “The Pile.” At the event, EleutherAI released a technical briefing and an invitation to public consultation on Common Pile.

A speaker holding a microphone gestures while speaking, with a screen displaying "The Dataset Convening" in the background.
Participants engaged in a discussion at “The Dataset Convening,” hosted by Mozilla and EleutherAI on June 11, 2024 to explore creating open-access and openly licensed LLM training datasets.

Our goal with the convening was to bring in the experiences of open dataset builders to develop normative and technical recommendations and best practices around openly licensed and open-access datasets. Below are some highlights of our discussion:

  • Openness alone does not guarantee legal compliance or ethical outcomes, we asked which decision points can contribute to datasets being more just and sustainable in terms of public good and data rights.
  • We discussed what “good” looks like, what we want to avoid, what is realistic and what is already being implemented in the realm of sourcing, curating, governing and releasing open training datasets.
  • Issues such as the cumbersome nature of sourcing public domain and openly licensed data (e.g. extracting text from PDFs), manual verification of metadata, legal status of data across jurisdictions, retractability of consent, preference signaling, reproducibility and data curation and filtering were recurring themes in almost every discussion.
  • To enable more builders to develop open datasets and unblock the ecosystem, we need financial sustainability and smart infrastructural investments that can unblock the ecosystem.
  • The challenges faced by open datasets today bear a resemblance to those encountered in the early days of open source software (data quality, standardization and sustainability). Back then, it was the common artifacts that united the community and provided some shared understanding and language. We saw the Dataset Convening as an opportunity to start exactly there and create shared reference points that, even if not perfect, will guide us in a common direction.
  • The final insight round underscored that we have much to learn from each other: we are still in the early days of solving this immense challenge, and this nascent community needs to collaborate and think in radical and bold ways.
A group of four people sitting around a table with laptops and documents, engaged in a discussion. One person types on a laptop, while others look at papers and a phone. A colorful graffiti mural is on the wall behind them.
Participants at the Mozilla and EleutherAI event collaborating on best practices for creating open-access and openly licensed LLM training datasets.

We are immensely grateful to the participants in the Dataset Convening (including some remote contributors):

  • Stefan Baack — Researcher and Data Analyst, Insights, Mozilla
  • Mitchell Baker — Chairwoman, Mozilla Foundation
  • Ayah Bdeir — Senior Advisor, Mozilla
  • Julie Belião — Senior Director of Product Innovation, Mozilla.ai
  • Jillian Bommarito — Chief Risk Officer, 273 Ventures
  • Kasia Chmielinski — Project Lead, Data Nutrition Project
  • Jennifer Ding — Senior Researcher, Alan Turing Institute
  • Alix Dunn — CEO, Computer Says Maybe
  • Marzieh Fadaee — Senior Research Scientist, Cohere For AI
  • Maximilian Gahntz — AI Policy Lead, Mozilla
  • Paul Keller — Director of Policy and Co-Founder, Open Future
  • Hynek Kydlíček — Machine Learning Engineer, HuggingFace
  • Pierre-Carl Langlais — Co-Founder, Pleias
  • Greg Leppert — Director of Product and Research, the Library Innovation Lab, Harvard
  • EM Lewis-Jong — Director, Common Voice, Mozilla
  • Shayne Longpre — Project Lead, Data Provenance Initiative
  • Angela Lungati — Executive Director, Ushahidi
  • Sebastian Majstorovic — Open Data Specialist, EleutherAI
  • Cullen Miller — Vice President of Policy, Spawning
  • Victor Miller — Senior Product Manager, LLM360
  • Kasia Odrozek — Director, Insights, Mozilla
  • Guilherme Penedo — Machine Learning Research Engineer, HuggingFace
  • Neha Ravella — Research Project Manager, Insights Mozilla
  • Michael Running Wolf — Co-Founder and Lead Architect, First Languages AI Reality, Mila
  • Max Ryabinin — Distinguished Research Scientist, Together AI
  • Kat Siminyu — Researcher, The Distributed AI Research Institute
  • Aviya Skowron — Head of Policy and Ethics, EleutherAI
  • Andrew Strait — Associate Director, Ada Lovelace Institute
  • Mark Surman — President, Mozilla Foundation
  • Anna Tumadóttir — CEO, Creative Commons
  • Marteen Van Segbroeck — Head of Applied Science, Gretel
  • Leandro von Werra — Chief Loss Officer, HuggingFace
  • Maurice Weber — AI Researcher, Together AI
  • Lee White — Senior Full Stack Developer, Ushahidi
  • Thomas Wolf — Chief Science Officer and Co-Founder, HuggingFace

In the coming weeks, we will be working with the participants to develop common artifacts that will be released to the community, along with an accompanying paper. These resources will help researchers and practitioners navigate the definitional and executional complexities of advancing open-access and openly licensed datasets and strengthen the sense of community.

The event was part of the Mozilla Convening Series, where we bring together leading innovators in open source AI to tackle thorny issues and help move the community and movement forward. Our first convening was the Columbia Convening where we invited 40 leading scholars and practitioners to develop a framework for defining what openness means in AI. We are committed to continuing the efforts to support communities invested in openness around AI and look forward to helping grow and strengthen this movement.