What kind of metadata is included with The Stack?

The Stack includes metadata such as file size, language, extensions, and repository information.

What programming languages are covered by The Stack?

The Stack covers 358 programming languages, including popular languages like Python, Java, JavaScript, C++, and more.

The Stack

The Stack | Find AI List

Overview

The Stack is a comprehensive dataset comprising over 6TB of permissively-licensed source code files, encompassing 358 programming languages. Created as part of the BigCode Project, it's designed for training Large Language Models for Code (Code LLMs). The dataset facilitates the development of AI systems capable of synthesizing programs from natural language descriptions and code snippets. It provides provenance information for each data point, ensuring adherence to original licenses. Updated regularly to reflect data removal requests, The Stack aims to promote reproducibility in code LLM training by offering an open, large-scale resource. Users are required to agree to terms of use, including attribution and version updating, to ensure responsible usage.

Common tasks

Code Completion Documentation Generation Code Auto-Completion Pre-training Code Generation Code Understanding Bug Detection Code Translation

FAQ

View all

What is The Stack?

The Stack is a large-scale dataset of permissively-licensed source code files covering 358 programming languages, designed for training Large Language Models for Code (Code LLMs).

How do I access The Stack?

You can access The Stack via the Hugging Face Datasets library using the `load_dataset` function.

What are the terms of use for The Stack?

The terms of use require you to abide by the original licenses of the code, update your version of the dataset to reflect data removal requests, and include the terms of use when sharing the dataset.

How is The Stack updated?

The Stack is regularly updated to enact validated data removal requests. Users are notified of these updates via email and community discussions.

FAQ+