Overview
spaCy is a Python library for advanced Natural Language Processing, designed for building real products and gathering real insights. It's written in Cython, offering blazing fast performance for large-scale information extraction. spaCy provides pre-trained pipelines and supports over 75 languages. It features components for named entity recognition, part-of-speech tagging, dependency parsing, and text classification. spaCy's new project system facilitates end-to-end workflows from prototype to production, allowing users to manage data transformation, preprocessing, and training steps. The library integrates Large Language Models (LLMs) into structured NLP pipelines via the spacy-llm package. spaCy layout, a plugin integrates with Docling to bring structured processing of PDFs, Word documents and other input formats to spaCy pipeline, outputing clean, structured data in a text-based format and creates spaCy's familiar Doc objects.
