PDF Text Extractor
PDF text Extractor subpackage. Contains modules and utilities for extracting text from a PDF Document.
- schema_miner.pdf_text_extractor.all_pdf_text_extraction(source_filepath: str, destination_filepath: str, pdf_parser: PDF_Parser = <schema_miner.services.PDF_Parsers.PyPDF_pdf_parser.PyPDF_PDF_Parser object>) None
Extract text content from all PDF files in a source directory and save them as Markdown files.
Each PDF file in the source directory is parsed using the provided parser (or the default
Docling_PDF_Parserif none is supplied). The extracted content is saved as individual Markdown files in the specified destination directory.- Parameters:
source_filepath (str) – Path to the directory containing PDF files.
destination_filepath (str) – Path to the directory where extracted Markdown files will be saved.
pdf_parser (PDF_Parser) – A PDF parser instance. If None, a default parser (Docling_PDF_Parser) is used.
- schema_miner.pdf_text_extractor.pdf_text_extractor(source_filename: str, destination_filepath: str | None = None, pdf_parser: PDF_Parser = <schema_miner.services.PDF_Parsers.PyPDF_pdf_parser.PyPDF_PDF_Parser object>, return_text: bool = False) None | str
Extract text content from a PDF file and optionally save it to a Markdown file.
This function parses a given PDF document using the provided or default parser. The extracted content is saved as a Markdown file in the specified destination directory. Optionally, the extracted text can also be returned.
- Parameters:
source_filename (str) – Path to the source PDF file to be processed.
destination_filepath (str) – Path to the destination directory where the extracted text will be saved.
pdf_parser (PDF_Parser) – A PDF parser instance. If None, a default parser (Docling_PDF_Parser) is used.
return_text (bool, optional) – If True, returns the extracted text in addition to saving it. Defaults to False.
- Returns None | str:
Extracted text content if
return_textis True, otherwise None.