PDF Text Extractor

PDF text Extractor subpackage. Contains modules and utilities for extracting text from a PDF Document.

schema_miner.pdf_text_extractor.all_pdf_text_extraction(source_filepath: str, destination_filepath: str, pdf_parser: PDF_Parser = <schema_miner.services.PDF_Parsers.PyPDF_pdf_parser.PyPDF_PDF_Parser object>) None

Extract text content from all PDF files in a source directory and save them as Markdown files.

Each PDF file in the source directory is parsed using the provided parser (or the default Docling_PDF_Parser if none is supplied). The extracted content is saved as individual Markdown files in the specified destination directory.

Parameters:
  • source_filepath (str) – Path to the directory containing PDF files.

  • destination_filepath (str) – Path to the directory where extracted Markdown files will be saved.

  • pdf_parser (PDF_Parser) – A PDF parser instance. If None, a default parser (Docling_PDF_Parser) is used.

schema_miner.pdf_text_extractor.pdf_text_extractor(source_filename: str, destination_filepath: str | None = None, pdf_parser: PDF_Parser = <schema_miner.services.PDF_Parsers.PyPDF_pdf_parser.PyPDF_PDF_Parser object>, return_text: bool = False) None | str

Extract text content from a PDF file and optionally save it to a Markdown file.

This function parses a given PDF document using the provided or default parser. The extracted content is saved as a Markdown file in the specified destination directory. Optionally, the extracted text can also be returned.

Parameters:
  • source_filename (str) – Path to the source PDF file to be processed.

  • destination_filepath (str) – Path to the destination directory where the extracted text will be saved.

  • pdf_parser (PDF_Parser) – A PDF parser instance. If None, a default parser (Docling_PDF_Parser) is used.

  • return_text (bool, optional) – If True, returns the extracted text in addition to saving it. Defaults to False.

Returns None | str:

Extracted text content if return_text is True, otherwise None.