Extract PDF Tables to Text, Excel,and CSV

Extracting table data from PDF files can be a challenging task due to the complex nature of PDF documents. Unlike simple text extraction, tables require careful handling to preserve the tabular structure and relationships between rows and columns. Instead of manually extracting data from numerous PDF tables, you can streamline and automate this process programmatically. In this article, we will demonstrate how to extract PDF tables to text, Excel and CSV using Python.

Python Libraries to Extract PDF Tables to Text, Excel, and CSV

To extract data from PDF tables to text, excel, and CSV files, we can use Spire.PDF for Python and Spire.XLS for Python libraries. Spire.PDF for Python is mainly used for extracting table data from PDF, and Spire.XLS for Python is mainly used for saving the extracted table data to Excel and CSV files.

You can run the following pip commands in your project's terminal to install Spire.PDF for Python and Spire.XLS for Python:

pip install Spire.Pdf
pip install Spire.Xls

If you already have Spire.PDF for Python and Spire.XLS for Python installed and would like to upgrade to the latest versions, use the following pip commands:

pip install --upgrade Spire.Pdf
pip install --upgrade Spire.Xls

Extract PDF Tables to Text in Python

The PdfTableExtractor.ExtractTable(pageIndex: int) function provided by Spire.PDF for Python allows you to access the tables within a PDF. Once accessed, you can use PdfTable.GetText(rowIndex: int, columnIndex: int) function to easily retrieve data from the tables. Then, you can save the retrieved data to a text file for later use.

The example below shows how to extract table data from a PDF file and save the result to a text file using Python and Spire.PDF for Python:

  from spire.pdf import *
  from spire.xls import *

  # Define an extract_table_data function to extract table data from PDF
  def extract_table_data(pdf_path):
      # Create an instance of the PdfDocument class
      doc = PdfDocument()
      
      try:
          # Load a PDF document
          doc.LoadFromFile(pdf_path)
          # Create a list to store the extracted table data
          table_data = []
  
          # Create an instance of the PdfTableExtractor class
          extractor = PdfTableExtractor(doc)
  
          # Iterate through the pages in the PDF document
          for page_index in range(doc.Pages.Count):
              # Get tables within each page
              tables = extractor.ExtractTable(page_index)
              if tables is not None and len(tables) > 0:
  
                  # Iterate through the tables
                  for table_index, table in enumerate(tables):
                      row_count = table.GetRowCount()
                      col_count = table.GetColumnCount()
  
                      table_data.append(f"Table {table_index + 1} of Page {page_index + 1}:\n")
  
                      # Extract data from each table and append the data to the table_data list
                      for row_index in range(row_count):
                          row_data = []
                          for column_index in range(col_count):
                              data = table.GetText(row_index, column_index)
                              row_data.append(data.strip())
                          table_data.append("  ".join(row_data))
  
                      table_data.append("\n")
  
          return table_data
  
      except Exception as e:
          print(f"Error occurred: {str(e)}")
          return None
  
  # Define a save_table_data_to_text function to save the table data extracted from a PDF to a text file
  def save_table_data_to_text(table_data, output_path):
      try:
          with open(output_path, "w", encoding="utf-8") as file:
              file.write("\n".join(table_data))
          print(f"Table data saved to '{output_path}' successfully.")
      except Exception as e:
          print(f"Error occurred while saving table data: {str(e)}")
  
  # Example usage
  pdf_path = "Tables.pdf"
  output_path = "table_data.txt"
  
  data = extract_table_data(pdf_path)
  if data:
      save_table_data_to_text(data, output_path)

Extract PDF Tables to Text, Excel, and CSV Python

Extract PDF Tables to Excel in Python

Extracting PDF tables to Excel is useful when you need to perform further analysis, calculation or visualization on the tabular data. By using Spire.PDF for Python in conjunction with Spire.XLS for Python, you can easily export data from PDF tables to Excel worksheets.

The example below shows how to export data from PDF tables to Excel worksheets in Python using Spire.PDF for Python and Spire.XLS for Python:

  from spire.pdf import *
  from spire.xls import *

  # Define a function to extract data from PDF tables to Excel
  def extract_table_data_to_excel(pdf_path, xls_path):
      # Create an instance of the PdfDocument class
      doc = PdfDocument()
  
      try:
          # Load a PDF document
          doc.LoadFromFile(pdf_path)
  
          # Create an instance of the PdfTableExtractor class
          extractor = PdfTableExtractor(doc)
  
          # Create an instance of the Workbook class
          workbook = Workbook()
          # Remove the default 3 worksheets
          workbook.Worksheets.Clear()
          
          # Iterate through the pages in the PDF document
          for page_index in range(doc.Pages.Count):
              # Extract tables from each page
              tables = extractor.ExtractTable(page_index)
              if tables is not None and len(tables) > 0:
                  # Iterate through the extracted tables
                  for table_index, table in enumerate(tables):
                      # Create a new worksheet for each table
                      worksheet = workbook.CreateEmptySheet()  
                      # Set the worksheet name
                      worksheet.Name = f"Table {table_index + 1} of Page {page_index + 1}"  
                      
                      row_count = table.GetRowCount()
                      col_count = table.GetColumnCount()
  
                      # Extract data from the table and populate the worksheet
                      for row_index in range(row_count):
                          for column_index in range(col_count):
                              data = table.GetText(row_index, column_index)
                              worksheet.Range[row_index + 1, column_index + 1].Value = data.strip()
                      
                      # Auto adjust column widths of the worksheet
                      worksheet.Range.AutoFitColumns()
  
          # Save the workbook to the specified Excel file
          workbook.SaveToFile(xls_path, ExcelVersion.Version2013)
  
      except Exception as e:
          print(f"Error occurred: {str(e)}")
  
  # Example usage
  pdf_path = "Tables.pdf"
  xls_path = "table_data.xlsx"
  extract_table_data_to_excel(pdf_path, xls_path)

Extract PDF Tables to Excel in Python

The example below shows how to export data from PDF tables to Excel worksheets in Python using Spire.PDF for Python and Spire.XLS for Python:

  from spire.pdf import *
  from spire.xls import *

  # Define a function to extract data from PDF tables to Excel
  def extract_table_data_to_excel(pdf_path, xls_path):
      # Create an instance of the PdfDocument class
      doc = PdfDocument()
  
      try:
          # Load a PDF document
          doc.LoadFromFile(pdf_path)
  
          # Create an instance of the PdfTableExtractor class
          extractor = PdfTableExtractor(doc)
  
          # Create an instance of the Workbook class
          workbook = Workbook()
          # Remove the default 3 worksheets
          workbook.Worksheets.Clear()
          
          # Iterate through the pages in the PDF document
          for page_index in range(doc.Pages.Count):
              # Extract tables from each page
              tables = extractor.ExtractTable(page_index)
              if tables is not None and len(tables) > 0:
                  # Iterate through the extracted tables
                  for table_index, table in enumerate(tables):
                      # Create a new worksheet for each table
                      worksheet = workbook.CreateEmptySheet()  
                      # Set the worksheet name
                      worksheet.Name = f"Table {table_index + 1} of Page {page_index + 1}"  
                      
                      row_count = table.GetRowCount()
                      col_count = table.GetColumnCount()
  
                      # Extract data from the table and populate the worksheet
                      for row_index in range(row_count):
                          for column_index in range(col_count):
                              data = table.GetText(row_index, column_index)
                              worksheet.Range[row_index + 1, column_index + 1].Value = data.strip()
                      
                      # Auto adjust column widths of the worksheet
                      worksheet.Range.AutoFitColumns()
  
          # Save the workbook to the specified Excel file
          workbook.SaveToFile(xls_path, ExcelVersion.Version2013)
  
      except Exception as e:
          print(f"Error occurred: {str(e)}")
  
  # Example usage
  pdf_path = "Tables.pdf"
  xls_path = "table_data.xlsx"
  extract_table_data_to_excel(pdf_path, xls_path)

Extract PDF Tables to CSV in Python

CSV is a universal format that can be opened and processed by spreadsheet software, databases, programming languages, and data analysis tools. Extracting PDF tables to CSV format makes the data easily accessible and compatible with a wide range of applications and tools.

The example below shows how to export data from PDF tables to CSV files in Python using Spire.PDF for Python and Spire.XLS for Python:

  from spire.pdf import *
  from spire.xls import *

  # Define a function to extract data from PDF tables to CSV
  def extract_table_data_to_csv(pdf_path, csv_directory):
      # Create an instance of the PdfDocument class
      doc = PdfDocument()
  
      try:
          # Load a PDF document
          doc.LoadFromFile(pdf_path)
  
          # Create an instance of the PdfTableExtractor class
          extractor = PdfTableExtractor(doc)
  
          # Create an instance of the Workbook class
          workbook = Workbook()
          # Remove the default 3 worksheets
          workbook.Worksheets.Clear()
          
          # Iterate through the pages in the PDF document
          for page_index in range(doc.Pages.Count):
              # Extract tables from each page
              tables = extractor.ExtractTable(page_index)
              if tables is not None and len(tables) > 0:
                  # Iterate through the extracted tables
                  for table_index, table in enumerate(tables):
                      # Create a new worksheet for each table
                      worksheet = workbook.CreateEmptySheet()  
   
                      row_count = table.GetRowCount()
                      col_count = table.GetColumnCount()
  
                      # Extract data from the table and populate the worksheet
                      for row_index in range(row_count):
                          for column_index in range(col_count):
                              data = table.GetText(row_index, column_index)
                              worksheet.Range[row_index + 1, column_index + 1].Value = data.strip()
                      
                      csv_name = csv_directory + f"Table {table_index + 1} of Page {page_index + 1}" + ".csv"
  
                      # Save each worksheet to a separate CSV file
                      worksheet.SaveToFile(csv_name, ",", Encoding.get_UTF8())
  
      except Exception as e:
          print(f"Error occurred: {str(e)}")
  
  # Example usage
  pdf_path = "Tables.pdf"
  csv_directory = "CSV/"
  extract_table_data_to_csv(pdf_path, csv_directory)

Conclusion

This article demonstrated how to retrieve data from PDF tables to text, Excel and CSV files using Python. We hope you find it helpful.

PythonTechnoGyan

Search Suggest

Extract PDF Tables to Text, Excel,and CSV

Python Libraries to Extract PDF Tables to Text, Excel, and CSV

Extract PDF Tables to Text in Python

Extract PDF Tables to Excel in Python

Extract PDF Tables to Excel in Python

Extract PDF Tables to CSV in Python

Conclusion

You may like these posts

2 comments

Build a LLM application with OpenAI, Flask and Pinecone

Book Management System Django Project

Top 10 Basic Python Interview Questions

Python Django Commands Cheat Sheets

Building A Blog Application With Django

50 Days to AI/ML: from Zero to Hero

PythonTechnoGyan