Extracting table data from PDF files can be a challenging task due to the complex nature of PDF documents. Unlike simple text extraction, tables require careful handling to preserve the tabular structure and relationships between rows and columns. Instead of manually extracting data from numerous PDF tables, you can streamline and automate this process programmatically. In this article, we will demonstrate how to extract PDF tables to text, Excel and CSV using Python.
Python Libraries to Extract PDF Tables to Text, Excel, and CSV
To extract data from PDF tables to text, excel, and CSV files, we can use Spire.PDF for Python and Spire.XLS for Python libraries. Spire.PDF for Python is mainly used for extracting table data from PDF, and Spire.XLS for Python is mainly used for saving the extracted table data to Excel and CSV files.
You can run the following pip commands in your project's terminal to install Spire.PDF for Python and Spire.XLS for Python:
pip install Spire.Pdf pip install Spire.Xls
If you already have Spire.PDF for Python and Spire.XLS for Python installed and would like to upgrade to the latest versions, use the following pip commands:
pip install --upgrade Spire.Pdf pip install --upgrade Spire.Xls
Extract PDF Tables to Text in Python
The PdfTableExtractor.ExtractTable(pageIndex: int) function provided by Spire.PDF for Python allows you to access the tables within a PDF. Once accessed, you can use PdfTable.GetText(rowIndex: int, columnIndex: int) function to easily retrieve data from the tables. Then, you can save the retrieved data to a text file for later use.
The example below shows how to extract table data from a PDF file and save the result to a text file using Python and Spire.PDF for Python:
from spire.pdf import * from spire.xls import * # Define an extract_table_data function to extract table data from PDF def extract_table_data(pdf_path): # Create an instance of the PdfDocument class doc = PdfDocument() try: # Load a PDF document doc.LoadFromFile(pdf_path) # Create a list to store the extracted table data table_data = [] # Create an instance of the PdfTableExtractor class extractor = PdfTableExtractor(doc) # Iterate through the pages in the PDF document for page_index in range(doc.Pages.Count): # Get tables within each page tables = extractor.ExtractTable(page_index) if tables is not None and len(tables) > 0: # Iterate through the tables for table_index, table in enumerate(tables): row_count = table.GetRowCount() col_count = table.GetColumnCount() table_data.append(f"Table {table_index + 1} of Page {page_index + 1}:\n") # Extract data from each table and append the data to the table_data list for row_index in range(row_count): row_data = [] for column_index in range(col_count): data = table.GetText(row_index, column_index) row_data.append(data.strip()) table_data.append(" ".join(row_data)) table_data.append("\n") return table_data except Exception as e: print(f"Error occurred: {str(e)}") return None # Define a save_table_data_to_text function to save the table data extracted from a PDF to a text file def save_table_data_to_text(table_data, output_path): try: with open(output_path, "w", encoding="utf-8") as file: file.write("\n".join(table_data)) print(f"Table data saved to '{output_path}' successfully.") except Exception as e: print(f"Error occurred while saving table data: {str(e)}") # Example usage pdf_path = "Tables.pdf" output_path = "table_data.txt" data = extract_table_data(pdf_path) if data: save_table_data_to_text(data, output_path)
Extract PDF Tables to Excel in Python
Extracting PDF tables to Excel is useful when you need to perform further analysis, calculation or visualization on the tabular data. By using Spire.PDF for Python in conjunction with Spire.XLS for Python, you can easily export data from PDF tables to Excel worksheets.
The example below shows how to export data from PDF tables to Excel worksheets in Python using Spire.PDF for Python and Spire.XLS for Python:
from spire.pdf import * from spire.xls import * # Define a function to extract data from PDF tables to Excel def extract_table_data_to_excel(pdf_path, xls_path): # Create an instance of the PdfDocument class doc = PdfDocument() try: # Load a PDF document doc.LoadFromFile(pdf_path) # Create an instance of the PdfTableExtractor class extractor = PdfTableExtractor(doc) # Create an instance of the Workbook class workbook = Workbook() # Remove the default 3 worksheets workbook.Worksheets.Clear() # Iterate through the pages in the PDF document for page_index in range(doc.Pages.Count): # Extract tables from each page tables = extractor.ExtractTable(page_index) if tables is not None and len(tables) > 0: # Iterate through the extracted tables for table_index, table in enumerate(tables): # Create a new worksheet for each table worksheet = workbook.CreateEmptySheet() # Set the worksheet name worksheet.Name = f"Table {table_index + 1} of Page {page_index + 1}" row_count = table.GetRowCount() col_count = table.GetColumnCount() # Extract data from the table and populate the worksheet for row_index in range(row_count): for column_index in range(col_count): data = table.GetText(row_index, column_index) worksheet.Range[row_index + 1, column_index + 1].Value = data.strip() # Auto adjust column widths of the worksheet worksheet.Range.AutoFitColumns() # Save the workbook to the specified Excel file workbook.SaveToFile(xls_path, ExcelVersion.Version2013) except Exception as e: print(f"Error occurred: {str(e)}") # Example usage pdf_path = "Tables.pdf" xls_path = "table_data.xlsx" extract_table_data_to_excel(pdf_path, xls_path)
Extract PDF Tables to Excel in Python
Extracting PDF tables to Excel is useful when you need to perform further analysis, calculation or visualization on the tabular data. By using Spire.PDF for Python in conjunction with Spire.XLS for Python, you can easily export data from PDF tables to Excel worksheets.
The example below shows how to export data from PDF tables to Excel worksheets in Python using Spire.PDF for Python and Spire.XLS for Python:
from spire.pdf import * from spire.xls import * # Define a function to extract data from PDF tables to Excel def extract_table_data_to_excel(pdf_path, xls_path): # Create an instance of the PdfDocument class doc = PdfDocument() try: # Load a PDF document doc.LoadFromFile(pdf_path) # Create an instance of the PdfTableExtractor class extractor = PdfTableExtractor(doc) # Create an instance of the Workbook class workbook = Workbook() # Remove the default 3 worksheets workbook.Worksheets.Clear() # Iterate through the pages in the PDF document for page_index in range(doc.Pages.Count): # Extract tables from each page tables = extractor.ExtractTable(page_index) if tables is not None and len(tables) > 0: # Iterate through the extracted tables for table_index, table in enumerate(tables): # Create a new worksheet for each table worksheet = workbook.CreateEmptySheet() # Set the worksheet name worksheet.Name = f"Table {table_index + 1} of Page {page_index + 1}" row_count = table.GetRowCount() col_count = table.GetColumnCount() # Extract data from the table and populate the worksheet for row_index in range(row_count): for column_index in range(col_count): data = table.GetText(row_index, column_index) worksheet.Range[row_index + 1, column_index + 1].Value = data.strip() # Auto adjust column widths of the worksheet worksheet.Range.AutoFitColumns() # Save the workbook to the specified Excel file workbook.SaveToFile(xls_path, ExcelVersion.Version2013) except Exception as e: print(f"Error occurred: {str(e)}") # Example usage pdf_path = "Tables.pdf" xls_path = "table_data.xlsx" extract_table_data_to_excel(pdf_path, xls_path)
Extract PDF Tables to CSV in Python
CSV is a universal format that can be opened and processed by spreadsheet software, databases, programming languages, and data analysis tools. Extracting PDF tables to CSV format makes the data easily accessible and compatible with a wide range of applications and tools.
The example below shows how to export data from PDF tables to CSV files in Python using Spire.PDF for Python and Spire.XLS for Python:
from spire.pdf import * from spire.xls import * # Define a function to extract data from PDF tables to CSV def extract_table_data_to_csv(pdf_path, csv_directory): # Create an instance of the PdfDocument class doc = PdfDocument() try: # Load a PDF document doc.LoadFromFile(pdf_path) # Create an instance of the PdfTableExtractor class extractor = PdfTableExtractor(doc) # Create an instance of the Workbook class workbook = Workbook() # Remove the default 3 worksheets workbook.Worksheets.Clear() # Iterate through the pages in the PDF document for page_index in range(doc.Pages.Count): # Extract tables from each page tables = extractor.ExtractTable(page_index) if tables is not None and len(tables) > 0: # Iterate through the extracted tables for table_index, table in enumerate(tables): # Create a new worksheet for each table worksheet = workbook.CreateEmptySheet() row_count = table.GetRowCount() col_count = table.GetColumnCount() # Extract data from the table and populate the worksheet for row_index in range(row_count): for column_index in range(col_count): data = table.GetText(row_index, column_index) worksheet.Range[row_index + 1, column_index + 1].Value = data.strip() csv_name = csv_directory + f"Table {table_index + 1} of Page {page_index + 1}" + ".csv" # Save each worksheet to a separate CSV file worksheet.SaveToFile(csv_name, ",", Encoding.get_UTF8()) except Exception as e: print(f"Error occurred: {str(e)}") # Example usage pdf_path = "Tables.pdf" csv_directory = "CSV/" extract_table_data_to_csv(pdf_path, csv_directory)
Conclusion
This article demonstrated how to retrieve data from PDF tables to text, Excel and CSV files using Python. We hope you find it helpful.