PDF2Table-3.811.1045

PDF2Table-3.811.1045

 

 

PDF2Table

 

 

Author: Kyobong An

 

Description

This plugin extracts texts and tables from PDF files.

 

Restrictions

This plugin works only with PDF files where the texts are selectable. Scanned (image) PDF files will not work with this plugin.

 

Our friend ABBYY explains very well about the different types of PDF’s here. In summary,  the ABBYY’s term “TRUE PDF” do not need ABBYY.  https://pdf.abbyy.com/learning-center/pdf-types/

 

 

 

Need help?

Technical contact to tech@argos-labs.com

 

May you search all operations,


 

From version 3.819.1007

Additional advanced parameters were added like below under the “TABLE” option.

 

To utilize these options users must understand functions and terminologies of the base Python technology of this plugin which is called pdfplubmer. 

For more resources, please visit these websites.

 

 

Input (Required)

  • Operation mode either TEXT mode or TABLE mode

  • Input PDF file (.pdf but digitally generated PDF only) as input (file path)

  • Output file name and path (For TABLE option, you can choose .csv or .txt – For TEXT option only .txt is available)

 

Input (Optional)

  • Page number

  • Table number (when there are multiple tables in a PDF they will have index (number) from the top to bottom)

  • Separator to be used to separate values in table

  • Horizontal and Vertical Strategies – this is used to determine the boundaries of values in the table when “lines” are not very clear

    • Lines

    • Lines – strict

    • Text

 

 

Output/Return Value

 

Return Value

String                Full file path for the output file

Csv                   Full file path for the output file

File                   Full file path for the output file

 

Return Code

0          Execution Successful

1          The table is not included in PDFfile

9          All other responses from the plugin  

 

 

Parameter Settings

 

For TABLE option

 

 

 

 

 

For TEXT option

 

 

 

 

 

 

More tips for TABLE option

 

Table index:    Select the table within the selected page.

Separator:       Please enter a separator which will be inserted between words in exported .txt file (default=‘,’)

 

What are VERTICAL and HORIZONTAL Strategies

Strategy

Description

"lines"

Use the page's graphical lines — including the sides of rectangle objects — as the borders of potential table-cells.

"lines_strict"

Use the page's graphical lines — but not the sides of rectangle objects — as the borders of potential table-cells.

"text"

For vertical_strategy: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For horizontal_strategy, the same but using the tops of words.