In this tutorial, I will explain how to use the Advanced Step "Read PDF All Pages" on the Tess AI platform. This step is useful for extracting text from a PDF, allowing you to use it to train your model or query the document. Here are the details on how to fill in the fields and examples of use cases:
Input Fields:
Insert the PDF file or link: In this field, you must provide the link of a PDF file published on the internet with open access. Alternatively, you can use the output from the "Upload File" user input to extract data from files stored on your computer.
Output Result:
The text from the entire PDF will be extracted.
Use Cases:
Importing Contracts for Queries: Imagine you have a library of contracts in PDF format. Using the "Read PDF All Pages" Step, you can extract the text from all these contracts and create a search model that allows users to search for specific terms within the contracts. This is useful for quickly locating important information.
Importing Knowledgebases for Querying: If you have a knowledge base in PDF format, you can use this step to extract the content from all documents and make it available in a query system. Users can then efficiently search for and access relevant information.
Importing Documents for Training in Various Markets: If you are training an AI model for a specific market, such as the financial, legal, or medical sector, you can use the "Read PDF All Pages" Step to collect data from relevant PDF documents. This data can be used to train the model and improve its understanding of the market, enabling it to provide more accurate and contextual information.
In summary, the "Read PDF All Pages" Step is a powerful tool that allows you to extract text from PDFs for various purposes, from contract queries to training models in different sectors. It simplifies the process of obtaining data from PDF documents and makes it easier to integrate this data into your workflow.
Limitations:
It is important to keep in mind that training your AI using PDF documents extracted via Tess AI has a size limitation.
Training cannot exceed 80,000 words. Therefore, ensure that the selected PDF is within this limit. If you have a PDF with more than 80,000 words, consider splitting it into smaller parts or selecting only the most relevant sections.
Otherwise, it is better to use the GPT creation mode, adding the file as RAG.
Implementation Example
Case 1: User-End PDF Import
The above case built a template where a PDF is imported by the user utilizing the template.
Case 2: Fixed Link Import
The above case built a template where the import of a PDF was used solely as training for the end user to perform queries.
Conclusion
In summary, the "Read PDF All Pages" Step is a powerful tool that enables the extraction of text from PDFs for various purposes, from contract queries to training models in different sectors. It simplifies the process of obtaining data from PDF documents and makes it easier to incorporate that data into your workflow.