GPT-3 can help us extract key figures, dates or other bits of important content from documents that are too big to fit into the context window. One approach for solving this is to chunk the document up and process each chunk separately, before combining into one list of answers.
In this notebook we'll run through this approach:
- Load in a long PDF and pull the text out
- Create a prompt to be used to extract key bits of information
- Chunk up our document and process each chunk to pull any answers out
- Combine them at the end
- This simple approach will then be extended to three more difficult questions
Approach
- Setup: Take a PDF, a Formula 1 Financial Regulation document on Power Units, and extract the text from it for entity extraction. We'll use this to try to extract answers that are buried in the content.
- Simple Entity Extraction: Extract key bits of information from chunks of a document by:
- Creating a template prompt with our questions and an example of the format it expects
- Create a function to take a chunk of text as input, combine with the prompt and get a response
- Run a script to chunk the text, extract answers and output them for parsing
- Complex Entity Extraction: Ask some more difficult questions which require tougher reasoning to work out