How to read a scientific research paper using a large language model?

Utpal Kumar   4 minute read      

Python script to read a PDF document, and perform question-answering on the document, and to generate an output based on the provided query

Introduction

GPT (3 or 4) is a Transformer-based large language model that can understand and generate texts developed by OpenAI. GPT is trained in two steps (pretraining and fine-tuning) using a method called unsupervised learning on a large corpus of text data. The pretraining step include training the GPT model on a large dataset containing parts of the Internet consisting of billions of sentences from various sources like books, articles, websites, and other publicly available text. The fine-tuning step comes after the pretraining, where the model is fine-tuned on more specific data to make it more useful and safe for practical applications.

Is it possible to use GPT model to read a research paper?

Yes, you can use GPT models like ChatGPT to read recent scientific research papers, but some “fine-tuning” is required. Directly accessing GPT models via web interfaces like ChatGPT has limitations; they might not be up-to-date with the most recent scientific research due to the rapid pace of publication. Despite regular updates, these models may not encompass the very latest scientific research due to the sheer volume of information being published daily. Moreover, using ChatGPT to inquire about recent research might yield incorrect answers due to inconsistencies in the data it was trained on, and it doesn’t provide the sources of information. However, it’s indeed possible to customize the model to deliver responses solely based on the provided input data.

The limitation of GPT model can be overcome by inputting the research paper data into the language model for fine-tuning, allowing us to retrieve information from the paper. The process entails converting downloaded PDF research papers into a well-structured text format, segmenting the text into smaller pieces that the language model can handle.

Each text chunk is then transformed into a vector embedding using OpenAI text embeddings and stored via FAISS. These embeddings serve as numerical equivalents of the texts, preserving their semantic information and creating a new knowledge base. For a deeper understanding of how embeddings function, refer to this NLP course.

Transformer-based language models represent each token in a span of text as an embedding vector. It turns out that one can “pool” the individual embeddings to create a vector representation for whole sentences, paragraphs, or (in some cases) documents. These embeddings can then be used to find similar documents in the corpus by computing the dot-product similarity (or some other similarity metric) between each embedding and returning the documents with the greatest overlap.

When querying information from the input research paper through the vector stores (representing the knowledge database), the results are ranked based on their similarity to the query. We can then use GPT to generate a response. A very good explanation on how to set up your application to read a pdf is done by this Youtube video by Prompt Engineering.

Implementation

I have written a Python script that reads a PDF document, performs question-answering on the document, and generates an output based on the provided query. You can download the script from here.

The script expects two arguments: configfile, which is the path to the configuration file, and query, which is the query to ask about the paper. The script retrieves the OpenAI API key from the environment variable OPENAI_API_KEY. To set up the API key, you can export your OpenAI API in your os environment. You can go to the OpenAI platform to get your API. You will get some free credit that should be sufficient for heavy usuage of this application for several months.

The script also creates some local cache for relatively fast experience.

Setting up the python environment

python3 -m venv venv
source venv/bin/activate
pip install langchain
pip install openai
pip install PyPDF2
pip install faiss-cpu
pip install tiktoken
pip install pyyaml

Usage

Who are the authors of this article?

python read_paper.py -c config.yaml -q "who are the authors of the article?"

Summary of this article?

python read_paper.py -c config.yaml -q "write a summary of this article"

Example

As an example, I use this application to read the research paper: Whole-mantle radially anisotropic shear velocity structure from spectral-element waveform tomography

French and Romanowicz 2014
French and Romanowicz 2014
❯ python read_paper.py -c config.yaml -q "who are the authors of the article?"

Cache cleared
====================================================================================================
The authors of the article are S. W. French and B. A. Romanowicz.
----------------------------------------------------------------------------------------------------
❯ python read_paper.py -c config.yaml -q "How many b spline nodes did they use in their model?"

====================================================================================================
They used 20 knots with variable spacing between the core-mantle boundary (CMB) and 30-km depth for their radial b-spline basis.
----------------------------------------------------------------------------------------------------
❯ python read_paper.py -c config.yaml -q "Which optimization algorithm did they chose for their inversion?"

====================================================================================================
They used a Gauss-Newton optimization scheme for their waveform inversion.
----------------------------------------------------------------------------------------------------

Conclusion

The outcome of this application depends on the parameters defined in the configfile such as chunk_size, chunk_overlap and gpt_temperature. The chunk_size ensures that chunk size of the text is within a limit and the chunk_overlap is important for the model to learn the context. If the overlap is too small then the model may not be able to learn the relation between different chunks. Temperature parameter of OpenAI GPT models governs the randomness and thus the creativity of the responses. A temperature of 0 means the responses will be very straightforward, almost deterministic (meaning you almost always get the same response to a given prompt). A temperature of 1 means the responses can vary wildly.

Disclaimer of liability

The information provided by the Earth Inversion is made available for educational purposes only.

Whilst we endeavor to keep the information up-to-date and correct. Earth Inversion makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services or related graphics content on the website for any purpose.

UNDER NO CIRCUMSTANCE SHALL WE HAVE ANY LIABILITY TO YOU FOR ANY LOSS OR DAMAGE OF ANY KIND INCURRED AS A RESULT OF THE USE OF THE SITE OR RELIANCE ON ANY INFORMATION PROVIDED ON THE SITE. ANY RELIANCE YOU PLACED ON SUCH MATERIAL IS THEREFORE STRICTLY AT YOUR OWN RISK.


Leave a comment