How to read a scientific research paper using a large language model?

Utpal Kumar 6 minute read MACHINELEARNING May 18, 2023

Introduction

A large language model (LLM) — like OpenAI’s GPT family, or the many capable models that have appeared since — is a Transformer-based network trained on a huge text corpus to understand and generate language. It’s trained in two broad phases: pretraining on a vast general corpus, then alignment / fine-tuning to make it useful and safe. But a general model doesn’t know the specific paper on your desk — so how do we get reliable answers about one particular document?

The one mental model — this is RAG, not fine-tuning

You don’t retrain the model. Instead you use Retrieval-Augmented Generation (RAG): split the paper into chunks, turn each into an embedding stored in a vector index (FAISS), and at query time retrieve the chunks most similar to your question and hand them to the LLM as context. The model then answers grounded in the paper’s own text — which curbs hallucination and lets you point at the source.

PDF → chunk + embed → vector store → retrieve relevant chunks → LLM answers from them.

Ingest the paper once; then every question retrieves only the relevant chunks to ground the LLM's answer.

Why not just paste the question into ChatGPT?

A web chat interface has real limits for this task: it may not be up to date with the latest research, it can give incorrect answers for niche topics, and it doesn’t cite its sources. By feeding the paper’s own text in as retrieved context, you sidestep all three — the answer comes from this document.

RAG vs. fine-tuning — an important distinction. Fine-tuning updates the model’s weights on new data (expensive, and overkill for one paper). RAG leaves the model untouched and instead supplies relevant text at query time. What this post does is RAG. (Earlier versions of this article called it “fine-tuning,” but no weights are ever changed.)

Each text chunk is transformed into a vector embedding (here via OpenAI text embeddings) and stored in FAISS. These embeddings are numerical stand-ins for the text that preserve its semantic meaning, forming a searchable knowledge base. For a deeper understanding of how embeddings work, see this NLP course.

Transformer-based language models represent each token in a span of text as an embedding vector. It turns out that one can “pool” the individual embeddings to create a vector representation for whole sentences, paragraphs, or (in some cases) documents. These embeddings can then be used to find similar documents in the corpus by computing the dot-product similarity (or some other similarity metric) between each embedding and returning the documents with the greatest overlap.

When you query the paper through the vector store, results are ranked by similarity to the query, and the LLM then generates a response from them. A good walkthrough of setting up a PDF reader is this YouTube video by Prompt Engineering.

Implementation

I have written a Python script that reads a PDF document, performs question-answering on the document, and generates an output based on the provided query. You can download the script from here.

The script expects two arguments: configfile, which is the path to the configuration file, and query, which is the query to ask about the paper. The script retrieves the OpenAI API key from the environment variable OPENAI_API_KEY. To set up the API key, you can export your OpenAI API in your os environment. You can go to the OpenAI platform to get your API.

Cost note: the OpenAI API is pay-as-you-go (prepaid credits) — the generous free trial credits from 2023 no longer apply to new accounts. The good news is that reading a handful of papers is cheap; check current pricing on the platform before a big batch.

The script also creates some local cache for a relatively fast experience.

Setting up the python environment

python3 -m venv venv
source venv/bin/activate
pip install langchain
pip install openai
pip install PyPDF2
pip install faiss-cpu
pip install tiktoken
pip install pyyaml

Package note (2026): the RAG concepts are unchanged, but the libraries have since been restructured. LangChain split into langchain-core / langchain-community / langchain-openai; the OpenAI Python SDK had a v1.0 rewrite (new OpenAI() client API); and PyPDF2 is now maintained as pypdf. If you’re installing fresh today, expect to use those newer packages (and possibly adjust imports in the script accordingly).

Usage

Who are the authors of this article?

python read_paper.py -c config.yaml -q "who are the authors of the article?"

Summary of this article?

python read_paper.py -c config.yaml -q "write a summary of this article"

Example

As an example, I use this application to read the research paper: Whole-mantle radially anisotropic shear velocity structure from spectral-element waveform tomography

❯ python read_paper.py -c config.yaml -q "who are the authors of the article?"

Cache cleared
====================================================================================================
The authors of the article are S. W. French and B. A. Romanowicz.
----------------------------------------------------------------------------------------------------

❯ python read_paper.py -c config.yaml -q "How many b spline nodes did they use in their model?"

====================================================================================================
They used 20 knots with variable spacing between the core-mantle boundary (CMB) and 30-km depth for their radial b-spline basis.
----------------------------------------------------------------------------------------------------

❯ python read_paper.py -c config.yaml -q "Which optimization algorithm did they chose for their inversion?"

====================================================================================================
They used a Gauss-Newton optimization scheme for their waveform inversion.
----------------------------------------------------------------------------------------------------

Check your understanding

Why does this RAG approach give more trustworthy answers than asking a chatbot the same question cold?

Conclusion

The outcome of this application depends on the parameters defined in the configfile such as chunk_size, chunk_overlap and gpt_temperature. The chunk_size ensures that chunk size of the text is within a limit and the chunk_overlap is important for the model to learn the context. If the overlap is too small then the model may not be able to learn the relation between different chunks. Temperature parameter of OpenAI GPT models governs the randomness and thus the creativity of the responses. A temperature of 0 means the responses will be very straightforward, almost deterministic (meaning you almost always get the same response to a given prompt). A temperature of 1 means the responses can vary wildly.

Recap

Without scrolling up — can you name the pattern? To ask questions about a specific paper:

Use RAG, not fine-tuning — the model’s weights never change.
Chunk the PDF, embed the chunks, and store them in a FAISS vector index.
At query time, retrieve the most similar chunks and let the LLM answer from them — grounded, and source-able.
Tune chunk_size, chunk_overlap, and temperature for the accuracy/creativity you want.

The libraries evolve fast, but this retrieve-then-generate pattern is now the standard way to put an LLM to work on your own documents.

Where to go next

LangChain documentation — the current modular packages and RAG chains.
FAISS and pypdf — the vector index and the maintained PDF reader.
The source for this tool: read_scientific_papers_gpt.

Disclaimer of liability

The information provided by the Earth Inversion is made available for educational purposes only.

Whilst we endeavor to keep the information up-to-date and correct. Earth Inversion makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services or related graphics content on the website for any purpose.

UNDER NO CIRCUMSTANCE SHALL WE HAVE ANY LIABILITY TO YOU FOR ANY LOSS OR DAMAGE OF ANY KIND INCURRED AS A RESULT OF THE USE OF THE SITE OR RELIANCE ON ANY INFORMATION PROVIDED ON THE SITE. ANY RELIANCE YOU PLACED ON SUCH MATERIAL IS THEREFORE STRICTLY AT YOUR OWN RISK.

Subscribe to our weekly newsletter

Introduction

Why not just paste the question into ChatGPT?

Implementation

Setting up the python environment

Usage

Who are the authors of this article?

Summary of this article?

Example

Conclusion

Recap

Where to go next

Disclaimer of liability

Leave a comment