Tutorial: Building a Summarization System with a Large Language Model
This tutorial teaches you how to build a question answering system that generates answers based on your documents. It uses the PromptNode with a large language model and a custom prompt.
- Level: Beginner
- Time to complete: 15 minutes
- Prerequisites:
- This tutorial assumes a basic knowledge of NLP, large language models and retrieval-augmented generation. If you need more information, have a look at Language Models.
- You must be an Admin to complete this tutorial.
- Goal: After completing this tutorial, you will have created a system that can generate summaries of reports on child obesity and food advertising regulations. You will have learned how to use PromptNode with a large language model and a custom prompt.
- Keywords: PromptNode, summarization, large language models, prompts
Choosing a Large Language Model
This tutorial uses an open source FLAN T5 model which has its limitations. The performance of your pipeline using this model may not be sufficient. For production scenarios, we recommend using a better-performing model such as OpenAI's gpt-3.5-turbo.
In this tutorial you'll learn how to change the model and the prompt so you'll know how to do that if you decide to use a different model.
Upload Files
First, let's upload the files we want our search system to run on. The files here are a set of reports on the impact of food marketing on child obesity. You can replace this dataset with any other dataset.
-
Download the .zip file with sample files and unpack it on your computer.
-
Log in to deepset Cloud, make sure you're in the workspace you want to use for this task, and go to _Data>Files.
-
Click Upload Files.
-
Select all the files you extracted and drop them into the Upload Files window. There should be four files in total.
-
Click Upload and wait until the files are uploaded.
Result: Your files are in your workspace, and you can see them on the Files page.

Create the Pipeline
We'll use an out-of-the-box template as a baseline for our pipeline and we'll adjust it a bit:
-
In deepset Cloud, go to Pipelines>New Pipeline.
-
In YAML Editor, click Create Pipeline and select From Template.

-
Find the Generative Question Answering FLAN-T5 template and click Use Template.
The template YAML opens in the Pipeline Designer. -
Update the template:
- In YAML editor, find line 7 and change the pipeline name to "summarization".
- In line 25, change the
default_prompt_template
tosummarization
. - In line 26, you can change the model.
For the purpose of this tutorial, we're using an open source model but it has its limitations. For a production scenario, you may want to use a paid model. To do this, first, connect to the model provider:- Click your name in the top right corner and select Connections.
- Connect to the chosen model provider by entering the key from your model provider account.
- Return to the Pipeline Designer and just type the model name, for example, text-davinci-003, in the
model_name_or_path
parameter. - In line 27, change the
top_k
value to1
.
-
Save your pipeline. This is what your pipeline YAML should look like:
# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml. # This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard. # Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models. # This is a Generative Question Answering pipeline for English with a good vector-based Retriever and Google's open-source FLAN-T5 model. Recommended for advanced users who want more control over models and prompts. version: '1.16.0' name: 'summarization' # This section defines the nodes you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here. # The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline. # Type is the class name of the component. components: - name: DocumentStore type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud - name: Retriever # Selects the most relevant documents from the document store so that the OpenAI model can base it's generation on it. type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query params: document_store: DocumentStore embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search model_format: sentence_transformers top_k: 1 # The number of documents to return - name: PromptNode # The component that generates the answer based on the documents it gets from the retriever type: PromptNode params: default_prompt_template: summarization # PromptTemplate defines the task you want the PromptNode to do. Here, we want it to summarize documents. model_name_or_path: google/flan-t5-large # A default large language model for PromptNode top_k: 1 # The number of answers to generate - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html type: FileTypeClassifier - name: TextConverter # Converts files into documents type: TextConverter - name: PDFConverter # Converts PDFs into documents type: PDFToTextConverter - name: Preprocessor # Splits documents into smaller ones and cleans them up type: PreProcessor params: # With a vector-based retriever, it's good to split your documents into smaller ones split_by: word # The unit by which you want to split the documents split_length: 100 # The max number of words in a document split_overlap: 30 # Enables the sliding window approach split_respect_sentence_boundary: True # Retains complete sentences in split documents language: en # Used by NLTK to best detect the sentence boundaries for that language # Here you define how the nodes are organized in the pipelines # For each node, specify its input pipelines: - name: query nodes: - name: Retriever inputs: [Query] - name: PromptNode inputs: [Retriever] - name: indexing nodes: # Depending on the file type, we use a Text or PDF converter - name: FileTypeClassifier inputs: [File] - name: TextConverter inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files - name: PDFConverter inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs - name: Preprocessor inputs: [TextConverter, PDFConverter] - name: Retriever inputs: [Preprocessor] - name: DocumentStore inputs: [Retriever]
-
At the top of the Pipeline Designer, click Deploy and wait until your pipeline is deployed and indexed. Indexing may take a couple of minutes.
Result: You have created a pipeline that summarizes documents using a large language model. Your pipeline is displayed on the Pipelines page in the Deployed section, and you can now try it out.
Test the Pipeline
Now it's time to see how your pipeline is doing. Let's run a search with it.
-
In the navigation, click Search.
-
Make sure the summarization pipeline is selected.
-
Type the query: summarize the report on advertising food to children.
Here's what the pipeline returns:
-
Not bad but let's try something that's not in the files we uploaded. Type the query: what's the impact of tv on children?
The answer we get as a result is rubbish. Time to modify our prompt to cater for cases when the answer is not in the documents.
Customize the Prompt
PromptNode comes with ready-made prompts. The first version of your pipeline used the summarization
prompt but after testing it out, we found we want to customize it a bit more.
-
First, let's check what's in the
summarization
prompt template. You can do that in the PromptTemplates documentation. Find the summarization template and expand it. Here is what it contains:PromptTemplate(name="summarization", prompt_text="Summarize this document: {documents} Summary:")
We'll use this template as a baseline for the custom prompt.
-
Declare the prompt template as a component in the YAML. Add a new entry in line 22 of the YAML editor and type
-name: summary-custom
:

-
In line 23, enter
type: PromptTemplate
. -
In line 24, enter
params
to specify the parameters for the prompt.promp_text
contains the instructions for the model. Let's copy the default summarization template and add this sentence to it: "If the document is not there, answer with: I can't answer the question based on the information provided.":params: prompt_text: > Summarize this document: {documents} \n\n If the document is not there, answer with: "I can''t answer the question based on the information provided." \n\n Summary:
-
In line 29, type
name: summary
. This is a required parameter for the PromptTemplate and it must be different than the name you declared in step 2 i. -
Set your custom prompt as the default prompt template. In line 33, set
default_prompt_template
tosummary-custom
.
This is what the pipeline should look like now:# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml. # This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard. # Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models. # This is a Generative Question Answering pipeline for English with a good vector-based Retriever and Google's open-source FLAN-T5 model. Recommended for advanced users who want more control over models and prompts. version: '1.15.1' name: 'summarization' # This section defines the nodes you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here. # The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline. # Type is the class name of the component. components: - name: DocumentStore type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud - name: Retriever # Selects the most relevant documents from the document store so that the OpenAI model can base it's generation on it. type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query params: document_store: DocumentStore embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search model_format: sentence_transformers top_k: 1 # The number of documents to return - name: summary-custom type: PromptTemplate params: name: summary prompt_text: > Summarize this document: {documents} \n\n If the document is not there, answer with: "I can''t answer the question based on the information provided." \n\n Summary: - name: PromptNode # The component that generates the answer based on the documents it gets from the retriever type: PromptNode params: default_prompt_template: summary-custom # PromptTemplate defines the task you want the PromptNode to do. Here, we want it to generate an answer to our question model_name_or_path: google/flan-t5-large # A default large language model for PromptNode top_k: 1 # The number of answers to generate - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html type: FileTypeClassifier - name: TextConverter # Converts files into documents type: TextConverter - name: PDFConverter # Converts PDFs into documents type: PDFToTextConverter - name: Preprocessor # Splits documents into smaller ones and cleans them up type: PreProcessor params: # With a vector-based retriever, it's good to split your documents into smaller ones split_by: word # The unit by which you want to split the documents split_length: 100 # The max number of words in a document split_overlap: 30 # Enables the sliding window approach split_respect_sentence_boundary: True # Retains complete sentences in split documents language: en # Used by NLTK to best detect the sentence boundaries for that language # Here you define how the nodes are organized in the pipelines # For each node, specify its input pipelines: - name: query nodes: - name: Retriever inputs: [Query] - name: PromptNode inputs: [Retriever] - name: indexing nodes: # Depending on the file type, we use a Text or PDF converter - name: FileTypeClassifier inputs: [File] - name: TextConverter inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files - name: PDFConverter inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs - name: Preprocessor inputs: [TextConverter, PDFConverter] - name: Retriever inputs: [Preprocessor] - name: DocumentStore inputs: [Retriever]
-
Save and deploy your pipeline.
Test the Customized Prompt
- Once your pipeline is deployed, go to Search.
- Make sure the summarization pipeline is selected and type the query: summarize the report on advertising food to children.
The result is OK, we get a summary. - Now, let's ask something that we know is not in the files we uploaded: what's the impact of tv on children?
Just like we asked in the prompt, the model tells us it can't answer the question:
Result: Congratulations! You just created a summarization pipeline that uses a large language model to generate summaries of documents. You also customized the prompt for the model.
Updated 23 days ago