Tutorial: Building a Summarization System with a Large Language Model

This tutorial teaches you how to build a question answering system that generates answers based on your documents. It uses the PromptNode with a large language model.

  • Level: Beginner
  • Time to complete: 15 minutes
  • Prerequisites:
    • This tutorial assumes a basic knowledge of NLP, large language models and retrieval-augmented generation. If you need more information, have a look at Language Models.
    • You must be an Admin to complete this tutorial.
    • This tutorial uses the gpt-3.5-turbo model, so you need an API key from an active OpenAI account.
      If you don't have an account with OpenAI, you can replace this model with an open source one, like google/flan-t5-large, but bear in mind it has its limitations, and its performance may not be sufficient.
  • Goal: After completing this tutorial, you will have created a system that can generate summaries of reports on child obesity and food advertising regulations. You will have learned how to use PromptNode with a large language model and a custom prompt.
  • Keywords: PromptNode, summarization, large language models, prompts

Connect Your OpenAI Account

Perform this step if you want to use the gpt-3.5-turbo model by OpenAI. If you're planning to use an open source model, you can skip this step.

You'll be able to use OpenAI models without having to pass the API keys in the pipeline itself.

  1. In deepset Cloud, click your initials in the top right corner and choose Connections.
The personal menu expanded with the Connections option underlined.
  1. Next to OpenAI, click Connect, paste your OpenAI API key, and click Submit.

Result: You're connected to your OpenAI account and can use OpenAI models in your pipelines.

The integrations section with the OpenAI option showing as connected.

Upload Files

First, let's upload the files we want our search system to run on. The files here are a set of reports on the impact of food marketing on child obesity. You can replace this dataset with any other dataset.

  1. Download the .zip file with sample files and unpack it on your computer.

  2. Go to deepset Cloud, make sure you're in the workspace you want to use for this task, and go to _Files.

    The left hand navigation with the workspace name numbered as 1 and the files option numbered as 2
  3. Click Upload Files.

  4. Select all the files you extracted, drop them into the Upload Files window, and click Upload. There should be four files in total.

Result: Your files are in your workspace, and you can see them on the Files page.

The Files page with the four files successfully uploaded.

Create the Pipeline

We'll use an out-of-the-box template as a baseline for our pipeline and we'll adjust it a bit:

  1. In deepset Cloud, go to Pipeline Templates.

  2. Click Basic QA, find Generative Question Answering GPT-3.5, and choose Use Template.

  3. Alt-text: "Screenshot of a 'Basic QA' section on a webpage showcasing pipeline templates for question-answering systems. The section is one of several categories listed in a sidebar on the left, with 'Basic QA' showing a count of '6'. The main pane shows three templates: 'Extractive Question Answering', 'Extractive Question Answering (German)', and 'Generative Question Answering GPT-3.5'. Each template offers a brief description of its functionality, emphasizing the use of semantic similarity in searching for answers. Icons indicate the creator of the templates, 'deepset'. At the bottom of each template, there are options to 'View Details' and 'Use Template', with the latter having a notification bubble with a curled arrow symbol, suggesting an update or new feature. The interface is clean with a color scheme consisting primarily of blue, white, and gray.
  4. Type summarization as the pipeline name and click Create Pipeline. You're redirected to the Pipelines page, and your pipeline is in the In Development section.

  5. Click the ellipsis button next to your pipeline and choose Edit.

  6. In the Pipeline Designer, update the template:

    1. In line 37, find the top_k value of the CNSentenceTransformersRanker component and change it to 1.
    2. In line 65, change the default_prompt_template to deepset/summarization.
    3. Line 69 is where you can change the model.
    4. In line 70, add the top_k parameter and set it to 1.
    5. Delete the code that defines PromptTemplate, so lines 46 to 61.
  7. Save your pipeline. This is what your pipeline should look like:

    version: '1.24.0'
    
    # This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
    # The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
    # Type is the class name of the component. 
    components:
      - name: DocumentStore
        type: DeepsetCloudDocumentStore
        params:
          embedding_dim: 768
          similarity: cosine
      - name: BM25Retriever # The keyword-based retriever
        type: BM25Retriever
        params:
          document_store: DocumentStore
          top_k: 10 # The number of results to return
      - name: EmbeddingRetriever # Selects the most relevant documents from the document store
        type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
        params:
          document_store: DocumentStore
          embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
          model_format: sentence_transformers
          top_k: 10 # The number of results to return
      - name: JoinResults # Joins the results from both retrievers
        type: JoinDocuments
        params:
          join_mode: concatenate # Combines documents from multiple retrievers
      - name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
        type: SentenceTransformersRanker
        params:
          model_name_or_path: intfloat/simlm-msmarco-reranker # Fast model optimized for reranking
          top_k: 4 # The number of results to return
          batch_size: 20  # Try to keep this number equal or larger to the sum of the top_k of the two retrievers so all docs are processed at once
          model_kwargs:  # Additional keyword arguments for the model
            torch_dtype: torch.float16
      - name: qa_template
        type: PromptTemplate
        params:
          output_parser:
            type: AnswerParser
      - name: PromptNode
        type: PromptNode
        params:
          default_prompt_template: qa_template
          max_length: 400 # The maximum number of tokens the generated answer can have
          model_kwargs: # Specifies additional model settings
            temperature: 0 # Lower temperature works best for fact-based qa
          model_name_or_path: gpt-3.5-turbo
          top_k: 1
      - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
        type: FileTypeClassifier
      - name: TextConverter # Converts files into documents
        type: TextConverter
      - name: PDFConverter # Converts PDFs into documents
        type: PDFToTextConverter
      - name: Preprocessor # Splits documents into smaller ones and cleans them up
        type: PreProcessor
        params:
          # With a vector-based retriever, it's good to split your documents into smaller ones
          split_by: word # The unit by which you want to split the documents
          split_length: 250 # The max number of words in a document
          split_overlap: 20 # Enables the sliding window approach
          language: en
          split_respect_sentence_boundary: True # Retains complete sentences in split documents
    
    # Here you define how the nodes are organized in the pipelines
    # For each node, specify its input
    pipelines:
      - name: query
        nodes:
          - name: BM25Retriever
            inputs: [Query]
          - name: EmbeddingRetriever
            inputs: [Query]
          - name: JoinResults
            inputs: [BM25Retriever, EmbeddingRetriever]
          - name: Reranker
            inputs: [JoinResults]
          - name: PromptNode
            inputs: [Reranker]
      - name: indexing
        nodes:
        # Depending on the file type, we use a Text or PDF converter
          - name: FileTypeClassifier
            inputs: [File]
          - name: TextConverter
            inputs: [FileTypeClassifier.output_1] # Ensures that this converter receives txt files
          - name: PDFConverter
            inputs: [FileTypeClassifier.output_2] # Ensures that this converter receives PDFs
          - name: Preprocessor
            inputs: [TextConverter, PDFConverter]
          - name: EmbeddingRetriever
            inputs: [Preprocessor]
          - name: DocumentStore
            inputs: [EmbeddingRetriever]
    
    
  8. At the top of the Pipeline Designer, click Deploy and wait until your pipeline is deployed and indexed. Indexing may take a couple of minutes.

Result: You have created a pipeline that summarizes documents using a large language model. The pipeline status is indexed which means it's ready for use.

Test the Pipeline

Now it's time to see how your pipeline is doing. Let's run a search with it.

  1. In the navigation, click Playground.

  2. Make sure the summarization pipeline is selected.

  3. Type the query: summarize the report on advertising food to children.
    Here's what the pipeline returns:
    The Search page with a summary the pipeline returned as an answer to the query.

Result: Congratulations! You just created a summarization pipeline that uses a large language model to generate summaries of documents.