Airtable node
HTTP Request node
+8

๐Ÿš€ Process YouTube Transcripts with Apify, OpenAI & Pinecone Database

Published 5 days ago

Template description

๐Ÿš€ YouTube Transcript Indexing Backend for Pinecone ๐ŸŽฅ๐Ÿ’พ

This tutorial explains how to build the backend workflow in n8n that indexes YouTube video transcripts into a Pinecone vector database. Note: This workflow handles the processing and indexing of transcripts onlyโ€”the retrieval agent (which searches these embeddings) is implemented separately.


๐Ÿ“‹ Workflow Overview

This backend workflow performs the following tasks:

  1. Fetch Video Records from Airtable ๐Ÿ“ฅ
    Retrieves video URLs and related metadata.

  2. Scrape YouTube Transcripts Using Apify ๐ŸŽฌ
    Triggers an Apify actor to scrape transcripts with timestamps from each video.

  3. Update Airtable with Transcript Data ๐Ÿ”„
    Stores the fetched transcript JSON back in Airtable linked via video ID.

  4. Process & Chunk Transcripts โœ‚๏ธ
    Parses the transcript JSON, converts "mm:ss" timestamps to seconds, and groups entries into meaningful chunks. Each chunk is enriched with metadataโ€”such as video title, description, start/end timestamps, and a direct URL linking to that video moment.

  5. Generate Embeddings & Index in Pinecone ๐Ÿ’พ
    Uses OpenAI to create vector embeddings for each transcript chunk and indexes them in Pinecone. This enables efficient semantic searches later by a separate retrieval agent.


๐Ÿ”ง Step-by-Step Guide

Step 1: Retrieve Video Records from Airtable ๐Ÿ“ฅ

  • Airtable Search Node:

    • Setup: Configure the node to fetch video records (with essential fields like url and metadata) from your Airtable base.
  • Loop Over Items:

    • Use a SplitInBatches node to process each video record individually.

Step 2: Scrape YouTube Transcripts Using Apify ๐ŸŽฌ

  • Trigger Apify Actor:

    • HTTP Request Node ("Apify NinjaPost"):
      • Method: POST
      • Endpoint: https://api.apify.com/v2/acts/topaz_sharingan~youtube-transcript-scraper-1/runs?token=<YOUR_TOKEN>
      • Payload Example:
        {
          "includeTimestamps": "Yes",
          "startUrls": ["{{ $json.url }}"]
        }
        
    • Purpose: Initiates transcript scraping for each video URL.
  • Wait for Processing:

    • Wait Node:
      • Duration: Approximately 1 minute to allow Apify to generate the transcript.
  • Retrieve Transcript Data:

    • HTTP Request Node ("Get JSON TS"):
      • Method: GET
      • Endpoint: https://api.apify.com/v2/acts/topaz_sharingan~youtube-transcript-scraper-1/runs/last/dataset/items?token=<YOUR_TOKEN>

Step 3: Update Airtable with Transcript Data ๐Ÿ”„

  • Format Transcript Data:

    • Code Node ("Code"):
      • Task: Convert the fetched transcript JSON into a formatted string.
        const jsonObject = items[0].json;
        const jsonString = JSON.stringify(jsonObject, null, 2);
        return { json: { stringifiedJson: jsonString } };
        
  • Extract the Video ID:

    • Set Node ("Edit Fields"):
      • Expression:
        {{$json.url.split('v=')[1].split('&')[0]}}
        
  • Update Airtable Record:

    • Airtable Update Node ("Airtable1"):
      • Updates:
        • ts: Stores the transcript string.
        • videoid: Uses the extracted video ID to match the record.

Step 4: Process Transcripts into Semantic Chunks โœ‚๏ธ

  • Retrieve Updated Records:

    • Airtable Search Node ("Airtable2"):
      • Purpose: Fetch records that now contain transcript data.
  • Parse and Chunk Transcripts:

    • Code Node ("Code4"):
      • Functionality:
        • Parses transcript JSON.
        • Converts "mm:ss" timestamps to seconds.
        • Groups transcript entries into chunks based on a 3-second gap.
        • Creates an object for each chunk that includes:
          • Text: The transcript segment.
          • Video Metadata: Video ID, title, description, published date, thumbnail.
          • Chunk Details: Start and end timestamps.
          • Direct URL: A link to the exact moment in the video (e.g., https://youtube.com/watch?v=VIDEOID&t=XXs).
  • Enrich & Split Text:

    • Default Data Loader Node:
      • Attaches additional metadata (e.g., video title, description) to each chunk.
    • Recursive Character Text Splitter Node:
      • Settings: Typically set to 500-character chunks with a 50-character overlap.
      • Purpose: Ensures long transcript texts are broken into manageable segments for embedding.

Step 5: Generate Embeddings & Index in Pinecone ๐Ÿ’พ

  • Generate Embeddings:

    • Embeddings OpenAI Node:
      • Task: Convert each transcript chunk into a vector embedding.
      • Tip: Adjust the batch size (e.g., 512) based on your data volume.
  • Index in Pinecone:

    • Pinecone Vector Store Node:
      • Configuration:
        • Index: Specify your Pinecone index (e.g., "videos").
        • Namespace: Use a dedicated namespace (e.g., "transcripts").
      • Outcome: Each enriched transcript chunk is stored in Pinecone, ready for semantic retrieval by a separate retrieval agent.

๐ŸŽ‰ Final Thoughts

This backend workflow is dedicated to processing and indexing YouTube video transcripts so that a separate retrieval agent can perform efficient semantic searches. With this setup:

  • Transcripts Are Indexed:
    Chunks of transcripts are enriched with metadata and stored as vector embeddings.

  • Instant Topic Retrieval:
    A retrieval agent (implemented separately) can later query Pinecone to find the exact moment in a video where a topic is discussed, thanks to the direct URL and metadata stored with each chunk.

  • Scalable & Modular:
    The separation between indexing and retrieval allows for easy updates and scalability.

Happy automating and enjoy building powerful search capabilities with your YouTube content! ๐ŸŽ‰

Share Template

Implement complex processes faster with n8n

red icon yellow icon red icon yellow icon