Skip to content

Embedding Similarity

The Embedding Similarity component lets you compare two sets of numbers that represent items (like words, sentences, or images) in a high‑dimensional space. By choosing a similarity metric, it tells you how close or far apart the two items are, which is useful for tasks such as finding duplicate content, clustering similar items, or filtering results.

How it Works

The component receives two embedding vectors (lists of numbers) and a similarity metric.

  1. It checks that exactly two vectors are provided and that they have the same number of dimensions.
  2. Depending on the selected metric, it calculates a score:
    • Cosine Similarity – measures the angle between the vectors (values close to 1 mean very similar).
    • Euclidean Distance – measures straight‑line distance (smaller values mean more similar).
    • Manhattan Distance – sums absolute differences (again, smaller values mean more similar).
  3. The result is packaged into a Data object that includes the original embeddings and the computed score.

No external APIs are called; all calculations happen locally within the dashboard.

Inputs

  • Embedding Vectors: A list containing exactly two data objects with embedding vectors to compare.
  • Similarity Metric: Select the similarity metric to use. Options are Cosine Similarity, Euclidean Distance, and Manhattan Distance.

Outputs

  • Similarity Data: A Data object that contains:
    • embedding_1 – the first input vector
    • embedding_2 – the second input vector
    • similarity_score – the calculated score under a key that matches the chosen metric (e.g., cosine_similarity)

This output can be fed into other components such as a threshold filter, a visualizer, or a storage component.

Usage Example

  1. Generate embeddings: Use an embedding generator component to turn two pieces of text into vectors.
  2. Add Embedding Similarity: Drag the component onto the canvas.
  3. Connect inputs: Link the two embedding outputs to the Embedding Vectors input.
  4. Choose metric: In the dropdown, pick Cosine Similarity.
  5. Run the workflow: The component will output a similarity score that you can display, log, or use to decide if the texts are duplicates.
  • Embedding Generator – Creates the vectors that feed into this component.
  • Similarity Threshold – Filters results based on a similarity score.
  • Data Filter – Allows you to keep or discard items after similarity calculation.

Tips and Best Practices

  • Keep dimensions consistent: All embeddings must have the same length; otherwise the component will return an error.
  • Choose the right metric: Cosine similarity is often best for text embeddings, while Euclidean or Manhattan distances can be useful for spatial data.
  • Use thresholds wisely: Combine with a threshold component to automatically flag items that are too similar or too different.
  • Check data privacy: If embeddings contain sensitive information, ensure they are stored securely.

Security Considerations

  • The component performs all calculations locally; no data leaves the dashboard.
  • Still, treat embedding data as potentially sensitive, especially if it originates from personal or confidential content.
  • Store or transmit the resulting similarity data only over secure channels if it will be sent outside the system.