Some HTML content
group_ids: - d290f1ee-6c54-4b01-90e6-d701748f0851 group_tracking_ids: - group_tracking_id image_urls: - https://example.com/red - https://example.com/blue link: https://example.com location: lat: -34 lon: 151 metadata: key1: value1 key2: value2 tag_set: - tag1 - tag2 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id upsert_by_tracking_id: true - chunk_html:Some more HTML content
group_ids: - d290f1ee-6c54-4b01-90e6-d701748f0851 group_tracking_ids: - group_tracking_id image_urls: [] link: https://explain.com location: lat: -34 lon: 151 metadata: key1: value1 key2: value2 tag_set: - tag3 - tag4 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id upsert_by_tracking_id: true weight: 0.5 SingleQueuedChunkResponse: type: object required: - chunk_metadata properties: chunk_metadata: $ref: '#/components/schemas/ChunkMetadata' example: chunk_metadata: - content: Some content link: https://example.com metadata: key1: value1 key2: value2 tag_set: - tag1 - tag2 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id weight: 0.5 pos_in_queue: 1 BatchQueuedChunkResponse: type: object title: batch required: - chunk_metadata properties: chunk_metadata: type: array items: $ref: '#/components/schemas/ChunkMetadata' example: chunk_metadata: - content: Some content file_id: d290f1ee-6c54-4b01-90e6-d701748f0851 link: https://example.com metadata: key1: value1 key2: value2 tag_set: - tag1 - tag2 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id weight: 0.5 - content: Some content file_id: d290f1ee-6c54-4b01-90e6-d701748f0851 link: https://example.com metadata: key1: value1 key2: value2 tag_set: - tag1 - tag2 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id weight: 0.5 pos_in_queue: 2 ChunkReqPayload: type: object title: single description: Request payload for creating a new chunk properties: chunk_html: type: string description: >- HTML content of the chunk. This can also be plaintext. The innerText of the HTML will be used to create the embedding vector. The point of using HTML is for convienience, as some users have applications where users submit HTML content. nullable: true convert_html_to_text: type: boolean description: >- Convert HTML to raw text before processing to avoid adding noise to the vector embeddings. By default this is true. If you are using HTML content that you want to be included in the vector embeddings, set this to false. nullable: true fulltext_boost: allOf: - $ref: '#/components/schemas/FullTextBoost' nullable: true fulltext_content: type: string description: >- If fulltext_content is present, it will be used for creating the fulltext and bm25 sparse vectors instead of the innerText `chunk_html`. `chunk_html` will still be the only thing stored and used for semantic functionality unless the corresponding `semantic_content` field is defined. `chunk_html` must still be present for the chunk to be created properly. nullable: true group_ids: type: array items: type: string format: uuid description: >- Group ids are the Trieve generated ids of the groups that the chunk should be placed into. This is useful for when you want to create a chunk and add it to a group or multiple groups in one request. Groups with these Trieve generated ids must be created first, it cannot be arbitrarily created through this route. nullable: true group_tracking_ids: type: array items: type: string description: >- Group tracking_ids are the user-assigned tracking_ids of the groups that the chunk should be placed into. This is useful for when you want to create a chunk and add it to a group or multiple groups in one request. If a group with the tracking_id does not exist, it will be created. nullable: true high_priority: type: boolean description: >- High Priority allows you to place this chunk into a priority queue with its own ingestion workers. Can only be used by users with a Custom Pro plan. nullable: true image_urls: type: array items: type: string description: >- Image urls are a list of urls to images that are associated with the chunk. This is useful for when you want to associate images with a chunk. nullable: true link: type: string description: >- Link to the chunk. This can also be any string. Frequently, this is a link to the source of the chunk. The link value will not affect the embedding creation. nullable: true location: allOf: - $ref: '#/components/schemas/GeoInfo' nullable: true metadata: description: >- Metadata is a JSON object which can be used to filter chunks. This is useful for when you want to filter chunks by arbitrary metadata. Unlike with tag filtering, there is a performance hit for filtering on metadata. nullable: true num_value: type: number format: double description: >- Num value is an arbitrary numerical value that can be used to filter chunks. This is useful for when you want to filter chunks by numerical value. There is no performance hit for filtering on num_value. nullable: true semantic_boost: allOf: - $ref: '#/components/schemas/SemanticBoost' nullable: true semantic_content: type: string description: >- If semantic_content is present, it will be used for creating semantic embeddings instead of the innerText `chunk_html`. `chunk_html` will still be the only thing stored and used for fulltext functionality unless the corresponding `fulltext_content` field is defined. `chunk_html` must still be present for the chunk to be created properly. nullable: true split_avg: type: boolean description: >- Split avg is a boolean which tells the server to split the text in the chunk_html into smaller chunks and average their resulting vectors. This is useful for when you want to create a chunk from a large piece of text and want to split it into smaller chunks to create a more fuzzy average dense vector. The sparse vector will be generated normally with no averaging. By default this is false. nullable: true tag_set: type: array items: type: string description: >- Tag set is a list of tags. This can be used to filter chunks by tag. Unlike with metadata filtering, HNSW indices will exist for each tag such that there is not a performance hit for filtering on them. nullable: true time_stamp: type: string description: >- Time_stamp should be an ISO 8601 combined date and time without timezone. It is used for time window filtering and recency-biasing search results. nullable: true tracking_id: type: string description: >- Tracking_id is a string which can be used to identify a chunk. This is useful for when you are coordinating with an external system and want to use the tracking_id to identify the chunk. nullable: true upsert_by_tracking_id: type: boolean description: >- Upsert when a chunk with the same tracking_id exists. By default this is false, and chunks will be ignored if another with the same tracking_id exists. If this is true, the chunk will be updated if a chunk with the same tracking_id exists. nullable: true weight: type: number format: double description: >- Weight is a float which can be used to bias search results. This is useful for when you want to bias search results for a chunk. The magnitude only matters relative to other chunks in the chunk's dataset dataset. nullable: true example: chunk_html:Some HTML content
fulltext_boost: boost_factor: 5 phrase: foo group_ids: - d290f1ee-6c54-4b01-90e6-d701748f0851 group_tracking_ids: - group_tracking_id image_urls: - https://example.com/red - https://example.com/blue link: https://example.com location: lat: -34 lon: 151 metadata: key1: value1 key2: value2 semantic_boost: distance_factor: 0.5 phrase: flagship tag_set: - tag1 - tag2 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id ChunkMetadata: type: object title: V2 required: - id - created_at - updated_at - dataset_id - weight properties: chunk_html: type: string description: >- HTML content of the chunk, can also be an arbitrary string which is not HTML nullable: true created_at: type: string format: date-time description: Timestamp of the creation of the chunk dataset_id: type: string format: uuid description: ID of the dataset which the chunk belongs to id: type: string format: uuid description: >- Unique identifier of the chunk, auto-generated uuid created by Trieve image_urls: type: array items: type: string nullable: true description: >- Image URLs of the chunk, can be any list of strings. Used for image search and RAG. nullable: true link: type: string description: Link to the chunk, should be a URL nullable: true location: allOf: - $ref: '#/components/schemas/GeoInfo' nullable: true metadata: description: Metadata of the chunk, can be any JSON object nullable: true num_value: type: number format: double description: >- Numeric value of the chunk, can be any float. Can represent the most relevant numeric value of the chunk, such as a price, quantity in stock, rating, etc. nullable: true tag_set: type: array items: type: string nullable: true description: >- Tag set of the chunk, can be any list of strings. Used for tag-filtered searches. nullable: true time_stamp: type: string format: date-time description: Timestamp of the chunk, can be any timestamp. Specified by the user. nullable: true tracking_id: type: string description: >- Tracking ID of the chunk, can be any string, determined by the user. Tracking ID's are unique identifiers for chunks within a dataset. They are designed to match the unique identifier of the chunk in the user's system. nullable: true updated_at: type: string format: date-time description: Timestamp of the last update of the chunk weight: type: number format: double description: >- Weight of the chunk, can be any float. Used as a multiplier on a chunk's relevance score for ranking purposes. example: chunk_html:Hello, world!
created_at: '2021-01-01 00:00:00.000' dataset_id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3 id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3 link: https://trieve.ai metadata: key: value tag_set: '[tag1,tag2]' time_stamp: '2021-01-01 00:00:00.000' tracking_id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3 updated_at: '2021-01-01 00:00:00.000' weight: 0.5 FullTextBoost: type: object description: >- Boost the presence of certain tokens for fulltext (SPLADE) and keyword (BM25) search. I.e. boosting title phrases to priortize title matches or making sure that the listing for AirBNB itself ranks higher than companies who make software for AirBNB hosts by boosting the in-document-frequency of the AirBNB token (AKA word) for its official listing. Conceptually it multiples the in-document-importance second value in the tuples of the SPLADE or BM25 sparse vector of the chunk_html innerText for all tokens present in the boost phrase by the boost factor like so: (token, in-document-importance) -> (token, in-document-importance*boost_factor). required: - phrase - boost_factor properties: boost_factor: type: number format: double description: >- Amount to multiplicatevly increase the frequency of the tokens in the phrase by phrase: type: string description: The phrase to boost in the fulltext document frequency index GeoInfo: type: object description: Location that you want to use as the center of the search. required: - lat - lon properties: lat: $ref: '#/components/schemas/GeoTypes' lon: $ref: '#/components/schemas/GeoTypes' SemanticBoost: type: object description: >- Semantic boosting moves the dense vector of the chunk in the direction of the distance phrase for semantic search. I.e. you can force a cluster by moving every chunk for a PDF closer to its title or push a chunk with a chunk_html of "iphone" 25% closer to the term "flagship" by using the distance phrase "flagship" and a distance factor of 0.25. Conceptually it's drawing a line (euclidean/L2 distance) between the vector for the innerText of the chunk_html and distance_phrase then moving the vector of the chunk_html distance_factor*L2Distance closer to or away from the distance_phrase point along the line between the two points. required: - phrase - distance_factor properties: distance_factor: type: number format: float description: >- Arbitrary float (positive or negative) specifying the multiplicate factor to apply before summing the phrase vector with the chunk_html embedding vector phrase: type: string description: >- Terms to embed in order to create the vector which is weighted summed with the chunk_html embedding vector GeoTypes: oneOf: - type: integer format: int64 - type: number format: double securitySchemes: ApiKey: type: apiKey in: header name: Authorization ````