> ## Documentation Index > Fetch the complete documentation index at: https://docs.trieve.ai/llms.txt > Use this file to discover all available pages before exploring further. # Create or Upsert Chunk or Chunks > Create new chunk(s). If the chunk has the same tracking_id as an existing chunk, the request will fail. Once a chunk is created, it can be searched for using the search endpoint. If uploading in bulk, the maximum amount of chunks that can be uploaded at once is 120 chunks. Auth'ed user or api key must have an admin or owner role for the specified dataset's organization. ## OpenAPI ````yaml post /api/chunk openapi: 3.0.3 info: title: Trieve API description: >- Trieve OpenAPI Specification. This document describes all of the operations available through the Trieve API. contact: name: Trieve Team url: https://trieve.ai email: developers@trieve.ai license: name: BSL url: https://github.com/devflowinc/trieve/blob/main/LICENSE.txt version: 0.13.0 servers: - url: https://api.trieve.ai description: Production server - url: http://localhost:8090 description: Local development server security: [] tags: - name: Invitation description: Invitation endpoint. Exists to invite users to an organization. - name: Auth description: Authentication endpoint. Serves to register and authenticate users. - name: User description: User endpoint. Enables you to modify user roles and information. - name: Organization description: >- Organization endpoint. Enables you to modify organization roles and information. - name: Dataset description: >- Dataset endpoint. Datasets belong to organizations and hold configuration information for both client and server. Datasets contain chunks and chunk groups. - name: Chunk description: >- Chunk endpoint. Think of chunks as individual searchable units of information. The majority of your integration will likely be with the Chunk endpoint. - name: Chunk Group description: >- Chunk groups endpoint. Think of a chunk_group as a bookmark folder within the dataset. - name: Crawl description: Crawl endpoint. Used to create and manage crawls for datasets. - name: File description: >- File endpoint. When files are uploaded, they are stored in S3 and broken up into chunks with text extraction from Apache Tika. You can upload files of pretty much any type up to 1GB in size. See chunking algorithm details at `docs.trieve.ai` for more information on how chunking works. Improved default chunking is on our roadmap. - name: Events description: >- Notifications endpoint. Files are uploaded asynchronously and events are sent to the user when the upload is complete. - name: Topic description: >- Topic chat endpoint. Think of topics as the storage system for gen-ai chat memory. Gen AI messages belong to topics. - name: Message description: >- Message chat endpoint. Messages are units belonging to a topic in the context of a chat with a LLM. There are system, user, and assistant messages. - name: Stripe description: >- Stripe endpoint. Used for the managed SaaS version of this app. Eventually this will become a micro-service. Reach out to the team using contact info found at `docs.trieve.ai` for more information. - name: Health description: Health check endpoint. Used to check if the server is up and running. - name: Metrics description: Metrics endpoint. Used to get information for monitoring - name: Analytics description: Analytics endpoint. Used to get information for search and RAG analytics - name: Experiment description: Experiment endpoint. Used to create and manage experiments paths: /api/chunk: post: tags: - Chunk summary: Create or Upsert Chunk or Chunks description: >- Create new chunk(s). If the chunk has the same tracking_id as an existing chunk, the request will fail. Once a chunk is created, it can be searched for using the search endpoint. If uploading in bulk, the maximum amount of chunks that can be uploaded at once is 120 chunks. Auth'ed user or api key must have an admin or owner role for the specified dataset's organization. operationId: create_chunk parameters: - name: TR-Dataset in: header description: >- The dataset id or tracking_id to use for the request. We assume you intend to use an id if the value is a valid uuid. required: true schema: type: string format: uuid requestBody: description: JSON request payload to create a new chunk (chunk) content: application/json: schema: $ref: '#/components/schemas/CreateChunkReqPayloadEnum' required: true responses: '200': description: JSON response payload containing the created chunk content: application/json: schema: $ref: '#/components/schemas/ReturnQueuedChunk' '400': description: Error typically due to deserialization issues content: application/json: schema: $ref: '#/components/schemas/ErrorResponseBody' '413': description: Error when more than 120 chunks are provided in bulk content: application/json: schema: $ref: '#/components/schemas/ErrorResponseBody' '426': description: Error when upgrade is needed to process more chunks content: application/json: schema: $ref: '#/components/schemas/ErrorResponseBody' security: - ApiKey: - admin components: schemas: CreateChunkReqPayloadEnum: oneOf: - $ref: '#/components/schemas/CreateSingleChunkReqPayload' - $ref: '#/components/schemas/CreateBatchChunkReqPayload' ReturnQueuedChunk: oneOf: - $ref: '#/components/schemas/SingleQueuedChunkResponse' - $ref: '#/components/schemas/BatchQueuedChunkResponse' ErrorResponseBody: type: object required: - message properties: message: type: string example: message: Bad Request CreateSingleChunkReqPayload: $ref: '#/components/schemas/ChunkReqPayload' CreateBatchChunkReqPayload: type: array items: $ref: '#/components/schemas/ChunkReqPayload' example: - chunk_html:

Some HTML content

group_ids: - d290f1ee-6c54-4b01-90e6-d701748f0851 group_tracking_ids: - group_tracking_id image_urls: - https://example.com/red - https://example.com/blue link: https://example.com location: lat: -34 lon: 151 metadata: key1: value1 key2: value2 tag_set: - tag1 - tag2 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id upsert_by_tracking_id: true - chunk_html:

Some more HTML content

group_ids: - d290f1ee-6c54-4b01-90e6-d701748f0851 group_tracking_ids: - group_tracking_id image_urls: [] link: https://explain.com location: lat: -34 lon: 151 metadata: key1: value1 key2: value2 tag_set: - tag3 - tag4 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id upsert_by_tracking_id: true weight: 0.5 SingleQueuedChunkResponse: type: object required: - chunk_metadata properties: chunk_metadata: $ref: '#/components/schemas/ChunkMetadata' example: chunk_metadata: - content: Some content link: https://example.com metadata: key1: value1 key2: value2 tag_set: - tag1 - tag2 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id weight: 0.5 pos_in_queue: 1 BatchQueuedChunkResponse: type: object title: batch required: - chunk_metadata properties: chunk_metadata: type: array items: $ref: '#/components/schemas/ChunkMetadata' example: chunk_metadata: - content: Some content file_id: d290f1ee-6c54-4b01-90e6-d701748f0851 link: https://example.com metadata: key1: value1 key2: value2 tag_set: - tag1 - tag2 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id weight: 0.5 - content: Some content file_id: d290f1ee-6c54-4b01-90e6-d701748f0851 link: https://example.com metadata: key1: value1 key2: value2 tag_set: - tag1 - tag2 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id weight: 0.5 pos_in_queue: 2 ChunkReqPayload: type: object title: single description: Request payload for creating a new chunk properties: chunk_html: type: string description: >- HTML content of the chunk. This can also be plaintext. The innerText of the HTML will be used to create the embedding vector. The point of using HTML is for convienience, as some users have applications where users submit HTML content. nullable: true convert_html_to_text: type: boolean description: >- Convert HTML to raw text before processing to avoid adding noise to the vector embeddings. By default this is true. If you are using HTML content that you want to be included in the vector embeddings, set this to false. nullable: true fulltext_boost: allOf: - $ref: '#/components/schemas/FullTextBoost' nullable: true fulltext_content: type: string description: >- If fulltext_content is present, it will be used for creating the fulltext and bm25 sparse vectors instead of the innerText `chunk_html`. `chunk_html` will still be the only thing stored and used for semantic functionality unless the corresponding `semantic_content` field is defined. `chunk_html` must still be present for the chunk to be created properly. nullable: true group_ids: type: array items: type: string format: uuid description: >- Group ids are the Trieve generated ids of the groups that the chunk should be placed into. This is useful for when you want to create a chunk and add it to a group or multiple groups in one request. Groups with these Trieve generated ids must be created first, it cannot be arbitrarily created through this route. nullable: true group_tracking_ids: type: array items: type: string description: >- Group tracking_ids are the user-assigned tracking_ids of the groups that the chunk should be placed into. This is useful for when you want to create a chunk and add it to a group or multiple groups in one request. If a group with the tracking_id does not exist, it will be created. nullable: true high_priority: type: boolean description: >- High Priority allows you to place this chunk into a priority queue with its own ingestion workers. Can only be used by users with a Custom Pro plan. nullable: true image_urls: type: array items: type: string description: >- Image urls are a list of urls to images that are associated with the chunk. This is useful for when you want to associate images with a chunk. nullable: true link: type: string description: >- Link to the chunk. This can also be any string. Frequently, this is a link to the source of the chunk. The link value will not affect the embedding creation. nullable: true location: allOf: - $ref: '#/components/schemas/GeoInfo' nullable: true metadata: description: >- Metadata is a JSON object which can be used to filter chunks. This is useful for when you want to filter chunks by arbitrary metadata. Unlike with tag filtering, there is a performance hit for filtering on metadata. nullable: true num_value: type: number format: double description: >- Num value is an arbitrary numerical value that can be used to filter chunks. This is useful for when you want to filter chunks by numerical value. There is no performance hit for filtering on num_value. nullable: true semantic_boost: allOf: - $ref: '#/components/schemas/SemanticBoost' nullable: true semantic_content: type: string description: >- If semantic_content is present, it will be used for creating semantic embeddings instead of the innerText `chunk_html`. `chunk_html` will still be the only thing stored and used for fulltext functionality unless the corresponding `fulltext_content` field is defined. `chunk_html` must still be present for the chunk to be created properly. nullable: true split_avg: type: boolean description: >- Split avg is a boolean which tells the server to split the text in the chunk_html into smaller chunks and average their resulting vectors. This is useful for when you want to create a chunk from a large piece of text and want to split it into smaller chunks to create a more fuzzy average dense vector. The sparse vector will be generated normally with no averaging. By default this is false. nullable: true tag_set: type: array items: type: string description: >- Tag set is a list of tags. This can be used to filter chunks by tag. Unlike with metadata filtering, HNSW indices will exist for each tag such that there is not a performance hit for filtering on them. nullable: true time_stamp: type: string description: >- Time_stamp should be an ISO 8601 combined date and time without timezone. It is used for time window filtering and recency-biasing search results. nullable: true tracking_id: type: string description: >- Tracking_id is a string which can be used to identify a chunk. This is useful for when you are coordinating with an external system and want to use the tracking_id to identify the chunk. nullable: true upsert_by_tracking_id: type: boolean description: >- Upsert when a chunk with the same tracking_id exists. By default this is false, and chunks will be ignored if another with the same tracking_id exists. If this is true, the chunk will be updated if a chunk with the same tracking_id exists. nullable: true weight: type: number format: double description: >- Weight is a float which can be used to bias search results. This is useful for when you want to bias search results for a chunk. The magnitude only matters relative to other chunks in the chunk's dataset dataset. nullable: true example: chunk_html:

Some HTML content

fulltext_boost: boost_factor: 5 phrase: foo group_ids: - d290f1ee-6c54-4b01-90e6-d701748f0851 group_tracking_ids: - group_tracking_id image_urls: - https://example.com/red - https://example.com/blue link: https://example.com location: lat: -34 lon: 151 metadata: key1: value1 key2: value2 semantic_boost: distance_factor: 0.5 phrase: flagship tag_set: - tag1 - tag2 time_stamp: '2021-01-01 00:00:00.000' tracking_id: tracking_id ChunkMetadata: type: object title: V2 required: - id - created_at - updated_at - dataset_id - weight properties: chunk_html: type: string description: >- HTML content of the chunk, can also be an arbitrary string which is not HTML nullable: true created_at: type: string format: date-time description: Timestamp of the creation of the chunk dataset_id: type: string format: uuid description: ID of the dataset which the chunk belongs to id: type: string format: uuid description: >- Unique identifier of the chunk, auto-generated uuid created by Trieve image_urls: type: array items: type: string nullable: true description: >- Image URLs of the chunk, can be any list of strings. Used for image search and RAG. nullable: true link: type: string description: Link to the chunk, should be a URL nullable: true location: allOf: - $ref: '#/components/schemas/GeoInfo' nullable: true metadata: description: Metadata of the chunk, can be any JSON object nullable: true num_value: type: number format: double description: >- Numeric value of the chunk, can be any float. Can represent the most relevant numeric value of the chunk, such as a price, quantity in stock, rating, etc. nullable: true tag_set: type: array items: type: string nullable: true description: >- Tag set of the chunk, can be any list of strings. Used for tag-filtered searches. nullable: true time_stamp: type: string format: date-time description: Timestamp of the chunk, can be any timestamp. Specified by the user. nullable: true tracking_id: type: string description: >- Tracking ID of the chunk, can be any string, determined by the user. Tracking ID's are unique identifiers for chunks within a dataset. They are designed to match the unique identifier of the chunk in the user's system. nullable: true updated_at: type: string format: date-time description: Timestamp of the last update of the chunk weight: type: number format: double description: >- Weight of the chunk, can be any float. Used as a multiplier on a chunk's relevance score for ranking purposes. example: chunk_html:

Hello, world!

created_at: '2021-01-01 00:00:00.000' dataset_id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3 id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3 link: https://trieve.ai metadata: key: value tag_set: '[tag1,tag2]' time_stamp: '2021-01-01 00:00:00.000' tracking_id: e3e3e3e3-e3e3-e3e3-e3e3-e3e3e3e3e3e3 updated_at: '2021-01-01 00:00:00.000' weight: 0.5 FullTextBoost: type: object description: >- Boost the presence of certain tokens for fulltext (SPLADE) and keyword (BM25) search. I.e. boosting title phrases to priortize title matches or making sure that the listing for AirBNB itself ranks higher than companies who make software for AirBNB hosts by boosting the in-document-frequency of the AirBNB token (AKA word) for its official listing. Conceptually it multiples the in-document-importance second value in the tuples of the SPLADE or BM25 sparse vector of the chunk_html innerText for all tokens present in the boost phrase by the boost factor like so: (token, in-document-importance) -> (token, in-document-importance*boost_factor). required: - phrase - boost_factor properties: boost_factor: type: number format: double description: >- Amount to multiplicatevly increase the frequency of the tokens in the phrase by phrase: type: string description: The phrase to boost in the fulltext document frequency index GeoInfo: type: object description: Location that you want to use as the center of the search. required: - lat - lon properties: lat: $ref: '#/components/schemas/GeoTypes' lon: $ref: '#/components/schemas/GeoTypes' SemanticBoost: type: object description: >- Semantic boosting moves the dense vector of the chunk in the direction of the distance phrase for semantic search. I.e. you can force a cluster by moving every chunk for a PDF closer to its title or push a chunk with a chunk_html of "iphone" 25% closer to the term "flagship" by using the distance phrase "flagship" and a distance factor of 0.25. Conceptually it's drawing a line (euclidean/L2 distance) between the vector for the innerText of the chunk_html and distance_phrase then moving the vector of the chunk_html distance_factor*L2Distance closer to or away from the distance_phrase point along the line between the two points. required: - phrase - distance_factor properties: distance_factor: type: number format: float description: >- Arbitrary float (positive or negative) specifying the multiplicate factor to apply before summing the phrase vector with the chunk_html embedding vector phrase: type: string description: >- Terms to embed in order to create the vector which is weighted summed with the chunk_html embedding vector GeoTypes: oneOf: - type: integer format: int64 - type: number format: double securitySchemes: ApiKey: type: apiKey in: header name: Authorization ````