OptionalburstOptionalcreationOptionalcreatorOptionalentityThe name of the entity to be served. The entity may be a model in the Databricks Model Registry, a model in the Unity Catalog (UC), or a function of type FEATURE_SPEC in the UC. If it is a UC object, the full name of the object should be given in the form of catalog_name.schema_name.model_name.
OptionalentityOptionalenvironmentAn object containing a set of optional, user-specified environment variable key-value pairs used for serving this entity. Note: this is an experimental feature and subject to change. Example entity environment variables that refer to {"OPENAI_API_KEY": "{{secrets/my_scope/my_key}}", "DATABRICKS_TOKEN": "{{secrets/my_scope2/my_key2}}"}
OptionalexternalThe external model to be served. NOTE: Only one of external_model and (entity_name, entity_version, workload_size, workload_type, and scale_to_zero_enabled) can be specified with the latter set being used for custom model serving for a
OptionalfoundationOptionalinstanceARN of the instance profile that the served entity uses to access AWS resources.
OptionalmaxThe maximum provisioned concurrency that the endpoint can scale up to. Do not use if workload_size is specified.
OptionalmaxThe maximum tokens per second that the endpoint can scale up to.
OptionalminThe minimum provisioned concurrency that the endpoint can scale down to. Do not use if workload_size is specified.
OptionalminThe minimum tokens per second that the endpoint can scale down to.
OptionalmodelOptionalmodelOptionalnameThe name of a served entity. It must be unique across an endpoint. A served entity name can consist of alphanumeric characters, dashes, and underscores. If not specified for an external model, this field defaults to external_model.name, with '.' and ':' replaced with '-', and if not specified for other entities, it defaults to entity_name-entity_version.
OptionalprovisionedThe number of model units provisioned.
OptionalscaleWhether the compute resources for the served entity should scale down to zero.
OptionalstateOptionalworkloadThe workload size of the served entity. The workload size corresponds to a range of provisioned concurrency that the compute autoscales between. A single unit of provisioned concurrency can process one request at a time. Valid workload sizes are "Small" (4 - 4 provisioned concurrency), "Medium" (8 - 16 provisioned concurrency), and "Large" (16 - 64 provisioned concurrency). Additional custom workload sizes can also be used when available in the workspace. If scale-to-zero is enabled, the lower bound of the provisioned concurrency for each workload size is 0. Do not use if min_provisioned_concurrency and max_provisioned_concurrency are specified.
Whether burst scaling is enabled. When enabled (default), the endpoint can automatically scale up beyond provisioned capacity to handle traffic spikes. When disabled, the endpoint maintains fixed capacity at provisioned_model_units.