Skip to main content

Serving Qwen Models on Databricks

Serving Qwen Models on Databricks

Published: March 28, 2025

Product5 min read

Summary

  • Deploy Qwen Models on Databricks by adapting them for Llama-based infrastructure.
  • Leverage High-Performance Serving with low-latency, high-throughput endpoints.
  • Follow a Simple Workflow to convert, register, and serve Qwen models efficiently.

Qwen models, developed by Alibaba, have shown strong performance in both code completion and instruction tasks. In this blog, we’ll show how you can register and deploy Qwen models on Databricks using an approach similar to that for Llama-based architectures. By following these steps, you can take advantage of Databricks’ foundation model (Provisioned Throughput) endpoints, which benefit from low latency and high throughput.

Table of Contents

  1. Motivation: Why Serve Qwen Models on Databricks?
  2. The Core Idea
  3. Implementation: Annotated Code Walkthrough
  4. Performance and Limitations
  5. Summary and Next Steps

Motivation: Why Serve Qwen Models on Databricks?

For many enterprise workloads, Databricks is a one-stop platform to train, register, and serve large language models (LLMs). With Databricks Mosaic AI Model Serving one can easily deploy fine-tuned or base models and utilize them for real-time or batch inference tasks.

The recently released Qwen 2.5 series of models provide strong performance in code completion and instruction tasks. Qwen 2.5 models at the time of their release beat similarly sized models on standard benchmarks such as MMLU, ARC-C, MATH, HumanEval, and multilingual benchmarks such as Multi-Exam and Multi-Understanding. Qwen 2.5 Coder models show similar gains on coding benchmarks. This may provide customers with strong motivation for deploying these models in Databricks Model Serving to power their use cases.

Serving a Qwen model on Databricks involves four steps:

  1. Run a notebook to convert the Qwen model files to be compatible with the Llama architecture and Databricks model serving
  2. Register the Qwen model in Unity Catalog
  3. Deployed the registered model in Databricks Foundation Model Serving
  4. Conduct quality testing on the deployment, such as either manual testing or running standard benchmarks directly against the endpoint

The Core Idea

Databricks foundation model serving provides optimized performance for models such as Meta’s Llama models. Customers can deploy these models with provisioned throughput and achieve low latency and high throughput. While the Qwen models’ underlying model structure is very similar to the Llama models’ structure, certain modifications are required in order to take advantage of Databricks’ model serving infrastructure. The steps below explain how customers can make the necessary modifications.

Implementation: Annotated Code Walkthrough

Part 1) Rewrite Qwen’s weights and config to be consistent with Llama models.

The steps in modify_qwen.py take a Qwen2.5 model and rewrite it to be consistent with the Llama architecture that is optimized for provisioned throughput on Databricks. Here are the key steps in the code:

  1. Load Qwen State Dict: Collect .safetensors from the original Qwen directory.
  2. Copy & Adjust Weights: Insert zero biases for attention outputs where Llama expects them.
  3. Rewrite the Config: Update fields like "architectures", "model_type" to "llama", and remove Qwen-specific flags.
  4. Copy Tokenizer Files: Ensure we bring over tokenizer.json, merges.txt, and so on.
  5. Create Final Output Folder: The files in the new directory make it look like a typical Llama model.

At the end of this step, you have a Llama-compatible Qwen model. You could load the model in vLLM and it should treat it as a Llama model and be able to generate code or follow instructions, depending on which model you used.

Tip: You can use huggingface_hub.snapshot_download to fetch the one of the Qwen models such as Qwen/Qwen2.5-Coder-7B-Instruct from Hugging Face to a directory before performing the conversion.

Part 2) Register and Serve Qwen on Databricks

Next we’ll focus on how to log and serve the “Qwen as Llama” model on Databricks. This is handled by register_qwen.py. The steps here ensure that the model has the configuration that model serving expects for a Llama model. The key steps:

  1. Specifying the path to the converted model from earlier.
  2. Modifying tokenizer configs (especially removing chat_template and setting tokenizer_class).
  3. Adjusting config.json to reflect Llama-compatible sequence lengths.
  4. Updating the model with Llama-like metadata before logging.
  5. Registering the model with MLflow, so it can be served on a GPU endpoint.

Once this notebook is run the model will be registered in Unity Catalog, navigate to the model and click “Serve this model” to set up the endpoint. You should see the option to set up the endpoint with provisioned input at different tokens/second rates.

Testing the Endpoint

Once the endpoint is ready you can conduct some basic tests to verify it is working properly. Suppose that we have deployed the Qwen2.5-Coder-7B model after performing the above conversion and registration. This model is capable of either completing a piece of code or performing fill-in-the-middle. Let’s use it to complete a simple sorting function. Under the “Use” dropdown click “Query” and enter the following request:

The text in the response contains the rest of the implementation:

For a more quantitative approach you could generate completions for the HumanEval tasks. Then run its evaluation to get the pass@1 metric and compare against the published results.

Performance and Limitations

  1. Manual Chat Formatting
    Since we remove Qwen’s built-in chat template, you must manually format system/user/assistant messages in your client code. This ensures the model can still interpret conversation turns properly.
  2. Max Position Embeddings
    We set max_position_embeddings to 16000 tokens to fit within certain Databricks constraints. If Qwen originally supported more, you might lose some maximum context length. However, you’ll still gain provisioned throughput support.

Summary and Next Steps

While Databricks does not support Qwen models directly on provisioned throughput model serving today, the above method allows you to register and serve these models successfully by aligning them to be compatible with the Llama models’ architecture. This workaround is particularly useful if your team requires Qwen’s capabilities but also wants the convenience of Databricks model serving endpoints and provisioned throughput.

Key Takeaway

  • The Qwen and Llama models share enough architectural similarities that, with a few minor modifications (namely, to the tokenizer config and model metadata), Databricks’ model serving infrastructure can readily serve the Qwen models using provisioned throughput.

Future Considerations

  • We encourage you to keep an eye out for official Qwen support on Databricks model serving.
  • Evaluate performance overhead from forcibly limiting context size.
  • If you rely on chat prompting, remember to manually format your prompts on the client side.

Acknowledgments

  • hiyouga's llamafy_qwen.py for the initial example that provided the basis for the Qwen conversion.
  • The Databricks engineering team for clarifying the internal serving constraints.
  • All the community members who tested and refined the approach.

Never miss a Databricks post

Subscribe to the categories you care about and get the latest posts delivered to your inbox