Free Dolly: 世界初の真にオープンな指示でチューニングされたLLM

Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM

公開日: April 13, 2023

によって Mike Conover、Matt Hayes、Ankit Mathur、謝建偉、Jun Wan、サム・シャー、Ali Ghodsi（アリ・ゴディシ）、Patrick Wendell（パトリック・ウェンデル）、Matei Zaharia、Reynold Xin（レイノルド・シン）による投稿

Original Post：Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM

翻訳：Takaaki Yayoi

2週間前、ChatGPTのような人間のインタラクティブ性(指示追従性)を示すように、$30以下でトレーニングされた大規模言語モデル(LLM)であるDollyをリリースしました。本日、研究と商用利用にライセンスされた、人の手で生成された指示データセットでファインチューンされた、史上初のオープンソース、指示追従LLMであるDolly 2.0をリリースします。

Dolly 2.0はEleutherAIのpythiaモデルファミリーをベースとした12Bのパラメーターを持つ言語モデルであり、Databricks従業員によってクラウドソースされた人間の手で生成された新たな高品質指示追従データセットのみを用いてファインチューニングされています。

我々は、すべて商用利用できるトレーニングコード、データセット、モデルの重みを含むDolly 2.0のすべてをオープンソース化します。これは、すべての組織がサードパーティにデータを共有したり、APIアクセスに支払いをすることなしに、人々と会話できるLLMを作成、所有、カスタマイズできることを意味します。

databricks-dolly-15kデータセット

databricks-dolly-15kには、大規模言語モデルのチューニングする指示として設計された、15,000の高品質な人造のプロンプト / レスポンスのペアが格納されています。databricks-dolly-15k(Creative Commons Attribution-ShareAlike 3.0 Unported License)のライセンスのもと、商用アプリケーションを含むすべての目的において誰でもこのデータセットを利用することができます。

我々の知る限り、このデータセットは大規模言語モデルがChatGPTの魔法のようなインタラクティブ性を示すように設計された、史上初のオープンソース、人造の指示データセットです。databricks-dolly-15kは、2023年3月から4月にかけて5,000以上のDatabricks従業員の手によって作成されました。これらのトレーニングレコードは、自然で表現豊かであり、ブレーンストーミングからコンテンツ生成、情報抽出、要約に至る広範な挙動を表現するように設計されています。

なぜ新たなデータセットを作成したのか？

Dolly 1.0をリリースするやいなや、試したい方々からのリクエストが殺到しました。ナンバーワンの質問は「これを商用利用できますか？」でした。

Dolly 1.0やその他の指示追従LLMを作成する重要なステップは、指示とレスポンスのペアを用いてモデルをトレーニングすることです。Dolly 1.0はスタンフォードのAlpacaチームがOpenAI APIを用いて作成したデータセットを用いて$30でトレーニングされました。このデータセットにはChatGPTのアウトプットが含まれており、スタンフォードチームが指摘しているように、利用条項はOpenAIと競合するモデルの作成を許可していません。このため、残念ですが、共通した質問に対する回答は「おそらくダメです！」でした。

我々の知る限り、既存のよく知られた指示追従モデル(Alpaca、Koala、GPT4All、Vicuna)は、商用利用を禁ずるこの制限に苦しんでいます。この難問に対処するために、我々は商用利用できるような、まだ「トレーニング」されていない新たなデータセットを作成する方法を模索し始めました。

どのように作成したのか？

我々は、OpenAIの研究論文から、InstructGPTモデルは13,000個の指示追従の挙動を示す内容から構成されるデータセットでトレーニングされたことを理解しています。これからインスピレーションを得て、Databricks従業員が先頭に立つことで同様の成果を達成できないかを検討することにしました。

結果として、13kの質問と回答を生成することは見かけよりも困難であることがわかりました。すべての回答はオリジナルではなくてはなりません。ChatGPTやウェブのどこかからコピーしてはならず、そうでないと我々のデータセットが汚染されることになります。うんざりするような話でしたが、DatabricksにはLLMに非常に興味を持っている5,000人以上の従業員がいます。そして、OpenAIの40人のラベル付け担当者よりも、さらに高品質なデータセットを作成するために、彼らの間でクラウドソースすることを検討しました。しかし、彼らはすべて多忙であり、山盛りの仕事を抱えていることを知っていましたので、これを行うためのインセンティブの方法が必要でした。

我々は、トップ20のラベラーがすごい賞を受け取れるコンテストをセットアップしました。また、7つの固有のタスクを説明したました:

オープンなQ&A: 例えば、「なぜ人々はコメディ映画を好むのか？」や「フランスの首都は？」あるケースにおいては、これには明確な回答はなく、他のケースにおいては、図示や広範な世界に対する知識を必要とします。
クローズドなQ&A: これは参照テキストの文に含まれる情報のみを用いて回答できる質問です。例えば、原子に関するWikipediaの段落に関して、「原子核における陽子と中性子の比率は？」と尋ねるというものです。
Wikipediaからの情報抽出: ここでは、アノテーターはWikipediaから段落をコピーし、文からエンティティやその他の重みや測定結果のような事実情報を抽出します。
Wikipediaからの情報の要約: ここでは、アノテーターにWikipediaの文が渡され、短い要約にまとめることが求められます。
ブレーンストーミング: このタスクでは、終わりのないアイディエーションと可能性のあるオプションとの関連付けが求められます。例えば、「今週末に友達とできる面白いアクティビティは？」と言ったものです。
分類: このタスクでは、アノテーターはクラスのメンバーシップ(例: このアイテムは動物リスト、鉱物リスト、野菜リストに含まれますか)や、感情や映画レビューのような短い文のプロパティに対する判断を求められます。
クリエイティブな記述: このタスクには、詩やラブレターの記述のようなものが含まれます。

databricks-dolly-15kにおけるオープンQAのサンプル

How do I build a campfire?

Safety should always come first when starting a campfire. Ensure you have water or another way to extinguish your fire readily available. Use a campfire ring, pit, or clear a large area around your campfire location. Also, make sure campfires are permitted in your area.

With safety out of the way, all a fire needs are heat, fuel, and oxygen. Any combustible material will start a fire, but good campfire fuel sources include tinder, kindling, and dry logs.

Use stuff like dry pine needles, dry leaves, grasses, and wood shavings for tinder. You could also use newspaper (or any paper), dryer lint, or cardboard. For kindling, use smaller pieces of wood, such as twigs and sticks no thicker than 1 inch in diameter (about the thickness of your thumb).

Dry or well-seasoned logs will ensure a good burn. Stack the wood in a way that guarantees oxygen can easily reach your flame. Many prefer a teepee or log cabin stacking design. Now, use a heat source to create a flame or spark and apply this to your tinder. Once the tinder combusts, it should light the kindling, and the kindling should eventually ignite your logs. If all goes well, you should now be enjoying your campfire.

databricks-dolly-15kにおけるブレーンストーミングのサンプル

Instead of making a peanut butter and jelly sandwich, what else could I combine peanut butter with in a sandwich? Give five ideas.

Instead of jelly, try one of the following with peanut butter in a sandwich:

1. Honey

2. Raisins

3. Fruit preserves

4. Bananas

5. Sliced apples

当初、10,000の結果を集められるかは懐疑的でした。しかし、毎夜のリーダーボードを用いたゲーミフィケーションによって、1週間で15,000以上の結果を得ることができました。我々の生産性を阻害することなしに、コンテストを終了することができました。

商用利用可能なモデルを作成するためのジャーニー

また、我々は商用利用可能なオープンソースのモデルを提供したいと考えていました。databricks-dolly-15kはAlpacaやDolly 1.0をトレーニングしたデータセットよりも非常に小さいものですが、EleutherAIのpythia-12bをベースとしたDolly 2.0モデルは、高品質な指示追従の挙動を示しました。後知恵ではありますが、これは驚くことではありません。ここ数ヶ月でリリースされた指示チューニングデータセットの多くには、合成データが含まれており、多くの場合、幻覚や事実上のエラーを含んでいます。

一方、プロフェッショナルによって生成されたdatabricks-dolly-15kは高品質であり、多くのタスクに対する長文の回答が含まれています。

ご自身で、要約やコンテンツ生成でDolly 2.0がどのように活用できるのかに関するサンプルを確認いただくことができます。初期のカスタマーフィードバックから、これらのような能力には企業の広範なアプリケーションがあることが明確となっています。

Dolly 2.0によるDatabricksドキュメントの要約

指示	レスポンス
What is the default configuration for new DBSQL warehouses? Input: Databricks SQL Serverless supports serverless compute. Admins can create serverless SQL warehouses (formerly SQL endpoints) that enable instant compute and are managed by Databricks. Serverless SQL warehouses use compute clusters in your Databricks account. Use them with Databricks SQL queries just like you normally would with the original customer-hosted SQL warehouses, which are now called classic SQL warehouses. Databricks changed the name from SQL endpoint to SQL warehouse because, in the industry, endpoint refers to either a remote computing device that communicates with a network that it's connected to, or an entry point to a cloud service. A data warehouse is a data management system that stores current and historical data from multiple sources in a business friendly manner for easier insights and reporting. SQL warehouse accurately describes the full capabilities of this compute resource. If serverless SQL warehouses are enabled for your account, note the following: New SQL warehouses are serverless by default when you create them from the UI. New SQL warehouses are not serverless by default when you create them using the API, which requires that you explicitly specify serverless. You can also create new pro or classic SQL warehouses using either method. You can upgrade a pro or classic SQL warehouse to a serverless SQL warehouse or a classic SQL warehouse to a pro SQL warehouse. You can also downgrade from serverless to pro or classic. This feature only affects Databricks SQL. It does not affect how Databricks Runtime clusters work with notebooks and jobs in the Data Science & Engineering or Databricks Machine Learning workspace environments. Databricks Runtime clusters always run in the classic data plane in your AWS account. See Serverless quotas. If your account needs updated terms of use, workspace admins are prompted in the Databricks SQL UI. If your workspace has an AWS instance profile, you might need to update the trust relationship to support serverless compute, depending on how and when it was created.	By default, when you create a new DBSQL warehouse using the UI, it will be a serverless SQL warehouse. However, when creating the same using the API, you will have to explicitly specify that it is a pro SQL warehouse.

指示

レスポンス

What is the default configuration for new DBSQL warehouses?
Input:

Databricks SQL Serverless supports serverless compute. Admins can create serverless SQL warehouses (formerly SQL endpoints) that enable instant compute and are managed by Databricks. Serverless SQL warehouses use compute clusters in your Databricks account. Use them with Databricks SQL queries just like you normally would with the original customer-hosted SQL warehouses, which are now called classic SQL warehouses. Databricks changed the name from SQL endpoint to SQL warehouse because, in the industry, endpoint refers to either a remote computing device that communicates with a network that it's connected to, or an entry point to a cloud service. A data warehouse is a data management system that stores current and historical data from multiple sources in a business friendly manner for easier insights and reporting. SQL warehouse accurately describes the full capabilities of this compute resource. If serverless SQL warehouses are enabled for your account, note the following: New SQL warehouses are serverless by default when you create them from the UI. New SQL warehouses are not serverless by default when you create them using the API, which requires that you explicitly specify serverless. You can also create new pro or classic SQL warehouses using either method. You can upgrade a pro or classic SQL warehouse to a serverless SQL warehouse or a classic SQL warehouse to a pro SQL warehouse. You can also downgrade from serverless to pro or classic. This feature only affects Databricks SQL. It does not affect how Databricks Runtime clusters work with notebooks and jobs in the Data Science & Engineering or Databricks Machine Learning workspace environments. Databricks Runtime clusters always run in the classic data plane in your AWS account. See Serverless quotas. If your account needs updated terms of use, workspace admins are prompted in the Databricks SQL UI. If your workspace has an AWS instance profile, you might need to update the trust relationship to support serverless compute, depending on how and when it was created.

By default, when you create a new DBSQL warehouse using the UI, it will be a serverless SQL warehouse. However, when creating the same using the API, you will have to explicitly specify that it is a pro SQL warehouse.

Dolly 2.0によるツイートの生成

指示	レスポンス
Write me a tweet about the launch of Dolly 2.0, our new LLM.	Dolly 2.0 is here 🤩 We've upgraded our LLM, making it more efficient, more powerful, and more accessible to a broader audience. Learn more ⬇️ https://t.co/KxzT3HXCPb

真にオープンな大規模言語モデル

我々はお客様から、自分のモデルを所有して、サードバーティに機微なデータを渡すことなしに、自身のドメイン固有のアプリケーションのために高品質なモデルを作成することがベストであるということを繰り返し聞いています。

また、我々は、バイアスや説明可能性に関する重要な問題やAIの安全は、数社の大企業ではなく多様なステークホルダーのコミュニティによって取り組まれるべきだと信じていいます。オープンソース化されたデータセットとモデルは、誰でも人工知能技術の進歩によるメリットを享受できることを確実にする助けとなるコメントや研究、イノベーションを促進します。

技術的、研究的アーティファクトとして、我々はDollyが最先端の効果を持っていることは期待していません。しかし、Dollyとオープンソースデータセットが、さらに強力な言語モデルをブートストラップする役に立つであろう今後の様々な取り組みの種になることを期待しています。

どうすれば使い始められますか？

Dolly 2.0の重みをダウンロードするには、シンプルにDatabricks Hugging Faceのページを訪れ、databricks-dolly-15k datasetをダウンロードするにはDolly repo on databricks-labsにアクセスしてください。そして、ご自身の組織でLLMを活用する方法を知るには、我々のウェビナーにご参加ください。