ローカルモデルの実行が、今や実用的になった

hackernews score 0.95 好み 0.00 en

ローカルモデルの実行が、今や実用的になった

原題: Running local models is good now

local modelsagentic codinggemma 4lm studiollama.cppollamadockerinference engine

原文 ↗

日本語訳

# タイトル

ローカルモデルの実行は、今や実用的だ

# 本文

ローカルモデルが登場して以来、ずっと使い続けてきましたが、ついに、驚くほど実用的なレベルに達しました。

私の環境は、2022年モデルのM2 Mac（64GB RAM、1TBストレージ）で、以下のモデルを使用してきました。

- Mistral 7B

- Gemma 3

- OpenAI OSS-20B

- Qwen 3 MOE、および Qwen 2.5 Coder などの Qwen のバリエーション多数

また、以下のようなさまざまなシステム構成でも使用してきました。

- Open WebUI を使用した生の llama.cpp

- llama-cpp-python

- Ollama

- llamafiles

- LM Studio

ローカルモデルの現状はどうなっているのか？

初期の頃、モデルの動作は遅く、使いにくく、ほとんどのプログラミングタスクにおいて精度も十分ではありませんでした。ローカルモデルが大きく遅れをとっているという考えは、私にとって GPT-OSS がリリースされるまでは、大部分において真実でした。これについて具体的な科学的根拠があるわけではありません。私の「モデルが十分に優れているか」を判断する感覚的な指標は、「APIモデルと照らし合わせて二重チェックする必要があるか」というものです。GPT-OSS は、その二重チェックを大幅に減らせるようになった最初のモデルでした。

その結果、私は主に、最新性を必要としない開発上の質問に対する、高速でパーソナレライズされた「開発用Google」としてローカルモデルを使用してきました。

しかし、Google からの最新リリースである Gemma 4 ファミリーにより、ついにローカルでエージェンティックなコーディング（agentic coding）ができるようになりました。ループ処理の精度と速度は、フロンティアモデルの約75%に達しており、これは驚くべきことです。

これまでのところ、デフォルトのローカルモデルとして gemma-4-26b-a4b の LM Studio 実装を使用しています。これまでのローカル環境での活用例は以下の通りです。

- ノートブック形式だった Python スクリプトを、5〜6個のモジュールからなるリポジトリにリファクタリング。

- モジュールに対して、ジェネリクスに正しい型ヒントを使用するようにリント（現在のフロンティアモデルの多くはこれを自動で行えますが、常にではありません）。

また、ブログ記事の校正、ユニットテストの作成、さらには、エージェントが白紙の状態から何を行うかを確認するために、レコメンデーション用のツータワーモデルを立ち上げるリポジトリの雛形作成にも使用しました。生成された内容は以下の通りです。非常に基本的なものではありますが、昨年の自分では不可能だと思っていた範囲を遥かに超えています。

なお、すべてのエージェンティックなワークフローを、実行権限を制限した Docker コンテナ内で実行しているため、環境は制限されています。

また、Arxiv の論文からトレンドトピックを抽出するアプリも構築しています。好奇心から、Pi に過去の LM Studio セッションログを読み込ませ、自分が LM Studio を何に使っていたかを特定させてみました。

予想通り、私は Rijksearch の開発に取り組んでいたため、以下のようになります。

これらは、決して画期的なタスクではありません（繰り返しますが、多くはパーソナライズされた Google やドキュメントの検索です）。また、これらの作業は GPU や RAM に負荷をかけ、K-V キャッシュが 64 GB RAM にまで膨らむこともあります。

しかし、私にとってより重要なストーリーは、こうしたタスクは、たとえ単純なものであっても、わずか6ヶ月前にはローカルモデルでは不可能だったということです。

Gemma-4-12b-qat が登場したばかりですが、そのサイズに対するパフォーマンスの高さには、すでに非常に感銘を受けています。モデルのアーキテクチャ自体が非常に興味深く、「パフォーマンスと価格に制約がある場合、どのようなアーキテクチャ上のトレードオフを行うべきか？」という、狂乱的なトークンのゴールドラッシュの中ではこれまであまり問われてこなかった、興味深い問いを投げかけています。

今日、エージェンティックなモデルをローカルで実行する

しかし、私の言葉を鵜呑みにせず、ぜひ自分で試してみてください！ローカルでエージェンティックなフローを実行したい場合は、ローカルモデルの推論エンジン、エージェント・ハーネス（agentic harness）、そしてローカルモデルのアーティファクトが必要です。ハーネスが、推論エンジン経由で提供されるダウンロード済みのモデル・アーティファクトを指すように設定する必要があります。

私のローカル環境では、現在エージェント・ハーネスとして Pi を、推論サーバーとして LM Studio を使用しています。ただし、llama.cpp を直接使用したほうが高速である可能性が高いため、これは将来の実験の方向性として考えています。

Pi と LM Studio を使ったエージェンティブ・コーディングのセットアップは、この記事の手順に従えば非常に簡単ですが、私は設定にいくつか変更を加えました。

- モデル: 記事では Gemma 26B A4B が推奨されていますが、gemma-4-12b-qat の方が新しく、精度を大きく損なうことなく、より小さく高速です。

- セキュリティ: すべての Pi セッションを Docker コンテナ内で実行し、Python コードの実行やウェブブラウジングができないよう、bash への権限のみを付与しています。ただし、現在行っている別の研究作業のために、別のイメージでは curl を許可する予定です。

- エージェント・ハーネスの設定: すべて Docker で実行しているため、Pi がモデルと通信できるように Pi の `models.json` を編集しました。

```json

"lmstudio": {

"baseUrl": "http://host.docker.internal:1234/v1",

"api": "openai-completions",

"apiKey": "not-needed",

"models": [

{

"id": "google/gemma-4-12b-qat",

"input": [

"text",

"image"

]

}

]

}

```

以下は、私の Docker Compose 設定です。

```yaml

services:

pi:

build:

context: .

dockerfile: Dockerdo

image: pi-agent:0.74.0

init: true

stdin_open: true

tty: true

extra_hosts:

- "host.docker.internal:host-gateway"

environment:

ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:-}

OPENAI_API_KEY: ${OPENAI_API_KEY:-not-needed}

GEMINI_API_KEY: ${GEMINI_API_KEY:-}

OPENAI_API_BASE: ${OPENAI_API_BASE:-http://host.docker.internal:1234/v1} # OpenAI の実際の完了エンドポイントにもアクセスする場合は、ベースを指定する必要があります

WHATEVER_API_KEY: ${WHATEVER_API_KEY:-}

volumes:

- ${HOME}/.pi/agent/models.json:/config/models.json

- ${WORKSPACE:-.}:/workspace

- pi-config:/config

- pi-sessions:/sessions

working_dir: /workspace

volumes:

pi-config:

pi-sessions:

```

そして、これが Pi を実行する bash スクリプトです。

```bash

#!/usr/bin/env bash

# Pi — Start the containerized Pi agent.

# Directory containing this script and the compose files.

SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

# Workspace to mount into the container.

WORKSPACE_DIR="${WORKSPACE:-$(pwd)}"

case "$WORKSPACE_DIR" in

/*) ;;

*) WORKSPACE_DIR="$(cd -- "$WORKSPACE_DIR" && pwd)" ;;

esac

export WORKSPACE="$WORKSPACE_DIR"

sandbox="${PI_SANDBOX:-0}"

pi_args=()

while (($#)); do

case "$1" in

--sandbox) sandbox=1 ;;

--no-sandbox) sandbox=0 ;;

*) pi_args+=("$1") ;;

esac

shift

done

compose_files=( -f "$SCRIPT_DIR/docker-compose.yml" )

if [[ "$sandbox" == "1" ]]; then

# an even more secure sandbox

compose_files+=( -f "$SCRIPT_DIR/docker-compose.sandbox.yml" )

# Derive a container name from the workspace directory's basename.

# Sanitize to characters Docker accepts: [a-zA-Z0-9][a-zA-Z0-9_.-]*

repo_slug="$(basename -- "$WORKSPACE_DIR" | tr -c 'a-zA-Z0-9_.-' '-' | sed 's/^-*//')"

[[ -z "$repo_slug" ]] && repo_slug="workspace"

container_name="pi-${repo_slug}-$$"

api_key_args=(

-e OPENAI_API_KEY

-e DEEPSEEK_API_KEY

-int ANTHROPIC_API_KEY

-e GEMINI_API_KEY

)

cmd=(

docker compose

--project-directory "$SCRIPT_DIR"

"${compose_files[@]}"

run --rm

--name "$container_name"

"${api_key_args[@]}"

)

if ((${#pi_args[@]})); then

cmd+=("${pi_args[@]}")

exec "${cmd[@]}"

```

私は Docker コンテナをビルドし、そのリポジトリ内のファイルを変更します。その後、作業中のリポジトリ

原文（英語）を表示

Running local models is good now

I’ve been working with local models since they came out, and finally, they’re surprisingly good now.

I have a 2022 M2 Mac with 64 GB RAM and 1TB storage and I’ve used

- Mistral 7B

- Gemma 3

- OpenAI OSS-20B

- Qwen 3 MOE, as well as a number of other Qwen variants like Qwen 2.5 Coder

across a lot of different system setups like

- raw llama.cpp with Open WebUI

- llama-cpp-python

- Ollama

- llamafiles and

- LM Studio

Where are local models now?

Early on, models were slow, hard to use, and just not that accurate for most programming tasks. The idea that local models were severely lagging behind was largely true until, for me, the release of GPT-OSS. I have no concrete scientific evidence of this - my own personal vibe metric of “is a model good enough” is, “do I have to double-check it against an API model”, and GPT-OSS was the first one where I started doing that a lot less often.

As a result, I’ve mostly been using local models as fast, personalized Google for development questions that don’t require recency.

But with the most recent releases from Google in the Gemma 4, family, I’ve finally been able to do agentic coding locally and have loops work at about ~75% the accuracy/speed of frontier models, which is incredible.

I’ve so far been using gemma-4-26b-a4b

LM Studio implementation as my default local model. I’ve used the local setup so far to: Refactor a Python script that was a notebook into a repo of 5-6 modules, lint that module to use correct type hints for generics (most frontier models now do this automatically, but not always).

I’ve also used it to proofread some blog posts, write unit tests, and to bootstrap a repo that stands up a two-tower model for recommendations just to see what the agent would do with a blank slate. Here’s what it generated, which was pretty basic but still beyond the scope of anything I would have thought possible last year:

Note that the environment is restricted because I run all my agentic workflows in a Docker container with limited access to execution.

I’m also building an app that surfaces trending topics from Arxiv papers. Out of curiosity, I had Pi go through my past LM Studio session logs and figure out what I was using LM Studio for:

Unsurprisingly, since I’ve been working on Rijksearch,

None of these are groundbreaking tasks (again, a lot of personalized Google/docs lookups), and working on them does give my GPUs and RAM a workout and the K-V cache grows to 64 GB RAM.

But, the larger story for me is that these kinds of tasks, even as simple as they are, used to be impossible for local models as recently as 6 months ago.

Gemma-4-12b-qat

just came out but I’ve already also really been impressed with its performance relative to its size. The model architecture itself is really interesting and proposes a bunch of interesting questions like, “if we are constrained by performance and price, what architectural tradeoffs do we need to make?” a question that so far has not really been asked in the mad token gold rush.

Running agentic models locally today

But don’t take my word for any of this, try it out for yourself! You’ll need a local model inference engine, an agentic harness, and the local model artifact if you want to try to run local agentic flows. You’ll need to set up the harness to point at your local inference endpoint, the downloaded model artifact served via the inference engine.

For my local setup, I’m currently using Pi as the agent harness and LM Studio as the inference server, although it would likely be faster if I just used llama.cpp directly - a potential direction for a future experiment.

This post was very easy to follow to set up agentic coding with Pi and LM Studio, although I did make a few tweaks to the post’s setup.

- Model: The post recommends

Gemma 26B A4B

, butgemma-4-12b-qat

is more recent and smaller and faster, without much sacrifice in accuracy. - Security: I run every Pi session in a Docker container and give it permissions only to bash so that it can’t run Python code or do web browsing, although I do plan to allow curl in a different image for some research work I’m doing.

- Agent Harness Config: Since I run everything in Docker, I edited Pi’s

models.json

in order to get Pi to talk to the model.

"lmstudio": {

"baseUrl": "http://host.docker.internal:1234/v1",

"api": "openai-completions",

"apiKey": "not-needed",

"models": [

{

"id": "google/gemma-4-12b-qat",

"input": [

"text",

"image"

]

}

]

}

Here’s my Docker Compose config:

services:

pi:

build:

context: .

dockerfile: Dockerfile

image: pi-agent:0.74.0

init: true

stdin_open: true

tty: true

extra_hosts:

- "host.docker.internal:host-gateway"

environment:

ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:-}

OPENAI_API_KEY: ${OPENAI_API_KEY:-not-needed}

GEMINI_API_KEY: ${GEMINI_API_KEY:-}

OPENAI_API_BASE: ${OPENAI_API_BASE:-http://host.docker.internal:1234/v1} # note that you'll need to specify a base if you also use OpenAI to access OpenAI's actual completions endpoint

WHATEVER_API_KEY: ${WHATEVER_API_KEY:-}

volumes:

- ${HOME}/.pi/agent/models.json:/config/models.json

- ${WORKSPACE:-.}:/workspace

- pi-config:/config

- pi-sessions:/sessions

working_dir: /workspace

volumes:

pi-config:

pi-sessions:

and here’s the bash script that runs pi

#!/usr/bin/env bash

# Pi — Start the containerized Pi agent.

# Directory containing this script and the compose files.

SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

# Workspace to mount into the container.

WORKSPACE_DIR="${WORKSPACE:-$(pwd)}"

case "$WORKSPACE_DIR" in

/*) ;;

*) WORKSPACE_DIR="$(cd -- "$WORKSPACE_DIR" && pwd)" ;;

esac

export WORKSPACE="$WORKSPACE_DIR"

sandbox="${PI_SANDBOX:-0}"

pi_args=()

while (($#)); do

case "$1" in

--sandbox) sandbox=1 ;;

--no-sandbox) sandbox=0 ;;

*) pi_args+=("$1") ;;

esac

shift

done

compose_files=( -f "$SCRIPT_DIR/docker-compose.yml" )

if [[ "$sandbox" == "1" ]]; then

# an even more secure sandbox

compose_files+=( -f "$SCRIPT_DIR/docker-compose.sandbox.yml" )

# Derive a container name from the workspace directory's basename.

# Sanitize to characters Docker accepts: [a-zA-Z0-9][a-zA-Z0-9_.-]*

repo_slug="$(basename -- "$WORKSPACE_DIR" | tr -c 'a-zA-Z0-9_.-' '-' | sed 's/^-*//')"

[[ -z "$repo_slug" ]] && repo_slug="workspace"

container_name="pi-${repo_slug}-$$"

api_key_args=(

-e OPENAI_API_KEY

-e DEEPSEEK_API_KEY

-e ANTHROPIC_API_KEY

-e GEMINI_API_KEY

)

cmd=(

docker compose

--project-directory "$SCRIPT_DIR"

"${compose_files[@]}"

run --rm

--name "$container_name"

"${api_key_args[@]}"

)

if ((${#pi_args[@]})); then

cmd+=("${pi_args[@]}")

exec "${cmd[@]}"

I build the Docker container and make changes to the files in its own repo. Then, I run Pi in the repo I’m working in, which spins up Docker so that Pi can’t wipe files or directories by acting on my physical hard drive. This also enables Pi running in the container to see my custom model json

config by shipping it into the container. All of this has been working fairly well for my experiments.

There are still issues with local models: inference can be slow, context windows are small and limited to your own hardware, and the ecosystem, although it’s made a ton easier by tooling like LM Studio and HuggingFace’s Use This Model button. Early releases suffer from prompt template mismatches. But, these are usually patched extremely quickly. Needless to say, I’m not sure this is ready for production software development quite yet.

The benefits, though, are numerous and the ecosystem critical to invest in, particularly now. One of the very cool parts of local models is you can introspect almost everything, like watching the token inference process live,

and watching tokens in/out.

You can do things like change the local context window and watch performance improve or degrade, and really dig into how your tokens are processed on the GPU. You can change the system prompt, the quantizations. You can pit models against each other. You can also change and introspect the harness side.

The possibilities are endless, and the tools only keep getting better.

← 一覧に戻る