用FP16精度让34B的 Code LLama 跑在 M2 Ultra上，推理速度超过每秒20个token。这通常需要4块高端GPU才能运行，现在用800GB/s带宽的M2 Ultra也可以。答案是 Speculative Sampling 👍

发布时间: 2023-09-01 17:00:32

1分

数据加载中

关注推特

收听电报

2

1

0

用FP16精度让34B的 Code LLama 跑在 M2 Ultra上，推理速度超过每秒20个token。

这通常需要4块高端GPU才能运行，现在用800GB/s带宽的M2 Ultra也可以。

答案是 Speculative Sampling 👍
IT技术
( twitter.com )

1年前由宝玉提交

Markdown支持

评论加载中...

您可能感兴趣的：更多

1

2

1

1

Phind 用羊驼精调出来的模型已经给GPT-4破功啦！开源牛逼！

Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B

时政
( www.phind.com)

1年前 • ShīnChvën • -- 点击 0 评论

2

2

1

1

给清华开源的中英双语对话模型 ChatGLM-6B 的第二代版本ChatGLM2-6B 做了一个 Colab，目前跑在默认情况下，模型以 FP16 精度加载，运行需要大概 13GB 显存。…
IT技术
( twitter.com)

1年前 • lewang🍥 • -- 点击 0 评论

3

2

1

1

→ code of #pinetwork 开源代码进度
①pailot code lines about 3500+
②Dcert, code open source update about 120,000…
币圈
( twitter.com)

1年前 • PI记者报International news • -- 点击 0 评论

4

2

1

1

LLM 通常使用 16 位浮点参数 (即 FP16 或 BF16) 进行训练。因此，存储一个权重值或激活值需要 2 个字节的内存。如果参数能从16位降低到8位或者4位，就能对模型大小进行压缩，而不会降低模型精度。

前些天的一篇论…
IT技术
( twitter.com)

1年前 • 宝玉 • -- 点击 0 评论

5

4

3

3

我們就坦誠相見吧 : )
Credit:Reddit u/Mansa-Ll
#meme
#memetranslation
时政
( twitter.com)

11个月前 • 迷因翻譯 • -- 点击 0 评论

6

2

1

1

百度搜索Visual Studio Code前六个结果：广告、广告、广告、百度百科、微软Azure的友情链接（进去可以跳转到VS Code官网）、CSDN教程

百度搜索VS Code前六个结果：广告、知乎专栏介绍、百度百科、51…
时政
( twitter.com)

2年前 • twitter机器人 • -- 点击 0 评论

7

2

1

1

试了下，用 ChatGPT 的 Code Interpreter 制作动画

1/ MidJourney 生成 36:9 的大宽幅
2/ 上传到 Code Interpreter…
IT技术
( twitter.com)

1年前 • 黄赟 • -- 点击 0 评论

8

2

1

1

Nvidia推出了Code Llama在线体验地址：

测试了下，让它说中文就是不说，但是你说中文他能听懂并用英文回答😕

最重要的是输出速度挺快，比GPT快起码3倍吧！代码能力你们自己测测吧！

Code Llama 是 Llama 2 的代码专用版本。它可以根据自然语言提示生成可以生成…
IT技术
( nvda.ws)

1年前 • 小互 • -- 点击 • 下载视频 0 评论

00:00:08

9

2

1

1

推荐一个超实用的网站，Papers with code，
IT技术
( paperswithcode.com)

1年前 • Barret李靖 • -- 点击 0 评论

0.1043 Second , Gzip Enable.本网所有言论均来自网络，不代表本网站立场。联系方式: [email protected]

©2012.11.21 bad.news All rights reserved. 社区自动运营第 -- 年零 -- 天
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

关注推特