阿里云开源通义千问多模态大模型 Qwen-VL,该模型都有哪些新技术?
作者:卡卷网发布时间:2024-12-09 14:11浏览数量:102次评论数量:0次
1 论文总结
2024年8月,千问团队发布支持图片,视频,文字交互的Qwen2-VL多模型大模型。Qwen2-VL是对之前 Qwen-VL 模型的一次升级,主要解决了传统视觉处理中预设分辨率的问题。它引入了一种称为“Naive Dynamic Resolution”的机制,允许模型动态处理不同分辨率的图像,生成更高效的视觉表示。此外,Qwen2-VL 还集成了多模态旋转位置嵌入(Multimodal Rotary Position Embedding,M-RoPE),以更好地融合文本、图像和视频中的位置信息。这些改进使得 Qwen2-VL 在多个多模态基准测试中表现出色。模型性能参考下面榜单:
OpenCompass司南 - 评测榜单动态分辨率机制(Naive Dynamic Resolution)
传统的视觉处理方法通常要求图像具有固定的分辨率,这会导致图像失真或信息丢失。Qwen2-VL 通过动态调整图像分辨率,解决了这一问题。Qwen2-VL 设置了 `min_pixels` 和 `max_pixels` 参数,允许模型根据图像的原始分辨率动态调整处理方式。动态分辨率方法在消耗更少的 token 的同时,保持了较高的性能。
多模态旋转位置嵌入(Multimodal Rotary Position Embedding,M-RoPE)
传统的旋转位置嵌入(RoPE)主要用于文本处理,但在多模态任务中,需要融合文本、图像和视频的位置信息。Qwen2-VL 引入了多模态旋转位置嵌入(M-RoPE),可以有效处理不同模态的数据。M-RoPE 不仅考虑了文本的位置信息,还考虑了图像和视频中的空间和时间位置信息。
模型结构
预训练策略
- 初始化:LLM 组件使用 Qwen2 的参数初始化,视觉编码器则基于 DFN 的 ViT 进行初始化,但将固定的 position embedding 替换为 RoPE-2D。
- 第一阶段:Qwen2-VL 使用约 6000 亿个 token 的语料库进行初始预训练。这个阶段主要关注图像-文本关系、图像内的文本识别(通过 OCR)和图像分类任务。
- 第二阶段:进一步预训练,涉及额外的 8000 亿个 token 的图像相关数据。这个阶段引入了更多的混合图像-文本内容,特别是视觉问答数据集,以提高模型对图像相关查询的响应能力。
- 第三阶段:通过多任务数据集的引入,Qwen2-VL 能够同时处理多种任务,这对于处理复杂的现实世界数据集非常重要。
- 总览:整个预训练过程中,Qwen2-VL 处理了总计 1.4 万亿个 token,其中包括文本和图像 token。尽管如此,训练过程中仅对文本 token 提供监督,确保模型在多样化的语言和视觉场景中发展出深刻的理解。
性能评估
Qwen2-VL-72B 在多个多模态基准测试中表现出色,与领先模型(如 GPT-4o 和 Claude3.5-Sonnet)相当,甚至在某些方面超过了其他通用模型。Qwen2-VL 在多语言 OCR 任务上超越了现有通用 LVLM。
2 模型推理
2.1 模型下载
方式1:Hugging Face
https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800dhuggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d方式2:ModelScope
ModelScope 魔搭社区2.2 环境配置
推荐下载docker环境,方便后面在第三方平台部署,使用pytroch2.4 cuda12.1
2.2.1 docker镜像下载
方式一:docker官网
https://hub.docker.com/r/pytorch/pytorch/tagsdocker pull pytorch/pytorch:2.4.1-cuda12.1-cudnn9-devel
方式二:国内第三方docker平台
https://register.liberx.info/docker常用命令:
wenjtop:Docker 介绍和基础使用命令2.2.2 qwen_vl环境准备
如果pip无法安装transformers 4.45.0版本,请到官网手动下载 transformers 4.45.0版本源码。
https://github.com/huggingface/transformerspip install transformers.zip
cd ms-swift
pip install -e .[llm]
# 运行过程中缺下面这个包再安装:
pip install swift
pip install autoawq
pip install qwen-vl-utils
pip install deepspeed
其它缺什么安装什么
2.3 推理
Qwen2_VL_test.py
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
# 加载权重
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto")
# 初始化文字和图片初始化类
processor = AutoProcessor.from_pretrained("Qwen2-VL-2B-Instruct")
# Image
url = "img.png"
image = Image.open(url)
question = "请描述一下图片。"
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": question},
],
}
]
# 处理输入图片和prompt
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
# 推理: 生成的最大输出 max_new_tokens=256 最大能新生成256个字符。
output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
# token ID转换成输出结果
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
3 训练
3.1 图像描述数据集制作三种方式
{"query": "<image>55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee<image>eeeee<image>eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response2"], ["query2", "response2"]], "images": []}
数据生成脚本代码:
import json
res = {"name": "小明", "gender": "男", "age": "18"}
img_path = "img.png"
data = []
template = {
"system": "你是一个OCR帮助系统。",
"query": "请识别图片的内容,返回name(名字),(gender)性别,(age)年龄,严格按照json格式返回。",
"response": str(res),
"images": [img_path]
}
data.append(template)
json_data = json.dumps(data, ensure_ascii=False, indent=4)
print(json_data)
with open('train_data.jsonl', 'w', encoding='utf-8') as file:
file.write(json_data)
注意:如果只有一个jsonl文件,模型会自动才分训练和验证数据集,也看自己提前才分成训练和验证集合。
--dataset train.jsonl \
--val_dataset val.jsonl \
3.2 目标检测数据集制作
# swift跨模型通用格式
{"query": "Find <bbox>", "response": "<ref-object>", "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
# mapping to multiple bboxes
{"query": "Find <ref-object>", "response": "<bbox>", "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [[138, 136, 235, 359],[1,2,3,4]], \"bbox_type\": \"real\", \"image\": 0}]" }
# qwen2-vl-chat特定格式,注意特殊字符的存在
{"query": "Find <|object_ref_start|>the man<|object_ref_end|>", "response": "<|box_start|>(123,235),(324,546)<|box_end|>", "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"]}
3.3 视频数据集制作
{"query": "<video>55555", "response": "66666", "videos": ["video_path"]}
{"query": "eeeee<video>eeeee<video>eeeee", "response": "fffff", "history": [], "videos": ["video_path1", "video_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response2"], ["query2", "response2"]], "videos": []}
注意:55555,66666,eeeee,EEEEE,FFFFF表示文字。
3.4 开始训练
nproc_per_node=4
NPROC_PER_NODE=$nproc_per_node \
MASTER_PORT=29500 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
--model_type qwen2_5-1_5b-instruct \
--model_id_or_path /home/wenjtop/wenjtop/qwen/Qwen2-VL-2B-Instruct \
--model_revision master \
--sft_type lora \
--tuner_backend swift \
--template_type AUTO \
--dtype AUTO \
--output_dir ./llm_sft_output/ \
--ddp_backend nccl \
--custom_train_dataset_path /home/wenjtop/wenjtop/qwen/datasets/self-cognition/train.jsonl \
--train_dataset_sample -1 \
--num_train_epochs 10 \
--max_length 4096 \
--check_dataset_strategy warning \
--gradient_checkpointing true \
--batch_size 16 \
--weight_decay 0.01 \
--learning_rate 1e-4 \
--gradient_accumulation_steps $(expr 8 / $nproc_per_node) \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 3 \
--logging_steps 10 \
--use_flash_attn false \
--save_only_model true \
--deepspeed default-zero3
nproc_per_node=2
:每个节点启动2个进程。CUDA_VISIBLE_DEVICES=0,1,2,3
:可见的GPU设备为0, 1, 2, 3,共4张显卡。使用4张显卡,加载两个模型。实现模型并行和数据并行。每个模型被拆分为两段,分别加载在不同的显卡上。输入数据先在0号显卡上的模型前半段进行计算,然后将输出结果传递到1号显卡上的模型后半段继续计算。
3.5 lora融合和量化
CUDA_VISIBLE_DEVICES=1,2 swift export \
--ckpt_dir /home/wenjtop/wenjtop/ms-swift-main/yldm0226/llm_sft_output/qwen2_5-1_5b-instruct/v0-20241015-115237/checkpoint-10
--merge_lora true
--quant_bits 8
--merge_lora true :融合lora。 --quant_bits 8:量化int8。
注意训练时图片分辨率太大会出现超显存问题:
方式一(推荐),修改ms-swift-main/swift/llm/utils/template.py
MIN_PIXELS = 4 * 28 * 28
MAX_PIXELS = 1024 * 28 * 28
方式二可以修改:ms-swift-main/swift/llm/utils/vision_utils.py(后面找到其它方法的再修改)
def load_image(image: Union['PIL.Image.Image', BytesIO]) -> 'PIL.Image.Image':
from PIL import Image
if isinstance(image, BytesIO):
image = Image.open(image)
if image.mode != 'RGB':
image = image.convert('RGB')
max_wh = 768
img_max_wh = max(image.size[0], image.size[0])
scale = max_wh/img_max_wh
image = image.resize((int(image.size[0]*scale), int(image.size[1]*scale)))
return image
4 代码详解
4.1 文字和图片处理过程:
4.1.1 输入图像,会先求出图片高和宽大于28倍数的最小值:
input: height width = 336 500
output: height width = 336 504
factor = 28
h_bar = round(height / factor) * factor
w_bar = round(width / factor) * factor
4.1.2 将图片的分辨率卡在min_pixels和max_pixels范围内:
if h_bar * w_bar > max_pixels:
beta = math.sqrt((height * width) / max_pixels)
h_bar = math.floor(height / beta / factor) * factor
w_bar = math.floor(width / beta / factor) * factor
elif h_bar * w_bar < min_pixels:
beta = math.sqrt(min_pixels / (height * width))
h_bar = math.ceil(height * beta / factor) * factor
w_bar = math.ceil(width * beta / factor) * factor
# input img: 3 336 500 # output img: 3 336 504
resized_image = image.resize((h_bar, w_bar), resample=resample, reducing_gap=reducing_gap)
4.1.3 对于单图,把一张图片复制一份,看成2帧的视频处理:
# input img: 3 336 500 # output img: 2 3 336 504
image = np.tile(resized_image, (2, 1, 1, 1))
4.1.4 拆分成pathch
input image :2 3 336 504
# 3
channel = image.shape[1]
# 2/2=1
grid_t = image.shape[0] // self.temporal_patch_size
# 336/14=24 504/14=36
grid_h, grid_w = resized_height // self.patch_size, resized_width // self.patch_size
self.merge_size = 2
image = image.reshape(
grid_t, # 1
self.temporal_patch_size, # 2
channel, # 3
grid_h // self.merge_size, # 12
self.merge_size, # 2
self.patch_size, # 14
grid_w // self.merge_size, # 18
self.merge_size, # 2
self.patch_size, # 14
) # output shape: (1, 2, 3, 12, 2, 14, 18, 2, 14)
image = image.transpose(0, 3, 6, 4, 7, 2, 1, 5, 8) # shape(1, 12, 18, 2, 2, 14, 14)
flatten_patches = image.reshape(
grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size
)
# output shape: (1*24*36,3*2*14*14)=(864, 1176)
4.1.5 将 image_pad 转换成图片的token数(864个)
input:['<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>请描述一下图片。<|im_end|>\n<|im_start|>assistant\n']
ouput:['<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|vision_end|>请描述一下图片。<|im_end|>\n<|im_start|>assistant\n']
for i in range(len(text)):
while "<|image_pad|>" in text[i]:
text[i] = text[i].replace("<|image_pad|>", "<|placeholder|>" * int(
(image_grid_thw[index].prod() // merge_length).item()), 1)
index += 1
text[i] = text[i].replace("<|placeholder|>", "<|image_pad|>")
4.1.6 讲text转换成字典id,除空格,其他所有符号,字符都会被编码
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
输出text:['<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|vision_end|>请描述一下图片。<|im_end|>\n<|im_start|>assistant\n']
说明:除开空格,其他所有符号,字符都会被编码。<|im_start|>、system、\n、.、|im_end|、user、<|vision_start|>等字符都会被编码。
151655表示图片token的占位符
输出 encodings:tensor([[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13,
151645, 198, 151644, 872, 198, 151652, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151653, 14880, 53481,
100158, 45930, 1773, 151645, 198, 151644, 77091, 198]])
4.1.7 将文本和图像进行编码
input :[1, 242], output: [1, 242, 1536]
inputs_embeds = self.model.embed_tokens(input_ids)
input:[864, 1176] output: [216, 1536] 864/4=216.
输入采用2x2的大小的窗口进行Patch Merging操作,会将原始的864个token转换成216个,再做Attention操作,特征维度位1536。
image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
# 将一个文本中占位符,image_mask中为True的位置替换为图像token。
inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
4.2 Qwen2VLRotaryEmbedding
wenjtop:旋转位置编码(Rotary Position Embedding, RoPE)免责声明:本文由卡卷网编辑并发布,但不代表本站的观点和立场,只提供分享给大家。
相关推荐

你 发表评论:
欢迎