llava 代码阅读

llava_arch.py
llava_llama.py
pope_eval.py
LlamaAttention
generate() 方法
greed_search()
attention 矩阵
NLP 基础概念
LLM Inference

从 llava-v1_5 文件夹开始看起.

文件夹结构: tree -L N, N 是指定的深度, 这里用了 N=2.

├── checkpoints
├── docs
├── images
├── llava
│   ├── constants.py
│   ├── conversation.py
│   ├── eval
│   ├── __init__.py
│   ├── mm_utils.py
│   ├── model
│   ├── model_post
│   ├── __pycache__
│   ├── serve
│   ├── train
│   └── utils.py
├── playground
│   └── data
├── pope_eval_post.py
├── pope_eval.py
├── pope_eval_repost.py
├── pope_popular.jsonl
├── README.md
├── replace_bin.py
├── replace_head_bin.py
├── scripts
├── shr_eval.py
├── train_dpo_head.py
├── train_dpo_post.py
└── train_dpo.py

重点是 llava/model/ 下面的内容.

├── constants.py
├── conversation.py
├── eval
├── __init__.py
├── mm_utils.py
├── model
│   ├── apply_delta.py
│   ├── builder.py
│   ├── consolidate.py
│   ├── __init__.py
│   ├── language_model # 
│   ├── llava_arch.py
│   ├── make_delta.py
│   ├── multimodal_encoder
│   ├── multimodal_projector
│   └── utils.py
├── model_post # crc 修改的model
├── serve
│   ├── cli.py
│   ├── controller.py
│   ├── examples
│   ├── gradio_web_server.py
│   ├── __init__.py
│   ├── model_worker.py
│   ├── register_worker.py
│   └── test_message.py
├── train
│   ├── llama_flash_attn_monkey_patch.py
│   ├── llama_xformers_attn_monkey_patch.py
│   ├── llava_trainer.py
│   ├── train_mem.py
│   ├── train.py
│   └── train_xformers.py
└── utils.py

├── apply_delta.py
├── builder.py
├── consolidate.py
├── __init__.py
├── language_model # 这里是 LLM
│   ├── llava_llama.py
│   ├── llava_mpt.py
│   └── mpt
├── llava_arch.py
├── make_delta.py
├── multimodal_encoder # 这里是 vision encoder
│   ├── builder.py
│   └── clip_encoder.py
├── multimodal_projector # 这里是
│   └── builder.py
└── utils.py

挑选重要的文件逐个分析.

`llava_arch.py`

这个文件主要定义了类 LlavaMetaModel 和类 LlavaMetaForCausalLM.

class LlavaMetaModel:
    def __init__(self, config):
        super(LlavaMetaModel, self).__init__(config)

        if hasattr(config, "mm_vision_tower"):
            self.vision_tower = build_vision_tower(config, delay_load=True)
            self.mm_projector = build_vision_projector(config)

    def get_vision_tower(self):
        vision_tower = getattr(self, 'vision_tower', None)
        if type(vision_tower) is list:
            vision_tower = vision_tower[0]
        return vision_tower

    def initialize_vision_modules(self, model_args, fsdp=None):
        vision_tower = model_args.vision_tower
        mm_vision_select_layer = model_args.mm_vision_select_layer
        mm_vision_select_feature = model_args.mm_vision_select_feature
        pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter

        self.config.mm_vision_tower = vision_tower

        if self.get_vision_tower() is None:
            vision_tower = build_vision_tower(model_args)

            if fsdp is not None and len(fsdp) > 0:
                self.vision_tower = [vision_tower]
            else:
                self.vision_tower = vision_tower
        else:
            if fsdp is not None and len(fsdp) > 0:
                vision_tower = self.vision_tower[0]
            else:
                vision_tower = self.vision_tower
            vision_tower.load_model()

        self.config.use_mm_proj = True
        self.config.mm_projector_type = getattr(model_args, 'mm_projector_type', 'linear')
        self.config.mm_hidden_size = vision_tower.hidden_size
        self.config.mm_vision_select_layer = mm_vision_select_layer
        self.config.mm_vision_select_feature = mm_vision_select_feature

        if getattr(self, 'mm_projector', None) is None:
            self.mm_projector = build_vision_projector(self.config)
        else:
            # In case it is frozen by LoRA
            for p in self.mm_projector.parameters():
                p.requires_grad = True

        if pretrain_mm_mlp_adapter is not None:
            mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu')
            def get_w(weights, keyword):
                return {k.split(keyword + '.')[1]: v for k, v in weights.items() if keyword in k}

            self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))

什么是装饰器?

首先函数是个对象, 具有 __name__ 属性表示函数的名称, 装饰器是一个返回函数的高阶函数,

装饰器 @abstractmethod 是标准模块 abc 提供的抽象方法, 如果要使用 @abstractmethod 抽象方法, 要求:

所在的类继承 abc.ABC
给抽象的实例方法添加装饰器 @abstractmethod

完成上述两步之后, 这个类就成为了抽象类, 不能直接被实例化, 要想使用抽象类, 必须继承该类并实现该类的所有的抽象方法, 这里的抽象方法就指的是

class LlavaMetaForCausalLM(ABC):

    @abstractmethod # 装饰器
    def get_model(self): # 由子类给出
        pass

    def get_vision_tower(self):
        return self.get_model().get_vision_tower()

    def encode_images(self, images): # 用 vision_tower 和 mm_projector 对 image 编码
        image_features = self.get_model().get_vision_tower()(images)
        image_features = self.get_model().mm_projector(image_features)
        return image_features

    def prepare_inputs_labels_for_multimodal(
        self, input_ids, position_ids, attention_mask, past_key_values, labels, images
    ): # 
        vision_tower = self.get_vision_tower()
        if vision_tower is None or images is None or input_ids.shape[1] == 1:
            if past_key_values is not None and vision_tower is not None and images is not None and input_ids.shape[1] == 1:
                target_shape = past_key_values[-1][-1].shape[-2] + 1
                attention_mask = torch.cat((attention_mask, torch.ones(
                    (attention_mask.shape[0], target_shape - attention_mask.shape[1]),
                    dtype=attention_mask.dtype,
                    device=attention_mask.device
                )), dim=1)
                position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
            return input_ids, position_ids, attention_mask, past_key_values, None, labels

        if type(images) is list or images.ndim == 5:
            concat_images = torch.cat([image for image in images], dim=0)
            image_features = self.encode_images(concat_images)
            split_sizes = [image.shape[0] for image in images]
            image_features = torch.split(image_features, split_sizes, dim=0)
            image_features = [x.flatten(0, 1).to(self.device) for x in image_features]
        else:
            image_features = self.encode_images(images).to(self.device)

        # TODO: image start / end is not implemented here to support pretraining.
        if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
            raise NotImplementedError

        # Let's just add dummy tensors if they do not exist,
        # it is a headache to deal with None all the time.
        # But it is not ideal, and if you have a better idea,
        # please open an issue / submit a PR, thanks.
        _labels = labels
        _position_ids = position_ids
        _attention_mask = attention_mask
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
        else:
            attention_mask = attention_mask.bool()
        if position_ids is None:
            position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
        if labels is None:
            labels = torch.full_like(input_ids, IGNORE_INDEX)

        # remove the padding using attention_mask -- TODO: double check
        input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)]
        labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]

        new_input_embeds = []
        new_labels = []
        cur_image_idx = 0
        for batch_idx, cur_input_ids in enumerate(input_ids):
            num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
            if num_images == 0:
                cur_image_features = image_features[cur_image_idx]
                cur_input_embeds_1 = self.get_model().embed_tokens(cur_input_ids)
                cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0)
                new_input_embeds.append(cur_input_embeds)
                new_labels.append(labels[batch_idx])
                cur_image_idx += 1
                continue

            image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [cur_input_ids.shape[0]]
            cur_input_ids_noim = []
            cur_labels = labels[batch_idx]
            cur_labels_noim = []
            for i in range(len(image_token_indices) - 1):
                cur_input_ids_noim.append(cur_input_ids[image_token_indices[i]+1:image_token_indices[i+1]])
                cur_labels_noim.append(cur_labels[image_token_indices[i]+1:image_token_indices[i+1]])
            split_sizes = [x.shape[0] for x in cur_labels_noim]
            cur_input_embeds = self.get_model().embed_tokens(torch.cat(cur_input_ids_noim))
            cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0)
            cur_new_input_embeds = []
            cur_new_labels = []

            for i in range(num_images + 1):
                cur_new_input_embeds.append(cur_input_embeds_no_im[i])
                cur_new_labels.append(cur_labels_noim[i])
                if i < num_images:
                    cur_image_features = image_features[cur_image_idx]
                    cur_image_idx += 1
                    cur_new_input_embeds.append(cur_image_features)
                    cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))

            cur_new_input_embeds = torch.cat(cur_new_input_embeds)
            cur_new_labels = torch.cat(cur_new_labels)

            new_input_embeds.append(cur_new_input_embeds)
            new_labels.append(cur_new_labels)
        

        # Truncate sequences to max length as image embeddings can make the sequence longer
        tokenizer_model_max_length = getattr(self.config, 'tokenizer_model_max_length', None)
        if tokenizer_model_max_length is not None:
            new_input_embeds = [x[:tokenizer_model_max_length] for x in new_input_embeds]
            new_labels = [x[:tokenizer_model_max_length] for x in new_labels]
            

        # Combine them
        max_len = max(x.shape[0] for x in new_input_embeds)
        batch_size = len(new_input_embeds)

        new_input_embeds_padded = []
        new_labels_padded = torch.full((batch_size, max_len), IGNORE_INDEX, dtype=new_labels[0].dtype, device=new_labels[0].device)
        attention_mask = torch.zeros((batch_size, max_len), dtype=attention_mask.dtype, device=attention_mask.device)
        position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device)

        for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
            cur_len = cur_new_embed.shape[0]
            if getattr(self.config, 'tokenizer_padding_side', 'right') == "left":
                new_input_embeds_padded.append(torch.cat((
                    torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device),
                    cur_new_embed
                ), dim=0))
                if cur_len > 0:
                    new_labels_padded[i, -cur_len:] = cur_new_labels
                    attention_mask[i, -cur_len:] = True
                    position_ids[i, -cur_len:] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
            else:
                new_input_embeds_padded.append(torch.cat((
                    cur_new_embed,
                    torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)
                ), dim=0))
                if cur_len > 0:
                    new_labels_padded[i, :cur_len] = cur_new_labels
                    attention_mask[i, :cur_len] = True
                    position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)

        new_input_embeds = torch.stack(new_input_embeds_padded, dim=0)

        if _labels is None:
            new_labels = None
        else:
            new_labels = new_labels_padded

        if _attention_mask is None:
            attention_mask = None
        else:
            attention_mask = attention_mask.to(dtype=_attention_mask.dtype)

        if _position_ids is None:
            position_ids = None

        return None, position_ids, attention_mask, past_key_values, new_input_embeds, new_labels

    def initialize_vision_tokenizer(self, model_args, tokenizer):
        if model_args.mm_use_im_patch_token:
            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
            self.resize_token_embeddings(len(tokenizer))

        if model_args.mm_use_im_start_end:
            num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
            self.resize_token_embeddings(len(tokenizer))

            if num_new_tokens > 0:
                input_embeddings = self.get_input_embeddings().weight.data
                output_embeddings = self.get_output_embeddings().weight.data

                input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
                    dim=0, keepdim=True)
                output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
                    dim=0, keepdim=True)

                input_embeddings[-num_new_tokens:] = input_embeddings_avg
                output_embeddings[-num_new_tokens:] = output_embeddings_avg

            if model_args.tune_mm_mlp_adapter:
                for p in self.get_input_embeddings().parameters():
                    p.requires_grad = True
                for p in self.get_output_embeddings().parameters():
                    p.requires_grad = False

            if model_args.pretrain_mm_mlp_adapter:
                mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location='cpu')
                embed_tokens_weight = mm_projector_weights['model.embed_tokens.weight']
                assert num_new_tokens == 2
                if input_embeddings.shape == embed_tokens_weight.shape:
                    input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:]
                elif embed_tokens_weight.shape[0] == num_new_tokens:
                    input_embeddings[-num_new_tokens:] = embed_tokens_weight
                else:
                    raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.")
        elif model_args.mm_use_im_patch_token:
            if model_args.tune_mm_mlp_adapter:
                for p in self.get_input_embeddings().parameters():
                    p.requires_grad = False
                for p in self.get_output_embeddings().parameters():
                    p.requires_grad = False

`llava_llama.py`

1	from ..llava_arch import LlavaMetaModel, LlavaMetaForCausalLM

上面表示 llava_arch.py 中定义了两个类 LlavaMetaModel 和 LlavaMetaForCausalLM, 这里导入了这两个类.

class LlavaLlamaModel(LlavaMetaModel, LlamaModel):
    config_class = LlavaConfig

    def __init__(self, config: LlamaConfig):
        super(LlavaLlamaModel, self).__init__(config)

上面定义类 LlavaLlamaModel 继承两个类 LlavaMetaModel 和 LlamaModel, LlamaModel 是 hugging face transformer 库中的.

# transformer 库中的类 LlamaForCausalLM
class LlamaForCausalLM(LlamaPreTrainedModel):
    _tied_weights_keys = ["lm_head.weight"]

    def __init__(self, config):
        super().__init__(config)
        self.model = LlamaModel(config)
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

class LlavaLlamaForCausalLM(LlamaForCausalLM, LlavaMetaForCausalLM):
    config_class = LlavaConfig

    def __init__(self, config):
        super(LlamaForCausalLM, self).__init__(config) # LlamaForCausalLM 是 transformer 库中的类, 这里显式调用了 LlamaForCausalLM 类的 __init__ 方法, 也就是说这里的 __init__ 调用结束之后, 有两个 self.model
        self.model = LlavaLlamaModel(config)
        self.pretraining_tp = config.pretraining_tp
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) # LLM 的头, 输出维度为词表大小

        # Initialize weights and apply final processing
        self.post_init()

    def get_model(self):
        return self.model

    def forward( # TODO: 补充每个参数的含义
        self,
        input_ids: torch.LongTensor = None, # 
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None, # 用于加速
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None, # 用于加速
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        images: Optional[torch.FloatTensor] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CausalLMOutputWithPast]:
        
        if inputs_embeds is None:
            (
                input_ids,
                position_ids,
                attention_mask,
                past_key_values,
                inputs_embeds,
                labels
            ) = self.prepare_inputs_labels_for_multimodal(
                input_ids,
                position_ids,
                attention_mask,
                past_key_values,
                labels,
                images
            ) # prepare_inputs_labels_for_multimodal() 函数是 llava_arch.py 文件中类 LlavaMetaForCausalLM 中的方法, 作用的 vision encoder + projector + text encoder + vision-text tokens concat

        return super().forward(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            labels=labels,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict
        )
    # 这里的 forward 函数返回的是父类的 forward() 函数, 根据 Python 的方法解析顺序 (Method Resolution Order, MRO), MRO 是确定多重继承中方法调用顺序的算法。在 Python 中，MRO 是由 C3 线性化算法来计算的, 也就是说这里是按照 LlamaForCausalLM, LlavaMetaForCausalLM 的先后顺序解析的, 所以这里调用的实际上是类 LlamaForCausalLM 的 forward() 函数. 除此之外, 第二个父类 LlavaMetaForCausalLM 里没有 forward() 方法.

    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
        '''
        
        '''
        images = kwargs.pop("images", None)
        _inputs = super().prepare_inputs_for_generation(
            input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
        )
        if images is not None:
            _inputs['images'] = images
        return _inputs

上面定义类 LlavaLlamaForCausalLM, 继承自 LlamaForCausalLM 和 LlavaMetaForCausalLM.

`pope_eval.py`

这是推理的代码, 现在要可视化 attention 矩阵以确定 summary token 的移动情况.

先分析一下文件内容.

1 2	rsync /home/cuiruochen/HA-DPO/ha_dpo/data/POPE.tar wuzongqian@shi.kongfei.life:/home/wuzongqian/xubaoduo/Deep-Learning-for-MLLMs/ha_dpo/data

1 2	rsync /home/cuiruochen/HA-DPO/ha_dpo/data.tar wuzongqian@shi.kongfei.life:/home/wuzongqian/xubaoduo/Deep-Learning-for-MLLMs/ha_dpo

1 2	rsync /home/cuiruochen/HA-DPO/ha_dpo/data/coco2024.tar wuzongqian@shi.kongfei.life:/home/wuzongqian/xubaoduo/Deep-Learning-for-MLLMs/ha_dpo/data

模型

LlavaLlamaForCausalLM(
  (model): LlavaLlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
    (vision_tower): CLIPVisionTower(
      (vision_tower): CLIPVisionModel(
        (vision_model): CLIPVisionTransformer(
          (embeddings): CLIPVisionEmbeddings(
            (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
            (position_embedding): Embedding(577, 1024)
          )
          (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (encoder): CLIPEncoder(
            (layers): ModuleList(
              (0-23): 24 x CLIPEncoderLayer(
                (self_attn): CLIPAttention(
                  (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
                )
                (layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
                (mlp): CLIPMLP(
                  (activation_fn): QuickGELUActivation()
                  (fc1): Linear(in_features=1024, out_features=4096, bias=True)
                  (fc2): Linear(in_features=4096, out_features=1024, bias=True)
                )
                (layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
              )
            )
          )
          (post_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (mm_projector): Sequential(
      (0): Linear(in_features=1024, out_features=4096, bias=True)
      (1): GELU(approximate='none')
      (2): Linear(in_features=4096, out_features=4096, bias=True)
    )
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

['T_destination', '__abstractmethods__', '__annotations__', '__call__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_apply', '_auto_class', '_backward_compatibility_gradient_checkpointing', '_backward_hooks', '_backward_pre_hooks', '_buffers', '_call_impl', '_convert_head_mask_to_5d', '_create_repo', '_expand_inputs_for_generation', '_extract_past_from_model_output', '_forward_hooks', '_forward_hooks_with_kwargs', '_forward_pre_hooks', '_forward_pre_hooks_with_kwargs', '_from_config', '_get_backward_hooks', '_get_backward_pre_hooks', '_get_decoder_start_token_id', '_get_files_timestamps', '_get_logits_processor', '_get_logits_warper', '_get_name', '_get_resized_embeddings', '_get_resized_lm_head', '_get_stopping_criteria', '_hook_rss_memory_post_forward', '_hook_rss_memory_pre_forward', '_init_weights', '_initialize_weights', '_is_full_backward_hook', '_is_hf_initialized', '_keep_in_fp32_modules', '_keys_to_ignore_on_load_missing', '_keys_to_ignore_on_load_unexpected', '_keys_to_ignore_on_save', '_load_from_state_dict', '_load_pretrained_model', '_load_pretrained_model_low_mem', '_load_state_dict_post_hooks', '_load_state_dict_pre_hooks', '_maybe_initialize_input_ids_for_generation', '_maybe_warn_non_full_backward_hook', '_merge_criteria_processor_list', '_modules', '_named_members', '_no_split_modules', '_non_persistent_buffers_set', '_parameters', '_prepare_attention_mask_for_generation', '_prepare_decoder_input_ids_for_generation', '_prepare_encoder_decoder_kwargs_for_generation', '_prepare_model_inputs', '_register_load_state_dict_pre_hook', '_register_state_dict_hook', '_reorder_cache', '_replicate_for_data_parallel', '_resize_token_embeddings', '_save_to_state_dict', '_set_default_torch_dtype', '_set_gradient_checkpointing', '_skip_keys_device_placement', '_slow_forward', '_state_dict_hooks', '_state_dict_pre_hooks', '_tie_encoder_decoder_weights', '_tie_or_clone_weights', '_tied_weights_keys', '_update_model_kwargs_for_generation', '_upload_modified_files', '_validate_model_class', '_validate_model_kwargs', '_version', 'add_memory_hooks', 'add_module', 'adjust_logits_during_generation', 'apply', 'assisted_decoding', 'base_model', 'base_model_prefix', 'beam_sample', 'beam_search', 'bfloat16', 'buffers', 'call_super_init', 'can_generate', 'children', 'compute_transition_scores', 'config', 'config_class', 'constrained_beam_search', 'contrastive_search', 'cpu', 'create_extended_attention_mask_for_decoder', 'cuda', 'device', 'disable_input_require_grads', 'double', 'dtype', 'dummy_inputs', 'dump_patches', 'enable_input_require_grads', 'encode_images', 'estimate_tokens', 'eval', 'extra_repr', 'float', 'floating_point_ops', 'forward', 'framework', 'from_pretrained', 'generate', 'generation_config', 'get_buffer', 'get_decoder', 'get_extended_attention_mask', 'get_extra_state', 'get_head_mask', 'get_input_embeddings', 'get_memory_footprint', 'get_model', 'get_output_embeddings', 'get_parameter', 'get_position_embeddings', 'get_submodule', 'get_vision_tower', 'gradient_checkpointing_disable', 'gradient_checkpointing_enable', 'greedy_search', 'group_beam_search', 'half', 'hf_device_map', 'init_weights', 'initialize_vision_tokenizer', 'invert_attention_mask', 'ipu', 'is_gradient_checkpointing', 'is_loaded_in_4bit', 'is_loaded_in_8bit', 'is_parallelizable', 'is_quantized', 'lm_head', 'load_state_dict', 'main_input_name', 'model', 'modules', 'name_or_path', 'named_buffers', 'named_children', 'named_modules', 'named_parameters', 'num_parameters', 'parameters', 'post_init', 'prepare_inputs_for_generation', 'prepare_inputs_labels_for_multimodal', 'pretraining_tp', 'prune_heads', 'push_to_hub', 'register_backward_hook', 'register_buffer', 'register_for_auto_class', 'register_forward_hook', 'register_forward_pre_hook', 'register_full_backward_hook', 'register_full_backward_pre_hook', 'register_load_state_dict_post_hook', 'register_module', 'register_parameter', 'register_state_dict_pre_hook', 'requires_grad_', 'reset_memory_hooks_state', 'resize_position_embeddings', 'resize_token_embeddings', 'retrieve_modules_from_names', 'reverse_bettertransformer', 'sample', 'save_pretrained', 'set_decoder', 'set_extra_state', 'set_input_embeddings', 'set_output_embeddings', 'share_memory', 'state_dict', 'supports_gradient_checkpointing', 'tie_weights', 'to', 'to_bettertransformer', 'to_empty', 'train', 'training', 'type', 'vocab_size', 'warn_if_padding_and_no_attention_mask', 'warnings_issued', 'xpu', 'zero_grad']

返回值为

1	<class 'transformers.modeling_outputs.CausalLMOutputWithPast'> 2

`LlamaAttention`

forward() 方法:

def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        bsz, q_len, _ = hidden_states.size()

        if self.pretraining_tp > 1:
            key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.pretraining_tp
            query_slices = self.q_proj.weight.split((self.num_heads * self.head_dim) // self.pretraining_tp, dim=0)
            key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
            value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)

            query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
            query_states = torch.cat(query_states, dim=-1)

            key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.pretraining_tp)]
            key_states = torch.cat(key_states, dim=-1)

            value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.pretraining_tp)]
            value_states = torch.cat(value_states, dim=-1)

        else:
            query_states = self.q_proj(hidden_states)
            key_states = self.k_proj(hidden_states)
            value_states = self.v_proj(hidden_states)

        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)

        kv_seq_len = key_states.shape[-2]
        if past_key_value is not None:
            kv_seq_len += past_key_value[0].shape[-2]
        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

        if past_key_value is not None:
            # reuse k, v, self_attention
            key_states = torch.cat([past_key_value[0], key_states], dim=2)
            value_states = torch.cat([past_key_value[1], value_states], dim=2)

        past_key_value = (key_states, value_states) if use_cache else None

        # repeat k/v heads if n_kv_heads < n_heads
        key_states = repeat_kv(key_states, self.num_key_value_groups)
        value_states = repeat_kv(value_states, self.num_key_value_groups)

        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)

        if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
            raise ValueError(
                f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
                f" {attn_weights.size()}"
            )

        if attention_mask is not None:
            if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
                raise ValueError(
                    f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
                )
            attn_weights = attn_weights + attention_mask

        # upcast attention to fp32
        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
        attn_output = torch.matmul(attn_weights, value_states)

        if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
            raise ValueError(
                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
                f" {attn_output.size()}"
            )

        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

        if self.pretraining_tp > 1:
            attn_output = attn_output.split(self.hidden_size // self.pretraining_tp, dim=2)
            o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.pretraining_tp, dim=1)
            attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.pretraining_tp)])
        else:
            attn_output = self.o_proj(attn_output)

        if not output_attentions:
            attn_weights = None

        return attn_output, attn_weights, past_key_value

['T_destination', '__annotations__', '__call__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_apply', '_backward_hooks', '_backward_pre_hooks', '_buffers', '_call_impl', '_forward_hooks', '_forward_hooks_with_kwargs', '_forward_pre_hooks', '_forward_pre_hooks_with_kwargs', '_get_backward_hooks', '_get_backward_pre_hooks', '_get_name', '_init_rope', '_is_full_backward_hook', '_is_hf_initialized', '_load_from_state_dict', '_load_state_dict_post_hooks', '_load_state_dict_pre_hooks', '_maybe_warn_non_full_backward_hook', '_modules', '_named_members', '_non_persistent_buffers_set', '_parameters', '_register_load_state_dict_pre_hook', '_register_state_dict_hook', '_replicate_for_data_parallel', '_save_to_state_dict', '_shape', '_slow_forward', '_state_dict_hooks', '_state_dict_pre_hooks', '_version', 'add_module', 'apply', 'bfloat16', 'buffers', 'call_super_init', 'children', 'config', 'cpu', 'cuda', 'double', 'dump_patches', 'eval', 'extra_repr', 'float', 'forward', 'get_buffer', 'get_extra_state', 'get_parameter', 'get_submodule', 'half', 'head_dim', 'hidden_size', 'ipu', 'k_proj', 'load_state_dict', 'max_position_embeddings', 'modules', 'named_buffers', 'named_children', 'named_modules', 'named_parameters', 'num_heads', 'num_key_value_groups', 'num_key_value_heads', 'o_proj', 'parameters', 'pretraining_tp', 'q_proj', 'register_backward_hook', 'register_buffer', 'register_forward_hook', 'register_forward_pre_hook', 'register_full_backward_hook', 'register_full_backward_pre_hook', 'register_load_state_dict_post_hook', 'register_module', 'register_parameter', 'register_state_dict_pre_hook', 'requires_grad_', 'rotary_emb', 'set_extra_state', 'share_memory', 'state_dict', 'to', 'to_empty', 'train', 'training', 'type', 'v_proj', 'xpu', 'zero_grad']

`generate()` 方法

获取推理过程中 decoder 中每一层输出的 attention 矩阵.

在文件 transformers/generation/utils.py 中的 GenerationMixin 类中的方法. 然后用了llava 推理的时候发现 is_greedy_gen_mode 的值为 True, 因此 generate() 函数的返回值为

# 这个在 1541 行左右
return self.greedy_search(
    input_ids,
    logits_processor=logits_processor,
    stopping_criteria=stopping_criteria,
    pad_token_id=generation_config.pad_token_id,
    eos_token_id=generation_config.eos_token_id,
    output_scores=generation_config.output_scores,
    return_dict_in_generate=generation_config.return_dict_in_generate,
    synced_gpus=synced_gpus,
    streamer=streamer,
    **model_kwargs,
)

greedy_search() 函数也在 GenerationMixin 类中实现, 定义为:

def greedy_search(
    self,
    input_ids: torch.LongTensor,
    logits_processor: Optional[LogitsProcessorList] = None,
    stopping_criteria: Optional[StoppingCriteriaList] = None,
    max_length: Optional[int] = None,
    pad_token_id: Optional[int] = None,
    eos_token_id: Optional[Union[int, List[int]]] = None,
    output_attentions: Optional[bool] = None, # 传入该参数为 True 
    output_hidden_states: Optional[bool] = None,
    output_scores: Optional[bool] = None,
    return_dict_in_generate: Optional[bool] = None, # 传入该参数为 True
    synced_gpus: bool = False,
    streamer: Optional["BaseStreamer"] = None,
    **model_kwargs,
) -> Union[GreedySearchOutput, torch.LongTensor]:

将上面给出注释的两个参数传入为 True 或者直接在源码修改为 True, 然后在 2380 行左右:

不知道为什么尝试在 generate() 函数的返回值处传入 output_attentions=True 以及 return_dict_in_generate=True 会报错.

if return_dict_in_generate:
    if output_scores:
        scores += (next_tokens_scores,)
        if output_attentions:
            decoder_attentions += (
                (outputs.decoder_attentions,) if self.config.is_encoder_decoder else (outputs.attentions,)
            )

这两个参数控制了是否会把 attention 存在来存在 decoder_attentions 中, 最终的返回值为

# 2433 行左右
if return_dict_in_generate:
    if self.config.is_encoder_decoder:
		...
    else: # 由于是 decoder only 结构, 因此返回的是下面的内容
        return GreedySearchDecoderOnlyOutput(
            sequences=input_ids,
            scores=scores,
            attentions=decoder_attentions,
            hidden_states=decoder_hidden_states,
        )

然后这个类是专门用来输出的类, 也在 transformers/generation/utils.py 中实现,

@dataclass
class GreedySearchDecoderOnlyOutput(ModelOutput):
    """
    Base class for outputs of decoder-only generation models using greedy search. 
	"""

    sequences: torch.LongTensor = None
    scores: Optional[Tuple[torch.FloatTensor]] = None
    attentions: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
    hidden_states: Optional[Tuple[Tuple[torch.FloatTensor]]] = None

这个参数的注释写的太长了, 用中文重新表述一下:

sequences 是形状为 (batch_size, sequence_length) 的张量, 就是推理过程产生的回答, sequence_length 要么等于 max_length, 要么因为 eos_token_id 所有 batches 都更早结束而长度短于 max_length, 具体长度不定.
scores 暂时没用到, 不写了.
attentions 的类型是 tuple(tuple(torch.FloatTensor)), 最外面的 tuple 的长度是生成的 token 的个数, 第二个 tuple 的长度是 decoder 中 layer 的个数, tensor 的维度是 (batch_size, num_heads, generated_length, sequence_length). 然后 sequence_length 和上面是一致的, generate_length 的含义还不明.

现在要搞清楚 generated_length 的含义, 搞清楚为什么对于第一个 token, generate_length 等于 sequence_length, 而后面的 token generate_length 都等于 1.
hidden_states 也没用到, 也不写了.

然后这个要具体看 greedy_search 逐 token 预测的过程.

问了以下 GPT, 回答如下:

generate_length 表示在生成每个标记时模型所关注的输入序列的长度。在注意力机制中，模型会根据当前生成的标记以及其上下文信息来决定在生成下一个标记时所关注的输入序列的部分。因此，generate_length 实际上是动态变化的，取决于当前生成的标记的位置。

对于第一个标记：在生成第一个标记时，模型需要考虑整个输入序列以及其上下文信息，因为第一个标记是在整个输入序列的基础上生成的。因此，对于第一个标记，generated_length 等于 sequence_length，表示模型在生成第一个标记时需要考虑整个输入序列的注意力权重。

对于后续的标记：一旦生成了第一个标记，模型在生成后续的标记时不再需要考虑整个输入序列的信息，而只需要关注已生成的标记以及它们的上下文信息。因此，对于后续的标记，generated_length 等于 1，表示模型在生成这些标记时只关注了当前时间步的输入。

在注意力机制中，generated_length 表示了模型在生成每个标记时所关注的输入序列的长度。对于第一个标记，模型需要关注整个输入序列，因此 generated_length 等于 sequence_length。而对于后续的标记，模型只需要关注已生成的标记，因此 generated_length 为 1。

这个回答应该是正确的。

`greed_search()`

while True:
    if synced_gpus:
        # Under synced_gpus the `forward` call must continue until all gpus complete their sequence.
        # The following logic allows an early break if all peers finished generating their sequence
        this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(input_ids.device)
        # send 0.0 if we finished, 1.0 otherwise
        dist.all_reduce(this_peer_finished_flag, op=dist.ReduceOp.SUM)
        # did all peers finish? the reduced sum will be 0.0 then
        if this_peer_finished_flag.item() == 0.0:
            break

            # prepare model inputs
            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)

            # forward pass to get next token
            outputs = self(
                **model_inputs,
                return_dict=True,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
            )

            if synced_gpus and this_peer_finished:
                continue  # don't waste resources running the code we don't need

                next_token_logits = outputs.logits[:, -1, :]

                # pre-process distribution
                next_tokens_scores = logits_processor(input_ids, next_token_logits)

                # Store scores, attentions and hidden_states when required
                if return_dict_in_generate:
                    if output_scores:
                        scores += (next_tokens_scores,)
                        if output_attentions:
                            decoder_attentions += (
                                (outputs.decoder_attentions,) if self.config.is_encoder_decoder else (outputs.attentions,)
                            )
                            if self.config.is_encoder_decoder:
                                cross_attentions += (outputs.cross_attentions,)

                                if output_hidden_states:
                                    decoder_hidden_states += (
                                        (outputs.decoder_hidden_states,)
                                        if self.config.is_encoder_decoder
                                        else (outputs.hidden_states,)
                                    )

                                    # argmax
                                    next_tokens = torch.argmax(next_tokens_scores, dim=-1)

                                    # finished sentences should have their next token be a padding token
                                    if eos_token_id is not None:
                                        if pad_token_id is None:
                                            raise ValueError("If `eos_token_id` is defined, make sure that `pad_token_id` is defined.")
                                            next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)

                                            # update generated ids, model inputs, and length for next step
                                            input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
                                            if streamer is not None:
                                                streamer.put(next_tokens.cpu())
                                                model_kwargs = self._update_model_kwargs_for_generation(
                                                    outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
                                                )

                                                # if eos_token was found in one sentence, set sentence to finished
                                                if eos_token_id_tensor is not None:
                                                    unfinished_sequences = unfinished_sequences.mul(
                                                        next_tokens.tile(eos_token_id_tensor.shape[0], 1).ne(eos_token_id_tensor.unsqueeze(1)).prod(dim=0)
                                                    )

                                                    # stop when each sentence is finished
                                                    if unfinished_sequences.max() == 0:
                                                        this_peer_finished = True

                                                        # stop if we exceed the maximum length
                                                        if stopping_criteria(input_ids, scores):
                                                            this_peer_finished = True

                                                            if this_peer_finished and not synced_gpus:
                                                                break

greedy_search() 里面说明了每一个 token 是如何具体生成的. 这里是用 greedy decoding 来生成 token 的.

第二个参数是 logits_processor, 用来在每一个生成步中修改 language modeling head 的预测分数.

具体过程在一个 while 循环里, 下面只说和 token 生成高度相关的内容:

首先用 input_ids 和其他模型的参数准备 model_inputs
将 model_inputs 以及其他参数前向传播给 self 得到 outputs
outputs 有一个 logits 的键值, 通过 next_token_logits = outputs.logits[:, -1, :] 获取下一个 token 的 logits, 然后通过 logits_processor 预处理分布得到下一个 token 的分数.
1
2
# pre-process distribution
next_tokens_scores = logits_processor(input_ids, next_token_logits)
然后根据需求把分数, attention 矩阵以及 hidden state 存起来.
然后下一个 token 的预测是选择分数最高的那个做作为下一个 token
1
2
# argmax
next_tokens = torch.argmax(next_tokens_scores, dim=-1)
修改 input_ids, 将它与新生成的 token 拼接在一起 input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1), 然后修改模型的其他参数 model_kwargs
如果发现某个句子中出现了 eos_token, 就将句子的状态设置为 finished
最后根据是否满足了 stopping_criteria 决定是否退出循环

这里的 self 是由 GenerateMixin 类实例化的某一个具体模型, 可以看作是输入进了这个模型的 forward() 函数, 按照对 LLaVA-1.5 源码的分析, 这里调用的是 LlamaForCausalLM 类的 forward 函数, 最终的返回值为

CausalLMOutputWithPast(
    loss=loss,
    logits=logits,
    past_key_values=outputs.past_key_values,
    hidden_states=outputs.hidden_states,
    attentions=outputs.attentions,
)

确实可以获取到 logits, hidden_states 和 attentions 这三项内容. 然后具体看一下 logits 的生成方式:

outputs = self.model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    position_ids=position_ids,
    past_key_values=past_key_values,
    inputs_embeds=inputs_embeds,
    use_cache=use_cache,
    output_attentions=output_attentions,
    output_hidden_states=output_hidden_states,
    return_dict=return_dict,
)

hidden_states = outputs[0]
if self.pretraining_tp > 1:
    lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.pretraining_tp, dim=0)
    logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.pretraining_tp)]
    logits = torch.cat(logits, dim=-1)
    else:
        logits = self.lm_head(hidden_states)
        logits = logits.float()

这里的最后一行的 logits 就是最重要返回的 logits. 其中用到的 self.lm_head 就是语言模型的 head, 这里就是一个 Linear:

1	self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

attention 矩阵

# 这里是输出 token 之间的 attention 矩阵
attn_matrices=[]
out_token_num=len(out.attentions)
layer_num=len(out.attentions[0])
input_token_num=out.attentions[0][0].shape[2]
for i in range(layer_num):
    attn_mat=torch.zeros((out_token_num,out_token_num)).cuda()
    # 每一层得到最终的 attn_matrix
    for j,attn in enumerate(out.attentions):
        if j>0:
            # attn : 32*([1,32,1,627+j])
            attn_max=attn[i].max(1).values.data.squeeze()
            attn_max=attn_max / attn_max.sum(-1, keepdim=True)
            # attn_max : 32*([627+j])
            attn_mat[j-1,:j]=attn_max[input_token_num:] # 获取后 j 个元素, 赋值给 attn_mat 所以为j-1的行的前j-1个元素
            # print(attn_mat)
            attn_matrices.append(attn_mat)

            for i in range(layer_num):
                vis(attn_matrices[i].cpu(), i)

设第一个 token 对应的 attn 矩阵为 , 第个 token 对应的 attn 矩阵为 .

输入的 token 为 , 所以是和两两之间的相似度, 然后输出第一个 token . 一共有个 , 后面同理.
将作为输入的 token, 此时表示和之间的相似度, (而和之间的相似度已经计算出来了, 就无需再做计算了), 输出第二个 token .
将作为输入的 token, 此时表示和之间的相似度, 输出最后一个 token .

然后需要获取的是之间的相似度, 很显然, 将上面的这些矩阵取出想要的值进行拼接即可.

还有一个疑问, 最后一个 token 呢?

只能有卡的时候再跑的时候看细节了.

从获取

NLP 基础概念

1: 代码中的 logits 特质模型输出的值, 指没有通过 softmax 处理的值.

2: 在 softmax 中引入温度参数 , 可以改善输出的概率分布的状态, 防止输出的概率分布与原始数据差距太大, , 越大概率分布会趋于均匀分布, 越小越接近于原始数据 .

3: token 和 tokenize 以及 tokenizer.

LLM Inference

2024-04-23 该篇文章被 Baoduo Xu 归为分类: Project and Reseach

以上

llava_arch.py

llava_llama.py

pope_eval.py

LlamaAttention

generate() 方法

greed_search()