大模型推理时model.generate的源码

2024-06-11 16:52

本文主要是介绍大模型推理时model.generate的源码,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

大模型推理时model.generate的源码

文件路径:anaconda3/envs/环境名/lib/python3.10/site-packages/transformers/generation/utils.py

def generate(self,inputs: Optional[torch.Tensor] = None,generation_config: Optional[GenerationConfig] = None,logits_processor: Optional[LogitsProcessorList] = None,stopping_criteria: Optional[StoppingCriteriaList] = None,prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,synced_gpus: Optional[bool] = None,assistant_model: Optional["PreTrainedModel"] = None,streamer: Optional["BaseStreamer"] = None,negative_prompt_ids: Optional[torch.Tensor] = None,negative_prompt_attention_mask: Optional[torch.Tensor] = None,**kwargs,) -> Union[GenerateOutput, torch.LongTensor]:r"""Generates sequences of token ids for models with a language modeling head.<Tip warning={true}>Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to themodel's default generation configuration. You can override any `generation_config` by passing the correspondingparameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.For an overview of generation strategies and code examples, check out the [followingguide](../generation_strategies).</Tip>Parameters:inputs (`torch.Tensor` of varying shape depending on the modality, *optional*):The sequence used as a prompt for the generation or as model inputs to the encoder. If `None` themethod initializes it with `bos_token_id` and a batch size of 1. For decoder-only models `inputs`should be in the format of `input_ids`. For encoder-decoder models *inputs* can represent any of`input_ids`, `input_values`, `input_features`, or `pixel_values`.generation_config ([`~generation.GenerationConfig`], *optional*):The generation configuration to be used as base parametrization for the generation call. `**kwargs`passed to generate matching the attributes of `generation_config` will override them. If`generation_config` is not provided, the default will be used, which has the following loadingpriority: 1) from the `generation_config.json` model file, if it exists; 2) from the modelconfiguration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'sdefault values, whose documentation should be checked to parameterize generation.logits_processor (`LogitsProcessorList`, *optional*):Custom logits processors that complement the default logits processors built from arguments andgeneration config. If a logit processor is passed that is already created with the arguments or ageneration config an error is thrown. This feature is intended for advanced users.stopping_criteria (`StoppingCriteriaList`, *optional*):Custom stopping criteria that complements the default stopping criteria built from arguments and ageneration config. If a stopping criteria is passed that is already created with the arguments or ageneration config an error is thrown. If your stopping criteria depends on the `scores` input, makesure you pass `return_dict_in_generate=True, output_scores=True` to `generate`. This feature isintended for advanced users.prefix_allowed_tokens_fn (`Callable[[int, torch.Tensor], List[int]]`, *optional*):If provided, this function constraints the beam search to allowed tokens only at each step. If notprovided no constraint is applied. This function takes 2 arguments: the batch ID `batch_id` and`input_ids`. It has to return a list with the allowed tokens for the next generation step conditionedon the batch ID `batch_id` and the previously generated tokens `inputs_ids`. This argument is usefulfor constrained generation conditioned on the prefix, as described in [Autoregressive EntityRetrieval](https://arxiv.org/abs/2010.00904).synced_gpus (`bool`, *optional*):Whether to continue running the while loop until max_length. Unless overridden this flag will be set to`True` under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finishedgenerating before other GPUs. Otherwise it'll be set to `False`.assistant_model (`PreTrainedModel`, *optional*):An assistant model that can be used to accelerate generation. The assistant model must have the exactsame tokenizer. The acceleration is achieved when forecasting candidate tokens with the assistent modelis much faster than running generation with the model you're calling generate from. As such, theassistant model should be much smaller.streamer (`BaseStreamer`, *optional*):Streamer object that will be used to stream the generated sequences. Generated tokens are passedthrough `streamer.put(token_ids)` and the streamer is responsible for any further processing.negative_prompt_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):The negative prompt needed for some processors such as CFG. The batch size must match the input batchsize. This is an experimental feature, subject to breaking API changes in future versions.negative_prompt_attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):Attention_mask for `negative_prompt_ids`.kwargs (`Dict[str, Any]`, *optional*):Ad hoc parametrization of `generation_config` and/or additional model-specific kwargs that will beforwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoderspecific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder_*.Return:[`~utils.ModelOutput`] or `torch.LongTensor`: A [`~utils.ModelOutput`] (if `return_dict_in_generate=True`or when `config.return_dict_in_generate=True`) or a `torch.LongTensor`.If the model is *not* an encoder-decoder model (`model.config.is_encoder_decoder=False`), the possible[`~utils.ModelOutput`] types are:- [`~generation.GenerateDecoderOnlyOutput`],- [`~generation.GenerateBeamDecoderOnlyOutput`]If the model is an encoder-decoder model (`model.config.is_encoder_decoder=True`), the possible[`~utils.ModelOutput`] types are:- [`~generation.GenerateEncoderDecoderOutput`],- [`~generation.GenerateBeamEncoderDecoderOutput`]"""# 1. Handle `generation_config` and kwargs that might update it, and validate the `.generate()` callself._validate_model_class()tokenizer = kwargs.pop("tokenizer", None)  # Pull this out first, we only use it for stopping criteriageneration_config, model_kwargs = self._prepare_generation_config(generation_config, **kwargs)self._validate_model_kwargs(model_kwargs.copy())# 2. Set generation parameters if not already definedif synced_gpus is None:if is_deepspeed_zero3_enabled() and dist.get_world_size() > 1:synced_gpus = Trueelse:synced_gpus = Falselogits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()accepts_attention_mask = "attention_mask" in set(inspect.signature(self.forward).parameters.keys())requires_attention_mask = "encoder_outputs" not in model_kwargskwargs_has_attention_mask = model_kwargs.get("attention_mask", None) is not None# 3. Define model inputsinputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs(inputs, generation_config.bos_token_id, model_kwargs)batch_size = inputs_tensor.shape[0]device = inputs_tensor.deviceself._prepare_special_tokens(generation_config, kwargs_has_attention_mask, device=device)# decoder-only models must use left-padding for batched generation.if not self.config.is_encoder_decoder and not is_torchdynamo_compiling():# If `input_ids` was given, check if the last id in any sequence is `pad_token_id`# Note: If using, `inputs_embeds` this check does not work, because we want to be more hands-off.if (generation_config.pad_token_id is not Noneand batch_size > 1and len(inputs_tensor.shape) == 2and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0):logger.warning("A decoder-only architecture is being used, but right-padding was detected! For correct ""generation results, please set `padding_side='left'` when initializing the tokenizer.")# 4. Define other model kwargs# decoder-only models with inputs_embeds forwarding must use caching (otherwise we can't detect whether we are# generating the first new token or not, and we only want to use the embeddings for the first new token)if not self.config.is_encoder_decoder and model_input_name == "inputs_embeds":model_kwargs["use_cache"] = Trueelse:model_kwargs["use_cache"] = generation_config.use_cacheif not kwargs_has_attention_mask and requires_attention_mask and accepts_attention_mask:model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(inputs_tensor, generation_config.pad_token_id, generation_config.eos_token_id)if self.config.is_encoder_decoder and "encoder_outputs" not in model_kwargs:# if model is encoder decoder encoder_outputs are created and added to `model_kwargs`model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(inputs_tensor, model_kwargs, model_input_name, generation_config)# 5. Prepare `input_ids` which will be used for auto-regressive generationif self.config.is_encoder_decoder:input_ids, model_kwargs = self._prepare_decoder_input_ids_for_generation(batch_size=batch_size,model_input_name=model_input_name,model_kwargs=model_kwargs,decoder_start_token_id=generation_config.decoder_start_token_id,device=inputs_tensor.device,)else:input_ids = inputs_tensor if model_input_name == "input_ids" else model_kwargs.pop("input_ids")if streamer is not None:streamer.put(input_ids.cpu())# 6. Prepare `max_length` depending on other stopping criteria.input_ids_length = input_ids.shape[-1]has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not Nonehas_default_min_length = kwargs.get("min_length") is None and generation_config.min_length is not Nonegeneration_config = self._prepare_generated_length(generation_config=generation_config,has_default_max_length=has_default_max_length,has_default_min_length=has_default_min_length,model_input_name=model_input_name,inputs_tensor=inputs_tensor,input_ids_length=input_ids_length,)if generation_config.cache_implementation is not None and model_kwargs.get("past_key_values") is not None:raise ValueError("Passing both `cache_implementation` (used to initialize certain caches) and `past_key_values` (a ""Cache object) is unsupported. Please use only one of the two.")elif generation_config.cache_implementation in NEED_SETUP_CACHE_CLASSES_MAPPING:if not self._supports_cache_class:raise ValueError("This model does not support the `cache_implementation` argument. Please check the following ""issue: https://github.com/huggingface/transformers/issues/28981.")if generation_config.cache_implementation == "static":if not self._supports_static_cache:raise ValueError("This model does not support `cache_implementation='static'`. Please check the following ""issue: https://github.com/huggingface/transformers/issues/28981")model_kwargs["past_key_values"] = self._get_static_cache(batch_size, generation_config.max_length)self._validate_generated_length(generation_config, input_ids_length, has_default_max_length)# 7. determine generation modegeneration_mode = generation_config.get_generation_mode(assistant_model)if streamer is not None and (generation_config.num_beams > 1):raise ValueError("`streamer` cannot be used with beam search (yet!). Make sure that `num_beams` is set to 1.")if self.device.type != input_ids.device.type:warnings.warn("You are calling .generate() with the `input_ids` being on a device type different"f" than your model's device. `input_ids` is on {input_ids.device.type}, whereas the model"f" is on {self.device.type}. You may experience unexpected behaviors or slower generation."" Please make sure that you have put `input_ids` to the"f" correct device by calling for example input_ids = input_ids.to('{self.device.type}') before"" running `.generate()`.",UserWarning,)# 8. prepare distribution pre_processing samplersprepared_logits_processor = self._get_logits_processor(generation_config=generation_config,input_ids_seq_length=input_ids_length,encoder_input_ids=inputs_tensor,prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,logits_processor=logits_processor,device=inputs_tensor.device,model_kwargs=model_kwargs,negative_prompt_ids=negative_prompt_ids,negative_prompt_attention_mask=negative_prompt_attention_mask,)# 9. prepare stopping criteriaprepared_stopping_criteria = self._get_stopping_criteria(generation_config=generation_config, stopping_criteria=stopping_criteria, tokenizer=tokenizer, **kwargs)# 10. go into different generation modesif generation_mode == GenerationMode.ASSISTED_GENERATION:if generation_config.num_return_sequences > 1:raise ValueError("num_return_sequences has to be 1 when doing assisted generate, "f"but is {generation_config.num_return_sequences}.")if batch_size > 1:raise ValueError("assisted generate is only supported for batch_size = 1")if not model_kwargs["use_cache"]:raise ValueError("assisted generate requires `use_cache=True`")if generation_config.cache_implementation == "static":raise ValueError("assisted generate is not supported with `static_cache`")# 11. Get the candidate generator, given the parameterizationcandidate_generator = self._get_candidate_generator(generation_config=generation_config,input_ids=input_ids,inputs_tensor=inputs_tensor,assistant_model=assistant_model,logits_processor=logits_processor,model_kwargs=model_kwargs,)# 12. prepare logits warper (if `do_sample` is `True`)prepared_logits_warper = (self._get_logits_warper(generation_config) if generation_config.do_sample else None)# 13. run assisted generateresult = self._assisted_decoding(input_ids,candidate_generator=candidate_generator,logits_processor=prepared_logits_processor,logits_warper=prepared_logits_warper,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,streamer=streamer,**model_kwargs,)elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:if not model_kwargs["use_cache"]:raise ValueError("Contrastive search requires `use_cache=True`")result = self._contrastive_search(input_ids,logits_processor=prepared_logits_processor,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,streamer=streamer,**model_kwargs,)elif generation_mode in (GenerationMode.SAMPLE, GenerationMode.GREEDY_SEARCH):# 11. prepare logits warperprepared_logits_warper = (self._get_logits_warper(generation_config) if generation_config.do_sample else None)# 12. expand input_ids with `num_return_sequences` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_return_sequences,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)result = self._sample(input_ids,logits_processor=prepared_logits_processor,logits_warper=prepared_logits_warper,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,streamer=streamer,**model_kwargs,)elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):# 11. prepare logits warperprepared_logits_warper = (self._get_logits_warper(generation_config) if generation_config.do_sample else None)# 12. prepare beam search scorerbeam_scorer = BeamSearchScorer(batch_size=batch_size,num_beams=generation_config.num_beams,device=inputs_tensor.device,length_penalty=generation_config.length_penalty,do_early_stopping=generation_config.early_stopping,num_beam_hyps_to_keep=generation_config.num_return_sequences,max_length=generation_config.max_length,)# 13. interleave input_ids with `num_beams` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_beams,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 14. run beam sampleresult = self._beam_search(input_ids,beam_scorer,logits_processor=prepared_logits_processor,logits_warper=prepared_logits_warper,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,**model_kwargs,)elif generation_mode == GenerationMode.GROUP_BEAM_SEARCH:# 11. prepare beam search scorerbeam_scorer = BeamSearchScorer(batch_size=batch_size,num_beams=generation_config.num_beams,device=inputs_tensor.device,length_penalty=generation_config.length_penalty,do_early_stopping=generation_config.early_stopping,num_beam_hyps_to_keep=generation_config.num_return_sequences,num_beam_groups=generation_config.num_beam_groups,max_length=generation_config.max_length,)# 12. interleave input_ids with `num_beams` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_beams,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 13. run beam searchresult = self._group_beam_search(input_ids,beam_scorer,logits_processor=prepared_logits_processor,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,**model_kwargs,)elif generation_mode == GenerationMode.CONSTRAINED_BEAM_SEARCH:final_constraints = []if generation_config.constraints is not None:final_constraints = generation_config.constraintsif generation_config.force_words_ids is not None:def typeerror():raise ValueError("`force_words_ids` has to either be a `List[List[List[int]]]` or `List[List[int]]` "f"of positive integers, but is {generation_config.force_words_ids}.")if (not isinstance(generation_config.force_words_ids, list)or len(generation_config.force_words_ids) == 0):typeerror()for word_ids in generation_config.force_words_ids:if isinstance(word_ids[0], list):if not isinstance(word_ids, list) or len(word_ids) == 0:typeerror()if any(not isinstance(token_ids, list) for token_ids in word_ids):typeerror()if any(any((not isinstance(token_id, int) or token_id < 0) for token_id in token_ids)for token_ids in word_ids):typeerror()constraint = DisjunctiveConstraint(word_ids)else:if not isinstance(word_ids, list) or len(word_ids) == 0:typeerror()if any((not isinstance(token_id, int) or token_id < 0) for token_id in word_ids):typeerror()constraint = PhrasalConstraint(word_ids)final_constraints.append(constraint)# 11. prepare beam search scorerconstrained_beam_scorer = ConstrainedBeamSearchScorer(constraints=final_constraints,batch_size=batch_size,num_beams=generation_config.num_beams,device=inputs_tensor.device,length_penalty=generation_config.length_penalty,do_early_stopping=generation_config.early_stopping,num_beam_hyps_to_keep=generation_config.num_return_sequences,max_length=generation_config.max_length,)# 12. interleave input_ids with `num_beams` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_beams,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 13. run beam searchresult = self._constrained_beam_search(input_ids,constrained_beam_scorer=constrained_beam_scorer,logits_processor=prepared_logits_processor,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,**model_kwargs,)return result

这篇关于大模型推理时model.generate的源码的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1051742

相关文章

Java 正则表达式URL 匹配与源码全解析

《Java正则表达式URL匹配与源码全解析》在Web应用开发中,我们经常需要对URL进行格式验证,今天我们结合Java的Pattern和Matcher类,深入理解正则表达式在实际应用中... 目录1.正则表达式分解:2. 添加域名匹配 (2)3. 添加路径和查询参数匹配 (3) 4. 最终优化版本5.设计思

Pydantic中model_validator的实现

《Pydantic中model_validator的实现》本文主要介绍了Pydantic中model_validator的实现,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价... 目录引言基础知识创建 Pydantic 模型使用 model_validator 装饰器高级用法mo

Spring Security基于数据库的ABAC属性权限模型实战开发教程

《SpringSecurity基于数据库的ABAC属性权限模型实战开发教程》:本文主要介绍SpringSecurity基于数据库的ABAC属性权限模型实战开发教程,本文给大家介绍的非常详细,对大... 目录1. 前言2. 权限决策依据RBACABAC综合对比3. 数据库表结构说明4. 实战开始5. MyBA

Java调用C++动态库超详细步骤讲解(附源码)

《Java调用C++动态库超详细步骤讲解(附源码)》C语言因其高效和接近硬件的特性,时常会被用在性能要求较高或者需要直接操作硬件的场合,:本文主要介绍Java调用C++动态库的相关资料,文中通过代... 目录一、直接调用C++库第一步:动态库生成(vs2017+qt5.12.10)第二步:Java调用C++

Python实现无痛修改第三方库源码的方法详解

《Python实现无痛修改第三方库源码的方法详解》很多时候,我们下载的第三方库是不会有需求不满足的情况,但也有极少的情况,第三方库没有兼顾到需求,本文将介绍几个修改源码的操作,大家可以根据需求进行选择... 目录需求不符合模拟示例 1. 修改源文件2. 继承修改3. 猴子补丁4. 追踪局部变量需求不符合很

Java的IO模型、Netty原理解析

《Java的IO模型、Netty原理解析》Java的I/O是以流的方式进行数据输入输出的,Java的类库涉及很多领域的IO内容:标准的输入输出,文件的操作、网络上的数据传输流、字符串流、对象流等,这篇... 目录1.什么是IO2.同步与异步、阻塞与非阻塞3.三种IO模型BIO(blocking I/O)NI

基于Flask框架添加多个AI模型的API并进行交互

《基于Flask框架添加多个AI模型的API并进行交互》:本文主要介绍如何基于Flask框架开发AI模型API管理系统,允许用户添加、删除不同AI模型的API密钥,感兴趣的可以了解下... 目录1. 概述2. 后端代码说明2.1 依赖库导入2.2 应用初始化2.3 API 存储字典2.4 路由函数2.5 应

GORM中Model和Table的区别及使用

《GORM中Model和Table的区别及使用》Model和Table是两种与数据库表交互的核心方法,但它们的用途和行为存在著差异,本文主要介绍了GORM中Model和Table的区别及使用,具有一... 目录1. Model 的作用与特点1.1 核心用途1.2 行为特点1.3 示例China编程代码2. Tab

Spring 中 BeanFactoryPostProcessor 的作用和示例源码分析

《Spring中BeanFactoryPostProcessor的作用和示例源码分析》Spring的BeanFactoryPostProcessor是容器初始化的扩展接口,允许在Bean实例化前... 目录一、概览1. 核心定位2. 核心功能详解3. 关键特性二、Spring 内置的 BeanFactory

C#集成DeepSeek模型实现AI私有化的流程步骤(本地部署与API调用教程)

《C#集成DeepSeek模型实现AI私有化的流程步骤(本地部署与API调用教程)》本文主要介绍了C#集成DeepSeek模型实现AI私有化的方法,包括搭建基础环境,如安装Ollama和下载DeepS... 目录前言搭建基础环境1、安装 Ollama2、下载 DeepSeek R1 模型客户端 ChatBo