{"id":25279,"date":"2025-11-26T10:48:34","date_gmt":"2025-11-26T10:48:34","guid":{"rendered":"https:\/\/pokecon.jp\/job\/?p=25279"},"modified":"2025-11-26T10:48:34","modified_gmt":"2025-11-26T10:48:34","slug":"investigating-fine-tuning-limitations-for-vlms-with-three-case-studies","status":"publish","type":"post","link":"https:\/\/pokecon.jp\/job\/25279\/","title":{"rendered":"Investigating fine-tuning limitations for VLMs with three case studies"},"content":{"rendered":"\n<\/p>\n<div>\n<p>Hello, this is Aur\u00e9lie, working as an Artificial Intelligence Engineer at Ridge-i. Today I would like to share some insights about fine-tuning <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> language models (VLMs)!\u00a0<\/p>\n<h2 id=\"Introduction\">Introduction<\/h2>\n<p>As interest in <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> language models (VLMs) grows, fine-tuning is increasingly seen as a promising way to adapt models for specific applications. For teams exploring this path, it\u2019s important to take some time to assess whether fine-tuning is the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/most\">most<\/a> suitable solution and whether it\u2019s likely to lead to meaningful improvement, given the high computational cost and large amounts of data requirements.<\/p>\n<p>In this blog post, we share some examples of internal experiments where fine-tuning failed to <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/improve\">improve<\/a> our model\u2019s capabilities and highlight some key challenges. We consider full supervised fine-tuning (SFT) and Low-Rank Adaptation method (LoRA) which adds small trainable matrices to some specific weights of a frozen model, making fine-tuning efficient with minimal changes.<\/p>\n<p>Note: All experiments used InternVL2.0 models <a target=\"_blank\" href=\"#ref_1\">[1]<\/a> and were conducted in October 2024. Since then, InternVL3.0 has been released. All experiments were done using a single <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/NVIDIA\">NVIDIA<\/a> A100 <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/GPU\">GPU<\/a> with 80GB of memory.<\/p>\n<h2 id=\"To-fine-tune-or-not-to-fine-tune\">To fine-tune or not to fine-tune<\/h2>\n<blockquote>\n<p><em>Prior to the rise of LLMs, fine-tuning was commonly used for smaller-scale models (100M \u2013 300M parameters). However, with the advent of larger models (&gt; 1B parameters), the question of fine-tuning has become more nuanced.<\/em><\/p>\n<\/blockquote>\n<p><figure><figcaption style=\"text-align: center; font-size: 90%; margin-top: 6px;\">\n    Quote from a Meta blog post: <em>&#8220;To fine-tune or not to fine-tune&#8221;<\/em> <a target=\"_blank\" href=\"#ref_2\">[2]<\/a><br \/>\n  <\/figcaption><\/figure>\n<\/p>\n<p>State-of-the-art models are often released in multiple sizes, with larger models offering the best performance but also requiring significantly more resources and large amounts of high-quality data to fine-tune.<\/p>\n<p>As a result, the first obstacle encountered by organizations looking to fine-tune a large model is the requirements in terms of computing infrastructure and volume of high-quality data.<\/p>\n<p>Even if computing resources and <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/access\">access<\/a> to data are not an issue, <strong>fine-tuning may not be suitable for all types of tasks.<\/strong> While it is useful for adapting the model\u2019s output style, vocabulary, tone, <strong>it is generally not recommended to inject external knowledge<\/strong> <a target=\"_blank\" href=\"#ref_2\">[2]<\/a>. This is because LLMs and VLMs are not designed to <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/memorize\">memorize<\/a> and reliably retrieve highly specific facts (e.g. temperature in a specific place on a specific day) and it is difficult to control or verify that the knowledge gets accurately embedded into the weights during fine-tuning.<\/p>\n<p>Therefore, it\u2019s important to begin by identifying the source of the performance limitations and to assess whether fine-tuning is the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/most\">most<\/a> appropriate solution.<\/p>\n<p>Finally, fine-tuning a large pretrained model with high capabilities always comes with some risk. <strong>Some issues like catastrophic forgetting,<\/strong> where the model loses previously acquired general knowledge, <strong>can lead to degraded performance and a reduced ability to follow instructions.<\/strong><\/p>\n<p>In this blog post, we share three quick fine-tuning experiments on different datasets that failed to <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/improve\">improve<\/a> performance, and we explore the underlying reasons.<\/p>\n<h2 id=\"Which-modules-to-keep-frozen-in-a-VLM-when-fine-tuning\">Which modules to keep frozen in a VLM when fine-tuning<\/h2>\n<p>A VLM is usually composed of 3 main modules:<\/p>\n<ul>\n<li>\n<p>A <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> encoder which extracts features from input images<\/p>\n<\/li>\n<li>\n<p>A projector, which maps the visual features into the language model\u2019s embedding space<\/p>\n<\/li>\n<li>\n<p>A language model, which processes both the language tokens and the projected image features<\/p>\n<\/li>\n<\/ul>\n<p>When fine-tuning a pretrained model on a new domain-specific dataset, it is common to keep the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> model frozen. This is because the visual encoder is typically trained on large-scale image datasets (e.g., ImageNet <a target=\"_blank\" href=\"#ref_3\">[3]<\/a>, LAION <a target=\"_blank\" href=\"#ref_4\">[4]<\/a>, etc.) and already captures rich and generalizable visual features. Unless large amounts of training data is available and there is a significant domain gap between the target data and the pretrained data, we recommend keeping it frozen.<\/p>\n<p>As a default, InternVL proposes to keep the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> encoder frozen and to fine-tune the projector and the language model <a target=\"_blank\" href=\"#ref_6\">[6]<\/a>. For our experiments, we also keep the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> part frozen.<\/p>\n<p>Below is a figure of a typical VLM architecture where the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> part is kept frozen for the fine-tuning step.<\/p>\n<p><figure class=\"figure-image figure-image-fotolife\" title=\"The vision encoder is kept frozen in a typical VLM architecture [5]\"><span itemscope=\"\" itemtype=\"http:\/\/schema.org\/Photograph\"><img decoding=\"async\" src=\"https:\/\/cdn-ak.f.st-hatena.com\/images\/fotolife\/r\/rouliiiie\/20250922\/20250922123325.png\" width=\"393\" height=\"499\" loading=\"lazy\" title=\"\" class=\"hatena-fotolife\" itemprop=\"image\"\/><\/span><figcaption>The <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> encoder is kept frozen in a typical VLM architecture <a target=\"_blank\" href=\"#ref_5\">[5]<\/a><\/figcaption><\/figure>\n<\/p>\n<p>Alternatively, the projector can also be kept frozen during fine-tuning to help reduce the risk of overfitting when the pretraining and target domains are similar. It can be useful when working with small datasets as the projector is usually small (often just a single layer) which makes it prone to overfitting.<\/p>\n<h2 id=\"Internal-experiments\">Internal experiments<\/h2>\n<p>To explore the limitations of fine-tuning methods like\u00a0full SFT and LoRA, we conducted experiments on three <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a>-language challenges.<\/p>\n<ul>\n<li>\n<p>The DocVQA dataset evaluates content extraction from documents. Our results highlight the importance of identifying whether performance limitations originate from the language model or the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> encoder, since fine-tuning typically targets only the language component.<\/p>\n<\/li>\n<li>\n<p>The AI2D task involves diagram question answering. Our experiments highlight the impact of format mismatch between training and test data, as well as the risk of catastrophic forgetting.<\/p>\n<\/li>\n<li>\n<p>COCO Captions is a dataset designed for the image captioning task. Our experiments point out issues with metric interpretation and the effects of low-quality training data.<\/p>\n<\/li>\n<\/ul>\n<h3 id=\"The-case-of-DocVQA\">The case of DocVQA<\/h3>\n<p>DocVQA <a target=\"_blank\" href=\"#ref_7\">[7]<\/a> is a multimodal open-source dataset for content extraction. It contains about 12k images and 50k high-level questions about the documents. The questions are relatively simple and aim to check whether the content was correctly extracted from the image.<\/p>\n<p>Below is an example of a DocVQA image and associated questions.<\/p>\n<p><figure class=\"figure-image figure-image-fotolife\" title=\"Example of a DocVQA sample\"><span itemscope=\"\" itemtype=\"http:\/\/schema.org\/Photograph\"><img decoding=\"async\" src=\"https:\/\/cdn-ak.f.st-hatena.com\/images\/fotolife\/r\/rouliiiie\/20250922\/20250922123518.png\" width=\"315\" height=\"308\" loading=\"lazy\" title=\"\" class=\"hatena-fotolife\" itemprop=\"image\"\/><\/span><figcaption>Example of a DocVQA sample<\/figcaption><\/figure>\n<\/p>\n<p>This dataset is evaluated using the Average Normalized Levenshtein Similarity (ANLS) metric <a target=\"_blank\" href=\"#ref_8\">[8]<\/a> which measures how similar a predicted answer is to the ground truth based on character-level edits. Higher scores mean more accurate text matching.<\/p>\n<p>In the table below, we compare the performance and resource usage of the pretrained models to our fine-tuned models using LoRA.<\/p>\n<div class=\"s_table\"><table style=\"margin: 0 auto; border-collapse: collapse; text-align: center;\">\n<caption style=\"caption-side: bottom; font-size: 90%; margin-top: 4px;\">\n    <em>With LoRA using DocVQA 10k training set<br \/>Results are for the validation set<br \/>Default parameters: Batch = 16, Per_device = 1<\/em><br \/>\n  <\/caption>\n<thead>\n<tr>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Model<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">1B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">2B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">4B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">8B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">26B<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Architecture<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><small>internvl2_1b_qwen2_0_5b<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><small>internvl2_2b_internlm2_1_8b<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><small>internvl2_4b_phi3_3_8b<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><small>internvl2_8b_internlm2_7b<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><small>internvl2_26b_internlm2_20b<\/small><\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Pretrained ANLS (val)<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.7883<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.8466<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.8699<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.8972<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.9058<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">LoRA ANLS (val)<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.7919<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.8405<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.8679<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.8899<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.8991<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Difference<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">+0.0036<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">\u22120.0061<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">\u22120.0020<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">\u22120.0073<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">\u22120.0067<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/GPU\">GPU<\/a> Usage<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">39711\u00a0MiB<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">30963\u00a0MiB<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">18037\u00a0MiB<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">43029\u00a0MiB<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">78877\u00a0MiB<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Training Time<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">43\u00a0min<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">53\u00a0min<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">1h31m<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">2h19m<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">6h47m<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>Below are the results for the full SFT method.<\/p>\n<div class=\"s_table\"><table style=\"margin: 0 auto; border-collapse: collapse; text-align: center;\">\n<caption style=\"caption-side: bottom; font-size: 90%; margin-top: 4px;\">\n    <em>Full SFT results on DocVQA 10k train set<br \/>We use default parameters with Batch = 128, Per_device = 4<\/em><br \/>\n  <\/caption>\n<thead>\n<tr>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Model<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">1B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">2B<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Architecture<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><small>internvl2_1b_qwen2_0_5b<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><small>internvl2_2b_internlm2_1_8b<\/small><\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Pretrained ANLS (val)<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.7883<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.8466<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Fine-tuned ANLS (val)<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.7959<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.8202<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Difference<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">+0.0076<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">\u22120.0264<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/GPU\">GPU<\/a> Usage<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">72423\u00a0MiB<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">76663\u00a0MiB<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Training Time<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">34\u00a0min<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">51\u00a0min<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>Unfortunately, there is almost no change in performance by using the proposed fine-tuning methods. There are several possible explanations.<\/p>\n<p>First, since both the dataset (text documents) and the task (text extraction, similar to <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/OCR\">OCR<\/a>) are very general, the performance of the pretrained models is already quite good and there is no reason to believe that the pretrained models would be lacking as they have been trained on large amounts of similar data.<\/p>\n<p>Secondly, this task relies heavily on challenging content extraction (some documents are handwritten or hard to read) while the language task (answering simple questions from the extracted text) is more simple. As a result, there is a discrepancy between what the model is learning during fine-tuning (mainly language patterns) and what is actually being evaluated (successful visual content extraction).<\/p>\n<p><strong>Finally, since fine-tuning a VLM typically assumes the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> encoder remains frozen and only the language component is updated, performance won&#8217;t <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/improve\">improve<\/a> when the image model is the true bottleneck. <\/strong><\/p>\n<h3 id=\"The-case-of-AI2D\">The case of AI2D<\/h3>\n<p>The AI2D dataset <a target=\"_blank\" href=\"#ref_9\">[9]<\/a> contains over 5,000 grade school science diagrams and 15,000 multiple-choice questions for research on diagram understanding and question answering.<\/p>\n<p><figure class=\"figure-image figure-image-fotolife\" title=\"Example of AI2D sample image and question\"><span itemscope=\"\" itemtype=\"http:\/\/schema.org\/Photograph\"><img decoding=\"async\" src=\"https:\/\/cdn-ak.f.st-hatena.com\/images\/fotolife\/r\/rouliiiie\/20250925\/20250925133504.png\" width=\"453\" height=\"222\" loading=\"lazy\" title=\"\" class=\"hatena-fotolife\" itemprop=\"image\"\/><\/span><figcaption>Example of AI2D sample image and question<\/figcaption><\/figure>\n<\/p>\n<p>First, we try full SFT using the default parameters.<\/p>\n<div class=\"s_table\"><table style=\"margin: 0 auto; border-collapse: collapse; text-align: center;\">\n<caption style=\"caption-side: bottom; font-size: 90%; margin-top: 4px;\">\n    <em>LoRA and full SFT results on AI2D<\/em><br \/>\n  <\/caption>\n<tr>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Model<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">1B<\/th>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Architecture<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><small>internvl2_1b_qwen2_0_5b<\/small><\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Learning rate<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">4e-5<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Pretrained ANLS (test)<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.644<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Fine-tuning ANLS (test)<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.0<\/td>\n<\/tr>\n<\/table><\/div>\n<p>Unfortunately, the test accuracy <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/drops\">drops<\/a> to 0.<\/p>\n<p>While looking at the actual test results, we noticed a reduced capability to follow instructions by the fine-tuned model. Even when explicitly prompted to answer only with the correct option\u2019s letter, it often returned the content of the option instead.<\/p>\n<pre class=\"code lang-json\" data-lang=\"json\" data-unlink=\"\"><span class=\"synSpecial\">{<\/span>\n  \"<span class=\"synStatement\">question<\/span>\": \"<span class=\"synConstant\">What is between the head and abdomen?<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">A. Antenna<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">B. Simple eye<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">C. Spiracle<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">D. Thorax<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">Answer with the option's letter from the given choices directly.<\/span>\",\n  \"<span class=\"synStatement\">image<\/span>\": \"<span class=\"synConstant\">345802<\/span>\",\n  \"<span class=\"synStatement\">answer<\/span>\": \"<span class=\"synConstant\">Thorax<\/span>\",\n  \"<span class=\"synStatement\">annotation<\/span>\": \"<span class=\"synConstant\">D<\/span>\"\n<span class=\"synSpecial\">}<\/span>\n<\/pre>\n<p style=\"font-size: 90%; text-align: center; margin-top: 4px;\">\n  <em>Example of a fine-tuned model output<br \/>Although the answer is correct, the model does not follow the expected format<\/em>\n<\/p>\n<p>The main explanation for this issue comes from the training and test sets using different formats.<\/p>\n<pre class=\"code lang-json\" data-lang=\"json\" data-unlink=\"\"><span class=\"synSpecial\">{<\/span>\n\"<span class=\"synStatement\">id<\/span>\": <span class=\"synConstant\">1<\/span>, \n\"<span class=\"synStatement\">image<\/span>\": \"<span class=\"synConstant\">images\/7.png<\/span>\", \n\"<span class=\"synStatement\">conversations<\/span>\": \n    <span class=\"synSpecial\">[{<\/span>\"<span class=\"synStatement\">from<\/span>\": \"<span class=\"synConstant\">human<\/span>\", \"<span class=\"synStatement\">value<\/span>\": \"<span class=\"synConstant\"><image\/><\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">Which plant has leaves modified into spikes?Smilax<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">Banayan tree<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">Utricularia<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">Cactus Please answer the question based on the options mentioned before.<\/span>\"<span class=\"synSpecial\">}<\/span>, \n    <span class=\"synSpecial\">{<\/span>\"<span class=\"synStatement\">from<\/span>\": \"<span class=\"synConstant\">gpt<\/span>\", \"<span class=\"synStatement\">value<\/span>\": \"<span class=\"synConstant\">Cactus<\/span>\"<span class=\"synSpecial\">}]<\/span>\n<span class=\"synSpecial\">}<\/span>\n<\/pre>\n<p style=\"font-size: 90%; text-align: center; margin-top: 4px;\">\n  <em>Example of a training sample<br \/>The model is expected to return the text of the correct answer directly (e.g., &#8220;Cactus&#8221;), not the option number or letter<\/em>\n<\/p>\n<pre class=\"code lang-json\" data-lang=\"json\" data-unlink=\"\"><span class=\"synSpecial\">{<\/span>\n\"<span class=\"synStatement\">id<\/span>\": <span class=\"synConstant\">345802<\/span>, \n\"<span class=\"synStatement\">image<\/span>\": \"<span class=\"synConstant\">data\/ai2diagram\/AI2D_TEST\/345802.jpg<\/span>\", \n\"<span class=\"synStatement\">question<\/span>\": \"<span class=\"synConstant\">What is between the head and abdomen?<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">A. Antenna<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">B. Simple eye<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">C. Spiracle<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">D. Thorax<\/span><span class=\"synSpecial\">\\n<\/span><span class=\"synConstant\">Answer with the option's letter from the given choices directly.<\/span>\", \n\"<span class=\"synStatement\">question_id<\/span>\": \"<span class=\"synConstant\">345802<\/span>\", \n\"<span class=\"synStatement\">answer<\/span>\": \"<span class=\"synConstant\">D<\/span>\", \n\"<span class=\"synStatement\">category<\/span>\": \"<span class=\"synConstant\">partsOfA<\/span>\", \n\"<span class=\"synStatement\">abcLabel<\/span>\": \"<span class=\"synConstant\">False<\/span>\"\n<span class=\"synSpecial\">}<\/span>\n<\/pre>\n<p style=\"font-size: 90%; text-align: center; margin-top: 4px;\">\n  <em>Example of a test sample<br \/>The model is expected to answer with the letter corresponding to the correct answer (e.g. &#8220;D&#8221;)<br \/>The dataset was downloaded from the InternVL documentation and comes preprocessed for their format, which may differ from versions available on other platforms like HuggingFace<br \/>\n<\/em>\n<\/p>\n<p>This suggests that the model, having been trained on data where answers are provided as full content, has learned to consistently respond with the answer text rather than the option label even when asked to.<\/p>\n<p>To prevent the model from overfitting on the training data\u2019s format, we reduce the learning rate and repeat the experiments with LoRA and full SFT.<\/p>\n<div class=\"s_table\"><table style=\"margin: 0 auto; border-collapse: collapse; text-align: center;\">\n<caption style=\"caption-side: bottom; font-size: 90%; margin-top: 4px;\">\n    <em>LoRA and full SFT results on AI2D with a reduced lr<\/em><br \/>\n  <\/caption>\n<thead>\n<tr>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Model<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">1B<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Architecture<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><small>internvl2_1b_qwen2_0_5b<\/small><\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Pretrained ANLS (test)<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.644<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">LoRA ANLS (test)<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.615<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Fine-tuning ANLS (test)<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.639 with a small lr (=1e-6)<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>This time, we confirm that the model retains its ability to follow the format instruction. However, the performance <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/drops\">drops<\/a> for both methods using LoRA (0.644 \u2192 0.615) and full SFT (0.644 \u2192 0.639).<\/p>\n<p>As we suspect that one reason for the reduced performance is catastrophic forgetting, we repeated the experiment while including part of the general-domain data originally used to train the pretrained weights.<\/p>\n<p>This strategy to limit catastrophic forgetting is widely used when fine-tuning large models (e.g. Swallow <a target=\"_blank\" href=\"#ref_10\">[10]<\/a>, InternVL <a target=\"_blank\" href=\"#ref_1\">[1]<\/a>, etc.) on domain-specific data, with the goal of enhancing downstream capabilities while retaining the foundational skills <a target=\"_blank\" href=\"#ref_6\">[6]<\/a>.<\/p>\n<div class=\"s_table\"><table style=\"margin: 0 auto; border-collapse: collapse; text-align: center;\">\n<caption style=\"caption-side: bottom; font-size: 90%; margin-top: 4px;\">\n    <em>LoRA results on AI2D when using additional general data<\/em><br \/>\n  <\/caption>\n<thead>\n<tr>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Model<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">1B<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Architecture<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\"><small>internvl2_1b_qwen2_0_5b<\/small><\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">LoRA ANLS (test)<br \/><small>Train: AI2D + General data<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.626<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Training Time<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">26h13min<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>Although using additional general data during fine-tuning seems to limit catastrophic forgetting, the performance of the fine-tuned model is still worse than that of the pretrained model. More advanced mitigation techniques exist, but they require significant changes to the training pipeline and a lot of setup effort.<\/p>\n<p>In conclusion, when fine-tuning a model on domain-specific datasets, <strong>it is crucial to ensure that the training data is of high quality and closely matches the format and expectations of the test set. Additionally, it is important to consider methods to limit catastrophic forgetting<\/strong> which inevitably make the training process more complex and more time and resource intensive.<\/p>\n<h3 id=\"The-case-of-COCO-Captions\">The case of COCO Captions<\/h3>\n<p>The COCO Captions dataset <a target=\"_blank\" href=\"#ref_11\">[11]<\/a> contains over one and a half million captions describing over 330,000 images. For each image of the training and validation sets, five independent crowdsourced captions are provided.<\/p>\n<p><figure class=\"figure-image figure-image-fotolife\" title=\"COCO dataset example for image captioning\"><span itemscope=\"\" itemtype=\"http:\/\/schema.org\/Photograph\"><img decoding=\"async\" src=\"https:\/\/cdn-ak.f.st-hatena.com\/images\/fotolife\/r\/rouliiiie\/20251001\/20251001151222.png\" width=\"487\" height=\"187\" loading=\"lazy\" title=\"\" class=\"hatena-fotolife\" itemprop=\"image\"\/><\/span><figcaption>COCO dataset example for image captioning <a target=\"_blank\" href=\"#ref_11\">[11]<\/a> <br \/>5 crowd-sourced captions are provided for each sample<\/figcaption><\/figure>\n<\/p>\n<p>First, we confirm that we can reproduce the LoRA results of InternVL authors\u2019 results. We also include the full SFT results which are not provided by the authors. The metrics are the same as used by InternVL.<\/p>\n<div class=\"s_table\"><table style=\"margin: 0 auto; border-collapse: collapse; text-align: center;\">\n<caption style=\"caption-side: bottom; font-size: 90%; margin-top: 6px;\">\n    <em>Comparison of LoRA on InternVL2.0-2B to baseline and authors\u2019 results<br \/>Baseline = pretrained 2B model<\/em><br \/>\n  <\/caption>\n<thead>\n<tr>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Metric<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Bleu_1<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Bleu_2<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Bleu_3<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Bleu_4<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">METEOR<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Rouge<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">CIDEr<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Baseline<br \/><small>(reproduced by RI)<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.640<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.463<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.321<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.214<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.267<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.504<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.793<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">LoRA<br \/><small>(by InternVL authors)<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.805<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.649<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.504<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.385<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.300<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.595<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">1.312<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">LoRA<br \/><small>(by RI)<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.804<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.649<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.501<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.382<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.299<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.594<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">1.305<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">SFT<br \/><small>(by RI)<\/small><\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.806<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.652<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.508<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.392<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.305<\/td>\n<td style=\"background-color: #e6f9e6;padding:6px 10px; border:1px solid #ddd;\">0.601<\/td>\n<td style=\"background-color: #e6f9e6;padding:6px 10px; border:1px solid #ddd;\">1.339<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>In terms of the metrics, we can clearly see a significant improvement for the fine-tuned models compared to the pretrained model.<\/p>\n<p><strong>However, can we truly say that the model performance has improved?<\/strong><\/p>\n<p>First, it is important to understand the BLEU (BiLingual Evaluation Understudy) metric <a target=\"_blank\" href=\"#ref_12\">[12]<\/a>. This metric was originally used to assess the quality of a translated text by comparing the machine translation to a ground-truth human made translation.<\/p>\n<p>According to <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/Google\">Google<\/a> AutoML documentation <a target=\"_blank\" href=\"#ref_13\">[13]<\/a>, there are several points to be careful about when using BLEU.<\/p>\n<ul>\n<li>\n<p><strong>BLEU is a Corpus-based Metric.<\/strong> It performs badly when used to evaluate individual sentences. It is mainly used to compare whether two corpus are similar (length, vocabulary etc.).<\/p>\n<\/li>\n<li>\n<p><strong>There is no distinction between content and function words<\/strong>. A dropped function word like &#8220;a&#8221; gets the same penalty as if the name &#8220;<a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/NASA\">NASA<\/a>&#8221; was erroneously replaced with &#8220;<a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/ESA\">ESA<\/a>&#8220;.<\/p>\n<\/li>\n<li>\n<p><strong>Not good at capturing the meaning and grammaticality of a sentence.<\/strong> Dropping a single word like &#8220;not&#8221; can change the meaning of a sentence. However BLEU only imposes a small penalty since it treats all words equally and considers it as just one word difference.<\/p>\n<\/li>\n<\/ul>\n<p>As a result, it is possible to achieve a very high BLEU score using sentences that don\u2019t make sense or are opposite in meaning. For example, the following pair of sentences has a score of 0.8:<\/p>\n<p><strong>Reference:<\/strong> <code>the cat is on the mat<\/code><\/p>\n<p><strong>Candidate:<\/strong> <code>the the the cat mat<\/code><\/p>\n<p>In our case, the ground-truth annotations used as reference for BLEU are crowdsourced. Therefore, the descriptions are usually short and may have issues such as typos or grammatical errors.<\/p>\n<div class=\"s_table\"><table style=\"margin: 0 auto; border-collapse: collapse; width: 90%;\">\n<caption style=\"caption-side: bottom; font-size: 90%; margin-top: 6px;\">\n    <em>Example of a crowdsourced annotation<\/em><br \/>\n  <\/caption>\n<thead>\n<tr>\n<th style=\"padding:6px 10px; border:1px solid #ddd; text-align:center; width: 40%;\">Image<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd; text-align:center;\">Crowdsourced annotation<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:center;\">\n        <span style=\"display:inline-block; height:100px;\"><span itemscope=\"\" itemtype=\"http:\/\/schema.org\/Photograph\"><img decoding=\"async\" src=\"https:\/\/cdn-ak.f.st-hatena.com\/images\/fotolife\/r\/rouliiiie\/20251001\/20251001152209.png\" width=\"150\" height=\"100\" loading=\"lazy\" title=\"\" class=\"hatena-fotolife\" itemprop=\"image\"\/><\/span><br \/>\n      <\/span><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left; white-space: pre-line;\">\n        &#8220;some dessert is laying out on a yellow and white plate&#8221;,<br \/>\n        <span style=\"color: red;\">\u2192 Punctuation and caps issues<\/span><br \/>\n        &#8220;A plate containing a slice of dessert, two forks and some piped cream&#8221;,<br \/>\n        &#8220;Pastry sitting on top of a golden white plate with forks.&#8221;,<br \/>\n        <span style=\"color: red;\">\u2192 Not detailed enough<\/span><br \/>\n        &#8220;Two forks on a plate of cake and cream.&#8221;,<br \/>\n        <span style=\"color: red;\">\u2192 Not detailed enough<\/span><br \/>\n        &#8220;THIS IS A PHOTO OF A DESERT PLATE FOR TWO&#8221;<br \/>\n        <span style=\"color: red;\">\u2192 All caps, includes a <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/typo\">typo<\/a> &#8220;dessert&#8221; \u2192 &#8220;desert&#8221; and is not detailed enough<\/span>\n      <\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>After investigating the results of the fine-tuned model, it does seem like the fine-tuned models&#8217; answers are more similar to the training data. For several cases, the fine-tuned models end up producing short descriptions that are not very detailed, similarly to the crowdsourced annotations, as shown in the table below.<\/p>\n<div class=\"s_table\"><table style=\"margin: 0 auto; border-collapse: collapse; width: 95%;\">\n<caption style=\"caption-side: bottom; font-size: 90%; margin-top: 6px;\">\n    <em>Comparison of Pretrained 2B, SFT-2B and LoRA-8B models on a few examples<\/em><br \/>\n  <\/caption>\n<thead>\n<tr>\n<th style=\"padding:6px 10px; border:1px solid #ddd; text-align:center; width:20%;\">Input image<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd; text-align:center;\">Pretrained 2B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd; text-align:center;\">SFT 2B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd; text-align:center;\">LoRA 8B<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:center;\">\n        <span itemscope=\"\" itemtype=\"http:\/\/schema.org\/Photograph\"><img decoding=\"async\" src=\"https:\/\/cdn-ak.f.st-hatena.com\/images\/fotolife\/r\/rouliiiie\/20251001\/20251001152209.png\" width=\"150\" height=\"100\" loading=\"lazy\" title=\"\" class=\"hatena-fotolife\" itemprop=\"image\"\/><\/span>\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A dessert plate with a slice of cake, two scoops of ice cream, and a spoon.\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A piece of cake with whipped cream and chocolate sauce.<br \/><span style=\"color: green;\">\u2192 best, according to human evaluation<\/span>\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A plate of food with a fork on it.\n      <\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:center;\">\n        <span itemscope=\"\" itemtype=\"http:\/\/schema.org\/Photograph\"><img decoding=\"async\" src=\"https:\/\/cdn-ak.f.st-hatena.com\/images\/fotolife\/r\/rouliiiie\/20251001\/20251001152614.png\" width=\"150\" height=\"109\" loading=\"lazy\" title=\"\" class=\"hatena-fotolife\" itemprop=\"image\"\/><\/span>\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A man in a cowboy hat rides a horse down a street, with people watching.<br \/><span style=\"color: green;\">\u2192 best, according to human evaluation<\/span>\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A man riding a horse down a street next to a crowd of people.\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A man riding a horse down a street.\n      <\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:center;\">\n        <span itemscope=\"\" itemtype=\"http:\/\/schema.org\/Photograph\"><img decoding=\"async\" src=\"https:\/\/cdn-ak.f.st-hatena.com\/images\/fotolife\/r\/rouliiiie\/20251001\/20251001152650.png\" width=\"100\" height=\"150\" loading=\"lazy\" title=\"\" class=\"hatena-fotolife\" itemprop=\"image\"\/><\/span>\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A bed with colorful bedding and pillows is set up in a room covered with blue plastic sheets.<br \/><span style=\"color: green;\">\u2192 best, according to human evaluation<\/span>\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A bed in a room with blue curtains and a blue sheet.\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A bed with a blue cover and a blue curtain.\n      <\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:center;\">\n        <span itemscope=\"\" itemtype=\"http:\/\/schema.org\/Photograph\"><img decoding=\"async\" src=\"https:\/\/cdn-ak.f.st-hatena.com\/images\/fotolife\/r\/rouliiiie\/20251001\/20251001152719.png\" width=\"150\" height=\"113\" loading=\"lazy\" title=\"\" class=\"hatena-fotolife\" itemprop=\"image\"\/><\/span>\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A collection of colorful street art is displayed on a wooden fence, with a stop <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/sign\">sign<\/a> and a cityscape illustration.<br \/><span style=\"color: green;\">\u2192 best, according to human evaluation<\/span>\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A row of paintings and a stop <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/sign\">sign<\/a> on a fence.\n      <\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; text-align:left;\">\n        A bunch of paintings are leaning against a fence.\n      <\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>In this case, neither LoRA nor full SFT significantly <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/improve\">improve<\/a> performance according to human evaluation, even when using larger architectures (e.g., LoRA 8B).\u00a0<\/p>\n<p>Note: We were unable to include SFT results for 8B due to <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/GPU\">GPU<\/a> memory constraints.<\/p>\n<p>Results suggest that the fine-tuned models have <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/indeed\">indeed<\/a> adapted to the training data distribution. <strong>However, since the training data quality is poor to begin with<\/strong> (facing several quality concerns such as typos, grammatical errors, generally short or inaccurate description)<strong>, the actual performance of the model has degraded.<\/strong> Evaluation metrics may appear high only because the model&#8217;s outputs are more similar to the low-quality ground truth annotations.<\/p>\n<h2 id=\"Small-model--SFT-vs-Large-model--LoRA\">Small model + SFT vs. Large model + LoRA?<\/h2>\n<p>In this section, we briefly discuss the trade-offs between using a small model with full SFT versus a larger model with lightweight adaptation methods such as LoRA.<\/p>\n<p>Full fine-tuning, which updates all the model&#8217;s parameters, requires significant computing and memory resources. In comparison, LoRA adds small trainable adapter layers to the model while keeping the original weights frozen, making it faster and more affordable to train.<\/p>\n<p>Below is a comparison table of our <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/GPU\">GPU<\/a> usage during our experiments.<\/p>\n<div class=\"s_table\"><table style=\"margin: 0 auto; border-collapse: collapse; text-align: center;\">\n<caption style=\"caption-side: bottom; font-size: 90%; margin-top: 6px;\">\n    <em>Comparison of the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/GPU\">GPU<\/a> usage (MiB) on a <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/NVIDIA\">NVIDIA<\/a> A100 <br \/>Note: Through optimization techniques and fragmentation strategies, it is possible that bigger models take less space in memory than some smaller models<br \/>\n<\/em><br \/>\n  <\/caption>\n<thead>\n<tr>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Model size<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">1B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">2B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">4B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">8B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">26B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">40B<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">76B<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Inference<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">6167<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">8361<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">12497<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">21745<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">56609<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">81145<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; color: red;\">Out of memory<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">LoRA<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">39711<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">30963<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">18037<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">43029<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">78877<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; color: red;\">Out of memory<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; color: red;\">Out of memory<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Fine-tuning<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">72423<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">76663<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; color: red;\">Out of memory<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; color: red;\">Out of memory<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; color: red;\">Out of memory<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; color: red;\">Out of memory<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd; color: red;\">Out of memory<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>As can be seen in the table, the full fine-tuning process is very resource intensive and our 80GB <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/GPU\">GPU<\/a> can only support fine-tuning up to the 2B model.<\/p>\n<p><strong>When hardware is limited, it&#8217;s important to weigh trade-offs between using a large pretrained model as it is, fine-tuning a mid-sized model with LoRA, or applying full SFT to a smaller model.<\/strong><\/p>\n<p>The largest InternVL2.0 model that can fit in our A100 80Gb is 26B for LoRA and 2B for full SFT.<br \/>\nFor the sake of this experiment, we compare the performance of LoRA-8B with SFT-2B on the COCO Captions dataset <a target=\"_blank\" href=\"#ref_11\">[11]<\/a>.<\/p>\n<div class=\"s_table\"><table style=\"margin: 0 auto; border-collapse: collapse; text-align: center;\">\n<caption style=\"caption-side: bottom; font-size: 90%; margin-top: 6px;\">\n    <em>Comparison of LoRA-8B and SFT-2B with InternVL2.0 on COCO Captions dataset<\/em><br \/>\n  <\/caption>\n<thead>\n<tr>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Metric<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Bleu_1<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Bleu_2<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Bleu_3<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Bleu_4<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">METEOR<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">Rouge<\/th>\n<th style=\"padding:6px 10px; border:1px solid #ddd;\">CIDEr<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Baseline 2B<br \/><small>(by RI)<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.640<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.463<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.321<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.214<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.267<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.504<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.793<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">Baseline 8B<br \/><small>(by RI)<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.660<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.491<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.351<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.245<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.285<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.530<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.892<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">LoRA 8B<br \/><small>(by RI)<\/small><\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.814<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.660<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.517<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.398<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.305<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.604<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">1.358<\/td>\n<\/tr>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">SFT 2B<br \/><small>(by RI)<\/small><\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.806<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.652<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.508<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.392<\/td>\n<td style=\"background-color: #e6f9e6; padding:6px 10px; border:1px solid #ddd;\">0.305<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">0.601<\/td>\n<td style=\"padding:6px 10px; border:1px solid #ddd;\">1.339<\/td>\n<\/tbody>\n<\/table><\/div>\n<p>Despite requiring less <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/GPU\">GPU<\/a> resources, LoRA 8B still performs better than SFT 2B at adapting to the target dataset.<\/p>\n<p>If resources are limited and fine-tuning is necessary, it may be more practical to use a lightweight fine-tuning method that allows us to use larger pretrained architectures within the same resource constraints.\u00a0<\/p>\n<p>Although we did not explore it in this study, the use of even larger pretrained models can be considered with methods that do not require updating model weights. For example, in-context learning adds few examples to the prompt to provide guidance to the desired output format and tone.<\/p>\n<h2 id=\"Conclusion\">Conclusion\u00a0\u00a0<\/h2>\n<p>In this blog post, we explored some key challenges of fine-tuning VLMs.\u00a0<\/p>\n<p>Fine-tuning is often seen as the default choice to <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/improve\">improve<\/a> model performance. However this assumption can be misleading and it is frequently more complex and resource-intensive than one may expect.\u00a0<\/p>\n<p>There are many reasons why fine-tuning can fail to bring improvement.<\/p>\n<ul>\n<li>\n<p>The choice of fine-tuning may not be appropriate<\/p>\n<ul>\n<li>\n<p>Fine-tuning is not appropriate for learning external knowledge without a tremendous amount of data.<\/p>\n<\/li>\n<li>\n<p>If the bottleneck is the image part of the architecture rather than the language part.<\/p>\n<\/li>\n<li>\n<p>The task is very general (e.g. captioning, summarization, etc.) and there is no reason to think that the pretrained model\u2019s training is insufficient.<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<li>\n<p>The training data is not prepared carefully<\/p>\n<ul>\n<li>\n<p>The training data is too different from the test data and is not appropriate for the intended use of the model.<\/p>\n<\/li>\n<li>\n<p>The training data quality is poor and is not able to properly teach the model. The purpose of fine-tuning is not to &#8220;<a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/improve\">improve<\/a>&#8221; the model but to align it more closely with the distribution of the training data.*<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<li>\n<p>The training process is difficult<\/p>\n<ul>\n<li>\n<p>General knowledge is lost due to catastrophic forgetting.<\/p>\n<\/li>\n<li>\n<p>The ability to follow instructions or answer questions is lost despite the model successfully adapting to the new data.<\/p>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Given the high cost and the significant risk of degraded performance, we strongly recommend carefully evaluating whether <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/alternative\">alternative<\/a> approaches (such as prompt tuning, retrieval-augmented generation (RAG), etc.) may be more suitable and whether basic requirements (amount and quality of the available data) are met.<\/p>\n<h2 id=\"References\">References<\/h2>\n<p><a target=\"_blank\" id=\"ref_1\"\/>[1] Chen, Zhe, et al. &#8220;Internvl: Scaling up <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/vision\">vision<\/a> foundation models and aligning for generic visual-linguistic tasks.&#8221; Proceedings of the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/IEEE\">IEEE<\/a>\/CVF conference on <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/computer%20vision\">computer vision<\/a> and pattern recognition. 2024. url: <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2312.14238\">https:\/\/arxiv.org\/abs\/2312.14238<\/a>.<\/p>\n<p><a target=\"_blank\" id=\"ref_2\"\/>[2] Aditya Jain. \u201cTo fine-tune or not to fine-tune.\u201d 2024. url: <a target=\"_blank\" href=\"https:\/\/ai.meta.com\/blog\/when-to-fine-tune-llms-vs-other-techniques\/\">https:\/\/ai.meta.com\/blog\/when-to-fine-tune-llms-vs-other-techniques\/<\/a>.\u00a0<\/p>\n<p><a target=\"_blank\" id=\"ref_3\"\/>[3] Russakovsky, Olga, et al. &#8220;Imagenet large scale visual recognition challenge.&#8221; International journal of <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/computer%20vision\">computer vision<\/a>. 2015. url: <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1409.0575\">https:\/\/arxiv.org\/abs\/1409.0575<\/a>.<\/p>\n<p><a target=\"_blank\" id=\"ref_4\"\/>[4] Schuhmann, Christoph, et al. &#8220;Laion-5b: An open large-scale dataset for training next generation image-text models.&#8221; Advances in neural information processing systems. 2022. url: <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2210.08402\">https:\/\/arxiv.org\/abs\/2210.08402<\/a>.<\/p>\n<p><a target=\"_blank\" id=\"ref_5\"\/>[5] Hugging Face Blog. &#8220;<a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/Vision\">Vision<\/a> Language Models Explained.&#8221; 2025. url: <a target=\"_blank\" href=\"https:\/\/github.com\/huggingface\/blog\/blob\/main\/vlms.md\">https:\/\/github.com\/huggingface\/blog\/blob\/main\/vlms.md<\/a>.\u00a0<\/p>\n<p><a target=\"_blank\" id=\"ref_6\"\/>[6] InternVL Authors. &#8220;Fine-tune on a Custom Dataset.&#8221; 2025. url: <a target=\"_blank\" href=\"https:\/\/internvl.readthedocs.io\/en\/latest\/internvl2.0\/finetune.html\">https:\/\/internvl.readthedocs.io\/en\/latest\/internvl2.0\/finetune.html<\/a>.<\/p>\n<p><a target=\"_blank\" id=\"ref_7\"\/>[7] Mathew, Minesh, Dimosthenis Karatzas, and C. V. Jawahar. &#8220;Docvqa: A dataset for vqa on document images.&#8221; Proceedings of the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/IEEE\">IEEE<\/a>\/CVF winter conference on applications of <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/computer%20vision\">computer vision<\/a>. 2021. url: <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2007.00398\">https:\/\/arxiv.org\/abs\/2007.00398<\/a>.<\/p>\n<p><a target=\"_blank\" id=\"ref_8\"\/>[8] Biten, Ali Furkan, et al. &#8220;Scene text visual question answering.&#8221; Proceedings of the <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/IEEE\">IEEE<\/a>\/CVF international conference on <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/computer%20vision\">computer vision<\/a>. 2019. url: <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1905.13648\">https:\/\/arxiv.org\/abs\/1905.13648<\/a>.\u00a0<\/p>\n<p><a target=\"_blank\" id=\"ref_9\"\/>[9] Kembhavi, Aniruddha, et al. &#8220;A diagram is worth a dozen images.&#8221; <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/Computer%20Vision\">Computer Vision<\/a>\u2013ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11\u201314, 2016, Proceedings, Part IV 14. Springer International Publishing, 2016. url: <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1603.07396\">https:\/\/arxiv.org\/abs\/1603.07396<\/a>.\u00a0<\/p>\n<p><a target=\"_blank\" id=\"ref_10\"\/>[10] Fujii, <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/Kazuki\">Kazuki<\/a>, et al. &#8220;Continual pre-training for cross-lingual llm adaptation: Enhancing japanese language capabilities.&#8221; <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/arXiv\">arXiv<\/a> preprint <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/arXiv\">arXiv<\/a>:2404.17790 (2024). url: <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2404.17790\">https:\/\/arxiv.org\/abs\/2404.17790<\/a>.\u00a0<\/p>\n<p><a target=\"_blank\" id=\"ref_11\"\/>[11] Chen, Xinlei, et al. &#8220;<a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/Microsoft\">Microsoft<\/a> coco captions: Data collection and evaluation server.&#8221; <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/arXiv\">arXiv<\/a> preprint <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/arXiv\">arXiv<\/a>:1504.00325 (2015). url: <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1504.00325\">https:\/\/arxiv.org\/abs\/1504.00325<\/a>.\u00a0<\/p>\n<p><a target=\"_blank\" id=\"ref_12\"\/>[12] Papineni, Kishore, et al. &#8220;Bleu: a method for automatic evaluation of machine translation.&#8221; Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002. url: <a target=\"_blank\" href=\"https:\/\/aclanthology.org\/P02-1040.pdf\">https:\/\/aclanthology.org\/P02-1040.pdf<\/a>.<\/p>\n<p><a target=\"_blank\" id=\"ref_13\"\/>\u00a0[13] <a target=\"_blank\" class=\"keyword\" href=\"https:\/\/d.hatena.ne.jp\/keyword\/Google\">Google<\/a> Cloud documentation. &#8220;Understanding the BLEU score.&#8221; 2025. url: <a target=\"_blank\" href=\"https:\/\/cloud.google.com\/translate\/docs\/advanced\/automl-evaluate#bleu\">https:\/\/cloud.google.com\/translate\/docs\/advanced\/automl-evaluate#bleu<\/a>.<\/p>\n<\/div>\n<p><script>(function(d, s, id) {\n  var js, fjs = d.getElementsByTagName(s)[0];\n  if (d.getElementById(id)) return;\n  js = d.createElement(s); js.id = id;\n  js.src = \"\/\/connect.facebook.net\/ja_JP\/sdk.js#xfbml=1&appId=719729204785177&version=v17.0\";\n  fjs.parentNode.insertBefore(js, fjs);\n}(document, 'script', 'facebook-jssdk'));<\/script><br \/>\n<br \/>\n<br \/><a href=\"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501\">\u5143\u306e\u8a18\u4e8b\u3092\u78ba\u8a8d\u3059\u308b <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"Hello, this is Aur\u00e9lie, working as an Artificial Intelligence Engineer at Ridge-i. Today I would like to share [&hellip;]","protected":false},"author":1,"featured_media":25280,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[4],"tags":[],"class_list":["post-25279","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-company-tec"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Investigating fine-tuning limitations for VLMs with three case studies - \u30dd\u30b1\u30b3\u30f3<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501\" \/>\n<meta property=\"og:locale\" content=\"ja_JP\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Investigating fine-tuning limitations for VLMs with three case studies - \u30dd\u30b1\u30b3\u30f3\" \/>\n<meta property=\"og:description\" content=\"Hello, this is Aur\u00e9lie, working as an Artificial Intelligence Engineer at Ridge-i. Today I would like to share [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501\" \/>\n<meta property=\"og:site_name\" content=\"\u30dd\u30b1\u30b3\u30f3\" \/>\n<meta property=\"article:published_time\" content=\"2025-11-26T10:48:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/pokecon.jp\/job\/wp-content\/uploads\/2025\/11\/https3A2F2Fcdn-ak.f.st-hatena.com2Fimages2Ffotolife2Fr2Frouliiiie2F202509222F20250922123325.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1300\" \/>\n\t<meta property=\"og:image:height\" content=\"1651\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"info@pokecon.jp\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u57f7\u7b46\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"info@pokecon.jp\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593\" \/>\n\t<meta name=\"twitter:data2\" content=\"18\u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/iblog.ridge-i.com\\\/entry\\\/2025\\\/11\\\/26\\\/184501#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/25279\\\/\"},\"author\":{\"name\":\"info@pokecon.jp\",\"@id\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/#\\\/schema\\\/person\\\/16c9f07b1ba984d165d9aee259bda997\"},\"headline\":\"Investigating fine-tuning limitations for VLMs with three case studies\",\"datePublished\":\"2025-11-26T10:48:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/25279\\\/\"},\"wordCount\":3531,\"image\":{\"@id\":\"https:\\\/\\\/iblog.ridge-i.com\\\/entry\\\/2025\\\/11\\\/26\\\/184501#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/https3A2F2Fcdn-ak.f.st-hatena.com2Fimages2Ffotolife2Fr2Frouliiiie2F202509222F20250922123325.png\",\"articleSection\":[\"\u4f01\u696d\u30c6\u30c3\u30af\"],\"inLanguage\":\"ja\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/25279\\\/\",\"url\":\"https:\\\/\\\/iblog.ridge-i.com\\\/entry\\\/2025\\\/11\\\/26\\\/184501\",\"name\":\"Investigating fine-tuning limitations for VLMs with three case studies - \u30dd\u30b1\u30b3\u30f3\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/iblog.ridge-i.com\\\/entry\\\/2025\\\/11\\\/26\\\/184501#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/iblog.ridge-i.com\\\/entry\\\/2025\\\/11\\\/26\\\/184501#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/https3A2F2Fcdn-ak.f.st-hatena.com2Fimages2Ffotolife2Fr2Frouliiiie2F202509222F20250922123325.png\",\"datePublished\":\"2025-11-26T10:48:34+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/#\\\/schema\\\/person\\\/16c9f07b1ba984d165d9aee259bda997\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/iblog.ridge-i.com\\\/entry\\\/2025\\\/11\\\/26\\\/184501#breadcrumb\"},\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/iblog.ridge-i.com\\\/entry\\\/2025\\\/11\\\/26\\\/184501\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\\\/\\\/iblog.ridge-i.com\\\/entry\\\/2025\\\/11\\\/26\\\/184501#primaryimage\",\"url\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/https3A2F2Fcdn-ak.f.st-hatena.com2Fimages2Ffotolife2Fr2Frouliiiie2F202509222F20250922123325.png\",\"contentUrl\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/https3A2F2Fcdn-ak.f.st-hatena.com2Fimages2Ffotolife2Fr2Frouliiiie2F202509222F20250922123325.png\",\"width\":1300,\"height\":1651},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/iblog.ridge-i.com\\\/entry\\\/2025\\\/11\\\/26\\\/184501#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\u30db\u30fc\u30e0\",\"item\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Investigating fine-tuning limitations for VLMs with three case studies\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/#website\",\"url\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/\",\"name\":\"\u30dd\u30b1\u30b3\u30f3\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ja\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/#\\\/schema\\\/person\\\/16c9f07b1ba984d165d9aee259bda997\",\"name\":\"info@pokecon.jp\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/2b0549cd9f7907c092ca5fbb283baf72337f235726e4b46fa39ec0b701ac2fe2?s=96&d=wavatar&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/2b0549cd9f7907c092ca5fbb283baf72337f235726e4b46fa39ec0b701ac2fe2?s=96&d=wavatar&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/2b0549cd9f7907c092ca5fbb283baf72337f235726e4b46fa39ec0b701ac2fe2?s=96&d=wavatar&r=g\",\"caption\":\"info@pokecon.jp\"},\"url\":\"https:\\\/\\\/pokecon.jp\\\/job\\\/author\\\/infopokecon-jp\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Investigating fine-tuning limitations for VLMs with three case studies - \u30dd\u30b1\u30b3\u30f3","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501","og_locale":"ja_JP","og_type":"article","og_title":"Investigating fine-tuning limitations for VLMs with three case studies - \u30dd\u30b1\u30b3\u30f3","og_description":"Hello, this is Aur\u00e9lie, working as an Artificial Intelligence Engineer at Ridge-i. Today I would like to share [&hellip;]","og_url":"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501","og_site_name":"\u30dd\u30b1\u30b3\u30f3","article_published_time":"2025-11-26T10:48:34+00:00","og_image":[{"width":1300,"height":1651,"url":"https:\/\/pokecon.jp\/job\/wp-content\/uploads\/2025\/11\/https3A2F2Fcdn-ak.f.st-hatena.com2Fimages2Ffotolife2Fr2Frouliiiie2F202509222F20250922123325.png","type":"image\/png"}],"author":"info@pokecon.jp","twitter_card":"summary_large_image","twitter_misc":{"\u57f7\u7b46\u8005":"info@pokecon.jp","\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593":"18\u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501#article","isPartOf":{"@id":"https:\/\/pokecon.jp\/job\/25279\/"},"author":{"name":"info@pokecon.jp","@id":"https:\/\/pokecon.jp\/job\/#\/schema\/person\/16c9f07b1ba984d165d9aee259bda997"},"headline":"Investigating fine-tuning limitations for VLMs with three case studies","datePublished":"2025-11-26T10:48:34+00:00","mainEntityOfPage":{"@id":"https:\/\/pokecon.jp\/job\/25279\/"},"wordCount":3531,"image":{"@id":"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501#primaryimage"},"thumbnailUrl":"https:\/\/pokecon.jp\/job\/wp-content\/uploads\/2025\/11\/https3A2F2Fcdn-ak.f.st-hatena.com2Fimages2Ffotolife2Fr2Frouliiiie2F202509222F20250922123325.png","articleSection":["\u4f01\u696d\u30c6\u30c3\u30af"],"inLanguage":"ja"},{"@type":"WebPage","@id":"https:\/\/pokecon.jp\/job\/25279\/","url":"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501","name":"Investigating fine-tuning limitations for VLMs with three case studies - \u30dd\u30b1\u30b3\u30f3","isPartOf":{"@id":"https:\/\/pokecon.jp\/job\/#website"},"primaryImageOfPage":{"@id":"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501#primaryimage"},"image":{"@id":"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501#primaryimage"},"thumbnailUrl":"https:\/\/pokecon.jp\/job\/wp-content\/uploads\/2025\/11\/https3A2F2Fcdn-ak.f.st-hatena.com2Fimages2Ffotolife2Fr2Frouliiiie2F202509222F20250922123325.png","datePublished":"2025-11-26T10:48:34+00:00","author":{"@id":"https:\/\/pokecon.jp\/job\/#\/schema\/person\/16c9f07b1ba984d165d9aee259bda997"},"breadcrumb":{"@id":"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501#breadcrumb"},"inLanguage":"ja","potentialAction":[{"@type":"ReadAction","target":["https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501"]}]},{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501#primaryimage","url":"https:\/\/pokecon.jp\/job\/wp-content\/uploads\/2025\/11\/https3A2F2Fcdn-ak.f.st-hatena.com2Fimages2Ffotolife2Fr2Frouliiiie2F202509222F20250922123325.png","contentUrl":"https:\/\/pokecon.jp\/job\/wp-content\/uploads\/2025\/11\/https3A2F2Fcdn-ak.f.st-hatena.com2Fimages2Ffotolife2Fr2Frouliiiie2F202509222F20250922123325.png","width":1300,"height":1651},{"@type":"BreadcrumbList","@id":"https:\/\/iblog.ridge-i.com\/entry\/2025\/11\/26\/184501#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\u30db\u30fc\u30e0","item":"https:\/\/pokecon.jp\/job\/"},{"@type":"ListItem","position":2,"name":"Investigating fine-tuning limitations for VLMs with three case studies"}]},{"@type":"WebSite","@id":"https:\/\/pokecon.jp\/job\/#website","url":"https:\/\/pokecon.jp\/job\/","name":"\u30dd\u30b1\u30b3\u30f3","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/pokecon.jp\/job\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ja"},{"@type":"Person","@id":"https:\/\/pokecon.jp\/job\/#\/schema\/person\/16c9f07b1ba984d165d9aee259bda997","name":"info@pokecon.jp","image":{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/secure.gravatar.com\/avatar\/2b0549cd9f7907c092ca5fbb283baf72337f235726e4b46fa39ec0b701ac2fe2?s=96&d=wavatar&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/2b0549cd9f7907c092ca5fbb283baf72337f235726e4b46fa39ec0b701ac2fe2?s=96&d=wavatar&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/2b0549cd9f7907c092ca5fbb283baf72337f235726e4b46fa39ec0b701ac2fe2?s=96&d=wavatar&r=g","caption":"info@pokecon.jp"},"url":"https:\/\/pokecon.jp\/job\/author\/infopokecon-jp\/"}]}},"_links":{"self":[{"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/posts\/25279","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/comments?post=25279"}],"version-history":[{"count":1,"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/posts\/25279\/revisions"}],"predecessor-version":[{"id":25281,"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/posts\/25279\/revisions\/25281"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/media\/25280"}],"wp:attachment":[{"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/media?parent=25279"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/categories?post=25279"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pokecon.jp\/job\/wp-json\/wp\/v2\/tags?post=25279"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}