Presel (English)

🔍 Overview of PreSel: Pre-Instruction Data Selection for VIT

Step	Purpose	Input/Output	Core Idea
1. Task-Importance Estimation	Decide how much to sample from each task	Input: 5% reference set $D_{\text{ref}}$ (with instructions) → Output: task weights $w(T_i)$	(i) Fine-tune LVLM on small set → Reference model (ii) Compute IRS (Instruction Relevance Score) to assess task importance
2. Task-wise Cluster-based Selection	Decide which images to sample from each task	Input: Remaining 95% images (no instructions) + $w(T_i)$	(i) Extract features using DINOv2 (ii) k-means clustering per task (iii) Select representative images via NC (Neighbor Centrality)
3. Instruction Generation & LVLM Fine-tuning	Generate instructions only for selected subset $\mathcal{D}_S$ and fine-tune	Output: (Image, Instruction) pairs → LVLM tuning	Save cost by generating instructions only for $\mathcal{D}_S \ll D$

1. Problem Formulation

Let:

\[D = \bigcup_{i=1}^{M} T_i\]

Where:

$T_i$: set of unlabeled images for the $i$-th vision task (e.g., VQA, OCR)
$D$: entire image pool
Each image $I\in T_i$ is mapped to instruction $Y = F_i(I)$, a high-cost generation process (e.g., GPT or human annotator)

Objective

Select a small subset $\mathcal{D}_{S} \subset D$ with $ \lvert \mathcal{D}_{S} \rvert \ll \lvert D \rvert $, generate instructions only for these, and fine-tune the LVLM on the resulting pairs $(I_a, Y_a)$ to achieve near full-data performance.

2. Task-Importance Estimation

2.1 Reference Model

Randomly sample 5% of $D$ as $D_{\text{ref}}$ (with instruction $Y = (Q, R)$) and fine-tune LVLM for one epoch → reference model.

2.2 Instruction Relevance Score (IRS)

Given instruction $Y = (Q, R)$, define:

\[\mathcal{L}_{R|Q,I} = -\frac{1}{|t^{R}|}\sum_{j=1}^{|t^{R}|} \log P_{\theta}(t^{R}_{j} \mid I, Q, t^{R}_{<j}) \tag{1}\] \[\mathcal{L}_{R|I} = -\frac{1}{|t^{R}|}\sum_{j=1}^{|t^{R}|} \log P_{\theta}(t^{R}_{j} \mid I, t^{R}_{<j}) \tag{2}\] \[\text{IRS}(I, Y) = \frac{\mathcal{L}_{R|Q,I}}{\mathcal{L}_{R|I}} \tag{3}\]

A lower IRS implies $Q$ significantly aids in generating $R$ → more important task.
A higher IRS implies $Q$ does not help much → less important task.

2.3 Compute Task Importance

Average IRS for task $T_i$:

\[s(T_i) = \frac{1}{|D_{\text{ref}}^i|} \sum_{I \in T_i} \text{IRS}(I, Y) \tag{4}\]

Compute task weight via softmax:

\[w(T_i) = \frac{\exp(-s(T_i)/\tau)}{\sum_{j=1}^{M} \exp(-s(T_j)/\tau)},\quad \tau = \frac{1}{\sqrt{M}} \tag{5}\]

3. Task-wise Cluster-based Selection

3.1 Visual Feature Extraction and Clustering

For all unlabeled images $I \in T_i$, extract visual features $\mathbf{v}_I$ using DINOv2 ([CLS] token), and cluster into:

\[\left\{ A_c^i \right\}_{c=1}^{C},\quad \text{where } C = \frac{|T_i|}{100}\]

3.2 Cluster Allocation

For cluster $A_c^i$, compute number of samples to pick:

\[n_c = \left\lfloor \frac{w(T_i) \cdot |A_c^i|}{|T_i|} \cdot |\mathcal{D}_S| \right\rfloor \tag{6}\]

3.3 Intra-Cluster Selection: Neighbor Centrality (NC)

Score for each image $I$ based on cosine similarity with $k$-nearest neighbors:

\[s_{\text{NC}}(I) = \frac{1}{k} \sum_{I_a \in \text{kNN}(I)} \text{sim}(\mathbf{v}_I, \mathbf{v}_{I_a}) \tag{7}\]

Higher $s_{\text{NC}}(I)$ means image is central and representative.

4. Final Assembly and Fine-tuning

Selected images across all tasks form $\mathcal{D}_S$.
Generate instructions only for $\mathcal{D}_S$.
Fine-tune LVLM using $(I, Y)$ pairs from $\mathcal{D}_S$.

5. Key Takeaways

Insight	Benefit
Instruction-free selection phase	Reduces GPT or annotation cost drastically
IRS for task relevance	Captures both redundancy and informativeness
Visual feature + NC	Enables language-free, representative image selection
Lightweight pipeline	Efficient for large-scale unlabeled datasets

This approach enables scalable, cost-effective, and high-performance data curation for visual instruction tuning in Multimodal LLMs.

Presel 정리 (Korean)

단계	목적	입력/출력	핵심 아이디어
1. Task-Importance Estimation	“어떤 태스크에서 얼마나 많이 뽑을까?” 결정	5 % 참조집합 $D_{\text{ref}}$ (이미지·지시 포함) → 가중치 $w(T_i)$	(i) LVLM을 1-epoch 소량 파인튜닝 → Reference model (ii) IRS(Instruction Relevance Score)로 태스크 중요도 측정
2. Task-wise Cluster-based Selection	각 태스크 내부에서 “무엇을 뽑을까?” 결정	나머지 95 % 미라벨 데이터 (이미지만) + $w(T_i)$	(i) DINOv2 특성 추출 → k-means 군집화 (ii) 군집 크기·$w(T_i)$ 기반 샘플 수 $n_c$ 산정 (iii) NC(Neighbor Centrality)로 대표 이미지 선택
3. Instruction Generation & LVLM Fine-tune	최종 소규모 데이터 $\mathcal{D}_S$에 대해 지시 생성·파인튜닝	$\mathcal{D}_S$ (이미지) → (이미지, 지시) → 파인튜닝	비용 절감: 전체 대신 ($\mathcal{D}_S \ll D$) 만 instruction 생성

1. 문제 정의 (Problem Formulation)

풀(pool) 구성
\[D \;=\; \bigcup_{i=1}^{M} T_i,\qquad |D| = \text{총 이미지 수}\]
$T_i$ : VQA, OCR 등 시각 태스크에 속하는 unlabeled 이미지 집합.
지시 생성 비용: 각 $T_i$에는 GPT API 호출·사람 라벨 등 고비용 절차 $F_i(\cdot)$ 이 필요.
목표: $\mathcal{D}_S\subset D,\;\vert\mathcal{D}_S\vert \ll \vert D \vert$ 를 뽑아 지시를 생성하고 (이미지, 지시) 페어로 LVLM을 파인튜닝 → 전체 파인튜닝과 유사한 성능 달성.

2. Task-Importance Estimation

2-1. 참조 모델 구축

무작위 5 % 참조집합 $D_{\text{ref}}$을 선택, 1 epoch 파인튜닝 → Reference LVLM.

2-2. Instruction Relevance Score (IRS)

각 샘플 $(I,Q,R)$ 에 대해

\[\mathcal{L}_{R|Q,I}= -\frac{1}{|t^{R}|}\sum_{j=1}^{|t^{R}|}\!\log P_{\theta}\!\bigl(t^{R}_{j}\mid I,Q,t^{R}_{<j}\bigr) \tag{1}\] \[\mathcal{L}_{R|I}= -\frac{1}{|t^{R}|}\sum_{j=1}^{|t^{R}|}\!\log P_{\theta}\!\bigl(t^{R}_{j}\mid I,t^{R}_{<j}\bigr) \tag{2}\] \[\textbf{IRS}(I,Y)=\frac{\mathcal{L}_{R|Q,I}}{\mathcal{L}_{R|I}} \tag{3}\]

해석
- IRS↑ ⇒ 질문 $Q$가 도움이 안 됨 → 태스크 중요도↓
- IRS↓ ⇒ $Q$ 덕분에 혼동↓ → 태스크 중요도↑

2-3. 태스크별 평균 IRS와 가중치

\[s(T_i)=\frac{1}{|D^{i}_{\text{ref}}|}\sum_{I\in T_i}\text{IRS}(I,Y) \tag{4}\] \[w(T_i)=\frac{\exp\!\bigl(-s(T_i)/\tau\bigr)}{\sum_{j=1}^{M}\exp\!\bigl(-s(T_j)/\tau\bigr)}, \qquad \tau=\frac{1}{\sqrt{M}} \tag{5}\]

$w(T_i)$ : 최종 샘플 비중 (태스크 중요도에 softmax 적용).

3. Task-wise Cluster-based Selection

3-1. 시각 특성 & 군집

모든 unlabeled $I\in T_i$ 에 대해 DINOv2 [30]의 $[\text{CLS}]$ 벡터 $\mathbf{v}_{I}$ 추출.
$k$-means 군집: $C=\tfrac{|T_i|}{100}$ 개 → 군집 $A^{i}_{c}$ ($c=1,\dots,C$) 형성.

3-2. 군집 샘플 수 결정

\[n_c=\Bigl\lfloor \frac{w(T_i)\,|A^{i}_{c}|}{|T_i|}\,|\mathcal{D}_S| \Bigr\rfloor \tag{6}\]

3-3. Intra-Cluster 대표성 — Neighbor Centrality

\[s_{\text{NC}}(I)=\frac{1}{k}\sum_{I_a\in k\text{NN}(I)}\! \operatorname{sim}\!\bigl(\mathbf{v}_{I},\mathbf{v}_{I_a}\bigr) \tag{7}\]

knn 이웃 평균 코사인 유사도.
$s_{\text{NC}}$ 높을수록 군집 중심 → 대표 이미지로 선택.

4. 전체 파이프라인 (Figure 3 해설)

좌측: 다태스크 이미지 풀 $D$ → 5 % 추출해 $D_{\text{ref}}$ (녹색)
- 이 단계에서만 지시 생성·Reference LVLM 학습 → $w(T_i)$ 도출.
중앙: 나머지 95 % 는 DINOv2 특성 → 태스크별 k-means → 군집 색상(연두/갈색…).
우측: 각 군집 내 NC 상위 $n_c$ 이미지 선정 → $\mathcal{D}_S$ (회색 상자).
하단 확대: IRS 계산 과정
- In/Out 토큰 시퀀스 비교로 $\mathcal{L}{R \vert I}$, $\mathcal{L}{R \vert Q, I}$ 산출.
- 태스크별 평균 → $w(T_i)$ 산정 후 윗단계로 피드백.
최종: 선택된 이미지에만 지시 생성 → LVLM 다시 파인튜닝.

5. 핵심 통찰 & 장점

포인트	이유/효과
“지시 없는” 상태에서 선택 → 지시 생성	GPT API 비용·휴먼 라벨링 비용 대폭 절감
IRS 기반 태스크 중요도	중복·학습 난이도까지 반영해 균형 잡힌 서브셋 구성
시각 특성+NC	언어 정보 없이도 대표성 중심 샘플링 → 효율적 일반화
경량 파이프라인 (DINOv2, k-means)	GPU 메모리 / 연산 부담 최소화, 대규모 풀에도 적용 용이

위 과정을 구현하면 전체 VIT 데이터의 5–10 % 만으로도 풀데이터 파인튜닝 수준에 근접한 성능을 달성하면서, 지시 생성-학습 시간을 획기적으로 단축할 수 있습니다.

[Paper Review] Presel: Pre-Instruction Data Selection for Visual Instruction Tuning

Presel (English)

🔍 Overview of PreSel: Pre-Instruction Data Selection for VIT

1. Problem Formulation

Objective

2. Task-Importance Estimation

2.1 Reference Model

2.2 Instruction Relevance Score (IRS)

2.3 Compute Task Importance

3. Task-wise Cluster-based Selection

3.1 Visual Feature Extraction and Clustering

3.2 Cluster Allocation

3.3 Intra-Cluster Selection: Neighbor Centrality (NC)

4. Final Assembly and Fine-tuning

5. Key Takeaways

Presel 정리 (Korean)

1. 문제 정의 (Problem Formulation)

2. Task-Importance Estimation

2-1. 참조 모델 구축

2-2. Instruction Relevance Score (IRS)

2-3. 태스크별 평균 IRS와 가중치

3. Task-wise Cluster-based Selection

3-1. 시각 특성 & 군집

3-2. 군집 샘플 수 결정

3-3. Intra-Cluster 대표성 — Neighbor Centrality

4. 전체 파이프라인 (Figure 3 해설)

5. 핵심 통찰 & 장점

Enjoy Reading This Article?