normalize= flag for transorms, return non-normalized torch.Tensor with original dytpe (for chug)Searching for Better ViT Baselines (For the GPU Poor) weights and vit variants released. Exploring model shapes between Tiny and Base.| model | top1 | top5 | param_count | img_size |
|---|---|---|---|---|
| vit_mediumd_patch16_reg4_gap_256.sbb_in12k_ft_in1k | 86.202 | 97.874 | 64.11 | 256 |
| vit_betwixt_patch16_reg4_gap_256.sbb_in12k_ft_in1k | 85.418 | 97.48 | 60.4 | 256 |
| vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k | 84.322 | 96.812 | 63.95 | 256 |
| vit_betwixt_patch16_rope_reg4_gap_256.sbb_in1k | 83.906 | 96.684 | 60.23 | 256 |
| vit_base_patch16_rope_reg1_gap_256.sbb_in1k | 83.866 | 96.67 | 86.43 | 256 |
| vit_medium_patch16_rope_reg1_gap_256.sbb_in1k | 83.81 | 96.824 | 38.74 | 256 |
| vit_betwixt_patch16_reg4_gap_256.sbb_in1k | 83.706 | 96.616 | 60.4 | 256 |
| vit_betwixt_patch16_reg1_gap_256.sbb_in1k | 83.628 | 96.544 | 60.4 | 256 |
| vit_medium_patch16_reg4_gap_256.sbb_in1k | 83.47 | 96.622 | 38.88 | 256 |
| vit_medium_patch16_reg1_gap_256.sbb_in1k | 83.462 | 96.548 | 38.88 | 256 |
| vit_little_patch16_reg4_gap_256.sbb_in1k | 82.514 | 96.262 | 22.52 | 256 |
| vit_wee_patch16_reg1_gap_256.sbb_in1k | 80.256 | 95.360 | 13.42 | 256 |
| vit_pwee_patch16_reg1_gap_256.sbb_in1k | 80.072 | 95.136 | 15.25 | 256 |
| vit_mediumd_patch16_reg4_gap_256.sbb_in12k | N/A | N/A | 64.11 | 256 |
| vit_betwixt_patch16_reg4_gap_256.sbb_in12k | N/A | N/A | 60.4 | 256 |
timm models. See example in https://github.com/huggingface/pytorch-image-models/discussions/1232#discussioncomment-9320949forward_intermediates() API refined and added to more models including some ConvNets that have other extraction methods.features_only=True feature extraction. Remaining 34 architectures can be supported but based on priority requests.features_only=True support for ViT models with flat hidden states or non-std module layouts (so far covering 'vit_*', 'twins_*', 'deit*', 'beit*', 'mvitv2*', 'eva*', 'samvit_*', 'flexivit*')forward_intermediates() API that can be used with a feature wrapping module or direclty.model = timm.create_model('vit_base_patch16_224')
final_feat, intermediates = model.forward_intermediates(input)
output = model.forward_head(final_feat) # pooling + classifier head
print(final_feat.shape)
torch.Size([2, 197, 768])
for f in intermediates:
print(f.shape)
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
print(output.shape)
torch.Size([2, 1000])
model = timm.create_model('eva02_base_patch16_clip_224', pretrained=True, img_size=512, features_only=True, out_indices=(-3, -2,))
output = model(torch.randn(2, 3, 512, 512))
for o in output:
print(o.shape)
torch.Size([2, 768, 32, 32])
torch.Size([2, 768, 32, 32])
Datasets & transform refactoring
--dataset hfids:org/dataset)datasets and webdataset wrapper streaming from HF hub with recent timm ImageNet uploads to https://huggingface.co/timm--input-size 1 224 224 or --in-chans 1, sets PIL image conversion appropriately in dataset--val-split '') in train script--bce-sum (sum over class dim) and --bce-pos-weight (positive weighting) args for training as they're common BCE loss tweaks I was often hard codingmodel_args config entry. model_args will be passed as kwargs through to models on creation.
vision_transformer.py typing and doc cleanup by Laureηtquickgelu ViT variants for OpenAI, DFN, MetaCLIP weights that use it (less efficient)convnext_xxlargequickgelu ViT variants for OpenAI, DFN, MetaCLIP weights that use it (less efficient)convnext_xxlargevision_transformer.py.
vision_transformer.py, vision_transformer_hybrid.py, deit.py, and eva.py w/o breaking backward compat.
dynamic_img_size=True to args at model creation time to allow changing the grid size (interpolate abs and/or ROPE pos embed each forward pass).dynamic_img_pad=True to allow image sizes that aren't divisible by patch size (pad bottom right to patch size each forward pass).img_size (interpolate pretrained embed weights once) on creation still works.patch_size (resize pretrained patch_embed weights once) on creation still works.python validate.py /imagenet --model vit_base_patch16_224 --amp --amp-dtype bfloat16 --img-size 255 --crop-pct 1.0 --model-kwargs dynamic_img_size=True dyamic_img_pad=True--reparam arg to benchmark.py, onnx_export.py, and validate.py to trigger layer reparameterization / fusion for models with any one of reparameterize(), switch_to_deploy() or fuse()
python validate.py /imagenet --model swin_base_patch4_window7_224.ms_in22k_ft_in1k --amp --amp-dtype bfloat16 --input-size 3 256 320 --model-kwargs window_size=8,10 img_size=256,320Minor updates and bug fixes. New ResNeXT w/ highest ImageNet eval I'm aware of in the ResNe(X)t family (seresnextaa201d_32x8d.sw_in12k_ft_in1k_384)
selecsls* model naming regressionseresnextaa201d_32x8d.sw_in12k_ft_in1k_384 weights (and .sw_in12k pretrain) with 87.3% top-1 on ImageNet-1k, best ImageNet ResNet family model I'm aware of.The first non pre-release since Oct 2022 with a long list of changes from 0.6.x releases...
timm 0.9 released, transition from 0.8.xdev releasestimmget_intermediate_layers function on vit/deit models for grabbing hidden states (inspired by DINO impl). This is WIP and may change significantly... feedback welcome.pretrained=True and no weights exist (instead of continuing with random initialization)bnb prefix, ie bnbadam8bittimm out of pre-release statetimm models uploaded to HF Hub and almost all updated to support multi-weight pretrained configs--grad-accum-steps), thanks Taeksang Kim--head-init-scale and --head-init-bias to train.py to scale classiifer head and set fixed bias for fine-tuneinplace_abn) use, replaced use in tresnet with standard BatchNorm (modified weights accordingly).drop_rate (classifier dropout), proj_drop_rate (block mlp / out projections), pos_drop_rate (position embedding drop), attn_drop_rate (attention dropout). Also add patch dropout (FLIP) to vit and eva models.timm trained weights added with recipe based tags to differentiateresnetaa50d.sw_in12k_ft_in1k - 81.7 @ 224, 82.6 @ 288resnetaa101d.sw_in12k_ft_in1k - 83.5 @ 224, 84.1 @ 288seresnextaa101d_32x8d.sw_in12k_ft_in1k - 86.0 @ 224, 86.5 @ 288seresnextaa101d_32x8d.sw_in12k_ft_in1k_288 - 86.5 @ 288, 86.7 @ 320| model | top1 | top5 | img_size | param_count | gmacs | macts |
|---|---|---|---|---|---|---|
| convnext_xxlarge.clip_laion2b_soup_ft_in1k | 88.612 | 98.704 | 256 | 846.47 | 198.09 | 124.45 |
| convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_384 | 88.312 | 98.578 | 384 | 200.13 | 101.11 | 126.74 |
| convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_320 | 87.968 | 98.47 | 320 | 200.13 | 70.21 | 88.02 |
| convnext_base.clip_laion2b_augreg_ft_in12k_in1k_384 | 87.138 | 98.212 | 384 | 88.59 | 45.21 | 84.49 |
| convnext_base.clip_laion2b_augreg_ft_in12k_in1k | 86.344 | 97.97 | 256 | 88.59 | 20.09 | 37.55 |
| model | top1 | top5 | param_count | img_size |
|---|---|---|---|---|
| eva02_large_patch14_448.mim_m38m_ft_in22k_in1k | 90.054 | 99.042 | 305.08 | 448 |
| eva02_large_patch14_448.mim_in22k_ft_in22k_in1k | 89.946 | 99.01 | 305.08 | 448 |
| eva_giant_patch14_560.m30m_ft_in22k_in1k | 89.792 | 98.992 | 1014.45 | 560 |
| eva02_large_patch14_448.mim_in22k_ft_in1k | 89.626 | 98.954 | 305.08 | 448 |
| eva02_large_patch14_448.mim_m38m_ft_in1k | 89.57 | 98.918 | 305.08 | 448 |
| eva_giant_patch14_336.m30m_ft_in22k_in1k | 89.56 | 98.956 | 1013.01 | 336 |
| eva_giant_patch14_336.clip_ft_in1k | 89.466 | 98.82 | 1013.01 | 336 |
| eva_large_patch14_336.in22k_ft_in22k_in1k | 89.214 | 98.854 | 304.53 | 336 |
| eva_giant_patch14_224.clip_ft_in1k | 88.882 | 98.678 | 1012.56 | 224 |
| eva02_base_patch14_448.mim_in22k_ft_in22k_in1k | 88.692 | 98.722 | 87.12 | 448 |
| eva_large_patch14_336.in22k_ft_in1k | 88.652 | 98.722 | 304.53 | 336 |
| eva_large_patch14_196.in22k_ft_in22k_in1k | 88.592 | 98.656 | 304.14 | 196 |
| eva02_base_patch14_448.mim_in22k_ft_in1k | 88.23 | 98.564 | 87.12 | 448 |
| eva_large_patch14_196.in22k_ft_in1k | 87.934 | 98.504 | 304.14 | 196 |
| eva02_small_patch14_336.mim_in22k_ft_in1k | 85.74 | 97.614 | 22.13 | 336 |
| eva02_tiny_patch14_336.mim_in22k_ft_in1k | 80.658 | 95.524 | 5.76 | 336 |
regnet.py, rexnet.py, byobnet.py, resnetv2.py, swin_transformer.py, swin_transformer_v2.py, swin_transformer_v2_cr.pyswinv2_cr_*, and NHWC for all others) and spatial embedding outputs.timm weights:
rexnetr_200.sw_in12k_ft_in1k - 82.6 @ 224, 83.2 @ 288rexnetr_300.sw_in12k_ft_in1k - 84.0 @ 224, 84.5 @ 288regnety_120.sw_in12k_ft_in1k - 85.0 @ 224, 85.4 @ 288regnety_160.lion_in12k_ft_in1k - 85.6 @ 224, 86.0 @ 288regnety_160.sw_in12k_ft_in1k - 85.6 @ 224, 86.0 @ 288 (compare to SWAG PT + 1k FT this is same BUT much lower res, blows SEER FT away)convnext_xxlarge default LayerNorm eps to 1e-5 (for CLIP weights, improved stability)convnext_large_mlp.clip_laion2b_ft_320 and convnext_lage_mlp.clip_laion2b_ft_soup_320 CLIP image tower weights for features & fine-tunesafetensor checkpoint support addedvit_*, vit_relpos*, coatnet / maxxvit (to start)features_only=Trueconvnext_base.clip_laion2b_augreg_ft_in1k - 86.2% @ 256x256convnext_base.clip_laiona_augreg_ft_in1k_384 - 86.5% @ 384x384convnext_large_mlp.clip_laion2b_augreg_ft_in1k - 87.3% @ 256x256convnext_large_mlp.clip_laion2b_augreg_ft_in1k_384 - 87.9% @ 384x384features_only=True. Adapted from https://github.com/dingmyu/davit by Fredo.features_only=True.features_only=True support to new conv variants, weight remap required./results to timm/data/_info.timm
inference.py to use, try: python inference.py /folder/to/images --model convnext_small.in12k --label-type detail --topk 5Add two convnext 12k -> 1k fine-tunes at 384x384
convnext_tiny.in12k_ft_in1k_384 - 85.1 @ 384convnext_small.in12k_ft_in1k_384 - 86.2 @ 384Push all MaxxViT weights to HF hub, and add new ImageNet-12k -> 1k fine-tunes for rw base MaxViT and CoAtNet 1/2 models
| model | top1 | top5 | samples / sec | Params (M) | GMAC | Act (M) |
|---|---|---|---|---|---|---|
| maxvit_xlarge_tf_512.in21k_ft_in1k | 88.53 | 98.64 | 21.76 | 475.77 | 534.14 | 1413.22 |
| maxvit_xlarge_tf_384.in21k_ft_in1k | 88.32 | 98.54 | 42.53 | 475.32 | 292.78 | 668.76 |
| maxvit_base_tf_512.in21k_ft_in1k | 88.20 | 98.53 | 50.87 | 119.88 | 138.02 | 703.99 |
| maxvit_large_tf_512.in21k_ft_in1k | 88.04 | 98.40 | 36.42 | 212.33 | 244.75 | 942.15 |
| maxvit_large_tf_384.in21k_ft_in1k | 87.98 | 98.56 | 71.75 | 212.03 | 132.55 | 445.84 |
| maxvit_base_tf_384.in21k_ft_in1k | 87.92 | 98.54 | 104.71 | 119.65 | 73.80 | 332.90 |
| maxvit_rmlp_base_rw_384.sw_in12k_ft_in1k | 87.81 | 98.37 | 106.55 | 116.14 | 70.97 | 318.95 |
| maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k | 87.47 | 98.37 | 149.49 | 116.09 | 72.98 | 213.74 |
| coatnet_rmlp_2_rw_384.sw_in12k_ft_in1k | 87.39 | 98.31 | 160.80 | 73.88 | 47.69 | 209.43 |
| maxvit_rmlp_base_rw_224.sw_in12k_ft_in1k | 86.89 | 98.02 | 375.86 | 116.14 | 23.15 | 92.64 |
| maxxvitv2_rmlp_base_rw_224.sw_in12k_ft_in1k | 86.64 | 98.02 | 501.03 | 116.09 | 24.20 | 62.77 |
| maxvit_base_tf_512.in1k | 86.60 | 97.92 | 50.75 | 119.88 | 138.02 | 703.99 |
| coatnet_2_rw_224.sw_in12k_ft_in1k | 86.57 | 97.89 | 631.88 | 73.87 | 15.09 | 49.22 |
| maxvit_large_tf_512.in1k | 86.52 | 97.88 | 36.04 | 212.33 | 244.75 | 942.15 |
| coatnet_rmlp_2_rw_224.sw_in12k_ft_in1k | 86.49 | 97.90 | 620.58 | 73.88 | 15.18 | 54.78 |
| maxvit_base_tf_384.in1k | 86.29 | 97.80 | 101.09 | 119.65 | 73.80 | 332.90 |
| maxvit_large_tf_384.in1k | 86.23 | 97.69 | 70.56 | 212.03 | 132.55 | 445.84 |
| maxvit_small_tf_512.in1k | 86.10 | 97.76 | 88.63 | 69.13 | 67.26 | 383.77 |
| maxvit_tiny_tf_512.in1k | 85.67 | 97.58 | 144.25 | 31.05 | 33.49 | 257.59 |
| maxvit_small_tf_384.in1k | 85.54 | 97.46 | 188.35 | 69.02 | 35.87 | 183.65 |
| maxvit_tiny_tf_384.in1k | 85.11 | 97.38 | 293.46 | 30.98 | 17.53 | 123.42 |
| maxvit_large_tf_224.in1k | 84.93 | 96.97 | 247.71 | 211.79 | 43.68 | 127.35 |
| coatnet_rmlp_1_rw2_224.sw_in12k_ft_in1k | 84.90 | 96.96 | 1025.45 | 41.72 | 8.11 | 40.13 |
| maxvit_base_tf_224.in1k | 84.85 | 96.99 | 358.25 | 119.47 | 24.04 | 95.01 |
| maxxvit_rmlp_small_rw_256.sw_in1k | 84.63 | 97.06 | 575.53 | 66.01 | 14.67 | 58.38 |
| coatnet_rmlp_2_rw_224.sw_in1k | 84.61 | 96.74 | 625.81 | 73.88 | 15.18 | 54.78 |
| maxvit_rmlp_small_rw_224.sw_in1k | 84.49 | 96.76 | 693.82 | 64.90 | 10.75 | 49.30 |
| maxvit_small_tf_224.in1k | 84.43 | 96.83 | 647.96 | 68.93 | 11.66 | 53.17 |
| maxvit_rmlp_tiny_rw_256.sw_in1k | 84.23 | 96.78 | 807.21 | 29.15 | 6.77 | 46.92 |
| coatnet_1_rw_224.sw_in1k | 83.62 | 96.38 | 989.59 | 41.72 | 8.04 | 34.60 |
| maxvit_tiny_rw_224.sw_in1k | 83.50 | 96.50 | 1100.53 | 29.06 | 5.11 | 33.11 |
| maxvit_tiny_tf_224.in1k | 83.41 | 96.59 | 1004.94 | 30.92 | 5.60 | 35.78 |
| coatnet_rmlp_1_rw_224.sw_in1k | 83.36 | 96.45 | 1093.03 | 41.69 | 7.85 | 35.47 |
| maxxvitv2_nano_rw_256.sw_in1k | 83.11 | 96.33 | 1276.88 | 23.70 | 6.26 | 23.05 |
| maxxvit_rmlp_nano_rw_256.sw_in1k | 83.03 | 96.34 | 1341.24 | 16.78 | 4.37 | 26.05 |
| maxvit_rmlp_nano_rw_256.sw_in1k | 82.96 | 96.26 | 1283.24 | 15.50 | 4.47 | 31.92 |
| maxvit_nano_rw_256.sw_in1k | 82.93 | 96.23 | 1218.17 | 15.45 | 4.46 | 30.28 |
| coatnet_bn_0_rw_224.sw_in1k | 82.39 | 96.19 | 1600.14 | 27.44 | 4.67 | 22.04 |
| coatnet_0_rw_224.sw_in1k | 82.39 | 95.84 | 1831.21 | 27.44 | 4.43 | 18.73 |
| coatnet_rmlp_nano_rw_224.sw_in1k | 82.05 | 95.87 | 2109.09 | 15.15 | 2.62 | 20.34 |
| coatnext_nano_rw_224.sw_in1k | 81.95 | 95.92 | 2525.52 | 14.70 | 2.47 | 12.80 |
| coatnet_nano_rw_224.sw_in1k | 81.70 | 95.64 | 2344.52 | 15.14 | 2.41 | 15.41 |
| maxvit_rmlp_pico_rw_256.sw_in1k | 80.53 | 95.21 | 1594.71 | 7.52 | 1.85 | 24.86 |
.in12k tags)
convnext_nano.in12k_ft_in1k - 82.3 @ 224, 82.9 @ 288 (previously released)convnext_tiny.in12k_ft_in1k - 84.2 @ 224, 84.5 @ 288convnext_small.in12k_ft_in1k - 85.2 @ 224, 85.3 @ 288--model-kwargs and --opt-kwargs to scripts to pass through rare args directly to model classes from cmd line
train.py /imagenet --model resnet50 --amp --model-kwargs output_stride=16 act_layer=silutrain.py /imagenet --model vit_base_patch16_clip_224 --img-size 240 --amp --model-kwargs img_size=240 patch_size=12convnext.py
efficientnet_b5.in12k_ft_in1k - 85.9 @ 448x448vit_medium_patch16_gap_384.in12k_ft_in1k - 85.5 @ 384x384vit_medium_patch16_gap_256.in12k_ft_in1k - 84.5 @ 256x256convnext_nano.in12k_ft_in1k - 82.9 @ 288x288vision_transformer.py, MAE style ViT-L/14 MIM pretrain w/ EVA-CLIP targets, FT on ImageNet-1k (w/ ImageNet-22k intermediate for some)
| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| eva_large_patch14_336.in22k_ft_in22k_in1k | 89.2 | 304.5 | 191.1 | 270.2 | link |
| eva_large_patch14_336.in22k_ft_in1k | 88.7 | 304.5 | 191.1 | 270.2 | link |
| eva_large_patch14_196.in22k_ft_in22k_in1k | 88.6 | 304.1 | 61.6 | 63.5 | link |
| eva_large_patch14_196.in22k_ft_in1k | 87.9 | 304.1 | 61.6 | 63.5 | link |
beit.py.
| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| eva_giant_patch14_560.m30m_ft_in22k_in1k | 89.8 | 1014.4 | 1906.8 | 2577.2 | link |
| eva_giant_patch14_336.m30m_ft_in22k_in1k | 89.6 | 1013 | 620.6 | 550.7 | link |
| eva_giant_patch14_336.clip_ft_in1k | 89.4 | 1013 | 620.6 | 550.7 | link |
| eva_giant_patch14_224.clip_ft_in1k | 89.1 | 1012.6 | 267.2 | 192.6 | link |
0.8.0dev0) of multi-weight support (model_arch.pretrained_tag). Install with pip install --pre timm
--torchcompile argument| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k | 88.6 | 632.5 | 391 | 407.5 | link |
| vit_large_patch14_clip_336.openai_ft_in12k_in1k | 88.3 | 304.5 | 191.1 | 270.2 | link |
| vit_huge_patch14_clip_224.laion2b_ft_in12k_in1k | 88.2 | 632 | 167.4 | 139.4 | link |
| vit_large_patch14_clip_336.laion2b_ft_in12k_in1k | 88.2 | 304.5 | 191.1 | 270.2 | link |
| vit_large_patch14_clip_224.openai_ft_in12k_in1k | 88.2 | 304.2 | 81.1 | 88.8 | link |
| vit_large_patch14_clip_224.laion2b_ft_in12k_in1k | 87.9 | 304.2 | 81.1 | 88.8 | link |
| vit_large_patch14_clip_224.openai_ft_in1k | 87.9 | 304.2 | 81.1 | 88.8 | link |
| vit_large_patch14_clip_336.laion2b_ft_in1k | 87.9 | 304.5 | 191.1 | 270.2 | link |
| vit_huge_patch14_clip_224.laion2b_ft_in1k | 87.6 | 632 | 167.4 | 139.4 | link |
| vit_large_patch14_clip_224.laion2b_ft_in1k | 87.3 | 304.2 | 81.1 | 88.8 | link |
| vit_base_patch16_clip_384.laion2b_ft_in12k_in1k | 87.2 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_384.openai_ft_in12k_in1k | 87 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_384.laion2b_ft_in1k | 86.6 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_384.openai_ft_in1k | 86.2 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_224.laion2b_ft_in12k_in1k | 86.2 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch16_clip_224.openai_ft_in12k_in1k | 85.9 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch32_clip_448.laion2b_ft_in12k_in1k | 85.8 | 88.3 | 17.9 | 23.9 | link |
| vit_base_patch16_clip_224.laion2b_ft_in1k | 85.5 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch32_clip_384.laion2b_ft_in12k_in1k | 85.4 | 88.3 | 13.1 | 16.5 | link |
| vit_base_patch16_clip_224.openai_ft_in1k | 85.3 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch32_clip_384.openai_ft_in12k_in1k | 85.2 | 88.3 | 13.1 | 16.5 | link |
| vit_base_patch32_clip_224.laion2b_ft_in12k_in1k | 83.3 | 88.2 | 4.4 | 5 | link |
| vit_base_patch32_clip_224.laion2b_ft_in1k | 82.6 | 88.2 | 4.4 | 5 | link |
| vit_base_patch32_clip_224.openai_ft_in1k | 81.9 | 88.2 | 4.4 | 5 | link |
| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| maxvit_xlarge_tf_512.in21k_ft_in1k | 88.5 | 475.8 | 534.1 | 1413.2 | link |
| maxvit_xlarge_tf_384.in21k_ft_in1k | 88.3 | 475.3 | 292.8 | 668.8 | link |
| maxvit_base_tf_512.in21k_ft_in1k | 88.2 | 119.9 | 138 | 704 | link |
| maxvit_large_tf_512.in21k_ft_in1k | 88 | 212.3 | 244.8 | 942.2 | link |
| maxvit_large_tf_384.in21k_ft_in1k | 88 | 212 | 132.6 | 445.8 | link |
| maxvit_base_tf_384.in21k_ft_in1k | 87.9 | 119.6 | 73.8 | 332.9 | link |
| maxvit_base_tf_512.in1k | 86.6 | 119.9 | 138 | 704 | link |
| maxvit_large_tf_512.in1k | 86.5 | 212.3 | 244.8 | 942.2 | link |
| maxvit_base_tf_384.in1k | 86.3 | 119.6 | 73.8 | 332.9 | link |
| maxvit_large_tf_384.in1k | 86.2 | 212 | 132.6 | 445.8 | link |
| maxvit_small_tf_512.in1k | 86.1 | 69.1 | 67.3 | 383.8 | link |
| maxvit_tiny_tf_512.in1k | 85.7 | 31 | 33.5 | 257.6 | link |
| maxvit_small_tf_384.in1k | 85.5 | 69 | 35.9 | 183.6 | link |
| maxvit_tiny_tf_384.in1k | 85.1 | 31 | 17.5 | 123.4 | link |
| maxvit_large_tf_224.in1k | 84.9 | 211.8 | 43.7 | 127.4 | link |
| maxvit_base_tf_224.in1k | 84.9 | 119.5 | 24 | 95 | link |
| maxvit_small_tf_224.in1k | 84.4 | 68.9 | 11.7 | 53.2 | link |
| maxvit_tiny_tf_224.in1k | 83.4 | 30.9 | 5.6 | 35.8 | link |
--amp-impl apex, bfloat16 supportedf via --amp-dtype bfloat16First non pre-release in a loooong while, changelog from 0.6.x below...
timm 0.9 released, transition from 0.8.xdev releasestimmget_intermediate_layers function on vit/deit models for grabbing hidden states (inspired by DINO impl). This is WIP and may change significantly... feedback welcome.pretrained=True and no weights exist (instead of continuing with random initialization)bnb prefix, ie bnbadam8bittimm out of pre-release statetimm models uploaded to HF Hub and almost all updated to support multi-weight pretrained configs--grad-accum-steps), thanks Taeksang Kim--head-init-scale and --head-init-bias to train.py to scale classiifer head and set fixed bias for fine-tuneinplace_abn) use, replaced use in tresnet with standard BatchNorm (modified weights accordingly).drop_rate (classifier dropout), proj_drop_rate (block mlp / out projections), pos_drop_rate (position embedding drop), attn_drop_rate (attention dropout). Also add patch dropout (FLIP) to vit and eva models.timm trained weights added with recipe based tags to differentiateresnetaa50d.sw_in12k_ft_in1k - 81.7 @ 224, 82.6 @ 288resnetaa101d.sw_in12k_ft_in1k - 83.5 @ 224, 84.1 @ 288seresnextaa101d_32x8d.sw_in12k_ft_in1k - 86.0 @ 224, 86.5 @ 288seresnextaa101d_32x8d.sw_in12k_ft_in1k_288 - 86.5 @ 288, 86.7 @ 320| model | top1 | top5 | img_size | param_count | gmacs | macts |
|---|---|---|---|---|---|---|
| convnext_xxlarge.clip_laion2b_soup_ft_in1k | 88.612 | 98.704 | 256 | 846.47 | 198.09 | 124.45 |
| convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_384 | 88.312 | 98.578 | 384 | 200.13 | 101.11 | 126.74 |
| convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_320 | 87.968 | 98.47 | 320 | 200.13 | 70.21 | 88.02 |
| convnext_base.clip_laion2b_augreg_ft_in12k_in1k_384 | 87.138 | 98.212 | 384 | 88.59 | 45.21 | 84.49 |
| convnext_base.clip_laion2b_augreg_ft_in12k_in1k | 86.344 | 97.97 | 256 | 88.59 | 20.09 | 37.55 |
| model | top1 | top5 | param_count | img_size |
|---|---|---|---|---|
| eva02_large_patch14_448.mim_m38m_ft_in22k_in1k | 90.054 | 99.042 | 305.08 | 448 |
| eva02_large_patch14_448.mim_in22k_ft_in22k_in1k | 89.946 | 99.01 | 305.08 | 448 |
| eva_giant_patch14_560.m30m_ft_in22k_in1k | 89.792 | 98.992 | 1014.45 | 560 |
| eva02_large_patch14_448.mim_in22k_ft_in1k | 89.626 | 98.954 | 305.08 | 448 |
| eva02_large_patch14_448.mim_m38m_ft_in1k | 89.57 | 98.918 | 305.08 | 448 |
| eva_giant_patch14_336.m30m_ft_in22k_in1k | 89.56 | 98.956 | 1013.01 | 336 |
| eva_giant_patch14_336.clip_ft_in1k | 89.466 | 98.82 | 1013.01 | 336 |
| eva_large_patch14_336.in22k_ft_in22k_in1k | 89.214 | 98.854 | 304.53 | 336 |
| eva_giant_patch14_224.clip_ft_in1k | 88.882 | 98.678 | 1012.56 | 224 |
| eva02_base_patch14_448.mim_in22k_ft_in22k_in1k | 88.692 | 98.722 | 87.12 | 448 |
| eva_large_patch14_336.in22k_ft_in1k | 88.652 | 98.722 | 304.53 | 336 |
| eva_large_patch14_196.in22k_ft_in22k_in1k | 88.592 | 98.656 | 304.14 | 196 |
| eva02_base_patch14_448.mim_in22k_ft_in1k | 88.23 | 98.564 | 87.12 | 448 |
| eva_large_patch14_196.in22k_ft_in1k | 87.934 | 98.504 | 304.14 | 196 |
| eva02_small_patch14_336.mim_in22k_ft_in1k | 85.74 | 97.614 | 22.13 | 336 |
| eva02_tiny_patch14_336.mim_in22k_ft_in1k | 80.658 | 95.524 | 5.76 | 336 |
regnet.py, rexnet.py, byobnet.py, resnetv2.py, swin_transformer.py, swin_transformer_v2.py, swin_transformer_v2_cr.pyswinv2_cr_*, and NHWC for all others) and spatial embedding outputs.timm weights:
rexnetr_200.sw_in12k_ft_in1k - 82.6 @ 224, 83.2 @ 288rexnetr_300.sw_in12k_ft_in1k - 84.0 @ 224, 84.5 @ 288regnety_120.sw_in12k_ft_in1k - 85.0 @ 224, 85.4 @ 288regnety_160.lion_in12k_ft_in1k - 85.6 @ 224, 86.0 @ 288regnety_160.sw_in12k_ft_in1k - 85.6 @ 224, 86.0 @ 288 (compare to SWAG PT + 1k FT this is same BUT much lower res, blows SEER FT away)convnext_xxlarge default LayerNorm eps to 1e-5 (for CLIP weights, improved stability)convnext_large_mlp.clip_laion2b_ft_320 and convnext_lage_mlp.clip_laion2b_ft_soup_320 CLIP image tower weights for features & fine-tunesafetensor checkpoint support addedvit_*, vit_relpos*, coatnet / maxxvit (to start)features_only=Trueconvnext_base.clip_laion2b_augreg_ft_in1k - 86.2% @ 256x256convnext_base.clip_laiona_augreg_ft_in1k_384 - 86.5% @ 384x384convnext_large_mlp.clip_laion2b_augreg_ft_in1k - 87.3% @ 256x256convnext_large_mlp.clip_laion2b_augreg_ft_in1k_384 - 87.9% @ 384x384features_only=True. Adapted from https://github.com/dingmyu/davit by Fredo.features_only=True.features_only=True support to new conv variants, weight remap required./results to timm/data/_info.timm
inference.py to use, try: python inference.py /folder/to/images --model convnext_small.in12k --label-type detail --topk 5Add two convnext 12k -> 1k fine-tunes at 384x384
convnext_tiny.in12k_ft_in1k_384 - 85.1 @ 384convnext_small.in12k_ft_in1k_384 - 86.2 @ 384Push all MaxxViT weights to HF hub, and add new ImageNet-12k -> 1k fine-tunes for rw base MaxViT and CoAtNet 1/2 models
| model | top1 | top5 | samples / sec | Params (M) | GMAC | Act (M) |
|---|---|---|---|---|---|---|
| maxvit_xlarge_tf_512.in21k_ft_in1k | 88.53 | 98.64 | 21.76 | 475.77 | 534.14 | 1413.22 |
| maxvit_xlarge_tf_384.in21k_ft_in1k | 88.32 | 98.54 | 42.53 | 475.32 | 292.78 | 668.76 |
| maxvit_base_tf_512.in21k_ft_in1k | 88.20 | 98.53 | 50.87 | 119.88 | 138.02 | 703.99 |
| maxvit_large_tf_512.in21k_ft_in1k | 88.04 | 98.40 | 36.42 | 212.33 | 244.75 | 942.15 |
| maxvit_large_tf_384.in21k_ft_in1k | 87.98 | 98.56 | 71.75 | 212.03 | 132.55 | 445.84 |
| maxvit_base_tf_384.in21k_ft_in1k | 87.92 | 98.54 | 104.71 | 119.65 | 73.80 | 332.90 |
| maxvit_rmlp_base_rw_384.sw_in12k_ft_in1k | 87.81 | 98.37 | 106.55 | 116.14 | 70.97 | 318.95 |
| maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k | 87.47 | 98.37 | 149.49 | 116.09 | 72.98 | 213.74 |
| coatnet_rmlp_2_rw_384.sw_in12k_ft_in1k | 87.39 | 98.31 | 160.80 | 73.88 | 47.69 | 209.43 |
| maxvit_rmlp_base_rw_224.sw_in12k_ft_in1k | 86.89 | 98.02 | 375.86 | 116.14 | 23.15 | 92.64 |
| maxxvitv2_rmlp_base_rw_224.sw_in12k_ft_in1k | 86.64 | 98.02 | 501.03 | 116.09 | 24.20 | 62.77 |
| maxvit_base_tf_512.in1k | 86.60 | 97.92 | 50.75 | 119.88 | 138.02 | 703.99 |
| coatnet_2_rw_224.sw_in12k_ft_in1k | 86.57 | 97.89 | 631.88 | 73.87 | 15.09 | 49.22 |
| maxvit_large_tf_512.in1k | 86.52 | 97.88 | 36.04 | 212.33 | 244.75 | 942.15 |
| coatnet_rmlp_2_rw_224.sw_in12k_ft_in1k | 86.49 | 97.90 | 620.58 | 73.88 | 15.18 | 54.78 |
| maxvit_base_tf_384.in1k | 86.29 | 97.80 | 101.09 | 119.65 | 73.80 | 332.90 |
| maxvit_large_tf_384.in1k | 86.23 | 97.69 | 70.56 | 212.03 | 132.55 | 445.84 |
| maxvit_small_tf_512.in1k | 86.10 | 97.76 | 88.63 | 69.13 | 67.26 | 383.77 |
| maxvit_tiny_tf_512.in1k | 85.67 | 97.58 | 144.25 | 31.05 | 33.49 | 257.59 |
| maxvit_small_tf_384.in1k | 85.54 | 97.46 | 188.35 | 69.02 | 35.87 | 183.65 |
| maxvit_tiny_tf_384.in1k | 85.11 | 97.38 | 293.46 | 30.98 | 17.53 | 123.42 |
| maxvit_large_tf_224.in1k | 84.93 | 96.97 | 247.71 | 211.79 | 43.68 | 127.35 |
| coatnet_rmlp_1_rw2_224.sw_in12k_ft_in1k | 84.90 | 96.96 | 1025.45 | 41.72 | 8.11 | 40.13 |
| maxvit_base_tf_224.in1k | 84.85 | 96.99 | 358.25 | 119.47 | 24.04 | 95.01 |
| maxxvit_rmlp_small_rw_256.sw_in1k | 84.63 | 97.06 | 575.53 | 66.01 | 14.67 | 58.38 |
| coatnet_rmlp_2_rw_224.sw_in1k | 84.61 | 96.74 | 625.81 | 73.88 | 15.18 | 54.78 |
| maxvit_rmlp_small_rw_224.sw_in1k | 84.49 | 96.76 | 693.82 | 64.90 | 10.75 | 49.30 |
| maxvit_small_tf_224.in1k | 84.43 | 96.83 | 647.96 | 68.93 | 11.66 | 53.17 |
| maxvit_rmlp_tiny_rw_256.sw_in1k | 84.23 | 96.78 | 807.21 | 29.15 | 6.77 | 46.92 |
| coatnet_1_rw_224.sw_in1k | 83.62 | 96.38 | 989.59 | 41.72 | 8.04 | 34.60 |
| maxvit_tiny_rw_224.sw_in1k | 83.50 | 96.50 | 1100.53 | 29.06 | 5.11 | 33.11 |
| maxvit_tiny_tf_224.in1k | 83.41 | 96.59 | 1004.94 | 30.92 | 5.60 | 35.78 |
| coatnet_rmlp_1_rw_224.sw_in1k | 83.36 | 96.45 | 1093.03 | 41.69 | 7.85 | 35.47 |
| maxxvitv2_nano_rw_256.sw_in1k | 83.11 | 96.33 | 1276.88 | 23.70 | 6.26 | 23.05 |
| maxxvit_rmlp_nano_rw_256.sw_in1k | 83.03 | 96.34 | 1341.24 | 16.78 | 4.37 | 26.05 |
| maxvit_rmlp_nano_rw_256.sw_in1k | 82.96 | 96.26 | 1283.24 | 15.50 | 4.47 | 31.92 |
| maxvit_nano_rw_256.sw_in1k | 82.93 | 96.23 | 1218.17 | 15.45 | 4.46 | 30.28 |
| coatnet_bn_0_rw_224.sw_in1k | 82.39 | 96.19 | 1600.14 | 27.44 | 4.67 | 22.04 |
| coatnet_0_rw_224.sw_in1k | 82.39 | 95.84 | 1831.21 | 27.44 | 4.43 | 18.73 |
| coatnet_rmlp_nano_rw_224.sw_in1k | 82.05 | 95.87 | 2109.09 | 15.15 | 2.62 | 20.34 |
| coatnext_nano_rw_224.sw_in1k | 81.95 | 95.92 | 2525.52 | 14.70 | 2.47 | 12.80 |
| coatnet_nano_rw_224.sw_in1k | 81.70 | 95.64 | 2344.52 | 15.14 | 2.41 | 15.41 |
| maxvit_rmlp_pico_rw_256.sw_in1k | 80.53 | 95.21 | 1594.71 | 7.52 | 1.85 | 24.86 |
.in12k tags)
convnext_nano.in12k_ft_in1k - 82.3 @ 224, 82.9 @ 288 (previously released)convnext_tiny.in12k_ft_in1k - 84.2 @ 224, 84.5 @ 288convnext_small.in12k_ft_in1k - 85.2 @ 224, 85.3 @ 288--model-kwargs and --opt-kwargs to scripts to pass through rare args directly to model classes from cmd line
train.py /imagenet --model resnet50 --amp --model-kwargs output_stride=16 act_layer=silutrain.py /imagenet --model vit_base_patch16_clip_224 --img-size 240 --amp --model-kwargs img_size=240 patch_size=12convnext.py
efficientnet_b5.in12k_ft_in1k - 85.9 @ 448x448vit_medium_patch16_gap_384.in12k_ft_in1k - 85.5 @ 384x384vit_medium_patch16_gap_256.in12k_ft_in1k - 84.5 @ 256x256convnext_nano.in12k_ft_in1k - 82.9 @ 288x288vision_transformer.py, MAE style ViT-L/14 MIM pretrain w/ EVA-CLIP targets, FT on ImageNet-1k (w/ ImageNet-22k intermediate for some)
| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| eva_large_patch14_336.in22k_ft_in22k_in1k | 89.2 | 304.5 | 191.1 | 270.2 | link |
| eva_large_patch14_336.in22k_ft_in1k | 88.7 | 304.5 | 191.1 | 270.2 | link |
| eva_large_patch14_196.in22k_ft_in22k_in1k | 88.6 | 304.1 | 61.6 | 63.5 | link |
| eva_large_patch14_196.in22k_ft_in1k | 87.9 | 304.1 | 61.6 | 63.5 | link |
beit.py.
| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| eva_giant_patch14_560.m30m_ft_in22k_in1k | 89.8 | 1014.4 | 1906.8 | 2577.2 | link |
| eva_giant_patch14_336.m30m_ft_in22k_in1k | 89.6 | 1013 | 620.6 | 550.7 | link |
| eva_giant_patch14_336.clip_ft_in1k | 89.4 | 1013 | 620.6 | 550.7 | link |
| eva_giant_patch14_224.clip_ft_in1k | 89.1 | 1012.6 | 267.2 | 192.6 | link |
0.8.0dev0) of multi-weight support (model_arch.pretrained_tag). Install with pip install --pre timm
--torchcompile argument| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k | 88.6 | 632.5 | 391 | 407.5 | link |
| vit_large_patch14_clip_336.openai_ft_in12k_in1k | 88.3 | 304.5 | 191.1 | 270.2 | link |
| vit_huge_patch14_clip_224.laion2b_ft_in12k_in1k | 88.2 | 632 | 167.4 | 139.4 | link |
| vit_large_patch14_clip_336.laion2b_ft_in12k_in1k | 88.2 | 304.5 | 191.1 | 270.2 | link |
| vit_large_patch14_clip_224.openai_ft_in12k_in1k | 88.2 | 304.2 | 81.1 | 88.8 | link |
| vit_large_patch14_clip_224.laion2b_ft_in12k_in1k | 87.9 | 304.2 | 81.1 | 88.8 | link |
| vit_large_patch14_clip_224.openai_ft_in1k | 87.9 | 304.2 | 81.1 | 88.8 | link |
| vit_large_patch14_clip_336.laion2b_ft_in1k | 87.9 | 304.5 | 191.1 | 270.2 | link |
| vit_huge_patch14_clip_224.laion2b_ft_in1k | 87.6 | 632 | 167.4 | 139.4 | link |
| vit_large_patch14_clip_224.laion2b_ft_in1k | 87.3 | 304.2 | 81.1 | 88.8 | link |
| vit_base_patch16_clip_384.laion2b_ft_in12k_in1k | 87.2 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_384.openai_ft_in12k_in1k | 87 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_384.laion2b_ft_in1k | 86.6 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_384.openai_ft_in1k | 86.2 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_224.laion2b_ft_in12k_in1k | 86.2 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch16_clip_224.openai_ft_in12k_in1k | 85.9 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch32_clip_448.laion2b_ft_in12k_in1k | 85.8 | 88.3 | 17.9 | 23.9 | link |
| vit_base_patch16_clip_224.laion2b_ft_in1k | 85.5 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch32_clip_384.laion2b_ft_in12k_in1k | 85.4 | 88.3 | 13.1 | 16.5 | link |
| vit_base_patch16_clip_224.openai_ft_in1k | 85.3 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch32_clip_384.openai_ft_in12k_in1k | 85.2 | 88.3 | 13.1 | 16.5 | link |
| vit_base_patch32_clip_224.laion2b_ft_in12k_in1k | 83.3 | 88.2 | 4.4 | 5 | link |
| vit_base_patch32_clip_224.laion2b_ft_in1k | 82.6 | 88.2 | 4.4 | 5 | link |
| vit_base_patch32_clip_224.openai_ft_in1k | 81.9 | 88.2 | 4.4 | 5 | link |
| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| maxvit_xlarge_tf_512.in21k_ft_in1k | 88.5 | 475.8 | 534.1 | 1413.2 | link |
| maxvit_xlarge_tf_384.in21k_ft_in1k | 88.3 | 475.3 | 292.8 | 668.8 | link |
| maxvit_base_tf_512.in21k_ft_in1k | 88.2 | 119.9 | 138 | 704 | link |
| maxvit_large_tf_512.in21k_ft_in1k | 88 | 212.3 | 244.8 | 942.2 | link |
| maxvit_large_tf_384.in21k_ft_in1k | 88 | 212 | 132.6 | 445.8 | link |
| maxvit_base_tf_384.in21k_ft_in1k | 87.9 | 119.6 | 73.8 | 332.9 | link |
| maxvit_base_tf_512.in1k | 86.6 | 119.9 | 138 | 704 | link |
| maxvit_large_tf_512.in1k | 86.5 | 212.3 | 244.8 | 942.2 | link |
| maxvit_base_tf_384.in1k | 86.3 | 119.6 | 73.8 | 332.9 | link |
| maxvit_large_tf_384.in1k | 86.2 | 212 | 132.6 | 445.8 | link |
| maxvit_small_tf_512.in1k | 86.1 | 69.1 | 67.3 | 383.8 | link |
| maxvit_tiny_tf_512.in1k | 85.7 | 31 | 33.5 | 257.6 | link |
| maxvit_small_tf_384.in1k | 85.5 | 69 | 35.9 | 183.6 | link |
| maxvit_tiny_tf_384.in1k | 85.1 | 31 | 17.5 | 123.4 | link |
| maxvit_large_tf_224.in1k | 84.9 | 211.8 | 43.7 | 127.4 | link |
| maxvit_base_tf_224.in1k | 84.9 | 119.5 | 24 | 95 | link |
| maxvit_small_tf_224.in1k | 84.4 | 68.9 | 11.7 | 53.2 | link |
| maxvit_tiny_tf_224.in1k | 83.4 | 30.9 | 5.6 | 35.8 | link |
--amp-impl apex, bfloat16 supportedf via --amp-dtype bfloat16Release from 0.6.x stable branch with fix for Python 3.11. NOTE original 0.6.13 release tag was against wrong branch.
regnet.py, rexnet.py, byobnet.py, resnetv2.py, swin_transformer.py, swin_transformer_v2.py, swin_transformer_v2_cr.pyswinv2_cr_*, and NHWC for all others) and spatial embedding outputs.timm weights:
rexnetr_200.sw_in12k_ft_in1k - 82.6 @ 224, 83.2 @ 288rexnetr_300.sw_in12k_ft_in1k - 84.0 @ 224, 84.5 @ 288regnety_120.sw_in12k_ft_in1k - 85.0 @ 224, 85.4 @ 288regnety_160.lion_in12k_ft_in1k - 85.6 @ 224, 86.0 @ 288regnety_160.sw_in12k_ft_in1k - 85.6 @ 224, 86.0 @ 288 (compare to SWAG PT + 1k FT this is same BUT much lower res, blows SEER FT away)convnext_xxlarge default LayerNorm eps to 1e-5 (for CLIP weights, improved stability)convnext_large_mlp.clip_laion2b_ft_320 and convnext_lage_mlp.clip_laion2b_ft_soup_320 CLIP image tower weights for features & fine-tunesafetensor checkpoint support addedvit_*, vit_relpos*, coatnet / maxxvit (to start)convnext_base.clip_laion2b_augreg_ft_in1k - 86.2% @ 256x256convnext_base.clip_laiona_augreg_ft_in1k_384 - 86.5% @ 384x384convnext_large_mlp.clip_laion2b_augreg_ft_in1k - 87.3% @ 256x256convnext_large_mlp.clip_laion2b_augreg_ft_in1k_384 - 87.9% @ 384x384features_only=True. Adapted from https://github.com/dingmyu/davit by Fredo.features_only=True.features_only=True support to new conv variants, weight remap required./results to timm/data/_info.timm
inference.py to use, try: python inference.py /folder/to/images --model convnext_small.in12k --label-type detail --topk 5Add two convnext 12k -> 1k fine-tunes at 384x384
convnext_tiny.in12k_ft_in1k_384 - 85.1 @ 384convnext_small.in12k_ft_in1k_384 - 86.2 @ 384Push all MaxxViT weights to HF hub, and add new ImageNet-12k -> 1k fine-tunes for rw base MaxViT and CoAtNet 1/2 models
| model | top1 | top5 | samples / sec | Params (M) | GMAC | Act (M) |
|---|---|---|---|---|---|---|
| maxvit_xlarge_tf_512.in21k_ft_in1k | 88.53 | 98.64 | 21.76 | 475.77 | 534.14 | 1413.22 |
| maxvit_xlarge_tf_384.in21k_ft_in1k | 88.32 | 98.54 | 42.53 | 475.32 | 292.78 | 668.76 |
| maxvit_base_tf_512.in21k_ft_in1k | 88.20 | 98.53 | 50.87 | 119.88 | 138.02 | 703.99 |
| maxvit_large_tf_512.in21k_ft_in1k | 88.04 | 98.40 | 36.42 | 212.33 | 244.75 | 942.15 |
| maxvit_large_tf_384.in21k_ft_in1k | 87.98 | 98.56 | 71.75 | 212.03 | 132.55 | 445.84 |
| maxvit_base_tf_384.in21k_ft_in1k | 87.92 | 98.54 | 104.71 | 119.65 | 73.80 | 332.90 |
| maxvit_rmlp_base_rw_384.sw_in12k_ft_in1k | 87.81 | 98.37 | 106.55 | 116.14 | 70.97 | 318.95 |
| maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k | 87.47 | 98.37 | 149.49 | 116.09 | 72.98 | 213.74 |
| coatnet_rmlp_2_rw_384.sw_in12k_ft_in1k | 87.39 | 98.31 | 160.80 | 73.88 | 47.69 | 209.43 |
| maxvit_rmlp_base_rw_224.sw_in12k_ft_in1k | 86.89 | 98.02 | 375.86 | 116.14 | 23.15 | 92.64 |
| maxxvitv2_rmlp_base_rw_224.sw_in12k_ft_in1k | 86.64 | 98.02 | 501.03 | 116.09 | 24.20 | 62.77 |
| maxvit_base_tf_512.in1k | 86.60 | 97.92 | 50.75 | 119.88 | 138.02 | 703.99 |
| coatnet_2_rw_224.sw_in12k_ft_in1k | 86.57 | 97.89 | 631.88 | 73.87 | 15.09 | 49.22 |
| maxvit_large_tf_512.in1k | 86.52 | 97.88 | 36.04 | 212.33 | 244.75 | 942.15 |
| coatnet_rmlp_2_rw_224.sw_in12k_ft_in1k | 86.49 | 97.90 | 620.58 | 73.88 | 15.18 | 54.78 |
| maxvit_base_tf_384.in1k | 86.29 | 97.80 | 101.09 | 119.65 | 73.80 | 332.90 |
| maxvit_large_tf_384.in1k | 86.23 | 97.69 | 70.56 | 212.03 | 132.55 | 445.84 |
| maxvit_small_tf_512.in1k | 86.10 | 97.76 | 88.63 | 69.13 | 67.26 | 383.77 |
| maxvit_tiny_tf_512.in1k | 85.67 | 97.58 | 144.25 | 31.05 | 33.49 | 257.59 |
| maxvit_small_tf_384.in1k | 85.54 | 97.46 | 188.35 | 69.02 | 35.87 | 183.65 |
| maxvit_tiny_tf_384.in1k | 85.11 | 97.38 | 293.46 | 30.98 | 17.53 | 123.42 |
| maxvit_large_tf_224.in1k | 84.93 | 96.97 | 247.71 | 211.79 | 43.68 | 127.35 |
| coatnet_rmlp_1_rw2_224.sw_in12k_ft_in1k | 84.90 | 96.96 | 1025.45 | 41.72 | 8.11 | 40.13 |
| maxvit_base_tf_224.in1k | 84.85 | 96.99 | 358.25 | 119.47 | 24.04 | 95.01 |
| maxxvit_rmlp_small_rw_256.sw_in1k | 84.63 | 97.06 | 575.53 | 66.01 | 14.67 | 58.38 |
| coatnet_rmlp_2_rw_224.sw_in1k | 84.61 | 96.74 | 625.81 | 73.88 | 15.18 | 54.78 |
| maxvit_rmlp_small_rw_224.sw_in1k | 84.49 | 96.76 | 693.82 | 64.90 | 10.75 | 49.30 |
| maxvit_small_tf_224.in1k | 84.43 | 96.83 | 647.96 | 68.93 | 11.66 | 53.17 |
| maxvit_rmlp_tiny_rw_256.sw_in1k | 84.23 | 96.78 | 807.21 | 29.15 | 6.77 | 46.92 |
| coatnet_1_rw_224.sw_in1k | 83.62 | 96.38 | 989.59 | 41.72 | 8.04 | 34.60 |
| maxvit_tiny_rw_224.sw_in1k | 83.50 | 96.50 | 1100.53 | 29.06 | 5.11 | 33.11 |
| maxvit_tiny_tf_224.in1k | 83.41 | 96.59 | 1004.94 | 30.92 | 5.60 | 35.78 |
| coatnet_rmlp_1_rw_224.sw_in1k | 83.36 | 96.45 | 1093.03 | 41.69 | 7.85 | 35.47 |
| maxxvitv2_nano_rw_256.sw_in1k | 83.11 | 96.33 | 1276.88 | 23.70 | 6.26 | 23.05 |
| maxxvit_rmlp_nano_rw_256.sw_in1k | 83.03 | 96.34 | 1341.24 | 16.78 | 4.37 | 26.05 |
| maxvit_rmlp_nano_rw_256.sw_in1k | 82.96 | 96.26 | 1283.24 | 15.50 | 4.47 | 31.92 |
| maxvit_nano_rw_256.sw_in1k | 82.93 | 96.23 | 1218.17 | 15.45 | 4.46 | 30.28 |
| coatnet_bn_0_rw_224.sw_in1k | 82.39 | 96.19 | 1600.14 | 27.44 | 4.67 | 22.04 |
| coatnet_0_rw_224.sw_in1k | 82.39 | 95.84 | 1831.21 | 27.44 | 4.43 | 18.73 |
| coatnet_rmlp_nano_rw_224.sw_in1k | 82.05 | 95.87 | 2109.09 | 15.15 | 2.62 | 20.34 |
| coatnext_nano_rw_224.sw_in1k | 81.95 | 95.92 | 2525.52 | 14.70 | 2.47 | 12.80 |
| coatnet_nano_rw_224.sw_in1k | 81.70 | 95.64 | 2344.52 | 15.14 | 2.41 | 15.41 |
| maxvit_rmlp_pico_rw_256.sw_in1k | 80.53 | 95.21 | 1594.71 | 7.52 | 1.85 | 24.86 |
.in12k tags)
convnext_nano.in12k_ft_in1k - 82.3 @ 224, 82.9 @ 288 (previously released)convnext_tiny.in12k_ft_in1k - 84.2 @ 224, 84.5 @ 288convnext_small.in12k_ft_in1k - 85.2 @ 224, 85.3 @ 288--model-kwargs and --opt-kwargs to scripts to pass through rare args directly to model classes from cmd line
train.py /imagenet --model resnet50 --amp --model-kwargs output_stride=16 act_layer=silutrain.py /imagenet --model vit_base_patch16_clip_224 --img-size 240 --amp --model-kwargs img_size=240 patch_size=12convnext.py
Part way through the conversion of models to multi-weight support (model_arch.pretrain_tag), module reorg for future building, and lots of new weights and model additions as we go...
This is considered a development release. Please stick to 0.6.x if you need stability. Some of the model names, tags will shift a bit, some old names have already been deprecated and remapping support not added yet. For code 0.6.x branch is considered 'stable' https://github.com/rwightman/pytorch-image-models/tree/0.6.x
efficientnet_b5.in12k_ft_in1k - 85.9 @ 448x448vit_medium_patch16_gap_384.in12k_ft_in1k - 85.5 @ 384x384vit_medium_patch16_gap_256.in12k_ft_in1k - 84.5 @ 256x256convnext_nano.in12k_ft_in1k - 82.9 @ 288x288vision_transformer.py, MAE style ViT-L/14 MIM pretrain w/ EVA-CLIP targets, FT on ImageNet-1k (w/ ImageNet-22k intermediate for some)
| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| eva_large_patch14_336.in22k_ft_in22k_in1k | 89.2 | 304.5 | 191.1 | 270.2 | link |
| eva_large_patch14_336.in22k_ft_in1k | 88.7 | 304.5 | 191.1 | 270.2 | link |
| eva_large_patch14_196.in22k_ft_in22k_in1k | 88.6 | 304.1 | 61.6 | 63.5 | link |
| eva_large_patch14_196.in22k_ft_in1k | 87.9 | 304.1 | 61.6 | 63.5 | link |
beit.py.
| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| eva_giant_patch14_560.m30m_ft_in22k_in1k | 89.8 | 1014.4 | 1906.8 | 2577.2 | link |
| eva_giant_patch14_336.m30m_ft_in22k_in1k | 89.6 | 1013 | 620.6 | 550.7 | link |
| eva_giant_patch14_336.clip_ft_in1k | 89.4 | 1013 | 620.6 | 550.7 | link |
| eva_giant_patch14_224.clip_ft_in1k | 89.1 | 1012.6 | 267.2 | 192.6 | link |
0.8.0dev0) of multi-weight support (model_arch.pretrained_tag). Install with pip install --pre timm
--torchcompile argument| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k | 88.6 | 632.5 | 391 | 407.5 | link |
| vit_large_patch14_clip_336.openai_ft_in12k_in1k | 88.3 | 304.5 | 191.1 | 270.2 | link |
| vit_huge_patch14_clip_224.laion2b_ft_in12k_in1k | 88.2 | 632 | 167.4 | 139.4 | link |
| vit_large_patch14_clip_336.laion2b_ft_in12k_in1k | 88.2 | 304.5 | 191.1 | 270.2 | link |
| vit_large_patch14_clip_224.openai_ft_in12k_in1k | 88.2 | 304.2 | 81.1 | 88.8 | link |
| vit_large_patch14_clip_224.laion2b_ft_in12k_in1k | 87.9 | 304.2 | 81.1 | 88.8 | link |
| vit_large_patch14_clip_224.openai_ft_in1k | 87.9 | 304.2 | 81.1 | 88.8 | link |
| vit_large_patch14_clip_336.laion2b_ft_in1k | 87.9 | 304.5 | 191.1 | 270.2 | link |
| vit_huge_patch14_clip_224.laion2b_ft_in1k | 87.6 | 632 | 167.4 | 139.4 | link |
| vit_large_patch14_clip_224.laion2b_ft_in1k | 87.3 | 304.2 | 81.1 | 88.8 | link |
| vit_base_patch16_clip_384.laion2b_ft_in12k_in1k | 87.2 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_384.openai_ft_in12k_in1k | 87 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_384.laion2b_ft_in1k | 86.6 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_384.openai_ft_in1k | 86.2 | 86.9 | 55.5 | 101.6 | link |
| vit_base_patch16_clip_224.laion2b_ft_in12k_in1k | 86.2 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch16_clip_224.openai_ft_in12k_in1k | 85.9 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch32_clip_448.laion2b_ft_in12k_in1k | 85.8 | 88.3 | 17.9 | 23.9 | link |
| vit_base_patch16_clip_224.laion2b_ft_in1k | 85.5 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch32_clip_384.laion2b_ft_in12k_in1k | 85.4 | 88.3 | 13.1 | 16.5 | link |
| vit_base_patch16_clip_224.openai_ft_in1k | 85.3 | 86.6 | 17.6 | 23.9 | link |
| vit_base_patch32_clip_384.openai_ft_in12k_in1k | 85.2 | 88.3 | 13.1 | 16.5 | link |
| vit_base_patch32_clip_224.laion2b_ft_in12k_in1k | 83.3 | 88.2 | 4.4 | 5 | link |
| vit_base_patch32_clip_224.laion2b_ft_in1k | 82.6 | 88.2 | 4.4 | 5 | link |
| vit_base_patch32_clip_224.openai_ft_in1k | 81.9 | 88.2 | 4.4 | 5 | link |
| model | top1 | param_count | gmac | macts | hub |
|---|---|---|---|---|---|
| maxvit_xlarge_tf_512.in21k_ft_in1k | 88.5 | 475.8 | 534.1 | 1413.2 | link |
| maxvit_xlarge_tf_384.in21k_ft_in1k | 88.3 | 475.3 | 292.8 | 668.8 | link |
| maxvit_base_tf_512.in21k_ft_in1k | 88.2 | 119.9 | 138 | 704 | link |
| maxvit_large_tf_512.in21k_ft_in1k | 88 | 212.3 | 244.8 | 942.2 | link |
| maxvit_large_tf_384.in21k_ft_in1k | 88 | 212 | 132.6 | 445.8 | link |
| maxvit_base_tf_384.in21k_ft_in1k | 87.9 | 119.6 | 73.8 | 332.9 | link |
| maxvit_base_tf_512.in1k | 86.6 | 119.9 | 138 | 704 | link |
| maxvit_large_tf_512.in1k | 86.5 | 212.3 | 244.8 | 942.2 | link |
| maxvit_base_tf_384.in1k | 86.3 | 119.6 | 73.8 | 332.9 | link |
| maxvit_large_tf_384.in1k | 86.2 | 212 | 132.6 | 445.8 | link |
| maxvit_small_tf_512.in1k | 86.1 | 69.1 | 67.3 | 383.8 | link |
| maxvit_tiny_tf_512.in1k | 85.7 | 31 | 33.5 | 257.6 | link |
| maxvit_small_tf_384.in1k | 85.5 | 69 | 35.9 | 183.6 | link |
| maxvit_tiny_tf_384.in1k | 85.1 | 31 | 17.5 | 123.4 | link |
| maxvit_large_tf_224.in1k | 84.9 | 211.8 | 43.7 | 127.4 | link |
| maxvit_base_tf_224.in1k | 84.9 | 119.5 | 24 | 95 | link |
| maxvit_small_tf_224.in1k | 84.4 | 68.9 | 11.7 | 53.2 | link |
| maxvit_tiny_tf_224.in1k | 83.4 | 30.9 | 5.6 | 35.8 | link |
--amp-impl apex, bfloat16 supportedf via --amp-dtype bfloat16Minor bug fixes to HF push_to_hub, plus some more MaxVit weights
maxxvit series, incl first ConvNeXt block based coatnext and maxxvit experiments:
coatnext_nano_rw_224 - 82.0 @ 224 (G) -- (uses ConvNeXt conv block, no BatchNorm)maxxvit_rmlp_nano_rw_256 - 83.0 @ 256, 83.7 @ 320 (G) (uses ConvNeXt conv block, no BN)maxvit_rmlp_small_rw_224 - 84.5 @ 224, 85.1 @ 320 (G)maxxvit_rmlp_small_rw_256 - 84.6 @ 256, 84.9 @ 288 (G) -- could be trained better, hparams need tuning (uses ConvNeXt block, no BN)coatnet_rmlp_2_rw_224 - 84.6 @ 224, 85 @ 320 (T)