timm docs home now exists, look for more here in the futuremaxxvit series incl a pico (7.5M params, 1.9 GMACs), two tiny variants:
maxvit_rmlp_pico_rw_256 - 80.5 @ 256, 81.3 @ 320 (T)maxvit_tiny_rw_224 - 83.5 @ 224 (G)maxvit_rmlp_tiny_rw_256 - 84.2 @ 256, 84.8 @ 320 (T)maxvit_rmlp_nano_rw_256 - 83.0 @ 256, 83.6 @ 320 (T)timm original models
maxxvit.py model def, contains numerous experiments outside scope of original paperscoatnet_nano_rw_224 - 81.7 @ 224 (T)coatnet_rmlp_nano_rw_224 - 82.0 @ 224, 82.8 @ 320 (T)coatnet_0_rw_224 - 82.4 (T) -- NOTE timm '0' coatnets have 2 more 3rd stage blockscoatnet_bn_0_rw_224 - 82.4 (T)maxvit_nano_rw_256 - 82.9 @ 256 (T)coatnet_rmlp_1_rw_224 - 83.4 @ 224, 84 @ 320 (T)coatnet_1_rw_224 - 83.6 @ 224 (G)bits_and_tpu branch training code, (G) = GPU trainedtimm re-write for license purposes)convnext_atto - 75.7 @ 224, 77.0 @ 288convnext_atto_ols - 75.9 @ 224, 77.2 @ 288convnext_femto - 77.5 @ 224, 78.7 @ 288convnext_femto_ols - 77.9 @ 224, 78.9 @ 288convnext_pico - 79.5 @ 224, 80.4 @ 288convnext_pico_ols - 79.5 @ 224, 80.5 @ 288convnext_nano_ols - 80.9 @ 224, 81.6 @ 288timm trained weightsWeights were created reproducing the paper architectures and exploring timm sepcific additions such as ConvNeXt blocks, parallel partitioning, and other experiments.
Weights were trained on a mix of TPU and GPU systems. Bulk of weights were trained on TPU via the TRC program (https://sites.research.google/trc/about/).
CoAtNet variants run particularly well on TPU, it's a great combination. MaxVit is better suited to GPU due to the window partitioning, although there are some optimizations that can be made to improve TPU padding/utilization incl using 256x256 image size (8, 8) windo/grid size, and keeping format in NCHW for partition attention when using PyTorch XLA.
Glossary:
coatnet - CoAtNet (MBConv + transformer blocks)coatnext - CoAtNet w/ ConvNeXt conv blocksmaxvit - MaxViT (MBConv + block (ala swin) and grid partioning transformer blocks)maxxvit - MaxViT w/ ConvNeXt conv blocksrmlp - relative position embedding w/ MLP (can be resized) -- if this isn't in model name, it's using relative position bias (ala swin)rw - my variations on the model, slight differences in sizing / pooling / etc from Google paper specResults:
maxvit_rmlp_pico_rw_256 - 80.5 @ 256, 81.3 @ 320 (T)coatnet_nano_rw_224 - 81.7 @ 224 (T)coatnext_nano_rw_224 - 82.0 @ 224 (G) -- (uses convnext block, no BatchNorm)coatnet_rmlp_nano_rw_224 - 82.0 @ 224, 82.8 @ 320 (T)coatnet_0_rw_224 - 82.4 (T) -- NOTE timm '0' coatnets have 2 more 3rd stage blockscoatnet_bn_0_rw_224 - 82.4 (T) -- all BatchNorm, no LayerNormmaxvit_nano_rw_256 - 82.9 @ 256 (T)maxvit_rmlp_nano_rw_256 - 83.0 @ 256, 83.6 @ 320 (T)maxxvit_rmlp_nano_rw_256 - 83.0 @ 256, 83.7 @ 320 (G) (uses convnext conv block, no BatchNorm)coatnet_rmlp_1_rw_224 - 83.4 @ 224, 84 @ 320 (T)maxvit_tiny_rw_224 - 83.5 @ 224 (G)coatnet_1_rw_224 - 83.6 @ 224 (G)maxvit_rmlp_tiny_rw_256 - 84.2 @ 256, 84.8 @ 320 (T)maxvit_rmlp_small_rw_224 - 84.5 @ 224, 85.1 @ 320 (G)maxxvit_rmlp_small_rw_256 - 84.6 @ 256, 84.9 @ 288 (G) -- could be trained better, hparms need tuning (uses convnext conv block, no BN)coatnet_rmlp_2_rw_224 - 84.6 @ 224, 85 @ 320 (T)(T) = TPU trained with bits_and_tpu branch training code, (G) = GPU trained
Rehosted and remaped checkpoints from https://github.com/snap-research/EfficientFormer (originals in Google Drive)
Heavily remaped from originals at https://github.com/NVlabs/GCVit due to from-scratch re-write of model code
NOTE: these checkpoints have a non-commercial CC-BY-NC-SA-4.0 license.
Minor bug fixes and a few more weights since 0.6.5
darknetaa53 - 79.8 @ 256, 80.5 @ 288convnext_nano - 80.8 @ 224, 81.5 @ 288cs3sedarknet_l - 81.2 @ 256, 81.8 @ 288cs3darknet_x - 81.8 @ 256, 82.2 @ 288cs3sedarknet_x - 82.2 @ 256, 82.7 @ 288cs3edgenet_x - 82.2 @ 256, 82.7 @ 288cs3se_edgenet_x - 82.8 @ 256, 83.5 @ 320cs3* weights above all trained on TPU w/ bits_and_tpu branch. Thanks to TRC program!First official release in a long while (since 0.5.4). All change log since 0.5.4 below,
More models, more fixes
ResNet defs added by request with 1 block repeats for both basic and bottleneck (resnet10 and resnet14)CspNet refactored with dataclass config, simplified CrossStage3 (cs3) option. These are closer to YOLO-v5+ backbone defs.srelpos (shared relative position) models trained, and a medium w/ class token.small model. Better than original small, but not their new USI trained weights.resnet10t - 66.5 @ 176, 68.3 @ 224resnet14t - 71.3 @ 176, 72.3 @ 224resnetaa50 - 80.6 @ 224 , 81.6 @ 288darknet53 - 80.0 @ 256, 80.5 @ 288cs3darknet_m - 77.0 @ 256, 77.6 @ 288cs3darknet_focus_m - 76.7 @ 256, 77.3 @ 288cs3darknet_l - 80.4 @ 256, 80.9 @ 288cs3darknet_focus_l - 80.3 @ 256, 80.9 @ 288vit_srelpos_small_patch16_224 - 81.1 @ 224, 82.1 @ 320vit_srelpos_medium_patch16_224 - 82.3 @ 224, 83.1 @ 320vit_relpos_small_patch16_cls_224 - 82.6 @ 224, 83.6 @ 320edgnext_small_rw - 79.6 @ 224, 80.4 @ 320cs3, darknet, and vit_*relpos weights above all trained on TPU thanks to TRC program! Rest trained on overheating GPUs.timm datasets/parsers. See (https://github.com/rwightman/pytorch-image-models/pull/1274#issuecomment-1178303103)F.layer_norm(x.permute(0, 2, 3, 1), ...).permute(0, 3, 1, 2) via LayerNorm2d in all cases.
LayerNormExp2d in models/layers/norm.pytimm Swin-V2-CR impl, will likely do a bit more to bring parts closer to official and decide whether to merge some aspects.vit_relpos_small_patch16_224 - 81.5 @ 224, 82.5 @ 320 -- rel pos, layer scale, no class token, avg poolvit_relpos_medium_patch16_rpn_224 - 82.3 @ 224, 83.1 @ 320 -- rel pos + res-post-norm, no class token, avg poolvit_relpos_medium_patch16_224 - 82.5 @ 224, 83.3 @ 320 -- rel pos, layer scale, no class token, avg poolvit_relpos_base_patch16_gapcls_224 - 82.8 @ 224, 83.9 @ 320 -- rel pos, layer scale, class token, avg pool (by mistake)vision_transformer_relpos.py) and Residual Post-Norm branches (from Swin-V2) (vision_transformer*.py)
vit_relpos_base_patch32_plus_rpn_256 - 79.5 @ 256, 80.6 @ 320 -- rel pos + extended width + res-post-norm, no class token, avg poolvit_relpos_base_patch16_224 - 82.5 @ 224, 83.6 @ 320 -- rel pos, layer scale, no class token, avg poolvit_base_patch16_rpn_224 - 82.3 @ 224 -- rel pos + res-post-norm, no class token, avg poolHow to Train Your ViT)vit_* models support removal of class token, use of global average pool, use of fc_norm (ala beit, mae).timm models are now officially supported in fast.ai! Just in time for the new Practical Deep Learning course. timmdocs documentation link updated to timm.fast.ai.seresnext101d_32x8d - 83.69 @ 224, 84.35 @ 288seresnextaa101d_32x8d (anti-aliased w/ AvgPool2d) - 83.85 @ 224, 84.57 @ 288ParallelBlock and LayerScale option to base vit models to support model configs in Three things everyone should know about ViTconvnext_tiny_hnf (head norm first) weights trained with (close to) A2 recipe, 82.2% top-1, could do better with more epochs.norm_norm_norm. IMPORTANT this update for a coming 0.6.x release will likely de-stabilize the master branch for a while. Branch 0.5.x or a previous 0.5.x release can be used if stability is required.regnety_040 - 82.3 @ 224, 82.96 @ 288regnety_064 - 83.0 @ 224, 83.65 @ 288regnety_080 - 83.17 @ 224, 83.86 @ 288regnetv_040 - 82.44 @ 224, 83.18 @ 288 (timm pre-act)regnetv_064 - 83.1 @ 224, 83.71 @ 288 (timm pre-act)regnetz_040 - 83.67 @ 256, 84.25 @ 320regnetz_040h - 83.77 @ 256, 84.5 @ 320 (w/ extra fc in head)resnetv2_50d_gn - 80.8 @ 224, 81.96 @ 288 (pre-act GroupNorm)resnetv2_50d_evos 80.77 @ 224, 82.04 @ 288 (pre-act EvoNormS)regnetz_c16_evos - 81.9 @ 256, 82.64 @ 320 (EvoNormS)regnetz_d8_evos - 83.42 @ 256, 84.04 @ 320 (EvoNormS)xception41p - 82 @ 299 (timm pre-act)xception65 - 83.17 @ 299xception65p - 83.14 @ 299 (timm pre-act)resnext101_64x4d - 82.46 @ 224, 83.16 @ 288seresnext101_32x8d - 83.57 @ 224, 84.270 @ 288resnetrs200 - 83.85 @ 256, 84.44 @ 320forward_head(x, pre_logits=False) fn added to all models to allow separate calls of forward_features + forward_headfoward_features, for consistency with CNN models, token selection or pooling now applied in forward_headtimm on his blog yesterday. Well worth a read. Getting Started with PyTorch Image Models (timm): A Practitioner’s Guidenorm_norm_norm branch back to master (ver 0.6.x) in next week or so.
pip install git+https://github.com/rwightman/pytorch-image-models installs!0.5.x releases and a 0.5.x branch will remain stable with a cherry pick or two until dust clears. Recommend sticking to pypi install for a bit if you want stable.This release holds weights for timm's variant of Swin V2 (from @ChristophReich1996 impl, https://github.com/ChristophReich1996/Swin-Transformer-V2)
NOTE: ns variants of the models have extra norms on the main branch at the end of each stage, this seems to help training. The current small model is not using this, but currently training one. Will have a non-ns tiny soon as well as a comparsion. in21k and 1k base models are also in the works...
small checkpoints trained on TPU-VM instances via the TPU-Research Cloud (https://sites.research.google/trc/about/)
swin_v2_tiny_ns_224 - 81.80 top-1swin_v2_small_224 - 83.13 top-1swin_v2_small_ns_224 - 83.5 top-1A wide range of mid-large sized models trained in PyTorch XLA on TPU VM instances. Demonstrating viability of the TPU + PyTorch combo for excellent image model results. All models trained w/ the bits_and_tpu branch of this codebase.
A big thanks to the TPU Research Cloud (https://sites.research.google/trc/about/) for the compute used in these experiments.
This set includes several novel weights, including EvoNorm-S RegNetZ (C/D timm variants) and ResNet-V2 model experiments, as well as custom pre-activation model variants of RegNet-Y (called RegNet-V) and Xception (Xception-P) models.
Many if not all of the included RegNet weights surpass original paper results by a wide margin and remain above other known results (e.g. recent torchvision updates) in ImageNet-1k validation and especially OOD test set / robustness performance and scaling to higher resolutions.
regnety_040 - 82.3 @ 224, 82.96 @ 288regnety_064 - 83.0 @ 224, 83.65 @ 288regnety_080 - 83.17 @ 224, 83.86 @ 288regnetv_040 - 82.44 @ 224, 83.18 @ 288 (timm pre-act)regnetv_064 - 83.1 @ 224, 83.71 @ 288 (timm pre-act)regnetz_040 - 83.67 @ 256, 84.25 @ 320regnetz_040h - 83.77 @ 256, 84.5 @ 320 (w/ extra fc in head)resnetv2_50d_gn - 80.8 @ 224, 81.96 @ 288 (pre-act GroupNorm)resnetv2_50d_evos 80.77 @ 224, 82.04 @ 288 (pre-act EvoNormS)regnetz_c16_evos - 81.9 @ 256, 82.64 @ 320 (EvoNormS)regnetz_d8_evos - 83.42 @ 256, 84.04 @ 320 (EvoNormS)xception41p - 82 @ 299 (timm pre-act)xception65 - 83.17 @ 299xception65p - 83.14 @ 299 (timm pre-act)resnext101_64x4d - 82.46 @ 224, 83.16 @ 288seresnext101_32x8d - 83.57 @ 224, 84.27 @ 288seresnext101d_32x8d - 83.69 @ 224, 84.35 @ 288seresnextaa101d_32x8d - 83.85 @ 224, 84.57 @ 288resnetrs200 - 83.85 @ 256, 84.44 @ 320vit_relpos_base_patch32_plus_rpn_256 - 79.5 @ 256, 80.6 @ 320 -- rel pos + extended width + res-post-norm, no class token, avg poolvit_relpos_small_patch16_224 - 81.5 @ 224, 82.5 @ 320 -- rel pos, layer scale, no class token, avg poolvit_relpos_medium_patch16_rpn_224 - 82.3 @ 224, 83.1 @ 320 -- rel pos + res-post-norm, no class token, avg poolvit_base_patch16_rpn_224 - 82.3 @ 224 -- rel pos + res-post-norm, no class token, avg poolvit_relpos_medium_patch16_224 - 82.5 @ 224, 83.3 @ 320 -- rel pos, layer scale, no class token, avg poolvit_relpos_base_patch16_224 - 82.5 @ 224, 83.6 @ 320 -- rel pos, layer scale, no class token, avg poolvit_relpos_base_patch16_gapcls_224 - 82.8 @ 224, 83.9 @ 320 -- rel pos, layer scale, class token, avg pool (by mistake)Pretrained weights for MobileViT and MobileViT-V2 adapted from Apple impl at https://github.com/apple/ml-cvnets
Checkpoints remapped to timm impl of the model with BGR corrected to RGB (for V1).
Paper: https://arxiv.org/abs/2110.00476
More details on weights and hparams to come...
A collection of weights I've trained comparing various types of SE-like (SE, ECA, GC, etc), self-attention (bottleneck, halo, lambda) blocks, and related non-attn baselines.
| model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
|---|---|---|---|---|---|---|---|---|
| botnet26t_256 | 79.246 | 20.754 | 94.53 | 5.47 | 12.49 | 256 | 0.95 | bicubic |
| halonet26t | 79.13 | 20.87 | 94.314 | 5.686 | 12.48 | 256 | 0.95 | bicubic |
| lambda_resnet26t | 79.112 | 20.888 | 94.59 | 5.41 | 10.96 | 256 | 0.94 | bicubic |
| lambda_resnet26rpt_256 | 78.964 | 21.036 | 94.428 | 5.572 | 10.99 | 256 | 0.94 | bicubic |
| resnet26t | 77.872 | 22.128 | 93.834 | 6.166 | 16.01 | 256 | 0.94 | bicubic |
Details:
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|
| resnet26t | 2967.55 | 86.252 | 256 | 256 | 857.62 | 297.984 | 256 | 256 | 16.01 |
| botnet26t_256 | 2642.08 | 96.879 | 256 | 256 | 809.41 | 315.706 | 256 | 256 | 12.49 |
| halonet26t | 2601.91 | 98.375 | 256 | 256 | 783.92 | 325.976 | 256 | 256 | 12.48 |
| lambda_resnet26t | 2354.1 | 108.732 | 256 | 256 | 697.28 | 366.521 | 256 | 256 | 10.96 |
| lambda_resnet26rpt_256 | 1847.34 | 138.563 | 256 | 256 | 644.84 | 197.892 | 128 | 256 | 10.99 |
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|
| resnet26t | 3691.94 | 69.327 | 256 | 256 | 1188.17 | 214.96 | 256 | 256 | 16.01 |
| botnet26t_256 | 3291.63 | 77.76 | 256 | 256 | 1126.68 | 226.653 | 256 | 256 | 12.49 |
| halonet26t | 3230.5 | 79.232 | 256 | 256 | 1077.82 | 236.934 | 256 | 256 | 12.48 |
| lambda_resnet26rpt_256 | 2324.15 | 110.133 | 256 | 256 | 864.42 | 147.485 | 128 | 256 | 10.99 |
| lambda_resnet26t | Not Supported |
| model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
|---|---|---|---|---|---|---|---|---|
| eca_halonext26ts | 79.484 | 20.516 | 94.600 | 5.400 | 10.76 | 256 | 0.94 | bicubic |
| eca_botnext26ts_256 | 79.270 | 20.730 | 94.594 | 5.406 | 10.59 | 256 | 0.95 | bicubic |
| bat_resnext26ts | 78.268 | 21.732 | 94.1 | 5.9 | 10.73 | 256 | 0.9 | bicubic |
| seresnext26ts | 77.852 | 22.148 | 93.784 | 6.216 | 10.39 | 256 | 0.9 | bicubic |
| gcresnext26ts | 77.804 | 22.196 | 93.824 | 6.176 | 10.48 | 256 | 0.9 | bicubic |
| eca_resnext26ts | 77.446 | 22.554 | 93.57 | 6.43 | 10.3 | 256 | 0.9 | bicubic |
| resnext26ts | 76.764 | 23.236 | 93.136 | 6.864 | 10.3 | 256 | 0.9 | bicubic |
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|
| resnext26ts | 3006.57 | 85.134 | 256 | 256 | 864.4 | 295.646 | 256 | 256 | 10.3 |
| seresnext26ts | 2931.27 | 87.321 | 256 | 256 | 836.92 | 305.193 | 256 | 256 | 10.39 |
| eca_resnext26ts | 2925.47 | 87.495 | 256 | 256 | 837.78 | 305.003 | 256 | 256 | 10.3 |
| gcresnext26ts | 2870.01 | 89.186 | 256 | 256 | 818.35 | 311.97 | 256 | 256 | 10.48 |
| eca_botnext26ts_256 | 2652.03 | 96.513 | 256 | 256 | 790.43 | 323.257 | 256 | 256 | 10.59 |
| eca_halonext26ts | 2593.03 | 98.705 | 256 | 256 | 766.07 | 333.541 | 256 | 256 | 10.76 |
| bat_resnext26ts | 2469.78 | 103.64 | 256 | 256 | 697.21 | 365.964 | 256 | 256 | 10.73 |
NOTE: there are performance issues with certain grouped conv configs with channels last layout, backwards pass in particular is really slow. Also causing issues for RegNet and NFNet networks.
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|
| resnext26ts | 3952.37 | 64.755 | 256 | 256 | 608.67 | 420.049 | 256 | 256 | 10.3 |
| eca_resnext26ts | 3815.77 | 67.074 | 256 | 256 | 594.35 | 430.146 | 256 | 256 | 10.3 |
| seresnext26ts | 3802.75 | 67.304 | 256 | 256 | 592.82 | 431.14 | 256 | 256 | 10.39 |
| gcresnext26ts | 3626.97 | 70.57 | 256 | 256 | 581.83 | 439.119 | 256 | 256 | 10.48 |
| eca_botnext26ts_256 | 3515.84 | 72.8 | 256 | 256 | 611.71 | 417.862 | 256 | 256 | 10.59 |
| eca_halonext26ts | 3410.12 | 75.057 | 256 | 256 | 597.52 | 427.789 | 256 | 256 | 10.76 |
| bat_resnext26ts | 3053.83 | 83.811 | 256 | 256 | 533.23 | 478.839 | 256 | 256 | 10.73 |
The 33-layer models have an extra 1x1 FC layer between last conv block and classifier. There is both a non-attenion 33 layer baseline and a 32 layer without the extra FC.
| model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
|---|---|---|---|---|---|---|---|---|
| sehalonet33ts | 80.986 | 19.014 | 95.272 | 4.728 | 13.69 | 256 | 0.94 | bicubic |
| seresnet33ts | 80.388 | 19.612 | 95.108 | 4.892 | 19.78 | 256 | 0.94 | bicubic |
| eca_resnet33ts | 80.132 | 19.868 | 95.054 | 4.946 | 19.68 | 256 | 0.94 | bicubic |
| gcresnet33ts | 79.99 | 20.01 | 94.988 | 5.012 | 19.88 | 256 | 0.94 | bicubic |
| resnet33ts | 79.352 | 20.648 | 94.596 | 5.404 | 19.68 | 256 | 0.94 | bicubic |
| resnet32ts | 79.028 | 20.972 | 94.444 | 5.556 | 17.96 | 256 | 0.94 | bicubic |
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|
| resnet32ts | 2502.96 | 102.266 | 256 | 256 | 733.27 | 348.507 | 256 | 256 | 17.96 |
| resnet33ts | 2473.92 | 103.466 | 256 | 256 | 725.34 | 352.309 | 256 | 256 | 19.68 |
| seresnet33ts | 2400.18 | 106.646 | 256 | 256 | 695.19 | 367.413 | 256 | 256 | 19.78 |
| eca_resnet33ts | 2394.77 | 106.886 | 256 | 256 | 696.93 | 366.637 | 256 | 256 | 19.68 |
| gcresnet33ts | 2342.81 | 109.257 | 256 | 256 | 678.22 | 376.404 | 256 | 256 | 19.88 |
| sehalonet33ts | 1857.65 | 137.794 | 256 | 256 | 577.34 | 442.545 | 256 | 256 | 13.69 |
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|
| resnet32ts | 3306.22 | 77.416 | 256 | 256 | 1012.82 | 252.158 | 256 | 256 | 17.96 |
| resnet33ts | 3257.59 | 78.573 | 256 | 256 | 1002.38 | 254.778 | 256 | 256 | 19.68 |
| seresnet33ts | 3128.08 | 81.826 | 256 | 256 | 950.27 | 268.581 | 256 | 256 | 19.78 |
| eca_resnet33ts | 3127.11 | 81.852 | 256 | 256 | 948.84 | 269.123 | 256 | 256 | 19.68 |
| gcresnet33ts | 2984.87 | 85.753 | 256 | 256 | 916.98 | 278.169 | 256 | 256 | 19.88 |
| sehalonet33ts | 2188.23 | 116.975 | 256 | 256 | 711.63 | 179.03 | 128 | 256 | 13.69 |
In Progress
haloregnetz_c uses halo attention for all of last stage, and interleaved every 3 (for 4) of penultimate stage| model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
|---|---|---|---|---|---|---|---|---|
| regnetz_d | 83.422 | 16.578 | 96.636 | 3.364 | 27.58 | 256 | 0.95 | bicubic |
| regnetz_c | 82.164 | 17.836 | 96.058 | 3.942 | 13.46 | 256 | 0.94 | bicubic |
| haloregnetz_b | 81.058 | 18.942 | 95.2 | 4.8 | 11.68 | 224 | 0.94 | bicubic |
| regnetz_b | 79.868 | 20.132 | 94.988 | 5.012 | 9.72 | 224 | 0.94 | bicubic |
| model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
|---|---|---|---|---|---|---|---|---|
| regnetz_d | 84.04 | 15.96 | 96.87 | 3.13 | 27.58 | 320 | 0.95 | bicubic |
| regnetz_c | 82.516 | 17.484 | 96.356 | 3.644 | 13.46 | 320 | 0.94 | bicubic |
| haloregnetz_b | 81.058 | 18.942 | 95.2 | 4.8 | 11.68 | 224 | 0.94 | bicubic |
| regnetz_b | 80.728 | 19.272 | 95.47 | 4.53 | 9.72 | 288 | 0.94 | bicubic |
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | infer_GMACs | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|---|
| regnetz_b | 2703.42 | 94.68 | 256 | 224 | 1.45 | 764.85 | 333.348 | 256 | 224 | 9.72 |
| haloregnetz_b | 2086.22 | 122.695 | 256 | 224 | 1.88 | 620.1 | 411.415 | 256 | 224 | 11.68 |
| regnetz_c | 1653.19 | 154.836 | 256 | 256 | 2.51 | 459.41 | 277.268 | 128 | 256 | 13.46 |
| regnetz_d | 1060.91 | 241.284 | 256 | 256 | 5.98 | 296.51 | 430.143 | 128 | 256 | 27.58 |
NOTE: channels last layout is painfully slow for backward pass here due to some sort of cuDNN issue
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | infer_GMACs | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|---|
| regnetz_b | 4152.59 | 61.634 | 256 | 224 | 1.45 | 399.37 | 639.572 | 256 | 224 | 9.72 |
| haloregnetz_b | 2770.78 | 92.378 | 256 | 224 | 1.88 | 364.22 | 701.386 | 256 | 224 | 11.68 |
| regnetz_c | 2512.4 | 101.878 | 256 | 256 | 2.51 | 376.72 | 338.372 | 128 | 256 | 13.46 |
| regnetz_d | 1456.05 | 175.8 | 256 | 256 | 5.98 | 111.32 | 1148.279 | 128 | 256 | 27.58 |
A catch-all (ish) release for storing vision transformer weights adapted/rehosted from 3rd parties. Too many incoming models for one release per source...
Containing weights from:
Weights from https://github.com/google/automl/tree/master/efficientnetv2
Paper: EfficientNetV2: Smaller Models and Faster Training - https://arxiv.org/abs/2104.00298
Weights for ResNet-RS models as per #554 . Ported from Tensorflow impl (https://github.com/tensorflow/tpu/tree/master/models/official/resnet/resnet_rs) by @amaarora
Weights for CoaT: Co-Scale Conv-Attentional Image Transformers (from https://github.com/mlpc-ucsd/CoaT)
Weights from https://github.com/naver-ai/pit
Copyright 2021-present NAVER Corp.
Rehosted here for easy pytorch hub downloads.
Weights converted from DeepMind Haiku impl of NFNets (https://github.com/deepmind/deepmind-research/tree/master/nfnets)