model specification discrepancy with paper

#16
by Jungwonchang - opened

Firstly, I would like to appreciate Meta AI for amazing contribution to the field.

However, I have found that there seems to be a slight discrepancy on model paramters.
This table is from the paper of MMS

image.png

It is stated that the number of hidden states is 1024,
but when I actually inspected the model, it was 1280.

Wav2Vec2ForCTC(
  (wav2vec2): Wav2Vec2Model(
    (feature_extractor): Wav2Vec2FeatureEncoder(
      (conv_layers): ModuleList(
        (0): Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (1-4): 4 x Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (5-6): 2 x Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
      )
    )
    (feature_projection): Wav2Vec2FeatureProjection(
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (projection): Linear(in_features=512, out_features=1280, bias=True)
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): Wav2Vec2EncoderStableLayerNorm(
      (pos_conv_embed): Wav2Vec2PositionalConvEmbedding(
        (conv): ParametrizedConv1d(
          1280, 1280, kernel_size=(128,), stride=(1,), padding=(64,), groups=16
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (padding): Wav2Vec2SamePadLayer()
        (activation): GELUActivation()
      )
      (layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.0, inplace=False)
      (layers): ModuleList(
        (0-47): 48 x Wav2Vec2EncoderLayerStableLayerNorm(
          (attention): Wav2Vec2Attention(
            (k_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (dropout): Dropout(p=0.0, inplace=False)
          (layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (feed_forward): Wav2Vec2FeedForward(
            (intermediate_dropout): Dropout(p=0.05, inplace=False)
            (intermediate_dense): Linear(in_features=1280, out_features=5120, bias=True)
            (intermediate_act_fn): GELUActivation()
            (output_dense): Linear(in_features=5120, out_features=1280, bias=True)
            (output_dropout): Dropout(p=0.0, inplace=False)
          )
          (final_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (adapter_layer): Wav2Vec2AttnAdapterLayer(
            (norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
            (linear_1): Linear(in_features=1280, out_features=16, bias=True)
            (act_fn): ReLU()
            (linear_2): Linear(in_features=16, out_features=1280, bias=True)
          )
        )
      )
    )
  )
  (dropout): Dropout(p=0.05, inplace=False)
  (lm_head): Linear(in_features=1280, out_features=73, bias=True)
)

also, isn't the proportion of the additional parameters by adapter modules about 0.2% rather than 2%, since the total number of the model is about 1billion, which is about 1000M.

Sign up or log in to comment