「テーブルをフラッシュする」とは何ぞやの話

今回はFLUSH TABLEについて取り上げる。これは一体どんな操作なのだろうか

FLUSH TABLEしてみる

まず適当にtestテーブルからSELECTをしてshow open tablesをするとtestテーブルがオープンされていることが分かる。

mysql> select * from test limit 5;
+----+-------------+
| id | name        |
+----+-------------+
|  0 | kubo1048580 |
|  1 | kubo        |
|  2 | kubo2       |
|  3 | kubo3       |
|  4 | kubo4       |
+----+-------------+
5 rows in set (3.24 sec)

mysql> show open tables;
+----------+-------------------+--------+-------------+
| Database | Table             | In_use | Name_locked |
+----------+-------------------+--------+-------------+
| kubo     | test              |      0 |           0 |
| mysql    | column_statistics |      0 |           0 |
+----------+-------------------+--------+-------------+
2 rows in set (3.53 sec)

ここでflush table testを実行すると、以下のようにshow open tablesの結果からtestテーブルが消え去る。

mysql> flush table test;
Query OK, 0 rows affected (0.01 sec)

mysql> show open tables;
+----------+-------------------+--------+-------------+
| Database | Table             | In_use | Name_locked |
+----------+-------------------+--------+-------------+
| mysql    | column_statistics |      0 |           0 |
+----------+-------------------+--------+-------------+
1 row in set (0.00 sec)

同様にflush table を実行すると、以下のようにshow open tablesの結果からシステムテーブルを含めて全て消え去る

mysql> flush table;
Query OK, 0 rows affected (0.01 sec)

mysql> show open tables;
Empty set (0.00 sec)

この挙動と以前取り上げたopen tableの内容から踏まえると、flush tableはopenしたテーブルを閉じる操作、つまりテーブルキャッシュにあるテーブルディスクリプタをクリアする操作と理解できる。

amamanamam.hatenablog.com

実際、flush table (test)を実行した時にtdc_remove_tableというTABLEとTABLE_SHAREをテーブル（定義）キャッシュから削除するメソッドを通る。

結論が出たのでこれで終わり...というのも寂しいので、ソースを追ってみる。因みに以下がflush table testを実行した時のtdc_remove_tableまでのバックトレースである。このバックトレースの上から順にmysql_execute_commandまでソースをざっくり眺めていくことにする。

tdc_remove_table(THD * thd, enum_tdc_remove_table_type remove_type, const char * db, const char * table_name, bool has_lock) (/mysql-8.0.28/sql/sql_base.cc:10223)
close_cached_tables(THD * thd, TABLE_LIST * tables, bool wait_for_refresh, ulong timeout) (/mysql-8.0.28/sql/sql_base.cc:1184)
handle_reload_request(THD * thd, unsigned long options, TABLE_LIST * tables, int * write_to_binlog) (/mysql-8.0.28/sql/sql_reload.cc:330)
mysql_execute_command(THD * thd, bool first_level) (/mysql-8.0.28/sql/sql_parse.cc:4044)
dispatch_sql_command(THD * thd, Parser_state * parser_state) (/mysql-8.0.28/sql/sql_parse.cc:5174)
dispatch_command(THD * thd, const COM_DATA * com_data, enum_server_command command) (/mysql-8.0.28/sql/sql_parse.cc:1938)
do_command(THD * thd) (/mysql-8.0.28/sql/sql_parse.cc:1352)
handle_connection(void * arg) (/mysql-8.0.28/sql/conn_handler/connection_handler_per_thread.cc:302)
pfs_spawn_thread(void * arg) (/mysql-8.0.28/storage/perfschema/pfs.cc:2947)
libpthread.so.0!start_thread(void * arg) (/build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477)
libc.so.6!clone() (/build/glibc-SzIz7B/glibc-2.31/sysdeps/unix/sysv/linux/x86_64/clone.S:95)

tdc_remove_table

ここでは前述の通り、テーブルディスクリプタをキャッシュから削除する処理を行なっている。

まず、table_def_cacheからcreate_table_def_keyで作成したキーを使用してイテレータを取得する。つまりテーブル定義キャッシュから指定されたテーブルの情報を探索している。

ちなみに、table_def_cacheはmalloc_unordered_mapという標準のハッシュマップstd::unordered_mapを拡張したテンプレートクラスの型である。malloc_unordered_mapはメモリ割り当てにおいて my_malloc を使用しているらしい。

次にヘルパー関数remove_tableで指定されたテーブル情報を削除する。具体的にはTABLE_SHAREオブジェクトを取得してそれがopenされていることが確認されれば、free_tableで関連するTABLEオブジェクトをテーブルキャッシュから削除する。

void tdc_remove_table(THD *thd, enum_tdc_remove_table_type remove_type,
                      const char *db, const char *table_name, bool has_lock) {
  char key[MAX_DBKEY_LENGTH];
  size_t key_length;
  ...
  
  key_length = create_table_def_key(db, table_name, key);
  
  auto it = table_def_cache->find(string(key, key_length));

  ....
  // Helper function that evicts the TABLE_SHARE pointed to by an iterator.
  auto remove_table = [&](Table_definition_cache::iterator my_it) {
    if (my_it == table_def_cache->end()) return;
    TABLE_SHARE *share = my_it->second.get();
    /*
      Since share->ref_count is incremented when a table share is opened
      in get_table_share(), before LOCK_open is temporarily released, it
      is sufficient to check this condition alone and ignore the
      share->m_open_in_progress flag.

      Note that it is safe to call table_cache_manager.free_table() for
      shares with m_open_in_progress == true, since such shares don't
      have any TABLE objects associated.
    */
    if (share->ref_count() > 0) {
      /*
        Set share's version to zero in order to ensure that it gets
        automatically deleted once it is no longer referenced.

        Note that code in TABLE_SHARE::wait_for_old_version() assumes
        that marking share as old and removal of its unused tables
        and of the share itself from TDC happens atomically under
        protection of LOCK_open, or, putting it another way, that
        TDC does not contain old shares which don't have any tables
        used.
      */
      if (remove_type != TDC_RT_REMOVE_NOT_OWN_KEEP_SHARE &&
          remove_type != TDC_RT_MARK_FOR_REOPEN)
        share->clear_version();
      table_cache_manager.free_table(thd, remove_type, share);
    } else if (remove_type != TDC_RT_MARK_FOR_REOPEN) {
      // There are no TABLE objects associated, so just remove the
      // share immediately. (Assert: When called with
      // TDC_RT_REMOVE_NOT_OWN_KEEP_SHARE, there should always be a
      // TABLE object associated with the primary TABLE_SHARE.)
      assert(remove_type != TDC_RT_REMOVE_NOT_OWN_KEEP_SHARE ||
             share->is_secondary_engine());
      table_def_cache->erase(to_string(share->table_cache_key));
    }
  };

  remove_table(it);

close_cached_tables

ではそんなtdc_remove_tableを呼び出しているメソッドに注目する。

初めに if (!tables)の分岐がある。これはFLUSH TABLEの対象が指定されているか否かの判定をしている。今回はFLUSH TABLE testというように特定のテーブルを指定しているので結果はfalseとなる。もしtrueであれば、テーブルキャッシュとテーブル定義キャッシュの両方がクリアされる。

さて今回の分岐はfalseなのでelse以下のロジックに移る。ここではget_cached_table_shareでテーブル定義キャッシュからTABLE_SHAREオブジェクトを取得する。そして、もし取得が無事できればtdc_remove_tableを呼び出してテーブルキャッシからTABLEオブジェクトが削除される。

bool close_cached_tables(THD *thd, TABLE_LIST *tables, bool wait_for_refresh,
                         ulong timeout) {
  bool result = false;
  bool found = true;
  struct timespec abstime;
  DBUG_TRACE;
  assert(thd || (!wait_for_refresh && !tables));

  table_cache_manager.lock_all_and_tdc();
  if (!tables) {
    /*
      Force close of all open tables.

      Note that code in TABLE_SHARE::wait_for_old_version() assumes that
      incrementing of refresh_version and removal of unused tables and
      shares from TDC happens atomically under protection of LOCK_open,
      or putting it another way that TDC does not contain old shares
      which don't have any tables used.
    */
    refresh_version++;
    DBUG_PRINT("tcache",
               ("incremented global refresh_version to: %lu", refresh_version));

    /*
      Get rid of all unused TABLE and TABLE_SHARE instances. By doing
      this we automatically close all tables which were marked as "old".
    */
    table_cache_manager.free_all_unused_tables();
    /* Free table shares which were not freed implicitly by loop above. */
    while (oldest_unused_share->next)
      table_def_cache->erase(to_string(oldest_unused_share->table_cache_key));
  } else {
    bool share_found = false;
    for (TABLE_LIST *table = tables; table; table = table->next_local) {
      TABLE_SHARE *share = get_cached_table_share(table->db, table->table_name);

      if (share) {
        /*
          tdc_remove_table() also sets TABLE_SHARE::version to 0. Note that
          it will work correctly even if m_open_in_progress flag is true.
        */
        tdc_remove_table(thd, TDC_RT_REMOVE_UNUSED, table->db,
                         table->table_name, true);
        share_found = true;
      }
    }
    if (!share_found) wait_for_refresh = false;  // Nothing to wait for
  }

  table_cache_manager.unlock_all_and_tdc();

  ...

  /* Wait until all threads have closed all the tables we are flushing. */
  DBUG_PRINT("info", ("Waiting for other threads to close their open tables"));

  while (found && !thd->killed) {
    TABLE_SHARE *share = nullptr;
    found = false;
    /*
      To a self-deadlock or deadlocks with other FLUSH threads
      waiting on our open HANDLERs, we have to flush them.
    */
    mysql_ha_flush(thd);
    DEBUG_SYNC(thd, "after_flush_unlock");

    mysql_mutex_lock(&LOCK_open);

    if (!tables) {
      for (const auto &key_and_value : *table_def_cache) {
        share = key_and_value.second.get();
        if (share->has_old_version()) {
          found = true;
          break;
        }
      }
    } else {
      for (TABLE_LIST *table = tables; table; table = table->next_local) {
        share = get_cached_table_share(table->db, table->table_name);
        if (share && share->has_old_version()) {
          found = true;
          break;
        }
      }
    }

    if (found) {
      /*
        The method below temporarily unlocks LOCK_open and frees
        share's memory. Note that it works correctly even for
        shares with m_open_in_progress flag set.
      */
      if (share->wait_for_old_version(
              thd, &abstime, MDL_wait_for_subgraph::DEADLOCK_WEIGHT_DDL)) {
        mysql_mutex_unlock(&LOCK_open);
        result = true;
        goto err_with_reopen;
      }
    }

    mysql_mutex_unlock(&LOCK_open);
  }

handle_reload_request

次にclose_cached_tablesを呼び出している箇所に注目する。

このメソッドではoptionsすなわち「どのFLUSHステートメントか」を表すフラグで分岐するif文が多くある（参考：COM_REFRESH Flags）。

今回のFLUSH TABLE testステートメントは(options & (REFRESH_TABLES | REFRESH_READ_LOCK))の分岐に合致する。ただし、REFRESH_READ_LOCKには該当せずREFRESH_TABLESに該当する。REFRESH_READ_LOCKはFLUSH TABLES WITH READ LOCステートメントに該当する。この分岐内でclose_cached_tablesを呼び出す。

※if (thd && thd->locked_tables_mode)の分岐があるが、今回はここを通らない。通らないのであまり気にする必要はないのだが、locked_tables_modeが何者か全く分からずハマった。現時点でわかってることをメモがてら残す。

locked_tables_modeはここのコメントを見るにロックテーブルモードと呼ばれる「一度に多くのテーブルを開いてロックする必要がある場合」に使用されるロックモードらしい。例えばストアドやトリガを使用する場合はプリロックモードとも呼ばれ、呼び出された時に関連するテーブルを一度に開いてロックをかけるようになっている。これによって呼び出される度にテーブルを何度も開いてロックをかけてまた閉じるといった操作の必要がなくなる。らしい。なるほど分かるようでわからん。

bool handle_reload_request(THD *thd, unsigned long options, TABLE_LIST *tables,
                           int *write_to_binlog) {
  bool result = false;
  select_errors = 0; /* Write if more errors */
  int tmp_write_to_binlog = *write_to_binlog = 1;

...

  /*
    Note that if REFRESH_READ_LOCK bit is set then REFRESH_TABLES is set too
    (see sql_yacc.yy)
  */
  if (options & (REFRESH_TABLES | REFRESH_READ_LOCK)) {
    if ((options & REFRESH_READ_LOCK) && thd) {
      
      ...
      
    } else {
      if (thd && thd->locked_tables_mode) {
        /*
          If we are under LOCK TABLES we should have a write
          lock on tables which we are going to flush.
        */
        if (tables) {
          for (TABLE_LIST *t = tables; t; t = t->next_local)
            if (!find_table_for_mdl_upgrade(thd, t->db, t->table_name, false))
              return true;
        } else {
          /*
            It is not safe to upgrade the metadata lock without GLOBAL IX lock.
            This can happen with FLUSH TABLES <list> WITH READ LOCK as we in
            these cases don't take a GLOBAL IX lock in order to be compatible
            with global read lock.
          */
          if (thd->open_tables &&
              !thd->mdl_context.owns_equal_or_stronger_lock(
                  MDL_key::GLOBAL, "", "", MDL_INTENTION_EXCLUSIVE)) {
            my_error(ER_TABLE_NOT_LOCKED_FOR_WRITE, MYF(0),
                     thd->open_tables->s->table_name.str);
            return true;
          }

          for (TABLE *tab = thd->open_tables; tab; tab = tab->next) {
            if (!tab->mdl_ticket->is_upgradable_or_exclusive()) {
              my_error(ER_TABLE_NOT_LOCKED_FOR_WRITE, MYF(0),
                       tab->s->table_name.str);
              return true;
            }
          }
        }
      }

      if (close_cached_tables(
              thd, tables, ((options & REFRESH_FAST) ? false : true),
              (thd ? thd->variables.lock_wait_timeout : LONG_TIMEOUT))) {
        /*
          NOTE: my_error() has been already called by reopen_tables() within
          close_cached_tables().
        */
        result = true;
      }
    }
  }
  
  ...
  
  return result || (thd ? thd->killed : 0);
}

mysql_execute_command

最後にhandle_reload_requestを呼び出しているのは以下になる。

SQLCOM_FLUSHのクライアントコマンドを受けるのでcase SQLCOM_FLUSH以下のソースを通り、handle_reload_requestが呼び出される。なお、REFRESH_READ_LOCK・REFRESH_FOR_EXPORT（つまりFLUSH TABLES WITH READ LOCK・FLUSH TABLES ... FOR EXPORTステートメント実行時）の場合は別メソッドが呼び出される。

int mysql_execute_command(THD *thd, bool first_level) {
  int res = false;
  LEX *const lex = thd->lex;
  
  ...
  ...
  ...
  
  case SQLCOM_FLUSH: {
      ...

      if (first_table && lex->type & REFRESH_READ_LOCK) {
        /* Check table-level privileges. */
        if (check_table_access(thd, LOCK_TABLES_ACL | SELECT_ACL, all_tables,
                               false, UINT_MAX, false))
          goto error;
        if (flush_tables_with_read_lock(thd, all_tables)) goto error;
        my_ok(thd);
        break;
      } else if (first_table && lex->type & REFRESH_FOR_EXPORT) {
        /* Check table-level privileges. */
        if (check_table_access(thd, LOCK_TABLES_ACL | SELECT_ACL, all_tables,
                               false, UINT_MAX, false))
          goto error;
        if (flush_tables_for_export(thd, all_tables)) goto error;
        my_ok(thd);
        break;
      }

      /*
        handle_reload_request() will tell us if we are allowed to write to the
        binlog or not.
      */
      if (!handle_reload_request(thd, lex->type, first_table,
                                 &write_to_binlog)) {
        /*
          We WANT to write and we CAN write.
          ! we write after unlocking the table.
        */
        /*
          Presumably, RESET and binlog writing doesn't require synchronization
        */

        if (write_to_binlog > 0)  // we should write
        {
          if (!lex->no_write_to_binlog)
            res = write_bin_log(thd, false, thd->query().str,
                                thd->query().length);
        } else if (write_to_binlog < 0) {
          /*
             We should not write, but rather report error because
             handle_reload_request binlog interactions failed
           */
          res = 1;
        }

        if (!res) my_ok(thd);
      }

      break;
    }