MySQL Character Set and Collation Guide

Complete guide to MySQL character sets and collations, including UTF-8 migration, performance implications, and best practices for international applications.

MySQL Character Set and Collation Guide

Character sets and collations are fundamental to how MySQL stores, compares, and sorts text data. Understanding these concepts is essential for building applications that properly handle international characters, emojis, and case-sensitive comparisons.


🎯 Understanding the Basics

What is a Character Set?

A character set defines the set of characters that can be stored and the encoding used to represent them.

Character SetDescriptionMax Bytes/CharUse Case
latin1Western European1Legacy applications
utf8Unicode (BMP only)3Basic international text
utf8mb4Full Unicode4Recommended - includes emojis
binaryRaw bytes1Binary data storage

What is a Collation?

A collation defines how characters are compared and sorted.

Character Set: utf8mb4
   │
   └── Collations:
       ├── utf8mb4_general_ci    (case-insensitive, accent-insensitive)
       ├── utf8mb4_unicode_ci    (case-insensitive, uses Unicode rules)
       ├── utf8mb4_0900_ai_ci    (MySQL 8.0 default, accent-insensitive)
       ├── utf8mb4_bin           (binary comparison, case-sensitive)
       └── utf8mb4_0900_as_cs    (accent-sensitive, case-sensitive)

💡 Key Insight: The _ci suffix means case-insensitive, _cs means case-sensitive, _ai means accent-insensitive, and _as means accent-sensitive.


🔧 Configuration Hierarchy

MySQL applies character set and collation settings at multiple levels:

┌─────────────────────────────────────────────────────────────────────┐
│                    CHARACTER SET HIERARCHY                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Server Level                                                        │
│       │                                                              │
│       ▼                                                              │
│  Database Level (can override server)                               │
│       │                                                              │
│       ▼                                                              │
│  Table Level (can override database)                                │
│       │                                                              │
│       ▼                                                              │
│  Column Level (can override table)                                  │
│                                                                      │
│  Most specific setting wins!                                        │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Server-Level Configuration

# /etc/mysql/mysql.conf.d/charset.cnf

[mysqld]
# Server default character set
character_set_server = utf8mb4
collation_server = utf8mb4_0900_ai_ci

# Client connection defaults
character_set_client_handshake = FALSE
character_set_client = utf8mb4

[client]
default-character-set = utf8mb4

Viewing Current Configuration

-- Check server defaults
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

-- Check database character set
SELECT
    SCHEMA_NAME,
    DEFAULT_CHARACTER_SET_NAME,
    DEFAULT_COLLATION_NAME
FROM information_schema.SCHEMATA
WHERE SCHEMA_NAME = 'your_database';

-- Check table character set
SELECT
    TABLE_NAME,
    TABLE_COLLATION
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'your_database';

-- Check column character set
SELECT
    TABLE_NAME,
    COLUMN_NAME,
    CHARACTER_SET_NAME,
    COLLATION_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'your_database'
    AND DATA_TYPE IN ('varchar', 'char', 'text', 'mediumtext', 'longtext');

🚀 UTF-8 Migration Guide

Why Migrate to utf8mb4?

Characterutf8 (3-byte)utf8mb4 (4-byte)
A-Z, 0-9
Accented (é, ñ)
Chinese, Japanese
Emojis (😀 🎉 🚀)
Math symbols (𝕏)

Step-by-Step Migration

Step 1: Backup Everything

# Full backup before migration
mysqldump -u root -p \
    --single-transaction \
    --routines \
    --triggers \
    --events \
    --all-databases > pre_migration_backup.sql

Step 2: Analyze Current State

-- Find all non-utf8mb4 tables
SELECT
    TABLE_SCHEMA,
    TABLE_NAME,
    TABLE_COLLATION,
    ENGINE
FROM information_schema.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
    AND TABLE_COLLATION NOT LIKE 'utf8mb4%'
ORDER BY TABLE_SCHEMA, TABLE_NAME;

-- Find all non-utf8mb4 columns
SELECT
    TABLE_SCHEMA,
    TABLE_NAME,
    COLUMN_NAME,
    DATA_TYPE,
    CHARACTER_SET_NAME,
    COLLATION_NAME,
    CHARACTER_MAXIMUM_LENGTH
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
    AND CHARACTER_SET_NAME IS NOT NULL
    AND CHARACTER_SET_NAME != 'utf8mb4'
ORDER BY TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME;

Step 3: Check Index Size Impact

-- VARCHAR(255) with utf8 = 255 * 3 = 765 bytes
-- VARCHAR(255) with utf8mb4 = 255 * 4 = 1020 bytes
-- InnoDB index key limit: 3072 bytes (with DYNAMIC/COMPRESSED row format)

-- Find columns that might exceed index limits
SELECT
    TABLE_SCHEMA,
    TABLE_NAME,
    COLUMN_NAME,
    DATA_TYPE,
    CHARACTER_MAXIMUM_LENGTH,
    CHARACTER_MAXIMUM_LENGTH * 4 AS utf8mb4_bytes
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
    AND DATA_TYPE IN ('varchar', 'char')
    AND CHARACTER_MAXIMUM_LENGTH * 4 > 768  -- Potential index issue
ORDER BY CHARACTER_MAXIMUM_LENGTH DESC;

Step 4: Generate Migration Scripts

-- Generate ALTER statements for tables
SELECT CONCAT(
    'ALTER TABLE `', TABLE_SCHEMA, '`.`', TABLE_NAME, '` ',
    'CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;'
) AS alter_statement
FROM information_schema.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
    AND TABLE_COLLATION NOT LIKE 'utf8mb4%'
ORDER BY TABLE_SCHEMA, TABLE_NAME;

-- Generate ALTER statements for database defaults
SELECT CONCAT(
    'ALTER DATABASE `', SCHEMA_NAME, '` ',
    'CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;'
) AS alter_statement
FROM information_schema.SCHEMATA
WHERE SCHEMA_NAME NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
    AND DEFAULT_CHARACTER_SET_NAME != 'utf8mb4';

Step 5: Execute Migration

-- Migrate database defaults
ALTER DATABASE your_database
    CHARACTER SET utf8mb4
    COLLATE utf8mb4_0900_ai_ci;

-- Migrate table (includes all columns)
ALTER TABLE your_table
    CONVERT TO CHARACTER SET utf8mb4
    COLLATE utf8mb4_0900_ai_ci;

-- Or migrate specific column
ALTER TABLE your_table
    MODIFY column_name VARCHAR(255)
    CHARACTER SET utf8mb4
    COLLATE utf8mb4_0900_ai_ci;

-- For large tables, use pt-online-schema-change
-- pt-online-schema-change \
--     --alter "CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci" \
--     D=your_database,t=your_table \
--     --execute

📊 Collation Comparison

Understanding Collation Differences

-- Create test data
CREATE TABLE collation_test (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name_general VARCHAR(100) COLLATE utf8mb4_general_ci,
    name_unicode VARCHAR(100) COLLATE utf8mb4_unicode_ci,
    name_0900 VARCHAR(100) COLLATE utf8mb4_0900_ai_ci,
    name_bin VARCHAR(100) COLLATE utf8mb4_bin
);

INSERT INTO collation_test (name_general, name_unicode, name_0900, name_bin) VALUES
    ('cafe', 'cafe', 'cafe', 'cafe'),
    ('café', 'café', 'café', 'café'),
    ('Cafe', 'Cafe', 'Cafe', 'Cafe'),
    ('CAFE', 'CAFE', 'CAFE', 'CAFE');

-- Compare: Case-insensitive collations
SELECT name_general, COUNT(*) as matches
FROM collation_test
WHERE name_general = 'cafe'
GROUP BY name_general;
-- Returns: cafe, café, Cafe, CAFE (4 matches with general_ci)

-- Compare: Binary collation (exact match)
SELECT name_bin, COUNT(*) as matches
FROM collation_test
WHERE name_bin = 'cafe'
GROUP BY name_bin;
-- Returns: only 'cafe' (1 match)

Collation Performance Comparison

CollationSpeedAccuracyUse Case
utf8mb4_general_ciFastLowerSimple comparisons
utf8mb4_unicode_ciMediumHigherInternational apps
utf8mb4_0900_ai_ciFastHighestMySQL 8.0+ default
utf8mb4_binFastestExactCase-sensitive lookups
-- Benchmark collation performance
SET @start = NOW(6);

SELECT COUNT(*)
FROM large_table
WHERE name LIKE 'test%';  -- Uses table's collation

SELECT TIMEDIFF(NOW(6), @start) AS execution_time;

⚠️ Common Pitfalls and Solutions

Pitfall 1: Illegal Mix of Collations

-- This will fail if tables have different collations
SELECT *
FROM users u
JOIN orders o ON u.email = o.customer_email;
-- ERROR 1267: Illegal mix of collations

-- Solution 1: Explicitly specify collation
SELECT *
FROM users u
JOIN orders o ON u.email = o.customer_email COLLATE utf8mb4_0900_ai_ci;

-- Solution 2: Migrate both tables to same collation
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
ALTER TABLE orders CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Pitfall 2: Index Not Used Due to Collation

-- Column has binary collation, query uses case-insensitive comparison
EXPLAIN SELECT * FROM users
WHERE username = 'John';  -- Uses index if column is case-insensitive

-- If column is binary collation:
EXPLAIN SELECT * FROM users
WHERE LOWER(username) = 'john';  -- Function prevents index use!

-- Solution: Use proper collation or change query
ALTER TABLE users
    MODIFY username VARCHAR(50) COLLATE utf8mb4_0900_ai_ci;

Pitfall 3: Connection Character Set Mismatch

-- Check connection character set
SHOW VARIABLES LIKE 'character_set_connection';
SHOW VARIABLES LIKE 'collation_connection';

-- Set connection character set
SET NAMES 'utf8mb4' COLLATE 'utf8mb4_0900_ai_ci';

-- Or in connection string
-- jdbc:mysql://host:3306/db?characterEncoding=UTF-8&connectionCollation=utf8mb4_0900_ai_ci

Pitfall 4: Emoji Storage Fails

-- This fails with utf8 (3-byte)
INSERT INTO messages (content) VALUES ('Hello! 🎉');
-- ERROR: Incorrect string value: '\xF0\x9F\x8E\x89' for column 'content'

-- Solution: Use utf8mb4
ALTER TABLE messages
    MODIFY content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

-- Now this works
INSERT INTO messages (content) VALUES ('Hello! 🎉');

🔍 Case Sensitivity Strategies

Making Specific Comparisons Case-Sensitive

-- Method 1: Use BINARY operator
SELECT * FROM users WHERE BINARY username = 'John';

-- Method 2: Use case-sensitive collation
SELECT * FROM users
WHERE username COLLATE utf8mb4_bin = 'John';

-- Method 3: Create column with binary collation
ALTER TABLE users
    ADD username_cs VARCHAR(50)
    GENERATED ALWAYS AS (username) STORED
    COLLATE utf8mb4_bin;

CREATE INDEX idx_username_cs ON users(username_cs);

Accent Sensitivity

-- Default utf8mb4_0900_ai_ci is accent-insensitive
SELECT * FROM products WHERE name = 'cafe';
-- Matches: cafe, café, CAFE, CAFÉ

-- Use accent-sensitive collation
SELECT * FROM products
WHERE name COLLATE utf8mb4_0900_as_cs = 'cafe';
-- Matches only: cafe

🚀 Performance Optimization

Index Size Considerations

-- Check index sizes
SELECT
    TABLE_NAME,
    INDEX_NAME,
    COLUMN_NAME,
    SEQ_IN_INDEX,
    CARDINALITY,
    ROUND(CARDINALITY * 4 / 1024 / 1024, 2) AS estimated_size_mb
FROM information_schema.STATISTICS
WHERE TABLE_SCHEMA = 'your_database'
    AND INDEX_NAME != 'PRIMARY'
ORDER BY TABLE_NAME, INDEX_NAME, SEQ_IN_INDEX;

-- Use prefix indexes for long VARCHAR columns
CREATE INDEX idx_description ON products(description(100));

-- Analyze index efficiency
SELECT
    TABLE_NAME,
    INDEX_NAME,
    COLUMN_NAME,
    CONCAT(
        CASE WHEN CHARACTER_SET_NAME = 'utf8mb4' THEN 4 ELSE 3 END * SUB_PART,
        ' bytes'
    ) AS index_key_size
FROM information_schema.STATISTICS s
JOIN information_schema.COLUMNS c USING (TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME)
WHERE s.TABLE_SCHEMA = 'your_database';

Comparison Performance Tips

-- Faster: Use appropriate collation for your needs
-- If you don't need linguistic sorting, general_ci is faster
ALTER TABLE lookup_codes
    MODIFY code VARCHAR(20) COLLATE utf8mb4_general_ci;

-- Faster: Binary comparison for exact matches
SELECT * FROM config
WHERE setting_key COLLATE utf8mb4_bin = 'app.version';

-- Faster: Avoid unnecessary collation conversions
SELECT * FROM users
WHERE email = '[email protected]'  -- Same collation, uses index
-- vs
WHERE email COLLATE utf8mb4_bin = '[email protected]';  -- Conversion needed

✅ Best Practices Checklist

Server Configuration

  • [ ] Set character_set_server = utf8mb4
  • [ ] Set collation_server = utf8mb4_0900_ai_ci (MySQL 8.0+)
  • [ ] Configure client connections to use utf8mb4
  • [ ] Document any exceptions to the standard

Development Practices

  • [ ] Always specify character set for new tables/databases
  • [ ] Use utf8mb4 for any user-generated content
  • [ ] Test with international characters and emojis
  • [ ] Document collation requirements for each column

Migration Practices

  • [ ] Audit existing character sets before migration
  • [ ] Test on non-production environment first
  • [ ] Use online DDL tools for large tables
  • [ ] Verify application compatibility after migration

📋 Quick Reference

Common Collation Suffixes

SuffixMeaning
_ciCase-insensitive
_csCase-sensitive
_aiAccent-insensitive
_asAccent-sensitive
_ksKana-sensitive (Japanese)
_binBinary (exact byte comparison)

Recommended Collations by Use Case

Use CaseRecommended Collation
General purpose (MySQL 8.0+)utf8mb4_0900_ai_ci
Legacy compatibilityutf8mb4_general_ci
Case-sensitive usernamesutf8mb4_bin
German languageutf8mb4_german2_ci
Japanese languageutf8mb4_ja_0900_as_cs

🎓 Conclusion

Proper character set and collation configuration is essential for applications that handle international text, emojis, or require specific comparison behaviors. Migrating to utf8mb4 with an appropriate collation ensures your database can store any Unicode character while providing the comparison behavior your application needs.

💡 Pro Tip: When in doubt, use utf8mb4 with utf8mb4_0900_ai_ci collation for MySQL 8.0+. This provides excellent Unicode support with high performance.

Key Takeaways

  1. Always use utf8mb4 for full Unicode support including emojis
  2. Choose collation based on needs - case sensitivity, accent handling, performance
  3. Be consistent across your entire database when possible
  4. Test thoroughly after migration, especially with international data
  5. Monitor index sizes when migrating from utf8 to utf8mb4

About the Author

Jamaurice Holt is a Cloud Database Expert and AWS Solutions Architect with over 10 years of experience in database optimization, high availability, and cloud migrations. Specializing in MySQL, SQL optimization, and enterprise database solutions.

Related Articles

Need help with MySQL character set migration? Contact us for expert database consulting.

Continue Reading

Leadership

Leadership in Motion: Technical Leadership Lessons from Database Operations

How a decade of database administration shaped my approach to technical leadership — lessons on incident response, mentoring engineers,...

July 10, 2024 · Read article →
Performance

Database Performance Tuning: Lessons from the Trenches

Real-world strategies for optimizing database performance in high-traffic production environments — query plan analysis, indexing, caching,...

August 22, 2024 · Read article →
AI

AI Governance and Hardware: Why Sovereign Refusal Belongs in Silicon

Software guardrails can be prompted around. The real frontier in AI safety is anchoring refusal below the model — at the hardware level. A...

September 15, 2024 · Read article →