MySQL Character Set and Collation Guide
Character sets and collations are fundamental to how MySQL stores, compares, and sorts text data. Understanding these concepts is essential for building applications that properly handle international characters, emojis, and case-sensitive comparisons.
🎯 Understanding the Basics
What is a Character Set?
A character set defines the set of characters that can be stored and the encoding used to represent them.
| Character Set | Description | Max Bytes/Char | Use Case |
|---|---|---|---|
latin1 | Western European | 1 | Legacy applications |
utf8 | Unicode (BMP only) | 3 | Basic international text |
utf8mb4 | Full Unicode | 4 | Recommended - includes emojis |
binary | Raw bytes | 1 | Binary data storage |
What is a Collation?
A collation defines how characters are compared and sorted.
Character Set: utf8mb4
│
└── Collations:
├── utf8mb4_general_ci (case-insensitive, accent-insensitive)
├── utf8mb4_unicode_ci (case-insensitive, uses Unicode rules)
├── utf8mb4_0900_ai_ci (MySQL 8.0 default, accent-insensitive)
├── utf8mb4_bin (binary comparison, case-sensitive)
└── utf8mb4_0900_as_cs (accent-sensitive, case-sensitive)
💡 Key Insight: The
_cisuffix means case-insensitive,_csmeans case-sensitive,_aimeans accent-insensitive, and_asmeans accent-sensitive.
🔧 Configuration Hierarchy
MySQL applies character set and collation settings at multiple levels:
┌─────────────────────────────────────────────────────────────────────┐
│ CHARACTER SET HIERARCHY │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Server Level │
│ │ │
│ ▼ │
│ Database Level (can override server) │
│ │ │
│ ▼ │
│ Table Level (can override database) │
│ │ │
│ ▼ │
│ Column Level (can override table) │
│ │
│ Most specific setting wins! │
│ │
└─────────────────────────────────────────────────────────────────────┘
Server-Level Configuration
# /etc/mysql/mysql.conf.d/charset.cnf
[mysqld]
# Server default character set
character_set_server = utf8mb4
collation_server = utf8mb4_0900_ai_ci
# Client connection defaults
character_set_client_handshake = FALSE
character_set_client = utf8mb4
[client]
default-character-set = utf8mb4
Viewing Current Configuration
-- Check server defaults
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
-- Check database character set
SELECT
SCHEMA_NAME,
DEFAULT_CHARACTER_SET_NAME,
DEFAULT_COLLATION_NAME
FROM information_schema.SCHEMATA
WHERE SCHEMA_NAME = 'your_database';
-- Check table character set
SELECT
TABLE_NAME,
TABLE_COLLATION
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'your_database';
-- Check column character set
SELECT
TABLE_NAME,
COLUMN_NAME,
CHARACTER_SET_NAME,
COLLATION_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'your_database'
AND DATA_TYPE IN ('varchar', 'char', 'text', 'mediumtext', 'longtext');
🚀 UTF-8 Migration Guide
Why Migrate to utf8mb4?
| Character | utf8 (3-byte) | utf8mb4 (4-byte) |
|---|---|---|
| A-Z, 0-9 | ✅ | ✅ |
| Accented (é, ñ) | ✅ | ✅ |
| Chinese, Japanese | ✅ | ✅ |
| Emojis (😀 🎉 🚀) | ❌ | ✅ |
| Math symbols (𝕏) | ❌ | ✅ |
Step-by-Step Migration
Step 1: Backup Everything
# Full backup before migration
mysqldump -u root -p \
--single-transaction \
--routines \
--triggers \
--events \
--all-databases > pre_migration_backup.sql
Step 2: Analyze Current State
-- Find all non-utf8mb4 tables
SELECT
TABLE_SCHEMA,
TABLE_NAME,
TABLE_COLLATION,
ENGINE
FROM information_schema.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
AND TABLE_COLLATION NOT LIKE 'utf8mb4%'
ORDER BY TABLE_SCHEMA, TABLE_NAME;
-- Find all non-utf8mb4 columns
SELECT
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
DATA_TYPE,
CHARACTER_SET_NAME,
COLLATION_NAME,
CHARACTER_MAXIMUM_LENGTH
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
AND CHARACTER_SET_NAME IS NOT NULL
AND CHARACTER_SET_NAME != 'utf8mb4'
ORDER BY TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME;
Step 3: Check Index Size Impact
-- VARCHAR(255) with utf8 = 255 * 3 = 765 bytes
-- VARCHAR(255) with utf8mb4 = 255 * 4 = 1020 bytes
-- InnoDB index key limit: 3072 bytes (with DYNAMIC/COMPRESSED row format)
-- Find columns that might exceed index limits
SELECT
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
DATA_TYPE,
CHARACTER_MAXIMUM_LENGTH,
CHARACTER_MAXIMUM_LENGTH * 4 AS utf8mb4_bytes
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
AND DATA_TYPE IN ('varchar', 'char')
AND CHARACTER_MAXIMUM_LENGTH * 4 > 768 -- Potential index issue
ORDER BY CHARACTER_MAXIMUM_LENGTH DESC;
Step 4: Generate Migration Scripts
-- Generate ALTER statements for tables
SELECT CONCAT(
'ALTER TABLE `', TABLE_SCHEMA, '`.`', TABLE_NAME, '` ',
'CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;'
) AS alter_statement
FROM information_schema.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
AND TABLE_COLLATION NOT LIKE 'utf8mb4%'
ORDER BY TABLE_SCHEMA, TABLE_NAME;
-- Generate ALTER statements for database defaults
SELECT CONCAT(
'ALTER DATABASE `', SCHEMA_NAME, '` ',
'CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;'
) AS alter_statement
FROM information_schema.SCHEMATA
WHERE SCHEMA_NAME NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
AND DEFAULT_CHARACTER_SET_NAME != 'utf8mb4';
Step 5: Execute Migration
-- Migrate database defaults
ALTER DATABASE your_database
CHARACTER SET utf8mb4
COLLATE utf8mb4_0900_ai_ci;
-- Migrate table (includes all columns)
ALTER TABLE your_table
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_0900_ai_ci;
-- Or migrate specific column
ALTER TABLE your_table
MODIFY column_name VARCHAR(255)
CHARACTER SET utf8mb4
COLLATE utf8mb4_0900_ai_ci;
-- For large tables, use pt-online-schema-change
-- pt-online-schema-change \
-- --alter "CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci" \
-- D=your_database,t=your_table \
-- --execute
📊 Collation Comparison
Understanding Collation Differences
-- Create test data
CREATE TABLE collation_test (
id INT AUTO_INCREMENT PRIMARY KEY,
name_general VARCHAR(100) COLLATE utf8mb4_general_ci,
name_unicode VARCHAR(100) COLLATE utf8mb4_unicode_ci,
name_0900 VARCHAR(100) COLLATE utf8mb4_0900_ai_ci,
name_bin VARCHAR(100) COLLATE utf8mb4_bin
);
INSERT INTO collation_test (name_general, name_unicode, name_0900, name_bin) VALUES
('cafe', 'cafe', 'cafe', 'cafe'),
('café', 'café', 'café', 'café'),
('Cafe', 'Cafe', 'Cafe', 'Cafe'),
('CAFE', 'CAFE', 'CAFE', 'CAFE');
-- Compare: Case-insensitive collations
SELECT name_general, COUNT(*) as matches
FROM collation_test
WHERE name_general = 'cafe'
GROUP BY name_general;
-- Returns: cafe, café, Cafe, CAFE (4 matches with general_ci)
-- Compare: Binary collation (exact match)
SELECT name_bin, COUNT(*) as matches
FROM collation_test
WHERE name_bin = 'cafe'
GROUP BY name_bin;
-- Returns: only 'cafe' (1 match)
Collation Performance Comparison
| Collation | Speed | Accuracy | Use Case |
|---|---|---|---|
utf8mb4_general_ci | Fast | Lower | Simple comparisons |
utf8mb4_unicode_ci | Medium | Higher | International apps |
utf8mb4_0900_ai_ci | Fast | Highest | MySQL 8.0+ default |
utf8mb4_bin | Fastest | Exact | Case-sensitive lookups |
-- Benchmark collation performance
SET @start = NOW(6);
SELECT COUNT(*)
FROM large_table
WHERE name LIKE 'test%'; -- Uses table's collation
SELECT TIMEDIFF(NOW(6), @start) AS execution_time;
⚠️ Common Pitfalls and Solutions
Pitfall 1: Illegal Mix of Collations
-- This will fail if tables have different collations
SELECT *
FROM users u
JOIN orders o ON u.email = o.customer_email;
-- ERROR 1267: Illegal mix of collations
-- Solution 1: Explicitly specify collation
SELECT *
FROM users u
JOIN orders o ON u.email = o.customer_email COLLATE utf8mb4_0900_ai_ci;
-- Solution 2: Migrate both tables to same collation
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
ALTER TABLE orders CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
Pitfall 2: Index Not Used Due to Collation
-- Column has binary collation, query uses case-insensitive comparison
EXPLAIN SELECT * FROM users
WHERE username = 'John'; -- Uses index if column is case-insensitive
-- If column is binary collation:
EXPLAIN SELECT * FROM users
WHERE LOWER(username) = 'john'; -- Function prevents index use!
-- Solution: Use proper collation or change query
ALTER TABLE users
MODIFY username VARCHAR(50) COLLATE utf8mb4_0900_ai_ci;
Pitfall 3: Connection Character Set Mismatch
-- Check connection character set
SHOW VARIABLES LIKE 'character_set_connection';
SHOW VARIABLES LIKE 'collation_connection';
-- Set connection character set
SET NAMES 'utf8mb4' COLLATE 'utf8mb4_0900_ai_ci';
-- Or in connection string
-- jdbc:mysql://host:3306/db?characterEncoding=UTF-8&connectionCollation=utf8mb4_0900_ai_ci
Pitfall 4: Emoji Storage Fails
-- This fails with utf8 (3-byte)
INSERT INTO messages (content) VALUES ('Hello! 🎉');
-- ERROR: Incorrect string value: '\xF0\x9F\x8E\x89' for column 'content'
-- Solution: Use utf8mb4
ALTER TABLE messages
MODIFY content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
-- Now this works
INSERT INTO messages (content) VALUES ('Hello! 🎉');
🔍 Case Sensitivity Strategies
Making Specific Comparisons Case-Sensitive
-- Method 1: Use BINARY operator
SELECT * FROM users WHERE BINARY username = 'John';
-- Method 2: Use case-sensitive collation
SELECT * FROM users
WHERE username COLLATE utf8mb4_bin = 'John';
-- Method 3: Create column with binary collation
ALTER TABLE users
ADD username_cs VARCHAR(50)
GENERATED ALWAYS AS (username) STORED
COLLATE utf8mb4_bin;
CREATE INDEX idx_username_cs ON users(username_cs);
Accent Sensitivity
-- Default utf8mb4_0900_ai_ci is accent-insensitive
SELECT * FROM products WHERE name = 'cafe';
-- Matches: cafe, café, CAFE, CAFÉ
-- Use accent-sensitive collation
SELECT * FROM products
WHERE name COLLATE utf8mb4_0900_as_cs = 'cafe';
-- Matches only: cafe
🚀 Performance Optimization
Index Size Considerations
-- Check index sizes
SELECT
TABLE_NAME,
INDEX_NAME,
COLUMN_NAME,
SEQ_IN_INDEX,
CARDINALITY,
ROUND(CARDINALITY * 4 / 1024 / 1024, 2) AS estimated_size_mb
FROM information_schema.STATISTICS
WHERE TABLE_SCHEMA = 'your_database'
AND INDEX_NAME != 'PRIMARY'
ORDER BY TABLE_NAME, INDEX_NAME, SEQ_IN_INDEX;
-- Use prefix indexes for long VARCHAR columns
CREATE INDEX idx_description ON products(description(100));
-- Analyze index efficiency
SELECT
TABLE_NAME,
INDEX_NAME,
COLUMN_NAME,
CONCAT(
CASE WHEN CHARACTER_SET_NAME = 'utf8mb4' THEN 4 ELSE 3 END * SUB_PART,
' bytes'
) AS index_key_size
FROM information_schema.STATISTICS s
JOIN information_schema.COLUMNS c USING (TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME)
WHERE s.TABLE_SCHEMA = 'your_database';
Comparison Performance Tips
-- Faster: Use appropriate collation for your needs
-- If you don't need linguistic sorting, general_ci is faster
ALTER TABLE lookup_codes
MODIFY code VARCHAR(20) COLLATE utf8mb4_general_ci;
-- Faster: Binary comparison for exact matches
SELECT * FROM config
WHERE setting_key COLLATE utf8mb4_bin = 'app.version';
-- Faster: Avoid unnecessary collation conversions
SELECT * FROM users
WHERE email = '[email protected]' -- Same collation, uses index
-- vs
WHERE email COLLATE utf8mb4_bin = '[email protected]'; -- Conversion needed
✅ Best Practices Checklist
Server Configuration
- [ ] Set
character_set_server = utf8mb4 - [ ] Set
collation_server = utf8mb4_0900_ai_ci(MySQL 8.0+) - [ ] Configure client connections to use utf8mb4
- [ ] Document any exceptions to the standard
Development Practices
- [ ] Always specify character set for new tables/databases
- [ ] Use utf8mb4 for any user-generated content
- [ ] Test with international characters and emojis
- [ ] Document collation requirements for each column
Migration Practices
- [ ] Audit existing character sets before migration
- [ ] Test on non-production environment first
- [ ] Use online DDL tools for large tables
- [ ] Verify application compatibility after migration
📋 Quick Reference
Common Collation Suffixes
| Suffix | Meaning |
|---|---|
_ci | Case-insensitive |
_cs | Case-sensitive |
_ai | Accent-insensitive |
_as | Accent-sensitive |
_ks | Kana-sensitive (Japanese) |
_bin | Binary (exact byte comparison) |
Recommended Collations by Use Case
| Use Case | Recommended Collation |
|---|---|
| General purpose (MySQL 8.0+) | utf8mb4_0900_ai_ci |
| Legacy compatibility | utf8mb4_general_ci |
| Case-sensitive usernames | utf8mb4_bin |
| German language | utf8mb4_german2_ci |
| Japanese language | utf8mb4_ja_0900_as_cs |
🎓 Conclusion
Proper character set and collation configuration is essential for applications that handle international text, emojis, or require specific comparison behaviors. Migrating to utf8mb4 with an appropriate collation ensures your database can store any Unicode character while providing the comparison behavior your application needs.
💡 Pro Tip: When in doubt, use
utf8mb4withutf8mb4_0900_ai_cicollation for MySQL 8.0+. This provides excellent Unicode support with high performance.
Key Takeaways
- Always use utf8mb4 for full Unicode support including emojis
- Choose collation based on needs - case sensitivity, accent handling, performance
- Be consistent across your entire database when possible
- Test thoroughly after migration, especially with international data
- Monitor index sizes when migrating from utf8 to utf8mb4
About the Author
Jamaurice Holt is a Cloud Database Expert and AWS Solutions Architect with over 10 years of experience in database optimization, high availability, and cloud migrations. Specializing in MySQL, SQL optimization, and enterprise database solutions.
Related Articles
- MySQL Database Migration Best Practices
- MySQL Index Optimization Strategies
- MySQL Performance Schema Monitoring
Need help with MySQL character set migration? Contact us for expert database consulting.