PHP's standard library includes a very large number of string functions, and knowing which one fits a given task — rather than reaching for the first vaguely-relevant one or writing a manual loop — makes string-handling code shorter, clearer, and less likely to contain an off-by-one bug.
Searching and Extracting
str_contains, str_starts_with, and str_ends_with (added natively in PHP 8) directly answer the question they're named for, replacing older, less readable strpos-based idioms. substr extracts a portion of a string by position and length, while strpos/strrpos locate the first or last occurrence of a substring.
if (str_starts_with($url, 'https://')) { /* ... */ }
$domain = substr($email, strpos($email, '@') + 1);Trimming, Padding, and Case
trim, ltrim, and rtrim remove whitespace (or other specified characters) from a string's edges, essential for cleaning up user input before validation or storage. str_pad adds characters to reach a target length, useful for formatting (zero-padding an invoice number). strtolower, strtoupper, and ucfirst handle case transformation, with mb_ variants needed for correct behavior on multi-byte UTF-8 text.
Splitting, Joining, and Replacing
explode splits a string into an array by a delimiter; implode does the reverse, joining an array into a string. str_replace performs simple substring replacement, while preg_replace handles pattern-based replacement when the target isn't a fixed literal string. Choosing the simplest function that solves the problem, rather than defaulting to regex for tasks str_replace handles fine, keeps code easier to read.
Multi-Byte String Functions
Standard string functions (strlen, substr, strtoupper) operate on bytes, which produces incorrect results for multi-byte UTF-8 characters (accented letters, non-Latin scripts) where one visible character spans multiple bytes. The mb_ prefixed equivalents (mb_strlen, mb_substr) handle multi-byte encoding correctly and should be the default choice for any text that might contain non-ASCII characters, which in practice is most real-world user-facing text.
sprintf for Structured String Formatting
Concatenating many pieces with dots to build a formatted string becomes hard to read once more than two or three values are involved; sprintf with a format template makes the intended final structure visually clear in the code, separate from the values being inserted into it.
$message = sprintf("Order #%d for %s totaling $%.2f", $orderId, $customerName, $total);str_word_count and Text Analysis
str_word_count counts words in a string and can also return the actual words found, useful for basic text analysis tasks like enforcing a minimum content length or generating a rough reading-time estimate. It has known limitations around punctuation and non-Latin text, so for anything beyond a rough estimate, more sophisticated text-processing approaches are warranted.
wordwrap for Constrained-Width Output
Formatting text for a fixed-width context (a plain-text email, a terminal-width report) benefits from wordwrap, which breaks a string into lines at word boundaries near a specified width, avoiding the readability problems of either an unbroken long line or a mid-word break.
Case Study: The Multi-Byte Bug That Truncated Names Mid-Character
An application using substr (not mb_substr) to truncate long display names to a fixed character limit worked fine for ASCII names but corrupted names containing accented or non-Latin characters, since cutting at a byte boundary in the middle of a multi-byte UTF-8 character produces invalid, garbled output rather than a cleanly truncated string. The bug went unnoticed in testing (which used only ASCII test data) and surfaced only after international users reported their names displaying as garbled text. Switching every truncation call to mb_substr fixed the issue across the board.
A Glossary for This Topic
Multi-byte string — text where some characters occupy more than one byte in their encoded form (most non-ASCII UTF-8 text). mb_ functions — PHP's multi-byte-aware string function variants (mb_strlen, mb_substr). sprintf — a function building a formatted string from a template and a set of values. wordwrap — a function breaking text into lines near a target width at word boundaries.
Frequently Asked Questions
When should I use mb_ functions instead of standard ones? Whenever text might contain non-ASCII characters, which in practice covers most real-world user-facing text. Is str_word_count reliable for all languages? No, it has known limitations with punctuation and non-Latin scripts; treat it as a rough estimate only. Is regex always better than string functions for simple tasks? No, simple string functions are faster and clearer for tasks that don't genuinely need pattern matching.
Step-by-Step: Auditing String Handling for Multi-Byte Safety
Search the codebase for strlen, substr, strtoupper, and strtolower usage. For each, determine whether the input could ever contain non-ASCII text. Replace any match touching potentially non-ASCII text with its mb_ equivalent. Add a test case using genuinely multi-byte input (an accented name, non-Latin text) to any function performing string truncation or case transformation. Set the default internal encoding explicitly (mb_internal_encoding) to avoid relying on a possibly-incorrect default.
A Comparison Table: String Operation Choices
substr vs mb_substr: substr is byte-based and unsafe for multi-byte text; mb_substr is character-aware and the safer default. str_replace vs preg_replace: str_replace is faster for fixed literal replacement; preg_replace is needed for pattern-based replacement. sprintf vs string concatenation: sprintf is clearer for templates with several inserted values; concatenation is fine for one or two simple pieces.
Security Considerations Checklist
Never assume input length in bytes equals length in displayed characters when enforcing limits for security purposes (a length-based truncation meant to prevent a denial-of-service via giant input should account for multi-byte expansion). Be cautious with str_replace on security-sensitive data (masking, redaction) since naive replacement patterns can sometimes be bypassed with cleverly crafted input.
Accessibility Considerations
Text truncation using mb_substr should avoid cutting off mid-word where reasonably possible for genuinely readable previews, and any truncated text should be marked up so its full, untruncated form remains available to screen readers (an aria-label or title attribute with the complete text).
How This Plays Out at Different Scales
A small application handling primarily ASCII text may not notice multi-byte issues immediately. A growing application with any international user base needs the mb_ function discipline described earlier applied consistently. A large, genuinely global application typically needs explicit internal-encoding configuration and dedicated test coverage for non-Latin scripts as a standard part of its test suite.
What to Do When You Inherit a Codebase With Byte-Based String Bugs
Search systematically for strlen/substr/strtoupper/strtolower calls rather than waiting for international users to report garbled text. Prioritize fixing anything touching user-facing names, addresses, or any field likely to contain non-ASCII text. Add multi-byte test fixtures to the test suite going forward so this class of regression gets caught automatically rather than relying on manual international testing.
Final Checklist Before Shipping
Confirm mb_ functions are used anywhere text might be non-ASCII. Confirm mb_internal_encoding is set explicitly rather than relying on a possibly-wrong default. Confirm sprintf templates handle the expected value types correctly, including edge cases like null. Confirm any user-facing truncation preserves full text accessibly for assistive technology.
Closing Thought, Revisited
PHP's string function library is large enough that there's rarely a need to hand-roll string manipulation logic; the more valuable skill is knowing which function already exists for a given task, and specifically remembering to reach for its mb_ variant whenever the text in question might not be plain ASCII.
Framework-Specific Defaults Worth Knowing
Laravel's Str helper class wraps many common string operations and, for several operations, already defaults to multi-byte-safe behavior, reducing the chance of accidentally reaching for a byte-based function when a character-aware one was needed. Direct use of PHP's native string functions outside the Str helper still requires the same mb_ discipline described throughout this guide.
Performance Considerations for String-Heavy Code
mb_ functions carry a small performance overhead compared to their byte-based equivalents, which matters only in genuinely hot code paths processing large volumes of text; for the overwhelming majority of application code, correctness on non-ASCII text is worth far more than this marginal difference, and premature optimization away from mb_ functions is rarely justified.
Building a Slug or URL-Safe String Correctly
Generating a URL slug from a title containing accented or non-Latin characters requires transliteration (converting accented characters to their closest ASCII equivalent) before applying typical slug rules, since naively stripping non-ASCII characters can leave a slug that's empty or meaningless for titles written entirely in a non-Latin script. Laravel's Str::slug() handles common transliteration cases out of the box, which is worth knowing before reaching for a hand-rolled regex.
Validating String Length for User-Facing Limits
A character-limit validation rule (a 280-character post limit) should count using mb_strlen, not strlen, or users writing in scripts with multi-byte characters get an effectively shorter limit than users writing in plain ASCII, which is both a correctness bug and a fairness problem across your user base.
Educating New Developers on Multi-Byte Safety
A new developer reaching for substr out of habit, without realizing mb_substr exists or why it matters, is an easy mistake to make without prior exposure; a short onboarding note with a real garbled-text example from your own codebase's history makes the lesson concrete rather than abstract.
A Final Word on String Handling Discipline
Multi-byte safety, locale-aware formatting, and clear structured formatting via sprintf each address a different failure mode; the common thread is recognizing that real-world text is messier than the ASCII test data most of us write first.
One More Practical Habit
Add at least one non-ASCII test fixture (an accented name, a non-Latin sample string) to your shared test factory data, so every test touching that fixture automatically exercises multi-byte handling without anyone needing to remember to test it specifically.
A Closing Note on Defaults
When in doubt about whether a given piece of text might contain non-ASCII characters, default to the mb_ function; the cost of using it unnecessarily is negligible, while the cost of skipping it on text that turns out to be multi-byte is a real, user-visible bug.