Real-World Data Wrangling Examples
This tutorial demonstrates practical applications of data-wrangler across different domains and use cases. We’ll explore how to handle messy, real-world data scenarios effectively.
Scenarios Covered
Customer Feedback Analysis: Processing mixed review data
Research Data Integration: Combining surveys, papers, and datasets
Content Recommendation: Building similarity systems
Data Pipeline Automation: Using decorators for robust preprocessing
Let’s dive into practical examples!
[ ]:
import datawrangler as dw
import pandas as pd
import numpy as np
from datawrangler import funnel
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import seaborn as sns
Example 1: Customer Feedback Analysis
Imagine you’re analyzing customer feedback from multiple sources: surveys, social media, emails, and review sites. Each source has different formats and structures.
[ ]:
# Simulate different types of customer feedback
survey_responses = [
"The product quality exceeded my expectations. Highly recommend!",
"Delivery was slow but customer service was helpful.",
"Great value for money. Will purchase again.",
"Product broke after one week. Very disappointed.",
"Easy to use interface. Love the new features."
]
social_media_posts = [
"Just tried @YourBrand and wow! 🔥 #amazing #quality",
"Not impressed with recent purchase from @YourBrand 😞",
"Customer support team went above and beyond! 👏 #excellent",
"Website was confusing, took forever to find what I needed",
"Fast shipping and great packaging! @YourBrand #satisfied"
]
email_feedback = [
"I wanted to express my satisfaction with your recent service...",
"The invoice process needs improvement. Too complicated.",
"Your technical documentation is excellent and comprehensive.",
"Had trouble with the mobile app. Crashes frequently.",
"Pricing is competitive compared to other options."
]
print(f"Survey responses: {len(survey_responses)}")
print(f"Social media posts: {len(social_media_posts)}")
print(f"Email feedback: {len(email_feedback)}")
Unified Analysis with Data Wrangler
Now let’s use data-wrangler to analyze all feedback sources together, using modern sentence-transformers for semantic understanding:
[ ]:
@funnel
def analyze_customer_sentiment(feedback_data, text_kwargs={'model': 'all-mpnet-base-v2'}):
"""Comprehensive customer feedback analysis"""
print(f"Analyzing {feedback_data.shape[0]} feedback items...")
print(f"Embedding dimensions: {feedback_data.shape[1]}")
# Cluster feedback into themes
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(feedback_data)
# Calculate sentiment indicators (simple proxy using embedding statistics)
sentiment_scores = feedback_data.mean(axis=1) # Simple aggregation
results = pd.DataFrame({
'cluster': clusters,
'sentiment_proxy': sentiment_scores
})
# Summary statistics
summary = {
'total_feedback': len(feedback_data),
'themes_identified': n_clusters,
'avg_sentiment': sentiment_scores.mean(),
'sentiment_std': sentiment_scores.std(),
'cluster_distribution': results['cluster'].value_counts().to_dict()
}
return results, summary
# Combine all feedback sources
all_feedback = survey_responses + social_media_posts + email_feedback
# Analyze with one function call
feedback_analysis, summary = analyze_customer_sentiment(all_feedback)
print("\n=== Customer Feedback Analysis Summary ===")
for key, value in summary.items():
print(f"{key}: {value}")
Visualizing Feedback Themes
Let’s create a visualization to understand the feedback distribution:
[ ]:
# Create source labels for visualization
source_labels = (['Survey'] * len(survey_responses) +
['Social Media'] * len(social_media_posts) +
['Email'] * len(email_feedback))
feedback_analysis['source'] = source_labels
feedback_analysis['feedback_text'] = all_feedback
# Plot cluster distribution by source
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
cluster_source_counts = feedback_analysis.groupby(['cluster', 'source']).size().unstack(fill_value=0)
cluster_source_counts.plot(kind='bar', ax=plt.gca())
plt.title('Feedback Themes by Source')
plt.xlabel('Theme Cluster')
plt.ylabel('Count')
plt.legend(title='Source')
plt.xticks(rotation=0)
plt.subplot(1, 2, 2)
plt.scatter(feedback_analysis['sentiment_proxy'], feedback_analysis['cluster'],
c=feedback_analysis['cluster'], cmap='viridis', alpha=0.7)
plt.xlabel('Sentiment Proxy Score')
plt.ylabel('Theme Cluster')
plt.title('Sentiment vs Theme Distribution')
plt.colorbar(label='Cluster')
plt.tight_layout()
plt.show()
# Show example feedback from each cluster
print("\n=== Example Feedback by Theme ===")
for cluster in sorted(feedback_analysis['cluster'].unique()):
cluster_examples = feedback_analysis[feedback_analysis['cluster'] == cluster]
print(f"\nCluster {cluster} Examples:")
for idx, row in cluster_examples.head(2).iterrows():
print(f" - ({row['source']}) {row['feedback_text'][:60]}...")
Example 2: Research Data Integration
Academic researchers often need to combine data from multiple sources: research papers, survey responses, experimental datasets, and literature reviews. Let’s see how data-wrangler can streamline this process.” } } ], “metadata”: {
[ ]:
# Simulate research data from different sources
paper_abstracts = [
"Deep learning models have shown remarkable performance in natural language processing tasks.",
"Transformer architectures revolutionized machine translation and text understanding.",
"BERT and its variants achieved state-of-the-art results across multiple NLP benchmarks.",
"Large language models demonstrate emergent capabilities in few-shot learning scenarios.",
"Fine-tuning pretrained models enables efficient adaptation to specific domains."
]
survey_responses = [
"I find AI tools helpful for research but worry about accuracy.",
"Machine learning has accelerated my data analysis workflow significantly.",
"The interpretability of AI models is crucial for scientific applications.",
"Collaborative AI tools enhance productivity in research teams.",
"Ethical considerations in AI research are becoming increasingly important."
]
experimental_notes = [
"Model A achieved 94% accuracy on validation set with minimal overfitting.",
"Hyperparameter tuning improved F1-score from 0.85 to 0.91.",
"Data augmentation techniques reduced training variance significantly.",
"Cross-validation results show consistent performance across folds.",
"Ensemble methods provided 3% improvement over single models."
]
print(f"Research data sources:")
print(f"- Paper abstracts: {len(paper_abstracts)}")
print(f"- Survey responses: {len(survey_responses)}")
print(f"- Experimental notes: {len(experimental_notes)}")