Sunday, 30 September 2018

The Best Turntables For Listening Vinyl Records

Turntables and vinyl records have made quite the comeback in recent years—record sales are at a 25 year high.

Click Here to Continue Reading

What Is a Meme (and How Did They Originate)?

If you’ve used the internet for more than a few days, you’ve probably seen a meme. They’ve become an integral part of modern online life. But, where did they get their start? How have they evolved? And where did the word “meme” come from, anyway?

Saturday, 29 September 2018

Geek Trivia: What Version Of Windows Is Windows 7?

Think you know the answer? Click through to see if you're right!

The Best Stools For Standing Desks

Standing desks are great for keeping you moving and your posture straight.

Click Here to Continue Reading

How to Make Your Chromebook as Secure as Possible

One of Chrome OS’ biggest benefits is its inherent security features. It’s regarded as one of the most secure consumer-focused operating systems, but here’s how you can eek just a bit more out of it.

India's oldest cryptocurrency exchange Zebpay shuts down

Zebpay exchange will not accept any new order until further notice and will cancel unexecuted crypto-to-crypto orders and credit user coins/tokens back to their Zebpay wallets.

Did You Get Logged Out of Facebook? It’s Because 50 Million People Got Hacked

The bad news for Facebook’s users just won’t end. Today Facebook had to admit that the accounts for 50 million people were somehow accessed by hackers abusing a little-known feature.

Windows 10’s Mail App Won’t Force You to Use Edge After All

Here’s some good news about the October 2018 Update: Microsoft has completely backed off on plans to force the Edge browser on Windows 10 Mail app users.

How to Remove Audio When Posting Videos to Instagram

More often than not, you want to keep the audio with the videos you post to social media, but sometimes you don’t. Here’s how to remove the audio from any videos that you post to Instagram.

Friday, 28 September 2018

Good Deal: Get This AmazonBasics Smartphone Repair Kit for Just $8

In an age where a lot of users want to keep their smartphones for as long as possible, DIY mobile phone repair has been gaining steam, and Ama…

Click Here to Continue Reading

Facebook is Using Your Phone Number to Target Ads and You Can’t Stop It

Tech publications are screaming today that giving Facebook your phone number for 2FA allows them to target you for ads. But this misses a bigger point: Facebook is using your phone number to target ads whether you give it to them willingly or not.

What are YouTube Premiers, and How Do You Use Them?

YouTube has added a new way to share and watch videos: YouTube Premiers. Premiers are a mix between a live stream and a traditional YouTube video. You prerecord them, but then play those recordings live, with live chat and donations like standard live streams.

Interview with Kinnari Ladha, Head of Business Intelligence and Data / TUI Group. 

This blog post is part of the Big Data Week Speaker interviews series. We sat down with Kinnari Ladha, Head of Business Intelligence and Data / TUI Group. She will speak at the upcoming Big Data Week London Conference, on October 5, about “Using Big Data Analytics Strategically to Drive Business Decisions to Improve Customer Experience”.  Reserve your seat today! 1. Effective use of big data seems like it is an established standard within a hospitality enterprise’s ranks, especially with the executive level and departments tasked with applying business intelligence. However, in today’s day and age, as it regards trickling down the effective use of big data insights and applications to the non-technical hospitality team member (front desk check-in, hotel [...] Continue Reading

Why the iPhone XR has the Best Battery Life of the New iPhone X Series

If you care about battery life on your iPhone, you may want to take a closer look at the XR. It has the best battery life of any iPhone before it, despite being the most affordable in the new lineup.

What Is an MPEG File (and How Do I Open One)?

A file with the .mpeg (or .mpg) file extension is an MPEG video file format, which is a popular format for movies that are distributed on the internet. They use a specific type of compression that makes streaming and downloading much quicker than other popular video formats.

Today’s VR is Just the Start: Here’s What is Coming in the Future

The Oculus Quest is an impressive piece of hardware, but there’s so much more to come. Virtual reality is just getting started. Here’s what has us most excited.

Updates for macOS Are No Longer in the Mac App Store, Here’s Where to Find Them

Updates for macOS are no longer in the Mac App Store as of macOS Mojave; they’ve moved to System Preferences, in the new Software Update panel.

The Best Sites for Background or Ambient Noise

Whether you need to focus on a project or just relax, background noise can help with either of those things. Here are the best websites and sources for background and ambient noise.

The Best Keyboard Trays For Improved Ergonomics

If you spend a lot of time at your desk, it’s important to arrange your workspace to minimize strain on your body and maximize comfort. …

Click Here to Continue Reading

Kids Can, and Will, Work Around Any Parental Restrictions You Set Up

Kids are easily working around Apple’s parental control system Screen Time, finding various ways to do what they want regardless of restrictions set up by their parents.

Should You Upgrade Your Echos to the New 2018 Models?

If you have the original Echo still kicking it in your home, it might be time for an upgrade.

Click Here to Continue Reading

In Our Customer’s Words – Previewing CX Day 2018

Next week we’ll be celebrating Customer Experience Day! October 2nd marks the 6th annual CX Day, which “celebrates the professionals and companies that make great customer experiences happen. It’s an opportunity to recognize great customer work, discover professional development opportunities, and strengthen professional networks.” (CXDay.org) Hortonworks has seen tremendous growth other the years, since its […]

The post In Our Customer’s Words – Previewing CX Day 2018 appeared first on Hortonworks.

Check Out This Virtual Tour of the Garage Where Google Started

Google is 20 years old today, and to celebrate Google Maps added the garage where the company was founded to Street View. You can check it out right now.

How to Use Drawing Tools in Windows 10 Mail

Microsoft recently released a new feature for the Windows 10 Mail app that lets you convey messages with drawings right inside the body of an email. This is a great way to quickly sketch things like graphs or tables to get your point across when simple text just doesn’t do the trick.

Use Conditional Formatting to Make Important Outlook Messages Stand Out

Outlook lets you create and customize folder views in many ways, like adding and removing columns or grouping and sorting messages. You can also apply rules to make Outlook display messages in different ways based on their properties (like the sender, subject line, or timestamp). This is called conditional formatting. Let’s take a look at how it works.

The Best Inexpensive Drones For Beginners

You’d like to get into the drone sensation, but you don’t want to break the bank to do it.

Click Here to Continue Reading

How Dual SIM Support Works in the New iPhone X Series

While dual SIM technology has been around for several years now, the new iPhone X series (XS, XS Max, and XR) marks the first time it’s been available in an iPhone. But what does that mean?

How to Hide Your IP Address (and Why You Might Want To)

Your IP address is like your public ID on the internet. Any time you do anything on the internet, your IP address lets servers know where to send back information you’ve requested. Many sites log these addresses, effectively spying on you, usually to deliver you more personalized ads to get you to spend more money. For some people, this is a significant issue, and there are ways to hide your IP address.

Plex is Killing the Plugin Directory, Here’s How to Install Plugins Yourself

Plex is shutting down its plugin directory, but will continue to support manually installed plugins “for the foreseeable future.”

What Your Function Keys Do in Microsoft Powerpoint

The function keys on keyboards don’t get the love they used to, but depending on the app you’re running, they can still be quite handy. Microsoft PowerPoint has some interesting features tucked away behind your function keys. Here’s what they do.

The Oculus Quest Is a Standalone, 6 Degree-of-Freedom VR Headset Coming Next Spring For $399

Today, Facebook announced the new Oculus Quest, a standalone VR headset that features the same six degrees of freedom as the higher-end Oculus…

Click Here to Continue Reading

Wednesday, 26 September 2018

What’s New for Apache Spark on Kubernetes in the Upcoming Apache Spark 2.4 Release

This is a community blog from Yinan Li, a software engineer at Google, working in the Kubernetes Engine team. He is part of the group of companies that have contributed to Kubernetes support in the upcoming Apache Spark 2.4.

Since the Kubernetes cluster scheduler backend was initially introduced in Apache Spark 2.3, the community has been working on a few important new features that make Spark on Kubernetes more usable and ready for a broader spectrum of use cases. The upcoming Apache Spark 2.4 release comes with a number of new features, some of which are highlighted below:

  • Support for running containerized PySpark and SparkR applications on Kubernetes.
  • Client mode support that allows users to run interactive applications and notebooks.
  • Support for mounting certain types of Kubernetes volumes.

Below we will take a deeper look into each of the new features.

PySpark Support

Soon to be released Spark 2.4 now supports running PySpark applications on Kubernetes. Both Python 2.x and 3.x are supported, and the major version of Python can be specified using the new configuration property spark.kubernetes.pyspark.pythonVersion, which can have value 2 or 3 but defaults to 2. Spark ships with a Dockerfile of a base image with the Python binding that is required to run PySpark applications on Kubernetes. Users can use the Dockerfile to build a base image or customize it to build a custom image.

Spark R Support

Spark on Kubernetes now supports running R applications in the upcoming Spark 2.4. Spark ships with a Dockerfile of a base image with the R binding that is required to run R applications on Kubernetes. Users can use the Dockerfile to build a base image or customize it to build a custom image.

Client Mode Support

As one of the most requested features since the 2.3.0 release, client mode support is now available in the upcoming Spark 2.4. The client mode allows users to run interactive tools such as spark-shell or notebooks in a pod running in a Kubernetes cluster or on a client machine outside a cluster. Note that in both cases, users are responsible for properly setting up connectivity from the executors running in pods inside the cluster to the driver. When the driver runs in a pod in the cluster, the recommended way is to use a Kubernetes headless service to allow executors to connect to the driver using the FQDN of the driver pod. When the driver runs outside the cluster, however, it’s important for users to make sure that the driver is reachable from the executor pods in the cluster. For more detailed information on the client mode support, please refer to the documentation when Spark 2.4 is officially released.

Other Notable Changes

In addition to the new features highlighted above, the Kubernetes cluster scheduler backend in the upcoming Spark 2.4 release has also received a number of bug fixes and improvements.

  • A new configuration property spark.kubernetes.executor.request.cores was introduced for configuring the physical CPU request for the executor pods in a way that conforms to the Kubernetes convention. For example, users can now use fraction values or millicpus like 0.5 or 500m. The value is used to set the CPU request for the container running the executor.
  • The Spark driver running in a pod in a Kubernetes cluster no longer uses an init-container for downloading remote application dependencies, e.g., jars and files on remote HTTP servers, HDFS, AWS S3, or Google Cloud Storage. Instead, the driver uses spark-submit in client mode, which automatically fetches such remote dependencies in a Spark idiomatic way.

  • Users can now specify image pull secrets for pulling Spark images from private container registries, using the new configuration property spark.kubernetes.container.image.pullSecrets.

  • Users are now able to use Kubernetes secrets as environment variables through a secretKeyRef. This is achieved using the new configuration options spark.kubernetes.driver.secretKeyRef.[EnvName] and spark.kubernetes.executor.secretKeyRef.[EnvName] for the driver and executor, respectively.

  • The Kubernetes scheduler backend code running in the driver now manages executor pods using a level-triggered mechanism and is more robust to issues talking to the Kubernetes API server.

Conclusion and Future Work

First of all, we would like to express huge thanks to Apache Spark and Kubernetes community contributors from multiple organizations (Bloomberg, Databricks, Google, Palantir, PepperData, Red Hat, Rockset and others) who have put tremendous efforts into this work and helped get Spark on Kubernetes this far. Looking forward, the community is working on or plans to work on features that further enhance the Kubernetes scheduler backend. Some of the features that are likely available in future Spark releases are listed below.

  • Support for using a pod template to customize the driver and executor pods. This allows maximum flexibility for customization of the driver and executor pods. For example, users would be able to mount arbitrary volumes or ConfigMaps using this feature.
  • Dynamic resource allocation and external shuffle service.
  • Support for Kerberos authentication, e.g., for accessing secure HDFS.
  • Better support for local application dependencies on submission client machines.
  • Driver resilience for Spark Streaming applications.

--

Try Databricks for free. Get started today.

The post What’s New for Apache Spark on Kubernetes in the Upcoming Apache Spark 2.4 Release appeared first on Databricks.

Windows 10’s October Update (Redstone 5) Will Be Released on October 2nd

We’ve known for a while that the latest update for Windows 10, codenamed “Redstone 5”, was going to be released in October, but now we have a nearly confirmed date: October 2nd, 2018, the same day that Microsoft is holding an event to talk about their new Surface products.

The Best Bag Organizers for Your Laptop Bag, Backpack, or Purse

Having an organized bag can make or break your productivity levels—so why not spend more time getting w…

Click Here to Continue Reading

How to Power Up Your Twitch Stream with Streamlabs

Streamlabs OBS is a modified version of OBS designed with streamers in mind. It adds support for themes and widgets, improving the look of your stream and adding support for many useful features, such as showing donations and subscribers on stream, displaying live chat on stream, and displaying sub goals and sponsor banners.

Chrome 70 Will Let Users Opt Out of the New Auto-Sign In Feature

An upcoming Chrome option allows users to log into Google accounts without logging into the browser. The change was prompted by a backlash among users and privacy advocates.

Watch This Maniac Beat Super Mario Bros. in Four Minutes and 55 Seconds

There’s a new world record for beating Nintendo’s 1985 classic Super Mario Bros. Kosmic, a speedrunner, beat the game in 4 minutes, 55 seconds, and 913 microseconds.

How to Enable or Disable Focused Inbox in Windows 10 Mail

Focused Inbox is a feature in the Windows Mail app that automatically filters emails into two separate tabs: Focused and Other. If you don’t like the feature, here’s how to disable it.

The Best Shaving Brushes to Lather Up With

If you have a pair of hands, you may argue that you don’t need a shaving brush. And you know what? You’d be right.

Click Here to Continue Reading

What Does “Edited for Content” Mean on Airplane Movies?

If you’ve ever watched a movie on a plane, you’ve probably seen a message like “This movie has been edited for content” pop up before it played. Ever wondered what it meant? Let’s find out.

What’s the Difference Between Optical and Digital Image Stabilization?

If you’ve ever tried to take video on your phone while walking, you know keeping the image still is tricky. There’s some neat technology designed to reduce that shaky-cam effect, and there are two different approaches to implementing it.

How to Disable that Weird New Recent Apps Section in the Dock in macOS Mojave

Install Mojave and you’ll notice it: a new section in the dock, populated by applications you’ve recently opened but don’t have pinned there. How do you get rid of it?

How to Delete Your Google Cookies in Chrome

Clearing out cookies generally signs you out of all accounts, but Chrome makes one exception: your Google account.

What Is Alexa Guard and What Can You Do With It?

Amazon held an Echo event last week where the company unveiled a huge line of new Echo products. It also unveiled some new software additions, one of which is Alexa Guard—a home security helper of sorts.

Tuesday, 25 September 2018

Facebook Seems Pretty Happy With Its Mid-Roll Video Ads, So Expect More of Them

For Facebook videos longer than a few seconds, you might have noticed mid-roll ads that interrupt what you’re watching.

Click Here to Continue Reading

Microsoft Teams Can Blur the Background During Meetings, Hiding Your Mess of an Office

Tired of tidying up your office before every team chat? Microsoft’s Teams can now blur the background for you.

Yale and August Unveil New Smart Locks That Work with Alexa, Google Assistant, and Siri

One problem with smart locks is that it’s hard to find one that works with all three voice assistant platforms, but Yale (and its sibling…

Click Here to Continue Reading

How to Place Text Over a Graphic in Microsoft Word

There are several reasons why you may want place text over an image in a Word document. Perhaps you want to place your company logo in the background of a document you’re writing for work, or maybe you need a “confidential” watermark on a document containing sensitive information. No matter the reason, you can do it easily in Microsoft Word.

Free Download: Night Owl Automatically Switches Between Mojave’s Dark and Light Modes

Mac: Dark mode in macOS Mojave is great at night, but can make things hard to see in during the day, or in bright rooms. The solution: switch between the two.

How to Generate a Wi-Fi History or WLAN Report in Windows 10

Windows 10 includes a pretty neat feature that automatically generates a detailed report of all your wireless network connection history. The report includes details about networks to which you’ve connected, session duration, errors, network adapters, and even displays the output from a few Command Prompt commands.

Office 2019 Has Arrived. Here’s Why You Probably Won’t Care.

Yesterday, Microsoft announced the availability of Office 2019 to volume licensing customers, promising general retail availability in the coming weeks. Unless you’re a business customer looking to upgrade and you’re not ready to move your Office life to the cloud, this probably won’t matter to you.

How to Safely Format SD Cards For Your Camera

A corrupt SD card is a digital photographer’s worst nightmare. All those amazing photos ruined because a few 1s and 0s are in the wrong place? While you might be able to recover your images from a corrupt card, you don’t want to be in this situation in the first place, and that means knowing how to format your SD cards properly.

What Is An EKG, and How Does It Work In The New Apple Watch?

Apple recently released its Series 4 Watch and everyone is talking about the new heart monitor feature—the electrocardiogram (EKG or ECG). While the app isn’t available yet, the hardware is in place, and it promises to be one of the Apple Watch’s most compelling new features. 

The Value of Data Analytics and the Impact it Brings upon Organisations – Interview with Alteryx’s Product Strategy Director Nick Jewell

This blog post is part of the Big Data Week Speaker interviews series. We sat down with Alteryx’s Director of Product Strategy, Dr. Nick Jewell, and discussed the importance and value of data analytics and the impact it brings upon organisations. Nick will speak at the upcoming Big Data Week London Conference, on October 5, about “Bridging the Gap: Key Lessons to Grow an Advanced Analytics Culture across Business, Technology and Data Science Teams“.  Reserve your seat today! What departments have the least to do with analysing big data yet whose resistance limit the benefits of big data for a company? It can still be a battle to convince Information & Security Risk departments about the benefits of big data [...] Continue Reading

Databricks Delta: Now Available in Preview as Part of Microsoft Azure Databricks

Bringing unprecedented reliability and performance to cloud data lakes

By Cihan Biyikoglu and Singh Garewal Posted in COMPANY BLOG September 24, 2018

Designed by Databricks in collaboration with Microsoft, Azure Databricks combines the best of Databricks’ Apache SparkTM-based cloud service and Microsoft Azure. The integrated service provides the Databricks Unified Analytics Platform integrated with the Azure cloud platform, encompassing the Azure Portal; Azure Active Directory; and other data services on Azure, including Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Storage; and Microsoft Power BI.

Databricks Delta, a component of Azure Databricks, addresses the data reliability and performance challenges of data lakes by bringing unprecedented data reliability and query performance to cloud data lakes. It is a unified data management system that delivers ML readiness for both batch and stream data at scale while simplifying the underlying data analytics architecture.

Further, it is easy to port code to use Delta. With today’s public preview, Azure Databricks Premium customers can start using Delta straight away.  They can start benefiting from the acceleration that large reliable datasets can provide to their ML efforts. Others can try it out using the Azure Databricks  14 day trial.

Common Data Lake Challenges

Many organizations have responded to their ever-growing data volumes by adopting data lakes as places to collect their data ahead of making it available for analysis. While this has tended to improve the situation somewhat data lakes also present some key challenges:

Query performance – The required ETL processes can add significant latency such that it may take hours before incoming data manifests in a query response so the users do not benefit from the latest data. Further, increasing scale and the resulting longer query run times can prove unacceptably long for users.

Data reliability – The complex data pipelines are error-prone and consume inordinate resources. Further, schema evolution as business needs change can be effort-intensive. Finally, errors or gaps in incoming data, a not uncommon occurrence, can cause failures in downstream applications.

System complexity – It is difficult to build flexible data engineering pipelines that combine streaming and batch analytics. Building such systems requires complex and low-level code. Interventions during stream processing with batch correction or programming multiple streams from the same sources or to the same destinations is restricted.

Databricks Delta To The Rescue

Already in use by several customers (handling more than 300 billion rows and more than 100 TB of data per day) as part of a private preview, today we are excited to announce Databricks Delta is now entering Public Preview status for Microsoft Azure Databricks Premium customers, expanding its reach to many more.

 

Using an innovative new table design, Delta supports both batch and streaming use cases with high query performance and strong data reliability while requiring a simpler data pipeline architecture:

Increased query performance – Able to deliver 10 to 100 times faster performance than Apache Spark(™) on Parquet through the use of key enablers such as compaction, flexible indexing, multi-dimensional clustering and data caching.

Improved data reliability – By employing ACID (“all or nothing”) transactions, schema validation / enforcement, exactly once semantics, snapshot isolation and support for UPSERTS and DELETES.

Reduced system complexity – Through the unification of batch and streaming in a common pipeline architecture – being able to operate on the same table also means a shorter time from data ingest to query result. Schema evolution provides the ability to infer schema from input data making it easier to deal with changing business needs.

The Versatility of Delta

Delta can be deployed to help address a myriad of use cases including IoT, clickstream analytics and cyber security. Indeed, some of our customers are already finding value with Delta for these – I hope to share more on that in future posts. My colleagues have written a blog (Simplify Streaming Stock Data Analysis Using Databricks Delta) to showcase Delta that you might interesting.

Easy to Adopt: Check Out Delta Today

Porting existing Spark code for using Delta is as simple as changing

CREATE TABLE … USING parquet” to

CREATE TABLE … USING delta”

or changing

dataframe.write.format(“parquet“).load(“/data/events“)”

dataframe.write.format(“delta“).load(“/data/events“)”

If you are already using Azure Databricks Premium you can explore Delta today using:

If you are not already using Databricks, you can try Databricks Delta for free by signing up for the free Azure Databricks 14 day trial.

You can learn more about Delta from the Databricks Delta documentation.

--

Try Databricks for free. Get started today.

The post Databricks Delta: Now Available in Preview as Part of Microsoft Azure Databricks appeared first on Databricks.

Walmart’s veggie-tracking technology: Use Blockchain

Walmart says it now has a better system for pinpointing which batches of leafy green vegetables might be contaminated.

What’s with Alexa's Spaghetti Strategy?

Amazon is racing to position its tech at the centre of our Jetsons-like home of future.

Watch This Rubik’s Cube Solve Itself

Finally: a Rubik’s cube that can solve itself. A maker named Human Controller built it in Japan, and you can see it in action right now.

How to Edit the Subject of a Message You’ve Been Sent in Outlook

It’s annoying getting an important email with an irrelevant or missing subject line. Sure, you can categorize your mail to help you find it later, but nothing beats a useful subject line when you’re looking at your search results. Outlook has a little-known feature that lets you edit the subject line of emails you’ve received, making this annoyance a thing of the past.

Best Budget Fountain Pens

If you’ve any interest in writing nicely then a $1 ballpoint won’t cut it, you need to look at a fountain pen.

Unlike a lot of our guid…

Click Here to Continue Reading

Roku Announces New $40 “Premiere” 4K Player, Upcoming Google Assistant Support

Roku just added a couple of killer little pieces to its already strong catalog of streaming devices with the new $40 Premiere and $50 Premiere+…

Click Here to Continue Reading

Chrome Now Logs all Google Users Into the Browser. Should You Care?

Google changed how logging into the browser works earlier this month: logging into any Google app now logs you in with Chrome as well.

The Best Rechargeable Batteries And Chargers

Two out of two Review Geek staffers agree: the second “e” in “rechargeable” is awkward and unnecessary.

Click Here to Continue Reading

What is a 504 Gateway Timeout Error (and How Can I Fix It)?

A 504 Gateway Timeout Error happens when a server that was attempting to load a web page did not get a response in time from another server. Almost always, the error is on the website itself, and there’s nothing you can do about it but try again later. Still, there are a few quick things you can try on your end.

How to Reset File Explorer’s Folder View in Windows 10

Windows 10 lets you customize how you see the contents of your folders by adding or removing the preview/details pane, viewing layout of icons, grouping and sorting, and more. If you want to get rid of customizations you’ve made, you can reset the folder view to its default.

Why Does My Phone Get Hot?

If you’ve had your phone for more than a few days, you’ve probably noticed that occasionally it gets hot when you’re using it. This is (almost always) normal. Here’s why it happens.

Buying New vs. Used Smartphones: What’s the Cheaper Option?

Buying your smartphones used can save you a lot of money over buying new—or can it? We’ve done some math to see what the best route is as far as buying new vs. used smartphones.

Sunday, 23 September 2018

The Best Gaming Rocker Chairs For The Gamer In Your Life

If you’re looking for a cross between the easy comfort of a bean bag chair and the support of desk chair, gaming “rocker” ch…

Click Here to Continue Reading

How 5G Could Transform Your Home Internet Connection

Verizon is about to launch home internet service using 5G. This new wireless standard isn’t just about faster data for your smartphone—it could finally offer competition for home internet, breaking the cable companies’ local monopolies and giving you a choice.

Saturday, 22 September 2018

Not Just Books: All the Free Digital Stuff Your Local Library Might Offer

You might think of libraries as old fashioned, or irrelevant in the age of the internet. You’d be wrong.

Facebook likely to launch 'Portal' video chat device

Facebook is reportedly set to announce its own video chat device called Portal next week, taking on Amazon's smart home devices.

Are Your Smarthome Devices Spying on You?

In a world where we’re all paranoid about devices spying on us (and rightfully so), perhaps no other devices receive more scrutiny than smarthome products. But is that scrutiny warranted?

Friday, 21 September 2018

Customize Chrome’s New Tab Page, No Extensions Required

Do you use an extension to customize Chrome’s new tab page? That’s not necessary anymore: you can now customize the default new tab page.

How to Make Good YouTube Videos

Using decent gear is a prerequisite to making suitable content. A good camera or a good rig for streaming games ensures your video feels more like a quality production. A good mic ensures people can hear what you’re saying and sound like a jet engine warming up.

How to Use MLflow To Reproduce Results and Retrain Saved Keras ML Models

In part 2 of our series on MLflow blogs, we demonstrated how to use MLflow to track experiment results for a Keras network model using binary classification. We classified reviews from an IMDB dataset as positive or negative. And we created one baseline model and two experiments. For each model, we tracked its respective training accuracy and loss and validation accuracy and loss.

In this third part in our series, we’ll show how you can save your model, reproduce results, load a saved model, predict unseen reviews—all easily with MLFlow—and view results in TensorBoard.

Saving Models in MLFlow

MLflow logging APIs allow you to save models in two ways. First, you can save a model on a local file system or on a cloud storage such as S3 or Azure Blob Storage; second, you can log a model along with its parameters and metrics. Both preserve the Keras HDF5 format, as noted in MLflow Keras documentation.

First, if you save the model using MLflow Keras model API to a store or filesystem, other ML developers not using MLflow can access your saved models using the generic Keras Model APIs. For example, within your MLflow runs, you can save a Keras model as shown in this sample snippet:

import mlflow.keras
#your Keras built, trained, and tested model
model = ...
#local or remote S3 or Azure Blob path
model_dir_path=...
# save the mode to local or remote accessible path on the S3 or Azure Blob
mlflow.keras.save_model(model, model_dir_path)

Once saved, ML developers outside MLflow can simply use the Keras APIs to load the model and predict it. For example,

import keras
from keras.models import load_model

model_dir_path = ...
new_data = ...
model = load_model(model_dir_path)
predictions = model.predict(new_data)

Second, you can save the model as part of your run experiments, along with other metrics and artifacts as shown in the code snippet below:

import mlflow
import mlfow.keras
#your Keras built, trained, and tested model
model = ...
with mlflow.start_run():
   # log metrics
   mlflow.log_metric("binary_loss", binary_loss)
   mlflow.log_metric("binary_acc", binary_acc)
   mlflow.log_metric("validation_loss", validation_loss)
   mlflow.log_metric("validation_acc", validation_acc)
   mlflow.log_metric("average_loss", average_loss)
   mlflow.log_metric("average_acc", average_acc)
   # log artifacts
   mlflow.log_artifacts(image_dir, "images")
   # log model
   mlflow.keras.log_model(model, "models")

With this second approach, you can access its run_uuid or location from the MLflow UI runs as part of its saved artifacts:

Fig 1. MLflow UI showing artifacts and Keras model saved

In our IMDB example, you can view code for both modes of saving in train_nn.py, class KTrain(). Saving model in this way provides access to reproduce the results from within MLflow platform or reload the model for further predictions, as we’ll show in the sections below.

Reproducing Results from Saved Models

As part of machine development life cycle, reproducibility of any model experiment by ML team members is imperative. Often you will want to either retrain or reproduce a run from several past experiments to review respective results for sanity, audibility or curiosity.

One way, in our example, is to manually copy logged hyper-parameters from the MLflow UI for a particular run_uuid and rerun using main_nn.py or reload_nn.py with the original parameters as arguments, as explained in the README.md.

Either way, you can reproduce your old runs and experiments:

python reproduce_run_nn.py --run_uuid=5374ba7655ad44e1bc50729862b25419
python reproduce_run_nn.py --run_uuid=5374ba7655ad44e1bc50729862b25419 [--tracking_server=URI]

Or use mlflow run command:

mlflow run keras/imdbclassifier -e reproduce -P run_uuid=5374ba7655ad44e1bc50729862b25419
mlflow run keras/imdbclassifier -e reproduce -P run_uuid=5374ba7655ad44e1bc50729862b25419 [-P tracking_server=URI]

By default, the tracking_server defaults to the local mlruns directory. Here is an animated sample output from a reproducible run:

Fig 2. Run showing reproducibility from a previous run_uuid: 5374ba7655ad44e1bc50729862b25419

Loading and Making Predictions with Saved Models

In the previous sections, when executing your test runs, the models used for these test runs also saved via the mlflow.keras.log_model(model, "models"). Your Keras model is saved in HDF5 file format as noted in MLflow > Models > Keras. Once you have found a model that you like, you can re-use your model using MLflow as well.

This model can be loaded back as a Python Function as noted noted in mlflow.keras using mlflow.keras.load_model(path, run_id=None).

To execute this, you can load the model you had saved within MLflow by going to the MLflow UI, selecting your run, and copying the path of the stored model as noted in the screenshot below.

Fig 3. MLflow model saved in the Artifacts

With your model identified, you can type in your own review by loading your model and executing it. For example, let’s use a review that is not included in the IMDB Classifier dataset:


this is a wonderful film with a great acting, beautiful cinematography, and amazing direction

 

To run a prediction against this review, use the predict_nn.py against your model:

python predict_nn.py --load_model_path='/Users/dennylee/github/jsd-mlflow-examples/keras/imdbclassifier/mlruns/0/55d11810dd3b445dbad501fa01c323d5/artifacts/models' --my_review='this is a wonderful film with a great acting, beautiful cinematography, and amazing direction'

Or you can run it directly using mlflow and the imdbclassifer repo package:

mlflow run keras/imdbclassifier -e predict -P load_model_path='/Users/jules/jsd-mlflow-examples/keras/imdbclassifier/keras_models/178f1d25c4614b34a50fbf025ad6f18a' -P my_review='this is a wonderful film with a great acting, beautiful cinematography, and amazing direction'

The output for this command should be similar to the following output predicting a positive sentiment for the provided review.

Using TensorFlow backend.
load model path: /tmp/models
my review: this is a wonderful film with a great acting, beautiful cinematography, and amazing direction
verbose: False
Loading Model...
Predictions Results:
[[ 0.69213998]]

Examining Results with TensorBoard

In addition to reviewing your results in the MLflow UI, the code samples save TensorFlow events so that you can visualize the TensorFlow session graph. For example, after executing the statement python main_nn.py, you will see something similar to the following output:

Average Probability Results:
[0.30386349968910215, 0.88336000000000003]

Predictions Results:
[[ 0.35428655]
[ 0.99231517]
[ 0.86375767]
...,
[ 0.15689197]
[ 0.24901576]
[ 0.4418138 ]]
Writing TensorFlow events locally to /var/folders/0q/c_zjyddd4hn5j9jkv0jsjvl00000gp/T/tmp7af2qzw4

Uploading TensorFlow events as a run artifact.
loss function use binary_crossentropy
This model took 51.23427104949951 seconds to train and test.

You can extract the TensorBoard log directory with the output line stating Writing TensorFlow events locally to .... And to start TensorBoard, you can run the following command:

tensorboard --logdir=/var/folders/0q/c_zjyddd4hn5j9jkv0jsjvl00000gp/T/tmp7af2qzw4

Within the TensorBoard UI:

  • Click on Scalars to review the same metrics recorded within MLflow: binary loss, binary accuracy, validation loss, and validation accuracy.
  • Click on Graph to visualize and interact with your session graph

Closing Thoughts

In this blog post, we demonstrated how to use MLflow to save models and reproduce results from saved models as part of the machine development life cycle. In addition, through both python and mlflow command line, we loaded a saved model and predicted the sentiment of our own custom review unseen by the model. Finally, we showcased how you can utilize MLflow and TensorBoard side-by-side by providing code samples that generate TensorFlow events so you can visualize the metrics as well as the session graph.

What’s Next?

You have seen, in three parts, various aspects of MLflow: from experimentation to reproducibility and using MLlfow UI and TensorBoard for visualization of your runs.

You can try MLflow at mlflow.org to get started. Or try some of tutorials and examples in the documentation, including our example notebook Keras_IMDB.py for this blog.

Read More

Here are some resources for you to learn more:

--

Try Databricks for free. Get started today.

The post How to Use MLflow To Reproduce Results and Retrain Saved Keras ML Models appeared first on Databricks.

Gmail Will Let You Disable Smart Replies on the Desktop Soon

If you’re annoyed by the “smart” replies Gmail started putting below your emails, good news: you’ll be able to disable them on the desktop soon.

Everything You Need To Get Started Cooking At Home

Whether you’ve just moved out on your own for the first time or you’re finally getting serious about cooking, we’ve rounded…

Click Here to Continue Reading

What Your Function Keys Do In Microsoft Excel

The function keys on keyboards don’t get the love they used to, but depending on the app you’re running, they can still be quite handy. Microsoft Excel has some interesting features tucked away behind your function keys. Here’s what they do.

Microsoft: Upcoming Windows Update Might Fail If Your Hard Drive is Too Full

Full hard drives could cause an upcoming Windows update to fail, according to Microsoft, and the system will not check for adequate storage space before installing.

No, Google Doesn’t Just Let Apps Read Your Email

There’s a story spreading in the news today that Google is letting companies scan through your email and sell the data, but this is really misleading. So what’s actually going on?

How to Enable a Windows 8-Style Start Screen in Windows 10

Although most people prefer to forget about Windows 8 entirely, some people did enjoy the full-screen Start menu—especially those on tablets with touchscreens. Here’s how to get it back in Windows 10.

What’s the Mention Column for in Microsoft Outlook?

Microsoft’s latest versions of Outlook are Outlook 2016, Outlook 365, and Outlook.com. If you’re not using one of these versions, you won’t have mentions until you upgrade.

The Best Ultrawide Monitors For Every Need

Ultrawide monitors are designed to give you ample room in your workspace without having to set up two separate monitors.

Click Here to Continue Reading

What is DSLR Crop Factor (And Why Should I Care)

Every time we talk about the digital cameras, one thing that comes up is the “crop factor” of the sensor. Let’s dig a bit more into it and explain why it matters.

Google Keeps Pushing ChromeOS and Android Closer Together

A supposed merge of Android and Chrome OS has been rumored for years—to the point where some people believe one will eventually replace the other. That’s not what’s really going to happen—but the two are joining forces.

At this farm, the boss is an AI-powered algorithm

Bowery is part of an emerging industry promising to bring new efficiencies to the millennia-old science of agriculture, focusing for now on greens such as lettuce, arugula and kale.

Everything Amazon Announced At Its Surprise Hardware Event Today

Amazon dropped a positively massive new batch of Alexa-enabled and smart home devices, from subwoofers to microwaves.

Click Here to Continue Reading

What Is a LIT File (and How Do I Open One)?

A file with the .lit file extension is an eBook in the Microsoft eReader file format. LIT (short for “Literature”) files are eBook formats designed by Microsoft to work on Microsoft devices only.

Thursday, 20 September 2018

DISCOVER with Data Steward Studio (DSS): Understand your your hybrid data lakes to exploit their business value! Part-2

If data is the new bacon, data stewardship supplies its nutrition label! This is the second part of a two-part blog introducing Data Steward Studio (DSS) which covers a detailed walkthrough of the capabilities in Data Steward Studio   With GDPR coming into effect in May 2018 and California legislature signing California Consumer Privacy Act […]

The post DISCOVER with Data Steward Studio (DSS): Understand your your hybrid data lakes to exploit their business value! Part-2 appeared first on Hortonworks.

John Hancock To Only Sell “Interactive” Life Insurance, Will Heavily Push Health Tracking

John Hancock plans to only sell life insurance packages that offer incentives to customers who wear a smart watch and track their health data.

It’s Not Just You: Google Renamed Keep to “Keep Notes” on Android

You’re not going crazy: Google’s Keep is now called “Keep Notes” on your Android home screen, despite it being called “Google Keep” everywhere else.

It Would Be Nice If Sony Were Planning a PS Vita Follow Up, But They’re Not

Starting next year, the PS Vita will be discontinued in Japan, officially ending its lifespan. What comes next? According to Sony, nothing.

Click Here to Continue Reading

The Best Xbox One Exclusive Games

The PS4 is currently king of the consoles, boasting enviable exclusives like Spider-Man, God of War, and Horizon: Zero D…

Click Here to Continue Reading

How to Add and Format Text in a Shape in Microsoft Word

Microsoft Word makes it easy to add geometric shapes (and a whole lot more) to your document. You can also add text into these shapes, which is handy when you’re creating flowcharts, network diagrams, mind maps, and so on. This being Word, there are plenty of options for doing this, so let’s take a look.

Upgrading your clusters and workloads from Hadoop 2 to Hadoop 3

Introduction The Apache Hadoop community announced Hadoop 3.0 GA in December, 2017 and Hadoop 3.1 in April, 2018 loaded with great features and improvements. One of the biggest challenges while upgrading to a new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for […]

The post Upgrading your clusters and workloads from Hadoop 2 to Hadoop 3 appeared first on Hortonworks.

The Best Free Image Hosting Websites

Facebook or Instagram may be a great place to share your pictures, but sometimes you need to do more than just share your photos with friends and family. Let’s take a look at our favorite free image hosting websites for all your other needs.

What Are the Most Beneficial Smarthome Devices to Own?

When deciding which smarthome products to install in your house or apartment, there are a lot of things to consider. Your first step should be deciding which products will give you the most value.

How to Stop Someone Forwarding a Meeting Request in Outlook

If you’ve ever had a problem with potential meeting attendees forwarding requests to others, we have good news. If you have the latest version of Outlook (2016 or 365) or you’re an Office 365 subscriber using the Outlook web app, you can prevent people from forwarding meeting requests.

What Apps Can You Actually Run on Linux?

Chromebooks can now run Linux desktop apps, offering a whole new universe of software to Chrome OS users. You can install a Linux distribution like Ubuntu on your PC, too. But what applications are available for Linux?

Amazon plans to open 3,000 cashierless Amazon Go stores by 2021: Report

Amazon Go store, which has no cashiers and allows shoppers to buy things with the help of a smartphone app, is widely seen as a concept that can alter brick-and-mortar retail.

Android Phones Now Share Precise Location Data With More 911 Call Centers

More Android phones will share your precise location when you call 911 in the United States, thanks to a couple of new partnerships worked out by Google. The change will save lives.

How to Draw and Manipulate Arrows in Microsoft Word

Whether you need to point to an image for emphasis or demonstrate where to click for interactivity, there is a wide range of arrow shapes that you can create and customize in Microsoft Word. Let’s take a look at how they work.

The Best Bluetooth Headsets

Most Bluetooth headphones sold today include a microphone in their housing, allowing them to make and answer calls.

Click Here to Continue Reading

Wednesday, 19 September 2018

How Micron is Achieving Faster Insights, Developing Analytical Solutions

We’ve just published our most recent customer video! This video gives a look at how Micron is able to focus more on building solutions and less on day-to-day data management through partnering with Hortonworks. Micron Technology, Inc. is an American global corporation based in Boise, Idaho. The company is one of the largest memory manufacturers in the […]

The post How Micron is Achieving Faster Insights, Developing Analytical Solutions appeared first on Hortonworks.

Create RSS Feeds to Follow Instagram and Twitter Users Without an Account

Wish you could follow a couple of Twitter or Instagram users, without setting up an account? Create an RSS feed for them.

The Best Smart Water Bottles To Keep You Hydrated

If you want to drink more water, get more out of your health apps, and feel better in the proc…

Click Here to Continue Reading

This Free Font Stretches Out Your Term Paper

Are you five pages into a seven page paper, with no time to spare? A stretched out font could pad things out for you, helping you reach that arbitrary page count.

Q&A to Demystify the Power of Apache Metron in Real-Time Threat Detection

How is Apache Metron utilized in Hortonworks’ product portfolio? Hortonworks Cybersecurity Platform (HCP) is powered by Apache Metron and other open-source big data technologies. At the prime intersection of Big Data and Machine Learning, HCP employs a data-science-based approach to visualize diverse, streaming security data at scale to aid Security Operations Centers (SOC) in real-time detection […]

The post Q&A to Demystify the Power of Apache Metron in Real-Time Threat Detection appeared first on Hortonworks.

How to Get Started Editing Videos with Final Cut Pro X

Final Cut Pro X is a huge step up from iMovie, which is the video editor most macOS users probably started out with. Final Cut Pro X functions similarly but packs in a whole lot more power while sticking to iMovie’s simple design.

How to Create and Customize a Folder View in Outlook

If you work in an office, then the chances are you spend a lot of time dealing with email, most probably in Microsoft Outlook. It’s worth taking a little time to get Outlook to display the information you need. For email, the best way to do this is with folder views. Here’s how they work.

How to Delete or Disable All Alarms on Your iPhone

The iPhone’s Clock app can only turn off or delete a single alarm at a time. But, if you have a lot of alarms and want to delete them all—or just turn off all alarms at once—Siri’s got you covered.

Microsoft brings HoloLens to business with new mixed reality Dynamics 365 apps

Microsoft Dynamics 365 Remote Assist allows workers wearing HoloLens to get real time remote help from subject experts while they are on the site

How to Make Android as Secure as Possible

Mobile security is a big deal, probably now more than ever. Most of us live on our phones, with financial information, calendar appointments, family photos, and more stored on our devices. Here’s how to keep your Android phone secure.

Lewis–Mogridge Position

The Lewis–Mogridge Position, named after David Lewis and Martin J. H. Mogridge, is a maxim in urban planning that states “traffic expands to meet the available road space”. The position points out the relative futility of building more roads to deal with traffic, as the gains are erased within a matter of weeks to months when people drive more because of the available space.

ISRO opens space technology incubation centre in Agartala

Within the next six months, ISRO plans to set up five more space technology incubation centres at Jalandhar in Punjab, Bhubaneswar in Odisha, Nagpur in Maharashtra, Indore in Madhya Pradesh and Tiruchirapalli in Tamil Nadu.

Safari 12 Finally Supports Favicons in Tabs

Safari 12 is here with a feature we’ve all been waiting for: favicons. Here’s how to enable them on macOS and on your iPhone or iPad.

Linux Apps Are Now Available in Chrome OS Stable, But What Does That Mean?

Chrome OS 69 just hit the stable channel and is currently rolling out to devices. This brings a handful of new features and changes, including Google’s Material theme, Night Light, an improved file manager, and most importantly: support for Linux apps.

What Kind of Smart Lights Should You Buy?

While most smart lights let you control them from your phone, by voice, or through automation, not all smart bulbs are created equal. Here are the different type of smart lights and which ones might be best for you.

Tuesday, 18 September 2018

Continuously delivering an easy-to-use Jenkins with Evergreen

When I first wrote about Jenkins Evergreen, which was then referred to as "Jenkins Essentials", I mentioned a number of future developments which in the subsequent months have become reality. At this year’s DevOps World - Jenkins World in San Francisco, I will be sharing more details on the philosophy behind Jenkins Evergreen, show off how far we have come, and discuss where we’re going with this radical distribution of Jenkins.

Jenkins Evergreen

As discussed in my first blog post, and JEP-300, the first two pillars of Jenkins Evergreen have been the primary focus of our efforts.

Automatically Updated Distribution

Perhaps unsurprisingly, implementing the mechanisms necessary for safely and automatically updating a Jenkins distribution, which includes core and plugins, was and continues to be a sizable amount of work. In Baptiste’s talk he will be speaking about the details which make Evergreen "go" whereas I will be speaking about why an automatically updating distribution is important.

As continuous integration and continuous delivery have become more commonplace, and fundamental to modern software engineering, Jenkins tends to live two different lifestyles depending on the organization. In some organizations, Jenkins is managed and deployed methodically with automation tools like Chef, Puppet, etc. In many other organizations however, Jenkins is treated much more like an appliance, not unlike the office wireless router. Installed and so long as it continues to do its job, people won’t think about it too much.

Jenkins Evergreen’s distribution makes the "Jenkins as an Appliance" model much better for everybody by ensuring the latest feature updates, bug and security fixes are always installed in Jenkins.

Additionally, I believe Evergreen will serve another group we don’t adequately serve at the moment: those who want Jenkins to behave much more like a service. We typically don’t consider "versions" of GitHub.com, we receive incremental updates to the site and realize the benefits of GitHub’s on-going development without ever thinking about an "upgrade."

I believe Jenkins Evergreen can, and will provide that same experience.

Automatic Sane Defaults

The really powerful thing about Jenkins as a platform is the broad variety of patterns and practices different organizations may adopt. For newer users, or users with common use-cases, that significant amount of flexibility can result in a paradox of choice. With Jenkins Evergreen, much of the most common configuration is automatically configured out of the box.

Included is Jenkins Pipeline and Blue Ocean, by default. We also removed some legacy functionalities from Jenkins while we were at it.

We are also utilizing some of the fantastic Configuration as Code work, which recently had its 1.0 release, to automatically set sane defaults in Jenkins Evergreen.

Status Quo

The effort has made significant strides thus far this year, and we’re really excited for people to start trying out Jenkins Evergreen. As of today, Jenkins Evergreen is ready for early adopters. We do not yet recommend using Jenkins Evergreen for a production environment.

If you’re at DevOps World - Jenkins World in San Francisco please come see Baptiste’s talk Wednesday at 3:45pm in Golden Gate Ballroom A. Or my talk at 11:15am in Golden Gate Ballroom B.

If you can’t join us here in San Francisco, we hope to hear your feedback and thoughts in our Gitter channel!

Alexa vs. Google Assistant vs. Siri: A Curious Question Face Off

There’s a lot of factors to consider when selecting a voice-assistant platform but what if your biggest consideration is just how they a…

Click Here to Continue Reading

How to Hide Suggestions in Windows 10’s Timeline

Windows Timeline extends the Task View mode to show you the history of activities performed on your PC—or even your other devices if you have syncing turned on. Sometimes, Microsoft sticks suggestions on your timeline for you. Here’s how to make them go away.

Everything You Need to Get Started with Street Photography

Street photography is one of the most popular forms of photography for people to start with.

Click Here to Continue Reading

Chromebooks Are Getting Better Parental Controls

Better parental controls are coming to Chromebooks, with the ability to set screen time limits and manage apps.

How to Print Just the Speaker Notes for a PowerPoint Presentation

You’ve got your speaker notes set up in your PowerPoint presentation, and now you want to print a copy for quick reference. Here’s how to print speaker notes for a PowerPoint Presentation—with and without slide thumbnails.

What is LockApp.exe on Windows 10?

You may see a process named LockApp.exe running on your PC. This is normal. LockApp.exe is a part of the Windows 10 operating system and is responsible for displaying the lock screen.

Simplify Market Basket Analysis using FP-growth on Databricks

Try this notebook in Databricks

When providing recommendations to shoppers on what to purchase, you are often looking for items that are frequently purchased together (e.g. peanut butter and jelly). A key technique to uncover associations between different items is known as market basket analysis. In your recommendation engine toolbox, the association rules generated by market basket analysis (e.g. if one purchases peanut butter, then they are likely to purchase jelly) is an important and useful technique.  With the rapid growth e-commerce data, it is necessary to execute models like market basket analysis on increasing larger sizes of data. That is, it will be important to have the algorithms and infrastructure necessary to generate your association rules on a distributed platform. In this blog post, we will discuss how you can quickly run your market basket analysis using Apache Spark MLlib FP-growth algorithm on Databricks.

To showcase this, we will use the publicly available Instacart Online Grocery Shopping Dataset 2017.  In the process, we will explore the dataset as well as perform our market basket analysis to recommend shoppers to buy it again or recommend to buy new items.

 

 

The flow of this post, as well as the associated notebook, is as follows:

  • Ingest your data: Bringing in the data from your source systems; often involving ETL processes (though we will bypass this step in this demo for brevity)
  • Explore your data using Spark SQL: Now that you have cleansed data, explore it so you can get some business insight
  • Train your ML model using FP-growth: Execute FP-growth to execute your frequent pattern mining algorithm
  • Review the association rules generated by the ML model for your recommendations

Ingest Data

The dataset we will be working with is 3 Million Instacart Orders, Open Sourced dataset:

The “Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on 01/17/2018. This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.

You will need to download the file, extract the files from the gzipped TAR archive, and upload them into Databricks DBFS using the Import Data utilities.  You should see the following files within dbfs once the files are uploaded:

  • Orders: 3.4M rows, 206K users
  • Products: 50K rows
  • Aisles: 134 rows
  • Departments: 21 rows
  • order_products__SET: 30M+ rows where SET is defined as:
    • prior: 3.2M previous orders
    • train: 131K orders for your training dataset

Refer to the Instacart Online Grocery Shopping Dataset 2017 Data Descriptions for more information including the schema.

Create DataFrames

Now that you have uploaded your data to dbfs, you can quickly and easily create your DataFrames using spark.read.csv:


# Import Data
aisles = spark.read.csv("/mnt/bhavin/mba/instacart/csv/aisles.csv", header=True, inferSchema=True)
departments = spark.read.csv("/mnt/bhavin/mba/instacart/csv/departments.csv", header=True, inferSchema=True)
order_products_prior = spark.read.csv("/mnt/bhavin/mba/instacart/csv/order_products__prior.csv", header=True, inferSchema=True)
order_products_train = spark.read.csv("/mnt/bhavin/mba/instacart/csv/order_products__train.csv", header=True, inferSchema=True)
orders = spark.read.csv("/mnt/bhavin/mba/instacart/csv/orders.csv", header=True, inferSchema=True)
products = spark.read.csv("/mnt/bhavin/mba/instacart/csv/products.csv", header=True, inferSchema=True)

# Create Temporary Tables
aisles.createOrReplaceTempView("aisles")
departments.createOrReplaceTempView("departments")
order_products_prior.createOrReplaceTempView("order_products_prior")
order_products_train.createOrReplaceTempView("order_products_train")
orders.createOrReplaceTempView("orders")
products.createOrReplaceTempView("products")

Exploratory Data Analysis

Now that you have created DataFrames, you can perform exploratory data analysis using Spark SQL.  The following queries showcase some of the quick insight you can gain from the Instacart dataset.

Orders by Day of Week

The following query allows you to quickly visualize that Sunday is the most popular day for the total number of orders while Thursday has the least number of orders.


%sql
select 
  count(order_id) as total_orders, 
  (case 
     when order_dow = '0' then 'Sunday'
     when order_dow = '1' then 'Monday'
     when order_dow = '2' then 'Tuesday'
     when order_dow = '3' then 'Wednesday'
     when order_dow = '4' then 'Thursday'
     when order_dow = '5' then 'Friday'
     when order_dow = '6' then 'Saturday'              
   end) as day_of_week 
  from orders  
 group by order_dow 
 order by total_orders desc

Orders by Hour

When breaking down the hours typically people are ordering their groceries from Instacart during business working hours with highest number orders at 10:00am.


%sql
select 
  count(order_id) as total_orders, 
  order_hour_of_day as hour 
  from orders 
 group by order_hour_of_day 
 order by order_hour_of_day

Understand shelf space by department

As we dive deeper into our market basket analysis, we can gain insight on the number of products by department to understand how much shelf space is being used.


%sql
select d.department, count(distinct p.product_id) as products
  from products p
    inner join departments d
      on d.department_id = p.department_id
 group by d.department
 order by products desc
 limit 10

As can see from the preceding image, typically the number of unique items (i.e. products) involve personal care and snacks.

Organize Shopping Basket

To prepare our data for downstream processing, we will organize our data by shopping basket. That is, each row of our DataFrame represents an order_id with each items column containing an array of items.


# Organize the data by shopping basket
from pyspark.sql.functions import collect_set, col, count
rawData = spark.sql("select p.product_name, o.order_id from products p inner join order_products_train o where o.product_id = p.product_id")
baskets = rawData.groupBy('order_id').agg(collect_set('product_name').alias('items'))
baskets.createOrReplaceTempView('baskets')

Just like the preceding graphs, we can visualize the nested items using thedisplay command in our Databricks notebooks.

Train ML Model

To understand the frequency of items are associated with each other (e.g. how many times does peanut butter and jelly get purchased together), we will use association rule mining for market basket analysis. Spark MLlib implements two algorithms related to frequency pattern mining (FPM): FP-growth and PrefixSpan. The distinction is that FP-growth does not use order information in the itemsets, if any, while PrefixSpan is designed for sequential pattern mining where the itemsets are ordered. We will use FP-growth as the order information is not important for this use case.

Note, we will be using the Scala API so we can configure setMinConfidence


%scala
import org.apache.spark.ml.fpm.FPGrowth

// Extract out the items 
val baskets_ds = spark.sql("select items from baskets").as[Array[String]].toDF("items")

// Use FPGrowth
val fpgrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.001).setMinConfidence(0)
val model = fpgrowth.fit(baskets_ds)

// Calculate frequent itemsets
val mostPopularItemInABasket = model.freqItemsets
mostPopularItemInABasket.createOrReplaceTempView("mostPopularItemInABasket")

With Databricks notebooks, you can use the %scala to execute Scala code within a new cell in the same Python notebook.

With the mostPopularItemInABasket DataFrame created, we can use Spark SQL to query for the most popular items in a basket where there are more than 2 items with the following query.


%sql
select items, freq from mostPopularItemInABasket where size(items) > 2 order by freq desc limit 20

As can be seen in the preceding table, the most frequent purchases of more than two items involve organic avocados, organic strawberries, and organic bananas.   Interesting, the top five frequently purchased together items involve various permutations of organic avocados, organic strawberries, organic bananas, organic raspberries, and organic baby spinach.    From the perspective of recommendations, the freqItemsets can be the basis for the buy-it-again recommendation in that if a shopper has purchased the items previously, it makes sense to recommend that they purchase it again.

Review Association Rules

In addition to freqItemSets, the FP-growth model also generates associationRules. For example, if a shopper purchases peanut butter, what is the probability (or confidence) that they will also purchase jelly.  For more information, a good reference is Susan Li’s A Gentle Introduction on Market Basket Analysis — Association Rules.


%scala
// Display generated association rules.
val ifThen = model.associationRules
ifThen.createOrReplaceTempView("ifThen")

A good way to think about association rules is that model determines that if you purchased something (i.e. the antecedent), then you will purchase this other thing (i.e. the consequent) with the following confidence.


%sql
select antecedent as `antecedent (if)`, consequent as `consequent (then)`, confidence from ifThen order by confidence desc limit 20

As can be seen in the preceding graph, there is relatively strong confidence that if a shopper has organic raspberries, organic avocados, and organic strawberries in their basket, then it may make sense to recommend organic bananas as well. Interestingly, the top 10 (based on descending confidence) association rules – i.e. purchase recommendations – are associated with organic bananas or bananas.

Discussion

In summary, we demonstrated how to explore our shopping cart data and execute market basket analysis to identify items frequently purchased together as well as generating association rules. By using Databricks, in the same notebook we can visualize our data; execute Python, Scala, and SQL; and run our FP-growth algorithm on an auto-scaling distributed Spark cluster – all managed by Databricks. Putting these components together simplifies the data flow and management of your infrastructure for you and your data practitioners. Try out the Market Basket Analysis using Instacart Online Grocery Dataset with Databricks today.

--

Try Databricks for free. Get started today.

The post Simplify Market Basket Analysis using FP-growth on Databricks appeared first on Databricks.

Google, Why Won’t Maps Let Me Work from Home?

The year is 2018, and telecommuting is at an all-time high. Working from home has never been easier and the benefits—both for employees and employers—are vast. There’s just one problem: Google Maps is annoying.