Thursday, 31 October 2019

Solving the Challenge of Big Data Cloud Migration with WANdisco, Databricks and Delta Lake

Migrating from Hadoop on-premises to the cloud has been a common theme in recent Databricks blog posts and conference sessions. They’ve identified key considerations, highlighted partnerships and described solutions for moving and streaming data to the cloud with governance and other controls, and compared the runtime environments offered between Hadoop and Databricks to highlight the benefits of the Databricks Unified Data Analytics Platform.

Challenges for Hadoop users when moving to the cloud

WANdisco has partnered with Databricks to solve many of the challenges for large-scale Hadoop migrations. A particular challenge for organizations that have adopted Hadoop at scale is the traditional problem of data gravity. Because their applications assume ready, local, fast access to an on-premises data lake built on HDFS, building applications away from that data becomes difficult, because it requires building additional workflows to manually copy or access data from the on-premises Hadoop data lake.

This problem is exacerbated by an order of magnitude if those on-premises data sets continue to change, because the workflows to move data between environments add a layer of complexity, and don’t handle changing data easily.

While the cloud brings efficiencies for data lakes there remains concerns about the reliability and the consistency of the data. Data Lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions.

Hadoop migration with Databricks and WANdisco

The Databricks and WANdisco partnership solves these challenges, by providing full read and write access to changing data lakes at scale during migration between on-premises systems and Databricks in the cloud. This solution is called LiveAnalytics, and it takes advantage of WANdisco’s platform to migrate and replicate the largest Hadoop datasets to Databricks and Delta Lake. WANdisco makes it possible to migrate data at scale, even while those data sets continue to be modified, using a novel distributed coordination engine to maintain data consistency between Hive data and Delta Lake tables.

LiveAnalytics migrates and replicates the largest Hadoop datasets to Databricks and Delta Lake

WANdisco’s architecture and consensus-based approach is the key to this capability. It allows migration without disruption, application downtime or data loss, and opens up the benefits of applying Databricks to the largest of data lakes that were previously difficult to bring to the cloud.

Because WANdisco LiveAnalytics provides direct support of Delta Lake and Databricks along with common Hadoop platforms, it provides a compelling solution to bringing your on-premises Hadoop data to Databricks without impacting your ability to continue to use Hadoop while migration is in process.

WANdisco’s architecture allows migration from Hadoop to the cloud without disruption, application downtime or data loss

You can take advantage of WANdisco’s technology today to help bring your Hadoop data lake to Databricks, with native support of common Hadoop platforms on-premises and for Databricks and Delta Lake on Azure or AWS.

Related Resources

--

Try Databricks for free. Get started today.

The post Solving the Challenge of Big Data Cloud Migration with WANdisco, Databricks and Delta Lake appeared first on Databricks.

How to Use AI to Detect Soft Skills

The new world we live in gives us more help — and more doubts. Machines stand behind everything — and the scope of this everything is only growing. To what extent can we trust such machines? We are used to relying on them in market trends, traffic management, maybe even in healthcare. Machines are now analysts, medical assistants, secretaries and teachers. Are they reliable enough to work as HRs? Psychologists? What can they tell about us?

Let’s see how text analysis can analyze your soft skills and tell a potential employer whether you can join the team smoothly.

Project Description

In this project, we used text analysis techniques to analyze the soft skills of young men (aged 15–24) looking for career opportunities.

What we had in mind was to perform a number of tests, or to choose the most effective one, to determine ground truth values. The tests we were experimenting with included:


Mind Tools test — short and simple, but may be difficult for centennials. Test’s output is score (from 1 to 15) for each of the following soft skills categories: Personal Mastery, Time Management, Communication Skills, Problem Solving and Decision Making, Leadership and Management.
TAT test — the input is a story about a picture, and the system automatically rates it in the ...


Read More on Datafloq

Scalable near real-time S3 access logging analytics with Apache Spark and Delta Lake

Introduction

Many organizations use AWS S3 as their main storage infrastructure for their data. Moreover, by using Apache Spark™ on Databricks they often perform transformations of that data and save the refined results back to S3 for further analysis. When the size of data and the amount of processing reach a certain scale, it often becomes necessary to observe the data access patterns. Common questions that arise include (but are not limited to): Which datasets are used the most? What is the ratio between accessing new and past data? How quickly a dataset can be moved to a cheaper storage class without affecting the performance of the users? Etc.

In Zalando, we have faced this issue since data and computation became a commodity for us in the last few years. Almost all of our ~200 engineering teams regularly perform analytics, reporting, or machine learning meaning they all read data from the central data lake. The main motivation to enable observability over this data was to reduce the cost of storage and processing by deleting unused data and by shrinking resource usage of the pipelines that produce that data. An additional driver was to understand if our engineering teams needed to query historical data or if they are only interested in the recent state of the data.

To answer these types of questions S3 provides a useful feature – S3 Server Access Logging. When enabled, it constantly dumps logs about every read and write access in the observed bucket. The problem that appears almost immediately, and especially at a higher scale, is that these logs are in the form of comparatively small text files, with a format similar to the logs of Apache Web Server.

To query these logs we have leveraged capabilities of Apache Spark™ Structured Streaming on Databricks, and built a streaming pipeline that constructs Delta Lake tables. These tables – for each observed bucket – contain well-structured data of the S3 Access Logs, they are partitioned, can be sorted if needed, and, as a result, enable extended and efficient analysis of the access patterns of the company’s data. This allows us to answer the previously mentioned questions and many more. In this blog post we are going to describe the production architecture we designed in Zalando, and to show in detail how you can deploy such a pipeline yourself.

Solution

Before we start, let us make two qualifications.

The first note is about why we chose Delta Lake, and not plain Parquet or any other format. As you will see, to solve the problem described we are going to create a continuous application using Spark Structured Streaming. The properties of the Delta Lake in this case will give us the following benefits:

  • ACID Transactions: No corrupted/inconsistent reads by the consumers of the table in case write operation is still in progress or has failed leaving partial results on S3. More information is also available in Diving into Delta Lake: Unpacking the Transaction Log.
  • Schema Enforcement: The metadata is controlled by the table; there is no chance that we break the schema if there is a bug in the code of the Spark job or if the format of the logs has changed. More information is available in Diving Into Delta Lake: Schema Enforcement & Evolution.
  • Schema Evolution: On the other hand, if there is a change in the log format – we can purposely extend the schema by adding new fields. More information is available in Diving Into Delta Lake: Schema Enforcement & Evolution.
  • Open Format: All the benefits of the plain Parquet format for readers apply, e.g. predicate push-down, column projection, etc.
  • Unified Batch and Streaming Source and Sink: Opportunity to chain downstream Spark Structured Streaming jobs to produce aggregations based on the new content

The second note is about the datasets that are being read by the clients of our data lake. For the most part, the mentioned datasets consist of 2 categories: 1) snapshots of the data warehouse tables from the BI databases, and 2) continuously appended streams of events from the central event bus of the company. This means that there are 2 types of patterns of how data gets written in the first place – full snapshot once per day and continuously appended stream, respectively.

In both cases we have a hypothesis that the data generated in the last day is consumed most often. For the snapshots we also know of infrequent comparisons between the current snapshot and past versions, for example one from a year ago. We are aware of the use case when the whole month or even year of historical data for a certain stream of event data has to be processed. This gives us an idea of what to look for, and this is where the described pipeline should help us to prove or disprove our hypotheses.

Let us now dive into the technical details of the implementation of this pipeline. The only entity we have at the current stage is the S3 bucket. Our goal is to analyze what patterns appear in the read and write access to this bucket.

To give you an idea of what we are going to show, on the diagram below you can see the final architecture, that represents the final state of the pipeline. The flow it depicts is the following:

  1. AWS constantly monitors the S3 bucket data-bucket
  2. It writes raw text logs to the target S3 bucket raw-logs-bucket
  3. For every created object an Event Notification is sent to the SQS queue new-log-objects-queue
  4. Once every hour a Spark job gets started by Databricks
  5. Spark job reads all the new messages from the queue
  6. Spark job reads all the objects (described in the messages from the queue) from raw-logs-bucket
  7. Spark job writes the new data in append mode to the Delta Lake table in the delta-logs-bucket S3 bucket (optionally also executes OPTIMIZE and VACUUM, or runs in the Auto-Optimize mode)
  8. This Delta Lake table can be queried for the analysis of the access patterns

 

Administrative Setup

First we will perform the administrative setup of configuring our S3 Server Access Logging and creating an SQS Queue.

Configure S3 Server Access Logging

First of all you need to configure S3 Server Access Logging for the data-bucket. To store the raw logs you first need to create an additional bucket – let’s call it raw-logs-bucket. Then you can configure logging via UI or using API. Let’s assume that we specify target prefix as data-bucket-logs/, so that we can use this bucket for S3 access logs of multiple data buckets.

After this is done – raw logs will start appearing in the raw-logs-bucket as soon as someone is doing requests to the data-bucket. The number and the size of the objects with logs will depend on the intensity of requests. We experienced three different patterns for three different buckets as noted in the table below.

You can see that the velocity of data can be rather different, which means you have to account for this when processing these disparate sources of data.

Create an SQS queue

Now, when logs are being created, you can start thinking about how to read them with Spark to produce the desired Delta Lake table. Because S3 logs are written in the append-only mode – only new objects get created, and no object ever gets modified or deleted – this is a perfect case to leverage the S3-SQS Spark reader created by Databricks. To use it, you need first of all to create an SQS queue. We recommend to set Message Retention Period to 7 days, and Default Visibility Timeout to 5 minutes. From our experience, these are good defaults, that as well match defaults of the Spark S3-SQS reader. Let’s refer to the queue with the name new-log-objects-queue.

Now you need to configure the policy of the queue to allow sending messages to the queue from the raw-logs-bucket. To achieve this you can edit it directly in the Permissions tab of the queue in the UI, or do it via API. This is how the statement should look like:


{
    "Effect": "Allow",
    "Principal": "*",
    "Action": "SQS:SendMessage",
    "Resource": "arn:aws:sqs:{REGION}:{MAIN_ACCOUNT_ID}:new-log-objects-queue",
    "Condition": {
        "ArnEquals": {
            "aws:SourceArn": "arn:aws:s3:::raw-logs-bucket"
        }
    }
}

Configure S3 event notification

Now, you are ready to connect raw-logs-bucket and new-log-objects-queue, so that for each new object there is a message sent to the queue. To achieve this you can configure the S3 Event Notification in the UI or via API. We show here how the JSON version of this configuration would look like:


{

    "QueueConfigurations": [
        {
            "Id": "raw-logs",
            "QueueArn": "arn:aws:sqs:{REGION}:{MAIN_ACCOUNT_ID}:new-log-objects-queue",
            "Events": ["s3:ObjectCreated:*"]
        }
    ]
}

Operational Setup

In this section, we will perform the necessary cluster configurations including creating IAM roles and prepare the cluster configuration.

Create IAM roles

To be able to run the Spark job, you need to create two IAM roles – one for the job (cluster role), and one to access S3 (assumed role). The reason you need to additionally assume a separate S3 role is that the cluster and its cluster role are located in the dedicated AWS account for Databricks EC2 instances and roles, whereas the raw-logs-bucket is located in the AWS account where the original source bucket resides. And because every log object is written by the Amazon role – there is an implication that cluster role doesn’t have permission to read any of the logs in accordance to the ACL of the log objects. You can read more about it in Secure Access to S3 Buckets Across Accounts Using IAM Roles with an AssumeRole Policy.

The cluster role, referred here as cluster-role, should be created in the AWS account dedicated for Databricks, and should have these 2 policies:


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "sqs:ReceiveMessage",
                "sqs:DeleteMessage",
                "sqs:GetQueueAttributes"
            ],
            "Resource": ["arn:aws:sqs:{REGION}:{DATABRICKS_ACCOUNT_ID}:new-log-objects-queue"],
            "Effect": "Allow"
        }
    ]
}

and


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::{DATABRICKS_ACCOUNT_ID}:role/s3-access-role-to-assume"
        }
    ]
}

You will also need to add the instance profile of this role as usual to the Databricks platform.

The role to access S3, referred here as s3-access-role-to-assume, should be created in the same account, where both buckets reside. It should refer to the cluster-role by its ARN in the assumed_by parameter, and should have these 2 policies:


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:GetObjectMetadata"
            ],
            "Resource": [
                "arn:aws:s3:::raw-logs-bucket",
                "arn:aws:s3:::raw-logs-bucket/*"
            ],
            "Effect": "Allow"
        }
    ]
}

and


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:GetObjectMetadata",
                "s3:PutObject",
                "s3:PutObjectAcl",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::delta-logs-bucket",
                "arn:aws:s3:::delta-logs-bucket/data-bucket-logs/*"
            ]
            "Effect": "Allow"
        }
    ]
}

where delta-logs-bucket is another bucket you need to create, where the resulting Delta Lake tables will be located.

Prepare cluster configuration

Here we outline the spark_conf settings that are necessary in the cluster configuration so that the job can run correctly:


spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem
spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem
spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl
spark.hadoop.fs.s3a.canned.acl BucketOwnerFullControl
spark.hadoop.fs.s3a.credentialsType AssumeRole
spark.hadoop.fs.s3a.stsAssumeRole.arn arn:aws:iam::{MAIN_ACCOUNT_ID}:role/s3-access-role-to-assume

If you go for more than one bucket, we also recommend these settings to enable FAIR scheduler, external shuffling, and RocksDB for keeping state:


spark.sql.streaming.stateStore.providerClass com.databricks.sql.streaming.state.RocksDBStateStoreProvider
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.scheduler.mode FAIR

Generate Delta Lake table with a Continuous Application

In the previous sections you completed the perfunctory administrative and operational setups. Now that this is done, you can write the code that will finally produce the desired Delta Lake table, and run it in a Continuous Application mode.

The notebook

The code is written in Scala. First we define a record case class:

Then we create a few helper functions for parsing:


And finally we define the Spark job:

Create Databricks job

The last step is to make the whole pipeline run. For this you need to create a Databricks job. You can use the “New Automated Cluster” type, add spark_conf we defined above, and schedule it to be run, for example, once every hour using “Schedule” section. This is it – as soon as you confirm creation of the job, and when it starts running by scheduler – you should be able to see that messages from the SQS queue are getting consumed, and that the job is writing to the output Delta Lake table.

Execute Notebook Queries

At this point data is available, and you can create your notebook and execute your queries to answer questions we started with in the beginning of this blog post.

Create interactive cluster and a notebook to run analytics

As soon as Delta Lake table has the data – you can start querying it. For this you can create a permanent cluster with the role that only needs to be able to read the delta-logs-bucket. This means it doesn’t need to use the AssumeRole technique, but only need ListBucket and GetObject permissions. After that you can attach a notebook to this cluster and execute your first analysis.

Queries to analyze access patterns

Let’s get back to one of the questions that we asked in the beginning – which datasets are used the most? If we assume that in the source bucket every dataset is located under prefix data/{DATASET_NAME}/, then to answer it, we could come up with a query like this one:


SELECT dataset, count(*) AS cnt
FROM (
    SELECT regexp_extract(key, '^data\/([^/]+)\/.+', 1) AS dataset
    FROM delta.`s3://delta-logs-bucket/data-bucket-logs/`
    WHERE date = 'YYYY-MM-DD' AND bucket = 'data-bucket' AND key rlike '^data\/' AND operation = 'REST.GET.OBJECT'
)
GROUP BY dataset
ORDER BY cnt DESC;

The outcome of the query can look like this:

This query tells us how many individual GetObject requests were made to each dataset during one day, ordered from the top most accessed down to the less intensively accessed. By itself it might not be enough to say, if one dataset is accessed more often. We can normalize each aggregation by the number of objects in each dataset. Also, we can group by dataset and day, so that we also see the correlation in time. There are many further options, but the point is that having at hand this Delta Lake table we can answer any kind of question about what access patterns in the bucket.

Extensibility

The pipeline we have shown is extensible out of the box. You can fully reuse the same SQS queue and add more buckets with logging into the pipeline, by simply using the same raw-logs-bucket to store S3 Server Access Logs. Because the Spark job already partitions by date and bucket, it will keep working fine, and your Delta Lake table will contain log data from the new buckets.

One piece of advice we can give is to use AWS CDK to handle infrastructure, i.e. to configure the buckets raw-logs-bucket and delta-logs-bucket, SQS queue, and the role s3-access-role-to-assume. This will simplify operations and make the infrastructure become code as well.

Conclusion

In this blog post we have described how S3 Server Access Logging can be transformed into Delta Lake in a continuous fashion, so that analysis of the access patterns to the data can be performed. We showed that Spark Structured Streaming together with the S3-SQS reader can be used to read raw logging data. We described what kind of IAM policies and spark_conf parameters you will need to make this pipeline work. Overall, this solution is easy to deploy and operate, and it can give you a good benefit by providing observability over the access to the data.

--

Try Databricks for free. Get started today.

The post Scalable near real-time S3 access logging analytics with Apache Spark and Delta Lake appeared first on Databricks.

Why Better Data User Experience Means Better ROI

It should be the least controversial idea in the world: when people have easy access to quality information, they’ll make better decisions.

But for so many of the world’s businesses, this quality information lives behind a veil that can only be pierced by trained professionals who specialize in data science or other niche topics. The user experience of business data generally suffers for companies seeking to make an impact — employees must be able to interpret a complicated graph in order to arrive at actionable information, or be able to understand a number of complex acronyms or abbreviations. That’s why it’s likely that employees are leaning on domain experts in order to explain the nitty-gritty and provide the underlying analytics. This approach is less than totally productive because it doesn’t promote the worker autonomy that actualized businesses depend on in pursuit of a goal.

What’s worse is that employees may also be dismissing the data altogether. This weakens an organization’s decision-making power and increases the overall risk of arriving at a suboptimal course of action.

It’s easy to suggest data visualization tools as the catch-all solution for making it easier to interact with business data. These software products serve their purpose, but they ...


Read More on Datafloq

Neural Machine Translation

For centuries people have been dreaming of easier communication with foreigners. The idea to teach computers to translate human languages is probably as old as computers themselves. The first attempts to build such technology go back to the 1950s. However, the first decade of research failed to produce satisfactory results, and the idea of machine translation was forgotten until the late 1990s. At that time, the internet portal AltaVista launched a free online translation service called Babelfish — a system that became a forefather for a large family of similar services, including Google Translate. At present, modern machine translation system rely on Machine Learning and Deep Learning techniques to improve the output and probably tackle the issues of understanding context, tone, language registers and informal expressions.

The techniques that were used until recently, including by Google Translate, were mainly statistical. Although quite effective for related languages, they tended to perform worse for languages from different families. The problem lies in the fact that they break down sentences into individual words or phrases and can span across only several words at a time while generating translations. Therefore, if languages have different words orderings, this method results in an awkward sequence of chunks of text.

Turn to ...


Read More on Datafloq

Wednesday, 30 October 2019

Why Digital Health Will Need Big Data to Support Its Infrastructure

Healthcare IT leaders must lead the charge to deploy big data analyses across the continuum of community medical services.

Rapidly expanding medical needs and a whirlwind of technological innovations has created a mass of data that no healthcare organization can ever hope to manage manually, according to the American Medical Informatics Association (AMIA). Resultantly, a growing number of organizations recognize big data systems as a solution for maintaining information infrastructure.

By compelling lawmakers to consider informatics when making decisions, IT leaders can ensure that the medical field meets goals for deploying big data technologies and improving community health outcomes.

How Care Providers Use Big Data to Improve Public Health

One example of how organizations use big data to improve public health outcomes is the United States Department of Veterans Affair’s deployment of analytics to improve treatment outcomes and well-being for veterans.

In 2018, nearly 5 million veterans received VA disability. The information in their medical records can provide healthcare organizations with a wealth of data that can help improve the emotional and physical well-being of their fellow veterans. To do just that, the VA partnered with Alphabet to develop technological innovations that predict the likeliness of illness and injuries among veterans.

For many veterans, it’s ...


Read More on Datafloq

6 Tips for Image Optimization for Data Visualization

Visuals are used to attract attention but also to provide information. Data visualization is a basic skill of data scientists, helping them tell the story behind the data. 

The popularity of infographics and data visualizations on the Internet has grown exponentially in the last years. According to a report from Infographic World, visuals improve information retention by 400%. However, attractive images and graphics can be large in size and slow down your website. Therefore, it is crucial to optimize the images for web usage. Read on to learn more about image optimization for data visualization. 

What Is Data Visualization?

The term data visualization refers to the techniques and processes used to communicate information via visual content. Data scientists use graphs, charts, and infographics to convey information in a clear way.

Presenting analytics visually often helps users understand complex concepts. Professionals often use data visualization to detect patterns in the data.

Data visualization goes beyond the realm of data science. Often, companies use data visualization to identify trends in the market. Other uses include tracking performance or calling attention to an issue. Data visualization is also part of a new branch of journalism called data journalism.

Why Visualizations are Viral

Displaying data visually is often the most efficient ...


Read More on Datafloq

Why we are investing 100 million euros in our European Development Center

A few days ago, we announced an investment of 100 million euros in our European Development Center in Amsterdam. I want to take a moment to describe why this is a pivotal moment for Databricks and why Amsterdam is a cornerstone of our growth strategy.

Solving the Hardest Data Problems at Scale

Our Unified Data Analytics Platform helps customers solve some of the hardest problems on the planet, from genomics research to credit card fraud detection. The Netherlands provides us with access to a large pool of talent that is uniquely suited to our needs. The Netherlands is home to world-class universities such as the Vrije Universiteit Amsterdam, Delft University of Technology, and many others. We have built close partnerships with local universities and research centers, helping translate cutting-edge research into product. For example, Databricks partners with Centrum Wiskunde & Informatica (CWI), one of the world leaders in distributed systems and database research.

Our employees and partners benefit from the excellent infrastructure that powers the competitive Dutch economy. For example, most of our employees in Netherlands skip the car commute and take a quick train or bike ride to work because of the superb public transport and bike-friendly infrastructure.

It’s no secret that relocating to the Netherlands and getting settled is very easy. The Dutch provide a “fast track” for knowledge workers, with streamlined entry and onboarding procedures. IN Amsterdam, for example, provides a one-stop shop for registration, immigration and much more. Many employees who relocate to the Netherlands are eligible for the 30% Ruling, which provides a significant tax incentive for up to five years.

Databricks European Development Center

Lastly, we are thrilled by the accomplishments of the European Development Center. At Databricks, we have built the European Development Center as a fully operational site from the start, with local leadership in key functions such as engineering, product management, HR and customer success. As a result, the EDC has shipped key features in almost every aspect of our Unified Data Analytics Platform.

The future is bright for the Databricks European Development Center. If you are interested in learning about opportunities in beautiful Amsterdam, message me directly or visit our Careers Site.

Related Resources

--

Try Databricks for free. Get started today.

The post Why we are investing 100 million euros in our European Development Center appeared first on Databricks.

Tuesday, 29 October 2019

How to Make AR, VR, and MR a Part of Your Digital Workplace

AR, VR, and MR have evolved greatly from sci-fi gadgets, gamer gimmicks, and entertainment devices, even surviving the Pokémon GO craze to be finally recognized as technologies bound to revolutionize the workplace. Currently, Ford, Boeing, Airbus, Coca Cola, Siemens, and hundreds of global companies outside the consumer entertainment industry are busy testing and implementing AR and VR gear in every field of their operational processes, from manufacturing to customer engagement. 

Other large and mid-sized enterprises are expected to follow suit in the next five years. According to the IDC report, the five-year CAGR for AR and VR technologies will reach 78.3% by 2023. The largest investment share (80% by 2022) will come from the commercial sector companies, which are expected to allocate a lesser but still significant portion of funds to digital workplace transformation. 

To meet the burgeoning need for AR, VR and MR technologies an impressive number of advanced hardware has hit the market. Oculus Quest, Samsung Odyssey+ and HTC Vive VR headsets, together with Google Glass and Microsoft HoloLens AR smart glasses, are now acknowledged as top immersive technology devices. 

The selection of AR, VR and MR software is not as robust and mostly geared toward game development. However, certain business ...


Read More on Datafloq

Scaling Hyperopt to Tune Machine Learning Models in Python

Try the Hyperopt notebook to reproduce the steps outlined below and watch our on-demand webinar to learn more.

Hyperopt is one of the most popular open-source libraries for tuning Machine Learning models in Python.  We’re excited to announce that Hyperopt 0.2.1 supports distributed tuning via Apache Spark.  The new SparkTrials class allows you to scale out hyperparameter tuning across a Spark cluster, leading to faster tuning and better models. SparkTrials was contributed by the authors of this post, in collaboration with Databricks engineers Hanyu Cui, Lu Wang, Weichen Xu, and Liang Zhang.

What is Hyperopt?

Hyperopt is an open-source hyperparameter tuning library written for Python.  With 445,000+ PyPI downloads each month and 3800+ stars on Github as of October 2019, it has strong adoption and community support.  For Data Scientists, Hyperopt provides a general API for searching over hyperparameters and model types. Hyperopt offers two tuning algorithms: Random Search and the Bayesian method Tree of Parzen Estimators.

For developers, Hyperopt provides pluggable APIs for its algorithms and compute backends.  We took advantage of this pluggability to write a new compute backend powered by Apache Spark.

Scaling out Hyperopt with Spark

With the new class SparkTrials, you can tell Hyperopt to distribute a tuning job across a Spark cluster.  Initially developed within Databricks, this API for hyperparameter tuning has enabled many Databricks customers to distribute computationally complex tuning jobs, and it has now been contributed to the open-source Hyperopt project, available in the latest release.

Hyperparameter tuning and model selection often involve training hundreds or thousands of models.  SparkTrials runs batches of these training tasks in parallel, one on each Spark executor, allowing massive scale-out for tuning.  To use SparkTrials with Hyperopt, simply pass the SparkTrials object to Hyperopt’s fmin() function:

from hyperopt import SparkTrials

best_hyperparameters = fmin(
fn = training_function,
space = search_space,
algo = hyperopt.tpe,
max_evals = 64,
trials = SparkTrials())

For a full example with code, check out the Hyperopt documentation on SparkTrials.

Under the hood, fmin() will generate new hyperparameter settings to test and pass them toSparkTrials.  The diagram below shows how SparkTrials runs these tasks asynchronously on a cluster: (A) Hyperopt’s primary logic runs on the Spark driver, computing new hyperparameter settings.  (B) When a worker is ready for a new task, Hyperopt kicks off a single-task Spark job for that hyperparameter setting. (C) Within that task, which runs on one Spark executor, user code will be executed to train and evaluate a new ML model.  (D) When done, the Spark task will return the results, including the loss, to the driver. These new results are used by Hyperopt to compute better hyperparameter settings for future tasks.

How new hyperparameter settings are tested and passed to SparkTrials using Hyperopt

Since SparkTrials fits and evaluates each model on one Spark worker, it is limited to tuning single-machine ML models and workflows, such as scikit-learn or single-machine TensorFlow.  For distributed ML algorithms such as Apache Spark MLlib or Horovod, you can use Hyperopt’s default Trials class.

Using SparkTrials in practice

SparkTrials takes 2 key parameters: parallelism (Maximum number of parallel trials to run, defaulting to the number of Spark executors) and timeout (Maximum time in seconds which fmin is allowed to take, defaulting to None).  Timeout provides a budgeting mechanism, allowing a cap on how long tuning can take.

The parallelism parameter can be set in conjunction with the max_evals parameter for fmin() using the guideline described in the following diagram.  Hyperopt will test max_evals total settings for your hyperparameters, in batches of size parallelism. If parallelism = max_evals, then Hyperopt will do Random Search: it will select all hyperparameter settings to test independently and then evaluate them in parallel.  If parallelism = 1, then Hyperopt can make full use of adaptive algorithms like Tree of Parzen Estimators which iteratively explore the hyperparameter space: each new hyperparameter setting tested will be chosen based on previous results. Setting parallelism in between 1 and max_evals allows you to trade off scalability (getting results faster) and adaptiveness (sometimes getting better models).  Good choices tend to be in the middle, such as sqrt(max_evals).

Setting parallelism for SparkTrials.

To illustrate the benefits of tuning, we ran Hyperopt with SparkTrials on the MNIST dataset using the PyTorch workflow from our recent webinar.  Our workflow trained a basic Deep Learning model to predict handwritten digits, and we tuned 3 parameters: batch size, learning rate, and momentum.  This was run on a Databricks cluster on AWS with p2.xlarge workers and Databricks Runtime 5.5 ML.

In the plot below, we fixed max_evals to 128 and varied the number of workers.  As expected, more workers (greater parallelism) allow faster runtimes, with linear scale-out.

Plotting the effect of greater parallelism on the running time of Hyperopt

We then fixed the timeout at 4 minutes and varied the number of workers, repeating this experiment for several trials.  The plot below shows the loss (negative log likelihood, where “180m” = “0.180”) vs. the number of workers; the blue points are individual trials, and the red line is a LOESS curve showing the trend.  In general, model performance improves as we use greater parallelism since that allows us to test more hyperparameter settings. Notice that behavior varies across trials since Hyperopt uses randomization in its search.

Demonstrating improved model performance with the use of greater parallelism

Getting started with Hyperopt 0.2.1

SparkTrials is available now within Hyperopt 0.2.1 (available on the PyPi project page) and in the Databricks Runtime for Machine Learning (5.4 and later).

To learn more about Hyperopt and see examples and demos, check out:

  • Example notebooks in the Databricks Documentation for AWS and Azure

Hyperopt can also combine with MLflow for tracking experimentals and models.  Learn more about this integration in the open-source MLflow example and in our Hyperparameter Tuning blog post and webinar.

You can get involved via the Github project page:

Related Resources

--

Try Databricks for free. Get started today.

The post Scaling Hyperopt to Tune Machine Learning Models in Python appeared first on Databricks.

Anti-fraud Analytics Shine as AI in Banking Grows

While the finance and banking sector has a reputation of maintaining a relatively conservative, cautious approach toward major disruptions, the current push for digital transformation, CX prioritization, and data-driven automation has been leading to massive changes. AI integration is one of them.

Chatbots and CX augmentations may well be the headline-makers of today, and a likely cutting-edge tech priority of the future, but right now risk management and regulatory compliance are #1 when it comes to funding. According to this year’s AI in Banking report, risk, safety and compliance AI alone currently accounts for more than half of $3 billion investments, according to the Emerj 2019 report.

Fraud detection and cybersecurity with AI in finance

With a 26% share of the funds raised by AI vendors working in finance (Emerj), fraud protection and cybersecurity are the top current uses and adoption opportunities for risk-related AI and ML in banking and finance.

Pwc’s Global Economic Crime and Fraud Survey 2018 states that 49% of respondent organizations experienced fraudulence within the previous 24 months (a 19% increase over the last decade). The newest Juniper research on online payment fraud marks banking among the most popular fields for financial fraud, too. This makes advanced detection and prevention ...


Read More on Datafloq

Multi-Party Privacy and AI Autonomous Cars

By Lance Eliot, the AI Trends Insider

You are at a bar and a friend of yours takes a selfie that includes you in the picture.

Turns out you’ve had a bit to drink and it’s not the most flattering of pictures.

In fact, you look totally plastered.

You are so hammered that you don’t even realize that your friend is taking the selfie and the next morning you don’t even remember there was a snapshot taken of the night’s efforts. About three days later, after becoming fully sober, you happen to look at the social media posts of your friend, and lo-and-behold there’s the picture, posted for her friends to see.

In a semi-panic, you contact your friend and plead with the friend to remove the picture.

The friend agrees to do so.

Meanwhile, turns out that the friends of that person happened to capture the picture, and many of them thought it was so funny that they re-posted it in other venues. It’s now on Facebook, Instagram, Twitter, etc. You look so ridiculous that it has gone viral. Some have even cut out just you from the picture and then made memes of you that are hilarious, and have spread like wildfire on social media.

People at work that only know you at work (you don’t often associate with them outside of the workplace), have come up to you at the office to mention they saw the picture.

Your boss comes to ask you about it.

The company is a bit worried that it might somehow reflect poorly on the firm.

People that you used to know from high school and even elementary school have contacted you to say that you’ve really gone wild (you used to be the studious serious person). Oddly enough, you know that this was a one-time drinking binge and that you almost never do anything like this. You certainly hadn’t been anticipating that a picture would capture the rare moment.

Frustratingly, it’s a misleading characterization of who you are.

A momentary lapse that has been blown way out of proportion.

People that you don’t know start to bombard your social media sites with requests to get linked.

Anyone that parties that hard must be worth knowing. Unfortunately, most of the requests are from creepy people. Scarier still is that they have Googled you and found out all sorts of aspects about your personal life. These complete strangers are sending you messages that appear as though they know you, doing so by referring to places you’ve lived, vacations you’ve taken, and so on.

Sadly, this leads to identity theft attempts of your identity, such as your bank account or opening of credit cards, and so on. It leads to cyber-stalking of you by nefarious hackers. Social phishing ensues.

If this seems like a nightmare, I’d say that you can wake-up now and go along with the aspect that it was all a dream.

A really ugly dream.

Let’s also make clear that it could absolutely happen.

Multi-Party Privacy A Looming Issue

Many people that are using social media seem to not realize that their privacy is not necessarily controlled solely by themselves.

If you end-up in someone else’s snapshot that just so happens to include you, maybe you are in the tangentially foreground or maybe even in the background, there’s a chance that you’ll now be found on social media if that person posts the photo.

The advent of facial recognition for photos has become quite proficient. In the early days of social media, a person’s face had to be facing completely forward and fully seen by the camera, the lighting had to be topnotch, and basically if it was a pristine kind of facial shot then the computer could recognize your face. Also, there were so few faces on social media that the computer could only determine that there was a face present, but it wasn’t able to try and guess who’s face it was.

Nowadays, the facial recognition is so good that your head can be turned and barely seen by the camera, and the lighting can be crummy, and there can be blurs and other aspects, and yet the computer can find a face. And, it can label the face by using the now millions of faces already found and tagged. The odds of remaining in obscurity in a photo online is no longer feasible for very long.

People are shocked to find that they went to the mall and all of a sudden there’s some postings that have them tagged in the photos. You are likely upset because you were just minding your own business at the mall. You didn’t take a photo of yourself and nor did a friend. But, because other people were taking photos, and because of the widespread database of faces, once these fellow mall shoppers posted the picture, it was easy enough to automatically tag you in a photo by a computer. No human intervention needed.

Notice also that in the story about being in a bar and a friend having taken a snapshot, even if your friend agrees to remove the picture from being posted, the odds are that once it’s been posted you’ll never be able to stop it from being promulgated.

There’s a rule-of-thumb these days that once something gets posted, it could be that it will last forever (unless you believe that someday the Internet will be closed down – good luck waiting for that day).

I realize you are likely already thinking about your own privacy when it comes to your own efforts, such as a selfie of yourself that you made and that you posted.

You might be very diligent about only posting selfies that you think showcase your better side.

You might be careful not to post a blog that might use foul words or offend anyone.

You might be cautious about filling in forms at web sites and be protective about private information.

Unfortunately, unless you live on a deserted island, the odds are that you are around other people, and the odds are that those people are going to capture you in their photos.

I suppose you could walk around all day long with a bag over your head. When people ask you why, you could tell them you are trying to preserve your privacy. You are trying to remain anonymous. I’d bet that you’d get a lot of strange stares and possibly people calling the police to come check-out the person wearing the bag over their head.

In some cases, you’ll perhaps know that you are in a photo and that someone that you know is going to post it. You went to a friend’s birthday party on Saturday and photos were taken. The friend already mentioned that an online photo album had been setup. You’ll be appearing in those photos. That’s something you knew about beforehand. There’s also the circumstance of being caught up in a photo that you didn’t know was being taken, and might have been a snapshot by a complete stranger, akin to the mall example earlier.

So, let’s recap:

  • You took a selfie, which you knew about because you snapped it, and then you posted it
  • You end-up in someone else’s photo, whom you know, and they posted it, but you didn’t know they would post it
  • You end-up in someone else’s photo, whom you know, and they posted it with your blessings
  • You end-up in someone else’s photo, a complete stranger, and they posted it but you didn’t know they would post it
  • You took a photo of others, whom you know, and you posted it, but you didn’t let them know beforehand
  • You took a photo of others, whom you know, and you posted it with their blessings
  • You took a photo of others, complete strangers, and you posted it but they didn’t know you would post it
  • Etc.

I purposely have pointed out in the aforementioned list that you can be both the person “victimized” by this and also the person that causes others to be victimized. I say this because I know some people that have gotten upset that others included them in a photo, and posted it without getting the permission of that person, and yet this same person routinely posts photos that include others and they don’t get their permission. Do as I say, not as I do, that’s the mantra of those people.

There’s a phrase for this multitude of participants involved in privacy, namely it is referred to as Multi-Party Privacy (MP).

Details About Multi-Party Privacy

Multi-Party Privacy has to do with trying to figure out what to do about intersecting privacy aspects in a contemporary world of global social media.

You might be thinking that privacy is a newer topic and that it has only emerged with the rise of the Internet and social media.

Well, you might be surprised to know that in 1948 the United Nations adopted a document known as the Universal Declaration of Human Rights (UDHR) and Article 12 refers to the right of privacy. Of course, few at that time could envision fully the world we have today, consisting of a globally interconnected electronic communications network and the use of social media, and for which it has made trying to retain or control privacy a lot harder to do.

When you have a situation involving MP, you can likely have an issue arise with conflict among the participants in terms of the nature of the privacy involved. In some cases, there is little or no conflict and the MP might be readily dealt with, thus it is easy to ensure the privacy of the multiple participants.

More than likely, you’ll have to deal with Multi-Party Privacy Conflicts (MPC), wherein one or more parties disagree about the privacy aspects of something that intersects them.

In the story about you being in the bar and your friend snapped the unbecoming picture and posted it, you might have been perfectly fine with this and therefore there was no MPC. But, as per the story, you later on realized what had happened, and so you objected to your friend about the posting. This was then a conflict.

This was a MPC: Multi-parties involved in a matter of privacy, over which they have a conflict, because one of them was willing to violate the privacy of the other, but the other was not willing to do so.

In this example, your friend quickly acquiesced and agreed to remove the posting.

This seemingly resolved the MPC.

As mentioned, even if the MPC seems to be resolved, it can unfortunately be a situation wherein the horse is already out of the barn. The damage is done and cannot readily be undone. Privacy can be usurped, even if the originating point of the privacy breach is later somehow fixed or undone.

I realize that some of you will say that you’ve had such a circumstance and that rather than trying to un-post the picture that you merely removed the tag that had labeled you in the picture. Yes, many of the social media sites allow you to un-tag something that was either manually tagged or automatically tagged. This would seem to put you back into anonymity.

If so, it is likely short-lived.

All it will take is for someone else to come along and decide to re-apply a tag, or an automated crawler that does it. Trying to return to a state of anonymity is going to be very hard to do as long as the picture still remains available. There will always be an open chance that it will get tagged again.

I’ll scare you even more so.

There are maybe thousands of photos right now with you in them, perhaps in the background while at the train station, or while in a store, or at a mall, or on vacation in the wilderness. You might not yet be tagged in any of those. The more that we continue to move toward this global massive inter-combining of social media and the Internet, and the more that computers advance and computing power becomes less costly, those seemingly “obscure” photos are bound to get labeled.

Every place that you’ve ever been, in every photo so captured, and that’s posted online, might ultimately become a tagged indication of where you were. Plus, the odds are that the photo has embedded in it other info such as the date and time of the photo, and the latitude and longitude of the photo location. Not only are you tagged, but now we’ll know when you were there and where it was. Plus, whomever else is in the photo will be tagged, so we’ll all know who you were with.

Yikes!

Time to give it all up, and go live in a cave.

Say, are there any cameras in that cave?

Multi-Party Privacy And AI Autonomous Cars

What does all this have to do with AI self-driving driverless autonomous cars?

At the Cybernetic AI Self-Driving Car Institute, we are developing AI software for self-driving cars. We’re also aware of the potential privacy aspects and looking at ways to deal with them from a technology perspective (it will also need to be dealt with from a societal and governmental perspective too).

I’ve already previously discussed at some length the matter of privacy and AI self-driving cars, so please do a quick refresher and take a look at that article: https://aitrends.com/selfdrivingcars/privacy-ai-self-driving-cars/

I’ve also pointed out that many of the pundits in support of AI self-driving cars continually hammer away at the benefits of AI self-driving cars, such as the mobility possibilities, but they often do so with idealism and don’t seem to be willing to also consider the downsides such as privacy concerns – see my article about idealism and AI self-driving cars:  https://aitrends.com/selfdrivingcars/idealism-and-ai-self-driving-cars/

Some are worried that we’re heading towards a monstrous future by having AI self-driving cars, including the potential for large-scale privacy invasion, as such, see my article about the AI self-driving car as a Frankenstein moment: https://aitrends.com/selfdrivingcars/frankenstein-and-ai-self-driving-cars/

Herein, let’s take a close look at Multi-Party Privacy and the potential for conflicts, or MPC as it relates to AI self-driving cars.

An AI self-driving car involves these key aspects as part of the driving task:

  • Sensor data collection and interpretation
  • Sensor fusion
  • Virtual world model updating
  • AI action plans
  • Car controls commands issuance

This is based on my overarching framework about AI self-driving cars, which you can read about here: https://aitrends.com/selfdrivingcars/framework-ai-self-driving-driverless-cars-big-picture/

You would normally think about the sensors of an AI self-driving car that are facing outward and detecting the world around the self-driving car. There are cameras involved, radar, sonar, LIDAR, and the like. These are continually scanning the surroundings and allow the AI to then ascertain as best possible whether there is a car ahead, or whether there might be a pedestrian nearby, and so on. Sometimes one kind of sensor might be blurry or not getting a good reading, and thus the other sensors are more so relied upon. Sometimes they are all working well and a generally full sensing of the environment will be possible.

One question for you is how long will this collected data about the surroundings be kept?

You could argue that the AI only needs the collected data from moment to moment. You are driving down a street in a neighborhood. As you proceed along, every second the sensors are collecting data. The AI is reviewing it to ascertain what’s going on. You might assume that this is a form of data streaming and there’s no “collection” per se of it.

You’d likely be wrong in that assumption.

Some or all of that data might indeed be collected and retained. For the on-board systems of the self-driving car, perhaps only portions are being kept. The AI self-driving car likely has an OTA (Over The Air) update capability, allowing it to use some kind of Internet-like communications to connect with an in-the-cloud capability of the automaker or tech firm that made the AI system. The data being collected by the AI self-driving car can potentially be beamed to the cloud via the OTA.

For my article about OTA, see: https://aitrends.com/selfdrivingcars/air-ota-updating-ai-self-driving-cars/

There are some AI developers that are going to be screaming right now and saying that Lance, there’s no way that the entire set of collected data from each AI self-driving car is going to be beamed-up. It’s too much data, it takes too much time to beam-up. And, besides, it’s wasted effort because what would someone do with the data?

I’d counter-argue that with compression and with increasingly high-speed communications, it’s not so infeasible to beam-up the data.

Plus, the data could be stored temporarily on-board the self-driving car and then piped up at a later time.

In terms of what the data could be used for, well, that’s the million-dollar question.

Or, maybe billion-dollar question.

Exploiting Multi-Party Private Data

If you were an auto maker or tech firm, and you could collect the sensory data from the AI self-driving cars that people have purchased from you, would you want to do so?

Sure, why not.

You could use it presumably to improve the capabilities of the AI self-driving cars, mining the data and improving the machine learning capabilities across all of your AI self-driving cars.

That’s a pretty clean and prudent thing to do.

You could also use the data in case there are accidents involving your AI self-driving car.

By examining the data after an accident, perhaps you’d be able to show that the dog that the AI self-driving car hit was hidden from view and darted out into the street at the last moment. This might be crucial from a public perception that the seemingly evil AI ran over a dog in the roadway. The data might also have important legal value. It might be used for lawsuits against the automaker or tech firm. It might be used for insurance purposes to set rates of insurance. Etc.

Let’s also though put on our money making hats.

If you were an auto maker or tech firm, and you were collecting all of this data, could you make money from it? Would third parties be willing to pay for that data? Maybe so. When you consider that the AI self-driving car is driving around all over the place, and it is kind of mapping whatever it encounters, there’s bound to be business value in that data.

It could have value to the government too.

Suppose your AI self-driving car was driving past a gas station just as a thief ran out of the attached convenience store. Voila, your AI self-driving car might have captured the thief on the video that was being used by the AI to navigate the self-driving car.

In essence, with the advent of AI self-driving cars, wherever we are, whenever we are there, the roaming AI self-driving cars are now going to up the ante on video capture. If you already were leery about the number of video cameras that are on rooftops and walls and polls, the AI self-driving car is going to increase exponentially.

Don’t think of the AI self-driving car as a car, instead think of it as a roaming video camera.

Right now, there are 250+ million cars in the United States.

Imagine if every one of those cars had a video camera, and the video camera had to be on whenever the car was in motion.

That’s a lot of videos. That’s a lot of videos of everyday activities.

I challenge you to later today, when in your car, look around and pretend that all the other cars have video cameras and are recording everything they see, every moment.

Eerie, yes?

Exponential Increase In Multi-Party Privacy Concerns

The point herein that if you believe in the Multi-Party Privacy issue, the AI self-driving car is going to make the MP become really big-time.

And, the MPC, the conflicts over privacy, will go through the roof.

You opt to take your AI self-driving car to the local store. It captures video of your neighbors outside their homes, mowing the lawn, playing ball with their kids, watching birds, you name it. All of that video, in the normal everyday course of life activities. Suppose it gets posted someplace online. Did any of them agree to this? Would then even know they had been recorded?

I assure you that the sensors and video cameras on an AI self-driving car are so subtle that people are not going to realize that they are being recorded.

It’s not like the old days where there might be a large camera placed on the top of the car and someone holding up a sign saying you are being recorded. It will be done without any realization by people. Even if at first they are thinking about it, once AI self-driving cars become prevalent it will just become an accustomed aspect.

And, suppose the government mandated that a red recording light had to be placed on the top of an AI self-driving car, what would people do? Stop playing ball in the street, hide behind a tree, or maybe walk around all day with a bag over their heads?

Doubtful.

One unanswered question right now is whether you as the owner of an AI self-driving car will get access to the sensor data collected by your AI self-driving car. You might insist that of course you should, it’s your car, darn it. The auto makers and tech firms might disagree and say that the data collected is not your data, it is data owned by them. They can claim you have no right to it, and furthermore that you’re having it might undermine the privacy of others. We’ll need to see how that plays out in real life.

Let’s also consider the sensors that will be pointing inward into the AI self-driving car.

Yes, I said pointing inward.

There is likely to be both audio microphones inside the AI self-driving car and cameras pointing inward. Why? Suppose you put your children into the AI self-driving car and tell the AI to take them to school. I’m betting you’d want to be able to see your children and make sure they are Okay. You’d want to talk to them and let them talk to you. For this a myriad of other good reasons, there’s going to be cameras and microphones inwardly aimed inside AI self-driving cars.

If you were contemplating the privacy aspects of recording what the AI self-driving car detects outside of the self-driving car, I’m sure you’ll be dismayed at the recordings of what’s happening inside the AI self-driving car.

Here’s an example.

It’s late at night. You’ve been to the bar. You want to get home. You are at least aware enough to not drive yourself. You hail an AI self-driving car. You get into the AI self-driving car, there’s no human driver. While in the AI self-driving car, you hurl whatever food and drink you had ingested while at the bar. You freak out inside the AI self-driving car due to drunkenness and you ramble about how bad your life is. You yell about the friends that don’t love you. You are out of your head.

Suppose the AI self-driving car is a ridesharing service provided by the Acme company. They now have recorded all of your actions while inside the AI self-driving car. What might they do with it? There’s also the chance that the ridesharing service is actually somebody’s personal AI self-driving car, but they let it be used when they aren’t using it, trying to earn some extra dough. They now have that recording of you in the AI self-driving car.

Eerie (again), yes?

There might be some AI self-driving ridesharing services that advertise they will never ever violate your privacy and that they don’t record what happens inside their AI self-driving cars.

Or, there might be AI ridesharing services that offer for an extra fee they won’t record. Or, for an extra fee they will give you a copy of the recording.

You might say that it is a violation of your privacy to have such a recording made.

But, keep in mind that you willingly went into the AI self-driving car.

There might even be some kind of agreement you agreed to by booking the ridesharing service or by getting into the self-driving car.

Some have suggested that once people know they are being recorded inside of a self-driving car, they’ll change their behavior and behave.

This seems so laughable that I can barely believe the persons saying this believe it. Maybe when AI self-driving cars first begin, we’ll sit in them like we used to do in an airplane, and be well-mannered and such, but after AI self-driving cars become prevalent, human behavior will be human behavior.

There are some that are exploring ways to tackle this problem using technology. Perhaps, when you get into the AI self-driving car, you have some kind of special app on your smartphone that can mask the video being recorded by the self-driving car and your face is not shown and your voice is scrambled. Or, maybe there is a bag in the self-driving car that you can put over your head (oops, back to the bag trick).

The Multi-Party Privacy issue arises in this case because there is someone else potentially capturing your private moments and it is in conflict with how you want your private moments to be used. Let’s extend this idea. You get into an AI self-driving car with two of your friends. You have a great time in the self-driving car. Afterward, one of you wants to keep and post the video, the other does not. There’s another MPC.

Conclusion

Some people will like having the video recordings of the interior of the AI self-driving car.

Suppose you take the family on a road trip. You might want to keep the video of both the interior shenanigans and the video captured of where you went. In the past, you might show someone a few pictures of your family road trip. Nowadays, you tend to show them video clips. In the future, you could show the whole trip, at least from the perspective of whatever the AI self-driving car could see.

For my article about family road trips and AI self-driving cars, see: https://aitrends.com/selfdrivingcars/family-road-trip-and-ai-self-driving-cars/

I hope that this discussion about Multi-Party Privacy does not cause you to become soured on AI self-driving cars.

Nor do I want this to be something of an alarmist nature.

The point more so is that we need to be thinking now about what the future will consist of. The AI developers crafting AI self-driving cars are primarily focused on getting an AI self-driving car to be able to drive. We need to be looking further ahead, and considering what other qualms or issues might arise. I’d bet that MPC will be one of them.

Get ready for privacy conflicts.

There are going to be conflicts about conflicts, you betcha.

Copyright 2019 Dr. Lance Eliot

This content is originally posted on AI Trends.

[Ed. Note: For reader’s interested in Dr. Eliot’s ongoing business analyses about the advent of self-driving cars, see his online Forbes column: https://forbes.com/sites/lanceeliot/]

5G Will Require Us to Reimagine Cybersecurity

We are on the precipice of 5G, or fifth-generation wireless adoption. Many consider its development to be a technological race, one that countless organizations and countries are working to achieve.

President Donald Trump explained this movement best: “The race to 5G is on, and America must win.”

It doesn’t matter what country or region you’re from, just about everyone is focused on developing and upgrading wireless technologies to boost its capabilities. 5G is considered a revolutionary upgrade to mobile connectivity and wireless networks. It will push the entire technology field to new heights, affecting every industry, from manufacturing and retail to health care and finance.

To understand why it’s going to be so impactful, one must consider what it offers. It’s also crucial to determine the cybersecurity and IT risks presented by these more robust networks.

What 5G Networks Will Bring to the Table

Nearly everyone is already familiar with wireless networks, particularly those that support mobile devices and smartphones. 3G networks, or the third-generation, were mainly responsible for boosting the popularity of mobile connectivity — alongside some of the most renowned handsets like the original iPhone.

4G took the technology to new levels, offering an increase in signal strength and quality, alongside bandwidth improvements. 4G ...


Read More on Datafloq

Niti Aayog proposes separate regulator for medical devices

The ministry had issued a draft notification saying that all medical devices would come under the category of drugs from December 1 and would be regulated under the Drugs & Cosmetics Act.

Monday, 28 October 2019

Ex-Google chip guru takes novel approach to AI at Groq

Away from the limelight, the three-year-old startup is trying to upend the semiconductor industry dogma by designing radical chips focused on AI.

Quantum Computing and Blockchain: Facts and Myths

The biggest danger to Blockchain networks from quantum computing is its ability to break traditional encryption [3].

Google sent shockwaves around the internet when it was claimed, had built a quantum computer able to solve formerly impossible mathematical calculations–with some fearing crypto industry could be at risk [7]. Google states that its experiment is the first experimental challenge against the extended Church-Turing thesis — also known as computability thesis — which claims that traditional computers can effectively carry out any “reasonable” model of computation

What is Quantum Computing?

Quantum computing is the area of study focused on developing computer technology based on the principles of quantum theory. The quantum computer, following the laws of quantum physics, would gain enormous processing power through the ability to be in multiple states, and to perform tasks using all possible permutations simultaneously [5].

A Comparison of Classical and Quantum Computing

Classical computing relies, at its ultimate level, on principles expressed by Boolean algebra. Data must be processed in an exclusive binary state at any point in time or bits. While the time that each transistor or capacitor need be either in 0 or 1 before switching states is now measurable in billionths of a second, there is still a ...


Read More on Datafloq

How AI Powered Chatbots are Changing the Customer Experience

Chatbots have arrived. They’re no longer the domain of sci-fi movies or high-tech companies. They’ve gone mainstream. Last year, more than two-thirds of consumers report interacting with a Chatbot.

People are embracing them, too.  40% of consumers said they don’t care whether it’s a Chatbot or human that helps them as long as they get what they need. 47% of consumers say they are open to the idea of buying products or services from Chatbots.

AI-Powered Chatbots

Natural Language Processing (NLP) and Artificial Intelligence (AI) are two of the main reasons Chatbots are becoming more accepted. Some of the more advanced AI-Fueled Chatbots make it difficult to know whether it’s a Chatbot or a real person.

Machine Learning (ML) can improve over time as Chatbots analyze additional data to learn how to answer specific inquiries. They can handle relentless amounts of inquiries and can be programmed to recommend upsell opportunities.

Most importantly, they take the burden off of overworked and stressed out customer support teams.  Chatbots can produce savings in customer service costs by as much as 30%.

Chatbots Are Augmenting (Not Replacing) Support Teams

While Chatbots can handle routine functions, they can also direct more complex inquiries to the right person. When a human touch is needed, virtual AI assistants will gather ...


Read More on Datafloq

How Artificial Intelligence Is Impacting the Healthcare Industry

The influx of disruptive technologies in many industries has brought massive transformation. The healthcare industry is no different when it comes to technological intervention taking over its functioning. Particularly emerging technologies like artificial intelligence (AI), automation, machine learning, etc., have significantly impacted the healthcare sector. Even stats are indicative of the growing adaptation of AI into clinical operations. For instance, as per a recent report, it is expected that by 2020, the average spend on artificial intelligence projects is going to be $54 million.

The application of digital automation and artificial intelligence is noticeable throughout the healthcare field. From record management to providing virtual care, automation of repetitive tasks, digital consultation and accurate diagnosis, the implementation of AI is widespread.

So, what are the areas where AI application is already making a huge impact? Here is a list of the ten ways in which AI is changing the healthcare industry for a better present and advanced tomorrow.


Enabling Next-Gen Radiology Tools 
Streamline the Use of EHR Systems
Precise Analysis of Pathology Reports with AI
Perform Diagnostic Duties to Meet Growing Demand of Trained Professionals
Adding Intelligence to Smart Machines and Devices
Extracting Actionable Insights from Wearable Devices and Apps
Powering Predictive Analytics to Treat Complex Diseases
Turning Selfies into Diagnostic ...


Read More on Datafloq

How to Improve MySQL AWS Performance 2X Over Amazon RDS at The Same Cost

AWS is the #1 cloud provider for open-source database hosting, and the go-to cloud for MySQL deployments. As organizations continue to migrate to the cloud, it’s important to get in front of performance issues, such as high latency, low throughput, and replication lag with higher distances between your users and cloud infrastructure. While many AWS users default to their managed database solution, Amazon RDS, there are alternatives available that can improve your MySQL performance on AWS through advanced customization options and unlimited EC2 instance type support. ScaleGrid offers a compelling alternative to hosting MySQL on AWS that offers better performance, more control, and no cloud vendor lock-in and the same price as Amazon RDS. In this post, we compare the performance of MySQL Amazon RDS vs. MySQL Hosting at ScaleGrid on AWS High-Performance instances.

TLDR

ScaleGrid’s MySQL on AWS High-Performance deployment can provide 2x-3x the throughput at half the latency of Amazon RDS for MySQL with their added advantage of having 2 read replicas as compared to 1 in RDS.

MySQL on AWS Performance Test




 
ScaleGrid
Amazon RDS


Instance Type
AWS High-Performance XLarge (see system details below)
DB Instance r4.xlarge
(Multi-AZ)


Deployment Type
3 Node Master-Slave Set with Semisynchronous Replication
Multi-AZ Deployment with 1 Read Replica


SSD Disk
Local SSD & General Purpose - ...


Read More on Datafloq

Sunday, 27 October 2019

Harnessing Big Data to Improve Healthcare

No doubt about the fact of how quickly the healthcare industry is progressing each day and is expected to reach Rs. 19,56,920 crore (US$ 280 billion) by 2020. With the pace of development, it has become highly important to manage and take complete care of the patients and other particulars related to the healthcare industry.

Several hospitals, practitioners, and researchers have been continuously in jest to find ways to improve the healthcare sector and the result is often based on how Big Data can be used efficiently to make a strong base of analytics.

All the data collected from the sector – studying patients, health records, and medical devices, is analyzed and processed to come up with diagnostics. Leveraging the Big Data to maintain quality and sheer volume is the need of the hour and what everyone is striving for.

Big Data has been continuously in focus to enhance the efficiency and accuracy of the sector. Along with industry analytics, Big Data in the healthcare industry has made a noticeable mark.

What Big Data Can Do for Healthcare Sector?

Here are the benefits of big data and how it can help the healthcare industry.

High-Risk Patient Care

There have always been attempts to enhance the accuracy of ...


Read More on Datafloq

Red Hat Report: IoT Outsourcing Trend Accelerating in 2019

IoT projects are becoming increasingly popular these days. TechTarget posted an article earlier this year on developments in the field. They cited research indicating that 70% of businesses intend to incorporate IoT technology into their operating models in the next five years.

TechTarget author Nacho De Marco discussed the implications of the IoT development profession. He pointed out that the field is becoming more complex, which requires companies to utilize more sophisticated deployment options.

Marco points out that outsourcing is going to make IoT development more viable in the near future.

“Keeping the cost advantage in mind, outsourcing allows companies to hire many more engineers and specialists for a fraction of the cost. But, more importantly in the competitive world of IoT, companies can get to market faster by outsourcing. If, per Red Hat, so many businesses want to add IoT functionality, there will be more competition in this market, so timeliness is key. Outsourcing also offers the advantage of more flexibility and more creativity with a skilled team working on the project.”

Companies developing new IoT applications are becoming more open to the concept of outsourcing. However, they are going to need to fully understand the ramifications of outsourcing, as well as the ...


Read More on Datafloq

Friday, 25 October 2019

4 Ways to Ensure Data Security in Tomorrow's Organisation

Data security in tomorrow's organisations will be fundamentally different than data security today. If you believe that data security today is a challenge, you will be surprised what it will be tomorrow. As I have said many times over: every organisation will be hacked, and if you are not yet hacked, you are simply not important enough. That counts for today, where human hackers have to be selective who to attack. Tomorrow's organisations, however, will not only have to deal with human hackers but will increasingly face machines autonomously hacking your organisation. Sounds scary? It should be, and you should take action today to remain safe tomorrow.

Autonomous Artificial Hackers

Artificial intelligence and machine learning offer great applications and tools for organisations and society. However, bad actors can also benefit from AI and machine learning and use that to attack your business. Fighting autonomous artificial hackers with human teams is nearly impossible. Hence, you would need to augment your security team with AI. As a result, we will see machine to machine fights operating at unbelievable speed and agility. If you don’t have your security in order, it will be an easy fight for those autonomous artificial hackers.

To make things worse, your ...


Read More on Datafloq

Dawn Fitzgerald of Schneider Electric to Speak on Preparing the Data Center for AI Ops at AI World Conference & Expo

By John Desmond, AI Trends Editor

Good data center operations have been defined by best practices before AI applications came on the scene, needing to be rolled out across enterprises. Now AI applications are undergoing a maturation by having those data center best practices applied to them, a process referred to as digitization.

Digitization is the conversion of text, pictures, or sound into a digital form that can be processed by a computer. Digitization can unlock the power of AI operations in the data center, says Dawn Fitzgerald, Director of Digital Transformation, Data Center Operations for Schneider Electric.

Schneider, a French multinational corporation, is focused on sustainability and efficiency with its software and services offerings to a range of customers, many in the oil and gas, utilities and manufacturing industries.

Dawn Fitzgerald, Director of Digital Transformation, Data Center Operations for Schneider Electric

Fitzgerald is speaking at the AI World Conference & Expo on Friday, Sept. 25, at 4:15 pm, on why digitizing the data center to pave the way for AI applications has been elusive. She recently took a few minutes to speak with AI Trends Editor John P. Desmond on the topic.

She talks about how methods of procedure operations (MOPs) and planned maintenance (PM), are at the foundation of condition-based maintenance, predictive analytics, and AI ops.

“You need to embark on a journey to prepare your data center for digitization,” she said. “Step one is clean you data, get it defined so that can be used in the models. The concept of digitizing your data center is critically important, and is where many data centers are today.”

“When you are doing your digitization of your data center, it’s important to design the people, processes and tools around the end goal, which is the AI operations. That’s where you get the highest value, and you leverage AI in the long run.”

Bootstrap Yourself to Learn About AI

Fitzgerald’s career in IT started before the latest wave of AI came on the scene in the last few years. So how did she get trained in how to work with AI? “AI is and will continue to be so pervasive, that individuals in different technologies and different fields need to bootstrap themselves, to take as many courses as possible, to do workshops with AI vendors, to engage in self-starting efforts. Many companies will support their employees in doing this,” she said.

The data center services organization at Schneider provides infrastructure management services to customers with their own data centers. “We are focused on high value solutions for our customers,” Fitzgerald said. AI components help to optimize efficiencies and bring more capability for customers. AI Analytics on IoT-sourced data can be used for predictive maintenance, for example.

Digitization has swept through every line of business at Schneider, which now has a Chief Digital Officer.

The challenges in implementing AI industry-wide begins with the data. “You need to make sure you have a mature data management model,” Fitzgerald said. “Also, a major call to action I have for all industries using AI, is that we need to design for Controlled and Ethical AI.”

Examples of that will be provided in her upcoming presentation at AI World.

For young people interested in a career in AI today, Fitzgerald advises a good understanding of analytics. “Get a course in that,” she said. Rensselaer Polytechnic University has made an analytics a requirement for graduation, said Fitzgerald, who is on the RPI School of Engineering Leadership Council.

Learn more and register for the  AI World Conference & Expo.